Upload 5 PDF
Upload 5 PDF
Upload 5 PDF
net/publication/7564689
CITATIONS READS
924 1,718
1 author:
Michael A Arbib
University of California, San Diego
666 PUBLICATIONS 23,921 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Michael A Arbib on 20 May 2014.
Abstract: The article analyzes the neural and functional grounding of language skills as well as their emergence in hominid evolution,
hypothesizing stages leading from abilities known to exist in monkeys and apes and presumed to exist in our hominid ancestors right
through to modern spoken and signed languages. The starting point is the observation that both premotor area F5 in monkeys and Broca’s
area in humans contain a “mirror system” active for both execution and observation of manual actions, and that F5 and Broca’s area are
homologous brain regions. This grounded the mirror system hypothesis of Rizzolatti and Arbib (1998) which offers the mirror system
for grasping as a key neural “missing link” between the abilities of our nonhuman ancestors of 20 million years ago and modern human
language, with manual gestures rather than a system for vocal communication providing the initial seed for this evolutionary process.
The present article, however, goes “beyond the mirror” to offer hypotheses on evolutionary changes within and outside the mirror sys-
tems which may have occurred to equip Homo sapiens with a language-ready brain. Crucial to the early stages of this progression is the
mirror system for grasping and its extension to permit imitation. Imitation is seen as evolving via a so-called simple system such as that
found in chimpanzees (which allows imitation of complex “object-oriented” sequences but only as the result of extensive practice) to a
so-called complex system found in humans (which allows rapid imitation even of complex sequences, under appropriate conditions) which
supports pantomime. This is hypothesized to have provided the substrate for the development of protosign, a combinatorially open reper-
toire of manual gestures, which then provides the scaffolding for the emergence of protospeech (which thus owes little to nonhuman vo-
calizations), with protosign and protospeech then developing in an expanding spiral. It is argued that these stages involve biological evo-
lution of both brain and body. By contrast, it is argued that the progression from protosign and protospeech to languages with full-blown
syntax and compositional semantics was a historical phenomenon in the development of Homo sapiens, involving few if any further bio-
logical changes.
Key words: gestures; hominids; language evolution; mirror system; neurolinguistics; primates; protolanguage; sign language; speech;
vocalization
1. Action-oriented neurolinguistics and the mirror viding the scaffolding for protospeech (vocal-based protolan-
system hypothesis guage) to provide “neural critical mass” to allow language to
emerge from protolanguage as a result of cultural innovations
1.1. Evolving the language-ready brain within the history of Homo sapiens.2
Two definitions: The theory summarized here makes it understandable why
1. A protolanguage is a system of utterances used by a it is as easy for a deaf child to learn a signed language as it
particular hominid species (possibly including Homo sapi- is for a hearing child to learn a spoken language.
ens) which we would recognize as a precursor to human lan-
guage (if only the data were available!), but which is not it-
self a human language in the modern sense.1 Michael Anthony Arbib was born in England, grew
2. An infant (of any species) has a language-ready brain up in Australia, and received his Ph.D. in Mathematics
if it can acquire a full human language when raised in an en- from MIT. After five years at Stanford, he became chair-
vironment in which the language is used in interaction with man of Computer and Information Science at the Uni-
the child. versity of Massachusetts, Amherst, in 1970. He moved
Does the language readiness of human brains require to the University of Southern California in 1986, where
that the richness of syntax and semantics be encoded in the he is Professor of Computer Science, Neuroscience,
genome, or is language one of those feats – from writing Biomedical Engineering, Electrical Engineering, and
Psychology. The author or editor of 38 books, Arbib
history to building cities to using computers – that played recently co-edited Who Needs Emotions? The Brain
no role in biological evolution but rested on historical de- Meets the Robot (Oxford University Press) with Jean-
velopments that created societies that could develop and Marc Fellous. His current research focuses on brain
transmit these skills? My hypothesis is that: mechanisms of visuomotor behavior, on neuroinfor-
Language readiness evolved as a multimodal manual/facial/vo- matics, and on the evolution of language.
cal system with protosign (manual-based protolanguage) pro-
1.2. The mirror system hypothesis tem for such species-specific vocalizations, and that a re-
lated mirror system persists in humans, but I suggest that it
Humans, chimps and monkeys share a general physical
is a complement to, rather than an integral part of, the
form and a degree of manual dexterity, but their brains,
speech system that includes Broca’s area in humans.
bodies, and behaviors differ. Moreover, humans can and
The mirror system hypothesis claims that a specific mir-
normally do acquire language, and monkeys and chimps
ror system – the primate mirror system for grasping –
cannot – though chimps and bonobos can be trained to ac-
evolved into a key component of the mechanisms that ren-
quire a form of communication that approximates the com-
der the human brain language-ready. It is this specificity
plexity of the utterances of a 2-year-old human infant. The
that will allow us to explain below why language is multi-
approach offered here to the evolution of brain mecha-
modal, its evolution being based on the execution and ob-
nisms that support language is anchored in two observa-
servation of hand movements. There is no claim that mir-
tions: (1) The system of the monkey brain for visuomotor
roring or imitation is limited to primates. It is likely that an
control of hand movements for grasping has its premotor
analogue of mirror systems exists in other mammals, espe-
outpost in an area called F5 which contains a set of neurons,
cially those with a rich and flexible social organization.
called mirror neurons, each of which is active not only when
Moreover, the evolution of the imitation system for learn-
the monkey executes a specific grasp but also when the
ing songs by male songbirds is divergent from mammalian
monkey observes a human or other monkey execute a more
evolution, but for the neuroscientist there are intriguing
or less similar grasp (Rizzolatti et al. 1996a). Thus F5 in
challenges in plotting the similarities and differences in the
monkey contains a mirror system for grasping which em-
neural mechanisms underlying human language and bird-
ploys a common neural code for executed and observed
song (Doupe & Kuhl 1999).5
manual actions (sect. 3.2 provides more details). (2) The re-
The monkey mirror system for grasping is presumed to
gion of the human brain homologous to F5 is part of Broca’s
allow other monkeys to understand praxic actions and use
area, traditionally thought of as a speech area but which has
this understanding as a basis for cooperation, averting a
been shown by brain imaging studies to be active when hu-
threat, and so on. One might say that this is implicitly com-
mans both execute and observe grasps.
municative, as a side effect of conducting an action for non-
These findings led to the mirror system hypothesis (Ar-
communicative goals. Similarly, the monkey’s orofacial ges-
bib & Rizzolatti 1997; Rizzolatti & Arbib 1998, henceforth
tures register emotional state, and primate vocalizations
R&A):
can also communicate something of the current priorities
The parity requirement for language in humans – that what of the monkey, but to a first order this might be called “in-
counts for the speaker must count approximately the same for voluntary communication”6 – these “devices” evolved to
the hearer3 – is met because Broca’s area evolved atop the mir- signal certain aspects of the monkey’s current internal state
ror system for grasping, with its capacity to generate and rec- or situation either through its observable actions or through
ognize a set of actions.
a fixed species-specific repertoire of facial and vocal ges-
One of the contributions of this paper will be to stress that tures. I will develop the hypothesis that the mirror system
the F5 mirror neurons in the monkey are linked to regions made possible (but in no sense guaranteed) the evolution
of parietal and temporal cortex, and then argue that the of the displacement of hand movements from praxis to ges-
evolutionary changes that “lifted” the F5 homologue of the tures that can be controlled “voluntarily.”
common ancestor of human and monkey to yield the hu- It is important to be quite clear as to what the mirror sys-
man Broca’s area also “lifted” the other regions to yield tem hypothesis does not say.
Wernicke’s area and other areas that support language in 1. It does not say that having a mirror system is equiva-
the human brain. lent to having language. Monkeys have mirror systems but
Many critics have dismissed the mirror system hypothe- do not have language, and I expect that many species have
sis, stating correctly that monkeys do not have language and mirror systems for varied socially relevant behaviors.
so the mere possession of a mirror system for grasping can- 2. Having a mirror system for grasping is not in itself suf-
not suffice for language. But the key phrase here is “evolved ficient for the copying of actions. It is one thing to recog-
atop” – and Rizzolatti and Arbib (1998) discuss explicitly nize an action using the mirror system; it is another thing to
how changes in the primate brain might have adapted the use that representation as a basis for repeating the action.
use of the hands to support pantomime (intended commu- Hence, further evolution of the brain was required for the
nication) as well as praxis, and then outlined how further mirror system for grasping to become an imitation system
evolutionary changes could support language. The hypoth- for grasping.
esis provides a neurological basis for the oft-repeated claim 3. It does not say that language evolution can be studied
that hominids had a (proto)language based primarily on in isolation from cognitive evolution more generally.
manual gestures before they had a (proto)language based Arbib (2002) modified and developed the R&A argu-
primarily on vocal gestures (e.g., Armstrong et al. 1995; ment to hypothesize seven stages in the evolution of lan-
Hewes 1973; Kimura 1993; Stokoe 2001).4 It could be guage, with imitation grounding two of the stages.7 The first
tempting to hypothesize that certain species-specific vocal- three stages are pre-hominid:
izations of monkeys (such as the snake and leopard calls of S1: Grasping.
vervet monkeys) provided the basis for the evolution of hu- S2: A mirror system for grasping shared with the com-
man speech, since both are in the vocal domain. However, mon ancestor of human and monkey.
these primate vocalizations appear to be related to non-cor- S3: A simple imitation system for object-directed grasp-
tical regions as well as the anterior cingulate cortex (see, ing through much-repeated exposure. This is shared with
e.g., Jürgens 1997) rather than F5, the homologue of the common ancestor of human and chimpanzee.
Broca’s area. I think it likely (though empirical data are The next three stages then distinguish the hominid line
sadly lacking) that the primate cortex contains a mirror sys- from that of the great apes:
S4: A complex imitation system for grasping – the ability dently of the claim that the transition to language was cul-
to recognize another’s performance as a set of familiar ac- tural rather than biological.
tions and then repeat them, or to recognize that such a per- The neurolinguistic approach offered here is part of a
formance combines novel actions which can be approxi- performance approach which explicitly analyzes both per-
mated by variants of actions already in the repertoire.8 ception and production (Fig. 1). For production, we have
S5: Protosign, a manual-based communication system, much we could possibly talk about which is represented as
breaking through the fixed repertoire of primate vocaliza- cognitive structures (cognitive form; schema assemblages)
tions to yield an open repertoire. from which some aspects are selected for possible expres-
S6: Protospeech, resulting from the ability of control sion. Further selection and transformation yields semantic
mechanisms evolved for protosign coming to control the vo- structures (hierarchical constituents expressing objects, ac-
cal apparatus with increasing flexibility.9 tions and relationships) which constitute a semantic form
The final stage is claimed (controversially!) to involve lit- that is enriched by linkage to schemas for perceiving and
tle if any biological evolution but instead to result from cul- acting upon the world (Arbib 2003; Rolls & Arbib 2003). Fi-
tural evolution (historical change) in Homo sapiens: nally, the ideas in the semantic form must be expressed in
S7: Language, the change from action-object frames to words whose markings and ordering are expressed in
verb-argument structures to syntax and semantics; the co- phonological form – which may include a wide range of or-
evolution of cognitive and linguistic complexity. dered expressive gestures, whether manual, orofacial, or
The Mirror System Hypothesis is simply the assertion vocal. For perception, the received sentence must be in-
that the mechanisms that get us to the role of Broca’s area terpreted semantically, with the result updating the
in language depend in a crucial way on the mechanisms es- “hearer’s” cognitive structures. For example, perception of
tablished in stage S2. The above seven stages provide just a visual scene may reveal “Who is doing what and to whom/
one set of hypotheses on how this dependence may have which” as part of a nonlinguistic action-object frame in cog-
arisen. The task of this paper is to re-examine this progres- nitive form. By contrast, the verb-argument structure is an
sion, responding to critiques by amplifying the supporting overt linguistic representation in semantic form – in mod-
argument in some cases and tweaking the account in oth- ern human languages, generally the action is named by a
ers. I believe that the overall framework is robust, but there verb and the objects are named by nouns or noun phrases
are many details to be worked out and a continuing stream (see sect. 7). A production grammar for a language is then
of new and relevant data and modeling to be taken into ac- a specific mechanism (whether explicit or implicit) for con-
count. verting verb-argument structures into strings of words (and
The claim for the crucial role of manual communication hierarchical compounds of verb-argument structures into
in language evolution remains controversial. MacNeilage complex sentences), and vice versa for perception.
(1998; MacNeilage & Davis, in press b), for example, has In the brain there may be no single grammar serving both
argued that language evolved directly as speech. (A com- production and perception, but rather, a “direct grammar”
panion paper [Arbib 2005] details why I reject MacNeil- for production and an “inverse grammar” for perception.
age’s argument. The basic point is to distinguish the evolu- Jackendoff (2002) offers a competence theory with a much
tion of the ability to use gestures that convey meaning from closer connection with theories of processing than has been
the evolution of syllabification as a way to structure vocal common in generative linguistics and suggests (his sect. 9.3)
gestures.) strategies for a two-way dialogue between competence and
A note to commentators: The arguments for stages S1 performance theories. Jackendoff’s approach to compe-
through S6 can and should be evaluated quite indepen- tence appears to be promising in this regard because it at-
tends to the interaction of, for example, phonological, syn- cognitive capabilities that underlie a number of the ideas
tactic, and semantic representations. There is much, too, to which eventually find their expression in language:
be learned from a variety of approaches to cognitive gram- LR5. From hierarchical structuring to temporal order-
mar which relates cognitive form to syntactic structure (see, ing: Perceiving that objects and actions have subparts; find-
e.g., Heine 1997; Langacker 1987; 1991; Talmy 2000). ing the appropriate timing of actions to achieve goals in re-
The next section provides a set of criteria for language lation to those hierarchically structured objects.
readiness and further criteria for what must be added to A basic property of language – translating a hierarchical
yield language. It concludes (sect. 2.3) with an outline of the conceptual structure into a temporally ordered structure of
argument as it develops in the last six sections of the paper. actions – is in fact not unique to language but is apparent
whenever an animal takes in the nature of a visual scene and
produces appropriate behavior. Animals possess subtle
2. Language, protolanguage, and language mechanisms of action-oriented perception with no neces-
readiness sary link to the ability to communicate about these compo-
nents and their relationships. To have such structures does
I earlier defined a protolanguage as any system of utter- not entail the ability to communicate by using words or ar-
ances which served as a precursor to human language in the ticulatory gestures (whether signed or vocalized) in a way
modern sense and hypothesized that the first Homo sapiens that reflects these structures.
had protolanguage and a “language-ready brain” but did not Hauser et al. (2002) assert that the faculty of language in
have language. the narrow sense (FLN) includes only recursion and is the
Contra Bickerton (see Note 1), I will argue in section 7 one uniquely human component of the faculty of language.
that the prelanguage of Homo erectus and early Homo sapi- However, the flow diagram given by Byrne (2003) shows
ens was composed mainly of “unitary utterances” that sym- that the processing used by a mountain gorilla when prepar-
bolized frequently occurring situations (in a general sense) ing bundles of nettle leaves to eat is clearly recursive. Go-
without being decomposable into distinct words denoting rillas (like many other species, and not only mammals) have
components of the situation or their relationships. Words as the working memory to refer their next action not only to
we know them then co-evolved culturally with syntax sensory data but also to the state of execution of some cur-
through fractionation. In this view, many ways of express- rent plan. Hence, when we refer to the monkey’s grasping
ing relationships that we now take for granted as part of lan- and ability to recognize similar grasps in others, it is a mis-
guage were the discovery of Homo sapiens; for example, ad- take to treat the individual grasps in isolation – the F5 sys-
jectives and the fractionation of nouns from verbs may be tem is part of a larger system that can direct those grasps as
“post-biological” in origin. part of a recursively structured plan.
Let me simply list the next two properties here, and then
expand upon them in the next section:
2.1. Criteria for language readiness LR6. Beyond the here-and-now 1: The ability to recall
Here are properties hypothesized to support protolan- past events or imagine future ones.
guage: LR7. Paedomorphy and sociality: Paedomorphy is the
LR1. Complex imitation: The ability to recognize an- prolonged period of infant dependency which is especially
other’s performance as a set of familiar movements and pronounced in humans; this combines with social struc-
then repeat them, but also to recognize that such a perfor- tures for caregiving to provide the conditions for complex
mance combines novel actions that can be approximated by social learning.
(i.e., more or less crudely be imitated by) variants of actions Where Deacon (1997) makes symbolization central to his
already in the repertoire.10 account of the coevolution of language and the human
The idea is that this capacity – distinct from the simple brain, the present account will stress the parity property
imitation system for object-directed grasping through much LR3, since it underlies the sharing of meaning, and the ca-
repeated exposure which is shared with chimpanzees – is pacity for complex imitation. I will also argue that only pro-
necessary to support properties LR2 and LR3, including tolanguage co-evolved with the brain, and that the full de-
the idea that symbols are potentially arbitrary rather than velopment of linguistic complexity was a cultural/historical
innate: process that required little or no further change from the
LR2. Symbolization: The ability to associate symbols brains of early Homo sapiens.
with an open class of episodes, objects, or actions. Later sections will place LR1 through LR7 in an evolu-
At first, these symbols may have been unitary utterances, tionary context (see sect. 2.3 for a summary), showing how
rather than words in the modern sense, and they may have the coupling of complex imitation to complex communica-
been based on manual and facial gestures rather than being tion creates a language-ready brain.
vocalized.
LR3. Parity (mirror property): What counts for the 2.2. Criteria for language
speaker (or producer) must count for the listener (or re-
ceiver). I next present four criteria for what must be added to the
This extends Property LR2 by ensuring that symbols can brain’s capabilities for the parity, hierarchical structuring,
be shared, and thus is bound up with LR4. and temporal ordering of language readiness to yield lan-
LR4. Intended communication: Communication is in- guage. Nothing in this list rests on the medium of exchange
tended by the utterer to have a particular effect on the re- of the language, applying to spoken language, sign lan-
cipient rather than being involuntary or a side effect of guage, or written language, for example. My claim is that a
praxis. brain that can support properties LR1 through LR7 above
The remainder are more general properties, delimiting can support properties LA1 through LA4 below – as long
as its “owner” matures in a society that possesses language discover the capacities for communication summarized in
in the sense so defined and nurtures the child to acquire it. LA3.
In other words, I claim that the mechanisms that make LR1 LA4. Learnability: To qualify as a human language, much
through LR7 possible are supported by the genetic encod- of the syntax and semantics of a human language must be
ing of brain and body and the consequent space of possible learnable by most human children.
social interactions, but that the genome has no additional I say “much of” because it is not true that children mas-
structures specific to LA1 through LA4. In particular, the ter all the vocabulary or syntactic subtlety of a language by
genome does not have special features encoding syntax and 5 or 7 years of age. Language acquisition is a process that
its linkage to a compositional semantics.11 continues well into the teens as we learn more subtle syn-
I suggest that “true language” involves the following fur- tactic expressions and a greater vocabulary to which to ap-
ther properties beyond LR1 through LR7: ply them (C. Chomsky [1969] traces the changes that occur
LA1. Symbolization and compositionality: The symbols from ages 5 to 10), allowing us to achieve a richer and richer
become words in the modern sense, interchangeable and set of communicative and representational goals.
composable in the expression of meaning.12 LR7 and LA4 link a biological condition “orthogonal” to
LA2. Syntax, semantics and recursion: The matching of the mirror system hypothesis with a “supplementary” prop-
syntactic to semantic structures coevolves with the frac- erty of human languages. This supplementary property is
tionation of utterances, with the nesting of substructures that languages do not simply exist – they are acquired anew
making some form of recursion inevitable. (and may be slightly modified thereby) in each generation
LA1 and LA2 are intertwined. Section 7 will offer candi- (LA4). The biological property is an inherently social one
dates for the sorts of discoveries that may have led to about the nature of the relationship between parent (or
progress from “unitary utterances” to more or less struc- other caregiver) and child (LR7) – the prolonged period of
tured assemblages of words. Given the view (LR5) that re- infant dependency which is especially pronounced in hu-
cursion of action (but not of communication) is part of lan- mans has co-evolved with the social structures for caregiv-
guage readiness, the key transition here is the ing that provide the conditions for the complex social learn-
compositionality that allows cognitive structure to be re- ing that makes possible the richness of human cultures in
flected in symbolic structure (the transition from LR2 to general and of human languages in particular (Tomasello
LA1), as when perception (not uniquely human) grounds 1999b).
linguistic description (uniquely human) so that, for exam-
ple, the noun phrase (NP) describing a part of an object
2.3. The argument in perspective
may optionally form part of the NP describing the overall
object. From this point of view, recursion in language is a The argument unfolds in the remaining six sections as fol-
corollary of the essentially recursive nature of action and lows:
perception once symbolization becomes compositional, and Section 3. Perspectives on grasping and mirror neurons:
reflects addition of further detail to, for example, a descrip- This section presents two models of the macaque brain. A
tion when needed to reduce ambiguity in communication. key point is that the functions of mirror neurons reflect the
The last two principles provide the linguistic comple- impact of experience rather than being pre-wired.
ments of two of the conditions for language readiness, LR6 Section 4. Imitation: This section presents the distinction
(Beyond the here-and-now 1) and LR7 (Paedomorphy and between simple and complex imitation systems for grasp-
sociality), respectively. ing, and argues that monkeys have neither, that chim-
LA3. Beyond the here-and-now 2: Verb tenses or other panzees have only simple imitation, and that the capacity
circumlocutions express the ability to recall past events or for complex imitation involved hominid evolution since the
imagine future ones. separation from our common ancestors, the great apes, in-
There are so many linguistic devices for going beyond the cluding chimpanzees.
here and now, and beyond the factual, that verb tenses are Section 5. From imitation to protosign: This section ex-
mentioned to stand in for all the devices languages have de- amines the relation between symbolism, intended commu-
veloped to communicate about other “possible worlds” that nication, and parity, and looks at the multiple roles of the
are far removed from the immediacy of, say, the vervet mirror system in supporting pantomime and then conven-
monkey’s leopard call. tionalized gestures that support a far greater range of in-
If one took a human language and removed all reference tended communication.
to time, one might still want to call it a language rather than Section 6. The emergence of protospeech: This section ar-
a protolanguage, even though one would agree that it was gues that evolution did not proceed directly from monkey-
thereby greatly impoverished. Similarly, the number sys- like primate vocalizations to speech but rather proceeded
tem of a language can be seen as a useful, but not defini- from vocalization to manual gesture and back to vocaliza-
tive, “plug-in.” LA3 nonetheless suggests that the ability to tion again.
talk about past and future is a central part of human lan- Section 7. The inventions of languages: This section ar-
guages as we understand them. However, all this would be gues that the transition from action-object frames to verb-
meaningless (literally) without the underlying cognitive argument structures embedded in larger sentences struc-
machinery – the substrate for episodic memory provided by tured by syntax and endowed with a compositional
the hippocampus (Burgess et al. 1999) and the substrate for semantics was the effect of the accumulation of a wide
planning provided by frontal cortex (Passingham 1993, Ch. range of human discoveries that had little if any impact on
10). It is not part of the mirror system hypothesis to explain the human genome.
the evolution of the brain structures that support LR6; it is Section 8. Toward a neurolinguistics “beyond the mir-
an exciting challenge for work “beyond the mirror” to show ror”: This section extracts a framework for action-oriented
how such structures could provide the basis for humans to linguistics informed by our analysis of the “extended mirror
Table 1. A comparative view of how the following sections relate the criteria LR1–LR for language readiness and LA1–LA2 for
language (middle column) to the seven stages, S1–S7, of the extended mirror system hypothesis (right column)
2.1 LR5: From hierarchical structuring to This precedes the evolutionary stages charted here.
temporal ordering
3.1 S1: Grasping
The FARS model.
3.2 S2: Mirror system for grasping
Modeling Development of the Mirror System. This supports
the conclusion that mirror neurons can be recruited to
recognize and encode an expanding set of novel actions.
4 LR1: Complex imitation S3: Simple imitation
This involves properties of the mirror system
beyond the monkey’s data.
S4: Complex imitation
This is argued to distinguish humans from other primates.
5 LR2: Symbolization S5: Protosign
LR4: Intended communication The transition of complex imitation from praxic to
LR3: Parity (mirror property) communicative use involves two substages: S5a: the
ability to engage in pantomime; S5b: the ability to make
conventional gestures to disambiguate pantomime.
6.1 S6: Protospeech
It is argued that early protosign provided the scaffolding for
early protospeech, after which both developed in an ex-
panding spiral until protospeech became dominant for
most people.
7 LA1: Symbolization and compositionality S7: Language
LA2: Syntax, semantics, and recursion The transition from action-object frame to verb-argument
structure to syntax and semantics.
8 The evolutionary developments of the preceding sections are
restructured into synchronic form to provide a framework
for further research in neurolinguistics relating the capa-
bilities of the human brain for language, action recogni-
tion, and imitation.
system hypothesis” presented in the previous sections. The when the monkey observed the object being grasped by an-
language-ready brain contains the evolved mirror system as other.
a key component but also includes many other components The “classic” mirror system hypothesis (sect. 1.2) em-
that lie outside, though they interact with, the mirror sys- phasizes the grasp-related neurons of the monkey premo-
tem. tor area F5 and the homology of this region with human
Table 1 shows how these sections relate the evolutionary Broca’s area. However, Broca’s area is part of a larger sys-
stages S1 through S7, and their substages, to the above cri- tem supporting language, and so we need to enrich the mir-
teria for language readiness and language.13 ror system hypothesis by seeing how the mirror system for
grasping in monkey includes a variety of brain regions in ad-
dition to F5. I show this by presenting data and models that
3. Perspectives on grasping and mirror neurons locate the canonical system of F5 in a systems perspective
(the FARS model of sect. 3.1) and then place the mirror
Mirror neurons in F5, which are active both when the mon- system of F5 in a system perspective (the MNS model of
key performs certain actions and when the monkey ob- sect. 3.2).
serves them performed by others, are to be distinguished
from canonical neurons in F5, which are active when the
monkey performs certain actions but not when the monkey 3.1. The FARS model
observes actions performed by others. More subtly, canon-
ical neurons fire when they are presented with a graspable Given our concern with hand use and language, it is strik-
object, irrespective of whether the monkey performs the ing that the ability to use the size of an object to preshape
grasp or not – but clearly this must depend on the extra (in- the hand while grasping it can be dissociated by brain le-
ferred) condition that the monkey not only sees the object sions from the ability to consciously recognize and describe
but is aware, in some sense, that it is possible to grasp it. that size. Goodale et al. (1991) studied a patient (D.F.)
Were it not for the caveat, canonical neurons would also fire whose cortical damage allowed signals to flow from primary
visual cortex (V1) towards posterior parietal cortex (PP) but tween the tips of the right index finger and thumb with the
not from V1 to inferotemporal cortex (IT). When asked to brain activity obtained in two control tasks in which neither
indicate the width of a single block by means of her index the load force task nor the grip force task involved coordi-
finger and thumb, D.F.’s finger separation bore no rela- nated grip-load forces. They found that the grip-load force
tionship to the dimensions of the object and showed con- task was specifically associated with activation of a section
siderable trial-to-trial variability. Yet when she was asked of the right intraparietal cortex. Culham et al. (2003) found
simply to reach out and pick up the block, the peak aper- greater activity for grasping than for reaching in several re-
ture (well before contact with the object) between her in- gions, including the anterior intraparietal (AIP) cortex. Al-
dex finger and thumb changed systematically with the though the lateral occipital complex (LOC), a ventral
width of the object, as in normal controls. A similar disso- stream area believed to play a critical role in object recog-
ciation was seen in her responses to the orientation of stim- nition, was activated by the objects presented on both
uli. In other words, D.F. could preshape accurately, even grasping and reaching trials, there was no greater activity
though she appeared to have no conscious appreciation (ex- for grasping compared to reaching.
pressible either verbally or in pantomime) of the visual pa- The FARS model analyzes how the “canonical system,”
rameters that guided the preshape. Jeannerod et al. (1994) centered on the AIP r F5 pathway, may account for basic
reported a study of impairment of grasping in a patient phenomena of grasping. The highlights of the model are
(A.T.) with a bilateral posterior parietal lesion of vascu- shown in Figure 2,14 which diagrams the crucial role of IT
lar origin that left IT and the pathway V1 r IT relatively in- (inferotemporal cortex) and PFC (prefrontal cortex) in
tact, but grossly impaired the pathway V1 r PP. This pa- modulating F5’s selection of an affordance. The dorsal
tient can reach without deficit toward the location of such stream (from V1 to parietal cortex) carries the information
an object, but cannot preshape appropriately when asked to needed for AIP to recognize that different parts of the ob-
grasp it. ject can be grasped in different ways, thus extracting affor-
A corresponding distinction in the role of these pathways dances for the grasp system which are then passed on to F5.
in the monkey is crucial to the FARS model (named for The dorsal stream does not know “what” the object is; it can
Fagg, Arbib, Rizzolatti, and Sakata; see Fagg & Arbib only see the object as a set of possible affordances. The ven-
1998), which embeds F5 canonical neurons in a larger sys- tral stream (from V1 to IT), by contrast, is able to recognize
tem. Taira et al. (1990) found that anterior intraparietal what the object is. This information is passed to PFC, which
(AIP) cells (in the anterior intraparietal sulcus of the pari- can then, on the basis of the current goals of the organism
etal cortex) extract neural codes for affordances for grasp- and the recognition of the nature of the object, bias AIP to
ing from the visual stream and sends these on to area F5. choose the affordance appropriate to the task at hand. The
Affordances (Gibson 1979) are features of the object rele- original FARS model posited connections between PFC
vant to action, in this case to grasping, rather than aspects and F5. However, there is evidence (reviewed by Rizzolatti
of identifying the object’s identity. Turning to human data: & Luppino 2001) that these connections are very limited,
Ehrsson et al. (2003) compared the brain activity when hu- whereas rich connections exist between PFC and AIP. Riz-
mans attempted to lift an immovable test object held be- zolatti and Luppino (2003) therefore suggested that FARS
Figure 2. A reconceptualization of the FARS model in which the primary influence of PFC (prefrontal cortex) on the selection of af-
fordances is on parietal cortex (AIP, anterior intraparietal sulcus) rather than premotor cortex (the hand area F5).
be modified so that information on object semantics and action that resembles one in its movement repertoire, a
the goals of the individual influence AIP rather than F5 subset of the F5 and PF mirror neurons is activated which
neurons. I show the modified schematic in Figure 2. The also discharges when a similar action is executed by the
modified figure represents the way in which AIP may ac- monkey itself.
cept signals from areas F6 (pre-SMA), 46 (dorsolateral pre- I next develop the conceptual framework for thinking
frontal cortex), and F2 (dorsal premotor cortex) to respond about the relation between F5, AIP, and PF. Section 6.1 ex-
to task constraints, working memory, and instruction stim- pands the mirror neuron database, reviewing the reports by
uli, respectively. In other words, AIP provides cues on how Kohler et al. (2002) of a subset of mirror neurons respon-
to interact with an object, leaving it to IT to categorize the sive to sounds and by Ferrari et al. (2003) of neurons re-
object or determine its identity. sponsive to the observation of orofacial communicative ges-
Although the data on cell specificity in F5 and AIP em- tures.
phasize single actions, these actions are normally part of Figure 3 provides a glimpse of the schemas (functions)
more complex behaviors – to take a simple example, a mon- involved in the MNS model (Oztop & Arbib 2002) of the
key who grasps a raisin will, in general, then proceed to eat monkey mirror system.15 First, we look at those elements
it. Moreover, a particular action might be part of many involved when the monkey itself reaches for an object. Ar-
learned sequences, and so we do not expect the premotor eas IT and cIPS (caudal intraparietal sulcus; part of area 7)
neurons for one action to prime a single possible conse- provide visual input concerning the nature of the observed
quent action and hence must reject “hard wiring” of the se- object and the position and orientation of the object’s sur-
quence. The generally adopted solution is to segregate the faces, respectively, to AIP. The job of AIP is then to extract
learning of a sequence from the circuitry which encodes the the affordances the object offers for grasping. The upper di-
unit actions, the latter being F5 in the present study. In- agonal in Figure 3 corresponds to the basic pathway AIP r
stead, another area (possibly the part of the supplementary F5canonical r M1 (primary motor cortex) of the FARS
motor area called pre-SMA; Rizzolatti et al. 1998) has neu- model, but Figure 3 does not include the important role of
rons whose connections encode an “abstract sequence” Q1, PFC in action selection. The lower-right diagonal (MIP/
Q2, Q3, Q4, with sequence learning then involving learn- LIP/VIP r F4) completes the “canonical” portion of the
ing that the activation of Q1 triggers the F5 neurons for A, MNS model, since motor cortex must instruct not only the
Q2 triggers B, Q3 triggers A again, and Q4 triggers C to pro- hand muscles how to grasp but also (via various intermedi-
vide encoding of the sequence A-B-A-C. Other studies sug- aries) the arm muscles how to reach, transporting the hand
gest that administration of the sequence (inhibiting extra- to the object. The rest of Figure 3 presents the core ele-
neous actions, while priming imminent actions) is carried ments for the understanding of the mirror system. Mirror
out by the basal ganglia on the basis of its interactions with neurons do not fire when the monkey sees the hand move-
the pre-SMA (Bischoff-Grethe et al. 2003; see Dominey et ment or the object in isolation – it is the sight of the hand
al. 1995 for an earlier model of the possible role of the basal moving appropriately to grasp or otherwise manipulate a
ganglia in sequence learning). seen (or recently seen) object (Umiltá et al. 2001) that is re-
quired for the mirror neurons attuned to the given action
to fire. This requires schemas for the recognition of both
3.2. Modeling development of the mirror system
the shape of the hand and analysis of its motion (ascribed
The populations of canonical and mirror neurons appear to in the figure to STS), and for analysis of the relation of these
be spatially segregated in F5 (Rizzolatti & Luppino 2001). hand parameters to the location and affordance of the ob-
Both sectors receive a strong input from the secondary so- ject (7a and 7b; we identify 7b with PF).
matosensory area (SII) and parietal area PF. In addition, In the MNS model, the hand state was accordingly de-
canonical neurons are the selective target of area AIP. Per- fined as a vector whose components represented the move-
rett et al. (1990; cf. Carey et al. 1997) found that STSa, in ment of the wrist relative to the location of the object and
the rostral part of the superior temporal sulcus (STS), has of the hand shape relative to the affordances of the object.
neurons which discharge when the monkey observes such Oztop and Arbib (2002) showed that an artificial neural net-
biological actions as walking, turning the head, bending the work corresponding to PF and F5mirror could be trained to
torso, and moving the arms. Of most relevance to us is that recognize the grasp type from the hand state trajectory,
a few of these neurons discharged when the monkey ob- with correct classification often being achieved well before
served goal-directed hand movements, such as grasping ob- the hand reached the object. The modeling assumed that
jects (Perrett et al. 1990) – though STSa neurons do not the neural equivalent of a grasp being in the monkey’s
seem to discharge during movement execution as distinct repertoire is that there is a pattern of activity in the F5
from observation. STSa and F5 may be indirectly con- canonical neurons which commands that grasp. During
nected via the inferior parietal area PF (Brodmann area 7b) training, the output of the F5 canonical neurons, acting as
(Cavada & Goldman-Rakic 1989; Matelli et al. 1986; a code for the grasp being executed by the monkey at that
Petrides & Pandya 1984; Seltzer & Pandya 1994). About time, was used as the training signal for the F5 mirror neu-
40% of the visually responsive neurons in PF are active for rons to enable them to learn which hand-object trajectories
observation of actions such as holding, placing, reaching, corresponded to the canonically encoded grasps. Moreover,
grasping, and bimanual interaction. Moreover, most of the input to the F5 mirror neurons encodes the trajectory
these action-observation neurons were also active during of the relation of parts of the hand to the object rather than
the execution of actions similar to those for which they were the visual appearance of the hand in the visual field. As a
“observers,” and were therefore called PF mirror neurons result of this training, the appropriate mirror neurons come
(Fogassi et al. 1998). to fire in response to viewing the appropriate trajectories
In summary, area F5 and area PF include an observation/ even when the trajectory is not accompanied by F5 canon-
execution matching system: When the monkey observes an ical firing.
Figure 3. A schematic view of the Mirror Neuron System (MNS) model (Oztop & Arbib 2002).
This training prepares the F5 mirror neurons to respond version of these novel actions for themselves. Our Infant
to hand-object relational trajectories even when the hand is Learning to Grasp Model (ILGM; Oztop et al. 2004)
of the “other” rather than the “self,” because the hand state strongly supports the hypothesis that grasps are acquired
is based on the movement of a hand relative to the object, through experience as the infant learns how to conform the
and thus only indirectly on the retinal input of seeing hand biomechanics of its hand to the shapes of the objects it en-
and object which can differ greatly between observation of counters. However, limited space precludes presentation of
self and other. What makes the modeling worthwhile is that this model here.
the trained network not only responded to hand-state tra- The classic papers on the mirror system for grasping in
jectories from the training set, but also exhibited interest- the monkey focus on a repertoire of grasps – such as the
ing responses to novel hand-object relationships. Despite precision pinch and power grasp – that seem so basic that
the use of a non-physiological neural network, simulations it is tempting to think of them as prewired. The crucial
with the model revealed a range of putative properties of point of this section on modeling is that learning models
mirror neurons that suggest new neurophysiological exper- such as ILGM and MNS, and the data they address, make
iments. (See Oztop & Arbib [2002] for examples and de- clear that mirror neurons are not restricted to recognition
tailed analysis.) of an innate set of actions but can be recruited to recognize
Although MNS was constructed as a model of the devel- and encode an expanding repertoire of novel actions. I will
opment of mirror neurons in the monkey, it serves equally relate the FARS and MNS models to the development of
well as a model of the development of mirror neurons in the imitation at the end of section 4.
human infant. A major theme for future modeling, then, With this, let us turn to human data. We mentioned in
will be to clarify which aspects of human development are section 1.2 that Broca’s area, traditionally thought of as a
generic for primates and which are specific to the human speech area, has been shown by brain imaging studies to be
repertoire. In any case, the MNS model makes the crucial active when humans both execute and observe grasps. This
assumption that the grasps that the mirror system comes to was first tested by two positron emission tomography (PET)
recognize are already in the (monkey or human) infant’s experiments (Grafton et al. 1996; Rizzolatti et al. 1996)
repertoire. But this raises the question of how grasps en- which compared brain activation when subjects observed
tered the repertoire. To simplify somewhat, the answer has the experimenter grasping an object against activation
two parts: (1) Children explore their environment, and as when subjects simply observed the object. Grasp observa-
their initially inept arm and hand movements successfully tion significantly activated the superior temporal sulcus
contact objects, they learn to reproduce the successful (STS), the inferior parietal lobule, and the inferior frontal
grasps reliably, with the repertoire being tuned through fur- gyrus (area 45). All activations were in the left hemisphere.
ther experience. (2) With more or less help from caregivers, The last area is of especial interest because areas 44 and 45
infants come to recognize certain novel actions in terms of in the left hemisphere of the human constitute Broca’s area.
similarities with and differences from movements already Such data certainly contribute to the growing body of indi-
in their repertoires, and on this basis learn to produce some rect evidence that there is a mirror system for grasping that
links Broca’s area with regions in the inferior parietal lob- There is not space here to analyze all the relevant dis-
ule and STS. We have seen that the “minimal mirror sys- tinctions between imitation and other forms of learning,
tem” for grasping in the macaque includes mirror neurons but one example may clarify my view: Voelkl and Huber
in the parietal area PF (7b) as well as F5, and some not- (2000) had marmosets observe a demonstrator removing
quite-mirror neurons in the region STSa in the superior the lids from a series of plastic canisters to obtain a meal-
temporal sulcus. Hence, in further investigation of the mir- worm. When subsequently allowed access to the canisters,
ror system hypothesis it will be crucial to extend the F5 r marmosets that observed a demonstrator using its hands to
Broca’s area homology to examine the human homologues remove the lids used only their hands. In contrast, mar-
of PF and STSa as well. I will return to this issue in section mosets that observed a demonstrator using its mouth also
7 (see Fig. 6) and briefly review some of the relevant data used their mouths to remove the lids. Voelkl and Huber
from the rich and rapidly growing literature based on hu- (2000) suggest that this may be a case of true imitation in
man brain imaging and transcranial magnetic stimulation marmosets, but I would argue that it is a case of stimulus
(TMS) inspired by the effort to probe the human mirror enhancement, apparent imitation resulting from directing
system and relate it to action recognition, imitation, and attention to a particular object or part of the body or envi-
language. ronment. This is to be distinguished from emulation (ob-
Returning to the term “language readiness,” let me stress serving and attempting to reproduce results of another’s ac-
that the reliable linkage of brain areas to different aspects tions without paying attention to details of the other’s
of language in normal speaking humans does not imply that behavior) and true imitation which involves copying a
language per se is “genetically encoded” in these regions. novel, otherwise improbable action or some act that is out-
There is a neurology of writing even though writing was in- side the imitator’s prior repertoire.
vented only a few thousand years ago. The claim is not that Myowa-Yamakoshi and Matsuzawa (1999) observed in a
Broca’s area, Wernicke’s area, and STS are genetically pre- laboratory setting that chimpanzees typically took 12 trials
programmed for language, but rather that the development to learn to “imitate” a behavior and in doing so paid more
of a human child in a language community normally adapts attention to where the manipulated object was being di-
these brain regions to play a crucial (but not the only) role rected than to the actual movements of the demonstrator.
in language performance. This involves the ability to learn novel actions which may
require using one or both hands to bring two objects into
relationship, or to bring an object into relationship with the
4. Imitation body.
Chimpanzees do use and make tools in the wild, with dif-
We have already discussed the mirror system for grasping ferent tool traditions found in geographically separated
as something shared between macaque and human; hence groups of chimpanzees: Boesch and Boesch (1983) have ob-
the hypothesis that this set of mechanisms was already in served chimpanzees in Tai National Park, Ivory Coast, us-
place in the common ancestor of monkey and human some ing stone tools to crack nuts open, although Goodall has
20 million years ago.16 In this section we move from stage never seen chimpanzees do this in the Gombe in Tanzania.
S2, a mirror system for grasping, to stages S3, a simple im- They crack harder-shelled nuts with stone hammers and
itation system for grasping, and S4, a complex imitation sys- stone anvils. The Tai chimpanzees live in a dense forest
tem for grasping. I will argue that chimpanzees possess a where suitable stones are hard to find. The stone anvils are
capability for simple imitation that monkeys lack, but that stored in particular locations to which the chimpanzees
humans have complex imitation whereas other primates do continually return.17 The nut-cracking technique is not
not. The ability to copy single actions is just the first step to- mastered until adulthood. Tomasello (1999b) comments
wards complex imitation, which involves parsing a complex that, over many years of observation, Boesch observed only
movement into more or less familiar pieces and then per- two possible instances in which the mother appeared to be
forming the corresponding composite of (variations on) fa- actively attempting to instruct her child, and that even in
miliar actions. Arbib and Rizzolatti (1997) asserted that these cases it is unclear whether the mother had the goal of
what makes a movement into an action is that it is associ- helping the young chimp learn to use the tool. We may con-
ated with a goal, and that initiation of the movement is ac- trast the long and laborious process of acquiring the nut-
companied by the creation of an expectation that the goal cracking technique with the rapidity with which human
will be met. Hence, it is worth stressing that when I speak adults can acquire novel sequences, and the crucial role of
of imitation here, I speak of the imitation of a movement caregivers in the development of this capacity for complex
and its linkage to the goals it is meant to achieve. The ac- imitation. Meanwhile, reports abound of imitation in many
tion may thus vary from occasion to occasion depending on species, including dolphins and orangutans, and even tool
parametric variations in the goal. This is demonstrated by use in crows (Hunt & Gray 2002). Consequently, I accept
Byrne’s (2003) description, noted earlier, of a mountain go- that the demarcation between the capability for imitation
rilla preparing bundles of nettle leaves to eat. of humans and nonhumans is problematic. Nonetheless, I
Visalberghi and Fragaszy (2002) review data on attempts still think it is fair to claim that humans can master feats of
to observe imitation in monkeys, including their own stud- imitation beyond those possible for other primates.
ies of capuchin monkeys. They stress the huge difference The ability to imitate has clear adaptive advantage in al-
between the major role that imitation plays in learning by lowing creatures to transfer skills to their offspring, and
human children, and the very limited role, if any, that imi- therefore could be selected for quite independently of any
tation plays in social learning in monkeys. There is little ev- adaptation related to the later emergence of protolanguage.
idence for vocal imitation in monkeys or apes (Hauser By the same token, the ability for complex imitation could
1996), but it is generally accepted that chimpanzees are ca- provide further selective advantage unrelated to language.
pable of some forms of imitation (Tomasello & Call 1997). However, complex imitation is central to human infants
both in their increasing mastery of the physical and social within the repertoire but which never come to be within the
world and in the close coupling of this mastery to the repertoire. In this case, the cumulative development of ac-
acquisition of language (cf. Donald 1998; Arbib et al., in tion recognition may proceed to increase the breadth and
press). The child must go beyond simple imitation to ac- subtlety of the range of actions that are recognizable but
quire the phonological repertoire, words, and basic “as- cannot be performed by children.
sembly skills” of its language community, and this is one of
the ways in which brain mechanisms supporting imitation
were crucial to the emergence of language-ready Homo 5. From imitation to protosign
sapiens. If I then assume (1) that the common ancestor of
monkeys and apes had no greater imitative ability than pre- The next posited transition, from stage S4, a complex imi-
sent-day monkeys (who possess, I suggest, stimulus en- tation system for grasping, to stage S5, protosign, a manual-
hancement rather than simple imitation), and (2) that the based communication system, takes us from imitation for
ability for simple imitation shared by chimps and humans the sake of instrumental goals to imitation for the sake of
was also possessed by their common ancestor, but (3) that communication. Each stage builds on, yet is not simply re-
only humans possess a talent for “complex” imitation, then ducible to, the previous stage.
I have established a case for the hypothesis that extension I argue that the combination of the abilities (S5a) to en-
of the mirror system from recognizing single actions to be- gage in pantomime and (S5b) to make conventional ges-
ing able to copy compound actions was the key innovation tures to disambiguate pantomime yielded a brain which
in the brains of our hominid ancestors that was relevant to could (S5) support “protosign,” a manual-based communi-
language. And, more specifically, we have the hypotheses: cation system that broke through the fixed repertoire of pri-
Stage S3 hypothesis: Brain mechanisms supporting a mate vocalizations to yield an open repertoire of commu-
simple imitation system – imitation of short, novel se- nicative gestures.
quences of object-directed actions through repeated expo- It is important to stress that communication is about far
sure – for grasping developed in the 15-million-year evolu- more than grasping. To pantomime the flight of a bird, you
tion from the common ancestor of monkeys and apes to the might move your hand up and down in a way that indicates
common ancestor of apes and humans; and the flapping of a wing. Your pantomime uses movements of
Stage S4 hypothesis: Brain mechanisms supporting a the hand (and arm and body) to imitate movement other
complex imitation system – acquiring (longer) novel se- than hand movements. You can pantomime an object either
quences of more abstract actions in a single trial – devel- by miming a typical action by or with the object, or by trac-
oped in the 5-million-year evolution from the common an- ing out the characteristic shape of the object.
cestor of apes and humans along the hominid line that led, The transition to pantomime does seem to involve a gen-
in particular, to Homo sapiens.18 uine neurological change. Mirror neurons for grasping in
Now that we have introduced imitation, we can put the the monkey will fire only if the monkey sees both the hand
models of section 3.2 in perspective by postulating the fol- movement and the object to which it is directed (Umiltá et
lowing stages prior to, during, and building on the devel- al. 2001). A grasping movement that is not made in the
opment of the mirror system for grasping in the infant: presence of a suitable object, or is not directed toward that
A. The child refines a crude map (superior colliculus) to object, will not elicit mirror neuron firing. By contrast, in
make unstructured reach and “swipe” movements at ob- pantomime, the observer sees the movement in isolation
jects; the grasp reflex occasionally yields a successful grasp. and infers (1) what non-hand movement is being mimicked
B. The child develops a set of grasps which succeed by by the hand movement, and (2) the goal or object of the ac-
kinesthetic, somatosensory criteria (ILGM). tion. This is an evolutionary change of key relevance to lan-
C. AIP develops as affordances of objects become guage readiness. Imitation is the generic attempt to repro-
learned in association with successful grasps. Grasping be- duce movements performed by another, whether to master
comes visually guided; the grasp reflex disappears. a skill or simply as part of a social interaction. By contrast,
D. The (grasp) mirror neuron system develops driven by pantomime is performed with the intention of getting the
visual stimuli relating hand and object generated by the ac- observer to think of a specific action, object, or event. It is
tions (grasps) performed by the infant himself (MNS). essentially communicative in its nature. The imitator ob-
E. The child gains the ability to map other individual’s ac- serves; the pantomimic intends to be observed.
tions into his internal motor representation. As Stokoe (2001) and others emphasize, the power of
F. Then the child acquires the ability to imitate, creating pantomime is that it provides open-ended communication
(internal) representations for novel actions that have been that works without prior instruction or convention. How-
observed and developing an action prediction capability. ever (and I shall return to this issue at the end of this sec-
I suggest that stages A through D are much the same in tion), even signs of modern signed language which resem-
monkey and human, but that stages E and F are rudimen- ble pantomimes are conventionalized and are, thus, distinct
tary at best in monkeys, somewhat developed in chimps, from pantomimes. Pantomime per se is not a form of pro-
and well-developed in human children (but not in infants). tolanguage; rather it provides a rich scaffolding for the
In terms of Figure 3, we might say that if MNS were aug- emergence of protosign.
mented to have a population of mirror neurons that could All this assumes rather than provides an explanation for
acquire population codes for observed actions not yet in the LR4, the transition from making praxic movement – for ex-
repertoire of self-actions, then in stage E the mirror neu- ample, those involved in the immediate satisfaction of some
rons would provide training for the canonical neurons, re- appetitive or aversive goal – to those intended by the ut-
versing the information flow seen in the MNS model. Note terer to have a particular effect on the recipient. I tenta-
that this raises the further possibility that the human infant tively offer:
may come to recognize movements that not only are not The intended communication hypothesis: The ability to
imitate combines with the ability to observe the effect of see analogies in the history of Chinese characters. The char-
such imitation on conspecifics to support a migration of acter (san) may not seem particularly pictorial, but if
closed species-specific gestures supported by other brain (following the “etymology” of Vaccari & Vaccari 1961), we
regions to become the core of an open class of commu- see it as a simplification of a picture of three mountains, ,
nicative gestures. via such intermediate forms as , then we have no trouble
Darwin (1872/1965) observed long ago, across a far seeing the simplified character as meaning “moun-
wider range of mammalian species than just the primates, tain.”20 The important point here for our hypothesis is that
that the facial expressions of conspecifics provide valuable although such a “picture history” may provide a valuable
cues to their likely reaction to certain courses of behavior (a crutch to some learners, with sufficient practice the crutch
rich complex summarized as “emotional state”). Moreover, is thrown away, and in normal reading and writing, the link
the F5 region contains orofacial cells as well as manual cells. between and its meaning is direct, with no need to in-
This suggests a progression from control of emotional ex- voke an intermediate representation of .
pression by systems that exclude F5 to the extension of F5’s In the same way, I suggest that pantomime is a valuable
mirror capacity for orofacial as well as manual movement crutch for acquiring a modern sign language, but that even
(discussed below), via its posited capacity (achieved by signs which resemble pantomimes are conventionalized
stage S3) for simple imitation, to support the imitation of and are thus distinct from pantomimes.21 Interestingly,
emotional expressions. This would then provide the ability Emmorey (2002, Ch. 9) discusses studies of signers using
to affect the behavior of others by, for example, appearing ASL which show a dissociation between the neural systems
angry. This would in turn provide the evolutionary oppor- involved in sign language and those involved in conven-
tunity to generalize the ability of F5 activity to affect the be- tionalized gesture and pantomime. Corina et al. (1992b) re-
havior of conspecifics from species-specific vocalizations to ported left-hemisphere dominance for producing ASL
a general ability to use the imitation of behavior (as distinct signs, but no laterality effect when subjects had to produce
from praxic behavior itself) as a means to influence others. symbolic gestures (e.g., waving good-bye or thumbs-up).
This in turn makes possible reciprocity by a process of back- Other studies report patients with left-hemisphere damage
ward chaining where the influence is not so much on the who exhibited sign language impairments but well-pre-
praxis of the other as on the exchange of information. With served conventional gesture and pantomime. Corina et al.
this, the transition described by LR4 (intended communi- (1992a) described patient W.L. with damage to left-hemi-
cation) has been achieved in tandem with the achievement sphere perisylvian regions. W.L. exhibited poor sign lan-
and increasing sophistication of LR2 (symbolization). guage comprehension and production. Nonetheless, this
A further critical change (labeled 5b above) emerges patient could produce stretches of pantomime and tended
from the fact that in pantomime it might be hard to distin- to substitute pantomimes for signs, even when the pan-
guish, for example, a movement signifying “bird” from one tomime required more complex movement. Emmorey sees
meaning “flying.” This inability to adequately convey such data as providing neurological evidence that signed
shades of meaning using “natural” pantomime would favor languages consist of linguistic gestures and not simply elab-
the invention of gestures that could in some way disam- orate pantomimes.
biguate which of their associated meanings was intended. Figure 4 is based on a scheme offered by Arbib (2004) in
Note that whereas a pantomime can freely use any move- response to Hurford’s (2004) critique of the mirror system
ment that might evoke the intended observation in the hypothesis. Hurford makes the crucial point that we must
mind of the observer, a disambiguating gesture must be (in the spirit of Saussure) distinguish the “sign” from the
conventionalized.19 This use of non-pantomimic gestures “signified.” In the figure, we distinguish the “neural repre-
requires extending the use of the mirror system to attend to sentation of the sign” (top row) from the “neural represen-
an entirely new class of hand movements. However, this tation of the signified” (bottom row). The top row of the fig-
does not seem to require a biological change beyond that ure makes explicit the result of the progression within the
limned above for pantomime. mirror system hypothesis of mirror systems for:
As pantomime begins to use hand movements to mime 1. Grasping and manual praxic actions.
different degrees of freedom (as in miming the flying of a 2. Pantomime of grasping and manual praxic actions.
bird), a dissociation begins to emerge. The mirror system 3. Pantomime of actions outside the pantomimic’s own
for the pantomime (based on movements of face, hand, behavioral repertoire (e.g., flapping the arms to mime a fly-
etc.) is now different from the recognition system for the ing bird).
action that is pantomimed, and – as in the case of flying – 4. Conventional gestures used to formalize and disam-
the action may not even be in the human action repertoire. biguate pantomime (e.g., to distinguish “bird” from “fly-
However, the system is still able to exploit the praxic recog- ing”).
nition system because an animal or hominid must observe 5. Protosign, comprising conventionalized manual (and
much about the environment that is relevant to its actions related orofacial) communicative gestures.
but is not in its own action repertoire. Nonetheless, this dis- However, I disagree with Hurford’s suggestion that there
sociation now underwrites the emergence of protosign – an is a mirror system for all concepts – actions, objects, and
open system of actions that are defined only by their com- more – which links the perception and action related to
municative impact, not by their direct relation to praxic each concept.22 In schema theory (Arbib 1981; 2003), I dis-
goals. tinguish between perceptual schemas, which determine
Protosign may lose the ability of the original pantomime whether a given “domain of interaction” is present in the
to elicit a response from someone who has not seen it be- environment and provide parameters concerning the cur-
fore. However, the price is worth paying in that the simpli- rent relationship of the organism with that domain, and mo-
fied form, once agreed upon by the community, allows tor schemas, which provide the control systems which can
more rapid communication with less neural effort. One may be coordinated to effect a wide variety of actions. Recog-
system would work just as well if we and all our ancestors nicative actions (with the effective executed action for dif-
had been deaf. However, primates do have a rich auditory ferent “mirror neurons” in parentheses) include lip-smack-
system which contributes to species survival in many ways, ing (sucking and lip-smacking); lips protrusion (grasping
of which communication is just one (Ghazanfar 2003). The with lips, lips protrusion, lip-smacking, grasping, and chew-
protolanguage perception system could thus build upon the ing); tongue protrusion (reaching with tongue); teeth-chat-
existing auditory mechanisms in the move to derive proto- ter (grasping); and lips/tongue protrusion (grasping with lips
speech. However, it appears that considerable evolution of and reaching with tongue; grasping). We therefore see that
the vocal-motor system was needed to yield the flexible vo- the communicative gestures and their associated effective
cal apparatus that distinguishes humans from other pri- observed actions are a long way from the sort of vocalizations
mates. MacNeilage (1998) offers an argument for how the that occur in speech (see Fogassi & Ferrari [in press] for fur-
mechanism for producing consonant-vowel alternations en ther discussion).
route to a flexible repertoire of syllables might have evolved Rizzolatti and Arbib (1998) stated that “This new use of
from the cyclic mandibular alternations of eating, but offers vocalization [in speech] necessitated its skillful control, a re-
no clue as to what might have linked such a process to the quirement that could not be fulfilled by the ancient emo-
expression of meaning (but see MacNeilage & Davis, in tional vocalization centers. This new situation was most
press b). This problem is discussed much further in Arbib likely the ‘cause’ of the emergence of human Broca’s area.”
(2005) which spells out how protosign (S5) may have pro- I would now rather say that Homo habilis and even more so
vided a scaffolding for protospeech (S6), forming an “ex- Homo erectus had a “proto-Broca’s area” based on an F5-
panding spiral” wherein the two interacted with each other like precursor mediating communication by manual and
in supporting the evolution of brain and body that made orofacial gestures, which made possible a process of collat-
Homo sapiens “language-ready” in a multi-modal integra- eralization whereby this “proto” Broca’s area gained primi-
tion of manual, facial and vocal actions. tive control of the vocal machinery, thus yielding increased
New data on mirror neurons for grasping that exhibit au- skill and openness in vocalization, moving from the fixed
ditory responses, and on mirror-like properties of orofacial repertoire of primate vocalizations to the unlimited (open)
neurons in F5, add to the subtlety of the argument. Kohler range of vocalizations exploited in speech. Speech appara-
et al. (2002) studied mirror neurons for actions which are tus and brain regions could then coevolve to yield the con-
accompanied by characteristic sounds, and found that a sub- figuration seen in modern Homo sapiens.
set of these neurons are activated by the sound of the ac- Corballis (2003b) argues that there may have been a sin-
tion (e.g., breaking a peanut in half) as well as sight of the gle-gene mutation producing a “dextral” allele, which cre-
action. Does this suggest that protospeech mediated by the ated a strong bias toward right-handedness and left-cere-
F5 homologue in the hominid brain could have evolved bral dominance for language at some point in hominid
without the scaffolding provided by protosign? My answer evolution.24 He then suggests that the “speciation event”
is negative for two reasons: (1) I have argued that imitation that distinguished Homo sapiens from other large-brained
is crucial to grounding pantomime in which a movement is hominids may have been a switch from a predominantly
performed in the absence of the object for which such a gestural to a predominantly vocal form of language. By con-
movement would constitute part of a praxic action. How- trast, I would argue that there was no one distinctive speci-
ever, the sounds studied by Kohler et al. (2002) cannot be ation event, and that the process whereby communication
created in the absence of the object, and there is no evi- for most humans became predominantly vocal was not a
dence that monkeys can use their vocal apparatus to mimic switch but was “cultural” and cumulative.
the sounds they have heard. I would further argue that the
limited number and congruence of these “auditory mirror
neurons” is more consistent with the view that manual ges- 7. The inventions of languages
ture is primary in the early stages of the evolution of lan-
guage readiness, with audiomotor neurons laying the basis The divergence of the Romance languages from Latin took
for later extension of protosign to protospeech. about one thousand years. The divergence of the Indo-Eu-
Complementing earlier studies on hand neurons in ropean languages to form the immense diversity of Hindi,
macaque F5, Ferrari et al. (2003) studied mouth motor neu- German, Italian, English, and so on took about 6,000 years
rons in F5 and showed that about one-third of them also dis- (Dixon 1997). How can we imagine what has changed since
charge when the monkey observes another individual per- the emergence of Homo sapiens some 200,000 years ago?
forming mouth actions. The majority of these “mouth mirror Or in 5,000,000 years of prior hominid evolution? I claim
neurons” become active during the execution and observa- that the first Homo sapiens were language-ready but did not
tion of mouth actions related to ingestive functions such as have language in the modern sense. Rather, my hypothesis
grasping, sucking, or breaking food. Another population of is that stage S7, the transition from protolanguage to lan-
mouth mirror neurons also discharges during the execution guage, is the culmination of manifold discoveries in the his-
of ingestive actions, but the most effective visual stimuli in tory of mankind:
triggering them are communicative mouth gestures (e.g., In section 2, I asserted that in much of protolanguage, a
lip-smacking) – one action becomes associated with a whole complete communicative act involved a unitary utterance,
performance of which one part involves similar movements. the use of a single symbol formed as a sequence of gestures,
This fits with the hypothesis that neurons learn to associate whose component gestures – whether manual or vocal –
patterns of neural firing rather than being committed to had no independent meaning. Unitary utterances such as
learn specifically pigeonholed categories of data. Thus, a po- “grooflook” or “koomzash” might have encoded quite com-
tential mirror neuron is in no way committed to become a plex descriptions such as “The alpha male has killed a meat
mirror neuron in the strict sense, even though it may be animal and now the tribe has a chance to feast together.
more likely to do so than otherwise. The observed commu- Yum, yum!” or commands such as “Take your spear and go
around the other side of that animal and we will have a bet- turies or more before someone could recognize the com-
ter chance together of being able to kill it.” On this view, monality across all these constructions and thus invent the
“protolanguage” grew by adding arbitrary novel unitary ut- precursor of what we would now call adjectives.26
terances to convey complex but frequently important situa- The latter example is meant to indicate how a sign for
tions, and it was a major later discovery en route to language “sour” could be added to the protolanguage vocabulary with
as we now understand it that one could gain expressive no appeal to an underlying “adjective mechanism.” Instead,
power by fractionating such utterances into shorter utter- one would posit that the features of language emerged
ances conveying components of the scene or command (cf. by bricolage (tinkering) which added many features as
Wray 1998; 2000). Put differently, the utterances of prelan- “patches” to a protolanguage, with general “rules” emerg-
guage were more akin to the “calls” of modern primates – ing both consciously and unconsciously only as generaliza-
such as the “leopard call” of the vervet monkey, which is tions could be imposed upon, or discerned in, a population
emitted by a monkey who has seen a leopard and which trig- of ad hoc mechanisms. Such generalizations amplified the
gers the appropriate escape behavior in other monkeys – power of groups of inventions by unifying them to provide
than to sentences as defined in a language like English, but expressive tools of greatly extended range. According to this
they differed crucially from the primate calls in that new account, there was no sudden transition from unitary ut-
utterances could be invented and acquired through learn- terances to an elaborate language with a rich syntax and
ing within a community, rather than emerging only through compositional semantics; no point at which one could say of
biological evolution. Thus, the set of such unitary utter- a tribe “Until now they used protolanguage but henceforth
ances was open, whereas the set of calls was closed. they use language.”
The following hypothetical but instructive example is To proceed further, I need to distinguish two “readings”
similar to examples offered at greater length by Wray (1998; of a case frame like Grasp(Leo, raisin), as an action-object
2000) to suggest how the fractionation of unitary utterances frame and as a verb-argument structure. I chart the transi-
might occur (and see Kirby [2000] for a related computer tion as follows:
simulation): Imagine that a tribe has two unitary utterances (1) As an action-object frame, Grasp(Leo, raisin) repre-
concerning fire which, by chance, contain similar substrings sents the perception that Leo is grasping a raisin. Here the
which become regularized so that for the first time there is action “grasp” involves two “objects,” one the “grasper” Leo
a sign for “fire.” Now the two original utterances are mod- and the other the “graspee,” the “raisin.” Clearly the mon-
ified by replacing the similar substrings by the new regu- key has the perceptual capability to recognize such a situa-
larized substring. Eventually, some tribe members regular- tion27 and enter a brain state that represents it, with that
ize the complementary gestures in the first string to get a representation distributed across a number of brain re-
sign for “burns”; later, others regularize the complementary gions. Indeed, in introducing principle LR5 (from hierar-
gestures in the second string to get a sign for “cooks meat.” chical structuring to temporal ordering) I noted that the
However, because of the arbitrary origin of the sign for ability to translate a hierarchical conceptual structure into
“fire,” the placement of the gestures that have come to de- a temporally ordered structure of actions is apparent when-
note “burns” relative to “fire” differs greatly from those for ever an animal takes in the nature of a visual scene and pro-
“cooks meat” relative to “fire.” It therefore requires a fur- duces appropriate behavior. But to have such a capability
ther invention to regularize the placement of the gestures does not entail the ability to communicate in a way that re-
in both utterances – and in the process, words are crystal- flects these structures. It is also crucial to note here the im-
lized at the same time as the protosyntax that combines portance of recognition not only of the action (mediated by
them. Clearly, such fractionation could apply to protosign F5) but also of the object (mediated by IT). Indeed, Figure
as well as to protospeech. 2 (the FARS model) showed that the canonical activity of
However, fractionation is not the only mechanism that F5 already exhibits a choice between the affordances of an
could produce composite structures. For example, a tribe object (mediated by the dorsal stream) that involves the na-
might over the generations develop different signs for “sour ture of the object (as recognized by IT and elaborated upon
apple,” “ripe apple,” “sour plum,” “ripe plum,” and so on, in PFC in a process of “action-oriented perception”). In the
but not have signs for “sour” and “ripe” even though the dis- same way, the activity of mirror neurons does not rest solely
tinction is behaviorally important. Hence, 2n signs are upon the parietal recognition (in PF, Fig. 3) of the hand mo-
needed to name n kinds of fruit. Occasionally someone will tion and the object’s affordances (AIP) but also on the “se-
eat a piece of sour fruit by mistake and make a characteris- mantics” of the object as extracted by IT. In the spirit of Fig-
tic face and intake of breath when doing so. Eventually, ure 2, I suggest that this semantics is relayed via PFC and
some genius pioneers the innovation of getting a conven- thence through AIP and PF to F5 to affect there the mir-
tionalized variant of this gesture accepted as the sign for ror neurons as well as the canonical neurons.
“sour” by the community, to be used as a warning before (2) My suggestion is that at least the immediate hominid
eating the fruit, thus extending the protolanguage.25 A step precursors of Homo sapiens would have been able to per-
towards language is taken when another genius gets people ceive a large variety of action-object frames and, for many
to use the sign for “sour” plus the sign for “ripe X” to re- of these, to form a distinctive gesture or vocalization to ap-
place the sign for “sour X” for each kind X of fruit. This in- propriately direct the attention of another tribe member,
novation allows new users of the protolanguage to simplify but that the vocalization used would be in general a unitary
learning fruit names, since now only n 1 names are re- utterance which need not have involved separate lexical en-
quired for the basic vocabulary, rather than 2n as before. tries for the action or the objects. However, the ability to
More to the point, if a new fruit is discovered, only one symbolize more and more situations would have required
name need be invented rather than two. I stress that the in- the creation of a “symbol tool kit” of meaningless ele-
vention of “sour” is a great discovery in and of itself. It might ments28 from which an open-ended class of symbols could
take hundreds of such discoveries distributed across cen- be generated.
(3) As a verb-argument structure, Grasp(Leo, raisin) is ber of utterances with overlap reaches a critical level,
expressed in English in a sentence such as “Leo grasps the economies of word learning would accrue from building ut-
raisin,” with “grasps” the verb, and “Leo” and “raisin” the terances from “reusable” components (cf. the Wray-Kirby
arguments. I hypothesize that stage S7 was grounded in the and “sour fruit” scenarios above). Separating verbs from
development of precursors to verb-argument structure us- nouns lets one learn m n p words (or less if the same
ing vocalizations that were decomposable into “something noun can fill two roles) to be able to form m*n*p of the
like a verb” and two somethings that would be “something most basic utterances. Of course, not all of these combina-
like nouns.” This is the crucial step in the transition from tions will be useful, but the advantage is that new utterances
protolanguage to human language as we know it. Abstract can now be coined “on the fly,” rather than each novel event
symbols are grounded (but more and more indirectly) in ac- acquiring group mastery of a novel utterance.
tion-oriented perception; members of a community may Nowak et al. (2000) analyzed conditions under which a
acquire the use of these new symbols (the crucial distinc- population that had two genes – one for unitary utterances
tion here is with the fixed repertoire of primate calls) by im- and one for fractionated utterances – would converge into
itating their use by others; and, crucially, these symbols can a situation in which one gene or the other (and therefore
be compounded in novel combinations to communicate one type of language or the other) would predominate. But
about novel situations for which no agreed-upon unitary I feel that this misses the whole point: (1) It assumes that
communicative symbol exists. there is a genetic basis for this alternative, whereas I believe
Having stressed above that adjectives are not a “natural the basis is historical, without requiring genetic change. (2)
category,” I hasten to add that I do not regard verbs or It postulates that the alternatives already exist. I believe it
nouns as natural categories either. What I do assert is that is necessary to offer a serious analysis of how both unitary
every human language must find a way to express the con- and fractionated utterances came to exist, and of the grad-
tent of action-object frames. The vast variety of these ual process of accumulating changes that led from the pre-
frames can yield many different forms of expression across dominance of the former to the predominance of the latter.
human languages. I view linguistic universals as being (3) Moreover, it is not a matter of either/or – modern lan-
based on universals of communication that take into ac- guages have a predominance of fractionated utterances but
count the processing loads of perception and production make wide use of unitary utterances as well.
rather than as universals of autonomous syntax. Hence, in The spread of these innovations rested on the ability of
emphasizing verb-argument structures in the form familiar other humans not only to imitate the new actions and com-
from English, I am opting for economy of exposition rather pounds of actions demonstrated by the innovators, but also
than further illustration of the diversities of human lan- to do so in a way that related increasingly general classes of
guage. To continue with the bricolage theme, much of “pro- symbolic behavior to the classes, events, behaviors, and re-
tosyntax” would have developed at first on an ad hoc basis, lationships that they were to represent. Indeed, considera-
with variations on a few basic themes, rather than being tion of the spatial basis for “prepositions” may help show
grounded from the start in broad categories like “noun” or how visuomotor coordination underlies some aspects of
“verb” with general rule-like procedures to combine them language (cf. Talmy 2000), whereas the immense variation
in the phonological expression of cognitive form. It might in the use of corresponding prepositions even in closely re-
have taken many, many millennia for people to discover lated languages like English and Spanish shows how the
syntax and semantics in the sense of gaining immense ex- basic functionally grounded semantic-syntactic correspon-
pressive power by “going recursive” with a relatively limited dences have been overlaid by a multitude of later innova-
set of strategies for compounding and marking utterances. tions and borrowings.
As a language emerged, it would come to include mecha- The transition to Homo sapiens thus may have involved
nisms to express kinship structures and technologies of the “language amplification” through increased speech ability
tribes, and these cultural products would themselves be ex- coupled with the ability to name certain actions and objects
panded by the increased effectiveness of transmission from separately, followed by the ability to create a potentially un-
generation to generation that the growing power of lan- limited set of verb-argument structures and the ability to
guage made possible. Evans (2003) supports this view by compound those structures in diverse ways. Recognition of
surveying a series of linguistic structures in which some syn- hierarchical structure rather than mere sequencing pro-
tactic rules must refer to features of the kinship system vided the bridge to constituent analysis in language.
which are common in Australian aboriginal tribes but are
unknown elsewhere. On this basis, we see such linguistic
structures as historical products reflecting the impact of 8. Towards a neurolinguistics “beyond the
various processes of “cultural selection” on emerging struc- mirror”
ture.
If one starts with unitary utterances, then symbols that Most of the stages of our evolutionary story are not to be
correspond to statements like “Take your spear and go seen so much as replacing “old” capabilities of the ancestral
around the other side of that animal and we will have a bet- brain with new ones, but rather, as extending those capa-
ter chance together of being able to kill it” must each be bilities by embedding them in an enriched system. I now
important enough, or occur often enough, for the tribe to build on our account of the evolution of the language-ready
agree on a symbol (e.g., arbitrary string of phonemes) and brain to offer a synchronic account of the “layered capabil-
for each one to replace an elaborate pantomime with a con- ities” of the modern adult human brain.
ventionalized utterance of protosign or protospeech. Dis- Aboitiz and García (1997) offer a neuroanatomical per-
covering that separate names could be assigned to each ac- spective on the evolutionary origin of the language areas in
tor, object, and action would require many words instead of the human brain by analyzing possible homologies between
one to express such an utterance. However, once the num- language areas of the human brain and areas of the monkey
or
Figure 6. Extending the FARS model to include the mirror system for grasping and the language system evolved “atop” this. Note that
this simple figure neither asserts nor denies that the extended mirror system for grasping and the language-supporting system are anatom-
ically separable, nor does it address issues of lateralization. (From Arbib & Bota 2003.)
brain that may offer clues as to the structures of the brains able to declare, either verbally or in pantomime, the visual
of our ancestors of 20 million years ago. Arbib and Bota parameters that guided the preshape. By contrast, A.T. had
(2003) summarize the Aboitiz-García and mirror system hy- a bilateral posterior parietal lesion. A.T. could use her hand
potheses and summarize other relevant data on homologies to pantomime the size of a cylinder, but could not preshape
between different cortical areas in macaque and human to appropriately when asked to grasp it. This suggests the fol-
ground further work on an evolutionary account of the lowing scheme:
readiness of the human brain for language. IV. Parietal “affordances” r preshape
Figure 6 is the diagram Arbib and Bota (2003) used to V. IT “perception of object” r pantomime or verbally describe
synthesize lessons about the language mechanisms of the size
human brain, extending a sketch for a “mirror neurolin-
guistics” (Arbib 2001b). This figure was designed to elicit That is, one cannot pantomime or verbalize an affor-
further modeling; it does not have the status of fully imple- dance; but rather one needs a “recognition of the object”
mented models, such as the FARS and MNS models, whose (IT) to which attributes can be attributed before one can
relation to, and prediction of, empirical results has been express them. Recall now the path shown in Figure 2 from
probed through computer simulation. IT to AIP, both directly and via PFC. I postulate that simi-
To start our analysis of Figure 6, note that an over-simple lar pathways link IT and PF. I show neither of these path-
analysis of praxis, action understanding, and language pro- ways in Figure 6, but rather show how this pathway might
duction might focus on the following parallel parieto- in the human brain not only take the form needed for praxic
frontal interactions: actions but also be “reflected” into a pathway that supports
the recognition of communicative manual actions. We
I. object r AIP r F5canonical praxis would then see the “extended PF” of this pathway as func-
II. action r PF r F5mirror action understanding tionally integrated with the posterior part of Brodmann’s
III. scene r Wernicke’s r Broca’s language production area 22, or area Tpt (temporo-parietal) as defined by Gal-
The data on patients A.T. and D.F. reviewed in section 3.1 aburda and Sanides (1980). Indeed, lesion-based views of
showed a dissociation between the praxic use of size infor- Wernicke’s area may include not only the posterior part of
mation (parietal) and the “declaration” of that information Tpt but also (in whole or in part) areas in the human cortex
either verbally or through pantomime (inferotemporal). that correspond to macaque PF (see Arbib & Bota [2003]
D.F. had a lesion allowing signals to flow from V1 towards for further details). In this way, we see Wernicke’s area as
posterior parietal cortex (PP) but not from V1 to infer- combining capabilities for recognizing protosign and pro-
otemporal cortex (IT). D.F. could preshape accurately tospeech to support a language-ready brain that is capable
when reaching to grasp an object, even though she was un- of learning signed languages as readily as spoken languages.
Finally, we note that Arbib and Bota (2003) responded to preparation and motor execution are tuned to coding goal-
the analysis of Aboitiz and García (1997) by including a oriented actions and are in keeping with single-cell record-
number of working memories crucial to the linkage of vi- ings revealing that neurons in area F5 of the monkey brain
sual scene perception, motor planning, and the production represent goal-directed aspects of actions. Grezes et al.
and recognition of language. However, they did not provide (2003) used event-related fMRI to investigate where in the
data on the integration of these diverse working memory human brain activation can be found that reflects both
systems into their anatomical scheme. canonical and mirror neuronal activity. They found activa-
When building upon Figure 6 in future work in neu- tion in the intraparietal and ventral limbs of the precentral
rolinguistics, we need to bear in mind the definition of sulcus when subjects observed objects and when they exe-
“complex imitation” as the ability to recognize another’s cuted movements in response to the objects (“canonical
performance as a set of familiar movements and then repeat neurons”); and activation in the dorsal premotor cortex, the
them, but also to recognize when such a performance com- intraparietal cortex, the parietal operculum (SII), and the
bines novel actions that can be approximated by (i.e., more superior temporal sulcus when subjects observed gestures
or less crudely be imitated by) variants of actions already in (“mirror neurons”). Finally, activations in the ventral pre-
the repertoire. Moreover, in discussing the FARS model in motor cortex and inferior frontal gyrus (Brodmann area
section 3.1, I noted that the interactions shown in Figure 2 [BA] 44) were found when subjects imitated gestures and
are supplemented in the computer implementation of the executed movements in response to objects. These results
model by code representing the role of the basal ganglia in suggest that in the human brain, the ventral limb of the pre-
administering sequences of actions, and that Bischoff- central sulcus may form part of the area designated F5 in
Grethe et al. (2003) model the possible role of the basal the macaque monkey. It is possible that area 44 forms an
ganglia in interactions with the pre-SMA in sequence learn- anterior part of F5, though anatomical studies suggest that
ing. Therefore, I agree with Visalberghi and Fragaszy’s it may be a transitional area between the premotor and pre-
(2002, p. 495) suggestion that “[mirror] neurons provide a frontal cortices.
neural substrate for segmenting a stream of action into dis- Manthey et al. (2003) used fMRI to investigate whether
crete elements matching those in the observer’s repertoire, paying attention to objects versus movements modulates
as Byrne (1999) has suggested in connection with his string- premotor activation during the observation of actions. Par-
parsing theory of imitation,” while adding that the success ticipants were asked to classify presented movies as show-
of complex imitation requires that the appropriate motor ing correct actions, erroneous actions, or senseless move-
system be linked to appropriate working memories (as in ments. Erroneous actions were incorrect either with regard
Fig. 6) as well as to pre-SMA and basal ganglia (not shown to employed objects, or to performed movements. The ven-
in Fig. 6) to extract and execute the overall structure of the trolateral premotor cortex (vPMC) and the anterior part of
compound action (which may be sequential, or a more gen- the intraparietal sulcus (aIPS) were strongly activated dur-
eral coordinated control program [Arbib 2003]). Lieber- ing the observation of actions in humans. Premotor activa-
man (2002) emphasizes that the roles of Broca’s and Wer- tion was dominantly located within BA 6, and sometimes
nicke’s areas must be seen in relation to larger neocortical extended into BA 44. The presentation of object errors and
and subcortical circuits. He cites data from studies of movement errors showed that left premotor areas were
Broca’s aphasia, Parkinson’s disease, focal brain damage, more involved in the analysis of objects, whereas right pre-
and so on, to demonstrate the importance of the basal gan- motor areas were dominant in the analysis of movements.
glia in sequencing the elements that constitute a complete (Since lateralization is not analyzed in this article, such data
motor act, syntactic process, or thought process. Hanakawa may be a useful springboard for commentaries.)
et al. (2002) investigated numerical, verbal, and spatial To test the hypothesis that action recognition and lan-
types of nonmotor mental-operation tasks. Parts of the pos- guage production share a common system, Hamzei et al.
terior frontal cortex, consistent with the pre-supplementary (2003) combined an action recognition task with a language
motor area (pre-SMA) and the rostral part of the dorsolat- production task and a grasping movement task. Action
eral premotor cortex (PMdr), were active during all three recognition-related fMRI activation was observed in the
tasks. They also observed activity in the posterior parietal left inferior frontal gyrus and on the border between the in-
cortex and cerebellar hemispheres during all three tasks. ferior frontal gyrus (IFG) and precentral gyrus (PG), the
An fMRI study showed that PMdr activity during the men- ventral occipito-temporal junction, the superior and infe-
tal-operation tasks was localized in the depths of the supe- rior parietal cortex, and in the intraparietal sulcus in the left
rior precentral sulcus, which substantially overlapped the hemisphere. An overlap of activations due to language pro-
region active during complex finger movements and was lo- duction, movement execution, and action recognition was
cated dorsomedial to the presumptive frontal eye fields. found in the parietal cortex, the left inferior frontal gyrus,
Such papers are part of the rapidly growing literature and the IFG-PG border. The activation peaks of action
that relates human brain mechanisms for action recogni- recognition and verb generation were always different in
tion, imitation, and language. A full review of such litera- single subjects, but no consistent spatial relationship was
ture is beyond the scope of the target article, but let me first detected, presumably suggesting that action recognition
list a number of key articles – Binkofski et al. (1999), De- and language production share a common functional archi-
cety et al. (1997), Fadiga et al. (2002), Grezes et al. (1998), tecture, with functional specialization reflecting develop-
Grezes and Decety (2001; 2002), Heiser et al. (2003), mental happenstance.
Hickok et al. (1998), Iacoboni et al. (1999; 2001), and Floel Several studies provide behavioral evidence supporting
et al. (2003) – and then briefly describe a few others: the hypothesis that the system involved in observation and
Koski et al. (2002) used fMRI to assess the effect of ex- preparation of grasp movements partially shares the corti-
plicit action goals on neural activity during imitation. Their cal areas involved in speech production. Gentilucci (2003a)
results support the hypothesis that areas relevant to motor had subjects pronounce either the syllable ba or ga while
observing motor acts of hand grasp directed to objects of visits in 1999 to the University of Western Australia and the Insti-
two sizes, and found that both lip aperture and voice peak tute of Human Physiology in Parma, Italy, and my conversations
amplitude were greater when the observed hand grasp was there with Robyn Owens, E. J. Holden, Giacomo Rizzolatti,
directed to the large object. Conversely, Glover and Dixon Morten Christiansen, Giuseppe Cossu, Giuseppe Luppino, Mas-
(2002; see Glover et al. 2004 for related results) presented simo Matelli, Vittorio Gallese, and other colleagues. So many peo-
ple have offered perceptive comments on various results of that
subjects with objects on which were printed either the word
effort (as published in, e.g., Arbib 2001a; 2001b; 2002) that the fol-
large or small. An effect of the words on grip aperture was lowing list is woefully incomplete – Shannon Casey, Chris Code,
found early in the reach, but this effect declined continu- Bob Damper, Kerstin Dautenhahn, Barry Gordon, Jim Hurford,
ously as the hand approached the target, presumably due to Bipin Indurkhya, Chrystopher Nehaniv, and Chris Westbury – but
the effect of visual feedback. Gerlach et al. (2002) showed I do hope that all these people (and the BBS referees), whether
that the left ventral premotor cortex is activated during cat- named or not, will realize how much I value their thoughtful com-
egorization not only for tools but also for fruits and vegeta- ments and that they will see how their suggestions and comments
bles and articles of clothing, relative to animals and non- have helped me clarify, correct, and extend my earlier analyses.
manipulable man-made objects. Such findings support the Preparation of the present paper was supported in part by a fel-
notion that certain lexical categories may evolve from ac- lowship from the Center for Interdisciplinary Research of the
University of Southern California. In particular, this fellowship al-
tion-based knowledge but are difficult to account for should
lowed me to initiate a faculty seminar in September of 2002 at
knowledge representations in the brain be truly categori- which my ideas have been exposed to intense though friendly
cally organized. scrutiny and placed in the context of the range of fascinating work
Several insights have been gleaned from the study of by the members of the seminar – Amit Almor, Elaine Andersen,
signed language. Corina et al. (2003) used PET to examine Aude Billard, Mihail Bota, Dani Byrd, Vincent Chen, Karen Em-
deaf users of ASL as they generated verb signs indepen- morey, Andrew Gordon, James Gordon, Jack Hawkins, Jerry R.
dently with their right dominant and left nondominant Hobbs, Laurent Itti, Toby Mintz, Stefan Schaal, Craig Stanford,
hands (compared to the repetition of noun signs). Nearly Jean-Roger Vergnaud, Christoph von der Malsburg, Carolee Win-
identical patterns of left inferior frontal and right cerebel- stein, Michail Zak, Patricia Zukow-Goldring, and Kie Zuraw.
lum activity were observed, and these were consistent with
patterns that have been reported for spoken languages. NOTES
Thus, lexical-semantic processing in production relies upon 1. Bickerton (1995) views infant language, pidgins, and the
left-hemisphere regions regardless of the modality in which “language” taught to apes as protolanguages in the sense of a form
of communication whose users can only string together a small
a language is realized, and, in signing, no matter which hand
handful of words at a time with little if any syntax. Bickerton hy-
is used. Horwitz et al. (2003) studied the activation of pothesizes that the protolanguage (in my sense) of Homo erectus
Broca’s area during the production of spoken and signed was a protolanguage in his sense, in which a few words much like
language. They showed that BA45, not BA44, was activated those of today’s language are uttered a few at a time to convey
by both speech and signing during the production of lan- meaning without the aid of syntax. I do not assume (or agree with)
guage narratives in bilingual subjects (fluent from early this hypothesis.
childhood in both ASL and English) with the generation of 2. Today’s signed languages are fully expressive human lan-
complex movements and sounds as control. Conversely, guages with a rich syntax and semantics, and are not to be con-
BA44, not BA45, was activated by the generation of com- fused with the posited systems of protosign communication. By
plex articulatory movements of oral-laryngeal or limb mus- the same taken, protospeech is a primitive form of communication
based on vocal gestures but without the richness of modern hu-
culature. Horwitz et al. therefore conclude that BA45 is the
man spoken languages.
part of Broca’s area that is fundamental to the modality- 3. Since we will be concerned in what follows with sign lan-
independent aspects of language generation. guage as well as spoken language, the “speaker” and “hearer” may
Gelfand and Bookheimer (2003), using fMRI, found that be using hand and face gestures rather than vocal gestures for
the posterior portion of Broca’s area responded specifically communication.
to sequence manipulation tasks, whereas the left supra- 4. However, I shall offer below the view that early forms of pro-
marginal gyrus was somewhat more specific to sequencing tosign provided a scaffolding for the initial development of proto-
phoneme segments. These results suggest that the left pos- speech, rather than holding that protosign was “completed” be-
terior inferior frontal gyrus responds not to the sound struc- fore protospeech was “initiated.”
ture of language but rather to sequential operations that 5. I would welcome commentaries on “language-like” aspects
of communication in nonprimates, but the present article is purely
may underlie the ability to form words out of dissociable el-
about changes within the primates that led to the human lan-
ements. guage-ready brain.
Much more must be done to take us up the hierarchy 6. It could be objected that monkey calls are not “involuntary
from elementary actions to the recognition and generation communication” because, for example, vervet alarm calls are
of novel compounds of such actions. Nonetheless, the given usually in the presence of conspecifics who would react to
above preliminary account strengthens the case that no them. However, I would still call this involuntary – this just shows
powerful syntactic mechanisms need have been encoded in that two conditions, rather than one, are required to trigger the
the brain of the first Homo sapiens. Rather, it was the ex- call. This is distinct from the human use of language to conduct a
tension of the imitation-enriched mirror system to support conversation that may have little or no connection to the current
intended communication that enabled human societies, situation.
7. When I speak of a “stage” in phylogeny, I do not have in mind
across many millennia of invention and cultural evolution,
an all-or-none switch in the genotype that yields a discontinuous
to achieve human languages in the modern sense. change in the phenotype, but rather the coalescence of a variety
of changes that can be characterized as forming a global pattern
ACKNOWLEDGMENTS that may emerge over the course of tens or even hundreds of mil-
The early stages of building upon “Language within Our Grasp” lennia.
(Rizzolatti & Arbib 1998) were conducted during my sabbatical 8. Let me stress that complex imitation involves both the
oped over the course of human evolution, however, unfortunately substrings” (sect. 7, para. 3). But won’t similar substrings also oc-
cannot be determined at our present level of knowledge. cur in unitary utterances that have nothing to do with fire? Here
he is on the horns of a dilemma. If he thinks they will not, he has
smuggled in a ready-made word, and if all “similar substrings” be-
have similarly, a holistic stage becomes superfluous – all the sep-
Beyond the mirror neuron – the smoke arate words of a synthetic language are already present, clumsily
neuron? disguised. If he thinks they will – and given the limited number of
possible syllables even in modern languages, they will probably oc-
Derek Bickerton cur more often in sequences that have nothing to do with fire –
Department of Linguistics, University of Hawaii, Honolulu, HI 96822. why should they be taken as meaning “fire” in the rarer cases, and
derbick@hawaii.rr.com what will similar strings in other contexts be assumed to mean?
And even before this dilemma can be addressed, Arbib must spec-
Abstract: Mirror neurons form a poor basis for Arbib’s account of lan- ify what would count as “similar enough” and explain why pho-
guage evolution, failing to explain the creativity that must precede imita- netic or gestural similarities would not be eroded by natural
tion, and requiring capacities (improbable in hominids) for categorizing change processes long before hominids could correlate them with
situations and unambiguously miming them. They also commit Arbib to similarities of meaning. Moreover, to extract a symbol meaning
an implausible holophrastic protolanguage. His model is further vitiated “fire” from a holistic utterance, our ancestors must first have had
by failure to address the origins of symbolization and the real nature of syn- the semantic concept of fire, and it becomes wholly unclear why,
tax.
instead of going the roundabout holistic route, they could not im-
mediately have given that concept a (signed or spoken) label. Real-
Mirror-neuron theory is the second-latest (FOXP2 is the latest) in
world objects can be ostensively defined; events and situations
a series of magic-bullet solutions to the problems of language evo-
cannot.
lution. To his credit, Arbib realizes it could not account for all of
Two substantive issues lie at the heart of language evolution:
language. Unfortunately, his attempts to go beyond it fall far short
how symbolism emerged, and how syntax emerged. No treatment
of adequacy.
that fails to deal with both can be taken seriously. Indeed, sym-
Even as a significant component of language, mirror neurons
bolism (as distinct from iconic or indexical reference, distinctions
are dubious. There cannot be imitation unless someone has first
that Arbib nowhere makes) has seemed to some (e.g., Deacon
created something to imitate, and mirror neurons offer no clue as
1997) to be the Rubicon between our species and others. Arbib
to how totally novel sequences – complex ones, at that – could
mentions it several times, hypothesizing it as a “support” for pro-
have been created ab ovo. Moreover, when someone addresses
tolanguage and noting the necessity for its “increasing sophistica-
you, you don’t just imitate what they said (unless you want to be
tion” as true language emerges. But at no point does he even ac-
thought an idiot); you say something equally novel.
knowledge the problem of how such an evolutionary novelty could
Arbib treats as wholly unproblematic both the category “fre-
have developed.
quently occurring situation” and the capacity of pantomime to
Syntax makes an even better candidate for a human apomorphy,
represent such situations. Situations, frequent or otherwise, do
since even with explicit instruction our nearest relatives fail to ac-
not come with labels attached; indeed, it is questionable whether
quire the rudiments of it (Givon 2004). Arbib’s dismissal of syntax
any species could isolate “a situation” from the unbroken, ongoing
as a “historical phenomenon” makes such uniqueness hard to ex-
stream of experience unless it already had a language with which
plain. According to him, “Words as we know them then coevolved
to do so. For this task requires abstracting away from a potentially
culturally with syntax through fractionation” (sect. 2, para. 2).
infinite number of irrelevant features – place, weather, time of
Even if syntax meant only the most frequent word-order in sim-
day, number and identity of participants, and on and on. How,
ple affirmative sentences, this claim might be tricky to defend. In
short of mind-reading powers that would leave professional clair-
fact, syntax depends on a wide variety of relationships within com-
voyants gasping, could our alingual ancestors have known which
plex hierarchical structures. Where do these structures and rela-
features seemed relevant to the sender of the message, and which
tionships come from? Arbib, ignoring the half-century of linguis-
did not?
tic research that has revealed (if not explained) them, remains
If Arbib answers “through pantomime,” one assumes he has
silent on this.
never played charades. Those who have, know that even with the
Arbib’s treatment claims to go “beyond the mirror.” However,
help of a large set of “disambiguating signs” – stereotypic gestures
what he offers is only a smoke-and-mirrors version of language
for “film title,” “book title,” and so on, elaborate routines of fin-
evolution, one in which all the real issues are obscured. His flow-
ger-counting to provide numbers of words and syllables – partic-
charts and neurological references may look impressive, but they
ipants with all the resources of modern human language and cog-
tell us nothing about the central problems of the field.
nition find it often difficult and sometimes impossible to guess
what the pantomimer is trying to represent. When what is to be NOTE
represented is not a monosyllabic word but something as complex 1. It is surely worth reminding readers that all the features of mirror
as “The alpha male has killed a meat animal and now the tribe has neurons (except for their catchy title) were described by David Perrett and
a chance to feast together. Yum, yum!” or “Take your spear and go his associates (Perrett et al. 1982; 1985) more than two decades ago – a
round the other side of that animal and we will have a better fact seldom acknowledged in contemporary accounts, including Arbib’s.
chance of being able to kill it” (Arbib’s own examples, sect. 7, para.
2), the likelihood of successful guessing becomes infinitesimally
small.
Arbib does see what I pointed out more than a decade ago
(Bickerton 1990, pp. 97– 98),1 that any espousal of mirror neurons
commits one to a holistic (Wray 1998; 2000) rather than a synthetic
protolanguage – one that would have to represent “bird flying”
with one symbol, rather than two (“bird” and “flying”) as all con-
temporary languages do (see Bickerton [2003] and especially
Tallerman [2004] for discussion). True language is then supposed
to develop straightforwardly through the “fractionation” of this
protolanguage. Arbib asks us to “imagine that a tribe has two uni-
tary utterances concerning fire which, by chance, contain similar
(a) Neuron h4
Neuron Activity
1
0.5
0
100 110 120 130 140 150 160 170 180 190 200
Time Step
(b) Neuron h5
Neuron Activity
0.5
0
100 110 120 130 140 150 160 170 180 190 200
Time Step
(c) Neuron h6
Neuron Activity
0.5
0
100 110 120 130 140 150 160 170 180 190 200
Time Step
Figure 2 (Borenstein & Ruppin). The activation level of three hidden neurons in a specific successful agent during time steps 100 to
200. Circles, squares, diamonds, and triangles represent the four possible actions in the repertoire. An empty shape indicates that the
action was only observed but not executed, a filled shape indicates that the action was executed by the agent (stimulated by a visible
world state) but not observed, and a dotted shape indicates time steps in which the action was both observed and executed. Evidently,
each of these neurons is associated with one specific action and discharges whenever this action is observed or executed.
Sharpening Occam’s razor: Is there need for teraction in the evolution of communication. In these and other
a hand-signing stage prior to vocal aspects, our viewpoints complement each other (Aboitiz & García
communication? 1997). We proposed that language networks originated as a spe-
cialization from ancestral working memory networks involved in
Conrado Bosman, Vladimir López, and Francisco Aboitiz vocal communication, and Figure 6 of the target article is a good
attempt to synthesize both hypotheses. However, we are not so
Departimento Psiquiatría, Facultad de Medicina, Pontificia Universidad
sure yet about the claim that gestural language was a precursor for
Católica de Chile, Casilla 114-D Santiago 1, Chile. cbosman@med.puc.cl
vlopezh@puc.cl faboitiz@puc.cl http://www.neuro.cl
vocal communication, for several reasons:
First, phylogenetic evidence indicates that in nonhuman pri-
mates, vocal communication transmits external meaning (i.e., about
Abstract: We commend Arbib for his original proposal that a mirror neu- events in the world) and is more diverse than gestural communica-
ron system may have participated in language origins. However, in our view
he proposes a complex evolutionary scenario that could be more parsimo-
tion (Acardi 2003; Leavens 2003; Seyfarth & Cheney 2003a). Sec-
nious. We see no necessity to propose a hand-based signing stage as ances- ond, there is evidence suggesting that the control of vocalizations in
tral to vocal communication. The prefrontal system involved in human the monkey could be partly carried out by cortical areas close to F5
speech may have its precursors in the monkey’s inferior frontal cortical do- and does not depend exclusively on the anterior cingulate cortex. If
main, which is responsive to vocalizations and is related to laryngeal control. this is so, the neural precursor for language would not need to be
sought in a hand-based coordination system. For example, in the
In the target article, Arbib extends his earlier hypothesis about the monkey there is an important overlap between area F5 and the cor-
role of mirror neurons for grasping in the motor control of lan- tical larynx representation (Jürgens 2003). Electrical stimulation of
guage (Rizzolatti & Arbib 1998), to a more detailed and fine- this area can elicit vocal fold movements (Hast et al. 1974), and cor-
grained scenario for language evolution. We agree with and cele- tical lesions in the supplementary motor area can significantly re-
brate the main proposals that a mirror neuron system has had a duce the total number of vocalizations emitted by monkeys (Gemba
fundamental role in the evolution of human communication and et al. 1997; Kirzinger & Jürgens 1982). Furthermore, Romanski and
that imitation was important in prelinguistic evolution. We also Goldman-Rakic (2002) recently described, in Brodmann areas 12
agree that there has probably been an important vocal-gestural in- and 45 of the monkey, neurons that respond strongly to vocalizations.
For these reasons, we suggested that this frontal auditory/mo- Action planning supplements mirror systems
tor domain may belong to, or be the precursor of, a vocalization in language evolution
mirror system similar to the mirror system for grasping, which in
hominids participated in vocal imitative behavior, allowing them Bruce Bridgeman
to compare heard vocalizations with their own productions Department of Psychology, Social Sciences 2, University of California, Santa
(Bosman et al. 2004; Jürgens 2003). All it would take to develop Cruz, Santa Cruz, CA 95064. bruceb@ucsc.edu
this system into a complex, voluntary vocalizing system might be http://psych.ucsc.edu/Faculty/bBridge.shtml
a refinement of the respective circuits and increasing cortico-bul-
bar control. In this line, evidence indicates a phylogenetic trend Abstract: Mirror systems must be supplemented by a planning capability
from nonhuman primates to humans towards increasing cortical to allow language to evolve. A capability for creating, storing, and execut-
control of the tongue, which may be related to the superior role ing plans for sequences of actions, having evolved in primates, was applied
the tongue plays in speech ( Jürgens & Alipour 2002). to sequences of communicatory acts. Language could exploit this already-
In parallel to this evidence, a very recent fMRI study has demon- existing capability. Further steps in language evolution may parallel steps
strated that in humans, listening to speech activates a superior por- seen in the development of modern children.
tion of the ventral premotor cortex that largely overlaps with a
speech-production motor area (Wilson et al. 2004). This evidence Because the functional basis for language capability lies in the
suggests the existence of a human vocalization mirror system, per- brain, it is sensible to look to brain evolution for insight into the
haps derived from the regions in the monkey described above. In evolution of language. Though the recently discovered mirror sys-
consequence, we think that a more parsimonious hypothesis could tem in primates offers possibilities for the evolution of capabilities
be that instead of a serial dependence of vocal communication upon necessary for language, it is not enough to do the whole job. In-
gestural communication, both coevolved to a large extent; that is, deed, the well-developed mirror system of monkeys in the ab-
both developed their own circuitry in parallel, with a high degree of sence of language shows that something more is needed, as Arbib
interaction between the two systems (Izumi & Kojima 2004). points out. In emphasizing the mirror neuron system, a here-and-
Against these arguments, it has been claimed that in nonhuman now system, Arbib makes a convincing case that mirror neurons
primates, cortical control over hand movements is stronger than are important in language evolution. A second need is for hierar-
control of vocalizations, which partly explains why apes can be chical structure rather than mere sequencing (target article, sect.
taught sign language and not vocal communication (Corballis 7, para. 13). This commentary will elaborate on that need and how
2003a). However, in our view this does not imply that gestural com- it is met.
munication must be ancestral to vocal communication. The same or A key power of language is the use of sequences of symbols in
even more behavioral flexibility (including combinatorial abilities) a grammatical system. For the ability to handle sequences, evolu-
than that observed in hand coordination, may have developed in vo- tion of primate planning mechanisms is essential. Complementary
cal communication by elaborating on preexisting vocal circuits. A to the mirror-neuron evolution story is the increasing ability of
similar situation may be observed in the elephant’s trunk: the neural primates to plan sequences of actions, for instance in preparing
machinery controlling the trunk probably developed on its own, and using tools. Actions must be planned in the brain before the
without the necessity of borrowing a coordination system from sequence starts, and must be executed in a particular order to
other motor devices (Pinker 1995). In addition, the presumed an- achieve success. The organization is hierarchical, with smaller
cestral signing stage remains highly speculative, there being still no tasks embedded in larger ones. The lateral prefrontal cortex is
evidence for it. Summarizing, since in monkeys and apes most com- probably the location of the machinery that produces, stores, and
munication is vocal, and given that there is an incipient prefrontal executes such plans. As planning abilities improved over the
control for vocalizations in them, we see no necessity to propose a course of primate evolution, the planning of sequences of actions
stage of gestural communication preceding “protospeech.” loomed ever greater in importance.
Finally, we would like to comment on the contrast previously In this conception, a critical event in the evolution of language
made by Arbib and Bota (2003), which we think may be mislead- was the use of this growing capability for planning to generate not
ing, between their theory being “prospective” (finding what is in sequences of actions, but sequences of words (Bridgeman 1992).
the monkey – hand coordination – which may have served as a This idea addresses two of the central puzzles of language evolu-
substrate for human language), and our theory (Aboitiz & García tion – first, how such a complex capability could evolve in such a
1997) being “retrospective” (looking at what is in the human brain short time, and second, how it could evolve in small steps, each
– working memory – and tracking it back to the monkey brain). useful immediately. The solution to the first problem is that lan-
Aboitiz and García (1997) followed standard phylogenetic guage is a new capability made mostly of old neurological parts,
methodology: first, the study identified in the monkey the net- among them the mirror system and the planning capability.
works that can be homologous to the language-related neural net- To examine the second problem, the small steps, we can look to
works; second, it asked about the functions of these networks in human development for hints about how the evolution of language
the monkey and in the human, one of which is working memory. may have proceeded, to the genetic remnants of earlier adapta-
A good analogy for this strategy comes from the evolution of the tions that remain in modern humans. The importance of gesture
eye (Dawkins 1996): Although image formation is a highly derived is clear from ontogeny as well as neurology, as most infants achieve
characteristic, there are more basic functions such as photorecep- a well-developed gestural communication before the first word.
tion, which are central to vision and shared by other species whose The gestures, although they eclipse the stereotyped call systems
visual organs lack image-forming properties; these functions permit of other animals, remain single communications fixed in the here-
us to track the phylogenetic ancestry of the eyes. Likewise, Aboitiz and-now. The first words occur in combination with gesture and
and García (1997) point to a function (working memory) that is pres- context to create useful communications with minimal verbal con-
ent in both the human and the monkey and participates in language tent.
processing (Aboitiz et al., in press; Smith & Jonides 1998). On the Arbib’s suggestion (sect. 2, para. 2) – that single utterances of
other hand, although hand coordination networks are present in Homo erectus and early Homo sapiens could convey complex
both species, at this point there is no evidence for the involvement meanings that modern languages achieve only with longish
of the hand control system in human linguistic processing. phrases – is unlikely to be accurate. Arbib’s comparison to mon-
key calls demonstrates this; most of them can be paraphrased in
ACKNOWLEDGMENTS one or two words; “leopard,” “I’m angry,” and so on. Similarly, an
Part of the work referred to in this commentary was financed by the Mil- infant’s first words are at the monkey-call level of generalization,
lennium Center for Integrative Neuroscience. C. Bosman is supported by not the whole sentence in a word that Arbib imagines. Arbib’s sug-
MECESUP PUC0005. gestion would require that super-words and the capacity to de-
velop and use them evolved, then disappeared again in favor of the guage-like system (i.e., protosign). Diachronic linguistic analyses
more specific words that characterize all existing languages. All have traced grammaticalization pathways in American Sign Lan-
this would have had to occur before speaking hominids gave rise guage (ASL) that originate with gesture (Janzen & Shaffer 2002).
to the present population, because the generality of words is about For example, grammatical markers of modality in ASL (e.g., “can,”
the same in all languages and therefore probably constitutes a “must”) are derived from lexical signs (“strong,” “owe”), and these
“universal” of language, that is, a species-specific and possibly a lexical signs are in turn derived from nonlinguistic communicative
part of our biological language equipment. gestures (clenching the fists and flexing muscles to indicate
One-word phrases address one of the paradoxes of language strength and a deictic pointing gesture indicating monetary debt).
evolution: in order to create a selective pressure for evolution of Investigations of newly emerging signed languages are also un-
better capability in using grammar, there must be a preexisting, covering patterns of conventionalization and grammaticalization
culturally defined lexicon with which the grammar can be built. that originate in pantomimic and communicative gestures (e.g.,
Many of the words used in modern languages could appear in this Kegl et al. 1999). Of course, these are modern sign languages ac-
way, but others, especially modifiers such as tense markers, can- quired and created by modern human brains, but the evidence in-
not. At this stage, words name things. The thing can be an object dicates that communicative gestures can evolve into language.
(later, noun), an action (verb), or a property (adjective/adverb). Arbib reasonably proposes that the transition from gesture to
Again, the paradox is the same: that such modifiers would have to speech was not abrupt, and he suggests that protosign and proto-
exist already before a complex grammar could develop. speech developed in an expanding spiral until protospeech be-
How could the sorts of words that cannot be used alone get came dominant for most people. However, there is no evidence
invented? Again we have evidence from the development of lan- that protosign ever became dominant for any subset of people –
guage in children. True, a child’s first words are single “holo- except for those born deaf. The only modern communities in
phrase” utterances, often comprehensible only in a context. But which a signed language is dominant have deaf members for
next comes a two-word slot grammar, the same all over the world whom a spoken language cannot be acquired naturally. No known
regardless of the structure of the parent language. This suggests a community of hearing people (without deaf members) uses a
biologically prepared mechanism (reviewed in Bridgeman 2003, signed language as the primary language. Hence, a community of
Ch. 7). Culturally, a large lexicon could develop at this stage, more deaf people appears to be a prerequisite for the emergence and
complex than one-word phrases could support, making possible maintenance of a sign language. Although it is possible that a sign
and useful the further development of grammar. language (and its deaf community) has existed for 6,000 years (the
Though the slot grammar of toddlers is different from that of divergence date for Indo-European spoken languages), the earli-
the child’s eventual language, it has several properties that make est known sign language can be tentatively traced back only 500
it useful for developing structure in a lexicon. Single-word utter- years to the use of Turkish Sign Language at the Ottoman court
ances need not differentiate parts of speech, since there is no (Zeshan 2003).
grammar. Words such as “sour” and “fruit” would be parallel – de- The fact that signed languages appear to be relatively new lan-
scriptions of some property of the world. Only when combined guages does not mean that they are somehow inferior to spoken
with another word must they be differentiated. Most of the utter- languages. Signed languages are just as complex, just as efficient,
ances of the slot grammar consist of a noun and a modifier, either and just as useful as spoken languages. Signed languages easily ex-
an adjective or a verb, that qualifies the context of the noun. press abstract concepts, are acquired similarly by children, and are
A “language” such as this is severely limited. We can imagine processed by the same neural systems within the left hemisphere
some group of Homo erectus sitting around their fire after a hard (see Emmorey 2002 for review). Thus, in principle, there is no lin-
day of hunting and gathering. Someone announces, “Lake cold.” guistic reason why the expanding spiral between protosign and
Another replies, “Fishing good.” The results seem almost comical protospeech could not have resulted in the evolutionary domi-
to us, but such terms would be tremendously more useful than no nance of sign over speech. A gestural-origins theory must explain
language at all, because they allow the huge advantage that hu- why speech evolved at all, particularly when choking to death is a
mans have over other living primates – to allow the experience of potential by-product of speech evolution due to the repositioning
one individual to increase the knowledge of another. Once this of the larynx.
level of communication is achieved, the selective pressure would Corballis (2002) presents several specific hypotheses why
be tremendous to develop all the power and subtlety of modern speech might have won out over gesture, but none are satisfactory
language. (at least to my mind). Corballis suggests that speech may have an
advantage because more arbitrary symbols are used, but sign lan-
guages also consist of arbitrary symbols, and there is no evidence
that the iconicity of some signs limits expression or processing.
The problem of signing in the dark is another oft-cited disadvan-
Sign languages are problematic for a gestural tage for sign language. However, early signers/gesturers could
origins theory of language evolution sign in moonlight or firelight, and a tactile version of sign language
could even be used if it were pitch black (i.e., gestures/signs are
Karen Emmorey felt). Furthermore, speech has the disadvantage of attracting
Laboratory for Cognitive Neuroscience, The Salk Institute for Biological predators with sound at night or alerting prey during a hunt.
Studies, La Jolla, CA 92037. emmorey@salk.edu
Corballis argues that speech would allow for communication
http://www-psy.ucsd.edu:80/~kemmorey
simultaneously with manual activities, such as tool construction or
demonstration. However, signers routinely sign with one hand,
Abstract: Sign languages exhibit all the complexities and evolutionary ad-
vantages of spoken languages. Consequently, sign languages are problem-
while the other hand holds or manipulates an object (e.g., turning
atic for a theory of language evolution that assumes a gestural origin. There the steering wheel while driving and signing to a passenger). It is
are no compelling arguments why the expanding spiral between protosign true that operation of a tool that requires two hands would neces-
and protospeech proposed by Arbib would not have resulted in the evolu- sitate serial manual activity, interspersing gesturing with object
tionary dominance of sign over speech. manipulation. But no deaths have occurred from serial manual ac-
tivity, unlike the deaths that occur as a result of choking.
At first glance, the existence of modern sign languages provides Everyone agrees that the emergence of language had clear and
support for Arbib’s hypothesis that there was an early stage in the compelling evolutionary advantages. Presumably, it was these ad-
evolution of language in which communication was predomi- vantages that outweighed the dangerous change in the vocal tract
nantly gestural. Modern sign languages offer insight into how pan- that allowed for human speech but increased the likelihood of
tomimic communication might have evolved into a more lan- choking. If communicative pantomime and protosign preceded
protospeech, it is not clear why protosign simply did not evolve gous capacity not perceptually, cognitively inherent even before
into sign language. The evolutionary advantage of language would complex imitation? Exactly how LR differs from simple imitation
already be within the grasp of early humans. behaviorally and in terms of the brain is not clear.
Arbib relies on Tomasello’s (1999a) idea about the biological ca-
pacity for intentional communication and social learning/culture,
all inherent in Homo sapiens (i.e., with language), yet includes in-
tended communication as part of protolanguage and hence “pre-
Biological evolution of cognition and culture: human.” Shared attributes of awareness of self as agent/sender
Off Arbib’s mirror-neuron system stage? and conspecific as receiver, adding parity and symbolization to this
amalgam, are implied. Does this mean that Homo sapiens’ lan-
Horacio Fabrega, Jr. guage-ready brain already enabled self-awareness, self/other dif-
Department of Psychiatry, University of Pittsburgh, Pittsburgh, PA 15213. ferentiation, and social cognition well before its members could
hfabregajr@adelphia.net actually “do” language? Arbib suggests that culture involves late
happenings (Pfeiffer 1982). Does not something like protoculture
Abstract: Arbib offers a comprehensive, elegant formulation of brain/lan- (Hallowell 1960) accompany LR as the “spiral” begins and gets un-
guage evolution; with significant implications for social as well as biologi- der way? Arbib also suggests that much of the protosign/proto-
cal sciences. Important psychological antecedents and later correlates are
presupposed; their conceptual enrichment through protosign and proto-
speech had a learned (cultural?) basis. Behavioral and cognitive
speech is abbreviated in favor of practical communication. What culture implications of Arbib’s Language Readiness construct (LR) are
“is” and whether protosign and protospeech involve a protoculture are not abstract and unclear. It (LR) appears to incorporate ordinary eco-
considered. Arbib also avoids dealing with the question of evolution of logic, executive cognition as well as social cognition, and it is un-
mind, consciousness, and self. clear how language fits in these. Much of social and executive
cognition is collapsed into, seems entailed by, LR schema. What
Is the mirror-neuron system (MNS) purely for grasping a basis for exactly language adds to aspects of self-awareness/consciousness
or a consequence of social communication and organization? Ar- and social cognition is not clear. Fundamental questions of the re-
bib suggests that even monkey MNS (involving praxis and vocal- lation between language and thought are simply bypassed (Car-
ization) contains the seeds of or serves “real” communication func- ruthers & Boucher 1998).
tions, as does simple imitation (but how?), with respect to social It is a challenging and very controversial idea that language is a
cooperation and ecological problem-solving. Such functions are purely cultural achievement – that it was invented and then per-
easier to visualize for emotional and facial gestures than for grasp- fected and remained as a cultural innovation ready for an infant
ing per se (on which he places most emphasis). born into a language/culture community to just learn naturally. It
Arbib’s formulation of what a pantomime sequence might com- is difficult to understand how the articulatory, phonological equip-
municate presupposes enormous cognitive capacities. Much of so- ment for language evolved entirely during the pre–Homo sapiens
cial cognition, conscious awareness of self and situation, and goal- LR phase; complexities of speech production seem in excess of
setting appear already resonant in the brain before pantomime. what the protosign/protospeech spiraling entails, unless one in-
Some have attributed self-consciousness and the “aboutness rela- cludes more of language within LR.
tionship” to language (Macphail 2000), but Arbib posits that the Arbib’s discussion of LA3 in section 2.2 is stunning: if one re-
reverse occurs. moved syntax (how much of it? Arbib mentions only time order-
In Arbib’s Table 1, cognitive functions (LR5) are said to precede ing and numbering system) one would still have language rather
all of language readiness: This involves a primate being able to than a protolanguage (Bickerton 1995). Arbib does elaborate on
take in, decompose, and order a complex perceptual scene as per this as per time travel but relates it to a whole array of brain/cog-
an action. Yet how this capacity blends into LR1–LR4 is covered nition features that support LR6. Is time travel inherent in pro-
mainly in brain terms, with natural selection (behavioral) factors tolanguage but only used through language? He also suggests that
minimized. It is unclear to what extent the idea that much of cog- language involves the capacity to exploit these cognitive structures
nition precedes and gets recruited into language readiness differs for communication purposes, suggesting that emergence/design
from formulations of others who cover related topics and whose of cognitive structures did not have a communicative basis. Did
work is not discussed in detail, such as Deacon (1997), Greenfield protosign/protospeech spiraling merely have communicational
(1991), Jackendoff’s (1983) and Wilkins and Wakefield’s (1995) basic functions? This relates to complex language/thought ques-
conceptual structure, the latter’s POT (parieto-occipito-temporo tions which Arbib bypasses.
junction), McNeilage’s (1998) syllabification, and metacommuni- Why MNS may not have involved vocality along with praxic/
cation and autoneoesis (Suddendorf 1999; Suddendorf & Corbal- gestural features from the start without necessitating a detour of
lis 1997; Wheeler et al. 1997). gesture alone is not clear. What brain/genetic conditions were
The biological line between LR and L (language) is left open: “not in place” that precluded the use of vocality along with man-
How much of the protosign/protospeech spiral is enough? Arbib ual gesture and that only later made it possible? Arbib’s two an-
promotes a slow, gradual evolution of LR and L as per communi- swers to this conundrum are not entirely persuasive.
cation but handles these as purely in analytical terms, as arbitrar- The conventionalized gestures used to disambiguate pan-
ily discontinuous. Despite much work on human speciation events tomime constitute a major transition into protosign, involving a
(Crow 2002b), Arbib seems against it. He is vague on “what of” dissociation between the mirror production system and the recog-
and “how much of ” spiraling establishes speciation, the identity of nition system, but this is dealt with by (merely) introducing the hy-
Homo sapiens. Is the latter “merely” a cultural event? Arbib sug- pothesis of intended communication, bypassing problems dis-
gests that a member of Homo erectus has the capacity to mentally cussed earlier.
use and associate symbols that are arbitrary and abstract (i.e., Epilogue: The mirror system as a framework for the evolution
showing considerable culture and cognition) yet is able to produce of culture. Intellectual quandaries hover over the evolutionary,
only simple, unitary utterances (showing comparatively little lan- brain, and social sciences: the nature of consciousness, self-con-
guage). This renders ambiguous exactly what marks speciation: sciousness, psychological experience, cultural knowledge, and
Does it come “only” when full language is invented or does is re- selfhood. To understand all of these in terms of brain function, and
quire more cognition and culture (and how much more?) made to bring into this intellectual theater their human evolutionary ba-
possible by invention of language as we know it? Behavioral im- sis, makes for a very beclouded stage. Many researchers have
plications of the cognitive/brain jump between simple imitation glided over such questions (Damasio 1987; D’Andrade 1999; In-
and complex imitation are also not clearly spelled out. Expressing gold 1996; Wierzbicka 1992; 1993). Some have addressed them in
relationships (compositionally) is said to come later, yet is analo- piecemeal fashion (Barkow 1987; Bickerton 1995; Geertz 1973;
Noble & Davidson 1996). Mind, consciousness, and, especially, that the term should not be limited to any particular model of proto-
capacity for and realization of culture constitute, at least in part, language (e.g., Bickerton’s [1995] model). However, I suggest that
neuroanatomical and neurophysiological phenomena. As the ho- the relevance of monkey mirror neurons to gestural theories of lan-
minid brain evolved, episodic and, especially, semantic memory guage evolution has been overstated, and I will focus on weaknesses
contained material that was fed into a working memory bin or Arbib’s model faces in explaining two key transitions: protosign to
supervisory system providing basis for experience and (autobio- protospeech, and holistic protolanguage to syntactic language.
graphical?) selfhood (Baddeley 1986; Fuster 2002; Tononi & The chain of a logical argument is only as strong as its weakest
Edelman 1998; Shallice 1988). link. The weak link in Arbib’s model is the crucial leap from proto-
When evolutionary scientists address such topics, they focus on sign to protospeech, specifically his elision between two distinct
concrete, expedient, raw, or “brutish” fitness imperatives, involv- forms of imitation: vocal and manual. Comparative data suggest
ing such things as hunting, foraging, mating, or parenting (Wray that these two are by no means inevitably linked. Although dol-
1998; 2000), leaving out cultural, symbolic, ritual complexities phins are accomplished at both whole-body and vocal imitation
(Fabrega 1997; 2002; 2004; Knight 1991). Arbib has managed to (Reiss & McCowan 1993; Richards et al. 1984), and parrots can im-
touch on all of these matters implicitly and tangentially, but for the itate movements (Moore 1992), evidence for non-vocal imitation
most part leaves them off his MNS stage. in the largest group of vocal imitators, the songbirds, is tenuous at
Beginning with the language-readiness phases wherein in- best (Weydemeyer 1930). Apes exhibit the opposite dissociation
tended communication is explicitly manifest, particularly during between some manual proto-imitation with virtually no vocal imi-
the shift from imitation to (conscious use of ) protosign, then to tation. There is therefore little reason to assume that the evolution
protospeech, and finally to language, Arbib insinuates (and once of manual imitation and protosign would inevitably “scaffold” vo-
mentions) culture/community and implies a sense of shared social cal imitation. Realizing this, Arbib offers a neuroanatomical justi-
life and social history. If there is a shared body of knowledge about fication for this crucial link, suggesting that the hypertrophied
what pantomimes are for and what they mean, what disambiguat- manual mirror system supporting protosign “colonized” the neigh-
ing gestures are for and mean, and what speech sounds are for and boring vocal areas of F5 by a process of “collateralization.”
mean, then there exists an obvious meaning-filled thought-world However, the key difference between human and other primate
or context “carried in the mind” that encompasses self-awareness, brains is not limited to local circuitry in area F5 but includes long-
other-awareness, need for cooperation, capacity for perspective- distance corticomotor connections from (pre)motor cortex to au-
taking – and, presumably, a shared framework of what existence, ditory motor neurons in the brainstem, which exist in humans but
subsistence, mating, parenting, helping, competing, and the like not other primates (Jürgens 1998). These probably represent a
entail and what they mean. All of this implies that evolution of cul- crucial neural step in gaining the voluntary control over vocaliza-
ture “happened” or originated during phases of biological evolution tion differentiating humans from monkeys and apes. “Collateral-
as LR capacities came into prominence (Foley 2001). No one de- ization” is not enough to create such corticomotor connections.
nies that “culture” was evident at 40,000 b.c.e., yet virtually no one Indeed, given competition for cortical real estate in the develop-
ventures to consider “culture” prior to this “explosion.” Arbib im- ing brain, it would seem, if anything, to make their survival less
plies, along with Wray (1998; 2000) that the context of language evo- likely. Thus, like other versions of gestural origins hypotheses, Ar-
lution was dominated by purely practical, expedient considerations bib’s model fails to adequately explain how a “protosign” system
(e.g., getting things done, preserving social stability, greetings, re- can truly scaffold the ability for vocal learning that spoken lan-
quests, threats). Boyer (1994) and Atran and Norenzayan (2004) guage rests upon. Are there alternatives?
imply that as a human form of cognition “coalesced,” so did a sig- Darwin suggested that our prelinguistic ancestors possessed an
nificant component of culture (Fabrega 1997; 2002; 2004). Arbib’s intermediate “protolanguage” that was more musical than linguis-
formulation suggests culture “got started” well before this, perhaps, tic (Darwin 1871). Combining Darwin’s idea with the “holistic
as he implies, with Homo habilis and certainly Homo erectus. protolanguage” arguments given by Arbib and others (Wray 2002a),
and the “mimetic stage” hypothesized by (Donald 1993), gives a
rather different perspective on the co-evolution of vocal and man-
ual gesture, tied more closely to music and dance than pantomime
and linguistic communication. By this hypothesis, the crucial first
Protomusic and protolanguage as step in human evolution from our last common ancestor with chim-
alternatives to protosign panzees was the development of vocal imitation, similar in form and
function to that independently evolved in many other vertebrate
W. Tecumseh Fitch lineages (including cetaceans, pinnipeds, and multiple avian lin-
School of Psychology, University of St. Andrews, Fife KY16 9JP, Scotland. eages). This augmented the already-present movement display be-
wtsf@st-andrews.ac.uk
haviour seen in modern chimpanzees and gorillas to form a novel,
learned, and open-ended multimodal display system. This hypo-
Abstract: Explaining the transition from a signed to a spoken protolan- thetical musical protolanguage preceded any truly linguistic system
guage is a major problem for all gestural theories. I suggest that Arbib’s
improved “beyond the mirror” hypothesis still leaves this core problem un-
capable of communicating particulate, propositional meanings.
solved, and that Darwin’s model of musical protolanguage provides a more This hypothesis is equally able to explain the existence of sign
compelling solution. Second, although I support Arbib’s analytic theory of (via the dance/music linkage), makes equal use of the continuity
language origin, his claim that this transition is purely cultural seems un- between ape and human gesture, and can inherit all of Arbib’s “ex-
likely, given its early, robust development in children. panding spiral” arguments. But it replaces the weakest link in Ar-
bib’s logical chain (the scaffolding of vocal by manual imitation)
Arbib’s wide-ranging paper commendably weaves together multi- with a step that appears to evolve rather easily: the early evolution
ple threads from neuroscience, linguistics, and ethology, provid- of a vocally imitating “singing ape” (where vocal learning functions
ing an explicit, plausible model for language phylogeny, starting in enhancement of multimodal displays). It renders understand-
with our common ancestor with other primates and ending with able why all modern human cultures choose speech over sign as
modern language-ready Homo sapiens. He takes seriously the the linguistic medium, if this sensory-motor channel is available.
comparative data accrued over the last 40 years of primatology, It also explains, “for free,” the evolution of two nonlinguistic hu-
rightly rejecting any simple transition from “monkey calls to lan- man universals, dance and music, as “living fossils” of an earlier
guage,” and provides an excellent integrative overview of an im- stage of human communicative behaviour. We need posit no hy-
portant body of neuroscientific data on grasping and vision and pothetical or marginal protolanguage: evidence of a human-spe-
their interaction. I agree with Arbib’s suggestion that some type of cific music/dance communication system is as abundant as one
“protolanguage” is a necessary stage in language evolution, and could desire. There are abundant testable empirical predictions
that would allow us to discriminate between this and Arbib’s hy- eas evolved atop a system that already existed in nonhuman pri-
potheses; the key desideratum is a better understanding of the mates. As explained in the target article, crucial early stages of the
neural basis of human vocal imitation (now sorely lacking). progression towards a language-ready brain are the mirror system
The second stage I find problematic in Arbib’s model is his expla- for grasping and its extension to permit imitation.
nation of the move from holistic protolinguistic utterances to analytic When comparing vocal-acoustic systems in vertebrates, neu-
(fully linguistic) sentences. I agree that analytic models (which start roanatomical and neurophysiological studies reveal that such sys-
with undecomposable wholes) are more plausible than synthetic tems extend from forebrain to hindbrain levels and that many of
models (e.g., Bickerton 2003; Jackendoff 1999) from a comparative their organizational features are shared by distantly related verte-
viewpoint, because known complex animal signals map signal to brate taxa such as teleost fish, birds, and mammals (Bass & Baker
meaning holistically. Both analytic and synthetic theories must be 1997; Bass & McKibben 2003; Goodson & Bass 2002). Given this
taken seriously, and their relative merits carefully examined. How- fundamental homogeneity, how are documented evolutionary
ever, the robust early development of the ontogenetic “analytic in- stages comparable to imitation in vertebrate taxa? Vocal imitation
sight” in modern human children renders implausible the sugges- is a type of higher-level vocal behaviour that is, for instance, illus-
tion that its basis is purely cultural, on a par with chess or calculus. trated by the songs of humpback whales (Payne & Payne 1985).
No other animal (including especially language-trained chim- In this case, there is not only voluntary control over the imitation
panzees or parrots) appears able to make this analytic leap, which process of a supposedly innate vocal pattern, but also a voluntary
is a crucial step to syntactic, lexicalized language. While dogs, control over the acoustic structure of the pattern.
birds, and apes can learn to map between meanings and words This behaviour seems to go beyond “simple” imitation of “ob-
presented in isolation or in stereotyped sentence frames, the abil- ject-oriented” sequences and resembles a more complex imitation
ity to extract words from arbitrary, complex contexts and to re- system. Although common in birds, this level of vocal behaviour is
combine them in equally complex, novel contexts is unattested in only rarely found in mammals (Jürgens 2002). It “evolved atop”
any nonhuman animal. In vivid contrast, each generation of hu- preexisting systems, therefore paralleling emergence of language
man children makes this “analytic leap” by the age of three, with- in humans. It indeed seems that this vocalization-based communi-
out tutelage, feedback, or specific scaffolding. This is in striking cation system is breaking through a fixed repertoire of vocalizations
contrast to children’s acquisition of other cultural innovations such to yield an open repertoire, something comparable to protosign
as alphabetic writing, which occurred just once in human history stage (S5). Following Arbib, S5 is the second of the three stages
and still poses significant problems for many children, even with that distinguish the hominid lineage from that of the great apes.
long and detailed tutelage. Although the specific aspect of S5 is to involve a manual-based
Although the first behavioural stages in the transition from communication system, it is interesting to see how cetaceans of-
holistic to analytic communication were probably Baldwinian fer striking examples of convergence with the hominid lineage in
exaptations, they must have been strongly and consistently shaped higher-level complex cognitive characteristics (Marino 2002).
by selection since that time, given the communicative and con- The emergence of a manual-based communication system that
ceptual advantages that a compositional, lexicalized language of- broke through a fixed repertoire of primate vocalizations seems to
fers. The “geniuses” making this analytic insight were not adults, owe little to nonhuman primate vocalizations. Speech is indeed a
but children, learning and (over)generalizing about language un- learned motor pattern, and even if vocal communication systems
analyzed by their adult caretakers, and this behaviour must have such as the ones of New World monkeys represent some of the
been powerfully selected, and genetically canalized, in recent hu- most sophisticated vocal systems found in nonhuman primates
man evolution. It therefore seems strange and implausible to (Snowdon 1989), monkey calls cannot be used as models for
claim that the acquisition of the analytic ability had “little if any speech production because they are genetically determined in
impact on the human genome” (target article, sect. 2.3). their acoustic structure. As a consequence, a number of brain
In conclusion, by offering an explicit phylogenetic hypothesis, structures crucial for the production of learned motor patterns
detailing each hypothetical protolinguistic stage and its mecha- such as speech production are dispensable for the production of
nistic underpinnings, and allowing few assumptions about these monkey calls (Jürgens 1998).
stages to go unexamined, Arbib does a service to the field, goes be- There is, however, one aspect of human vocal behavior that does
yond previous models, and raises the bar for all future theories of resemble monkey calls in that it also bears a strong genetic compo-
language phylogeny. However, further progress in our under- nent. This aspect involves emotional intonations that are super-
standing of language evolution demands parallel consideration of imposed on the verbal component. Monkey calls can therefore be
multiple plausible hypotheses, and finding empirical data to test considered as an interesting model for investigating the central
between them, on the model of physics or other natural sciences. mechanisms underlying emotional vocal expression (Jürgens 1998).
Arbib’s article is an important step in this direction. In recent studies, Falk (2004a; 2004b) hypothesizes that as hu-
man infants develop, a special form of infant-directed speech
known as baby talk or motherese universally provides a scaffold
for their eventual acquisition of language. Human babies cry in or-
der to re-establish physical contact with caregivers, and human
Imitation systems, monkey vocalization, and mothers engage in motherese that functions to soothe, calm, and
the human language reassure infants. These special vocalizations are in marked con-
trast to the relatively silent mother/infant interactions that char-
Emmanuel Gilissen acterize living chimpanzees (and presumably their ancestors).
Royal Belgian Institute of Natural Sciences, Anthropology and Prehistory, Motherese is therefore hypothesized to have evolved in early ho-
B-1000 Brussels, Belgium. Emmanuel.Gilissen@naturalsciences.be
minin mother/infant pairs, and to have formed an important
http://www.naturalsciences.be
prelinguistic substrate from which protolanguage eventually
emerged. Although we cannot demonstrate whether there is a link
Abstract: In offering a detailed view of putative steps towards the emer-
gence of language from a cognitive standpoint, Michael Arbib is also in-
between monkey calls and motherese, it appears that the neural
troducing an evolutionary framework that can be used as a useful tool to substrate for emotional coding, prosody, and intonation, and
confront other viewpoints on language evolution, including hypotheses hence for essential aspects of motherese content, is largely pre-
that emphasize possible alternatives to suggestions that language could not sent in nonhuman primate phonation circuitry (Ploog 1988; Sut-
have emerged from an earlier primate vocal communication system. ton & Jürgens 1988). In a related view, Deacon (1989) suggested
that the vocalization circuits that play a central role in nonhuman
An essential aspect of the evolutionary framework presented by primate vocalization became integrated into the more distributed
Michael Arbib is that the system of language-related cortical ar- human language circuits.
Although the view of Falk puts language emergence in a con- is that a significant change occurred in biological evolution allowing
tinuum that is closer to primate vocal communication than the hominids to develop the ability to discriminate auditory objects, to
framework of Michael Arbib, both models involve a progression categorize them, to retain them in long-term memory, to manipu-
atop the systems already preexisting in nonhuman primates. Ar- late them in working memory, and to relate them to articulatory ges-
bib’s work gives the first detailed account of putative evolutionary tures. It is only the last of these features that Arbib discusses. In our
stages in the emergence of human language from a cognitive view- view, the neural basis of auditory object processing will prove to be
point. It therefore could be used as a framework to test specific central to understanding human language evolution. We have be-
links between cognitive human language and communicative hu- gun a systematic approach combining neural modeling with neuro-
man language emergence hypotheses, such as the one recently physiological and functional brain imaging data to explore the
proposed by Falk. neural substrates for this type of processing (Husain et al. 2004).
Concerning language production, Arbib’s model of the mirror-
neuron system (MNS) may require considerable modification, es-
pecially when the focus shifts to the auditory modality. For in-
stance, there is no treatment of babbling, which occurs in the
Auditory object processing and primate development of both spoken and sign languages (Petitto & Mar-
biological evolution entette 1991). Underscoring the importance of auditory processing
in human evolution, hearing-impaired infants exhibit vocal bab-
Barry Horwitz,a Fatima T. Husain,a and Frank H. Guentherb bling that declines with time (Stoel-Gammon & Otomo 1986).
aBrain Imaging and Modeling Section, National Institute on Deafness and However, there has been work in developing biologically plau-
Other Communications Disorders, National Institutes of Health, Bethesda, sible models of speech acquisition and production. In one such
MD 20892; bDepartment of Cognitive and Neural Systems, Boston University, model (Guenther 1995), a role for the MNS in learning motor
Boston, MA 02215. horwitz@helix.nih.gov husainf@nidcd.nih.gov commands for producing speech sounds has been posited. Prior
http://www.nidcd.nih.gov/research/scientists/horwitzb.asp
to developing the ability to generate speech sounds, an infant must
guenther@cns.bu.edu http://www.cns.bu.edu/~guenther/
learn what sounds to produce by processing sound examples from
the native language. That is, he or she must learn an auditory tar-
Abstract: This commentary focuses on the importance of auditory object
processing for producing and comprehending human language, the rela-
get for each native language sound. This occurs in the model via a
tive lack of development of this capability in nonhuman primates, and the MNS involving speech sound-map cells hypothesized to corre-
consequent need for hominid neurobiological evolution to enhance this spond to mirror neurons (Guenther & Ghosh 2003). Only after
capability in making the transition from protosign to protospeech to lan- learning this auditory target can the model learn the appropriate
guage. motor commands for producing the sound via a combination of
feedback and feed-forward control subsystems. After the com-
The target article by Arbib provides a cogent but highly speculative mands are learned, the same speech sound-map cell can be acti-
proposal concerning the crucial steps in recent primate evolution vated to read out the motor commands for producing the sound.
that led to the development of human language. Generally, much In this way, mirror neurons in the model play an important role in
of what Arbib proposes concerning the transition from the mirror both the acquisition of speaking skills and in subsequent speech
neuron system to protosign seems plausible, and he makes numer- production in the tuned system. This role of mirror neurons in de-
ous points that are important when thinking about language evolu- velopment of new motor skills differs from Arbib’s MNS model,
tion. We especially applaud his use of neural modeling to imple- which “makes the crucial assumption that the grasps that the mir-
ment specific hypotheses about the neural mechanisms mediating ror system comes to recognize are already in the (monkey or hu-
the mirror neuron system. We also think his discussion in section 6 man) infant’s repertoire” (sect. 3.2, para. 7).
of the necessity to use protosign as scaffolding upon which to Our efforts to comprehend the biological basis of language evo-
ground symbolic auditory gestures in protospeech is a significant in- lution will, by necessity, depend on understanding the neural sub-
sight. However, the relatively brief attention Arbib devotes to the strates for human language processing, which in turn will rely
perception side of language, and specifically to the auditory aspects heavily on comparative analyses with nonhuman primate neu-
of this perception, seems to us to be a critical oversight. The explicit robiology. All these points are found in Arbib’s target article. A
assumption that protosign developed before protospeech, rein- crucial aspect, which Arbib invokes, is the necessary reliance on
forced by the existence of sign language as a fully developed lan- neurobiologically realistic neural modeling to generate actual im-
guage, allows Arbib (and others) to ignore some of the crucial fea- plementations of neurally based hypotheses that can be tested by
tures that both the productive and receptive aspects of speech comparing simulated data to human and nonhuman primate ex-
require in terms of a newly evolved neurobiological architecture. perimental data (Horwitz 2005). It seems to us that the fact that
One aspect of auditory processing that merits attention, but is humans use audition as the primary medium for language expres-
not examined by Arbib, has to do with auditory object processing. sion means that auditory neurobiology is a crucial component that
By auditory object, we mean a delimited acoustic pattern that is must be incorporated into hypotheses about how we must go be-
subject to figure-ground separation (Kubovy & Van Valkenburg yond the mirror-neuron system.
2001). Humans are interested in a huge number of such objects (in
the form of words, melodic fragments, important environmental
sounds), perhaps numbering on the order of 105 in an individual.
However, it is difficult to train monkeys on auditory object tasks,
and the number of auditory objects that interest them, compared On the neural grounding for metaphor and
to visual objects, seems small, numbering perhaps in the hundreds
(e.g., some species-specific calls, some important environmental
projection
sounds). For example, Mishkin and collaborators (Fritz et al. 1999; Bipin Indurkhya
Saunders et al. 1998) have showed that monkeys with lesions in the
International Institute of Information Technology, Hyderabad 500 019, India.
medial temporal lobe (i.e., entorhinal and perirhinal cortex) are im- bipin@iiit.net
paired relative to unlesioned monkeys in their ability to perform
correctly a visual delayed match-to-sample task when the delay pe- Abstract: Focusing on the mirror system and imitation, I examine the role
riod is long, whereas both lesioned and unlesioned monkeys are of metaphor and projection in evolutionary neurolinguistics. I suggest that
equally unable to perform such a task using auditory stimuli. the key to language evolution in hominid might be an ability to project one’s
These results implicate differences in monkeys between vision thoughts and feelings onto another agent or object, to see and feel things
and audition in the use of long-term memory for objects. Our view from another perspective, and to be able to empathize with another agent.
With regard to the evolutionary framework for neurolinguistics wider variety of actions and situations, and to project oneself into
spelled out in Arbib’s article, I would like to focus on the role of those situations to imitate them in a number of ways.
metaphor and projection therein. In particular, I am interested in Empathy – being able to put oneself into another’s shoes and to
the implications of Arbib’s framework for the thesis “all knowledge project one’s thoughts and feelings into another person, animal, or
(or language) is metaphorical.” It should be clarified from the out- object – is often considered a hallmark of being human. Indeed,
set that this thesis is sometimes misconstrued to suggest that the one of the ideals of robotics research is to emulate this essentially
literal or conventional does not exist – a suggestion that is trivially human trait in robots. (See, e.g., Breazeal et al. 2005; Kozima et
refuted. However, the sense in which I take it here is based on a al. 2003. This is also the theme of the classic Philip K. Dick story
well-known phenomenon that a novel metaphor sometimes be- “Do Androids Dream of Electric Sheep?” upon which the popu-
comes conventional through repeated use, and may even turn into lar film Blade Runner was based.) A glimpse of the key role played
polysemy; and the claim is that all that is conventional and literal by empathy in human cognition is provided by a study by Holstein
now must have been metaphorical once (Indurkhya 1994). Fur- (1970), in which children were given projection tasks such as be-
thermore, I take the viewpoint that the key mechanism underly- ing asked to imagine being a doorknob or a rock, and to describe
ing metaphor, especially creative metaphor, is that of projection, one’s thoughts and feelings in order to stimulate their creativity.
which carves out a new ontology for the target of the metaphor In a very recent study, it was found that when participants hid one
(Indurkhya 1992; 1998). This mechanism can be best explained as of their hands and a rubber hand was placed in front of them to
projecting a structure onto a stimulus, as in gestalt interaction, and make it look like their own hand, it took them only 11 seconds to
is to be contrasted with the mapping-based approaches to project their feelings onto the rubber hand as if it were their own,
metaphor, which require a pre-existing ontology for mapping. For down to the neural level: when the rubber hand was stroked by a
example, in the context of Arbib’s article, it is the projection mech- brush, the somatosensory area in the participants’ brain corre-
anism that determines what constitutes objects and actions when sponding to their hand was stimulated (Ehrsson et al. 2004). One
a monkey watches a raisin being grasped by another monkey or by wonders if monkeys and other animals are capable of projecting
a pair of pliers. their selves into other animals or other objects to this degree, and
There are two particular places in the evolutionary account ar- if the divergent point of hominid evolution might not be found
ticulated by Arbib where a projection step is implicit, and I shall therein.
zoom in on them in turn to raise some open issues. The first of
these concerns the mirror neurons (sect. 3.2). Now, certain mir-
ror neurons are known to fire when a monkey observes another
monkey performing a particular grasping action but not when the
grasp is being performed with a tool. This suggests a predisposi- Listen to my actions!
tion towards the ontology of a biological effector. The interesting
question here is: How much variation can be introduced in the ef- Jonas T. Kaplan and Marco Iacoboni
fector so that it is still acceptable to the mirror neuron. Does a ro- UCLA Brain Mapping Center, David Geffen School of Medicine, University of
California at Los Angeles, Los Angeles, CA 90095. jonask@ucla.edu
bot arm trigger the mirror neuron? What about a hairy robot arm?
iacoboni@loni.ucla.edu http://www.jonaskaplan.com
Similar remarks can be made with respect to the learning effect
in mirror neurons. When a monkey first sees a raisin being grasped
Abstract: We believe that an account of the role of mirror neurons in lan-
with a pair of pliers, then his mirror neurons do not fire. However, guage evolution should involve a greater emphasis on the auditory prop-
after many such experiences, the monkey’s mirror neurons en- erties of these neurons. Mirror neurons in premotor cortex which respond
coding precision grip start firing when he sees a raisin being to the visual and auditory consequences of actions allow for a modality-in-
grasped with pliers. This shows a predisposition towards the on- dependent and agent-independent coding of actions, which may have
tology of the object raisin and the effect of grip on it, as it is not been important for the emergence of language.
the physical appearance of the effector but its effect on the object
that matters. Again we may ask how much variation is possible in We agree with Arbib that the mirror property of some motor neu-
the object and the kind of grip before the mirror system fails to rons most probably played an important role in the evolution of
learn. For example, after the mirror neurons learn to fire on see- language. These neurons allow us to bridge the gap between two
ing a raisin being grasped with pliers, do they also fire when tweez- minds, between perception and action. As strong evidence for the
ers are used? Or, does the tweezers grasp have to be learned all role of mirror-like mechanisms in language, we have recently
over again? demonstrated with functional magnetic resonance imaging (fMRI)
These issues become more prominent when we consider imita- that a human cortical area encompassing primary motor and pre-
tion (sect. 4). In the literature, a wide range of animal behaviors motor cortex involved in the production of phonemes is also ac-
are classified as imitation (Caldwell & Whiten 2002; Zentall & tive during the perception of those same phonemes (Wilson et al.
Akins 2001), and true imitation is distinguished from imprinting, 2004). This suggests that motor areas are recruited in speech per-
stimulus enhancement, emulation learning, and so on. However, ception in a process of auditory-to-articulatory transformation that
even in imitating a single action, one has to decide what aspect of accesses a phonetic code with motor properties (Liberman et al.
the situation to imitate, as any situation has many possible aspects; 1967).
and how to imitate, as the imitating agent has to interpret the sit- However, we direct our commentary mostly at what Arbib calls
uation from its point of view – it may not have the same effectors, the transition from protosign to protospeech. In Arbib’s account,
access to the same objects, and so on – and project the observed a system of iconic manual gestures evolved from a mirror system
action into its own action repertoire (Alissandrakis et al. 2002; of action recognition, and then somehow transitioned to a vocal-
Hofstadter 1995). In this respect, studies on the behavior of ani- based language. Mention is made of the so-called audiovisual mir-
mals that imitate a non-conspecific model, such as bottlenose dol- ror neurons, which respond to the sound of an action as well as
phins or parrots imitating a human model (or a bottlenose dolphin during the production of that action (Kohler et al. 2002). The role
imitating a parrot?) are most illuminating. (See, e.g., Bauer & of these neurons in the evolution of language deserves more at-
Johnson 1994; Kuczaj et al. 1998; Moore 1992.) In Arbib’s frame- tention.
work, a distinction is made between simple and complex imitation Arbib argues that arbitrary signs first evolved in gesture, which
to explain where humans diverge from monkeys, and a projection- was more amenable to iconic representation, and that this proto-
like mechanism is posited for complex imitation (sect. 2.1: LR1; sign provided the “scaffolding” for vocal-based abstractions. We
also sect. 5). But I would like to suggest that even simple imitation suggest that rather than being added on later, the auditory re-
could invoke projection, and the crux of the distinction between sponsiveness of premotor neurons may have played a more cen-
humans and other animals might lie in the ability to interpret a tral role in the development of abstract representations. The au-
diovisual property of these mirror neurons puts them in position to biological mechanisms for transition from protolanguage to properly lan-
form a special kind of abstraction. Many of the neurons respond guage is considered
equally well to the sight of an action and to the sound associated with
an action (Keysers et al. 2003). This means that they are represent- 1. Arbib’s conception of language, summarised in LA1 to LA4,
ing an action not only regardless of who performs it, but also re- is concentrated upon its cognitive components and the cognitive
gardless of the modality through which it is perceived. The multi- abilities that both underlie and are based on verbal communica-
modality of this kind of representation may have been an important tion. Although semantics and syntax are the only components of
step towards the use of the motor system in symbolic language. Per- the language in highly intelligent speaking robots, human lan-
formed and observed actions can be associated with both sounds guages also include expressive components such as intonation and
and sights. This makes the motor cortex a prime candidate as a po- gesticulation. Particularly, prosody subserves two important func-
tential locus for the development of multimodal (or amodal) repre- tions of emotional expression (affective prosody) and of clarifica-
sentations, which are so important to language. tion of the content’s meaning (linguistic prosody, such as distin-
Support for this view comes from an fMRI study we recently con- guishing between an assertion and a question) (Bostanov &
ducted on audiovisual interactions in the perception of actions (Ka- Kotchoubey 2004; Seddoh 2002). Neuropsychological and neu-
plan & Iacoboni, submitted). When subjects saw and heard an ac- roimaging data converge in demonstrating that both linguistic and
tion (i.e., tearing paper) simultaneously, there was greater activity in affective prosodic information is processed mainly in the right
the left ventral premotor cortex compared with control conditions temporal lobe (Ross 1981), in contrast to semantics and syntax,
in which they only saw or only heard the action. This cross-modal which are processed in the left temporal lobe. Affective prosody
interaction did not happen with a non-action control stimulus (i.e., is strikingly similar in humans and other primates, so that human
a square moving while a sound was played), suggesting that the pre- subjects having no previous experience with monkeys correctly
motor cortex is sensitive to the conjunction of visual and auditory identify the emotional content of their screams (Linnankoski et al.
representations of an action. Again, it may be this capacity for con- 1994).
junctive representations that led to true symbolic capability. It is therefore tempting to represent the system of language as
Further support for the role of the auditory responsiveness of entailing two virtually additive subsystems. The left hemispheric
motor neurons in language evolution comes from transcranial subsystem develops on the basis of the mirror system of apes in
magnetic stimulation (TMS) studies on motor facilitation in the an indirect way depicted in the target article, and subserves the
two cerebral hemispheres in response to the sight or the sound of cognitive-symbolic function of language, its referential network,
an action. Motor activation to the sight of an action is typically bi- and syntactic design. The right hemispheric subsystem, in con-
lateral, albeit slightly larger in the left hemisphere in right-han- trast, is a direct successor of monkeys’ vocalisation mechanisms
ders (Aziz-Zadeh et al. 2002). Action sounds, in contrast, activate and gives our language its intonational colour and expressive
the motor cortex only in the left hemisphere, the cerebral hemi- power (Scherer 1986).
sphere dominant for language (Aziz-Zadeh et al. 2004). Since This view would ignore, however, the possibly most important
there is no evidence of lateralized auditory responses of mirror aspect of language: its pragmatics. Except for some scientific dis-
neurons in the monkey, the lateralization for action sounds ob- cussions, which did not play any important role before 2,500 years
served in the TMS study and the lateralization of cross-modal in- ago (and even after this point their role should not be overesti-
teractions in the ventral premotor cortex seem to be related to mated), communication is directed to move somebody to do
evolutionary processes that made human brain functions such as something. Communication is only a means, whereas the goal is
language lateralized to the left hemisphere. co-operation.1 The pragmatic function of language goes beyond
A more central role of auditory properties of mirror neurons in the mere referential semantics and mere expression of one’s own
language evolution makes also the transition from manual ges- state: It links together verbal and non-verbal, symbolic and non-
tures to mouth-based communication (speech) easier to account symbolic components of language because it relates us, over all
for. Recent fMRI data suggest that the human premotor cortex conventional symbols (words), to some, perhaps very remote, non-
seems able to map some kind of articulatory representation onto conventional basis. Likewise, affective prosody is not symbolic
almost any acoustic input (Schubotz & von Cramon 2003). A and conventional; it is a part of emotion itself. This pragmatic view
multi-sensory representation of action provided by mirror neu- makes it very difficult to imagine a certain moment in the evolu-
rons responding also to action sounds may have more easily tion of language when its left- and right-hemispheric components
evolved in articulatory representation of the sounds associated met together; rather, they should have been together from the
with manual actions. very beginning.
In summary, it may be the premotor cortex’s unique position of Some recent neuropsychological data point in the same direc-
having both cross-modal and cross-agent information that allowed tion. Although the right temporal lobe is critical for recognition of
it to support language. The auditory properties of mirror neurons prosody (Adolphs et al. 2002), prosodic aspects of language are
may have been a facilitator rather than a by-product of language also severely impaired in patients with lesions to orbitofrontal cor-
evolution. tex (Hornak et al. 2003) and the corpus callosum (Friederici et al.
2003). All this makes the simple additive model (i.e., the ancient
prosodic subsystem is simply added to the newly developed cog-
nitive subsystem) implausible. Rather, a theory is needed that
would describe the development of language in mutual interac-
Pragmatics, prosody, and evolution: tion of its different aspects.
Language is more than a symbolic system 2. Arbib suggests that the development of language from pro-
tolanguage was a social rather than biological process. The only
Boris Kotchoubey mechanism of such social progress he describes in section 7 is the
Institute of Medical Psychology and Behavioral Neurobiology, University of unexpected and unpredictable linguistic inventions made by nu-
Tübingen, 72074 Tübingen, Germany.
merous but anonymous genii, those inventions being then seized
boris.kotchoubey@uni-tuebingen.de
http://www.uni-tuebingen.de/medizinischepsychologie/stuff/
upon and employed by other people. I agree that no other social
mechanism can be thought of, because otherwise social systems
Abstract: The model presented in the target article is biased towards a are usually conservative and favour hampering, rather than pro-
cognitive-symbolic understanding of language, thus ignoring its other im- moting, development (e.g., Janis 1982). Surely, this putative pro-
portant aspects. Possible relationships of this cognitive-symbolic subsys- cess of social inventions is familiar: somebody has a good idea, oth-
tem to pragmatics and prosody of language are discussed in the first part ers learn about it, after a period of resistance they become
of the commentary. In the second part, the issue of a purely social versus accustomed to it and see its advantages, and soon the whole social
group uses it. However, the speed of this process critically de- Why is there such a continued interest in formulating gestural-ori-
pends on such institutions as writing, hierarchical social organisa- gins theories of language when they never provide an adequate
tion (the most powerful accelerator of social development; Cav- reason for the subsequent abandonment of the gestural medium,
alli-Sforza & Feldman 1981), and at least rudimentary mass or a means of getting us to the eventual vocal one? As to why the
media. Churches and monasteries played an active role in dis- change occurred, Arbib finesses that issue. The usual explanations
semination of new notions and concepts in Europe as well as the – that signed language is not omnidirectional, does not work in the
Far East. dark, and ties up the hands – have always constituted an insuffi-
Arbib argues that the development of modern languages such cient basis for such a radical reorganization. As to how the change
as English required much less time than the time to pass over from occurred, we note that the first gestural-origins theory of the mod-
protolanguage to language. This analogy misses, however, the sim- ern era was proposed by Hewes (1973; 1996), who gracefully ad-
ple fact that modern languages did not start with a protolanguage. mitted that “The ideas about the movement from a postulated pre-
Rather, their starting point was another highly developed lan- speech language to a rudimentary spoken one are admittedly the
guage. Italian needed only 800 years to reach its peak in The Di- weakest part of my model” (1996, p. 589). Nothing has changed
vine Comedy, but its precursor was Latin. since, whether in Arbib’s earlier gestural incarnation (Arbib & Riz-
More generally, the problem can be formulated as follows: the zolatti 1997), in the most recent reincarnation of Corballis’s ges-
proposed theory postulates that the development of language was tural-origins theory (Corballis 2003a; see MacNeilage 2003 for
not supported by natural selection. But the major social mecha- commentary), or in the present target article.
nisms (e.g., the mechanisms of state, church, writing, social hier- Arbib is more vulnerable than most on the why problem be-
archies, and fast migration), which might be supposed to have re- cause he posits an original open (read unrestricted) pantomimic
placed evolutionary mechanisms, did not exist when first protosign stage. Openness is a definitional property of true lan-
languages developed from their protolanguage ancestors. On the guage. Hockett (1978) pointed out, we think correctly, that if man-
other hand, social mechanisms which were present from the very ual communication had ever achieved openness, this would have
beginning (e.g., socialization in tribes and family education) are been such a momentous development that we would never have
known to be factors of conservation rather than development. abandoned the original form of the incarnation. Besides ignoring
Due to these social processes I would expect that genial inventors the why question, Arbib palms the how question, saying only
of words were ostracized rather than accepted. Hence, it remains “Once an organism has an iconic gesture, it can both modulate that
unclear how, if we retain Arbib’s example, the new notion “sour” gesture and/or or symbolize it (non-iconically) by ‘simply’ associ-
might ever have become known to anybody except the closest fel- ating a vocalization with it” (sect. 6.1, para. 2, Arbib’s quotation
lows of its genial inventor. Therefore, any generalisation about the marks). Simply?
development of the first human language(s) from what is known Arbib’s problems arise from a very disappointing source, given
about modern languages is problematic. his own focus on the evolution of action. He shows little regard for
Given that the degrees of linguistic and genetic similarity be- the affordances and constraints of the two language transmission
tween populations correlate (Cavalli-Sforza 1996), and that the media (their action components). He consequently misses a num-
transition from protolanguage to language can have covered 1,500 ber of opportunities to put constraints on his model. For example,
to 2,000 generations, I do not understand why biological mecha- his problematical conclusion that pantomime could be an open
nisms should be denied during the evolvement of the very first system disregards a commonly accepted conclusion in linguistics
(but not proto-) language. A possible argument could be the lack that for language to become an open system, it must have a com-
of substantial biological progress between the early Homo sapiens, binatorial phonology consisting of meaningless elements (such as
having only a protolanguage, and modern people. But this argu- consonants and vowels in the vocal medium, and hand shapes, lo-
ment would be misleading because it confounds evolution with cations, and movements in the manual medium) (Jackendoff
progress and power of different brains with their diversity. There 2002; Studdert-Kennedy & Lane 1980). He makes scant refer-
was not a big genetic progress since the appearance of Homo sapi- ence to modern-day sign languages, apparently regarding them as
ens, but the genetic changes took place. an adventitious side effect rather than a central phenomenon that
must be accounted for in a language-evolution context. Where did
ACKNOWLEDGMENT modern day sign languages get the combinatorial phonology com-
This work was partially supported by a grant from the Fortune Founda- monly thought to be necessary for an open linguistic system, if
tion, University of Tübingen Medical School. their predecessor already had an open pantomimic system? Arbib
NOTE
says nothing about the system-level problems of getting from a
1. From the pragmatic point of view, a message always remains “here pantomimic repertoire to a speech repertoire at either the per-
and now.” For instance, I am going to discuss the transition from pro- ceptual or the motor level.
tolanguage to language, which was about 100,000 years ago, that is, fairly A prominent consequence of Arbib’s neglect of the linguistic ac-
“beyond the here-and-now”; but my aim is to convince Arbib or other tion component is shown in his dubious contention that hominids
readers today. in the protospeech stage could have dashed off complex semantic
concepts with holistic phonetic utterances such as “grooflack” or
“koomzash,” forms that take a modern infant several years to mas-
ter. Such utterances are not holistic today. How could forms with
Evolutionary sleight of hand: Then, they saw such internal complexity, sounding like words with modern struc-
it; now we don’t ture, have originated, and how could they have become linked
with concepts? Also, if they indeed existed as holistic complexes,
Peter F. MacNeilagea and Barbara L. Davisb as Arbib claims, how did they get fractionated? And how was the
aDepartment of Psychology, The University of Texas at Austin, A8000 Austin,
phonetic fractionation related to the putative semantic fractiona-
TX 78712; bDepartment of Communication Sciences and Disorders, The tion into present-day forms of class elements such as nouns and
University of Texas at Austin, A1100 Austin, TX 78712.
verbs in a way that is consistent with phonology-morphology rela-
macneilage@psy.utexas.edu babs@mail.utexas.edu
tionships in present-day languages?
In light of the problems of gestural origins theories with the
Abstract: Arbib’s gestural-origins theory does not tell us why or how a sub-
sequent switch to vocal language occurred, and shows no systematic con-
why and how questions, there is a need for a theory of evolution
cern with the signalling affordances or constraints of either medium. Our of language that gets us to modern language in the old-fashioned
frame/content theory, in contrast, offers both a vocal origin in the inven- way – by speaking it! Our frame/content theory (MacNeilage
tion of kinship terms in a baby-talk context and an explanation for the 1998; MacNeilage & Davis 1990; 2000) is such a theory. Arbib bills
structure of the currently favored medium. our theory as being about “the evolution of syllabification as a way
to structure vocal gestures” but asserts that it “offers no clue as to Michael Arbib’s extension of the mirror-system hypothesis for ex-
what might have linked such a process to the expression of mean- plaining the origin of language elegantly sets the stage for further
ing” (sect. 6.1, para. 3). Apparently, Arbib did not revise the tar- discussion, but we think it overlooks a crucial source of data – the
get article following an exchange of critiques with him earlier this kinds of gestures that actually occur in current human linguistic
year (our paper not being cited in the target article), in which we performance. These data lead us to doubt a basic claim of the “ges-
described our view that the first words may have been kinship ture-first” theory, that language started as a gesture language that
terms formed in the baby-talk context. (For this exchange, see was gradually supplanted by speech. Arbib has modified this the-
Arbib 2005; MacNeilage & Davis, in press b.) ory with his concept of an expanding spiral, but this new model
Our primary contribution in this regard has been to refine ear- does not go far enough in representing a speech-gesture system
lier conceptions (cf. Locke 1993) of exactly how kinship terms that evolved together.
might have originated in a baby-talk context (MacNeilage & Davis Classic gesture-first. The enduring popularity of “gesture-
2004; in press a). Our argument is that the structure of present- first” seems to presuppose that gestures are simple and that as we
day baby-talk words is basically identical to the structure of the humans, and language, became more complex, speech evolved
first words of early speakers of language. We propose that because and to an extent supplanted gesture, a belief that emerged as part
of this basic identity, the first words had forms like baby-talk of the Enlightenment quest for the natural state of man and is
forms. credited to Condillac, and which has continued since (e.g., Hewes
The basic idea (see Falk 2004a, for a recent version) starts from 1973; Armstrong et al. 1995; Corballis 2002). However, contrary
the contention that nasal vocalizations of infants in the presence to the traditional view, we contend that gesture and language, as
of the mother (perhaps something like “mama”) came to be seen they currently exist, belong to a single system of verbalized think-
as standing for the mother. This is consistent with the fact that an ing and communication, and neither can be called the simple twin
extremely high proportion of words for the female parent in both of the other. It is this system, in which both speech and gesture are
baby talk (Ferguson 1964) and in a corpus of 474 languages (Mur- crucial, that we should be explaining. It makes little sense to ask
dock 1959) have nasal consonants in them. which part of an unbroken system is “simpler”; a better question
We argue (MacNeilage & Davis 2004) that following this de- is how the parts work together.
velopment a subsequent word for the male parent would have a In this system, we find synchrony and coexpressiveness – ges-
similar simple structure but would need to contrast phonetically ture and speech conveying the same idea unit, at the same time.
with the word for the female parent. Consistent with this proposal, Gesture and speech exhibit what Wundt described long ago as the
words for male parent in baby talk (Ferguson 1964) and languages “simultaneous” and “sequential” sides of the sentence (Blumen-
(Murdock 1959) tend to favor oral consonants (e.g., “papa” or thal 1970, p. 21) and Saussure, in notes recently discovered,
“dada”). termed “l’essence double du langage” (Harris 2002). Double
The word for female parent in this scenario could be regarded essence, not enhancement, is the relationship, and we do not see
as iconic in that it consistently “went with” the female parent as a how it could have evolved from the supplanting of gestures by
result of the focus of infant demand on the nearby female parent. speech. In the remainder of this commentary, we summarize three
However, we argue that that the force towards coining a male sources of evidence to support this assertion.
parental term that contrasted phonetically with the female term 1. Consider the attached drawing (Fig. 1). The speaker was de-
necessarily introduced an element of arbitrariness into the sound- scribing a cartoon episode in which one character tries to reach
meaning linkage. The conscious realization that arbitrary labels another character by climbing up inside a drainpipe. The speaker
could be attached to concepts, could have started spoken language
on its momentous journey with the typical arbitrary relationship
between concept and sound pattern that has been so difficult to
explain (MacNeilage & Davis 2004).
The baby-talk origins scenario might not seem as plausible as
the idea of pantomimes as first words, but it is the only one of the
two ideas that is consistent with the present-day structure of lan-
guage, even down to the level of structure of particular lexical
items.
ACKNOWLEDGMENT
This paper was prepared with support from research grant No. HD 2773-
10 from the U.S. Public Health Service.
is saying, “and he goes up through the pipe this time,” with the Arbib’s gesture-first. Arbib’s concept of an expanding spiral
gesture occurring during the boldfaced portion (the illustration may avoid some of the problems of the supplanting mechanism.
captures the moment when the speaker says the vowel of He speaks of scaffolding and spiral expansion, which appear to
“through”). Coexpressively with “up,” her hand rose upward, and mean, in both cases, that one thing is preparing the ground for or
coexpressively with “through,” her fingers spread outward to cre- propping up further developments of the other thing – speech to
ate an interior space. These took place together and were syn- gesture, gesture to speech, and so on. This spiral, as now de-
chronized with “up through,” the linguistic package that combines scribed, brings speech and gesture into temporal alignment (see
the same meanings. Fig. 6 in the target article), but also implies two things juxtaposed
The effect is a uniquely gestural way of packaging meaning – rather than the evolution of a single “thing” with a double essence.
something like “rising hollowness,” which does not exist as a se- Modification to produce a dialectic of speech and gesture, beyond
mantic package of English at all. Speech and gesture, at the mo- scaffolding, does not seem impossible. However, the theory is still
ment of their synchronization, were coexpressive. The very fact focused on gestures of the wrong kind for this dialectic – in terms
there is shared reference to the character’s climbing up inside the of Kendon’s Continuum (see McNeill 2000 for two versions),
pipe makes clear that it is being represented by the speaker in two signs, emblems, and pantomime. Because it regards all gestures as
ways simultaneously – analytic/combinatoric in speech and simplified and meaning-poor, it is difficult to see how the expand-
global/synthetic in gesture. We suggest it was this very simultane- ing spiral can expand to include the remaining point on the Con-
ous combination of opposites that evolution seized upon. tinuum, “gesticulations” – the kind of speech-synchronized coex-
2. When signs and speech do combine in contemporary human pressive gesture illustrated above.
performance, they do not synchronize. Kendon (1988) observed A compromise is that pantomime was the initial protolanguage
sign languages employed by aboriginal Australian women – full but was replaced by speech plus gesture, leading to the thought-
languages developed culturally for (rather frequent) speech language-hand link that we have described. This hypothesis has
taboos – which they sometimes combine with speech. The rele- the interesting implication that different evolutionary trajectories
vant point is that in producing these combinations, speech and landed at different points along Kendon’s Continuum. One path
sign start out synchronously, but then, as the utterance proceeds, led to pantomime, another to coexpressive and speech-synchro-
speech outruns the semantically equivalent signs. The speaker nized gesticulation, and so on. These different evolutions are re-
stops speaking until the signs catch up and then starts over, only flected today in distinct ways of combining movements with
for speech and signs to pull apart again. If, in the evolution of lan- speech. Although we do not question the importance of extending
guage, there had been a similar doubling up of signs and speech, the mirror system hypothesis, we have concerns about a theory
as the supplanting scenario implies, they too would have been that predicts, as far as gesture goes, the evolution of what did not
driven apart rather than into synchrony, and for this reason, too, evolve instead of what did.
we doubt the replacement hypothesis.
3. The Wundt/Saussure “double essence” of gesture and lan-
guage appears to be carried by a dedicated thought-hand-lan-
guage circuit in the brain. This circuit strikes us as a prime candi-
date for an evolutionary selection at the foundation of language. Meaning and motor actions: Artificial life and
It implies that the aforementioned combinations of speech and behavioral evidence
gesture were the selected units, not gesture first with speech
supplanting or later joining it. We observe this circuit in the Domenico Parisi,a Anna M. Borghi,b Andrea Di Ferdinando,c
unique neurological case of I.W., who lost all proprioception and and Giorgio Tsiotasb
spatial position sense from the neck down at age 19, and has since aInstitute of Cognitive Science and Technologies, National Research Council,
taught himself to move using vision and cognition. The thought- Rome 00137, Italy; bDepartment of Psychology, University of Bologna,
language-hand link, located presumably in Broca’s area, ties to- Bologna 40127, Italy; cDepartment of Psychology, University of Padua,
gether language and gesture, and, in I.W., survives and is partly Padua 35131, Italy. d.parisi@istc.cnr.it
dissociable from instrumental action. http://gral.ip.rm.cnr.it/dparisi/ annamaria.borghi@unibo.it
http://gral.ip.rm.cnr.it/borghi/ andrea.diferdinando@unipd.it
We can address Arbib’s pantomime model by observing the
http://ccnl.psy.unipd.it/di_ferdinando.html giorgio@tsiotas.com
kinds of gestures the dedicated link sustains in I.W.’s performance,
in the absence of vision: his gestures are (1) coexpressive and syn-
Abstract: Mirror neurons may play a role in representing not only signs
chronous with speech; (2) not supplemental; and (3) not derivable but also their meaning. Because actions are the only aspect of behavior
from pantomime. I.W. is unable to perform instrumental actions that are inter-individually accessible, interpreting meanings in terms of ac-
without vision but continues to perform speech-synchronized, co- tions might explain how meanings can be shared. Behavioral evidence and
expressive gestures that are virtually indistinguishable from nor- artificial life simulations suggest that seeing objects or processing words
mal (topokinetic accuracy is reduced but morphokinetic accuracy referring to objects automatically activates motor actions.
is preserved) (Cole et al. 2002). His gestures without vision, more-
over, minimize the one quality that could be derived from pan- Arbib argues that the vocal signs of human language are probably
tomime, a so-called “first-person” or “character” viewpoint, in which evolved from the gestural signs of some protolanguage, and this
a gesture replicates an action of a character (cf. McNeill 1992). might explain why the production of vocal signs in the human
More generally, an abundance of evidence demonstrates that brain is controlled by Broca’s area – which corresponds to area V5
spontaneous, speech-synchronized gestures should be counted as in monkeys’ brain – which controls manual actions. The discovery
part of language (McNeill 1992). Gestures are frequent (accom- of neurons in both areas that are activated both when a manual ac-
panying up to 90% of utterances in narrations). They synchronize tion is executed and when it is observed in others (mirror neurons)
exactly with coexpressive speech segments, implying that gesture reinforces this interpretation, because language is based on what
and related linguistic content are coactive in time and jointly con- Arbib calls the parity requirement, according to which what
vey what is newsworthy in context. Gesture adds cohesion, gluing counts for the speaker must count approximately the same for the
together potentially temporally separated but thematically related hearer.
segments of discourse. Speech and gesture develop jointly in chil- However, language is not only signs but is signs plus the mean-
dren, and decline jointly after brain injury. In contrast to cultural ing of signs. Mirror neurons tend to be invoked to explain the pro-
emblems, such as the “O.K.” sign, speech-synchronized gestures duction of linguistic signs but they may also play an important role
occur in all languages, so far as is known. Finally, gestures are not in the representation of the meaning of those signs. If meanings
“signs” with an independent linguistic code. Gestures exist only in are interpreted as categories of entities in the environment, one
combination with speech, and are not themselves a coded system. can argue that these categories are represented in the brain in
An avian parallel to primate mirror neurons pression show separation of budgerigar (Melopsittacus undulates)
and language evolution? response regions for hearing and vocalizing warble song (Jarvis &
Mello 2000), electrophysiological studies in the frontal neostria-
Irene M. Pepperberg tum of awake budgerigars show activity both in production of and
Department of Psychology, Brandeis University, MS062, Waltham, MA response to calls (Plumer & Striedter 1997; 2000); evidence also
02454. impepper@media.mit.edu http://www.alexfoundation.org exists for additional budgerigar auditory-vocal pathways (e.g.,
Brauth et al. 2001). Because ZENK response apparently is tuned
Abstract: Arbib presents a reasoned explanation for language evolution to specific song features (Ribeiro et al. 1998), the relevance of
from nonhuman to human primates, one that I argue can be equally ap- these data for MNs in talking parrots is unknown.
plied to animals trained in forms of interspecies communication. I apply However, arguments for complex imitation, and by inference,
his criteria for language readiness and language (in actuality, protolan- brain structures to support such behavior, exist. Like children de-
guage) to the behavior of a Grey parrot (Psittacus erithacus) taught to com- scribed by Arbib, Alex goes beyond simple imitation; he acquires
municate with humans using rudiments of English speech. the phonological repertoire, some words, and basic “assembly
skills” of his trainers and appears to parse complex behavior pat-
Arbib approaches an old chestnut – language evolution – from the terns (words and phrases) into recombinable pieces and familiar
novel standpoint of mirror neurons (MN), building upon his ear- (or semi-familiar) actions. In addition to material described above,
lier theses. His counterarguments for innatist theories are clearly Alex (1) recognizes and produces small phonetic differences (“tea”
on target. With little to critique, I focus on possible parallels be- vs. “pea”) meaningfully (Patterson & Pepperberg 1994; 1998), (2)
tween Arbib’s proposals and Grey parrot behavior – particularly produces initial phonemes differently depending upon subse-
that of my oldest subject, Alex (Pepperberg 1999). quent ones (/k/ in “key” vs. “cork”; Patterson & Pepperberg 1998),
Concerning Arbib’s criteria for language-readiness (LR), little and (3) consistently recombines parts of labels according to their
is unique to primates. Arbib provides not LR but CCR – “com- order in existent labels (i.e., combines beginnings of one label with
plex communication-ready” – criteria. He suggests this possibility the ends of others – e.g., “banerry” [for apples] from banana/
but omits details. LR1 (complex imitation), reproduction of novel cherry. After analyzing more than 22,000 vocalizations, we never
behavior that can be approximated by existent actions and their observed backwards combinations such as “percup” instead of
variants, is demonstrated, for example, by Alex’s initial immediate “cupper/copper”; Pepperberg et al. 1991).
utterance of “s(pause)wool” for “spool” (for a wooden bobbin; Surprisingly, Arbib doesn’t discuss Greenfield’s (1991) studies
Pepperberg 2005b, /p/ being particularly difficult to produce that might also involve co-opting gestural forms for vocal lan-
without lips and teeth (Patterson & Pepperberg 1998). LR2 (sym- guage, although she does not examine MNs and imitation. Appar-
bolization), LR3 (parity), and LR4 (intention) are demonstrated ently, human children – and language-trained chimps, but not
in detailed studies of Alex’s referential communication (Pepper- monkeys – simultaneously develop hierarchical object and lin-
berg 1999). LR5 (temporal versus hierarchical ordering) is more guistic ordering (e.g., serial cup stacking, phrases like I want
difficult to prove, except possibly in the understanding and use of X) as, Greenfield argues, a consequence of Broca/F5 maturation.
interactive dialogue (Pepperberg 1999). LR6 (past/future) occurs MNs in these brain areas are activated by both action and obser-
in any animal that can be operantly conditioned. Although few vation of hand or mouth gestures; less advanced MNs exist in
data exist on Grey parrot behavior in nature, LR7 is likely, given monkeys than in apes and humans. Similar behavior is observed
that Greys take several years to reach sexual maturity. in Grey parrots (Pepperberg & Shive 2001), although avian com-
In LA1 through LA4, Arbib also focuses on primates, but Greys binations both involve the beak. Greenfield implies that these ac-
seemingly meet most criteria. For LA1, for example, Alex trans- tions emerge without overt instruction; however, these behavior
fers the use of “none” from responding to “What’s same/differ- patterns are likely observed from birth (or hatching). Maybe only
ent?” for two objects when nothing is same or different, to re- after maturation of MN and canonical neuron systems can they be
sponding to, without training, “What color bigger?” for equally expressed (Pepperberg 2005a).
sized objects (Pepperberg & Brezinsky 1991), and then, again In sum, the communication system I have taught Grey parrots
without training, to designate an absent quantity in an enumera- will never be fully congruent with any current human language,
tion task (Pepperberg & Gordon 2005). Furthermore, to Alex, but I am intrigued by the many parallels that can be drawn be-
“paper,” for example, is not merely index card pieces used for ini- tween their protolanguage and that described by Arbib for early
tial training, but large sheets of computer output, newspapers, and Homo: Start with a brain of a certain complexity and give it enough
students’ textbooks. For LA2, Alex comprehends recursive, con- social and ecological support; that brain will develop at least the
junctive queries (e.g., “What object is green and 3-corner?” ver- building blocks of a complex communication system.
sus “What color is wood and 4-corner?” versus “What shape is blue
and wool?”; Pepperberg 1992). LA3 has not been demonstrated ACKNOWLEDGMENT
directly in Greys, but birds likely have episodic memory (e.g., Preparation of this commentary was supported by donors to The Alex
work by Clayton et al. 2003). LA4, learnability, exists with respect Foundation.
to semantics and, to a limited extent, for sentence frames (appro- NOTE
priate use of “I want X” versus “Wanna go Y”; Pepperberg 1999). 1. Expression of the ZENK gene, a songbird analog to a human tran-
Interestingly, Arbib’s criteria closely parallel Hockett’s (1959) de- scription factor, egr-1, is driven by actions of singing and hearing. Hence,
sign features; direct comparison would be instructive. it is used to form a functional map of avian brains for behavior related to
Given these parallels, do Grey parrots also have MN systems – both auditory processing and vocal production ( Jarvis & Mello 2000).
neurons that, for example, react similarly when birds hear and
speak human labels? Biologically, existent evidence is sparse but
intriguing. For oscine birds’ own song, some parallels exist with
primates. Songbirds’ high vocal center (HVC) sends efferents to
both input and output branches of the song system; HVC is nec-
essary for song production and has neurons showing song-specific
auditory responses (Williams 1989). Furthermore, playback of
birds’ own song during sleep causes neural activity comparable to
actual singing (Dave & Margoliash 2000).
How these findings relate to parrot brains, which are organized
differently from songbird brains (e.g., Jarvis & Mello 2000;
Striedter 1994) is unclear. Although studies of ZENK gene1 ex-
Contagious yawning and laughing: Everyday sures contagion in darkness or in the absence of line-of-sight vi-
imitation- and mirror-like behavior sual contact with a yawner.
Laughter has a clearer and much shorter history than yawning
Robert R. Provine and is associated with the evolution of social play in mammals
Department of Psychology, University of Maryland Baltimore County, (Provine 1996; 2000). Laughter is literally the sound of labored
Baltimore, MD 21250. provine@umbc.edu breathing in rough and tumble play, where the sound of panting
has come to represent the playful act that produced it. Ethologists
Abstract: Infectious yawning and laughing offer a convenient, noninva- refer to such processes as ritualization. Laughter evolved as a play
sive approach to the evolution, development, production, and control of vocalization, an unconsciously controlled, therefore honest signal
imitation-like and mirror-like phenomena in normal, behaving humans. that an encounter has playful intent and is not a physical assault.
In humans, the “pant-pant” laughter of our primate ancestors
The analysis of a scientific problem can benefit from taking a morphed into “ha-ha.” Laughter is the clearest example of how a
broad perspective before turning to narrower and more reductive vocalization evolved – it does not involve the arbitrary pairing of
issues. In this spirit, I nominate contagious yawning and laughing a sound with a meaning. (The transition from “pant-pant” to “ha-
for consideration, which are two of the most familiar cases of hu- ha” laughter reflects the increased vocal control of humans en-
man behavior with imitation-like and mirror-like properties. Even abled by bipedality and ultimately explains why we can speak and
their relegation to special-case status would help set parameters other great apes cannot.) Laughter and speech time-share the
and inform readers who are more familiar with these acts than same vocal apparatus, but each maintains unique features and
such esoteric and inaccessible phenomena as mirror neurons. An neurological mechanisms. Laughter lacks the voluntary control of
attractive feature of contagious yawning and laughing as scientific spoken words, and we tend to either laugh or speak, with speech
problems is that we can use ourselves as subjects – no electro- being dominant because laughter seldom interrupts the phrase
physiological laboratory is required. They also offer tantalizing in- structure of speech. Laughter punctuates the speech stream (Pro-
sights into the evolutionary process through which a motor act vine 1993).
may become mirrored or imitated. Laughter triggers the laughter of those who hear it, synchro-
Contagious yawning and laughing involve a chain reaction of nizing and amplifying the neurobehavioral status of a group. It is
behavior and physiology that propagates through and synchro- the basis of the notorious television laugh tracks. Crying is another
nizes the state of a group. Being unconsciously controlled, the infectious vocalization, at least among human infants (Simner
contagious responses do not involve a desire to replicate an ob- 1971). As suggested by Arbib, such processes are probably com-
served yawn or laugh – we just do them. Although the sensory vec- mon among animals. Contagious laughs occur almost immediately
tor for contagious yawns is primarily visual and that for laughter is after the stimulus laugh, in contrast to contagious yawns where
primarily auditory, both contagious acts involve the replication of there is a gradual increase in the probability of yawning during the
observed movements, whether the facial contortions of the yawn, seconds after the observed yawn.
or the respiratory movements that produce the vocalization of A challenge of comparing the mirror systems of Arbib with
laughter. those of yawning and laughter is that so little is known about the
Although the focus of this commentary is on the mirror-like and neurology of the latter. The laughing/yawning systems may, for ex-
imitation-like properties of contagion, the analysis of mechanism ample, more resemble systems involved in monkey vocalizations
must begin with the motor act brought under stimulus control. (midbrain and cingulate cortex) than those for language (e.g.,
Yawns and laughs evolved before the stimulus triggers responsi- Broca’s and Wernicke’s regions) or the specific mirror system con-
ble for their contagion. This is a case of motor precocity, the com- sidered by Arbib, the hand and orofacial system of monkey pre-
mon tendency of motor systems to develop or evolve prior to re- motor area F5. However, the yawning/laughter systems may be a
ceiving sensory inputs. Organisms often “spond before they convenient exemplar of a class of processes at the foundation of
respond.” Motor systems can be adaptive, stand-alone processes, Arbib’s proposal that can teach us about mirror/imitation mecha-
unlike sensory systems that, by themselves, lack adaptive signifi- nisms and their evolution. The parsimony of biological systems
cance because they have no behavioral consequence. (By exten- suggests that, in whole or in part, standard processes, components,
sion, reflexes are unlikely to emerge de novo because they require and circuits in the neurological tool kit are likely to find many ap-
the improbable simultaneous genesis of both a sensory and motor plications.
process.) Let us now consider the evolution of yawning and laugh-
ing and how they came under sensory control.
Yawning (Provine 1986) is an ancient, stereotyped motor pat-
tern that is performed by most vertebrates and develops prena-
tally in humans. Once initiated, a yawn goes to completion – re-
Motivation rather than imitation determined
call the difficulty of stifling a yawn. There are no half-yawns. The the appearance of language
motor pattern generator for yawning probably resides in the brain
stem along with other pulmonary and vasomotor control centers. Pavel N. Prudkov
A yawn, like a laugh, is not under voluntary control and cannot be Ecomon Ltd., Selskohosyastvennaya ul. 12-a, Moscow, Russia.
pnprudkov@mtu-net.ru
produced on command.
Contagious yawning (Provine 1986; 1989) probably emerged
Abstract: Arbib derives the origin of language from the emergence of a
many millions of years after the ubiquitous motor act and, al-
complex imitation system; however, it is unlikely that this complication
though it may be present in other species, has been clearly demon- could occur without a prior complicating within the imitated systems. This
strated only in humans. Lacking the remarkable precocity of the means that Arbib’s hypothesis is not correct, because the other systems de-
motor act, contagious yawning of humans appears sometime dur- termined the appearance of language. In my opinion, language emerged
ing early childhood, a developmental trajectory that suggests the when the motivational system became able to support goal-directed pro-
involvement of a separate and higher brain mechanism. Conta- cesses with no innate basis.
gious yawns can be triggered by the observation of the overall con-
figuration of the animate, yawning face, regardless of its axial ori- In the target article Arbib derives the origin of language from the
entation or presence of the gaping mouth. (Shielding a yawn will emergence of a complex imitation system among ancient Homo.
not block its contagion.) The neurological yawn detector is so Describing in detail how the complex imitation system could fa-
broadly tuned that almost any stimulus associated with yawning cilitate the formation of protosign and protospeech, he says noth-
can trigger the act, including, as some readers have noticed, even ing, however, about why this system must have emerged. This is a
thinking about or reading about yawning. The broad tuning in- serious problem; imitation is, by definition, copying of other pro-
cesses, therefore the complexity of the imitation system of an or- constraints. For example, the capacity of birds to navigate in three-
ganism cannot exceed the complexity of the systems to be imi- dimensional space on the basis of visual cues obviously exceeds
tated. This principle seriously constrains the possibility of the that of humans, but innate mechanisms determine the behavior of
emergence of a new, more complex imitation system without the birds.
corresponding complicating within the systems to be imitated. It is reasonable to think that there was a reciprocal interaction
Such a possibility seems to underlie Arbib’s approach because, in in the evolution of human language and motivation. The new mo-
emphasizing the changes in the imitation system, he does not re- tivational ability spurred the development of language; afterwards
quire similar fundamental changes in other systems. language was used to construct efficient, purposeful processes,
Of course, it is impossible to abandon the idea that the complex and this interaction likely determined all stages of human evolu-
imitation system could emerge as a result of a single mutation tion. This joint evolution was facilitated by the fact that a common
without the corresponding changes in other systems of some an- mechanism that evolved within these systems is the capacity to
cient hominids; but such hominids occasionally benefited from form and execute complex, hierarchical, goal-directed processes
their new possibilities, thereby surviving successfully, until other (such processes are rapid and relatively simple in language and are
systems achieved the complexity of the imitation system; and then slow and complex in motivation) (Prudkov & Rodina 1999). In
natural selection started working more conventionally again. The other words, I agree with Arbib that humans have a language-
probability of this scenario is extremely low, obviously. Another ready brain rather than special mechanisms embedded in the
approach to the origin of the complex imitation system, which genome. The capacity was also involved in the development of the
seems much more probable, is that a certain complication of other imitation system, because a basic characteristic distinguishing the
systems preceded this system and made its appearance necessary. human imitation system from its animal analogs is the possibility
This, however, means that Arbib’s hypothesis suggesting that the to imitate more complex and long-term processes. But the devel-
complex imitation system is the “missing link” is not correct, be- opment of the imitation system itself is not sufficient to construct
cause other systems in fact determined the appearance of lan- protolanguage, because only the new motivational system could
guage. make imitation voluntary and arbitrary. Indeed, in emphasizing
Like other hypotheses of language origin, Arbib’s hypothesis is that at a certain stage of evolution communication became volun-
based on the idea that language is a means of communication. This tary and intentional, Arbib does not explain what mechanisms un-
definition is correct but incomplete: language is a means of com- derlay such possibilities of communication.
munication for people engaged in a joint activity. There is a clear In my opinion, the gestural and vocal components of protolan-
correlation between the diversity of activities and the complexity guage emerged together, but the latter gained advantage in the
of the language serving these activities. Modern languages consist development because, unlike gestures, which are effective only in
of hundreds of thousands of words only because these languages dyadic contacts, vocalizations are more effective in group actions
are applied in thousands of diverse activities. Each human activ- (group hunting, collective self-defense, etc.), which became the
ity is goal-directed, hence, the complexity of languages is a conse- first actions guided by goals having no innate basis.
quence of the ability of the human brain to construct diverse goals.
Indeed, most human goals are not constrained by any innate ba-
sis; they are social, and result from interactions between people.
So, there is an obvious connection between language and the abil-
ity to construct and maintain long-term motivations with no innate Vocal gestures and auditory objects
basis.
No nonhuman animals have a motivational system with similar Josef P. Rauschecker
characteristics. Animals have long-term motivations (e.g., sex, Laboratory of Integrative Neuroscience and Cognition, Georgetown
University School of Medicine, Washington, DC 20057-1460.
hunger), but these are all innate. An animal can form learned mo-
rauschej@georgetown.edu
tivations, but only when its basic drives are activated. The hy-
pothesis that the motivation of animals is always constrained by
Abstract: Recent studies in human and nonhuman primates demonstrate
the activation of basic drives was suggested by Kohler (1917/ that auditory objects, including speech sounds, are identified in anterior
1927), and despite intensive researches, there have still been no superior temporal cortex projecting directly to inferior frontal regions and
data inconsistent with it (Suddendorf & Corballis 1997). With the not along a posterior pathway, as classically assumed. By contrast, the role
limited and stable number of long-term motivations, animals are of posterior temporal regions in speech and language remains largely un-
constrained in using and developing their languages. Since all explained, although a concept of vocal gestures may be helpful.
their motivations are connected with vital functions, any serious
misunderstanding in the process of communication can be fatal; In his target article, Arbib maintains (and before him, Rizzolatti &
as a result, the number of signals in animal languages must be lim- Arbib 1998) that language originated from a system of mirror neu-
ited, and the signals must have unequivocal meanings. Roughly rons coding manual gestures, rather than from vocal communica-
speaking, animals do not have a language similar to human lan- tion systems present in nonhuman primates (and other animals).
guages because they simply do not need it. I do not doubt the usefulness of the mirror-neuron concept, which
I have suggested elsewhere that the emergence of the ability to brings back to mind the motor theory of speech perception (Liber-
construct and maintain long-term goals with no innate basis was man et al. 1967). In fact, many recent neuroimaging studies have
the missing link for language (Prudkov 1999c) and for the other independently demonstrated a simultaneous activation of what
distinctively human characteristics (Prudkov 1999a; 1999b) be- were previously thought of as separate centers for the production
cause the ability allowed ancient humans to overcome the con- and perception of human language, Broca’s and Wernicke’s areas,
straints of innate motivations, thus providing the possibility of respectively. These designations go back more than a century to
constructing new, flexible, and open systems. In other words, pro- crudely characterized single-case studies of neurological patients,
tolanguage emerged because in new situations conditioned by which have been shown by modern magnetic resonance imaging
goals having no innate basis, the innate communicative means be- (MRI) techniques (Bookheimer 2002) to have missed much more
came inefficient for interactions between ancient hominids, and brain than the relatively small regions that now bear their discov-
those who were able to construct new means succeeded in repro- erers’ names.
duction. Of course, language, imitation, and the theory of mind Both on that basis and on the basis of his own belief in inter-
had started evolving then. It is very important to emphasize that twined systems of perception and action, it is surprising that Ar-
without the prior (or parallel) formation of the system able to con- bib continues to use this outdated terminology. “Broca’s area” at
struct learned, long-term motivations, any changes in other sys- least is redefined by him as part of a system that deals with, among
tems (e.g., in intelligence) were not sufficient to overcome innate others, “sequential operations that may underlie the ability to
form words out of dissociable elements” (sect. 8), a definition that Continuities in vocal communication argue
many researchers could agree with, although the exact corre- against a gestural origin of language
spondence with cytoarchitectonically defined areas and the ho-
mologies between human and nonhuman primates are still con- Robert M. Seyfarth
troversial. “Wernicke’s area,” by contrast, gets short shrift. Arbib Department of Psychology, University of Pennsylvania, Philadelphia, PA
talks about it as consisting of the posterior part of Brodmann’s area 19104. seyfarth@psych.upenn.edu
22, including area Tpt of Galaburda and Sanides (1980) and an http://www.psych.upenn.edu/~seyfarth/Baboon%20research/index.htm
“extended [parietal area] PF,” suggesting that this is the only route
that auditory input takes after it reaches primary auditory cortex. Abstract: To conclude that language evolved from vocalizations, through
Of course, this suggestion echoes the classical textbook view of a gestures, then back to vocalizations again, one must first reject the simpler
posterior language pathway leading from Wernicke’s to Broca’s hypothesis that language evolved from prelinguistic vocalizations. There is
area via the arcuate fascicle. no reason to do so. Many studies – not cited by Arbib – document conti-
A remarkable convergence of recent neurophysiological and nuities in behavior, perception, cognition, and neurophysiology between
human speech and primate vocal communication.
functional imaging work has demonstrated, however, that the
analysis of complex auditory patterns and their eventual identifi-
Arbib argues that the emergence of human speech “owes little to
cation as auditory objects occurs in a completely different part of
nonhuman vocalizations” and concludes that “evolution did not
the superior temporal cortex, namely, its anterior portion. The an-
proceed directly from monkey-like primate vocalizations to speech
terior superior temporal (aST) region, including the anterior su-
but rather proceeded from vocalization to manual gesture and
perior temporal gyrus (STG) and to some extent the dorsal aspect
back to vocalization again” (target article, sect. 2.3). Accepting this
of the superior temporal sulcus (STS), project to the inferior
hypothesis requires us to adopt a convoluted argument over a sim-
frontal (IF) region and other parts of the ventrolateral prefrontal
ple one. There is no need to do so.
cortex (VLPFC) via the uncinate fascicle. Together, the aST and
If dozens of scientists had been studying the natural vocaliza-
IF cortices seem to form a “what” stream for the recognition of
tions of nonhuman primates for the past 25 years and all had con-
auditory objects (Rauschecker 1998; Rauschecker & Tian 2000),
cluded that the vocal communication of monkeys and apes exhib-
quite similar to the ventral stream for visual object identification
ited no parallels whatsoever with spoken language, one might be
postulated previously (Ungerleider & Mishkin 1982). Neurophys-
forced to entertain Arbib’s hypothesis. If years of neurobiological
iological data from rhesus monkeys suggest that neurons in the
research on the mechanisms that underlie the perception of calls
aST are more selective for species-specific vocalizations than are
by nonhuman primates had revealed no parallels with human
neurons in the posterior STG (Tian et al. 2001). In humans, there
speech perception, this, too, might compel us to reject the idea
is direct evidence from functional imaging work that intelligible
that human language evolved from nonhuman primate vocaliza-
speech as well as other complex sound objects are decoded in the
tions. Neither of these conclusions, however, is correct.
aST (Binder et al. 2004; Scott et al. 2000; Zatorre et al. 2004).
Arbib offers his hypothesis as if he had carefully reviewed
It seems, therefore, that the same anatomical substrate sup-
the literature on nonhuman primate vocal communication and
ports both the decoding of vocalizations in nonhuman primates
thoughtfully rejected its relevance to the evolution of human lan-
and the decoding of human speech. If this is the case, the conclu-
guage. Readers should be warned, however, that his review ends
sion is hard to escape that the aST in nonhuman primates is a pre-
around 1980 and even neglects some important papers published
cursor of the same region in humans and (what Arbib may be re-
before that date.
luctant to accept) that nonhuman primate vocalizations are an
Primate vocal repertoires contain several different call types
evolutionary precursor to human speech sounds. Indeed, the
that grade acoustically into one another. Despite this inter-grada-
same phonological building blocks (or “features”), such as fre-
tion, primates produce and perceive their calls as, roughly speak-
quency-modulated (FM) sweeps, band-passed noise bursts, and
ing, discretely different signals. Different call types are given in
so on, are contained in monkey calls as well as human speech. Ad-
different social contexts (e.g., Cheney & Seyfarth 1982; Fischer
mittedly, the decoding of complex acoustic sound structure alone
1998; Fischer et al. 2001a; Hauser 1998; Snowdon et al. 1986). In
is far from sufficient for language comprehension, but it is a nec-
playback experiments, listeners respond in distinct ways to these
essary precondition for the effective use of spoken speech as a
different call types, as if each type conveys different information
medium of communication. Arbib argues, with some justification,
(e.g., Fischer 1998; Fischer et al. 2001b; Rendall et al. 1999). Lis-
that communication is not bound to an acoustic (spoken) medium
teners discriminate between similar call types in a manner that
and can also function on the basis of visual gestures. However, in
parallels – but does not exactly duplicate – the categorical per-
most hearing humans the acoustic medium, that is, “vocal ges-
ception found in human speech (Fischer & Hammerschmidt
tures,” have gained greatest importance as effective and reliable
2001; Owren et al. 1992; Prell et al. 2002; Snowdon 1990; Zoloth
carriers of information.
et al. 1979). Offering further evidence for parallels with human
An interesting question remaining, in my mind, is, therefore,
speech, the grunts used by baboons (and probably many other pri-
how the auditory feature or object system in the aST could inter-
mates) differ according to the placement of vowel-like formants
act with a possible mirror system, as postulated by Arbib and col-
(Owren et al. 1997; Rendall 2003).
leagues. The projection from aST to IF seems like a possible can-
Arbib incorrectly characterizes primate vocalizations as “invol-
didate to enable such an interaction. Indeed, auditory neurons,
untary” signals. To the contrary, ample evidence shows that non-
some of them selectively responsive to species-specific vocaliza-
human primate call production can be brought under operant
tions, are found in the VLPFC (Romanski & Goldman-Rakic
control (Peirce 1985) and that individuals use calls selectively in
2002). According to our view, aST serves a similar role in the au-
the presence of others with whom they have different social rela-
ditory system as inferotemporal (IT) cortex does for the visual sys-
tions (for further review and discussion, see Cheney & Seyfarth
tem. Which role, if any, Wernicke’s area (or posterior STG) plays
1990; Seyfarth & Cheney 2003b).
for vocal communication, including speech and language, remains
Because nonhuman primates use predictably different calls in
the bigger puzzle. Understanding it as an input stage to parietal
different social and ecological contexts, listeners can extract highly
cortex in an auditory dorsal pathway is a good hint. However, as
specific information from them, even in the absence of any sup-
Arbib would say, “empirical data are sadly lacking” and need to be
porting contextual cues. For example, listeners respond to acousti-
collected urgently.
cally different alarm calls as if they signal the presence of differ-
ent predators (Fichtel & Hammerschmidt 2002; Fischer 1998;
Seyfarth et al. 1980), and to acoustically different grunts as if they
signal the occurrence of different social events (Cheney & Sey-
farth 1982; Rendall et al. 1999). In habituation-dishabituation ex- Making a case for mirror-neuron system
periments that asked listeners to make a same-different judgment involvement in language development: What
between calls, subjects assessed calls based on their meaning, not about autism and blindness?
just their acoustic properties (Cheney & Seyfarth 1988; Zuber-
buhler et al. 1999). The parallels with children’s perception of Hugo Théoreta and Shirley Fecteaub
words cannot be ignored (see Zuberbuhler 2003 for review). a
Departement de Psychologie, Université de Montréal, Centre Ville, Montreal,
Indeed, it is now clear that although primates’ production of vo- Qc H3C 3J7, Canada; bFaculté de Médecine, Université de Montréal, Centre
calizations is highly constrained, their ability to extract complex in- Ville, Montreal, Qc, H3C 3J7, Canada. hugo.theoret@umontreal.ca
formation from sounds is not (Seyfarth & Cheney 2003b). Upon shirley.fecteau@umontreal.ca
hearing a sequence of vocalizations, for example, listeners acquire
information that is referential, discretely coded, hierarchically Abstract: The notion that manual gestures played an important role in the
structured, rule-governed, and propositional (Bergman et al. evolution of human language was strengthened by the discovery of mirror
2003; Cheney & Seyfarth, in press). These properties of primates’ neurons in monkey area F5, the proposed homologue of human Broca’s
social knowledge, although by no means fully human, bear strik- area. This idea is central to the thesis developed by Arbib, and lending fur-
ing resemblances to the meanings we express in language, which ther support to a link between motor resonance mechanisms and lan-
are built up by combining discrete-valued entities in a structured, guage/communication development is the case of autism and congenital
blindness. We provide an account of how these conditions may relate to
hierarchical, rule-governed, and open-ended manner. Results the aforementioned theory.
suggest that the internal representations of language meaning in
the human brain initially emerged from our prelinguistic ances- Arbib presents a strong argument in favor of a link between mir-
tors’ knowledge of social relations, as exhibited in the information ror neurons (MN), imitation, and the development of human lan-
they acquire from vocalizations (Cheney & Seyfarth 1997; in guage. We endorse his thesis that a protolanguage based on man-
press; Worden 1998). ual gestures was a precursor to human language as we know it
Nonhuman primate vocalizations also exhibit parallels with hu- today. Additional support for this claim comes from two seemingly
man speech in their underlying neural mechanisms. Behavioral different conditions: autism and congenital blindness.
studies of macaques suggest that the left hemisphere is specialized Autism. Language and communication deficits are one of the
for processing species-specific vocalizations but not other auditory defining features of autism spectrum disorders (ASD; American
stimuli (Hauser & Anderson 1994; Petersen et al. 1978). Lesion Psychiatric Association 1994) and are core elements of their diag-
results demonstrate that ablation of auditory cortex on the left but nosis and prognosis (Herbert et al. 2002; Ventner et al. 1992). Par-
not the right hemisphere disrupts individuals’ ability to discrimi- ticularly relevant is the fact that these impairments are more
nate among acoustically similar call types (Heffner & Heffner prominent in pragmatic speech associated with social communi-
1984). Most recently, Poremba et al. (2004) measured local cere- cation (Tager-Flusberg 1997). Interestingly, individuals with ASD
bral metabolic activity as macaques listened to a variety of audi- also display well-documented deficits in imitative behavior (e.g.,
tory stimuli. They found significantly greater activity in the left su- Avikainen et al. 2003). Recent magnetoencephalographic data
perior temporal gyrus as compared with the right, but only in suggest that an abnormal mirror-neuron system (MNS) may un-
response to conspecific vocalizations. These and other results derlie the imitative impairment observed in individuals with ASD
(e.g., Wang et al. 1995; see Hauser [1996] and Ghazanfar & (Nishitani et al. 2004). That study reported imitation-related ab-
Hauser [2001] for review) suggest that Arbib is wrong to assume normalities in Broca’s area and its contralateral homologue, the
that primate vocalizations “appear to be related to non-cortical re- human equivalent of monkey area F5, where most MN are found
gions” (sect. 1. 2, para. 3). They further suggest that the neuro- (Rizzolatti & Craighero 2004).
physiological mechanisms underlying human speech processing The idea that imitative abilities, and possibly language impair-
evolved from similar mechanisms in our nonhuman primate an- ments, are related to basic MN dysfunction in ASD was recently
cestors. investigated in our laboratory. In line with the MN hypothesis of
In sum, research demonstrates a striking number of continu- ASD (Williams et al. 2001), motor cortex activation during the ob-
ities – in behavior, perception, cognition, and neurophysiology – servation of simple finger movements was found to be significantly
between human speech and the vocal communication of nonhu- weaker in individuals with ASD compared to matched controls
man primates. Nonhuman primate vocal communication does not (Théoret et al. 2005). The MNS/language disorder hypothesis is
qualify as language, but it does exhibit many of the characteristics also supported by the fact that individuals with autism display
that one would expect to find if human language had evolved from structural abnormalities in Broca’s area (Herbert et al. 2002).
the vocal communication and cognition of the common ancestor Other symptoms that may be associated in some way with MN
of human and nonhuman primates. dysfunction in ASD include abnormal eye gaze, theory-of-mind
Arbib cites none of this research. As a result, his presentation is deficits, the use of other’s hands to communicate or demand, hand
strongly biased in favor of his own view that the emergence of hu- mannerisms, repetitive behaviors, and echolalia.
man speech “owes little to nonhuman vocalizations” (target arti- Taken together, these data support Arbib’s main argument that
cle, Abstract). To accept the convoluted hypothesis that spoken a simple action observation/execution matching mechanism an-
language evolved from vocalizations, through gestures, then back chored in area F5 (Broca’s area in humans) may have evolved into
to vocalizations again, one must first have good reason to reject the a complex system subserving human language. Consequently, a
simpler hypothesis that spoken language evolved from prelinguis- pathological, congenital dysfunction of the mirror-cell system in
tic vocal communication. A substantial body of data argues against humans would be expected to dramatically affect social interac-
such a rejection. tions and language/communication as a result of gesture/speech
interpretation and acquisition. This appears to be the case in ASD.
As mentioned by Arbib, Broca’s is not the only area making up the
human language and MNS. It is thus possible that other regions
within the MNS underlie some intact language skills in some ASD
individuals (e.g., grammar and syntax), which could in turn partly
account for the heterogeneity of the symptoms across individuals.
To that effect, the case of individuals with ASD and normal IQ
is particularly relevant to the argument put forth by Arbib. In that
population, it is the social and pragmatic aspects of language that
are usually impaired, with some individuals displaying normal
abilities in, for example, vocabulary and syntax. It appears that ab-
normal imitation-related cortical activations in ASD with normal Language is fundamentally a social affair
IQ are located mostly within the inferior frontal gyrus as opposed
to the superior temporal sulcus and inferior parietal lobule (Nishi- Justin H. G. Williams
tani et al. 2004). In light of these neurophysiological results, it may Department of Child Health, University of Aberdeen School of Medicine,
be that modularity of the MNS can account for differential lan- Royal Aberdeen Children’s Hospital, Aberdeen AB25 2ZG, Scotland,
guage symptomatology, with Broca’s area being the principal com- United Kingdom. justin.Williams@abdn.ac.uk
ponent of the system. http://www.abdn.ac.uk/child_health/williams.hti
Blindness. Another pathological condition that may add some
Abstract: Perhaps the greatest evolutionary advantage conferred by spo-
insight into the perspective offered by Arbib is congenital blind- ken language was its ability to communicate mentalistic concepts, rather
ness. It has been suggested that congenitally blind individuals dis- than just extending the vocabulary of action already served by an imitation
play autism-like characteristics (Hobson & Bishop 2003). For ex- function. An appreciation that the mirror-neuron system served a simple
ample, visually impaired children perform at lower levels than mentalising function before gestural communication sets Arbib’s theory in
normal subjects on theory-of-mind tasks (Minter et al. 1998), and a more appropriate social cognitive context.
blind children are at an increased risk of meeting diagnostic cri-
teria for autism (Brown et al. 1997). Interpretation of these data It may not be an obvious question to ask why spoken language
as suggesting a causal link between sensory deprivation and should evolve from gestural communication, but it is an important
autism-like characteristics has been challenged (Baron-Cohen one. Simply put, if gesture can be used to communicate effec-
2002), but they nevertheless bring to mind interesting questions tively, why evolve speech? Why didn’t we just evolve a complex
regarding ASD, MN function, and language impairment. gesturing language that did not require changes to the larynx? Ar-
Some blind children display fewer periods in which they direct bib has presented a theory of language evolution but has omitted
language towards other children and are generally impaired in the to discuss the selection pressures involved.
social and pragmatic aspects of language (Hobson & Bishop 2003), According to the Machiavellian intelligence hypothesis (Byrne
reminiscent of individuals with ASD. In blind individuals, lack of & Whiten 1988; Whiten & Byrne 1997), the human brain evolved
visual input would derail the normal mechanism matching action because of the selection pressure to develop cognitive capacities
perception and execution within the visual system. A motor reso- that facilitate social manoeuvring. This would also suggest that
nance mechanism could still operate through the auditory modal- language evolved through the need to communicate mental states.
ity (Kohler et al. 2002), but in an obviously limited manner due to The evolution of language would be driven primarily by the need
lack of visual input. to discuss matters such as loyalty, betrayal, forgiveness, and re-
Mechanisms of disorder. We have tried to describe two patho- venge. Arbib uses few examples to illustrate the content of the lan-
logical conditions that offer insight into the role of the MNS in lan- guage he is discussing; he mentions gestures used to describe fly-
guage/communication. We have showed that a breakdown in MN ing birds, hunters directing each other, the tastes of food, and the
function may be associated with specific language impairments, use of fire to cook meat. His argument seems to assume that
most notably pragmatic speech. In contrast to the theory put forth speech and gesture are used to discuss the physical activities of
by Arbib, these examples speak to the ontogeny, rather than the daily living, rather than to express feelings, desires, or intentions,
phylogeny, of language. Nevertheless, they share a striking simi- or to consider the thoughts of conspecifics.
larity: the necessity of an adequately “evolved” (as Arbib puts it) Also, Arbib derives his model of imitation from that proposed
MNS to develop the unique ability of human language. Although by Byrne and Russon (1998) following their observations of leaf-
still speculative, the two conditions we have described suggest dif- folding by mountain gorillas. This is an imitative task that requires
ferent mechanisms that may lead to MNS impairment and associ- replicating the structural organisation of an action, rather than the
ated language deficits. mental states driving it. Communicating the knowledge inherent
In the case of blindness, it may be that loss of visual input im- to this skill is a relatively straightforward matter using action
pairs the normal development of a motor resonance system, demonstration, whereas to describe it using only speech would be
thereby leading to language/communication deficits. In that more difficult. Conversely, communication concerning invisible
sense, it is an environmental factor that hinders adequate devel- mental states may lend itself more to speech than descriptive ges-
opment of the MNS. In ASD, where genetic factors are an im- ture. Consider for example, “John wrongly thinks that Bob is jeal-
portant part of the etiology, individuals may be born with a dys- ous of me,” or, “you distract John whilst I plot revenge against
functional MNS, preventing normal language and social behavior. Bob.” It may be that in the discussion of invisible mental states,
In that regard, it is tempting to look at the Forkhead box P2 speech can add a valuable modality of communication, which may
(FOXP2) gene, located on chromosome 7q, which is believed to even supplant manual and facial gesture.
be implicated in the acquisition of language (Lai et al. 2001) and Arbib does not mention the possible role of the mirror-neuron
may be involved in the human properties of the MNS (Corballis system in mentalising, or the importance of this mentalising func-
2004). Most evidence argues against a direct link between autism tion in imitation. Imitation involves incorporating a novel action
and FOXP2 (e.g., Newbury et al. 2002), but the idea that MN de- into a pre-existing behavioural repertoire (Whiten et al. 2004). It
velopment may be genetically determined is an intriguing possi- follows that for this to occur, the observed behaviour must be com-
bility that requires further investigation. pared with the existing knowledge of the behaviour. Therefore, im-
In summary, this commentary highlights the need to test Arbib’s itation requires more than remembering and then replicating the
theory against various pathological conditions, either those spe- components and organisational structure of an action sequence.
cific to language (e.g., aphasia) or those which may be associated Rather, imitation requires that the observer draw on his or her own
with MN dysfunction (autism, schizophrenia, Williams’ syn- knowledge of an action exhibited by a model. This includes the ob-
drome). For example, one of the co-morbidities of specific lan- server’s knowledge of the action’s relationships to causes, beliefs,
guage impairment (SLI) is motor impairment (Hill 2001), sug- goals, desires, effects, and agency. Only then can the observer un-
gesting yet another association between motor skill and language derstand the role of the action in the model’s behaviour.
dysfunction. It seems obvious to us that specific predictions of Ar- Actions are therefore vehicles for the thoughts that shape them,
bib’s model need to be tested this way, as direct evidence in sup- in that thoughts are carried by actions from mind to mind. Both im-
port of some aspects of the theory is lacking. itation and “simulation theory of mind” involve observing actions
or behaviours from a stance of using self-knowledge to predict the
ACKNOWLEDGMENTS mental states behind them (Meltzoff & Decety 2003). This means
This commentary was supported by grants from the National Sciences and that both “theory of mind” and imitation depend on relating per-
Engineering Research Council of Canada and the Fonds de Recherche en ceived actions to their motor counterparts (Meltzoff & Prinz 2002).
Santé du Québec. The mirror-neuron system is the prime candidate to serve this
function (Gallese & Goldman 1998), not as the only component, plexity that is rationalised as fuzzy grammaticality, subclass exception, and
but by providing the original action-perception links that constitute full irregularity.
the evolutionary origins and the developmental core for social cog-
nitive growth. I suggest that it is the capacity of the mirror-neuron Any model of language evolution must explain four basic things:
system to represent an observed action as if it were the behaviour 1. The interface between real-world semantics and the arbi-
associated with a self-generated mental state, thereby allowing for trary phonetic medium: a difficult problem, particularly if sub-
attribution of intention (and a secondary representational capacity; cortical reflex vocalisations are not the precursor of speech;
see Suddendorf & Whiten [2001]), rather than its capacity for cod- 2. The capacity for fast and fluent formulations of phonologi-
ing an action’s organisational structure, which enabled the mirror- cal strings, since this has no obvious purpose beyond language it-
neuron system to serve highly flexible imitation and praxis. self (unless for display);
The neurodevelopmental disorder of autism is characterised by 3. Our ability to express and understand messages that juxta-
major developmental impairment of social cognitive ability, in- pose many separate meaning features; and
cluding imitative and mentalising abilities. Another characteristic 4. Why languages appear to be unnecessarily complex, relative
feature, that is highly discriminative diagnostically, is the reduced to the perceived underlying simple rule systems.
use of all gestures, whether descriptive, instrumental, emphatic, Arbib’s integrated model offers an explanation for the first three
or facial (Lord et al. 2000). This suggests that the neural system in by identifying manual dexterity and imitation, exapted for pan-
humans serving gestural communication is knitted to that serving tomimic communication, as the conduit between holistic message
other social cognition (Williams et al. 2001). Whether dysfunc- and oral articulation. Associating Broca’s area first with grasping
tional mirror-neuron systems account for this symptom cluster is and imitation is much more satisfactory than attributing to it an a
still a matter for research, but it seems unlikely that during evolu- priori involvement in language that must then be independently
tion, language became more divorced from social cognitive sys- explained. Indeed, in line with Arbib’s section 8, neurolinguistic
tems once it became spoken. Indeed spoken language can become and clinical evidence strongly suggests that linguistic representa-
divorced from social cognition in autism, when it may be repeti- tion in the brain is mapped on the principle of functional motiva-
tive, stereotyped, and pragmatically impaired, such that its com- tion, so language operations are expected to be distributed ac-
municative function is severely impaired. If language did evolve cording to their primary functions or derivation (Wray 2002a, Ch.
only as Arbib describes, it could be impaired in a similar manner. 14).1
I suggest that the evolution of language from object-directed However, Arbib’s model also indirectly offers an explanation for
imitation would have been intimately tied to the evolution of so- point 4. In Arbib’s scenario, complex meaning existed in holistic
cial communication at the neural level. During early hominid evo- expressions before there was a way of isolating and recombining
lution, the representations being pantomimed through gestural units. The subsequent application of what Arbib terms “fraction-
communication (including facial expression) would have been ation” (“segmentation” for Peters [1983], who identified the pro-
concerned with mental states, including feelings and desires. Fa- cess in first language acquisition) is viewed as culturally rather
cial and manual gestures were being used by individuals to express than biologically determined, and consequently, piecemeal and
both their own feelings and what they thought others were feel- circumstantial rather than uniform and universal.
ing. The neural systems serving these functions would form the On what basis should we favour this proposal over the standard
basis for the communication of more complex mental states, alternative (e.g., Bickerton 1996), that there have always been dis-
which would recruit vocal and auditory systems as well as seman- crete units with word-like properties, which became combinable
tic and planning structures in temporal and frontal lobes. to create meaning, first agrammatically (protolanguage) and later
In summary, I suggest that mirror neurons first evolved within grammatically? First, we can note that attributing to our biologi-
social cognitive neural systems to serve a mentalising function that cally modern ancestors a default capacity for holistic rather than
was crucial to their praxic role in imitation and gestural commu- compositional expression, begs the question: Where is that holis-
nication. As the evolution of social language was driven through tic foundation now? Wray (2002a) demonstrates that holistic pro-
the need to convey and discuss invisible mental states, and these cessing, far from being peripheral and inconsequential, is in fact
became increasingly complex, so a vocal-auditory modality be- alive and well and motivating much of our everyday linguistic be-
came recruited as an increasingly valuable additional means of haviour.2
communication. This extended, rather than altered, the funda- But I want to focus mainly on one linguistic phenomenon that
mentally social nature and function of language, and maintained has long caused puzzlement and demanded much explanatory ef-
its dependence upon social cognitive mechanisms such as sec- fort: irregularity. It is surely a necessary corollary of the standard
ondary representation. view of language as an ab initio combinatory system that we are
predisposed to orderliness, and that unnecessary complexity and
ACKNOWLEDGMENTS irregularity are an aberrance to be minimised rather than pro-
I am grateful to Thomas Suddendorf and Nina Dick-Williams for com- moted or protected. Hence, first, we should find that languages
ments on an earlier draft of this commentary. attempt to cleanse themselves of phonological and morphological
exceptions, oddities in patterns of lexical collocation, grammatical
restrictions of the sort that demand subcategorisations of word
classes, and lexical gaps. For instance, we would expect the up-
The explanatory advantages of the holistic grading of adjective subsets that cannot occur predicatively (*The
protolanguage model: The case of linguistic objection is principal) and attributively (*the asleep boy), and the
filling of gaps in lexical sets, for example, horror/horrid/horrify,
irregularity terror/*terrid/terrify, candor/candid/*candify (Chomsky 1965,
p. 186). Such cleansing does not generally occur. Most irregular-
Alison Wray
ity is preserved intact from one generation to the next. Although
Centre for Language and Communication Research, Cardiff University,
regularisation does happen at the margins, it is balanced by the
Cardiff, CF10 3XB, Wales, United Kingdom. wraya@cf.ac.uk
http://www.cf.ac.uk/encap/staff/wray.html
creation of new irregularities (see below).
Second, children acquiring an L1 that is fully regular and trans-
Abstract: Our tolerance for, and promotion of, linguistic irregularity is a
parent, such as Esperanto, ought to do so efficiently and perfectly.
key arbitrator between Arbib’s proposal that holistic protolanguage pre- However, they do not (Bergen 2001). Instead, they introduce (ap-
ceded culturally imposed compositionality, and the standard view that dis- parently permanently) irregularities and sub-patterns that render
crete units with word-like properties came first. The former, coupled with complex the simple system of the input.
needs-only analysis, neatly accounts for the second-order linguistic com- Third, if native speakers naturally develop a full compositional
linguistic system during first language acquisition, we should ex- Language evolution: Body of evidence?
pect their writing to reflect that system from the start. This, too,
is not the case. In semi-literates, Fairman (e.g., 2000) reports taket Chen Yua and Dana H. Ballardb
(take it), in form (inform), a quaint (acquaint) and B four (before). a
Department of Psychology and Cognitive Science Program, Indiana
Guillaume (1927/1973) offers semy (c’est mis), a bitant (habitant), University, Bloomington, IN 47405; bDepartment of Computer Science,
a ses (assez) and dé colle (d’école). Thus is speech transcribed with University of Rochester, Rochester, NY 14627. chenyu@indiana.edu
strikingly little awareness of the grammatical or morphological dana@cs.rochester.edu http://www.indiana.edu/~dll/
http://www.cs.rochester.edu/~dana/
components that are supposedly being freely manipulated.
All of these oddities are readily explained if humans are predis-
posed to treat input and output holistically where they can, and to Abstract: Our computational studies of infant language learning estimate
the inherent difficulty of Arbib’s proposal. We show that body language
engage in linguistic analysis only to the extent demanded by ex- provides a strikingly helpful scaffold for learning language that may be
pressive need (rather than a principle of system) – needs-only necessary but not sufficient, given the absence of sophisticated language
analysis (NOA; Wray 2002a, pp. 130 – 32). Coupled with a parsi- in other species. The extraordinary language abilities of Homo sapiens
monious approach to pattern identification, NOA will: must have evolved from other pressures, such as sexual selection.
a) Prevent the deconstruction of linguistic material that is no
longer morphologically active, thus preserving irregularity; Arbib’s article provides a complete framework showing how hu-
b) Fence off combinations that are regular but are not observed mans, but not monkeys, have language-ready brains. A center-
to be subject to paradigmatic variation, and maintain them as com- piece in hominid language evolution is based on the recognition
plete units that cannot be generalised to other cases (as with the and production of body movements, particularly hand move-
L1 acquisition of Esperanto); in so doing, protect the units from ments, and their explicit representation in the brain, termed the
subsequent linguistic change, so they drift over time through fuzzy mirror property.
semi-regularity to full irregularity; How can we evaluate this proposal? One way is to take a look at
c) Support, in those who do not subsequently augment their infant language learning. The human infant has evolved to be lan-
fuzzy, half-formed linguistic system with formal training through guage-ready, but nonetheless, examining the steps to competency
literacy, a tolerance for underspecification and an absence of any in detail can shed light on the constraints that evolution had to deal
expectation that language is fully composed of atomic lexical units. with. In a manner similar to language evolution, the speaker (lan-
The bizarre spellings of semi-literates reflect a direct link between guage teacher) and the listener (language learner) need to share
the whole meaning and its phonological form. the meanings of words in a language during language acquisition.
In addition, the fractionation of a holistic expression may often A central issue in human word learning is the mapping problem –
result in a “remainder” of phonological material that cannot be at- how to discover correct word-meaning pairs from multiple co-oc-
tributed a plausible meaning or function. Yet, because of (a) and currences between words and things in an environment, which is
(c), there may well never be a point when that material demands termed reference uncertainty by Quine (1960). Our work in Yu et
rationalisation – until the grammarian attempts to explain it in al. (2003) and Yu and Ballard (2004) shows that body movements
terms of a system it actually stands outside. Unless by haphazard play a crucial role in addressing the word-to-world mapping prob-
or imposed hypercorrection, such irregular remainders may never lem, and the body’s momentary disposition in space can be used
be expunged and, although vulnerable to certain kinds of change, to infer referential intentions in speech.
may persist in the long term, to the puzzlement of analysts (Wray By testing human subjects and comparing their performances
2002a) and frustration of adult language learners (Wray 2004). in different learning conditions, we find that inference of speak-
Therefore, I contend that linguistic irregularity is a source of ers’ intentions from their body movements, which we term em-
support for Arbib’s proposal that compositionality is a choice bodied intentions, facilitates both word discovery and word-mean-
rather than a fundamental in human language, and that its appli- ing association. In light of these empirical findings, we have
cation is variable not absolute. Some aspects of what syntacticians developed a computational model that can identify the sound pat-
are obliged to account for via complex rules may be no more than terns of individual words from continuous speech using nonlin-
detritus from the process of fractionising unprincipled phonolog- guistic contextual information and can employ body movements
ical strings. as deictic references to discover word-meaning associations. As a
If this is so, our challenge, before all the endangered languages complementary study in language learning, we argue that one piv-
disappear, is to recast our assumptions about prehistorical norms, otal function of a language-ready brain is to utilize temporal cor-
by establishing what the “natural” balance is between composi- relations among language, perception, and action to bootstrap
tionality and formulaicity in the absence of literacy and formal ed- early word learning. Although language evolution and language
ucation. Many “fundamentals,” such as the word, full classificatory acquisition are usually treated as different topics, the consistency
potential, and inherent regularly of pattern, may come down to of the findings from both Arbib’s work and our work does show a
culture-centricity (Grace 2002) and the long-standing uneasy at- strong link between body and language. Moreover, it suggests that
tempt to squeeze square pegs into the round holes of prevailing the discoveries in language evolution and those in language ac-
linguistic theory. quisition can potentially provide some insightful thoughts to each
other.
NOTES
Language (even protolanguage) is about symbols, and those
1. This position easily supports Arbib’s hypothesis (sect. 1.2) that there
would be an extralinguistic human correlate of the primate mirror system symbols must be grounded so that they can be used to refer to a
for subcortical reflex vocalisations. class of objects, actions, or events. To tackle the evolutionary prob-
2. It was on the basis of this evidence that I first proposed a holistic pro- lem of the origins of language, Arbib argues that language readi-
tolanguage (Wray 1998; 2000; 2002b), but we avoid circularity since Arbib ness evolved as a multimodal system and supported intended
does not in any sense build his own story upon my proposal, he only cites communication. Our work confirms Arbib’s hypothesis and shows
it as an independently developed account consistent with his own. that a language-ready brain is able to learn words by utilizing tem-
poral synchrony between speech and referential body movements
to infer referents in speech, which leads us to ask an intriguing
question: How can the mirror system proposed by Arbib provide
a neurological basis for a language learner to use body cues in lan-
guage learning?
Our studies show quantitatively how body cues that signal in-
tention could aid infant language learning. Such intentional body
movements with accompanying visual information provide a nat-
tem in evolution of the language-ready brain, but mecha- ing rudiments of English speech. Despite the lack of strong
nisms supporting conversation (R3.1), motivation (R3.2), neural homologies between parrots, songbirds (Doupe &
and theory of mind (R3.3) must also be taken into account. Kuhl 1999), and primates, we may still hope to model rele-
Section R4 considers lessons from modeling biological vant circuitry (R4) to better understand what allows a
neural networks (4.1) and evolving artificial networks (R4.2). neural network to achieve different language-related func-
Section R5 reviews the debate over the claim (H2) that tions. Fitch notes that some species may have vocal but not
the path to protospeech was indirect. Discussion of co- bodily imitation, and vice versa. This is irrelevant to MSH,
speech gestures (R5.1) shows how strongly manual gesture which asserts that humans had a particular history. This
and speech are intertwined. Future work must factor in does not deny that comparative study of neural mechanisms
new data on the auditory system (R5.2). Data on primate underlying different forms of imitation may help us better
vocalization challenge H2 but do not topple it (R5.3). How- understand the workings of the human brain – though the
ever, any claim that protosign had a large head start (if any) closer the homology, the more likely the payoff.
on protospeech in the expanding spiral is questionable. The Pepperberg’s assertion that little about my criteria for
challenge of evolving a phonological system remains (R5.4). language readiness is unique to humans seems a blow to my
Section R6 discusses the transition from protolanguage claim to characterize what allows human children to learn
to language with special attention to the debate on H3, the “full language” where other species cannot. Perhaps there
holophrastic view of language (R6.1). Issues on bringing se- are differences of degree: for example, Alex does not meet
mantics into MSH (R6.2) are followed by a brief discussion my full criteria for “complex imitation.” What enriches the
of H4, emphasizing the cultural basis of grammars (R6.3). discussion is that chimpanzees raised in a human environ-
Section R7 revisits the overview in Figure 6 of the TA. ment can exhibit far more “protolanguage” than their wild
Unfortunately, the figure was mentioned only in one com- cousins – observing animals in the wild does not define the
mentary and then only in passing, but I discuss commen- limits of complexity of their behavior.
tary relevant to the issues of whether anatomically sepa-
rated regions may share an evolutionary history (R7.1) and R1.2.2. Lateralization. Kaplan & Iacoboni show that mo-
how action planning supplements mirror systems in lan- tor activation to sight of an action is typically bilateral,
guage evolution (R7.2). whereas action sounds activate the motor cortex only in the
In addition to the commentaries published here in this left hemisphere. This may be related to evolutionary pro-
issue, I had the privilege of receiving equally fine com- cesses that lateralized language. Since lateralization has
mentaries from Yoonsuck Choe; Jean-Louis Dessalles & been debated extensively in BBS (Vol. 26, No. 2, Corballis
Laleh Ghadakpour; Peter Dominey; James Hurford; Masao 2003a), I will not comment here (but see R5.3) beyond the
Ito; David Kemmerer; Takaki Makino, Kotaro Hirayama & observation that, because children who receive a hemi-
Kazuyuki Aihara (Makino et al.); Emese Nagy; Massimo Pi- spherectomy early enough can gain fairly good command of
attelli-Palmarini & Thomas Bever; Friedemann Pulver- language (though comprehension of syntax does show some
müller; Andreas Rogalewski, Andreas Jansen, Ann-Freya left-hemisphere superiority [Dennis & Kohn 1975]), later-
Foerster, Stefan Knecht, & Caterina Breitenstein (Ro- alization would seem to be not so much the genetic speci-
galewski et al.); Martin Ruchsow; Markus Werning; and Pa- fication of different kinds of circuitry in the two hemi-
tricia Zukow-Goldring. These commentaries are posted on spheres as a developmental bias which favors, but does not
the BBSOnline Web site and have been given a fuller Au- force, differential development of skills there.
thor’s Response. The supplemental commentaries with Au-
thor’s Response are retrievable at the following URL: R1.2.3. Sexual selection. Yu & Ballard cite the hypothe-
http: / / www.bbsonline.org / Preprints / Arbib-05012002 / sis that language is a product of sexual selection. I am un-
Supplemental/. I am particularly grateful to the many com- able to evaluate this hypothesis, but raise two questions:
mentators whose correspondence allowed me to more fully Does sexual selection function differently in early hominids
understand the issues they raised. I cannot do justice to this and early great apes? Why does it not yield stronger di-
“conversation” in 10,000 words here, but hope to develop morphism between male and female language use?
many of the issues in Beyond the Mirror: Biology and Cul-
ture in the Evolution of Brain and Language, which I am cur- R1.2.4. Genetic underpinnings. Théoret & Fecteau note
rently preparing for publication by Oxford University Press. attempts to implicate the FOXP2 gene in language. How-
I use boldface for commentators’ names when respond- ever, FOXP2 is implicated in many systems, from the gut to
ing to the present published commentaries and italic when the basal ganglia. It has been argued that because the gene
discussing (more briefly) the supplemental ones. A com- changed only once from mouse to our common ancestor
mentator’s name followed by the notation (p.c.) refers to the with the great apes but changed twice in the hominid line,
follow-up correspondence (personal communication), not it may hold the key to what distinguishes us from the great
the original commentary. apes. However, the mutation of the gene seen in a number
of members of the family KE does not reverse the two “re-
cent” mutations to yield humans with a chimpanzee-like
R1.2. Et cetera
FOXP2 gene. The KE language deficits seem more a func-
A number of interesting points do not fit into the above tion of motor problems than proving a causal relation be-
framework: tween changes in FOXP2 and the evolution of the language-
ready brain (Corballis 2004). Pepperberg’s description of
R1.2.1. Birds and others. Pepperberg applies my criteria the use of expression of the ZENK gene to form a functional
for language readiness to the behavior of the Grey parrot, map of avian brains for behavior related both to auditory
Alex, that she has taught to communicate with humans us- processing and vocal production and the coupling of this to
neurophysiology, provides an encouraging model for future mirror system. By directing the child’s attention to its own
studies in macaques. effectivities in relation to affordances, the caregiver nar-
rows the search space for learning, and thus enhances that
learning (Zukow-Goldring 1996). These practices may pave
R2. Complex imitation the way to early word learning. The prolonged period of in-
fant dependency in humans combines with caregiving to
R2.1. Complex imitation and planning provide conditions for complex social learning.
I hypothesize that the mirror system for grasping evolved in Neonatal imitation is based on moving single effectors
two stages: first to provide feedback for dexterous manual and thus differs from goal-directed imitation. (Studdert-
control, then to underwrite the ability to act with other Kennedy [2002] discusses data consistent with the view that
brain regions to make information available for interacting the infant at first imitates sounds by moving one articulator
with others. Makino, Hirayama & Aihara (Makino et al.) and only later coordinates articulators.) Social reciprocity in
observe that success in complex imitation requires the abil- neonatal imitation (R3.1) may be a necessary precursor for
ity to recognize the goal of an action as the basis for mas- complex imitation, establishing that “I am like the other.”
tering the action which achieves that goal. Indeed, Arbib Biological evolution may have selected for neonatal imita-
and Rizzolatti (1997) gave the equation Action Move- tion as a basis for complex imitation.
ment Goal, and the mirror neurons system (MNS) model Yu & Ballard found that body cues signaling intention
recognizes an action in terms of the goal of successful grasp- can aid word learning in adults, suggesting the utility of
ing of an affordance. Complex imitation takes us further. It such cues for children. Their computational model reliably
rests on recognizing how different actions fit together to associates spoken words and their perceptually grounded
achieve various subgoals of the overall goal. meanings. This model employs “small steps” which, they
Bickerton and Prudkov assert that there cannot be im- suggest, would be accessible by a variety of social species,
itation unless someone has first created something to imi- and “yet they were only traversed by us.” However, they
tate, and that mirror neurons offer no clue as to how totally seem accessible to parrots and bonobos as well as the 2-
novel sequences could have been created. Actually, new year-old child – which is why I emphasize complex imita-
skills can emerge by trial and error. The problem is to pre- tion.
serve them. The data on chimpanzee cultures (Whiten et
al. 2001) show how few skills chimpanzees acquire. I sug-
gest that it is complex imitation that enables humans to R3. Complementing complex imitation: Motivation
move beyond such limited repertoires, cumulatively ratch- and theory of mind
eting up the available stock of novel skills.
Complex imitation presupposes a capacity for complex Conversation, motivation, and theory of mind – and pros-
action analysis – the ability to analyze another’s perfor- ody (R5.3) – must all be addressed in a satisfactory account
mance as a combination of actions (approximated by vari- of the language-ready brain. This requires expanding MSH
ants of) actions already in the repertoire. In modern hu- rather than weakening it.
mans, imitation undergirds the child’s ability to acquire
language, whereas complex action analysis is essential for R3.1. Conversation
the adult’s ability to comprehend the novel assemblage of
“articulatory gestures” that constitute each utterance of a Kotchoubey emphasizes pragmatics, for example, what we
language. However, the adult does not imitate this assem- say depends on the mental state (R3.3) of our “hearer.”
blage but rather factors it into the planning of his reply. I However, his claim that “We do not use language to trans-
agree with Bridgeman that mirror systems must be sup- mit information, but to persuade and motivate” (R3.2)
plemented by a planning capability to create, store, and ex- seems a false dichotomy. “Look at this beautiful flower”
ecute plans for sequences of actions and communicatory combines information – “This flower is beautiful” – and
acts. These apparent sequences are the expression of hier- persuasion – “Look at this flower.” Kotchoubey (personal
archical structures. In Figure 5 of the TA, interpretation of communication) stresses that his starting point is coopera-
actions of others is coupled to planning of one’s own actions; tion between two or more humans, reinforcing the claims
Bridgeman stresses the need for the complementary evo- of MSH for relating praxic and communicative actions.
lution of these two capabilities. They underlie perception Nagy suggests an innate basis for conversation that pre-
grammars and production grammars, mentioned in discus- cedes its pragmatic function – newborn infants communi-
sion of Figure 1 of the TA. cate by using “imitation” right after birth (Nagy & Molnar
Bickerton observes that when someone addresses you, 2004). She suggests that language develops from these early
you do not just imitate what they said. True. The human intersubjective “conversations” (Trevarthen 2001). The cy-
mirror system creates a representation that can be used for cle of turn taking in “imitating” a small repertoire of “almost
feedback control, imitation (which monkeys do not exhibit), innate” gestures is crucial in establishing the social pattern
or generating some appropriate response while inhibiting of turn taking (R2.2). (Cf. “motherese”; R5.3.)
mimicking. Only in pathology does this inhibition fail, yield-
ing compulsive imitation (echopraxia; Podell et al. 2001).
R3.2. Motivation
Prudkov downplays complex imitation, arguing that the
R2.2. Imitation in developmental perspective
complexity of languages builds on the ability of the human
Zukow-Goldring sees affordances and effectivities (what brain to construct diverse goals. He suggests that animals
the body can do; Shaw & Turvey 1981) as two sides of the can form learned motivations only when basic drives are ac-
tivated. However, animals can acquire secondary rein- ers, one must invent terms which can express shadings spe-
forcers, and so on. Chimpanzees have the ability to develop cific to the new domain.
non-innate subgoals (e.g., cracking nuts). The mirror sys- Indurkhya suggests that the ability to project one’s self
tem is well linked to the motivational system in the into other animals or objects might mark a crucial transition
macaque. The target article shows that the F5 mirror sys- in hominid evolution. I think this notion is important. Much
tem for grasping is best understood within the larger F5- work on empathy emphasizes the similarities between self
PF-STSa mirror system for manual and orofacial actions. and other – but one must be able to maintain different
Rizzolatti et al. (2001) observe that STSa is also part of a cir- models of other agents, adaptively going beyond what is
cuit that includes the amygdala and the orbitofrontal cortex held in common to imagine essential differences.
and so may be involved in the elaboration of affective as- Williams and Théoret & Fecteau see autism as pro-
pects of social behavior. Hence, Prudkov’s transition to viding a window on the role of the mirror system in ToM
“non-innate motivation” may be less proximate for the evo- and language. (Théoret & Fecteau add analysis of blind-
lution of the language-ready brain per se than complex im- ness.) Deficits in autism are prominent in speech associated
itation, which made possible the rapid acquisition of new with social communication, but praxic aspects of language
skills. are fairly well preserved. Perhaps what is affected is not so
much language per se as the integration of this with affect
and ToM. Interestingly, autistics may exhibit stereotypic
R3.3. Theory of mind
mimicking (which monkeys do not have). Hence, it must be
Fabrega stresses that successful pantomime presupposes reiterated that a fully functional human mirror system in-
social cognition, awareness of self, and goal-setting – re- hibits mere repetition (echopraxia and echolalia) and in-
versing the view of those who attribute self-consciousness stead relates the perception of perceived actions to the
to language (Macphail 2000). Fabrega (personal communi- planning of an appropriate course of action.
cation) also asks: “What are thoughts beyond internal use of
language”? Arbib (2001a) suggests that there are forms of
consciousness common to many mammals, but that mirror R4. Lessons from modeling
system developments help explain why humans also have
R4.1. Biological neural networks
forms of consciousness that build upon, rather than pre-
cede language. Development of increasing subtlety of lan- Horwitz, Husain, & Guenther (Horwitz et al.) note
guage can feed back into the nonlanguage system to refine the importance of babbling in the development of spoken
our perceptions and experience. and sign languages. The Infant Learning to Grasp Model
Fabrega says that I do not specify how much of the pro- (ILGM; Oztop et al. 2004), mentioned briefly in the TA, is
tosign/protospeech spiral is enough to support the cultural a theory of how “manual babbling” leads to an effective set
evolution of language, and he asks whether “protoculture” of grasps. Arbib and Rizzolatti (1997) discussed the rele-
emerges as the expanding spiral gets underway. I suggest vance of inverse and forward models to MSH, building on
that chimpanzees have protoculture (Whiten et al. 2001), insights of Jordan and Rumelhart (1992) into vocal bab-
but that human culture is qualitatively different, and lan- bling. Ito and Makino et al. also stressed the importance of
guage makes it so. internal models; see Carr et al. (2003), Makino and Aihara
Williams sees the greatest evolutionary advantage con- (2003), Miall (2003), Wolpert et al. (2003), and Ito (2005).
ferred by spoken language as its ability to communicate Because Piatelli-Palmarini & Bever note the problem of
mentalistic concepts (theory of mind, ToM). Williams determining similarity criteria for learning models, it is
stresses selection pressure for social maneuvering where I worth noting that the MNS and ILGM models have “in-
have emphasized physical activities. Surely the two “do- nate” hand-related biases which enable them to acquire a
mains of discourse” complement each other. Williams notes range of grasps without having them built in. MNS input is
the possible role of the mirror neuron system in mentaliz- the hand state relating the hand to the goal affordance of
ing (Gallese 2003; Meltzoff & Decety 2003). We need to in- the object. ILGM acquires grasps whose visual manifesta-
vestigate whether an account can be given of a shared evo- tion MNS is to learn. ILGM has as its basis that the child
lution of “mirror systems” suiting both ToM and complex reaches for a salient object, executes the grasp reflex if pal-
imitation. I hypothesize that the ancestral mirror system for mar contact is made, and builds a repertoire of grasps based
manual praxis was distinct from the putative mirror system on those which prove to be stable – stability supplies the re-
for facial expression of emotion. The former would support inforcement signal.
pantomime and thence on to multimodal symbols; and then Dominey models the transformation between semantic
the availability of symbols could enrich the latter to yield structures and grammatical structures. He exploits the de-
rudiments of ToM. velopmental analog of fractionation of holophrases to yield
Indurkhya sees the key to language evolution in an abil- “words” which fill slots thus formed in the holophrase.
ity to see and feels things from another perspective and Dominey et al. (2003) suggest that the resultant categorical
stresses the role of metaphor. Projection restructures the distinction between function and content elements evolved
target by creating a new ontology for it; generalization of first for sensory-motor function and then was exploited for
responsiveness of a mirror neuron may provide a novel on- phrasal-conceptual function. Dominey sees his modeling as
tology for objects and actions that can newly yield this ac- consistent with Ullman’s (2004) declarative/procedural
tivity. The TA shows that pantomime must be supple- model, in which the mental lexicon depends on temporal-
mented by conventional gestures to yield protosign. Within lobe substrates of declarative memory, while mental gram-
language itself, metaphor broadens our language by ex- mar depends on a “procedural” network of frontal, basal-
tending (proto)words to new contexts. In some cases, con- ganglia, parietal, and cerebellar structures supporting
text is enough to recapture the shade of meaning. In oth- learning and execution of motor and cognitive skills.
Horwitz et al. model how learning an auditory target for strain evolutionary models to provide insights that link to
each native language sound may occur via a mirror neuron the anatomy and neurophysiology of real brains. A key chal-
system. Guenther (p.c.) notes that in this modeling, the per- lenge for MSH-related modeling is to understand how to
ceptual system organizes largely independently of the mo- “evolve” from a brain for which Stage I learning can never
tor system, whereas motor development relies very heavily yield the ability to learn language, to a human brain in
on the auditory perceptual system. which perhaps 2.5 years of learning is required for Stage I
Horwitz et al. emphasize the importance of combining to make possible the “word explosion” which distinguishes
neural modeling with neurophysiological and brain imaging the human infant from the chimpanzee.
data. Horwitz and Tagamets (2003) and Arbib et al. (1994)
developed techniques for using models of primate neuro-
physiological data to predict and analyze results of human R5. The path to protospeech
brain imaging. Arbib et al. (2000) analyze imitation of mo-
tor skills, relating human brain imaging to data on the Several commentaries concerned H2: protosign provides
macaque mirror system. scaffolding for protospeech. Those who take a “speech
only” approach ignore the fact that language is multimodal.
However, future work on MSH needs greater attention to
R4.2. Evolving artificial networks the auditory system.
We turn to using “simulated evolution” to obtain versions of
initially unstructured networks whose parameters fit them
R5.1. Co-speech gestures
to better perform or learn a class of actions. (Cangelosi &
Parisi [2002] and Briscoe [2002] include papers using arti- McNeill, Bertenthal, Cole & Gallagher (McNeill et
ficial evolution to probe constraints on language, but almost al.) show that speech and “gesticulations” form a single sys-
all the papers are far removed from neurobiology.) tem – a “gesticulation” (Kendon 1988) is a motion that em-
Parisi, Borghi, Di Ferdinando & Tsiotas (Parisi et bodies a meaning relatable to the accompanying speech.
al.) support MSH through computer simulations and be- About 90% of gesticulations synchronize with the speech
havioral experiments with humans which suggest that see- segments with which they are co-expressive. I disagree with
ing objects or processing words referring to objects auto- McNeill et al.’s claim that gesture and speech must convey
matically activates canonical actions that we perform on the same idea unit. Kita and Özyürek (2003) compared
them. Parisi et al. (p.c.) point out that because actions are speech-gesture coordination in Turkish, Japanese, and En-
the only inter-individually accessible aspect of behavior, in- glish descriptions of an animated cartoon. Gestures used to
terpreting meanings in terms of actions might explain how express motion events were influenced by how features of
meanings can be shared. motion events were expressed in each language, but also by
Borenstein & Ruppin evolve networks in which evolu- spatial information that was never verbalized. However, the
tion of imitation promotes emergence of neural mirroring. key point is that gesticulation is truly part of language.
However, primate data suggest that neural mirroring pre- McNeill et al. reject the claim that language started as
ceded imitation in human evolution. Borenstein (p.c.) re- a gesture language that was supplanted by speech and stress
sponds that the key point is that in their system only the evo- the importance of a close coupling between manual and vo-
lution of imitation was solicited, yet a mirror system cal action. However, they suggest that my concept of an ex-
emerged – suggesting that the link between imitation and panding spiral of protosign and protospeech does not go far
mirroring may be universal. enough. They advocate the evolution of a speech-gesture
Pulvermüller opts for specifying the putative neural cir- system in which speech and gesture evolved in lockstep.
cuits of syntax and recursion first, and thinking about pos- This criticism may be mistaken. Gesticulations are part of
sible evolution later. In his model (Pulvermüller 2002), language, not protolanguage. By contrast, protosign may in-
words are represented by distributed cell assemblies whose deed have had a pantomimic base, with protosign scaffold-
cortical topographies reflect aspects of word meaning; ing protospeech.
these assemblies are formed by correlation learning and In any case, McNeill et al. establish that protolanguage
anatomical constraints. Syntactic rules emerge from the in- was multimodal and that gesture was not “turned off” in evo-
terplay between sequence detectors and general principles lution. Relating this to brain function, McNeill et al. offer
of neuronal dynamics. My concern is that this approach is the telling example of a man who can control his limb move-
so focused on symbols that it is ill suited to grounding an ments only through arduous visually guided attentional con-
evolutionary approach to neurolinguistics. trol, yet can still gesticulate while speaking even when he
Fagg and Arbib (1992) modeled the surprising speed cannot see his hands. Kemmerer describes a brain-damaged
with which monkeys could learn to associate a visual pat- subject, with intact semantic and grammatical knowledge of
tern and a motor response. This led us to distinguish Stage motion events, whose ability to retrieve the phonological
I from Stage II learning. Stage I may take months of shap- forms of concrete nouns, action verbs, and spatial preposi-
ing for the monkey to learn a general task like “associate a tions was severely impaired but whose ability to produce
new visual pattern on this screen with the correct pull on gestures with language-typical information packaging was
the lever in front of you and you will get a reward.” In Stage mostly preserved (Kemmerer et al., submitted).
II, the monkey knows the task, and then takes only seven or Emmorey concedes that the existence of modern sign
so trials to stabilize the correct response to a novel visual languages might seem to support my hypothesis that there
pattern (Mitz et al. 1991). My concern with models using was an early stage in the evolution of language in which
small neural networks is that the search space is so re- communication was predominantly gestural. However, she
stricted that Stage I is no longer necessary. As one ap- rejects this view because “the only modern communities in
proaches the more biological models of R4.1 one must con- which a signed language is dominant have deaf members.”
However, there are communities of hearing people using a ditory objects, retain them in memory, and relate them to
signed language, albeit not their primary one (Kendon articulatory gestures. I would agree, while noting that the
1988). Emmorey suggests that sign languages can tenta- success of speech has been linked to the ability to form an
tively be traced back only 500 years, but such historical es- immense vocabulary from a small set of “phonemes” (the
timates are suspect. For example, http://www.ASLinfo.com/ Particulate Principle; Studdert-Kennedy 2002).
trivia.cfm says that by a.d. 530 Benedictine monks had in-
vented signs to circumvent their vow of silence.
Emmorey asserts that “If communicative pantomime R5.3. Primate vocalization
and protosign preceded protospeech, it is not clear why
protosign simply did not evolve into sign language.” Mac- The issue is whether primate calls evolved directly to
Neilage & Davis suggest that I am “vulnerable” because I speech. Seyfarth argues that the parallels between primate
posit an open pantomimic protosign stage, whereas Hock- same-different judgments for calls and children’s percep-
ett (1978) asserted that if manual communication had ever tion of words cannot be ignored. However, such parallels
achieved openness, we would never have abandoned it for suggest properties of auditory discrimination necessary for
speech. However, I make no claim that protosign by itself protospeech but do not help explain the crucial transition
realized the full potential of this openness. Emmorey fur- to production of an open-ended repertoire of symbols
ther asserts: “A gestural-origins theory must explain why linked to an open semantics. Seyfarth faults me for charac-
speech evolved at all, particularly when choking to death is terizing primate vocalizations as “involuntary” signals but
a potential by-product of speech evolution.” First, I do see Note 6 of the TA addresses this explicitly. Seyfarth shows
slight advantages for speech over sign (agreeing with Cor- that I am wrong to deny that primate vocalizations are re-
ballis 2002) or, rather, for the early combination of proto- lated to cortical regions, but his data primarily concern au-
speech with protosign over protosign alone, though this dition. Bosman et al. suggest that in the monkey there is
judgment is subjective. Second, Clegg and Aiello (2000) overlap between area F5 and the cortical larynx represen-
show that the risk of choking is highly overstated: “Mortal- tation, but Gilissen argues that monkey calls cannot be
ity statistics for England & Wales . . . [show] that overall used as models for speech production because they are ge-
mortality from choking on food was very low averaging 0.6 netically determined in their acoustic structure. A number
per 100,000 head of population.” Third, just as it is a mat- of brain structures crucial for the production of learned mo-
ter of historical contingency that some tribes have both tor patterns such as speech production are dispensable for
signed and spoken languages, it may well be that some the production of monkey calls (Jürgens 1998).
tribes of early humans had languages dominated by speech I have been unable to consult Cheney and Seyfarth’s (in
and others had protolanguages dominated by sign. At a time press) paper which apparently asserts that primate social
of evolutionary bottleneck before humans left Africa 50,000 knowledge bears striking resemblances to the meanings we
years ago, speech could have taken a dominant role. The express in language, which are built up by combining dis-
counter-question to Emmorey is then: “If speech has pri- crete-valued entities in a structured, hierarchical, rule-gov-
macy and sign is a modern innovation, how can one explain erned, and open-ended manner. Though uninformed, I
the ubiquity of co-speech gestures?” speculate that Seyfarth may be misled by our human abil-
ity to offer a language-like description of the primates’ abil-
ities. This is not to deny that prelinguistic knowledge of so-
R5.2. Taking the auditory system seriously
cial relations is relevant to evolving the language-ready
Bosman, López & Aboitiz (Bosman et al.), Rausch- brain (R3.3).
ecker, and Horwitz et al. all make clear that future work Provine discusses contagious yawning and laughing.
on MSH must pay more attention to data on the auditory These seem analogous to the contagious alarm calls of non-
system than does the TA. human primates. He observes that laughter is a ritualization
Bosman et al. (personal communication) argue that the of the sound of labored breathing in rough-and-tumble play
biological principles that supported the evolution of mirror – but, presumably, we are talking of biological selection
neurons for grasping may also have independently sup- rather than the free symbol creation to which ritualization
ported the evolution of auditory mirror neurons, but they contributes at the phonological and morphosyntactic level
agree that gesture may have helped establish certain se- in language (Bybee 2001). Laughter punctuates the speech
mantic aspects of protolanguage by the use of pantomime. stream, in contrast with the tight integration of gesticula-
Their view is that protosign and protospeech coexisted and tion and speech (McNeill et al.).
coevolved, and each contributed to the development of the Kotchoubey and Fitch note that my emphasis on cogni-
other. tive-symbolic aspects of language ignores prosody. Kotchou-
Bosman et al. discuss neurons in frontal areas of the bey notes that prosody subserves both affective prosody
monkey that respond strongly to vocalizations and thereby (emotional expression) and linguistic prosody (as in distin-
suggest that this domain may be the precursor of a vocal- guishing between an assertion and a question) and that both
ization mirror system similar to the mirror system for grasp- forms of prosodic information are processed mainly in the
ing. Rauschecker presents further relevant material on right temporal lobe. In similar vein, Gilissen notes that hu-
the macaque auditory system (R7). man vocal behavior does resemble monkey calls in the emo-
Horwitz et al. note that a human may recognize 105 au- tional intonations superimposed on the verbal component.
ditory objects, whereas the number of those that interest a Kotchoubey (p.c.) observes that in many languages, intona-
monkey seems small. Moreover, monkeys seem far better tion is the only distinction between question and declara-
in vision than audition in the use of long-term memory for tion. He thus suggests that linguistic prosody is a part of the
objects. They thus argue that biological evolution gave ho- right hemisphere so closely controlled by the left that they
minids the ability to better discriminate and categorize au- cannot work without each other. This is reminiscent of the
coupling of gesticulations to the syntax and semantics of a medium in which already available elements can be com-
specific language. posed to form new ones, irrespective of the level at which
Gilissen cites Falk’s (2004a) evolutionary perspective on these elements were themselves defined.
the hypothesis that, as human infants develop, a special I characterized MacNeilage’s frame/content theory as
form of infant-directed speech (motherese) provides a scaf- being about the evolution of syllabification but offering no
fold for their eventual acquisition of language. This en- clue as to what might have linked such a process to the ex-
riches our discussion of the role of the caregiver in neo- pression of meaning. MacNeilage & Davis note that they
natal “conversation” (R3.1). Gilissen says that the special now address this criticism by arguing that the first words
vocalizations of human motherese are in marked contrast to may have been kinship terms based on baby talk (Mac-
the relatively silent mother/infant interactions that char- Neilage & Davis, in press b – I received the final version
acterize chimpanzees, yet suggests a possible link between only after the TA was “set in concrete”). I do not deny that
monkey calls and motherese. This apparent contradiction words like “mama” and “dada” may have been based on
suggests that the affective content of motherese (and proto- baby talk. But to suggest that this gives us insights into the
language) builds upon the monkey vocalization system, but emergence of protolanguage seems to me to conflate phy-
the information content of motherese (and protolanguage) logeny and ontogeny – the prototalk of adult hunter-gath-
has a complementary evolutionary history. Kotchoubey erers is unlikely to have been much like baby talk.
suggests that the left-hemispheric subsystem develops For Fabrega, the complexities of speech production
as described by MSH to subserve the cognitive-symbolic seem in excess of what protosign/protospeech spiraling en-
function, whereas the right-hemispheric subsystem is a di- tails. I disagree. Even a protovocabulary of a few hundred
rect successor of monkey vocalization mechanisms and protowords would already provide selective advantage for
gives language its intonational color. It is a long-standing changes in the vocal apparatus which “full” language could
observation (Hughlings Jackson 1878 –79) that impreca- exploit without further change. In any case, I insist that the
tions survive damage to the human brain that blocks nor- appropriate framework must also explain co-speech ges-
mal speech. In Arbib (2002), I therefore suggested that tures.
the language-ready brain integrates action-oriented and Kaplan & Iacoboni argue that mirror neurons in pre-
affect-oriented systems in a pattern of cooperative compu- motor cortex that respond to the visual and auditory conse-
tation. quences of actions allow for a modality-independent and
Fitch adopts Darwin’s hypothesis that our prelinguistic agent-independent coding of actions, which may have been
ancestors possessed an intermediate “protolanguage” that important for the emergence of language. Kaplan and Ia-
was musical and that music scaffolds the early structural coboni (in preparation) found that when subjects simulta-
and imitative aspects of language (prosody). He sees the se- neously saw and heard an action, there was greater activity
mantic stage as coming later. However, even if we accept in the premotor cortex compared with control conditions in
the importance of “musicality,” it does not follow that the which they only saw or only heard the action. Rogalewski et
coevolution of vocal and manual gesture is tied more closely al. report the use of trans-cranial magnetic stimulation
to music than to pantomime and linguistic communication (TMS) to show that linguistic tasks, like speaking, covert
– but it does encourage us to investigate how dance and reading, and listening to speech, activate the hand motor
music might enrich MSH. system bilaterally (Floel et al. 2003). Kaplan & Iacoboni ar-
gue that the role of audiovisual mirror neurons in the evo-
lution of language deserves more attention. I agree, but
R5.4. Evolving a phonological system
suggest that the best framework for this is provided by the
MacNeilage & Davis argue that my view that pantomime expanding spiral hypothesis. In discussing Kohler et al.
could be an open system disregards the view that for lan- (2002) and Ferrari et al. (2003), the TA argued that these
guage to become an open system it must have a combina- data do not support the claim that protospeech mechanisms
torial phonology consisting of meaningless elements. How- could have evolved from F5 without the scaffolding pro-
ever, I explicitly distinguish pantomime from protosign. But vided by protosign. This matter is further debated by Fo-
I do say too little about getting from a pantomimic reper- gassi and Ferrari (in press), Arbib (2005), and MacNeilage
toire to a speech repertoire. The following, adapted from & Davis (in press b).
Arbib (in press), may be helpful:
Signing exploits the signer’s rich praxic repertoire of arm
and hand movements, and builds up vocabulary by lan- R6. From protolanguage to language
guage-sanctioned variations on this multi-dimensional
R6.1. For and against holophrasis
theme (move a hand shape along a trajectory to a particu-
lar position while making appropriate facial gestures). By The hypothesis that protolanguage was based on holo-
contrast, speech has no rich behavioral repertoire of non- phrases was offered as an alternative to the view of proto-
speech movements to build upon. Instead evolution took a language utterances as strings of “words as we know them.”
particulate path, so that the word is built (to a first approx- Fitch supports the holophrase theory of language origin
imation) from a language-specific stock of phonemes (ac- but suggests that Baldwinian exaptations may underlie the
tions defined by the coordinated movement of several ar- first behavioral stages in the transition from holistic com-
ticulators, but with only the goal of sounding right rather munication toward modern language. I accept that the de-
than conveying meaning in themselves). On this analysis, a velopment of an articulatory system adequate to the de-
basic reach and grasp corresponds to a single word in signed mands of (proto)language phonology may have involved a
language; whereas in speech, a basic reach and grasp is akin Baldwinian effect but doubt Fitch’s claim that the transition
to a phoneme, with a word being one level up the hierar- to language must have been “strongly and consistently
chy. In either case, the brain must provide a computational shaped by selection [. . .], given the communicative and
conceptual advantages that a compositional, lexicalized lan- that any model that will not do conversation (R3.1) is “worse
guage offers.” Agriculture, writing, and living in cities pro- than dubious.” I have given too little thought to this, but
vide evidence that being advantageous does not imply a ge- suggest that protoconversations may have been like the in-
netic change. Because of this I took pains to make clear that teractions that we see in nonhuman primates, with a few
one’s account of the evolution of the human brain might be protowords interspersed among actions, rather than taking
seen as having two very different results: “the language- – from the start – the form of a steady interchange of proto-
ready brain” versus “the brain that ‘has’ language.” words.
Bridgeman argues against holophrasis. He asserts that I accept Bickerton’s argument that it is implausible that
monkey calls can be paraphrased in one or two words, such all “real words” are foreshadowed by widely distributed
as “leopard.” However, the leopard call’s meaning can be fragments of protowords. However, Kirby’s (2000) com-
better approximated in English by the sentence: “There is a puter simulation shows that statistical extraction of sub-
leopard nearby. Danger! Danger! Run up a tree to escape.” strings whose meanings stabilize can yield surprisingly pow-
To this he might respond, “It’s only one word, because ‘leop- erful results across many generations. I thus see the
ard’ is enough to activate the whole thing.” But once one “Wray-Kirby mechanism” as part of the story of the pro-
moves to protolanguage, one may want to convey meanings tolanguage-language transition, but not the whole one. My
like “There is a dead leopard. Let’s feast upon it,” and we sour fruit story, plus aspects of ritualization, provides other
clearly cannot use the alarm call as the word for leopard in mechanisms whereby the virtues of a synthetic description
this utterance. Bridgeman asserts that the generality of might emerge – with the consequent demand for a proto-
words is about the same in all languages and therefore con- syntax to disambiguate combinations once the combina-
stitutes a possibly biological “universal” of language. How- torics began to explode.
ever, it is well known that a word in one language may re- Wray notes that neurolinguistic and clinical evidence
quire a phrase or more to translate into another language. I suggests that linguistic representation in the brain is
therefore maintain that the size of words is a result of a long mapped on the principle of functional motivation (Wray
history of building increasingly flexible languages. 2002a, Ch. 14). Wray (p.c.) expands on this as follows: When
Bickerton is so “language-centered” that he gives us lit- people lose language skills after a stroke, it is common for
tle help in imaginatively recreating possible scenarios for a them to retain formulaic expressions such as “good morn-
time when hominid protolanguage was at an early stage of ing” while they are unable to compose novel messages (cf.
development. He asserts that it is “questionable whether R5.3). She focuses on the functions of the material, and pro-
any species could isolate ‘a situation’ from the unbroken, poses that the functional locus supporting a class of lexical
ongoing stream of experience unless it already had a lan- material – for example, names for people whose faces we
guage with which to do so.” But we know that biological recognize might be activated via the mechanisms that pro-
evolution yielded a repertoire of primate calls each of which cess visual face recognition, whereas expressions used for
is akin to (but very different from) a “protoword” describ- context-determined phatic interaction would be activated
ing a “situation” in my sense, and I have tried to imagine via, say, the right-hemisphere areas that handle context
how the brain could have so evolved that such protowords pragmatics – would be linked to the “language” areas of the
could be “invented” and disseminated in hominid commu- left hemisphere. Damage to left-hemisphere language areas
nities. I suggest that early hominids very rarely created pro- could block the ability to generate names and expressions on
towords for new situations. I require only a slow accretion request, but spare the ability to use the words and expres-
of such nameable situations in each generation to build to- sions themselves, if activated as functional wholes.
wards the critical mass that constituted protolanguage. Wray supports the holophrasis theory by focusing on the
Bickerton notes that those who play charades use “a large linguistic phenomenon of irregularity. She presents a num-
set of ‘disambiguating signs’ – stereotypic gestures for ‘film ber of “oddities” about language use that are readily ex-
title,’ ‘book title,’ and so on.” I concede that early hominids plained if humans are predisposed to treat input and out-
had no signs for these! But the crucial point is this: When I put holistically where they can, and to engage in linguistic
posit that there is a protoword for “The alpha male has analysis only to the extent demanded by expressive need.
killed a meat animal and now the tribe has a chance to feast Her formulation bridges between “true wholes” (protolan-
together. Yum, yum!”, I do not claim that (at first) there guage holophrasis) and “apparent compounds” (formulas
were protowords for all the variations, such as “The alpha within a modern language) – supporting our view that the
male has killed a meat animal but it’s too scrawny to eat. “protolexicon” had many such wholes, rather than combin-
Woe is we.” I think this point also addresses one half of ing “words as we know them.” Wray shows that the
MacNeilage & Davis’s dismissal of my supposed claim “holophrastic impulse” remains part of modern language
that hominids in the protospeech stage could have “dashed use, even though languages have long supplanted protolan-
off complex semantic concepts with holistic phonetic utter- guages in human society.
ances” (those are their words not mine). Bickerton cites Bridgeman denies holophrasis, but then asks “How
Tallerman (2004), who argues that holophrasis was incom- could the sorts of words that cannot be used alone get in-
patible with contrastive phonology, but (as argued above) as vented?” and looks for evidence to the development of lan-
the protovocabulary increased, the different protowords guage in children. He concedes that a child’s first utterances
(whether signed, spoken, or both) would need to be read- are holophrases but “next comes a two-word slot grammar.”
ily generated and comprehended, and this could provide as But ontogeny does not equal phylogeny. The child “ex-
much pressure for the particulate principle as does the anti- tracting statistics” from the adult word stream is a far cry
holophrase position. from early hominids, for whom very few (proto)words al-
Bickerton (p.c.) makes the telling point that I have of- ready existed. In modeling the 2-year-old, Hill (1983; Arbib
fered no example of a hypothetical conversation consisting et al. 1987; cf. Peters 1983 for data) sees the child as ex-
of representations of frequently occurring situations and tracting fragments of the adult’s speech stream to provide a
set of templates that describe situations. At first, “want milk” tal or neural states that constitutes the meaning of my words
or “love Teddy” are unanalyzed wholes, but then words (Kripke 1982; Wittgenstein 1958). For internalists, con-
common to many templates crystallize out, and word cate- versely, cognition is computation over (symbolic) represen-
gories follow as it is recognized that certain words can fill tations (Kurthen 1992). Ruchsow rejects internalism be-
similar slots (recall Dominey’s modeling, R4.1). This sup- cause it lets us “throw the world away,” allowing reason and
ports the claim that “holophrasis” is prototypical but that thought to be focused on the inner model instead (Clark
modern communities provide a setting in which the “Wray- 1999). Ruchsow finds that many passages in the TA can be
Kirby mechanism” can extract existing words in a few years read in favor of externalism but sees “some sympathy for in-
rather than extending the protovocabulary over the course ternalism” in references to fMRI and PET studies. Actually,
of many generations. I regard both externalism and internalism as incomplete
Piattelli-Palmarini & Bever note that although idioms like and have sought a framework in which the partial truths of
“he kicked the bucket” may be semantically non-composi- each can be integrated. Arbib and Hesse (1986) expanded
tional, they do obey strict syntactic constraints. However, it my “internal” schema theory of “schemas in the head” to
is a mistake to confuse Wray’s observation of the role of for- provide a complementary account of “external” schemas
mulas in modern language use with the idea that protowords that reside in the statistics of social interaction and are thus
were themselves formulas. We are trying to imagine ho- the expression of socially shared (externalist) knowledge.
minids without syntax and understand how they did get syn- Werning confronts the “complex first” paradox: sub-
tax. Hurford asserts that a synthetic evolutionary route to stance concepts are more frequently lexicalized across lan-
compositional syntax is “simpler” than the analytic (Wray- guages than attribute concepts, and their lexical expressions
Arbib) account. But by what measure? Once you have dis- are ontogenetically acquired earlier. This is hard to recon-
covered the combinatorial power of using syntax to combine cile with the view that prototypical substance concepts are
words, then words are simpler. But if you have not done so, semantically complex so that, for example, the substance
then labeling significant events or constructs seems the sim- concept [mango] is made up of the vector of attribute con-
pler strategy – which is why I still support the holophrasis cepts orange, oval, big, soft, sweet, edible, . . .. My so-
theory as a viable alternative to the synthetic account. lution is to distinguish the distributed code for “orange” as
a feature implicit in early visual processing from the neural
representation of “orange” as a concept that can be put into
R6.2. Bringing in semantics
words. Werning cites Fodor’s (1995) view of mental repre-
The issue of semantics was emphasized not only by In- sentations to argue that it would be logically impossible to
durkhya (see sect. R3.1) but also by Choe, Dessalles & have two representations with the same content in one and
Ghadakpour, Hurford, Pulvermüller, and Werning (supple- the same brain. However, data reviewed in section 3.1 of
mental commentaries). Hurford caught me in an embar- the TA show that the size of an object may have different
rassing lapse. When I disagreed with “Hurford’s suggestion representations for grasping (dorsal) and “declaring” (ven-
that there is a mirror system for all concepts,” it turns out tral). Similarly, the color of an object may be salient in seg-
that I was disagreeing not with the ideas published in Hur- menting it from the background or distinguishing it from
ford (2004) but with Hurford’s preliminary ideas as ex- other objects, yet not enter consciousness. For the child,
pressed in early discussions before the paper was written. I the redness of his truck is as indissoluble from the truck’s
am happy to report that we are now in essential agreement identity as the fact that it has wheels – so the word “truck”
about the diversity of perceptual and motor schemas. may imply “redness” and “wheels” when the child does not
Choe cites a thought experiment of Choe and Bhamidi- have words or well-formed concepts for either.
pati (2004) to assert that voluntary action can provide mean- Dessalles & Ghadakpour stress that an account of the
ing to one’s internal perceptual state, and that maintained in- evolutionary emergence of language should “explain why
variance in the internal perceptual state can serve as a and how our ancestors got minds able to form predicative
criterion for learning the appropriate action sequence. structures, and to express them through compositional lan-
Parisi et al. assert “If we interpret not only signs but also guages.” (See Hurford [2003] for a different approach to
their meaning in terms of motor actions, we can understand this problem.) They point to a crucial distinction between
how meanings can be shared between speakers and hearers. gaining the use of predicative structures to communicate
Motor actions are the only aspect of behavior which is inter- some aspects of a situation and using predicates to “think
individually accessible.” However, there is no “reddish ac- new thoughts”: “We can systematically express the negative
tion” as such. And we must be careful about jumping to the version of a predicative structure, for example, ‘Leo doesn’t
idea that every gesture mimics a direct action. Nonetheless, grasp the raisin,’ whereas there is no perceptive meaning
I welcome the discussion by Parisi et al. of evidence that lan- corresponding to the negation of a visual scene.” I would
guage is grounded in action. Studies on the neural basis of agree, yet would suggest that one can imagine a series of
cognition suggest that different areas are activated for ma- “inventions” that would build more and more general
nipulable and non-manipulable objects (Chao &Martin power into such a capability. At first, the effort to apply cer-
2000); manual gestures may be automatically activated not tain recognition criteria failing to meet the threshold is what
only by visual stimuli but by words, too (Gentilucci 2003a; justifies the negation. This requires that context demands
2003b); and Borghi et al. (2004) found in a part verification that only some tests, of all possible perceptual tests, be ap-
task that responding by moving the arm in a direction in- plied (cf. presuppositions). I would probably not say to you
compatible with the part location was slow relative to re- “There are no onions in my office” unless, for example, you
sponding in a direction compatible with the part location. knew I was in my office and had expressed your need of
This could explain why Ruchsow argued that MSH is in onions. Section 7 of the TA talks of language capabilities be-
good agreement with an externalistic account of semantics. ing extended by bricolage over a long period of time, and
The externalist denies that there is any fact about my men- argues that “the language-ready brain” provided by the
genome lacks much of what we now take for granted as R7. Towards a mirror-system based
parts of language. I view and, not, and every as human in- neurolinguistic model
ventions augmenting language and thus reshaping thought.
Only one commentary touched on Figure 6 of the TA, but
I gather here a number of comments relevant to the pro-
R6.3. Concerning innate universal grammar gram it exemplified.
Kotchoubey questions my view that the development of
language from protolanguage was social by noting that dis- R7.1. Anatomically distinct regions may share an
semination of “inventions” currently exploits such institu- evolutionary history
tions as writing, hierarchical social organization, and mass
media. I respond that it was because early humans did not Figure 6 of the TA offered a highly conceptual extension of
have such institutions that their inventions would diffuse the FARS model to include the mirror system for grasping
more slowly by protoword of mouth – and hand. That is why and the language system evolved “atop” this. I see the var-
I think that the “cumulative bricolage” that led to the earli- ious circuits as evolutionary cousins but do not require that
est “full” languages may have taken 100,000 years. Kotchou- the same circuitry subserves them. Given this, I am grate-
bey (p.c.) responds that the social mechanisms present ful for the review by Barrett, Foundas & Heilman (Bar-
from the very beginning, for example, socialization in tribes rett et al.) of functional and structural evidence support-
and education in families, are known to be very conserva- ing differential localization of the neuronal modules
tive and to brake progress rather to promote it. He thus ar- controlling limb praxis, speech and language, and emo-
gues that development of the first language rested on bio- tional communication, but I am puzzled as to why they view
logical natural selection. these data as justifying rejection of an evolutionary rela-
Kotchoubey notes that degrees of linguistic and genetic tionship between the underlying mechanisms. Barrett et al.
similarity between populations correlate, and that the tran- assert that the TA treats different forelimb gesture classes
sition from protolanguage to language may have covered interchangeably, whereas, for example, it cites data from
1,500 to 2,000 generations, and so he cannot understand Corina et al. (1992a) which separate pantomime from sign-
why biological mechanisms should be denied during the ing.
evolution of the very first language. Yes, Cavalli-Sforza et al. There is no space here for a disquisition on mammalian
(1996) attest that the length of isolation of a group yields cor- brain evolution. Let me simply refer to Kaas (1993), Butler
related amounts of drift in both genetic structure and lan- and Hodos (1996), Krubitzer (1998), and Striedter (2004)
guage structure, but there is no suggestion that the genetic for support of the conclusion that increasing complexity of
changes are linked to the language changes. The counter to behavior is paralleled by increases in the overall size and
my hypothesis, then, would be to offer concrete proposals number of functional subdivisions of neocortex and the
on how to shift the boundary from my set of criteria for pro- complexity of internal organization of the subdivisions, and
tolanguage further into the language domain of syntax and that reduplication of circuitry may form the basis for differ-
compositional semantics. But how far? My firm conviction ential evolution of copies of a given system, with differing
is that the genome does not specify a principles-and-param- connectivities, and so on, to serve a variety of functions.
eters universal grammar, but I could accept that phonologi- Barrett et al. usefully summarize data showing that in
cal expression of hierarchical structure might require a bio- most humans, the left hemisphere may be dominant in the
logical change not within my current criteria for language control of vocalization associated with propositional speech,
readiness. (Recall Fitch on Baldwinian exaptation, R6.1.) but the right hemisphere often controls vocalization associ-
Kemmerer supports the view that grammatical categories ated with emotional prosody, automatic speech, and sing-
gradually emerged over hundreds of generations of histori- ing. Moreover, Kotchoubey notes that although the right
cal language transmission and change. Linguists identify temporal lobe is critical for recognition of prosody (R5.3),
grammatical categories primarily by formal criteria, but the prosodic aspects of language are also severely impaired in
criteria used in some languages are either completely absent patients with lesions to orbitofrontal cortex (which has links
in others or are employed in ways that seem bizarre com- to the mirror system, R3.2) and the corpus callosum. The
pared to English. For example, verbs are often marked for latter, presumably, is related to integration between the two
tense, aspect, mood, and transitivity, but some languages, hemispheres.
such as Vietnamese, lack all such inflection; Makah, on the Such data must be taken into account in building upon
other hand, applies aspect and mood markers not only to Figure 6 of the TA but do not contradict MSH at its current
words that are translated into English as verbs, but also to level of detail.
words that are translated into English as nouns or adjectives.
Croft (e.g., 2001) addresses such quandaries by “construc-
R7.2. Action planning complements mirror systems
tion grammar,” seeking to identify the grammatical cate-
gories of individual languages according to the constructions I have already (in R2.1) applauded Bridgeman’s insistence
unique to those languages. Of course, one may relate these that mirror systems must be supplemented by a planning
to semantic and pragmatic prototypes: prototypical nouns capability to allow language to evolve. Interestingly, the de-
specify objects and have referential functions, prototypical sign of Figure 6 of the TA was motivated in part by the work
verbs specify actions and have predicative functions, and so of Bridgeman. Recall the crucial role of inferotemporal cor-
on. Such considerations are very much consistent with the tex (IT) and prefrontal cortex (PFC) in modulating affor-
notion that languages (not Language-with-a-capital-L) dance selection in the FARS model. In the psychophysical
evolved culturally through bricolage within many commu- experiments of Bridgeman et al. (1997; Bridgeman 1999),
nities and diffusion across communities (Aikhenvald & an observer sees a target in one of several possible positions,
Dixon 2002; Dixon 1997; Lass 1997; Ritt 2004). and a frame either centered before the observer or deviated
left or right. Verbal judgments of the target position are al- other systems and for filling in many details that remain too
tered by the background frame’s position, but “jabbing” at sketchy or have not yet received due attention.
the target never misses, regardless of the frame’s position.
The data demonstrate independent representations of vi-
sual space in the two systems, with the observer aware only
of the spatial values in the cognitive (inferotemporal) sys- References
tem. The crucial point here is that communication must be
Letters “a” and “r” appearing before authors’ initials refer to target article
based on the size estimate generated by IT, not that gener- and response, respectively.
ated by posterior parietal cortex (PP). Thus, all three paths
of Figure 6 of the TA are enriched by the prefrontal system, Aboitiz, F. & García V. R. (1997) The evolutionary origin of the language areas in
which combines current IT input with memory structures the human brain. A neuroanatomical perspective. Brain Research Reviews
25:381–96. [arMAA, CB]
combining objects, actions, and relationships. (The figure Aboitiz, F., García, R., Brunetti, E. & Bosman, C. (in press) The origin of Broca’s
just shows the path IT r DPLF as it may affect Wernicke’s area and its connections from an ancestral working memory network. In:
area.) Broca’s area, ed. K. Amunts & Y. Grodzinsky. Oxford University Press. [CB]
Bosman et al. disagree with the contrast (Arbib & Bota Acardi, A. C. (2003) Is gestural communication more sophisticated than vocal
2003) between the MSH theory presented in the TA as be- communication in wild chimpanzees? Behavioral and Brain Sciences 26:210 –
11. [CB]
ing “prospective” and the Aboitiz and García (1997) theory Adolphs, R., Damasio, H. & Tranel, D. (2002) Neural systems for recognition of
as being “retrospective.” Our point was that Aboitiz and emotional prosody: A 3-D lesion study. Emotion 2:23–51. [BK]
García focused on lexicon and syntax and looked at what Aikhenvald, A. Y. & Dixon, R. M. W. (2002) Areal diffusion and genetic
might support them, without suggesting the intermediate inheritance. Oxford University Press. [rMAA]
Albert, M. L., Goodglass, H., Helm, N. A., Rubens, A. B. & Alexander, M. P.
stages that might have emerged through evolutionary pres- (1981) Clinical aspects of dysphagia. Springer-Verlag. [AMB]
sures before language itself “appeared on the scene.” As Alissandrakis, A., Nehaniv, C. L. & Dautenhahn, K. (2002) Imitation with ALICE:
noted in the TA, these researchers emphasize working Learning to imitate corresponding actions across dissimilar embodiments.
memory, whereas earlier work on MSH failed to do so. IEEE Transactions on Systems, Man, and Cybernetics, Part A 32(4):482–96.
Hence the inclusion of working memories in Figure 6 of the [BI]
American Psychiatric Association (1994) Diagnostic and statistical manual of
TA. Further modeling must also take into account issues mental disorders, 4th edition. American Psychiatric Association. [HT]
discussed in R4.1. Arbib, M. A. (1981) Perceptual structures and distributed motor control. In:
Figure 6 of the TA has auditory input only to area Tpt, Handbook of physiology, section 2: The nervous system, vol. II: Motor control,
whereas Rauschecker notes that auditory objects, includ- part 1, ed. V. B. Brooks, pp. 1449–80. American Physiological Society.
[aMAA]
ing speech sounds, are identified in anterior superior tem- (2001a) Co-evolution of human consciousness and language. In: Cajal and
poral cortex (aST), which projects directly to inferior frontal consciousness: Scientific approaches to consciousness on the centennial of
regions and not along a posterior pathway, as classically as- Ramón y Cajal’s textura, ed. Pedro C. Marijuan. Annals of the New York
sumed. He suggests that aST supports both the decoding of Academy of Sciences 929:195–220. [arMAA]
vocalizations in nonhuman primates and the decoding of (2001b) Computational models of monkey mechanisms for the control of
grasping: Grounding the mirror system hypothesis for the evolution of the
human speech: “the conclusion is hard to escape that . . . language-ready brain. In: Simulating the evolution of language, ed. A.
nonhuman primate vocalizations are an evolutionary pre- Cangelosi & D. Parisi. Springer-Verlag. [aMAA]
cursor to human speech sounds” (cf. discussion of Sey- (2002) The mirror system, imitation, and the evolution of language. In: Imitation
farth, R5.3). However, brain imaging of users of sign lan- in animals and artifacts, ed. C. Nehaniv & K. Dautenhahn, pp. 229–80. MIT
Press. [aMAA]
guage (Emmorey 2002) suggests that the brain regions (2003) Schema theory. In: The handbook of brain theory and neural networks,
constituting the perceptual and motor periphery differ be- 2nd edition, ed. M. A. Arbib, pp. 993–98. A Bradford Book/MIT Press.
tween sign (parietal lobe in, manual-facial out) and speech [aMAA]
(temporal lobe in; vocal-articulatory out), but that there are (2004) How far is language beyond our grasp? A response to Hurford. In:
large overlap regions assumed to be responsible for syntac- Evolution of communication systems: A comparative approach, ed. D. K.
Oller & U. Griebel, pp. 315–21, MIT Press. [aMAA]
tic and semantic processing at a level abstracted from the (2005) Interweaving protosign and protospeech: Further developments beyond
peripheral codes. This is the set of “new” regions which my the mirror. Interaction Studies: Social Behaviour and Communication in
theory begins to explain. Given this, I would not expect Biological and Artificial Systems 6:145–71. [arMAA, PFM]
“protospeech” as it “spirally evolves” with “protosign” to in- (in press) A sentence is to speech as what is to action? [A Contribution to the
Special Issue on Integrative Models of Brocas Area and the Ventral Premotor
vent a whole new periphery, but rather, to co-opt available Cortex]. Cortex: A Journal Devoted to the Study of the Nervous System and
resources. My impression would be that the auditory sys- Behavior. [rMAA]
tem of nonhuman primates is (almost) adequate for speech Arbib, M. A. & Bota, M. (2003) Language evolution: Neural homologies and
perception (note the restrictions reviewed by Horwitz et neuroinformatics. Neural Networks 16:1237–60. [arMAA, CB]
al., R5.2) whereas (R5.4) the motor side needed immense Arbib, M. A. & Hesse, M. B. (1986) The construction of reality. Cambridge
University Press. [rMAA]
changes to get separable control of vocal articulators to Arbib, M. A. & Rizzolatti, G. (1997) Neural expectations: A possible evolutionary
make the sounds of speech (as distinct from primate calls). path from manual skills to language. Communication and Cognition 29:393 –
423. [arMAA, PFM]
Arbib, M. A., Billard, A., Iacoboni, M. & Oztop, E. (2000) Synthetic brain imaging:
Grasping, mirror neurons and imitation. Neural Networks 13:975–97. [rMAA]
R8. Envoi Arbib, M. A., Bischoff, A., Fagg, A. H. & Grafton, S. T. (1994) Synthetic PET:
Analyzing large-scale properties of neural networks. Human Brain Mapping
By way of conclusion, I simply invite the reader to return to 2:225–33. [rMAA]
section R1.1 of this response and assess my claim that, over- Arbib, M. A., Caplan, D. & Marshall, J. C., eds. (1982) Neural models of language
all, the extended mirror system hypothesis is alive and well. processes. Academic Press. [rMAA]
Arbib, M. A., Conklin, E. J. & Hill, J. C. (1987) From schema theory to language.
I believe it has survived most of the criticism that would de- Oxford University Press. [rMAA]
stroy its key claims, but that the commentaries provide Arbib, M. A., Érdi, P. & Szentágothai, J. (1998) Neural organization: Structure,
many challenges for linking the evolving mirror system to function, and dynamics. MIT Press. [rMAA]
lesions on auditory short-term memory (STM) in the rhesus monkey. Society Sutton, D. & Jürgens, U. (1988) Neural control of vocalization. In: Comparative
for Neuroscience Abstracts 24:1907. [BH] primate biology, vol. 4, Neurosciences. Alan R. Liss. [EG]
Scherer, K. R. (1986) Vocal affect expression: A review and a model for future Tager-Flusberg, H. (1997) The role of theory of mind in language acquisition:
research. Psychological Bulletin 99:143 – 65. [BK] Contributions from the study of autism. In: Research on communication and
Schubotz, R. I. & von Cramon, D. Y. (2003) Functional-anatomical concepts of language disorders: Contributions to theories of language development, ed. L.
human premotor cortex: Evidence from fMRI and PET studies. NeuroImage Adamson & M. A. Romski. Paul Brookes. [HT]
20 (Suppl. 1):S120 – 31. [JTK] Taira, M., Mine, S., Georgopoulos, A. P., Murata, A. & Sakata, H. (1990) Parietal
Scott, S. K., Blank, C. C., Rosen, S. & Wise, R. J. (2000) Identification of a pathway cortex neurons of the monkey related to the visual guidance of hand
for intelligible speech in the left temporal lobe. Brain 123 (Pt 12):2400–406. movement. Experimental Brain Research 83:29–36. [aMAA]
[JPR] Tallerman, M. (2004) Analyzing the analytic: Problems with holistic theories of
Seddoh, S. A. (2002) How discrete or independent are “affective prosody” and protolanguage. Paper presented at the Fifth Biennial Conference on the
“linguistic prosody”? Aphasiology 16(7):683 – 92. [BK] Evolution of Language, Leipzig, Germany, April 2004. [rMAA, DB]
Seltzer, B. & Pandya, D. N. (1994) Parietal, temporal, and occipital projections to Talmy, L. (2000) Towards a cognitive semantics, 2 vols. MIT Press. [aMAA]
cortex of the superior temporal sulcus in the rhesus monkey: A retrograde Théoret, H., Halligan, E., Kobayashi, M., Fregni, F., Tager-Flusberg, H. &
tracer study. Journal of Comparative Neurology 15:445 – 63. [aMAA] Pascual-Leone, A. (2005) Impaired motor facilitation during action
Seyfarth, R. M. & Cheney, D. L. (2003a) Meaning and emotion in animal observation in individuals with autism spectrum disorder. Current Biology
vocalizations. Annals of the New York Academy of Sciences 1000:32–55. [CB] 15(3):R84–85. [HT]
(2003b) Signalers and receivers in animal communication. Annual Review of Tian, B., Reser, D., Durham, A., Kustov, A. & Rauschecker, J. P. (2001) Functional
Psychology 54:145 –73. [RMS] specialization in rhesus monkey auditory cortex. Science 292:290–93. [JPR]
Seyfarth, R. M., Cheney, D. L. & Marler, P. (1980) Monkey responses to three dif- Tomasello, M. (1999a) The cultural origins of human cognition. Harvard University
ferent alarm calls: Evidence for predator classification and semantic Press. [HF]
communication. Science 210:801– 803. [RMS] (1999b) The human adaptation for culture. Annual Review of Anthropology
Shallice, T. (1988) From neuropsychology to mental structure. Cambridge 28:509–29. [aMAA]
University Press. [HF] Tomasello, M. & Call, J. (1997) Primate cognition. Oxford University Press.
Shaw, R. & Turvey, M. (1981) Coalitions as models of ecosystems: A realist [aMAA]
perspective on perceptual organization. In: Perceptual organization, ed. M. Tonkonogy, J. & Goodglass, H. (1981) Language function, foot of the third frontal
Kubovy & J. R. Pomerantz, pp. 343 – 415. Erlbaum. [rMAA] gyrus, and rolandic operculum. Archives of Neurology 38:486–90. [AMB]
Simner, M. L. (1971) Newborn’s response to the cry of another infant. Tononi, G. & Edelman, G. M. (1998) Consciousness and complexity. Science
Developmental Psychology 5:136 – 50. [RRP] 282:1846–51. [HF]
Skelly, M., Schinsky, L., Smith, R. W. & Fust, R. S. (1974) American Indian Sign Trevarthen, C. (2001) The neurobiology of early communication: Intersubjective
(Amerind) as a facilitator of verbalization for the oral verbal apraxic. Journal of regulations in human brain development. In: Handbook on brain and
Speech and Hearing Disorders 39:445 – 56. [AMB] behavior in human development, ed. A. F. Kalverboer & A. Gramsbergen,
Smith, E. E. & Jonides, J. (1998) Neuroimaging analyses of human working pp. 841–82. Kluwer. [rMAA]
memory. Proceedings of the National Academy of Sciences U.S.A. 95:12061– Tsiotas, G., Borghi, A. & Parisi, D. (in press) Objects and affordances: An Artificial
68. [CB] Life simulation. Proceeding of the XVII Annual Meeting of the Cognitive
Snowdon, C. T. (1989) Vocal communication in New World monkeys. Journal of Science Society. [DP]
Human Evolution 18:611– 33. [EG] Tucker, D. M., Watson, R. T. & Heilman, K. M. (1977) Affective discrimination
(1990) Language capacities of nonhuman animals. Yearbook of Physical and evocation in patients with right parietal disease. Neurology 17:947– 50.
Anthropology 33:215 – 43. [RMS] [AMB]
Snowdon, C. T., French, J. A. & Cleveland, J. (1986) Ontogeny of primate Tucker, M. & Ellis, R. (1998) On the relations between seen objects and
vocalizations: Models from birdsong and human speech. In: Current components of potential actions. Journal of Experimental Psychology: Human
perspectives in primate social behavior, ed. D. Taub & F. E. King. Van Perception and Performance 24(3):830–46. [DP]
Nostrand Rheinhold. [RMS] (2004) Action priming by briefly presented objects. Acta Psychologica 116:185 –
Speedie, L. J., Wertman, E., Tair, J. & Heilman, K. M. (1993) Disruption of 203. [DP]
automatic speech following a right basal ganglia lesion. Neurology 43:1768– Ullman, M. (2004) Contributions of memory circuits to language: The declarative
74. [AMB] procedural model. Cognition 92:231–70. [rMAA]
Stoel-Gammon, C. & Otomo, K. (1986) Babbling development of hearing- Umiltá, M. A., Kohler, E., Gallese, V., Fogassi, L., Fadiga, L., Keysers, C. &
impaired and normally hearing subjects. Journal of Speech and Hearing Rizzolatti, G. (2001) I know what you are doing. A neurophysiological study.
Disorders 51:33 – 41. [BH] Neuron 31:155–65. [aMAA]
Stokoe, W. C. (2001) Language in hand: Why sign came before speech. Gallaudet Ungerleider, L. G. & Mishkin, M. (1982) Two cortical visual systems. In: Analysis
University Press. [aMAA] of visual behaviour, ed. D. J. Ingle, M. A. Goodale & R. J. W. Mansfield,
Striedter, G. (1994) The vocal control pathways in budgerigars differ from those in pp. 549–86. MIT Press. [JPR]
songbirds. Journal of Comparative Neurology 343:35 – 56. [IMP] Vaccari, O. & Vaccari, E. E. (1961) Pictorial Chinese-Japanese characters, 4th
(2004) Principles of brain evolution. Sinauer Associates. [rMAA] edition. Charles E. Tuttle. [aMAA]
Studdert-Kennedy, M. (2002) Mirror neurons, vocal imitation and the evolution of Valenstein, E. & Heilman, K. M. (1979) Apraxic agraphia with neglect-induced
particulate speech. In: Mirror neurons and the evolution of brain and paragraphia. Archives of Neurology 36:506–508. [AMB]
language, ed. M. Stamenov & V. Gallese, pp. 207–27. John Benjamins. Ventner A., Lord, C. & Schopler, E. (1992) A follow-up study of high-functioning
[rMAA] autistic children. Journal of Child Psychology and Psychiatry 33:489 – 507.
Studdert-Kennedy, M. G. & Lane, H. (1980) Clues from the differences between [HT]
signed and spoken languages. In: Signed and spoken language: Biological Visalberghi, E. & Fragaszy, D. (2002) “Do monkeys ape?” Ten years after. In:
constraints on linguistic form, ed. U. Bellugi & M. Studdert-Kennedy. Verlag Imitation in animals and artifacts, ed. C. Nehaniv & K. Dautenhahn. MIT
Chemie. [PFM] Press [aMAA]
Suddendorf, T. (1999) The rise of the metamind. In: The descent of mind: Voelkl, B. & Huber, L. (2000) True imitation in marmosets? Animal Behaviour
Psychological perspectives on hominid evolution, ed. M. C. Corballis & S. E. 60:195–20. [aMAA]
G. Lea, pp. 218 – 60. Oxford University Press. [HF] Wang, X., Merzenich, M. M., Beitel, R. & Schreiner, C. E. (1995) Representation
Suddendorf, T. & Corballis, M. C. (1997) Mental time travel and the evolution of of species-specific vocalization in the primary auditory cortex of the common
the human mind. Genetic, Social, and General Psychology Monographs marmoset: Temporal and spectral characteristics. Journal of Neurophysiology
123(2):133 – 67. [HF, PNP] 74:2685–706. [RMS]
Suddendorf, T. & Whiten, A. (2001) Mental evolution and development: Evidence Watson, R. T., Fleet, W. S., Gonzalez-Rothi, L. & Heilman, K. M. (1986) Apraxia
for secondary representation in children, great apes, and other animals. and the supplementary motor area. Archives of Neurology 43:787–92.
Psychological Bulletin 127:629 – 50. [JHGW] [AMB]
Supalla, T. (1986) The classifier system in American Sign Language. In: Noun Weydemeyer, W. (1930) An unusual case of mimicry by a catbird. Condor 32:124 –
classes and categorization, vol. 7 of Typological studies in language, ed. 25. [WTF]
C. Craig. John Benjamins. [aMAA] Wheeler, M. A., Stuss, D. T. & Tulving, E. (1997) Toward a theory of episodic
Supalla, T. & Newport, E. (1978) How many seats in a chair? The derivation memory: The frontal lobes and autonoetic consciousness. Psychological
of nouns and verbs in American sign language. In: Understanding language Bulletin 121(3):331–54. [HF]
through sign language research, ed. P. Siple. Academic Press. Whiten A. & Byrne, R. W. (1997) Machiavellian intelligence II.: Evaluations and
[aMAA] extensions. Cambridge University Press. [JHGW]