CHBPP Chapter8 L2LP
CHBPP Chapter8 L2LP
CHBPP Chapter8 L2LP
Escudero P. and Yazawa K. (in press). “The Second Language Linguistic Perception
Model (L2LP),” in Amengual, M. (Ed.). The Cambridge Handbook of Bilingual Phonetics
and Phonology. Cambridge: Cambridge University Press. Preprint version 8/09/2023.
Chapter 8
Abstract
In this chapter, we thoroughly describe the L2LP model, its five ingredients to explain speech
development from first contact with a language or dialect (initial state) to proficiency
comparable to a native speaker of the language or dialect (ultimate attainment), and its
empirical, computational, and statistical method. We present recent studies comparing different
types of bilinguals (simultaneous and sequential) and explaining their differential levels of
ultimate attainment in different learning scenarios. We also show that although the model has
the word “perception” in its name, it was designed to also explain phonological development
in general, including lexical development, speech production, and orthographic effects. The
studies reviewed in the chapter include new methods for examining lexical development and
speech production, via implicit word learning and corpus-based analyses respectively, as well
as a novel suprasegmental example of the L2LP SUBSET problem, which was conceptualized
as the reverse of the commonly NEW scenario where L2 learners are phased with target
contrasts that do not exist in their L1. We also review a recent study on the effect of
bidialectalism on L2 acquisition, showing that the L2LP model’s explanations not only apply
to speakers of multiple languages but also of multiple dialects. Finally, we present other topics
and future directions, including phonetic training, going beyond segmental phonology, and the
formalisation of orthographic effects in phonological development. All in all, the chapter
demonstrates that the L2LP model can be regarded as a comprehensive theoretical,
computational, and probabilistic model or framework for explaining how we learn the
phonetics and phonology of multiple languages (sequentially or simultaneously) with variable
levels of language input throughout the life span.
1
The authors shared first authorship of this chapter, with names listed alphabetically.
1
8.1 Introduction
Since its original proposal (Escudero, 2005) and following a revision (van Leussen & Escudero,
2015), the Second Language Linguistic Perception model (L2LP) has received increasing
perception. It grew out of and co-evolved with the Bidirectional Phonology and Phonetics
(BiPhon) framework (Boersma, 1998, 2011), which itself is an extension of Optimality Theory
(OT; Prince & Smolensky, 1993).2 Numerous studies have been conducted within the model’s
framework over the last two decades, accumulating evidence for its adequacy in describing,
explaining, and predicting L2 learners’ perceptual patterns. Recent works have also extended
Escudero et al. [2016a]), to other domains of language acquisition (e.g., word learning as in
Escudero, Mulak, and Vlach [2016b] and Escudero, Smit, and Mulak [2022], orthography as
in Escudero, Simon, and Mulak [2014a], Escudero [2015] and Escudero, Smit, and Angwin
[2023], and speech production as in Yazawa et al. [2023] and Liu and Escudero [in press]), and
to other academic disciplines (e.g., language training and curriculum design as in Elvin and
Escudero [2019] and Colantoni et al. [2021]). This chapter aims to illustrate how L2LP can
address a breadth of issues in bilingual phonetics and phonology by reviewing pivotal research
conducted with the model. The focus here is on L2LP, but thorough comparisons with other
models of L2 and bilingual phonetics and phonology can be found in Escudero (2005),
2
While knowledge of OT is not a prerequisite for understanding the content of this chapter, those who wish to
have a brief overview of the elements of OT that can be used to model production and perception grammars can
refer to Boersma and Escudero (2008, p. 379), which motivates the inclusion of phonetic phenomena within the
domain of theoretical phonology.
2
Before we move on, it is important to note that most studies that have previously been
conducted within L2LP or other models of nonnative speech perception have tended to feature
“naïve” listeners and “L2 learners” with different proficiency levels. Given that within this
volume the term used to define users of two or more languages is “bilingual,” it seems
appropriate to first provide the definitions of a variety of participant groups that have been
Most studies within the L2LP framework have used a control group commonly termed
“monolingual” listeners of the target language. However, even a term that seems simple and
easy to determine has complexities. To clarify the term, Escudero, Sisinni, and Grimaldi
(2014b, p. 1578) defined monolingual listeners or functional monolinguals as those who use
only their L1 in their everyday life, have not resided in a country or region where another
language is spoken for longer than a month, and have received basic classroom L2 instruction
(if at all) by L1-accented teachers focusing on reading and grammar. Such monolinguals can
be regarded as being in their initial state for learning any subsequent language, that is, at the
onset of L2 learning.
In Escudero et al. (2022), an important difference is made between those who use two
languages, commonly referred to as bilinguals, based on their age of acquisition for each
bilinguals, with the former being exposed to their languages from birth and the latter acquiring
an L2 after their L1. Sequential bilinguals are commonly called L2 learners, with the onset of
end state that resembles nativelike performance, this may not be the case for all components of
monolinguals of the two languages, especially in the domain of phonetics and phonology
3
(Antoniou et al., 2011; Elvin, Tuninetti, & Escudero, 2018a). Below we will see that this
distinction between different types of bilinguals yields differential performance, which will be
L2LP to help familiarize the readers with the model’s key constructs (Section 8.2). This section
also discusses how computational and statistical methods are utilized to provide greater
explanatory adequacy and more specific and testable predictions, since quantification is a
crucial property of the model. We then report on a series of new studies to illustrate L2LP’s
recent approach to lexical development (Section 8.3). These studies shed light on previously
linguistic background influences their prelexical perception and lexical development. Finally,
we address some remaining questions concerning how the model handles important issues such
as the role of orthography, speech production, and applications to curriculum design and
training, including future directions (Section 8.4). The chapter ends with a summary and
Given that L2LP’s theoretical framework is based on “Linguistic Perception” (LP), we start by
outlining the principles of LP in Section 8.2.1, followed by their extension to L2LP in Section
8.2.2. Section 8.2.3 addresses how the model’s theoretical components can be computationally
implemented for explanatory adequacy as well as to formulate specific and testable predictions.
The term “Linguistic Perception” reflects the notion that human speech perception is a
language-specific rather than general auditory process. Escudero (2005, p. 7) defines speech
4
perception as “the act by which listeners map continuous and variable speech onto linguistic
targets.” Given that the very purpose of speech communication is to understand and to be
understood, the listener’s task is to map the incoming variable acoustic cues (e.g., first formant
or F1, second formant or F2, fundamental frequency, and duration) onto discrete and abstract
structures) to ultimately extract the meaning intended by the speaker. The mapping patterns are
language-specific in nature, since the number of linguistic representations and the use of
acoustic cues vary substantially not only across languages but also across varieties or dialects
Consider, for example, how the acoustic cues of F1 and F2 can map onto vowel
categories. These cues, though physically continuous, should perceptually map to a different
number of discrete categories depending on the language. Native English listeners need to
make a fine-grained mapping of the two cues onto a dozen vowel categories so that they can
identify and distinguish minimal pairs such as “heed,” “hid,” “hayed,” “head,” “had,” “hud,”
“hod,” “hawed,” “hoed,” “who’d,” “hood,” and “heard,” although “a dozen” is a very rough
approximation because the exact number of categories varies across different dialects of
English. The mapping is much less dense for Arabic, which has only three qualitative contrasts
(/i/, /a/, and /u/), though again dialectal variations exist. Languages also exhibit divergent
mapping patterns even when they have the same number of categories. For example, native
listeners of Greek, Hebrew, Czech, Spanish, and Japanese, all of which have a five-vowel
system in their standard varieties, show distinct mapping patterns of the F1 and F2 cues per
optimal perception hypothesis 3 , which posits that listeners learn the optimal mapping of
3
The term “optimal” comes from OT and means “the best possible, given the circumstances.”
5
acoustic cues onto appropriate sound representations that leads to maximum likelihood
behaviour (Boersma, 1998, p. 337). This means that the probability of correctly perceiving the
intended linguistic representation based on the acoustic cues is maximized or, to put it another
is optimal in that it tries to extract as many linguistic representations as required in the language
(e.g., a dozen vowel categories in English, three in Arabic, or five in Greek, Hebrew, Czech,
Spanish, and Japanese, with nonnegligible dialectal differences). It is also optimal in that the
mapping patterns mirror the acoustic cues in the language (e.g., Japanese /u/ is generally more
fronted than Spanish /u/, and so the perceptual usage of the F2 cue differs between the two
languages).
assumes a general learning device that is responsible for creating representations and adjusting
cue usage, which is computationally implemented (see Section 8.2.3) by the Gradual Learning
Algorithm (GLA; Boersma & Hayes, 2001). An important attribute of the learning device is
statistical information concerning the acoustic cues in the ambient language and gradually
adjusts the mapping patterns based on this information (Boersma, Escudero, & Hayes, 2003),
whereby the resulting perception exhibits what is known as the perceptual magnet effect (Kuhl,
2004). The device is meaning-driven in that it evaluates how the mappings signal lexical
contrasts to determine the number of representations required for optimal perception in the
language. The meaning-driven nature of the device implies that LP goes beyond simple
acoustic-to-category mapping, since sound categories alone are meaningless unless they are
6
between them, as shown in the current LP model illustration in Figure 8.1 (van Leussen &
Escudero, 2015).
In Figure 8.1, the bottom-level representation, the [auditory] form, refers to the
incoming acoustic signals as they arrive in the peripheral auditory system. The variable
[auditory] form is then mapped to the following /surface/ form, which encodes the listener’s
allophonic details. The /surface/ form is further abstracted into the third, |underlying| form,
which encodes canonical phonemic contrasts that may change the meaning of a word. Finally,
the |underlying| form is connected to the <lexical> form, namely words and morphemes stored
in the mind or brain. These representations, together with the connections between them, are
(cue constraints 4 ) are learned based on the distributions of acoustic values, while the
4
The connections are formulated as “constraints” such as “a value of x on the auditory continuum y should not
be mapped to the phonological category z” because LP, like BiPhon, derives from Stochastic OT. The revised
version of the model uses neural networks for processing with better results for lexical recognition (van Leussen
& Escudero, 2015).
7
connections between /surface/ and |underlying| forms (phonological constraints) are learned in
Importantly, notice in Figure 8.1 that LP distinguishes prelexical perception and lexical
recognition. Most psycholinguistic models of speech perception agree that lexical recognition
guides perceptual learning, but it remains controversial whether the two processes are
sequential (i.e., bottom-up) or interactive (i.e., bottom-up and top-down). The original LP
model (Escudero, 2005, 2009) held a sequential view where perception precedes recognition,
that is, the outcome of perception is faithfully passed on to recognition. According to this view,
the lexical influences on perception are explained by offline (i.e., post hoc) learning from the
lexicon (see the Merge model; Norris, McQueen, & Cutler, 2000). In contrast, the revised LP
model (van Leussen & Escudero, 2015) allows for an interactive view as well, in which the
lexicon can influence lower-level representations during the online (i.e., ad hoc) processing of
speech (see the TRACE model; McClelland & Elman, 1986). While the pursuit of this matter
is beyond the scope of this chapter, the distinction and connection between prelexical
The Second Language Linguistic Perception model (L2LP) is a conceptual extension of the LP
framework for L2 learners. The model consists of five theoretical ingredients, as shown in
Figure 8.2, where straight arrows represent the ingredients’ sequential nature and curved
8
Figure 8.2. Five theoretical ingredients of L2LP.
As shown in Figure 8.2, the first ingredient is optimal perception in the listener’s first
language (L1) and the target L2. As mentioned above, LP is language-specific, with the number
of linguistic representations and the mapping of acoustic cues being unique to each language.
This means that optimal perception for one language is not necessarily optimal for another and
vice versa (see Footnote 3 for L2LP’s definition of “optimal”). L2LP proposes that a thorough
analysis of optimal perception in each language, and specifically in each variety or dialect of a
optimal perception define the initial state (Ingredient 2) and the end state or ultimate attainment
development, the focus should be on the acoustic distributions of the target sounds and their
phonemic and allophonic status in the two languages (whether the sounds are lexically
contrastive or not), but other factors such as the quantity and quality of input and the learners’
cognitive capacity and skills are also relevant, as we shall see below.
The second ingredient is the L2 initial state. L2LP’s Full Copying hypothesis, which
derives from the Full Transfer hypothesis (Schwartz & Sprouse, 1996), states that listeners
5
Demonstrations of differential developmental paths can be found depending on the target L2 English dialect
(Escudero & Boersma, 2004) or the learners’ L1 English dialect (Williams & Escudero, 2014). See also Chládková
and Escudero (2012) for dialects of Portuguese, Escudero and Williams (2012) for dialects of Spanish, and
Escudero, Simon, and Mittere (2012) for dialects of Dutch.
9
start with a copy or duplicate of their L1 optimal perception at the onset of L2 learning. This
results in the listener having a separate system or grammar for each of their L1 and L2, through
which the sounds in the L1 and L2 are perceived, respectively. Listeners at this stage are called
“naïve” because no L2 learning has taken place yet, and their perception of target language
sounds is commonly called “crosslinguistic” because L2 sounds are filtered by the L1. Note
that both L1 linguistic representations and perceptual mappings are copied, which relates to the
Since the initial L2 grammar is seldom optimal for perceiving L2 sounds because of
mismatches between optimal L1 and L2 perception, learners often struggle with misperception
and miscommunication in the target language. The learners’ goal, then, is to modify the L2
grammar to solve the mismatch. Two kinds of learning tasks are specified for this goal: a
representational task to modify the number of categories (by forming new ones or disposing
of existing ones), and a perceptual task to adjust the acoustic cue usage (by changing the
weighting of FAMILIAR cues and/or creating new mappings of UNFAMILIAR cues) 6 . L2LP
proposes that three types of learning scenarios emerge depending on the task(s): SIMILAR, NEW,
and SUBSET. These are illustrated with examples in Figure 8.3 and explained in detail in the
paragraphs below.
6
The terms “UNFAMILIAR” and “FAMILIAR” supersede the terms “non-previously categorized” and “already-
categorized” used in Escudero (2005) and other previous publications.
10
Figure 8.3. Three types of learning scenarios in L2LP.
The SIMILAR scenario occurs when the same number of representations are involved
across the two languages. L1 Canadian English listeners’ perception of L2 Canadian French
/æ/–/ɛ/ contrast falls into this scenario (Escudero, 2009).7 While Canadian English also has /æ/
and /ɛ/ that differ in both F1 and duration (where /ɛ/ is generally shorter than /æ/), Canadian
French /æ/ and /ɛ/ differ primarily in F1 with little durational differences. The weighting of F1
and duration cues are thus different between the two languages. Consequently, Canadian
English learners of Canadian French tend to misperceive durationally short tokens of L2 /æ/ as
/ɛ/, relying on their higher use of duration cues in the L1. The learners therefore have the
perceptual task of adjusting the nonoptimal cue weighting to minimize the likelihood of L2
misperception. They do not have a representational task in this scenario because no addition or
7
In this example and the rest in Section 8.2, a /surface/ form is assumed to faithfully map to the same |underlying|
form that is associated with relevant <lexical> forms (e.g., /æ/ → |æ| → <man>) for the sake of simplicity.
11
The NEW scenario occurs when L2 representations outnumber L1 representations.
Unlike the SIMILAR scenario, this scenario poses a representational task because a new sound
category needs to be formed for L2 optimal perception. There are two subscenarios of NEW
that differ in the perceptual task: one that involves an UNFAMILIAR acoustic dimension and the
other that involves only FAMILIAR acoustic dimensions. An example of the UNFAMILIAR NEW
scenario comes from L1 Iberian Spanish listeners’ perception of L2 Southern British English
/iː/–/ɪ/ contrast (Escudero & Boersma, 2004). This corresponds to a NEW scenario because the
target L2 vowels, which contrast in both F1 and duration, map to the same L1 vowel /i/. The
duration cue is UNFAMILIAR because Spanish does not employ duration for segmental
contrasts. 8 The learners’ perceptual task is to create completely new mappings (e.g., long
versus short) on this ‘blank-slate’ or ‘uncategorized’ acoustic dimension. The mappings are
then integrated into an existing category to create new ones (e.g., long /i/ versus short /i/) to
accomplish the representational task. On the other hand, the FAMILIAR NEW scenario occurs
when new perceptual mappings are created along acoustic dimensions already utilized in the
English /ɛ/–/æ/–/ʌ/ contrast (Yazawa, 2020). This is also NEW because the three L2 vowels
map to two L1 vowels /e/ or /a/. A notable difference from the case of Escudero and Boersma
(2004) is that the learners’ L1, Japanese, has phonemic vowel length, unlike Spanish. Given
that all relevant acoustic cues for vowel identity (F1, F2, and duration) are FAMILIAR in the L1,
the perceptual task is to alter the existing mapping patterns along the known acoustic
dimensions. This would result in the splitting of an existing category (e.g., /a/) to yield a new
8
It has been proposed that the use of duration to distinguish nonnative vowel contrasts may be a language-
universal strategy (Bohn, 1995). However, this view has been challenged by behavioural and neurophysiological
studies demonstrating that the use of duration is language-specific in both quantity and nonquantity languages
(Escudero & Boersma, 2004; Escudero, Benders, & Lipski, 2009; Chládková, Escudero, & Lipski, 2015;
Chládková et al., 2022).
12
Finally, the SUBSET scenario occurs when L1 representations outnumber L2
L2LP is currently the only model that addresses this mapping pattern, which Escudero and
Boersma (2002) termed Multiple Category Assimilation (MCA). Examples of this scenario are
L1 North Holland Dutch listeners’ perception of L2 Iberian Spanish /i/ and /e/ (Boersma &
Escudero, 2008; van Leussen & Escudero, 2015) and L1 Australian English listeners’
perception of L2 Iberian Spanish vowels (Elvin & Escudero, 2019). For Dutch listeners, the
Spanish vowels /i/ and /e/ perceptually map to /i/, /ɛ/, or /ɪ/ in the L1, thus resulting in MCA.
Here, the listeners can have a representational problem where three categories are perceived
instead of two, which could lead to spurious lexical contrasts (i.e., /i/–/ɪ/ or /ɪ/–/ɛ/) in the L2.
Even when they ‘know’ from textbooks that there are only two such vowels in Spanish, they
have a perceptual problem where their L2 initial grammar cannot help automatically mapping
relevant acoustic cues to three categories. Thus, learners have a representational task to unlearn
unnecessary categories and a perceptual task to alter the existing mapping so as not to perceive
them.9
The fourth L2LP ingredient is L2 development, for which the proposal states that L2
learners have Full Access (Schwartz & Sprouse, 1996) to the general learning device of LP
(which is computationally implemented by the GLA; see Section 8.2.3) throughout their
which is distribution- and meaning-driven. Studies have shown that distributional learning has
immediate and long-lasting effects on adult L2 learners (Escudero, Benders, & Wanrooij, 2011;
Escudero & Williams, 2014). The meaning-driven nature of L2 perceptual learning becomes
9
MCA can occur in other types of scenarios as well (e.g., L2 /æ/ mapping to L1 /e/ and /a/ in the NEW scenario)
but is particularly problematic for the SUBSET scenario where listeners hear more words than they are supposed
to. However, there can be cases where acute perception along an acoustic dimension from the L1 leads to positive
L1 transfer in L2 perception, resulting in no spurious lexical contrasts and communication problems. Future
research could explore this possibility.
13
evident when its relationship with lexical development is considered (Section 8.3). The
hypothesized full access to an L1-like learning device does not guarantee that L2 learning
occurs as quickly and effortlessly as L1 learning, however. In fact, researchers have long noted
that adults progress more slowly than children in L2 perceptual learning. L2LP attributes age
effects to cognitive plasticity, which peaks in youth and then gradually decreases as one gets
older. Crucially, Escudero (2005) also argues that the role of input outweighs that of plasticity,
which explains why learners of the same age and linguistic background may not follow an
identical developmental path, since the quality and quantity of input is modulated by various
factors including motivation. These factors have significant implications for predicting the end
L2LP’s final ingredient proposes that all L2 learners can ultimately acquire L2 optimal
perception regardless of their age, provided that sufficient and appropriate linguistic input is
continuously provided to the learner. This holds true for all three learning scenarios, though
different scenarios can pose different levels of difficulty depending on the number of learning
tasks. Specifically, it is proposed that the NEW scenario is the most difficult, followed by
SUBSET and then by SIMILAR, as forming new categories is considered more difficult than
deleting or reusing existing ones (Escudero, 2005, p. 125). Note that the L1 grammar remains
intact because L2 development occurs in a separate copy of the grammar (see Ingredient 1),
by the higher weight of input for L2 development and L1 maintenance in Ingredient 4. L2LP
thus predicts that learners can attain two separate optimal grammars for the two languages.
This hypothesis may raise questions because bilinguals can show bidirectional interactions
when they code-mix their two languages (Antoniou et al., 2011). L2LP explains such
phenomena with the assumption of gradient and parallel activation of the two grammars, which
derives from Grosjean’s (2001) language mode hypothesis. A recent L2LP study has confirmed
14
perception modes in L1 Japanese learners of L2 American English, who adapt their cue
weighting (duration versus F2/F1) for vowel perception depending on whether they listen to
English or Japanese (Yazawa et al., 2020). Crucially, in this study, some learners showed L1-
L2 intermediate cue weighting, implying that both grammars were activated to different
degrees, which was also shown previously in Escudero (2009) and Boersma and Escudero
(2008). Within Ingredient 5, it is also proposed that for ultimate attainment and successful
performance, bilinguals need to master language control (Green, 1998) or selective inhibitory
control (Friessen et al., 2015). This proposal can explain individual or group differences in
performance and will be relevant for comparing results of different types of bilinguals in
Section 8.3.2.
perception is acquired, starting from the initial state (Full Copying of L1 optimal perception),
through learning tasks (SIMILAR, UNFAMILIAR/FAMILIAR NEW, and SUBSET) and development
(Full Access to L1-like learning device mediated by input and plasticity), to the end state (L1
and L2 optimal perception activated in different degrees). While these theoretical components
alone can explain and predict the outcome of various L1-L2 learning scenarios, the model’s
described below.
and working of a system of interest, and modelling is the process of building a model. For
example, Escudero’s (2005) work concerned the modelling of L2 speech perception, which
15
resulted in the L2LP model. A model should be a close approximation to the real system it
represents, incorporating its salient attributes, but it should not be too complex to understand.
through configuring it to virtually experiment with it. Simulations can serve at least two
purposes. First, one can validate a model by implementing it computationally under known
conditions and comparing the output with the real system output. For example, L2LP’s
Ingredient 2 (the initial state) can be tested by simulating a virtual listener who learns to
perceive Spanish as their L1. This virtual performance is then compared to that of real learners’
system under different configurations and over long periods of time, which would be too
expensive or impractical to conduct in the real world. For example, the outcome of specific L2
learning environments can be predicted by reconfiguring the types of input and the learning
period, such as 1, 3, 6, and 18 years of L2 Spanish input fed to L1 Dutch grammar (Boersma
& Escudero, 2008) or a few months versus a few years of L2 English input to L1 Japanese
grammar (Yazawa et al., 2020). The main incentive for computational modelling in L2LP is
thus to provide a direct test for a hypothesis before conducting an empirical study, resulting in
computational methods, the model has often utilized Stochastic OT (Boersma, 1998) and the
GLA (Boersma & Hayes, 2001). More recently, neural networks have been used to extend
these frameworks (van Leussen & Escudero, 2015). Stochastic OT is a probabilistic extension
of OT, which is used to represent the learners’ language-specific grammar. The GLA is an
error-driven algorithm for learning optimal constraint rankings in Stochastic OT that represents
10
“Every theory, after all, is ultimately wrong in some way” (Cutler, 20122, p. xv).
16
the learning device, which has been shown to outperform other machine learning algorithms
(Escudero et al., 2007). While we do not intend to provide detailed explanations of how
Stochastic OT and the GLA work here, interested readers can find step-by-step instructions for
implementing L2LP with these computational methods in the following studies: Yazawa et al.
(2020) for a SIMILAR scenario, Escudero and Boersma (2004) for an UNFAMILIAR NEW
scenario, and Boersma and Escudero (2008) for a SUBSET scenario. These studies focus mainly
on the acquisition of the cue constraints (see Figure 8.1), hence representing classic L2
perception research (i.e., cue-based segmental category identification and discrimination), but
Boersma (2011) discusses how the phonological and lexical constraints can also be
implemented. See also van Leussen and Escudero (2015) for how the constraints at different
levels may interact and for an implementation using an approach more compatible with neural
networks.
Other studies have utilized statistical methods to make L2LP’s predictions more
specific (Curtin, Fennell, & Escudero, 2009; Elvin, Vasiliev, & Escudero, 2018b; Elvin,
Williams, & Escudero, 2016). For example, Elvin et al. (2016) applied discriminant analysis
to Australian English vowel production data to predict which acoustic cues (duration, formant
means, and formant changes) would contribute to the identity of /iː/–/ɪ/–/ɪə/ and to what extent.
The analysis, which has been used for assessing cross-linguistic phoneme categorization as
well (Escudero & Vasiliev, 2011), found that /ɪ/ can be durationally distinguished from /iː/ or
/ɪə/ while formant changes were essential for distinguishing /iː/ and /ɪə/. The statistical model
results accurately predicted real Australian English listeners’ perception (Williams, Escudero,
& Gafos, 2018), which resembled simulation results using Stochastic OT and the GLA
(Yazawa, 2020). The key point here is that quantification of theoretical predictions is a crucial
17
8.3 Explaining Lexical Development within L2LP
We now present a series of new studies to highlight recent advancements within the L2LP
learners, the new studies expand the scope of inquiry to include lexical development in
monolinguals and a wider range of bilingual populations. This is a crucial step forward, given
the importance of lexical recognition for speech communication (see Figure 8.1) and the
diversity of bilinguals worldwide. Although previous studies had shown that L2LP can explain
the interrelation between prelexical perception and lexical development (Escudero, 2005;
Escudero, Broersma, & Simon, 2013; van Leussen & Escudero, 2015), it was assumed that
lexical learning took place via one-to-one mappings between words and their referents. In
Sections 8.3.1 and 8.3.2, we introduce a novel word learning paradigm that more closely
resembles the real world where word-referent mappings are ambiguous, and test the L2LP
proposal for explaining lexical encoding of minimal pairs in different types of bilinguals.
learning paradigm where each novel word is explicitly and unambiguously paired with its
corresponding referent. The method involves a learning phase where participants are presented
with a picture of a novel object in tandem with the object’s auditory form, followed by a testing
phase where they hear one of the learned words and select the corresponding visual object.
Many studies using this paradigm have shown that adults and children can learn minimal pairs,
that is, words that are distinguished by a phonological contrast, in their L1 or in a subsequent
language (Escudero, 2015; Escudero et al., 2013; Escudero, Hayes-Harb, & Mitterer, 2008;
Escudero, Simon, & Mulak, 2014a; Giezen, Escudero, & Baker, 2016; Escudero &
18
Kalashnikova, 2020). Results also show that word recognition accuracy is linked to how well
the phonological distinction is perceived, confirming the L2LP proposal of a tight relationship
between prelexical perception and lexical development. The mechanism underlying this rapid
word learning ability is commonly called fast mapping (Escudero et al., 2023).
However, this type of explicit and intentional learning does not entirely reflect how the
learning of new words proceeds in more naturalistic and immersive environments. Specifically,
everyday situations typically pose high levels of ambiguity because a novel word may appear
alongside many potential referents (Mulak, Vlach, & Escudero, 2019; Yu & Smith, 2007). Real
world ambiguity can be resolved by drawing conclusions from statistical regularities across
instances or situations where the same word is presented, a mechanism known as cross-
situational word learning (CSWL; Yu & Smith, 2007; Escudero et al., 2016a, 2016b, 2023).
Studies have shown that adults (Angwin et al., 2022; Escudero et al., 2016a, 2016b; Mulak et
al., 2019) and children (Escudero, Mulak, & Vlach, 2016c; Smith & Yu, 2008; Pino Escobar
et al., 2023) use CSWL to learn words in their L1 and in subsequent languages (Tuninetti,
Mulak, & Escudero, 2020; Escudero et al., 2016b, 2022; Juntilla & Ylinen, 2020). Importantly,
CSWL differs from incidental word learning paradigms used in previous L2 vocabulary
learning studies in that CSWL is not only unintentional but also ambiguous (see Escudero et
al., 2023).
Escudero et al. (2016b) were the first to apply the CSWL paradigm to the learning of
minimal pairs, demonstrating that adult monolingual Australian English listeners could track
distinctions in spoken words. Participants were shown the eight words in Figure 8.4 (left) in
pairs that formed nonminimal pairs (e.g., /bɔn/–/dit/), vowel minimal pairs (e.g., /dit/–/dɪt/), or
consonant minimal pairs (e.g., /bɔn/–/tɔn/) in English. The experiment consisted of learning
and testing phases (Figure 8.4, right). During learning, participants were presented with a series
19
of trials with two auditory words and two visual objects, without any instruction about the
nature of the task or the correspondence between words and objects. Each trial was ambiguous
because the order of presentation of the auditory words was not synced with that of the visual
objects. Participants were then asked to identify word-object mappings in the test phrase.
Performance at test was above chance for all pairs, but vowel minimal pairs were less accurate
than consonant and nonminimal pairs, with no difference between the last two. These findings
suggest that phonological-lexical encoding may be weaker for vowels than for consonants at
least in Australian English monolinguals, indicating that this unintentional and ambiguous
paradigm can be used to explore the link between prelexical perception and lexical
Escudero & Hayes-Harb, 2022). The obvious next step was to examine how bilinguals fair at
Most L2LP studies and most studies previously conducted within the field of L2 phonetics and
phonology show that monolinguals outperform sequential bilinguals in their target language,
which leads researchers and observers to conclude that the “optimal” end state in L2 acquisition
is very hard to achieve. Additionally, the idea that monolinguals outperform bilinguals has been
20
confirmed in most studies on lexical processing (see Gollan & Kroll, 2001 for a review).
Contrary to this common belief, Escudero et al. (2016b) found that sequential bilinguals had
comparable word learning performance to monolinguals when tested in the same CSWL task
described above. One reason for this discrepancy may be that the CSWL task allows sequential
learners to perform well regardless of their linguistic background, yielding results that are
different from those gathered with more conventional tasks and with those predicted by the
L2LP model. However, another CSWL study (Tuninetti et al., 2020) has shown that the
relationship between L1 and L2 phonemes predicts the difficulty with which Australian English
listeners learn Dutch and Portuguese words, indicating that CSWL results are in line with the
L2LP proposal that perceptual difficulty is correlated with word learning and recognition
Alternatively, the composition of the bilingual group may have influenced the results.
That is, the sequential bilinguals in Escudero et al. (2016b) came from diverse linguistic
backgrounds and had diverse onsets of acquisition for their L2 English,11 and including this
monolinguals. To test the L2LP developmental proposal that both perceptual difficulties related
to linguistic background and acquired proficiency play a role in CSWL of minimal pairs, two
studies were conducted using the same method reported in Escudero et al. (2016b), summarized
at the end of Section 8.3.1. The first study tested simultaneous Mandarin-English bilinguals in
Singapore (Escudero et al., 2016a), while the second tested a group of homogenous sequential
bilinguals with L1 Mandarin who started learning L2 English at school and resided in Shanghai
or Sydney at the time of testing (Escudero et al., 2022). As expected, opposite group results
11
Participants came from a pool of first year psychology students, which in cosmopolitan cities such as Sydney
have a majority of international students and students from multilingual households.
21
were found, where simultaneous bilinguals performed overall better than monolinguals, while
Both results are explained by the bilinguals’ linguistic background and their acquired
the inclusion of language control and selective inhibition as part of ultimate attainment (see
Ingredient 5), which states that with high proficiency and continuous input from both
languages, a bilingual can perform equally to a monolingual of either language. It seems that
the heterogeneous group of sequential bilinguals in Escudero et al. (2016b) had high enough
L2 proficiency and did not activate L1 features that could have negatively affected their
performance. In the case of the simultaneous bilinguals in Escudero et al. (2016a), their
“advantage” for overall word learning is explained by their heightened ability to selectively
inhibit or suppress the irrelevant language, which may enable higher performance when coping
with the ambiguity of a CSWL task. This L2LP proposal is in line with studies showing a
bilingual advantage for simultaneous bilinguals depending on their levels of inhibitory control,
an ability connected to the general-domain executive functioning (Friessen et al., 2015; Pino
Escobar, Kalashnikova, & Escudero, 2018). Thus, the L2LP explanation extends the bilingual
advantage to the domain of statistical learning of minimal pairs, which involves the encoding
of phonological distinctions.
tones, due to the pitch variations in the stimuli presented, since no negative evidence against
the use of tonal contrasts was provided in the CSWL task (Escudero et al., 2022). This
possibility may have been enhanced by the words in the study being produced in infant-directed
speech (IDS) because its properties can facilitate the learning of phonetic contrasts in adults
and children (Graf Estes & Hurley, 2013; Golinkoff & Alioto, 1995; Escudero & Williams,
22
2014). 12 However, since IDS has more variable pitch than adult-directed speech across
languages (Igarashi et al., 2013), the English words produced in IDS likely sounded as though
they had different lexical tones to L1 Mandarin ears, challenging their word-referent mappings.
A similar effect has been found by Smit, Milne, and Escudero (2022), in which participants’
music perception abilities negatively influenced their learning of English vowel minimal pairs
via CSWL, presumably because of their enhanced sensitivity to pitch variations in vowels.
Escudero et al. (2022) explained that hearing tones in the English words could have resulted in
these sequential bilinguals’ MCA of the vowels in the words, leading to a SUBSET scenario
with spurious lexical contrasts and poorer performance. These findings suggest that not only
segmental but also suprasegmental details should be considered in predicting and explaining
vowel perception and word learning (Escudero et al., 2018; Escudero & Kalashnikova, 2020).
The recent studies reviewed above demonstrate that L2LP offers adequate explanations
regarding the relation between speech perception and lexical development in diverse bilingual
populations. Here we review other important issues that the model can explain such as the role
of orthography and speech production (Sections 8.4.1 and 8.4.2), as well as applications to
Many studies have shown that orthography influences speech processing in bilinguals (see
Bassetti, Escudero, & Hayes-Harb, 2015). Studies within L2LP have shown that the availability
12
The IDS nature of the stimuli does not explain the different results in Escudero et al. (2016b) and Escudero et
al. (2022) because the same stimuli were used in both studies. Importantly, L2 learners are exposed to foreign-
directed speech, which shares some of the properties of IDS (Uther, Knoll, & Burnham, 2007). The use of IDS
stimuli was motivated by L2LP’s assumption of similar learning mechanisms for both children and adults, with
input and cognitive plasticity constraints differing with age (see Boersma & Escudero, 2008).
23
of orthographic forms as input to bilinguals can have various influences on speech perception
and word learning, both facilitative and impeding (Escudero, 2015; Escudero et al., 2008,
2014a; Escudero & Wanrooij, 2010). For instance, Escudero et al. (2014a) demonstrate that
learning, as the orthography of the learners’ dominant language is activated when reading L2
words (Escudero, 2015; Escudero et al., 2008). For both prelexical perception and lexical
recognition, it has been shown that when the learners’ two orthographic systems match,
learning is facilitated, but when they do not, learning is more challenging. Also, CSWL is more
accurate for both monolinguals and bilinguals when words are presented orthographically than
auditorily, suggesting that visual information facilitates unintentional and ambiguous word
learning (Escudero et al., 2023). Thus, the role of orthography is clearly prominent in bilingual
L2LP assumes that bilinguals’ mental lexicon contains phonological and orthographic
representations of speech based on much previous research attesting the role of orthography in
bilingual speech processing (Escudero & Wanrooij, 2010; Escudero, 2015; Escudero et al.,
2023). However, how exactly orthography fits within L2LP’s architecture (Figure 8.1) is yet
There have also been attempts to extend L2LP to speech production (Elvin et al., 2018b; Elvin,
Williams, & Escudero, 2020), as the model claims that perception precedes production and is
a prerequisite for the development of production skills (Escudero, 2007, p. 110). Unlike other
5 predicts that L2 learners can ultimately attain optimal perception (and by extension,
24
production). To test this hypothesis for production, Yazawa et al. (2023) examined 102 adult
speech corpus called J-AESOP (Kondo, Tsubaki, & Sagisaka, 2015). All learners were late
sequential bilinguals who had been learning English since the age of 13 in Japanese schools
and had never lived outside of Japan. Despite the uniform linguistic background, the learners
exhibited diverse levels, with some (if not most) showing near-nativelike productions across
all vowel categories, regardless of the perceptual similarity between particular L1 and L2
sounds. The result is consistent with the L2LP’s prediction and provides a promising extension
Liu and Escudero (in press) applied L2LP’s predictions to the influence of dialectal
variation in production, finding that those who speak two dialects of the same L1 had overall
better performance in L2 production tasks than those who speak only one L1 dialect, despite
their similar L2 learning backgrounds. This implies that the divergent performance of different
proposal to focus on each specific variety of a language and suggesting that the proposed
inhibitory control advantages may also apply to the control of two dialects (Section 8.2.2).
Further research can help to better understand how bilingualism and bidialectalism compare.
Finally, L2LP’s theoretical proposals have significant implications for language learning and
training, as detailed by Elvin and Escudero (2019). Specifically, its ingredients can be used to
differences and to predict their further development. The following are just a few examples of
25
Many studies within L2LP have capitalized on the distributional nature of perceptual
learning to demonstrate that difficult phonetic contrasts can be accurately perceived through
very short exposure to the most frequent sound exemplars of a phonetic continuum (Escudero
et al., 2011). It has been shown that distributional training can enhance the perception of
difficult vowel and tone contrasts (Ong, Burnham, & Escudero, 2015), that individual
differences prior to training modulate success (Wanrooij, Escudero, & Raijmakers, 2013), and
that SIMILAR contrasts are easier to train than NEW contrasts (Chládková, Boersma, &
Escudero, 2022). Escudero and Williams (2014) also demonstrated that the effects of
distributional learning in adult L2 learners can last long, as its effects remained over a year
after training.
The CSWL task can be used to teach real words to L2 learners at different
developmental stages. Tuninetti et al. (2020) show that learners can easily learn 12 to 18 words
within a learning session, suggesting that this paradigm could be quite successful for classroom
perception and production exercises for beginner university-level L2 learners of Spanish. The
proposed teaching materials are based on key principles such as a focus on features with high
functional load and shared by most varieties of the target language, which are direct
applications of L2LP.
8.5 Conclusion
L2LP is a comprehensive model of how people learn to perceive, recognize, and produce the
The model has unique strengths such as the powerful computational and statistical mechanisms
for precisely predicting learning outcomes, as well as the ability to explain previously
26
understudied issues including the bilingual/bidialectal (dis)advantage and the interrelation
between prelexical perception and lexical development, with new studies extending the model
to orthographic influences, speech production, and language training and curriculum design.
We hope the readers find L2LP useful in deepening their understanding of bilingual phonetics
and phonology and in promoting explanatory adequacy for modelling language acquisition in
[Figures]
27
1. References
Angwin, A. J., Armstrong, S. R., Fisher, C., & Escudero, P. (2022). Acquisition of novel word
meaning via cross situational word learning: An event-related potential study. Brain
Antoniou, M., Best, C. T., Tyler, M., & Kroos, C. (2011). Inter-language interference in VOT
Bassetti, B., Escudero, P., & Hayes-Harb, R. (2015). Second language phonology at the
interface between acoustic and orthographic input. Applied Psycholinguistics, 36, 1–6.
Academic Graphics.
Boersma, P. (2011). A programme for bidirectional phonology and phonetics and their
Boersma, P. & Chládková, K. (2011). Asymmetries between speech perception and production
reveal phonological structure. In W.-S. Lee & E. Zee, eds., Proceedings of the 17th
International Congress of Phonetic Sciences. The University of Hong Kong, pp. 328–
331.
Boersma, P., Escudero, P., & Hayes, R. (2003). Learning abstract phonological from auditory
28
International Congress of Phonetic Sciences. Causal Productions Pty Ltd, pp. 1013–
1016.
Boersma, P. & Hayes, B. (2001). Empirical tests of the Gradual Learning Algorithm. Linguistic
Bohn, O. S. (1995). Cross-language speech perception in adults: First language transfer doesn’t
tell it all. In W. Strange, ed., Speech perception and linguistic experience: Issues in
Chládková, K., Boersma, P., & Escudero, P. (2022). Unattended distributional training can
Chládková, K. & Escudero, P. (2012). Comparing vowel perception and production in Spanish
and Portuguese: European versus Latin American dialects. The Journal of the
Chládková, K., Escudero, P., & Lipski, S. C. (2015). When “aa” is long but “a” is not short:
speakers who distinguish short and long vowels in production do not necessarily encode
Colantoni, L., Escudero, P., Marrero-Aguiar, V., & Steele, J. (2021). Evidence-based design
Colantoni, L., Steele, J., & Escudero, P. (2015). Second Language Speech: Theory and
Curtin, S., Fennell, C., & Escudero, P. (2009). Weighting of vowel cues explains patterns of
Cutler, A. (2012). Native Listening: Language Experience and the Recognition of Spoken
29
Elvin, J., Tuninetti, A., & Escudero, P. (2018a). Non-native dialect matters: The perception of
Elvin, J., Vasiliev, P., & Escudero, P. (2018b). Production and perception in the acquisition of
Spanish and Portuguese. In M. Gibson & J. Gil, eds., Romance Phonetics and
Elvin, J., Williams, D., & Escudero, P. (2016). Dynamic acoustic properties of monophthongs
and diphthongs in Western Sydney Australian English. The Journal of the Acoustical
Elvin, J., Williams, D., & Escudero, P. (2020). Australian English vs. European Spanish
Escudero, P. (2005). Linguistic perception and second language acquisition: Explaining the
30
Escudero, P. (2015). Orthography plays a limited role when learning the phonological forms
of new words: The case of Spanish and English learners of novel Dutch words. Applied
Escudero, P., Benders, T., & Lipski, S. C. (2009). Native, non-native and L2 perceptual cue
weighting for Dutch vowels: The case of Dutch, German, and Spanish
Escudero, P., Benders, T., & Wanrooij, K. (2011). Enhanced bimodal distributions facilitate
the learning of second language vowels. The Journal of the Acoustical Society of
& A. H.-J. Do, eds., Proceedings of the 26th Annual Boston University Conference on
Escudero, P. & Boersma, P. (2004). Bridging the gap between L2 speech perception research
Escudero, P., Broersma, M., & Simon, E. (2013). Learning words in a third language: Effects
28(6), 746–761.
Escudero, P. & Hayes-Harb, R. (2022). The Ontogenesis Model may provide a useful guiding
framework, but lacks explanatory power for the nature and development of L2 lexical
Escudero, P., Hayes-Harb, R., & Mitterer, H. (2008). Novel second-language words and
31
Escudero, P. & Kalashnikova, M. (2020). Infants use phonetic detail in speech perception and
Escudero, P., Kastelein, J., Weiand, K., & van Son, R. J. J. H. (2007). Formal modelling of L1
Escudero, P., Mulak, K. E., Elvin, J., & Traynor, N. M. (2018). “Mummy, keep it steady”:
21(5), e12640.
Escudero, P., Mulak, K. E., Fu, C. S. L., & Singh, L. (2016a). More limitations to
Escudero, P., Mulak, K. E., & Vlach, H. A. (2016b). Cross-situational learning of minimal
Escudero, P., Mulak, K. E., & Vlach, H. A. (2016c). Infants encode phonetic detail during
Escudero, P., Simon, E., & Mitterer, H. (2012). The perception of English front vowels by
North Holland and Flemish listeners: Acoustic similarity predicts and explains cross-
Escudero, P., Simon, E., & Mulak, K. E. (2014a). Learning words in a new language:
Orthography doesn’t always help. Bilingualism: Language and Cognition, 17(2), 384–
395.
32
Escudero, P., Sisinni, B., & Grimaldi, M. (2014b). The effect of vowel inventory and acoustic
properties in Salento Italian learners of Southern British English vowels. The Journal
Escudero, P., Smit, E. A., & Angwin, A. J. (2023). Investigating orthographic versus auditory
cross-situational word learning with online and lab-based testing. Language Learning,
73(2), 543–577.
Escudero, P., Smit, E. A., & Mulak, K. E. (2022). Explaining L2 lexical learning in multiple
assimilation of Canadian English and Canadian French vowels. The Journal of the
perception: Peruvian versus Iberian Spanish learners of Dutch. The Journal of the
Escudero, P. & Williams, D. (2014). Distributional learning has immediate and long-lasting
Friesen, D. C., Luo, L., Luk, G., & Bialystok, E. (2015). Proficiency and control in verbal
fluency performance across the lifespan for monolinguals and bilinguals. Language,
Giezen, M. R., Escudero, P., & Baker, A. E. (2016). Rapid learning of minimally different
33
Gollan, T. H. & Kroll, J. F. (2001). Bilingual lexical access. In B. Rapp, ed., The Handbook of
Cognitive Neuropsychology: What Deficits Reveal about the Human Mind. Psychology
Graf Estes, K. & Hurley, K. (2013). Infant-directed prosody helps infants map sounds to
Grosjean, F. (2001). The bilingual’s language modes. In J. Nicol, ed., One Mind, Two
Igarashi, Y., Nishikawa, K., Tanaka, K., & Mazuka, R. (2013). Phonological theory informs
Junttila, K. & Ylinen, S. (2020). Intentional training with speech production supports children’s
learning the meanings of foreign words: A comparison of four learning tasks. Frontiers
Kondo, M., Tsubaki, H., & Sagisaka, Y. (2015). Segmental variation of Japanese speakers’
English: Analysis of “the North Wind and the Sun” in AESOP corpus. Journal of the
Kuhl, P. (2004). Early language acquisition: Cracking the speech code. Nature Reviews
34
Liu, L. & Escudero, P. (in press). How bidialectalism interacts with cross-language phonetic
McClelland, J. L. & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive
Mulak, K. E., Vlach, H. A., & Escudero, P. (2019). Cross-situational learning of phonologically
Norris, D., McQueen, J. M., & Cutler, A. (2000). Merging information in speech recognition:
Ong, J. H., Burnham, D., & Escudero, P. (2015). Distributional learning of lexical tones: A
Pino Escobar, G., Kalashnikova, M., & Escudero, P. (2018). Vocabulary matters! The
Pino Escobar, G., Tuninetti, A., Antoniou, M., & Escudero, P. (2023). Understanding
Schwartz, B. D. & Sprouse, R. A. (1996). L2 cognitive states and the Full Transfer/Full Access
35
Smit, E. A., Milne, A. J., & Escudero, P. (2022). Music perception abilities and ambiguous
Smith, L. & Yu, C. (2008). Infants rapidly learn word-referent mappings via cross-situational
Tuninetti, A., Mulak, K. E., & Escudero, P. (2020). Cross-situational word learning in two
Communication, 5, 602471.
Uther, M., Knoll, M. A., & Burnham, D. (2007). Do you speak E-NG-L-I-SH? A comparison
van Leussen, J.-W. & Escudero, P. (2015). Learning to perceive and recognize a second
Wanrooij, K., Escudero, P., & Raijmakers, M. E. J. (2013). What do listeners learn from
Northern and Southern British English. The Journal of the Acoustical Society of
Williams, D., Escudero, P., & Gafos, A. (2018). Spectral change and duration as cues in
Australian English listeners’ front vowel categorization. The Journal of the Acoustical
Yazawa, K. (2020). Testing Second Language Linguistic Perception: A case study of Japanese,
36
Yazawa, K., Konishi, T., Whang, J., Escudero, P. &, Kondo, M. (2023). Spectral and temporal
Yazawa, K., Whang, J., Kondo, M., & Escudero, P. (2020). Language-dependent cue
Yu, C. & Smith, L. B. (2007). Rapid word learning under uncertainty via cross-situational
37