Lexical Facilities
Lexical Facilities
Lexical Facilities
the
be
to
of
and
a
in
t
tha
v e
it
ha
for
Lexical Facility
Size, Recognition Speed and Consistency
as Dimensions of Second Language
Vocabulary Knowledge
Lexical Facility
Michael Harrington
Lexical Facility
Size, Recognition Speed and
Consistency as Dimensions of Second
Language Vocabulary Knowledge
Michael Harrington
University of Queensland
Brisbane, QLD, Australia
vii
Acknowledgments
I would first like to thank my wife Jan and daughter Bridget for their
forebearance. I am also greatly indebted to John Read for his advice and sup-
port throughout this project. He, of course, is not responsible for the final
outcome. Special thanks to collaborators Thomas Roche, Michael Carey, and
Akira Mochida, and colleagues Noriko Iwashita, Paul Moore, Wendy Jiang,
Mike Levy, Yukie Horiba, Yuutaka Yamauchi, Shuuhei Kadota, Ken Hashimoto,
Fred Anderson, Mark Sawyer, Kazuo Misono, John Ingram and Jenifer Larson-
Hall. Thanks also to Said Al-Amrani, Lara Weinglass, and Mike Powers.
Vikram Goyal programmed and has served as the long-standing sys-
tem administrator for the LanguageMAP online testing program used
to collect the data reported. He has been especially valuable to the proj-
ect. Special thanks also to Chris Evason, Director of the University of
Queensland’s (UQ) Foundation-Year program, who has provided encour-
agement and financial support for testing and program development.
Funding support is also acknowledged from Andrew Everett and the UQ
International Education Directorate.
The research reported here has been supported by the Telstra Broadband
Fund and a UQ Uniquest Pathfinder grant for the development of the
LanguageMAP program. Support was also provided by research contracts
from the Milton College (Chap. 9) and International Education Services–
UQ Foundation-Year (Chaps. 8, 9, and 10), and a grant, with Thomas
Roche, from the Omani Ministry of Research.
ix
Contents
xi
xii Contents
References283
Index303
List of Figures
xvii
xviii List of Figures
Fig. 5.4 Elements of the instruction set for the Timed Yes/No Test 111
Fig. 6.1 Lexical facility measures by English proficiency levels 140
Fig. 6.2 Median proportion of hits and 95% confidence intervals
for lexical facility measures by frequency levels and groups 149
Fig. 6.3 Median individual mnRT and 95% confidence intervals
for lexical facility measures by frequency levels and groups 150
Fig. 6.4 Median coefficient of variation (CV) and 95% confidence
intervals for lexical facility measures by frequency levels
and groups 150
Fig. 7.1 University entry standard study. Mean proportion of hits
by frequency levels for written and spoken test results 179
Fig. 7.2 University entry standard study. Mean response times by
frequency levels for written and spoken test results 180
Fig. 7.3 University entry standard study. Mean CV ratio by
frequency levels for written and spoken test results 181
Fig. 8.1 Combined IELTS dataset: Timed Yes/No Test scores by
IELTS overall band scores 194
Fig. 9.1 Sydney language program study. Comparison of VKsize
and mnRT scores with program placement grammar and
listening scores across four placement levels 213
Fig. 9.2 Singapore language program levels. Standardized scores
for the lexical facility measures (VKsize, mnRT, and CV)
for the VLT and BNC test versions 219
Fig. 9.3 Singapore language program study. Standardized scores
for the lexical facility measures (VKsize, mnRT, and CV)
for the combined test by level 221
Fig. 10.1 Oman university GPA study. Standardized VKsize,
mnRT, and CV scores by faculty 238
List of Tables
xix
xx List of Tables
xxiii
xxiv Introduction
Research Goals
This book has three goals. The first is to make the theoretical case for lexi-
cal facility. The validity of the construct is established in the first four
chapters by first examining the crucial roles that vocabulary size (Chaps. 1
and 2) and word recognition skill (Chap. 3) play in L2 performance.
The rationale for characterizing size and processing skill jointly as an L2
vocabulary construct, that is, for lexical facility, is then set out in Chap. 4.
This chapter discusses key theoretical and methodological issues that arise
from the proposal. Primary among these is the attempt to treat size and
speed as parts of a unitary construct. Standard practice in the psychomet-
ric tradition has long been to treat the two as separate dimensions.
Human performance has been characterized either as knowledge (also
called power) or speed, the relative importance of each dependent on the
kind of performance being measured. Knowledge is seen as the critical
attribute of higher-level cognitive tasks such as educational testing, while
speed is paramount for mechanical tasks such as typing. The lexical
facility account proposes that size (knowledge) and processing skill (speed
xxvi Introduction
References
Balota, D. A., & Yap, M. J. (2011). Moving beyond the mean in studies of
mental chronometry: The power of response time distributional analyses.
Current Directions in Psychological Science, 20(3), 160–166.
Hird, K., & Kirsner, K. (2010). Objective measurement of fluency in natural
language production: A dynamic systems approach. Journal of Neurolinguistics,
23(5), 518–530. doi:10.1016/j.jneuroling.2010.03.001.
Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2
written production. Applied Linguistics, 16(3), 307–322.
Read, J., & Chapelle, C. A. (2001). A framework for second language vocabu-
lary assessment. Language Testing, 18(1), 1–32.
Segalowitz, N., & Segalowitz, S. J. (1993). Skilled performance, practice and
differentiation of speed-up from automatization effects: Evidence from sec-
ond language word recognition. Applied Psycholinguistics, 14(3), 369–385.
doi:10.1017/S0142716400010845.
Part 1
Introduction
References
Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary
tests. Language Testing, 4(2), 142–145.
Nation, I. S. P. (2013). Learning vocabulary in another language (2nd ed.).
Cambridge: Cambridge University Press
1
Size as a Dimension
of L2 Vocabulary Skill
Aims
1.1 Introduction
This chapter introduces the field of what will be called vocabulary size
research, an approach based on the simple assumption that the overall
number of words a user knows—the breadth of an individual’s vocabulary
stock—provides an index of vocabulary knowledge. The focus on vocab-
ulary breadth means that little attention is given to what specific words
are known or the extent (or depth) to which any given word is used.
Rather, researchers in the area are interested in estimating the vocabulary
size needed to perform particular tasks in a target language. These tasks
can range from reading authentic texts (Hazenberg and Hulstijn 1996) to
coping with unscripted spoken language (Nation 2006). Size estimates
are used to propose vocabulary thresholds for second language (L2)
instruction, and more generally to provide a quantitative picture of an
individual’s L2 vocabulary knowledge (Laufer 2001; Laufer and
Ravenhorts-Kalovski 2010). The focus here, and in the book in general,
is on the size of recognition vocabulary and the role it plays in L2 use.
The main focus is on the recognition of written language.
Recognition vocabulary is acquired before productive vocabulary and
serves as the foundation for the learning of more complex language struc-
tures. The store of recognition vocabulary knowledge builds up over the
course of an individual’s experience with the language. This knowledge
ranges from the most minimal, as in the case of knowing only that a word
exists, to an in-depth understanding of its meaning and uses. A sparkplug
may be a thingamajig found in a car or, according to Wikipedia, ‘a device
for delivering electric current from an ignition system to the combustion
chamber of a spark-ignition engine to ignite the compressed fuel/air mix-
ture by an electric spark, while containing combustion pressure within
the engine’. Recognition vocabulary knowledge emerges from both
intentional learning and implicit experience, and even the most casual
experience can contribute to the stock of recognition vocabulary knowl-
edge. Repeated exposure to a word also has a direct effect on how effi-
ciently it is recognized.
The notion that knowing more words allows a language user to do
more in the language hardly seems controversial. However, many appar-
ently commonsensical assumptions in language learning are often diffi-
cult to specify in useful detail or to apply in practice (Lightbown and
Spada 2013). Even when evidence lends support to the basic idea, spe-
cific findings introduce qualifications that often diminish the scope and
power of the original insight. This chapter introduces and surveys the
vocabulary size research literature to see how the ‘greater size = better per-
formance’ assumption manifests itself. The methodology used for esti-
mating vocabulary size is first described, and then findings from key
studies are presented.
Size is a quantitative property and therefore requires some unit of mea-
surement. In the vocabulary size approach, it is the single word. Size
1.2 Estimating Vocabulary Size 5
What to Count
odification. But this knowledge is only part of the lexicon, which con-
m
sists of these words in combination with the mostly implicit grammatical
properties that constrain how the words are used. These properties reside
in procedural memory, a system of implicit, unconscious knowledge.
Paradis (2009) makes a distinction between vocabulary and the lexicon to
capture this difference. Vocabulary is the totality of sound–meaning asso-
ciations and is typical of L2 learner knowledge, particularly in the early
stages. The lexicon characterizes the system of explicit and implicit
knowledge that the first language (L1) user develops as a matter of course
in development, and which is developed to varying degrees in more
advanced L2 users. In Paradis’s terms, the lexical facility account relates
strictly to vocabulary knowledge, its measurement, and its relationship to
L2 proficiency and performance.
Last, the pivotal role the single word plays in online processing also
reflects its importance. The word serves as the intersecting node for a
range of sentence and discourse processes that unfold in the process of
reading (Andrews 2008). It is where the rubber meets the road, as it were,
in text comprehension.
The focus on the recognition of single words means that the vocabu-
lary size approach captures only a small part of L2 vocabulary knowledge,
a multidimensional notion comprising knowledge of form, meaning, and
usage. Each word is part of a complex web of relationships with other
words, and this complex network is used to realize the wide range of
expressive, communicative, and instrumental functions encountered in
everyday use. Figure 1.1 depicts the basic elements of word knowledge in
a three-part model adapted from Nation (2013); see also Richards (1976).
The vocabulary size account reduces vocabulary knowledge to the sin-
gle dimension of the number of individual words a user knows, or more
precisely, recognizes. It is about the user’s ability to relate a form to a basic
meaning, whether by identifying the meaning from among a set of alter-
natives, as in the Vocabulary Levels Test (VLT), or merely recognizing a
word when it is presented alone, as in the Yes/No Test. This passive ‘rec-
ognition knowledge’ is assumed to be an internal property—a trait—of
the L2 user’s vocabulary stock that can be measured independently of a
given context.
1.2 Estimating Vocabulary Size 7
of a given word very often depends on the context, and ‘knowing’ a word
ultimately comes down to whether it facilitates comprehension in a par-
ticular context in an appropriate and timely manner. The measurement
of size alone says nothing about the depth of word knowledge, though
the two are not unrelated. Ultimately, greater vocabulary size correlates
with greater depth of vocabulary knowledge (Vermeer 2001).
The central question in the vocabulary size approach is the degree to
which this single form–meaning dimension relates to individual differ-
ences in L2 performance. Evidence of a reliable relationship between size
and performance has implications for the way L2 vocabulary knowledge
is conceptualized and, in turn, for L2 vocabulary assessment. The next
section will consider the challenging problem of how to count single
words.
There are alternative ways to calculate vocabulary size, all with their
advantages and disadvantages. The number of words on this page could
simply be counted by tallying the number of white spaces before each
word. These are all words in the simplest sense. But this method would
yield a very insensitive measure of vocabulary knowledge, given that
many words are repeated. For example, the word ‘the’ appears seven times
in this paragraph. The same word can also appear in different forms.
Does the researcher count ‘word’ and ‘words’ as one or two words? As a
result, although estimating vocabulary size is a quantitative process, the
researcher must make qualitative distinctions as to if and how individual
word forms are counted. Several alternatives are available.
Word Families Related to the lemma is the word family, which is defined
as the base word form plus its inflections and most common derivational
variants, for example, invite, invites, inviting, invitation (Hirsh and Nation
1992, p. 692). English inflections include third person -s, past participle
-ed, present participle -ing, plural -s, possessive -s, and comparative -er and
superlative -est. Derivational affixes include -able, -er, -ish, -less, -ly, -ness,
-th, -y, non-, un-, -al, -ation, -ess, -ful, -ism, -ist, -ity, -ize, -ment, and in-
(Hirsh and Nation 1992, p. 692). As with the lemma, the underlying idea
is that a base word and its inflected forms express the same core meaning,
and thus can be considered learned words if a learner knows the base and
the affix rules. Bauer and Nation (1993) proposed seven levels of affixes,
which include derivations and inflections. Word families differ from lem-
mas in that they cross syntactic categories. In the example of bank, as
above, the noun and verb forms are counted as part of the same family.
10 1 Size as a Dimension of L2 Vocabulary Skill
As a result, a lemma count will always be larger than the word family
count, given the narrower range of forms counted as a single instance.
Milton identifies what he terms a ‘very crude’ equivalence of lemma to
word family involving multiplying the word family size by 1.6 to get the
approximate lemma size (Milton 2009, p. 12).
The word family has been widely used as the unit of counting in vocabu-
lary size studies (Schmitt 2010). Nation has argued that the word family
is a particularly appropriate unit for studying L2 recognition vocabulary
because it is primarily about meaning and meaning potential (Nation
2006, p. 76). It also has a degree of psycholinguistic reality regarding how
the different forms in a given family are stored in the mental lexicon
(Nagy et al. 1989). The basic assumption is that if the meaning of the
base word is known, the various inflections and derivations in which it
appears will also be potentially understood, at least to some degree. This
assumption has proved to be useful in relating individual vocabulary size
to L2 use, but is one that is not categorical. The assumption that a learner
who knows the meaning of build will understand the meaning of builder
on the first encounter is a probabilistic one. Schmitt and Zimmerman
(2002) show that university-level ESL students’ knowledge of the derived
forms of many stem words is far from complete, for example, not knowing
that persistent, persistently, and persistence all come from persist. However,
they also recognize that users will probably work out the meaning of per-
sistence faster if they knew persist than if they did not.
The word family construct also conflates the distinction that Paradis
(2009) makes between the stock of form–meaning associations stored in
declarative memory and morphological processes that are procedural in
nature. Widely used tests of vocabulary size, the VLT (Nation 2013) and
the Yes/No Test (Meara and Buxton 1987), always present the base form
as the test item, thus sidestepping any attempt to measure the morpho-
logical knowledge assumed in the word family construct.
1.2 Estimating Vocabulary Size 11
Figuring out how many words a user knows is the next challenge for the
vocabulary size researcher. While in theory it may be possible to identify
every single word a user knows, in practice, the process of fixing vocabu-
lary size is one of estimation. A vocabulary size estimate is based on a
finite sample of a user’s knowledge obtained in a specific task or set of
tasks. Recognition vocabulary knowledge is passive by nature, and evi-
dence for it must be elicited from the user. This is done by presenting a
set of words to a user and eliciting a response that indicates whether the
items are known. Time and resource limitations mean that any test can
present only a limited number of words, and it is from this limited sam-
ple that the user’s vocabulary size is estimated. Word frequency statistics
provide the vocabulary size researcher with a reliable and objective means
to index the size of recognition (and productive) vocabulary knowledge
(Laufer 2001).
Words greatly differ in how often they occur in a given language.
When the words in a large corpus of spoken or written English are rank-
ordered from the most to least frequently occurring, a highly distinctive
pattern emerges. The 2000–3000 most frequently occurring words
account for the vast majority of tokens that appear in the corpus. Beyond
these high-frequency words, the relative frequency of a given word
steadily decreases as a function of its relative order, until the very-low-
frequency words tail off and account for only a tiny proportion of tokens.
This frequency distribution, called Zipf ’s law (after one of its original
discoverers), provides an index for the measurement and interpretation of
vocabulary size. The law states that, for a corpus of natural language
utterances, the frequency of any word is in inverse proportion to its rank
12 1 Size as a Dimension of L2 Vocabulary Skill
they also have direct implications for vocabulary learning and the repre-
sentation of this knowledge in the mental lexicon. This is discussed below.
100
Likelihood of knowing
80
60
40
20
0
High Mid Low
Frequency of occurence
Table 1.1 Vocabulary size expressed in word families and text coverage (written
and spoken) across ninea corpora (Nation 2006, p. 79)
Knowledge of all Approximate written text Approximate spoken text
word in coverage (%) coverage (%)
1K 78–81 81.84
2K 8–9 5–6
3K 3–5 2–3
4K–5K 3 1.5–3
6K–9K 2 0.75–1
10K–14K <1 0.5
Proper nouns 2–4 1–1.5
+14K 1–3 1
a
Corpora analyzed: Lancaster–Oslo–Bergen (LOB) Corpus, Freiburg–LOB, Brown,
Frown, Kohlapur, Macquarie, Wellington Written, Wellington Spoken, and
Lund, available from the International Computer Archive of Modern and
Medieval English at http://gandalf.aksis.uib.no/icame.html (Nation 2006, p. 63).
100
90
80
Percentage of coverage
70
60
50 Spoken Wrien
40
30
20
10
0
1K 2K 3K 4–5K 6–9K 10–14K
Frequency band
99 1 10
98 2 5
95 5 2
90 10 1
80 20 0.5
Fig. 1.4 Text coverage as the number of unfamiliar words and the number of
lines of text per unfamiliar word
word families were sufficient for 95% text coverage, a number similar to
the written text research. In contrast, knowledge of only the 6K–7K
bands was needed for 98% text coverage, lesser than the 8K–9K sug-
gested as being necessary to read authentic texts with some degree of
fluency (Nation 2006). van Zeeland and Schmitt (2013) also reported
that listening comprehension required knowledge of fewer word families
than comparable reading levels.
A question remains as to whether these text coverage levels, particu-
larly the 95% and 98% levels, reflect a qualitative threshold that must be
met for adequate comprehension, or a continuum from lesser to greater
comprehension skill. Schmitt et al. (2011) examined this issue by plot-
ting text coverage levels against performance for 600 tertiary L2 English
readers from 12 different countries. The relationship between text cover-
age and comprehension was plotted at ten text coverage levels, ranging
from 90% to 100% coverage. See Fig. 1.6.
This figure is adapted from Schmitt et al. (2011, p. 34), with only
alternating text coverage levels reported here. A consistent linear relation-
ship is evident across the reading comprehension and vocabulary cover-
age levels. There is little suggestion of discrete thresholds at the 95% or
100
90
Comprehension percentage
80
70
60
50
40
30
1+SD Mean 1-SD
20
10
0
90% 92% 94% 96% 98% 99% 100%
(n=21) (n=39) (n=93) (n=176) (n=200) (n=186) (n=187)
Vocabulary Coverage and Number of Parcipants at Each Level
1.4 Conclusions
The vocabulary size approach is based on the simple assumption that the
number of words an individual knows has a direct relationship to L2
proficiency. The focus here is on recognition vocabulary knowledge,
which is narrowly defined as the ability to recognize the association
between a single word form and a basic meaning. Vocabulary learning is
viewed as an input-driven process in which vocabulary size emerges from
the user’s experience with the language. Corpus-based word frequency
statistics provide a means of estimating the overall vocabulary size from
recognition performance on a limited set of words. These vocabulary size
estimates have been related to L2 proficiency and use in two ways.
Vocabulary size has been examined as a predictor of differences in L2
performance as measured by standardized and context-specific tests. As a
key component of the lexical facility construct, this use of vocabulary size
is the focus of the book. Considerable attention has also been given to the
relationship between vocabulary size and text coverage, the latter reflect-
ing the comprehension demands of written or spoken texts. Vocabulary
size thresholds have been proposed to meet the levels of text coverage
required for successful comprehension. Both uses demonstrate the utility
of vocabulary size as a dimension of L2 vocabulary knowledge and a cor-
relate of L2 proficiency.
22 1 Size as a Dimension of L2 Vocabulary Skill
References
Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity,
not word frequency, determines word-naming and reading times. Psychological
Science, 17(9), 814–823.
Adolphs, S., & Schmitt, N. (2003). Lexical coverage of spoken discourse.
Applied Linguistics, 24(4), 425–438.
Aitchison, J. (2012). Words in the mind: An introduction to the mental lexicon
(4th ed.). Malden: Wiley.
Andrews, S. (2008). Lexical expertise and reading skill. In B. H. Ross (Ed.), The
psychology of learning and motivation: Advances in research and theory (Vol. 49,
pp. 247–281). San Diego: Elsevier.
Bauer, L., & Nation, I. S. P. (1993). Word families. International Journal of
Lexicography, 6(4), 253–279.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating
language structure and use. Cambridge: Cambridge University Press.
Cobb, T. (2007). Computing the vocabulary demands of L2 reading. Language
Learning & Technology, 11(3), 38–63.
Crossely, S. A., Subtirelu, N., & Salsbury, T. (2013). Frequency effects or con-
text effects in second language word learning. Studies in Second Language
Acquisition, 35(4), 727–755. doi:10.1017/S0272263113000375.
Ellis, N. C. (2002). Frequency effects in language processing: A review with
implications for theories of implicit and explicit language acquisition. Studies
in Second Language Acquisition, 24(2), 143–188.
Harrington, M., & Carey, M. (2009). The online yes/no test as a placement
tool. System, 37(4), 614–626. doi:10.1016/j.system.2009.09.006.
Hazenberg, S., & Hulstijn, J. H. (1996). Defining a minimal receptive vocabu-
lary for non-native university students: An empirical investigation. Applied
Linguistics, 17(2), 145–163.
References 23
Hirsh, D., & Nation, P. (1992). What vocabulary size is needed to read unsim-
plified texts for pleasure? Reading in a Foreign Language, 8(2), 689–696.
Hsueh-Chao, M. H., & Nation, I. S. P. (2000). Unknown vocabulary density
and reading comprehension. Reading in a Foreign Language, 13(1),
403–430.
Laufer, B. (1989). What percentage of text-lexis is essential for comprehension?
In C. Lauren & M. Nordman (Eds.), Special language: From humans thinking
to thinking, machines (pp. 316–323). Clevedon: Multilingual Matters.
Laufer, B. (1992). How much lexis is necessary for reading comprehension? In
P. J. L. Arnaud & H. Béjoint (Eds.), Vocabulary and applied linguistics
(pp. 126–132). London: Macmillan. doi:10.1007/978-1-349-12396-4_12.
Laufer, B. (2001). Quantitative evaluation of vocabulary: How it can be done
and what it was good for. In C. Elder, K. Hill, A. Brown, N. Iwashita,
L. Grove, T. Lumley, & T. MacNamara (Eds.), Experimenting with uncer-
tainty: Essays in hounour of Alan Davies (pp. 241–250). Cambridge:
Cambridge University Press.
Laufer, B., & Ravenhorts-Kalovski, G. C. (2010). Lexical threshold revisited:
Lexical text coverage, learners’ vocabulary size and reading comprehension.
Reading in a Foreign Language, 22(1), 15–30.
Leech, G., Rayson, P., & Wilson, A. (2001). Word frequencies in spoken and writ-
ten English. London: Longman.
Lightbown, P. M., & Spada, N. (2013). How languages are learned (4th ed.).
Oxford: Oxford University Press.
McCarthy, M. (1998). Spoken language and applied linguistics. Cambridge:
Cambridge University Press.
Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary
tests. Language Testing, 4(2), 142–145.
Meara, P., & Jones, G. (1988). Vocabulary size as placement indicator. In
P. Grunwell (Ed.), Applied linguistics in society (pp. 80–87). London: CILT.
Milton, J. (2009). Measuring second language vocabulary acquisition. Bristol:
Multilingual Matters.
Nagy, W. E., Anderson, R., Schommer, M., Scott, J. A., & Stallman, A. (1989).
Morphological families in the internal lexicon. Reading Research Quarterly,
24(3), 263–282. doi:10.2307/747770.
Nation, I. S. P. (2006). How large a vocabulary was needed for reading and lis-
tening? The Canadian Modern Language Review/La Revue Canadienne des
Langues Vivantes, 63(1), 59–82.
24 1 Size as a Dimension of L2 Vocabulary Skill
Aims
2.1 Introduction
This chapter describes two approaches to measuring recognition vocabu-
lary size. The first approach is represented in two tests developed by Paul
Nation and his colleagues: the Vocabulary Levels Test (VLT) (Nation
2013; Schmitt et al. 2001) and, more recently, the Vocabulary Size Test
(VST) (Beglar 2010; Nation 2012). The other approach is embodied in
Paul Meara’s Yes/No Test of recognition vocabulary knowledge (Meara
and Buxton 1987; Meara and Jones 1988). The approaches share the
same frequentist perspective and the same measurement goal—vocabu-
lary size—but go about it in fundamentally different ways. Because the
latter serves as the foundation for the Timed Yes/No Test used in the lexi-
cal facility account, it is important to understand the differences between
the two and the advantages (and limitations) of the Yes/No Test format.
In both approaches, words are systematically sampled from frequency-
of-occurrence bands, with relative performance across the bands provid-
ing a measure of individual vocabulary size. As noted in the previous
chapter, the use of word frequency statistics as the basis for estimat-
ing vocabulary size is a distinctive feature of the vocabulary size approach.
The next section describes the three test formats and their three main
uses. The first use is characterizing vocabulary size as learning targets, or
thresholds. The second is examining the sensitivity of vocabulary size to
proficiency level differences evident in both global standards, such as
TOEFL and IELTS, and more local applications, such as placement testing
and academic performance. The third use is theoretical and relates to gain-
ing a better understanding of written recognition vocabulary size as a
dimension of L2 vocabulary knowledge. This includes both spoken recog-
nition vocabulary size and productive vocabulary size, spoken and written.
The VLT was first introduced in the early 1980s and has since been modi-
fied and revised (Nation 1990; Beglar and Hunt 1999; Schmitt et al.
2001). Each test item comprises six possible target word items and three
short definitions. The test-taker matches the word to the corresponding
definition. An example is given in Fig. 2.1.
2.2 Approaches to Measuring Recognition Vocabulary Size 27
This is a vocabulary test. You must choose the right word to go with each meaning.
1 bench
5 mirror
6 province
Fig. 2.1 Instructions and example item for Vocabulary Levels Test (Adapted from
Nation 2013, p. 543)
The VLT includes words from each of the 2K, 3K, 5K, and 10K fre-
quency bands (Nation 2013). Also included is a set of academic words in
an Academic Word List (Coxhead 2000). These are words that occur with
high frequency in academic texts but are drawn from a range of frequency
levels. The current version comprises ten sets of word clusters from the
four levels and the academic words, for a total of 150 test words (3 target
items × 10 sets × 5 levels). The validity and reliability of the test have been
examined in a number of studies (Beglar and Hunt 1999; Culligan 2015;
Read 1988; Schmitt et al. 2001). The studies primarily evaluated the
target items used and the wording of the response alternatives. The two
versions of the test published by Schmitt, Schmitt and Clapham (2001)
are now the standards and have been used with a variety of learners in a
range of settings. The test is available in Schmitt (2010).
The VLT has also been used in a timed format by Laufer and Nation
(2001) and Zhang and Lu (2013). These studies are examined in the next
chapter, where vocabulary recognition speed is discussed.
The VLT samples words from four frequency bands in the 2K–10K range.
From this, it is possible to estimate an overall vocabulary size by interpo-
lating response levels for the untested bands. In other words, ceiling per-
formance on the 2K band is assumed to imply that the test-taker knows
28 2 Measuring Recognition Vocabulary Size
Look through the French words listed below. Cross out any words that you do not know
well enough to say what they mean. Keep a record of how long it takes you to do the test.
GÔTER PONTE
100
VLT Yes/No Test
90
80
Percentage correct
70
60
50
40
30
20
10
0
2K 3K 5K 10K
Frequency-of-occurence bands
Fig. 2.4 Comparison of VLT and Yes/No Test Performance (Mochida and
Harrington 2006)
34 2 Measuring Recognition Vocabulary Size
as words (‘false alarms’) from words correctly recognized (‘hits’). This for-
mula is used in the studies reported in Part 2.
It was evident that performance on the two tests is very similar across
all the frequency levels for the group of students tested. In an earlier
study, Cameron (2002) compared performance on the Yes/No Test and
the VLT by secondary ESL students in the UK. In contrast to Mochida
and Harrington’s (2006) results, the scores on the two tests did not cor-
relate. However, the Cameron study used a Yes/No Test including differ-
ent items from the VLT. The secondary-level participants were also of
lower language proficiency and produced a much higher error rate for the
pseudowords, making direct comparisons between the two studies
difficult.
Mochida and Harrington’s (2006) results suggest that the frequency
band format used in the VLT and the Yes/No Test yield similar size mea-
sures despite the difference in item presentation and response type. On a
more practical note, the respective formats also make significantly differ-
ent time demands: the VLT version took almost 30 minutes to complete,
while the Yes/No Test took around five minutes.
The tests have also been used to examine the relationship between recog-
nition vocabulary size and L2 proficiency that involves more than reading
skill alone. Individual difference in size has been correlated with a range
of standardized tests such as TOEFL, TOEIC, IELTS, and the Common
European Framework of Reference for Languages (CEFR). All measures
provide a characterization of overall English proficiency for academic,
government, and employment purposes. TOEFL performance by various
L1 learner groups has been correlated with the VLT (Qian 2002; Alavi
2012), VST (McLean et al. 2014), and Yes/No Test performance (Meara
and Milton 2003; Milton 2009). TOEIC scores have also been correlated
with the VLT (Kanzaki 2015), VST (Kanzaki 2015; McLean et al. 2014;
Stewart 2014), and Yes/No Test performance (Kanzaki 2015; Stubbe
36 2 Measuring Recognition Vocabulary Size
2015). Similarly, variability in IELTS scores has been related to the VLT
performance differences (Alavi 2012) and the Yes/No Test performance
(Milton 2009; Stæhr 2008). The Yes/No Test has also been related to the
CEFR scale (Alderson 2005; Milton and Alexiou 2009).
This research consistently shows that individual differences in recogni-
tion vocabulary size are sensitive to outcomes on the criterion standard
examined. This sensitivity is evident in two ways. The first is the presence
of a statistically significant difference among the levels as a function of
vocabulary size or in the strength of association between the size and
proficiency levels. This is expressed in the p value, usually set at <0.05.
The other aspect of sensitivity is the effect size, which is the strength, or
magnitude, of the statistically significant result. In correlation and regres-
sion analyses the effect is expressed in the R2 value, also called the coeffi-
cient of determination. This signifies the amount of variance in the
differences in the proficiency levels that can be attributed to individual
differences in vocabulary test performance. The R2 value is an important
benchmark for comparing effect sizes across the different studies in the
later chapters. Other effect size statistics will be used when testing mean
differences using the t-test and ANOVA, and their nonparametric alter-
natives as well.
The studies cited above all report statistically significant results for the
size tests as discriminators of performance. However, they differ greatly in
the amount of variance accounted for in the criterion measure of interest.
Kanzaki (2014) examined VLT and VST performance as predictors of
TOEIC performance in Japanese learners. The VLT results correlated
with multiple TOEIC versions, r = 0.5–0.7, or between 25% and 50% of
the variance accounted for, while the VST scores produced slightly lower
correlations, of 0.4–0.6,. or 16–36% of the variance in the TOEIC scores.
Milton and Alexiou (2009) examined Yes/No Test performance as a
predictor of CEFR scale placement for learners of English, French, and
Greek learned as foreign language (FL)/L2 in Greece, Hungary, Spain,
and the UK. Vocabulary size accounted for around 70% of the variance
for Greek L2 and EFL (English as foreign language) learners in Greece
and French FL learners in Spain, but only about 17% of that for the
CEFR levels for EFL learners in Hungary. The differences in effect sizes
across the groups, criterion standards, and vocabulary size tests in just
2.3 Uses of the Vocabulary Size Tests 37
these two studies show that the relationship between vocabulary size and
proficiency standard is affected by the size measure used, the participants,
the setting, and the criterion outcome examined. Chap. 8 in Part 2 com-
pares performance on the Timed Yes/No Test and IELTS.
The tests have also been used to examine the relationship between vocab-
ulary size and individual proficiency differences in specific learning set-
tings and skill areas. As is the case with the standardized tests, these
proficiency domains include reading skill as a central component, but
also tap the range of language skills that contribute to overall language
performance. Both test formats have been examined as potential tools
for language program placement decisions. Placement outcomes have
been related to performance in the VLT (Akbarian 2010; Clark and
Ishida 2005), VST (Gee and Nguyen 2015) and Yes/No Test (Harrington
and Carey 2009; Harsch and Hartig 2015; Lam 2010; Meara and Jones
1990). All the studies show test outcomes to be sensitive to placement
level differences to some degree. However, when compared directly with
other placement measures, such as an in-house placement test
(Harrington and Carey 2009) or other vocabulary test formats such as
the C-test (Harsch and Hartig 2015), the size measure alone proves to be
less sensitive than the more complex measures. This issue is examined in
Chap. 9.
Vocabulary size has also been examined as a predictor of various types
of academic English performance. It has been related to classroom perfor-
mance (Morris and Cobb 2004; Roche and Harrington 2013), where size
is a moderately strong predictor of course grade outcomes. Size has also
been related to overall grade point average (Harrington and Roche 2014,
Roche et al. 2016), although the link is weaker. The latter research was
undertaken to identify students at potential academic risk due to lan-
guage proficiency limitations and is examined in Chap. 10. EFL writing
outcomes have been correlated with performance on the VLT (Lemmouh
2008) and Yes/No Test (Roche and Harrington 2013) recognition vocabulary
38 2 Measuring Recognition Vocabulary Size
measures. Similarly, EFL listening skill has been correlated with VLT
performance in university students (Stæhr 2008) and lower-proficiency
secondary students (Stæhr 2008). The range of domains illustrates the
sensitivity of vocabulary size differences to virtually all types of language
performance. However, the question remains as to how sensitive those
differences are and thus how useful they might be in characterizing user
performance.
A handful of studies also suggest that vocabulary size correlates with
other dimensions of L2 performance beyond reading and listening. VST
performance has been related to better phonetic discrimination skills
(Bundgaard-Nielsen et al. 2011). Better Yes/No Test performance has
also been correlated with superior learning strategy use (Kojic-Sabo and
Lightbown 1999). Both studies open up the possibility that recognition
vocabulary size may affect L2 performance in more ways than previously
imagined.
better than the other? Both use word frequency statistics to index vocabu-
lary size but differ substantially in format. The VLT/VST format uses
traditional matching and multiple-choice items, while the Yes/No Test
uses a simple self-report format to indicate whether a test-taker knows an
item. It also includes pseudowords to control for guessing, a distinctive
feature of the test that is also the most problematic. This issue is discussed
in Chap. 5, which introduces and describes the Timed Yes/No Test.
The lexical facility account uses the Yes/No Test format because it
directly taps knowledge of the target items in a way that is not affected by
the cues used in possible response alternatives (in monolingual or bilin-
gual versions). Presenting target items one at a time also eliminates the
guessing advantage afforded by an individual being able to reject unlikely
alternatives and greatly reduces the scope for the strategic allocation of
attention within and across items. The single-word presentation format
also allows for a direct measure of individual word recognition speed.
2.4 Conclusions
This chapter introduced the two most widely used approaches to measur-
ing L2 recognition vocabulary size, the multiple-choice VLT/VST and
the simple word recognition Yes/No Test. Both approaches are similar in
that vocabulary size is estimated from test performance on a set of words
systematically sampled from frequency-of-occurrence bands. Size esti-
mates based on this performance have been shown to be correlated to
variability in outcomes in a range of L2 performance domains. These
include proficiency standards such as IELTS and the CEFR and more
localized measures such as program placement and academic English per-
formance. The two approaches differ substantially in format, with the
VLT/VST approach using multiple-choice tasks, and the self-report Yes/
No Test merely requires users to indicate whether they know the word.
The Timed Yes/No Test uses this format to measure lexical facility in the
following chapters. The test combines the size measure described here
with a measure of word recognition speed. The motivation for including
recognition time in a measure of vocabulary skill is set out in the next
chapter, where research on L2 word recognition skill is discussed.
40 2 Measuring Recognition Vocabulary Size
References
Akbarian, I. (2010). The relationship between vocabulary size and depth for ESP/
EAP learners. System, 38(3), 391–401. doi:10.1016/j.system.2010.06.013.
Alavi, S. M. (2012). The role of vocabulary size in predicting performance on
TOEFL reading item types. System, 40(3), 376–385.
Alderson, J. (2005). Diagnosing foreign language proficiency: The interface between
learning and assessment. New York: Continuum.
Bardel, C., & Lindquist, C. (2011). Developing a lexical profiler for spoken
French L2 and Italian L2: The role of frequency, thematic vocabulary and
cognates. EUROSLA Yearbook, 11, 75–93. doi:10.1075/eurosla.11.06bar.
Beeckmans, R., Eyckmans, J., Janssens, V., Dufranne, M., & Van de Velde, H.
(2001). Examining the yes/no vocabulary test: Some methodological issues
in theory and practice. Language Testing, 18(3), 235–274.
Beglar, D. (2010). A Rasch-based validation of the vocabulary size test. Language
Testing, 27(1), 101–118. doi:10.1177/0265532209340194.
Beglar, A., & Hunt, A. (1999). Revising and validating the 2000 word level and
the university word level vocabulary tests. Language Testing, 16(2), 131–162.
doi:10.1191/026553299666419728.
Bundgaard-Nielsen, R. L., Best, C. T., & Tyler, M. D. (2011). Vocabulary size
was associated with second-language vowel perception performance in adult
learners. Studies in Second Language Acquisition, 33(3), 433–461. doi:10.1017/
S0272263111000040.
Cameron, L. (2002). Measuring vocabulary size in English as an additional lan-
guage. Language Teaching Research, 6(2), 145–173.
Clark, M. K., & Ishida, S. (2005). Vocabulary knowledge differences between
placed and promoted students. Journal of English for Academic Purposes, 4(3),
225–238. doi:10.1016/j.jeap.2004.10.002.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2),
213–238.
Culligan, B. (2015). A comparison of three test formats to assess word difficulty.
Language Testing, 32(4), 503–520.
East, M. (2004). Calculating the lexical frequency profile of written German
texts. Australian Review of Applied Linguistics, 27(1), 30–43.
Elgort, I. (2013). Effects of L1 definitions and cognate status of test items on the
vocabulary size test. Language Testing, 30(2), 253–272. doi:10.1177/
0265532212459028.
References 41
Gee, R. W., & Nguyen, L. T. C. (2015). The bilingual vocabulary size test for
Vietnamese learners: Reliability and use in placement testing. Asian Journal
of English Language Teaching, 25, 63–80.
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical pro-
ficiency. EUROSLA Yearbook, 6(1), 147–168.
Harrington, M., & Carey, M. (2009). The online yes/no test as a placement
tool. System, 37(4), 614–626. doi:10.1016/j.system.2009.09.006.
Harrington, M., & Roche, T. (2014). Identifying academically at-risk students
at an English-medium university in Oman: Post-enrolment language assess-
ment in an English-as-a-foreign language setting. Journal of English for
Academic Purposes, 15, 34–37.
Harsch, C., & Hartig, J. (2015). Comparing C-tests and yes/no vocabulary size
tests as predictors of receptive language skills. Language Testing, 33(4), 555–575.
Huibregtse, I., Admiraal, W., & Meara, P. (2002). Scores on a yes-no vocabulary
test: Correction for guessing and response style. Language Testing, 19(3),
227–245.
Kanzaki, M. (2014). Comparing TOEIC® and vocabulary test scores. In G.
Brooks, M. Grogan, & M. Porter (Eds.), 2014 PanSIG conference proceedings
(pp. 52–58). Miyazaki: JALT
Kanzaki, M. (2015). Comparing TOEIC® and vocabulary test scores. In
G. Brooks, M. Grogan, & M. Porter (Eds.), 2014 PanSIG conference proceedings
(pp. 52–58). Miyazaki: JALT.
Kojic-Sabo, I., & Lightbown, P. M. (1999). Students’ approaches to vocabulary
learning and their relationship to success. The Modern Language Journal,
83(2), 176–192. doi:10.1111/0026-7902.00014.
Lam, Y. (2010). Yes/No tests for foreign language placement at the post-
secondary level. Canadian Journal of Applied Linguistics/Revue canadienne de
linguistique appliquee, 13(2), 54–72.
Laufer, B. (1992). How much lexis is necessary for reading comprehension? In
P. J. L. Arnaud & H. Béjoint (Eds.), Vocabulary and applied linguistics
(pp. 126–132). London: Macmillan. doi:10.1007/978-1-349-12396-4_12.
Laufer, B. (2005a). Focus on form in second language vocabulary learning.
EUROSLA Yearbook, 5(1), 223–250.
Laufer, B. (2005b). Lexical frequency profiles: From Monte Carlo to the real
world. A response to Meara. Applied Linguistics, 26(4), 582–588.
Laufer, B., & Levitzky-Aviad, T. (2016). CATTS (Computer Adaptive Test of
Size & Strength). Downloaded May 1, 2016, from http://www.lextutor.ca/
tests/levels/recognition/nvlt/paper.pdf
42 2 Measuring Recognition Vocabulary Size
Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2
written production. Applied Linguistics, 16(3), 307–322.
Laufer, B., & Nation, P. (2001). Passive vocabulary size and speed of meaning
recognition: Are they related? EUROSLA Yearbook, 1(1), 7–28.
Lemmouh, Z. (2008). The relationship between grades and the lexical richness
of student essays. Nordic Journal of English Studies, 7(3), 163–180.
McLean, S., Hogg, N., & Kramer, B. (2014). Estimations of Japanese university
learners’ English vocabulary sizes using the vocabulary size test. Vocabulary
Learning and Instruction, 3(2), 47–55.
McClean and Kramer. (2015). The creation of a new vocabulary levels test. In
G. Brooks, M. Grogan, & M. Porter (Eds.), 2014 PanSIG conference proceed-
ings (pp. 1–11). Miyazaki: JALT.
Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary
tests. Language Testing, 4(2), 142–145.
Meara, P., & Jones, G. (1988). Vocabulary size as placement indicator. In
P. Grunwell (Ed.), Applied linguistics in society (pp. 80–87). London: CILT.
Meara, P., & Jones, G. (1990). Eurocentres vocabulary size test. 10KA. Zurich:
Eurocentres.
Meara, P. M., & Milton, J. L. (2002). X_Lex: The Swansea vocabulary levels test.
Newbury: Express.
Meara, P. M., & Milton, J. (2003). X_Lex: The Swansea vocabulary levels test.
Swansea: Lognostics.
Meara, P. M., & Miralpeix, I. (2006). Y_Lex: The Swansea advanced vocabulary
levels test. v2.05. Swansea: Lognostics.
Milton, J. (2009). Measuring second language vocabulary acquisition. Bristol:
Multilingual Matters.
Milton, J., & Alexiou, T. (2009). Vocabulary size and the common European
framework of reference for languages. In B. Richards, M. H. Daller, D. D.
Malvern, P. Meara, J. Milton, & J. Treffers-Daller (Eds.), Vocabulary studies
in first and second language acquisition (pp. 194–211). Basingstoke: Palgrave
Macmillan.
Mochida, A., & Harrington, M. (2006). The yes-no test as a measure of recep-
tive vocabulary knowledge. Language Testing, 26(1), 73–98. doi:10.1191/02
65532206lt321oa.
Morris, L., & Cobb, T. (2004). Vocabulary profiles as predictors of the academic
performance of teaching English as a second language trainees. System, 32(1),
75–87. doi:10.1016/j.system.2003.05.001.
Nation, I. S. P. (1990). Teaching and learning vocabulary. Rowley: Newbury
House.
References 43
Nation, I. S. P. (2006). How large a vocabulary was needed for reading and lis-
tening? The Canadian Modern Language Review/La Revue Canadienne des
Langues Vivantes, 63(1), 59–82.
Nation, I. S. P. (2012). The vocabulary size test: Information and specifications.
Retrieved from http://www.victoria.ac.nz/lals/about/staff/publications/paul-
nation/Vocabulary-Size-Test-information-and-specifications.pdf
Nation, I. S. P. (2013). Learning vocabulary in another language (2nd ed.).
Cambridge, UK: Cambridge University Press.
Nation, P., & Coxhead, A. (2014). Vocabulary size research at Victoria University
of Wellington, New Zealand. Language Teaching, 47(03), 398–403.
Pellicer-Sánchez, A., & Schmitt, S. (2012). Scoring yes-no vocabulary tests:
Reaction time vs. nonword approaches. Language Testing, 29(4), 489–509.
doi:10.1177/0265532212438053.
Qian, D. D. (2002). Investigating the relationship between vocabulary knowl-
edge and academic reading performance: An assessment perspective. Language
Learning, 52(3), 513–536.
Read. (1988). Measuring the vocabulary knowledge of second language learners.
RELC Journal, 19, 12–25.
Roche, T., & Harrington, M. (2013). Recognition vocabulary knowledge as a
predictor of academic performance. Language Testing in Asia, 3, 1–12.
Roche, T., Harrington, M., Sinha, Y., & Denman, C. (2016). Vocabulary recog-
nition skill as a screening tool in English-as-a-Lingua-Franca University set-
tings. In J. Read (Ed.), Post-admission language assessment of University students,
English language education (Vol. 6, pp. 159–178). Switzerland: Springer.
Schmitt, N. (2010). Researching vocabulary. A vocabulary research manual.
Basingstoke: Palgrave Macmillan.
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring
the behaviour of two new versions of the vocabulary levels test. Language
Testing, 18(1), 55–89. doi:10.1191/026553201668475857.
Shiotsu, T., & Read, J. (2009, November). Extending the yes/no test as a measure
of the English vocabulary knowledge of Japanese learners. Paper presented at The
measurement of L2 lexical development colloquium, Annual Conference of
the Applied Linguistics Association of Australia, Brisbane.
Stæhr, L. S. (2008). Vocabulary size and the skills of listening,
reading and writing. Language Learning Journal, 36(2), 139–152.
doi:10.1080/09571730802389975.
Stewart, J. (2014). Do multiple-choice options inflate estimates of vocabulary size
on the VST? Language Assessment Quarterly, 11(3), 271–282. doi:10.1080/15
434303.2014.922977.
44 2 Measuring Recognition Vocabulary Size
Aims
3.1 Introduction
Efficiency in recognizing individual words is a fundamental element of
fluent discourse comprehension. As with vocabulary size, fast and rela-
tively effortless word recognition is a foundation of second language (L2)
proficiency. It is also an aspect of language skill where first language (L1)
and L2 users differ markedly. In the words of Paul Meara, ‘[t]he ability to
recognize and retrieve words effortlessly seems to be a basic feature of the
performance of L1 speakers, and a feature that is conspicuously lacking
from the performance of most L2 speakers’ (2002, p. 404).
The importance of this skill is readily evident when viewed in the con-
text of the broader text comprehension process. The construction–inte-
gration (C–I) model (Kintsch 1998) provides a useful framework for
situating word recognition skill (and later, lexical facility) in overall dis-
course comprehension, as well as relating this skill to other aspects of L2
vocabulary knowledge. The two-stage model is presented in Fig. 3.1.
The first stage of comprehension involves the construction of a text
base from the words and phrases in a text. Crucial to this stage is word
recognition (or ‘word identification’, in Kintsch’s terms), which involves
recognizing and extracting meaning from surface forms. This process
involves both phonological and orthographic knowledge, and is assumed
Prior knowledge
Sentence
Word-to-text integration
representation
Semantic units
Word recognition
WRITTEN
WRITTEN TEXT
Table 3.1 A meta-analysis of factors affecting L2 reading skill (Jeon and Yamashita
2014)
N = 59 studies—Jeon and Yamashita (2014)
Evidence
Type Factor r level
Word recognition skill L2 vocabulary knowledge 0.79 High
L2 decoding 0.56 High
Phonological awareness 0.48 Low
Orthographic knowledge 0.51 Low
Morphological knowledge 0.61 Low
Grammar knowledge L2 grammar knowledge 0.85 High
L1 reading skill L1 reading comprehension 0.77 High
L2 proficiency L2 listening comprehension 0.77 Low
Cognitive processes Working memory 0.42 Low
Metacognition 0.32 Low
that the participant is not trading off accuracy for the speed of response,
or vice versa. This is important because interpreting accuracy and speed
performance where both variables vary presents the researcher with a
potential confound. Lower accuracy may be due to a tendency to
answer more quickly and not necessarily to a lack of underlying knowl-
edge, while slower responses might reflect greater care being taken in
making the correct response at the expense of answering quickly
(Pachella 1974).
As a result, typical applications of the LDT in psychological research
assume that the participant knows the words. The need for the partici-
pant to have a threshold level of vocabulary knowledge makes the tech-
nique less useful for child language research (Harley 2013) or for L2
research more generally, where vocabulary knowledge levels can vary
greatly. In L2 research, it has been used in bilingual research involving
advanced learners. Here, the interest is in investigating the organization
and interaction of the cross-linguistic mental lexicon, and the partici-
pants are proficient in both languages (Dijkstra 2005; Kroll et al. 2010).
Experimental L2 acquisition research has also used the LDT in studies
that involve only advanced learners (Elgort 2013; Favreau and Segalowitz
1983) or where the participants are trained on a set of items beforehand
to ensure error-free performance (Akamatsu 2008). In the lexical facility
account tested in Part 2, both response accuracy and response time will
be simultaneously examined as indices of L2 proficiency.
The LDT involves a simple yes/no response to a presented item, but the
factors affecting that judgment are complex. At a minimum, the task can
be broken down into two stages (Balota and Chumbley 1984; Jacobs and
Grainger 1994), which is schematized in Fig. 3.2.
Balota and Chumbley (1984) decompose the task into the word recog-
nition processes that reflect the underlying strength of the representation
and the decision processes that act on the output of the recognition phase.
The mental representation of a word consists of phonological, ortho-
graphic, and semantic code information. The strength of these elements
54 3 L2 Word Recognition Skill and Its Measurement
Item presentation
Based on
Based on
task understanding
Response
is a function of the frequency with which the word has been used and its
relationship to other words in the mental lexicon. All things being equal,
a word with a stronger underlying representation will be recognized faster
and more consistently; it is this underlying representation strength and
its interaction with other specific task elements that is usually the object
of the researcher’s interest. The word recognition stage feeds into the
decision stage, where the participant makes the decision as to how to
respond. Decision-stage factors also affect the speed of response. Decision-
related factors include systematic effects, such as test-taker’s attitude and
motivation, and nonsystematic effects, such as attention on a given trial
or fatigue. Performance at the decision stage can also be affected by the
3.4 The LDT as a Measure of Word Recognition Skill 55
Time Measures The main LDT measure is the mean recognition time,
mnRT, for word items calculated over individuals, conditions, groups,
and items. Only correct responses are used in the calculation of the mnRT,
though incorrect responses are usually few. Interpreting the mnRTs also
requires a measure of variability of the sample on which the mnRT was
calculated. This is SD, which is the estimate of the average variability
between data points in a sample used if the data are normally distributed.
In cases where correct responses are not used, the median and range serve
the same function as the mnRT and SD. The smaller the SD, the more
confidence the researcher has that the mean is a ‘true’ indication of the
underlying group mean.
Age of Acquisition The age at which a word is learned affects how fast
words are recognized. Words learned at a younger age are recognized
faster than those learned later.
(Yap and Balota 2015). Naturally, the number of meanings and associates
a target has is directly related to vocabulary size.
There are also three factors related to the use of pseudowords and non-
words that affect recognition times.
Finally, there are two factors in LDT performance that are specific to
testing L2 and bilingual populations.
The preceding factors all effect LDT performance, though Jiang notes
that the role the various factors play in LDT performance, and word
recognition more generally, remains a subject of discussion and debate
(Jiang 2013, p. 84). Regardless, the quality and strength of these effects
60 3 L2 Word Recognition Skill and Its Measurement
3.7 Conclusions
Lexical facility is measured using the Timed Yes/No Test, an instrument
that incorporates basic features of the LDT. The format provides a win-
dow on the development of word recognition skill that is driven by expe-
rience with the language, as is the lexical facility construct itself. Applying
the format to the measurement of L2 word recognition skill across a
range of user proficiency levels, that is, where accuracy performance can
vary greatly, is a novel undertaking and one that presents a number
of challenges to the researcher. These will be examined in the studies pre-
sented in Part 2.
References
Akamatsu, N. (2003). The effects of first language orthographic features on sec-
ond language reading in text. Language Learning, 53(2), 207–231.
Akamatsu, N. (2008). The effects of training on automatization of word recog-
nition in English as a foreign language. Applied PsychoLinguistics, 29(2),
175–193. doi:10.1017/S0142716408080089.
Andrews, S. (1992). Frequency and neighborhood effects on lexical access:
Lexical similarity or orthographic redundancy? Journal of Experimental
Psychology: Learning, Memory, and Cognition, 18(2), 234–254.
doi:10.1037/0278-7393.18.2.234.
Balota, D. A., & Chumbley, J. I. (1984). Are lexical decisions a good measure of
lexical access? The role of word frequency in the neglected decision phase.
Journal of Experimental Psychology: Human Perception and Performance, 10(3),
340–357. doi:10.1037/0096-1523.10.3.340.
Balota, D. A., Yap, M. J., & Cortese, M. J. (2006). Visual word recognition: The
journey from features to meaning (a travel update). In M. J. Traxler & M. A.
Gernsbacher (Eds.), Handbook of psycholinguistics (2nd ed., pp. 285–375).
Amsterdam: Elsevier.
62 3 L2 Word Recognition Skill and Its Measurement
Bell, L. C., & Perfetti, C. A. (1994). Reading skill: Some adult comparisons.
Journal of Educational Psychology, 86(2), 244–255.
doi:10.1037/0022-0663.86.2.244.
Brown, J. D. (2005). Testing in language programs: A comprehensive guide to
English language assessment. New York: McGraw-Hill.
Carreiras, M., Perea, M., & Grainger, J. (1997). Effects of the orthographic
neighborhood in visual word recognition: Cross-task comparisons. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 23(4), 857.
De Groot, A. M., Delmaar, P., & Lupker, S. J. (2000). The processing of inter-
lexical homographs in translation recognition and lexical decision: Support
for non-selective access to bilingual memory. The Quarterly Journal of
Experimental Psychology: Section A, 53(2), 397–428.
Dijkstra, T. (2005). Bilingual visual word recognition and lexical access. In J. F.
Kroll & A. M. B. de Groot (Eds.), Handbook of bilingualism: Psycholinguistic
approaches (pp. 179–201). New York: Oxford University Press.
Elgort, I. (2013). Effects of L1 definitions and cognate status of test items on the
vocabulary size test. Language Testing, 30(2), 253–272.
doi:10.1177/0265532212459028.
Ellis, N. C. (2002). Frequency effects in language processing: A review with
implications for theories of implicit and explicit language acquisition. Studies
in Second Language Acquisition, 24(2), 143–188.
Favreau, M., & Segalowitz, N. S. (1983). Automatic and controlled processes in
the first and second language of reading fluent bilinguals. Memory and
Cognition, 11(6), 565–574. doi:10.3758/BF03198281.
Fender, M. J. (2001). A review of L1 and L2/ESL word integration development
involved in lower-level text processing. Language Learning, 51(2), 319–396.
doi:10.1111/0023-8333.00157.
Fodor, J. (1983). Modularity of mind. Cambridge, MA: MIT Press.
Geva, E., & Wang, M. (2001). The development of basic reading skills in chil-
dren: A cross-language perspective. Annual Review of Applied Linguistics, 21,
182–204.
Harley, T. A. (2013). The psychology of language: From data to theory (4th ed.).
Hove: Psychology Press.
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical pro-
ficiency. EUROSLA Yearbook, 6(1), 147–168.
Holmes, V. M. (2009). Bottom-up processing and reading comprehension in
experienced adult readers. Journal of Research in Reading, 32(3), 309–326.
doi:10.1111/j.1467-9817.2009.01396.
References 63
Hoover, W. A., & Gough, P. B. (1990). The simple view of reading. Reading and
Writing, 2(2), 127–160. doi:10.1007/BF00401799.
Hulstijn, J. H., Van Gelderen, A., & Schoonen, R. (2009). Automatization in
second language acquisition: What does the coefficient of variation tell us?
Applied PsychoLinguistics, 30(04), 555–582.
Jacobs, A. M., & Grainger, J. (1994). Models of visual word recognition:
Sampling the state of the art. Journal of Experimental Psychology: Human
Perception and Performance, 20(6), 1311.
Jeon, E. H., & Yamashita, J. (2014). L2 reading comprehension and its corre-
lates: A meta-analysis. Language Learning, 64(1), 160–212. doi:10.1111/
lang.12034.
Jiang, N. (2013). Conducting reaction time research in second language studies.
New York: Routledge.
Juffs, M., & Harrington, M. (2011). Aspects of working memory in L2 learn-
ing. Language Teaching, 44(2), 137–166. doi:10.1017/S0261444810000509.
Just, M. A., & Carpenter, P. A. (1992). A capacity theory comprehension: Individual
differences in working memory. Psychological Review, 99(1), 122–149.
Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge:
Cambridge University Press.
Kintsch, W. (2005). An overview of top-down and bottom-up effects in com-
prehension: The C-I perspective. Discourse Processes, 39(2–3), 125–128. doi:
10.1080/0163853X.2005.9651676.
Koda, K. (1992). The effects of lower-level processing skills on FL reading per-
formance: Implications for instruction. The Modern Language Journal, 76(4),
502–512.
Koda, K. (1996). L2 word recognition research: A critical review. The Modern
Language Journal, 80(4), 450–460.
Koda, K. (2005). Insights into second language reading: A cross-linguistic approach.
New York: Cambridge University Press.
Koda, K. (2007). Reading and language learning: Crosslinguistic constraints on
second language reading development. Language Learning, 57, 1–44.
doi:10.1111/0023-8333.101997010-i1.
Kroll, J., Van Hell, J., Tokowicz, N., & Green, D. (2010). The revised hierarchical
model: A critical review and assessment. Bilingualism: Language and Cognition,
13(3), 373–381. doi:10.1017/S136672891000009X.
LaBerge, D., & Samuels, S. J. (1974). Toward a theory of automatic informa-
tion processing in reading. Cognitive Psychology, 6(2), 293–323.
doi:10.1016/0010-0285(74)90015-2.
64 3 L2 Word Recognition Skill and Its Measurement
Lewellen, M. J., Goldinger, S. D., Pisoni, D. B., & Greene, B. G. (1993). Lexical
familiarity and processing efficiency: Individual differences in naming, lexical
decision, and semantic categorization. Journal of Experimental Psychology:
General, 122(3), 316–330. doi:10.1037/0096-3445.122.3.316.
Luce, R. D. (1986). Response times. New York: Oxford University Press.
Meara, P. (2002). The rediscovery of vocabulary. Second Language Research,
18(4), 393–407. doi:10.1191/0267658302sr211xx.
Meara, P., Lightbown, P. M., & Halter, R. H. (1994). The effects of cognates on
the applicability of yes/no vocabulary tests. The Canadian Modern Language
Review, 50(2), 296–311.
Nassaji, H. (2014). The role and importance of lower-level processes in second
language reading. Language Teaching, 47(1), 1–37.
Pachella, R. G. (1974). The interpretation of reaction time in information pro-
cessing research. In B. H. Kantowitz (Ed.), Human information processing:
Tutorials in performance and cognition (pp. 41–82). Hillsdale: Lawrence
Erlbaum Associates, Inc.
Pellicer-Sánchez, A., & Schmitt, S. (2012). Scoring yes-no vocabulary tests:
Reaction time vs. nonword approaches. Language Testing, 29(4), 489–509.
doi:10.1177/0265532212438053.
Perfetti, C. A., & Stafura, J. (2014). Word knowledge in a theory of reading
comprehension. Scientific Studies of Reading, 18(1), 22–37. doi:10.1080/108
88438.2013.827687.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
Schmitt, N. (2010). Researching vocabulary. A vocabulary research manual.
Basingstoke: Palgrave Macmillan.
Segalowitz, N. (2010). Cognitive bases of second language fluency. New York:
Routledge.
Segalowitz, N., & Segalowitz, S. J. (1993). Skilled performance, practice and
differentiation of speed-up from automatization effects: Evidence from sec-
ond language word recognition. Applied PsychoLinguistics, 14(3), 369–385.
doi:10.1017/S0142716400010845.
Segalowitz, N., Watson, V., & Segalowitz, S. J. (1995). Vocabulary skill: Single
case assessment of automaticity of word recognition in a timed lexical deci-
sion task. Second Language Research, 11(2), 121–136.
Segalowitz, N., Segalowitz, S. J., & Wood, A. G. (1998). Assessing the develop-
ment of automaticity in second language word recognition. Applied
PsychoLinguistics, 19(1), 53–67.
References 65
Aims
4.1 Introduction
The preceding chapters examined vocabulary size and word recognition skill
as elements of second language (L2) vocabulary knowledge. This chapter
introduces an approach to L2 vocabulary skill and its measurement that
brings together vocabulary size and the two dimensions of word recognition
skill, recognition speed and consistency. Lexical facility characterizes vocabu-
lary size and processing skill dimensions as complementary indices of L2
vocabulary knowledge that, when combined, p rovide a more sensitive mea-
sure of individual differences in L2 vocabulary than vocabulary size alone.
This book presents conceptual support and empirical evidence for the
value of treating size and processing skill as a unitary construct. Three
goals were set out at the beginning. The first goal is to establish the theo-
retical basis of the lexical facility construct, with particular attention to
the combination of the knowledge (size) and speed dimensions. The sec-
ond is to provide empirical evidence for the lexical facility construct as a
valid and reliable measure of individual differences in L2 vocabulary skill.
The third goal is to demonstrate how this measure correlates with out-
comes in various L2 performance domains.
This chapter focuses on the first aim—that is, establishing the concep-
tual basis for lexical facility. The construct is first defined and related to
existing research on word recognition skill. Lexical facility is approached
both as a vocabulary skill construct and as a measurement construct. As
a vocabulary skill construct, it represents lower-level word recognition
processes that play a crucial role in discourse comprehension skill. As a
measurement construct, it combines the three dimensions of vocabulary
size, recognition speed, and consistency as trait-like entities underlying
performance across contexts and uses. The lexical facility proposal repre-
sents a significant departure from the traditional practice in vocabulary
learning and assessment of treating knowledge and speed as independent
entities. The implications of the proposal are discussed, and the ways in
which the account differs from current directions in L2 vocabulary
research are highlighted.
The final part of the chapter introduces the research program that tests
the lexical facility proposal. The studies reported in Chaps. 6, 7, 8, 9, and 10
address the second and third goals of the book, that is, to provide evidence
for the validity and reliability of the lexical facility measures as indices of
L2 vocabulary skill and as correlates of performance in key L2 domains.
Lexical facility is about user vocabulary size and recognition skill, and the
role they play in L2 performance, individually and in combination. The
term facility denotes a basic capacity that the learner develops and has
available for use across a range of contexts. This facility emerges as the
70 4 Lexical Facility: Bringing Size and Speed Together
with sentence- and text-level factors then influencing the selection of the
correct meaning from among the available alternatives (Kintsch 1998;
Liu 2009).
The importance of fluent word recognition cannot be exaggerated.
Recognizing words is a singular recurring cognitive activity in reading,
with individual differences in word recognition skills serving to separate
lower-proficiency learners from their more fluent counterparts (Perfetti
2007, p. 357). It is also one of the most observable differences between
advanced L2 readers and their L1 counterparts (Meara 2002; Koda 2005).
Individuals with smaller and slower vocabularies, that is, less lexical
facility, expend greater effort in identifying individual words and attempt-
ing to figure out unfamiliar words encountered in a text. This, in turn,
results in slower and less effective comprehension outcomes. These indi-
viduals will process fewer words, cover less text, and achieve a diminished
understanding compared to more fluent readers in the same amount of
time. The degree of lexical facility thus has a direct impact on the working
memory resources available for higher-level comprehension. Working
memory is the capacity to maintain previously encountered material
while simultaneously processing new material (Baddeley 2012). It is an
important determinant of text comprehension in particular and L2 learn-
ing and use in general (Juffs and Harrington 2011). For less fluent read-
ers, slower, less effective word recognition processes draw directly upon
available memory resources and limit the amount of working memory
available for executing the higher-level processes needed for successful
comprehension (Perfetti 1985). This is not the case for fluent L1 readers,
for whom word identification is generally automatic and assumed to play
a relatively indirect role in comprehension outcomes. Hannon (2012),
for example, links word identification efficiency to overall text
comprehension via sentence-level processing, where it combines word
integration and syntactic processes that together determine text compre-
hension outcomes.
The word identification skills represented in the lexical facility con-
struct are an integral element of L2 vocabulary skill, with efficient word
identification processes being important predictors of fluent L2 reading
outcomes (Koda 1996, 2005; see also Wang and Koda 2005; Shiotsu
2009).
4.3 Lexical Facility as a Vocabulary Skill Construct 73
are not, for practical purposes, independent. All things being equal,
increases in depth are accompanied by an increase in breadth. As is the
case with vocabulary speed and size, learners with very ‘broad’ vocabular-
ies also have very ‘deep’ ones.
Lexical facility emerges from the individual’s experience with the lan-
guage. The user’s vocabulary size, speed, and consistency reflect the fre-
quency of exposure to the words in the language. Frequency of occurrence
is a strong predictor of when a word will be learned and an important
determinant of the strength of word representation in the mental lexi-
con (Ellis 2002, 2012). This strength develops as a result of successive
word retrieval events and the resulting multiple associations formed
with other words in the mental lexicon. It, in turn, predicts how quickly
a word will be accessed in use (Balota et al. 2006). Recent research also
suggests that word frequency statistics may be more than just a quantita-
tive notion. The frequency with which a word appears has been shown
to closely relate to the range of contexts in which it appears, not just to
the overall number of occurrences (Adelman and Brown 2006; Raymond
and Brown 2012). Frequency thus serves as an indicator of how widely
a particular word is used, that is, as a measure of vocabulary depth.
the language task versus the features of the task and its context of use. A
behaviorist posits the underlying construct as being synonymous with the
behavior required for the particular context. Specifying the performance
context is defining the construct. The interactionalist approach character-
izes proficiency as an interaction between what the learner knows and the
context of use (Chapelle 1998). The interactionalist approach has been
the dominant one in recent L2 testing, as it takes into account both what
the learner brings to the task and what the specific task demands
(Bachmann and Palmer 2010). The third approach shifts the focus solely
to what the learner knows. A trait approach characterizes the individual’s
vocabulary knowledge independent of any specific context. Lexical facil-
ity is assumed to be trait-like in that vocabulary size and processing skill
are assumed to be learner-internal characteristics that can be measured
and usefully interpreted independently of the vocabulary demands of
specific contexts. The trait approach is appropriate for the lexical facility
account, given its narrow scope (word recognition size and speed), which
serves as a fundamental constraint on comprehension processes in a con-
sistent manner across a range of contexts of use.
The trait approach to vocabulary measurement is generally disfavored
in L2 vocabulary testing, given the complexity of vocabulary knowledge
and its context-sensitive nature (Chapelle et al. 2010). However, the nar-
row scope of lexical facility as a lower-level word recognition skill lends
itself to characterization as a probabilistic capacity, independent of
specific contexts (Laufer 2001). As such, lexical facility is characterized as
a trait measurement construct that serves as a frequency-based objective
index against which to measure and compare lexical facility across learn-
ers and settings (Kempe and MacWhinney 1996).
How well it does this will be evaluated in the research studies reported
in Part 2.
times are not independent. Relatively less time is spent processing the six
words in the second and third frame presentations. Also, and crucially,
the response time measures reflect the average speed of item completion.
They are not a direct measure of word recognition speed, though Zhang
and Lu (2013, p. 8) use that term interchangeably with response time.
Rather, the times are measures of word and meaning-matching speeds
that involve the consideration of alternatives before a response is given.
As a result, the mean response time values are very high. In both studies,
response times for the 3K level ranged from 8 seconds to beyond 15 sec-
onds. In contrast, in L2 word recognition studies in which individual
word recognition is measured, values typically range from 500 millisec-
onds to 1500 milliseconds (Segalowitz and Segalowitz 1993; van Gelderen
et al. 2004).
Of more direct relevance to the current work, a small number of stud-
ies have used the Timed Yes/No format to examine the relationship
between vocabulary size and mean recognition speed (mnRT). The stud-
ies differ in aims, settings, and participants, but all report a correlation
between vocabulary size and recognition speed when size and mnRT
measures are collected on the same words (i.e., Harrington 2006;
Harrington and Carey 2009; Shiotsu and Read 2009; Pellicer-Sánchez
and Schmitt 2012).
A precursor to the lexical facility proposal is Harrington (2006)
who reported a systematic correlation between vocabulary size, recogni-
tion speed, and consistency (CV) across frequency and proficiency levels
for university-age English learners when examining the development of
L2 word automaticity. Harrington and Carey (2009) also found that
vocabulary size and recognition speed predicted placement levels in an
English language program. This research will be examined more closely
in Part 2.
In contrast, possibly the only study that reported no correlation
between vocabulary size and speed is Miralpeix and Meara (2010). The
study examined vocabulary knowledge in L2 English students at a Spanish
university using separate size and speed tests. Vocabulary size was mea-
sured using the X_Lex (1K–5K) and Y_Lex (6K–10K) tests, and mean
response speed measured using an independent lexical access test that
required the individual to judge whether a presented word was animate
86 4 Lexical Facility: Bringing Size and Speed Together
4.8 Conclusions
The lexical facility account is distinctive among L2 vocabulary approaches
in that it combines vocabulary size and recognition speed. The proposed
combination of size and processing skill (recognition time and consis-
tency) is at odds with the traditional psychometric approach to measure-
ment which treats the two as independent dimensions of behavior. There
are a number of reasons for the practice, both in testing theory in general
and in L2 vocabulary assessment in particular, but treating size and speed
as independent entities may also obscure a basic underlying relationship,
given the time-contingent nature of L2 knowledge.
Central to the lexical facility account is the proposal that the combina-
tion of recognition speed and consistency with vocabulary size provides a
combined measure of greater sensitivity to group and individual differ-
ences than vocabulary size alone. Evidence for this effect validates the
account and has broader implications for theory and practice in L2
vocabulary instruction and assessment.
References 89
These will be taken up in Part 2, where empirical evidence for the lexi-
cal facility proposal is presented. The studies reported all use the Timed
Yes/No Test. In the next chapter, the test format and methodology are
described.
References
Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity,
not word frequency, determines word-naming and reading times. Psychological
Science, 17(9), 814–823.
Akamatsu, N. (2008). The effects of training on automatization of word recog-
nition in English as a foreign language. Applied PsychoLinguistics, 29(2),
175–193. doi:10.1017/S0142716408080089.
Andrews, S. (Ed.). (2006). From inkmarks to ideas: Current issues in lexical pro-
cessing. Hove: Psychology Press.
Andrews, S. (2008). Lexical expertise and reading skill. In B. H. Ross (Ed.), The
psychology of learning and motivation: Advances in research and theory (Vol. 49,
pp. 247–281). San Diego: Elsevier.
Bachman, L. (1990). Fundamental considerations in language testing. Oxford:
Oxford University Press.
Bachmann, L. F., & Palmer, A. (2010). Language assessment in practice: Developing
language assessments and justifying their use in the real world. New York: Oxford
University Press.
Baddeley, A. (2012). Working memory: Theories, models, and controversies.
Annual Review of Psychology, 63, 1–29.
Balota, D. A., Yap, M. J., & Cortese, M. J. (2006). Visual word recognition: The
journey from features to meaning (a travel update). In M. J. Traxler & M. A.
Gernsbacher (Eds.), Handbook of psycholinguistics (2nd ed., pp. 285–375).
Amsterdam: Elsevier.
Bell, L. C., & Perfetti, C. A. (1994). Reading skill: Some adult comparisons. Journal
of Educational Psychology, 86(2), 244–255. doi:10.1037/0022-0663.86.2.244.
Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies.
Cambridge: Cambridge University Press.
Catalán, R. M. J. (Ed.). (2013). Lexical availability in English and Spanish as a
second language (Vol. 17). Dordrecht: Springer.
Chapelle, C. A. (1998). Construct definition and validity inquiry in SLA
research. In L. F. Bachman & A. E. Cohen (Eds.), Interfaces between second
90 4 Lexical Facility: Bringing Size and Speed Together
Read, J., & Chapelle, C. A. (2001). A framework for second language vocabu-
lary assessment. Language Testing, 18(1), 1–32.
Schmitt, N. (2010). Researching vocabulary. A vocabulary research manual.
Basingstoke: Palgrave Macmillan.
Schnipke, D. L., & Scrams, D. J. (2002). Exploring issues of examinee behav-
ior: Insights gained from response-time analyses. In C. N. Mills, M. Potenza,
J. J. Fremer, & W. Ward (Eds.), Computer-based testing: Building the founda-
tion of future assessments (pp. 237–266). Hillsdale: Lawrence Erlbaum
Associates.
Segalowitz, N. (2010). Cognitive bases of second language fluency. New York:
Routledge.
Segalowitz, N., & Freed, B. (2004). Context, contact and cognition in oral flu-
ency acquisition: Learning Spanish in at home and study abroad contexts.
Studies in Second Language Acquisition, 26(2), 173–199. doi:10.1017/
S0272263104262027.
Segalowitz, N., & Segalowitz, S. J. (1993). Skilled performance, practice and
differentiation of speed-up from automatization effects: Evidence from sec-
ond language word recognition. Applied PsychoLinguistics, 14(3), 369–385.
doi:10.1017/S0142716400010845.
Segalowitz, N., Segalowitz, S. J., & Wood, A. G. (1998). Assessing the develop-
ment of automaticity in second language word recognition. Applied
PsychoLinguistics, 19(1), 53–67.
Shiotsu, T. (2009). Reading ability and components of word recognition speed:
The case of L1-Japanese EFL learners. In Z. Han & N. J. Anderson (Eds.),
Second language reading research and instruction: Crossing the boundaries
(pp. 15–39). Ann Arbor: University of Michigan Press.
Shiotsu, T., & Read, J. (2009, November). Extending the yes/no test as a measure
of the English vocabulary knowledge of Japanese learners. Paper presented at The
measurement of L2 lexical development colloquium, Annual Conference of
the Applied Linguistics Association of Australia, Brisbane.
Sternberg, S. (1998). Inferring mental operations from reaction time data: How
we compare objects. In D. N. Osherson, D. Scarborough, & S. Sternberg
(Eds.), An invitation to cognitive science, Methods, models, and conceptual issues
(Vol. 4, pp. 436–440). Cambridge, MA: MIT Press.
Ullman, M. T. (2005). A cognitive neuroscience perspective on second language
acquisition: The declarative/procedural model. In C. Sanz (Ed.), Mind and
context in adult second language acquisition: Methods, theory, and practice
(pp. 141–178). Washington, DC: Georgetown University Press.
94 4 Lexical Facility: Bringing Size and Speed Together
Aims
5.1 Introduction
This chapter describes the Timed Yes/No Test, the online assessment tool
used to measure lexical facility in the studies reported in Part 2. Lexical
facility consists of three dimensions: vocabulary size, mean recognition
time (mnRT), and recognition speed consistency, as captured in the coef-
ficient of variation (CV). The size measure is based on the number of
words (hits) recognized minus pseudowords incorrectly recognized as
words (false alarms). Various formulas have been proposed to combine
hit and false alarm performance. These are described and evaluated. The
use of mnRT as a proficiency measure distinguishes the test (and the
Yes/No Test format presents items one at a time and collects both yes/no
and recognition times. The format differs from other L2 vocabulary tests
in the selection of items, the response format, and the scoring procedure.
These are described next.
Test Items
The range of frequency bands sampled can vary. The X_Lex test sam-
ples items from the 1K–5K range, while a version assessing knowledge of
lower-frequency words, Y_Lex (Meara and Miralpeix 2006), tests items
in the 6K–10K range. The proficiency level of the cohort being tested
affects what bands are selected for inclusion. Tests including low-
frequency bands (e.g., 7K–10K) run the risk of being too difficult for
beginner learners, while using too narrow a range, (1K–3K), may result
in more advanced learners performing at ceiling. At the same time, a
spread of frequency bands allows a greater range of learners to be tested
and compared. The band ranges used in the studies reported later range
from the 1K to the 10K band, with four bands (2K, 3K, 5K, and 10K)
and five bands (1K, 3K, 5K, 7K, and 9K) used, depending on the study.
The range used represents a trade-off between the aim of the study, the
proficiency range of the participants, and the time and resources available
for testing.
98 5 Measuring Lexical Facility: The Timed Yes/No Test
There are four kinds of responses possible for the two item types and two
response alternatives. A schematic diagram of the four is given in Fig. 5.1.
The item response matrix is based on signal detection theory model of
decision-making (Green and Swets 1966). There are two kinds of correct
responses: ‘yes’ responses to actual word items (hits) and ‘no’ responses to
Item type
Word Pseudoword
correct incorrect
YES
(‘hit’) (‘false alarm’)
Response type
incorrect correct
NO
(‘miss’) (‘correct reject’)
1. Hits-false alarms (H-FA). Adjusts total score to reflect guessing, but is not very
2. Correction for blind guessing (cfbg). Incorporates the correct rejection to account
for ‘blind’ guessing (Anderson and Freebody 1983; Meara and Buxton 1987). Blind
guessing assumes the respondent either knows the word or is guessing at random.
1 – (f)
Theory approach in to correction for guessing (Meara 1992, cited in Huibregtse et al.
2002). Does not take into account response style, and tends to underestimate scores
Dm = (h – f) –f
(1 – f) – h
4. ISDT. Assumes sophisticated guessing and takes into account individual response
ISDT = 1 – 4h (1–f)–2(h–f)(1+h–f)
4h (1–f)–(h–f)(1+h–f)
recall test than with the time-adjusted approach. Ultimately, the findings
did not find a clear advantage for the reaction time or for any one of the
established scoring formulas. Regardless, the logic of the Pellicer-Sánchez
and Schmitt (2012) study accords with that of the lexical facility proposal,
namely that speed of recognition reflects stronger word knowledge.
The research to date on Yes/No Test scoring has failed to identify a
formula that is clearly superior across the range of testing contexts, test-
taker samples, and performance outcomes encountered (Beeckmans et al.
2001, p. 272). Pellicer-Sánchez and Schmitt (2012) raise the possibility
of using adaptive scoring, in which the formula that appears to be the
most sensitive to the pattern of responses made by the individual test-
taker is used. The practical considerations would be significant but, as the
authors note, not insurmountable; however, it is not clear whether the
increased sensitivity that might come from using the various formulas
would be worth the effort and expense. It would also present difficulties
in comparing performance across individuals.
The empirical studies reported in Part 2 use the H-FA formula to score
test performance across a range of test-takers and testing domains. Of
course, if false-alarm rates are low, or even zero, the hits provide a usable
measure on their own. However, even in this case, the hit rate as a reflec-
tion of an individual’s ‘true’ vocabulary knowledge might still be an over-
estimate (the test-taker did some guessing) or an underestimate (the
test-taker did not select some known words). It is important to emphasize
that the vocabulary size measure that the test yields is a probabilistic esti-
mate serving as an indirect measure of vocabulary size, given the adjust-
ment for guessing involved. The usefulness of the testing approach is not
about identifying an individual’s absolute vocabulary size, but with pro-
ducing a relative measure of vocabulary size that can meaningfully dis-
criminate among proficiency and performance levels.
In addition to the vocabulary size measure, the test also collects recogni-
tion time for each item presented. A mean recognition time (mnRT) is
calculated for all the words correctly identified (hits). The mnRT and its
5.3 Scoring the Timed Yes/No Test 105
standard deviation (SD) are then used to estimate recognition speed con-
sistency. This is expressed in the coefficient of variation (CV), which is
the ratio of the SD of the mnRT to the mean RT (SDmnRT/mnRT). These
processing skill measures are examined individually and in combination
with the size measures as indices of proficiency.
recognition time (−1.0) would indicate that the measures are redun-
dant, that size can be perfectly predicted by RT and vice versa. Perfect
correlations, of course, do not happen, so the interest is in the direction
and relative strength of the relationship. A strong positive correlation
indicates a systematic speed-accuracy trade-off, while a strong negative
correlation is consistent with the lexical facility account. A weak, or no,
correlation shows no systematic trade-off but does not preclude it
entirely. It would also provide little support for the lexical facility
proposal.
Key
Instructions
Tick the words you know. Some of the words in the list do not exist in
Dutch.
Tick the words you know the meaning of. When in doubt, do not tick the
item. Notice that some of the words in the list do not exist in Dutch. After
completing this test, you will be asked to translate some of the words of the
list.
In this version, the test-taker is told to tick only the words for which
they know the meaning, to the degree that they can supply a transla-
tion. Also, the individual is explicitly cautioned against guessing. The
5.4 Administering the Test 111
1. Description of the difference between the words and pseudowords. It should be clear that
the pseudoword look like possible words (i.e., are phonologically ‘legal’) but do not exist
in the language.
2. Criteria for responding ‘yes’. What it means to know a word should be specified clearly
3. Criteria for scoring. It should be specified that there is a penalty for guessing; incorrect
5. Any follow-up activities. If appropriate planned or even possible follow-up activities can
be specified, e.g., you may be tested on some of the items after the test.
Fig. 5.4 Elements of the instruction set for the Timed Yes/No Test
112 5 Measuring Lexical Facility: The Timed Yes/No Test
Procedure
context independent versus context dependent, and low stakes versus high
stakes. These dimensions affect how the test scores are interpreted and
used.
The second task dimension concerns the nature of the vocabulary being
tested. Selective versus comprehensive measures differ according to the
range of vocabulary used in the assessment. Selective tests incorporate a
set of target items from a text and the test-taker is tested on just these
items. A comprehensive measure, in contrast, takes into account all
vocabulary content in the test material. Knowledge of individual words is
not assessed; rather, overall vocabulary use is rated and a judgment made
as to the individual’s relative level of vocabulary mastery. In Read’s terms,
the Timed Yes/No Test is a selective test because it consists of a set of
items drawn from frequency lists. However, as with comprehensive mea-
sures, the focus lies not on whether specific items are known, but rather
on the proportion of items known at different frequency levels. In this
way, the items are representative of frequency bands and accessed as such.
In principle, items can be sampled at random from matched frequency
levels and yield the same measure, despite being different items. Therefore,
114 5 Measuring Lexical Facility: The Timed Yes/No Test
although the Timed Yes/No Test is selective in that a set of target items at
different frequency levels are tested, it is also comprehensive in that per-
formance on the target items is assumed to represent the proportion of
words known at the frequency level in question. For example, a score of
80% on 20 words sampled from the 2K band is interpreted as evidence
that the test-taker knows 800 of the words in the 2K band. It does not
target knowledge of specific words.
The fourth dimension involves the perceived importance of the test out-
comes. To date, the Timed Yes/No Test has been administered in set-
tings and for purposes where there is relatively little at stake for the
test-taker. The results reported in Part 2 were collected in research proj-
ects and pilot testing programs that did not have an immediate bearing
on the test-taker’s course of study or future goals, making the test of low
stakes. Although a low stakes test, the Timed Yes/No Test can poten-
tially be used to complement high-stakes testing functions related to
university entrance and placement decisions, as illustrated in Chaps. 8
to 10.
5.6 Lexical Facility in English 115
5.7 Conclusions
The Timed Yes/No Test is an online assessment tool that is used to test the
lexical facility construct. The test format has a number of features that
distinguish it from other approaches to L2 vocabulary testing. These fea-
tures and the motivation for their use was explained. Two features that
received particular attention were the use of pseudowords as a means to
control for guessing as well as recognition speed and consistency as
vocabulary knowledge measures. These features raise theoretical and
methodological challenges for the research reported in Part 2.
The central aim of the empirical research program is to establish the
sensitivity of the three measures, individually and in combination, to
learners’ performance differences. The combined effect of the measures
will be evaluated using composite scores, which is another distinctive fea-
ture of the approach. The studies in Part 2 test the lexical facility proposal
by examining Timed Yes/No Test performance as a predictor of outcomes
in various domains of academic English in both ESL and EFL settings.
References 117
References
Akamatsu, N. (2008). The effects of training on automatization of word recog-
nition in English as a foreign language. Applied Psycholinguistics, 29(02),
175–193. doi:10.1017/S0142716408080089.
Anderson, R. C., & Freebody, P. (1981). Vocabulary knowledge. In J. T. Guthie
(Ed.), Comprehension and teaching: Research reviews (pp. 77–117). Newark:
International Reading Association.
Anderson, R. C., & Freebody, P. (1983). Reading comprehension and the assess-
ment and acquisition of word knowledge. In B. Huston (Ed.), Advances in
reading/language research (Vol. 2, pp. 231–256). Greenwich: JAI Press.
Beeckmans, R., Eyckmans, J., Janssens, V., Dufranne, M., & Van de Velde, H.
(2001). Examining the yes/no vocabulary test: Some methodological issues
in theory and practice. Language Testing, 18(3), 235–274.
Cameron, L. (2002). Measuring vocabulary size in English as an additional lan-
guage. Language Teaching Research, 6(2), 145–173.
Davies, M. (2008). The corpus of contemporary American English: 450 million,
1990–present. Available online at http://copruse.byu.edu.coca/
Eyckmans, J. (2004). Learners’ response behavior in Yes/No vocabulary tests. In
H. Daller, M. Milton, & J. Treffers-Daller (Eds.), Modelling and assessing
vocabulary knowledge (pp. 59–76). Cambridge: Cambridge University Press.
Fender, M. (2008). Spelling knowledge and reading development: Insights from
Arab ESL learners. Reading in a Foreign Language, 20(1), 19–42.
Green, D., & Swets, J. A. (1966). Signal detection theory and psychophysics.
New York: Wiley.
Harsch, C., & Hartig, J. (2015). Comparing C-tests and yes/no vocabulary size
tests as predictors of receptive language skills. Language Testing, 33(4),
555–575.
Heitz, R. P. (2014). The speed-accuracy tradeoff: History, physiology, methodol-
ogy, and behavior. Frontiers in Neuroscience, 8, 150.
Huibregtse, I., Admiraal, W., & Meara, P. (2002). Scores on a yes-no vocabulary
test: Correction for guessing and response style. Language Testing, 19(3),
227–245.
Jiang, N. (2013). Conducting reaction time research in second language studies.
London/New York: Routledge.
Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword gen-
erator. Behavior Research Methods, 42(3), 627–633. d oi:10.3758/
BRM.42.3.627.
118 5 Measuring Lexical Facility: The Timed Yes/No Test
Segalowitz, N., Segalowitz, S. J., & Wood, A. G. (1998). Assessing the develop-
ment of automaticity in second language word recognition. Applied
PsychoLinguistics, 19(1), 53–67.
Siakaluk, P. D., Buchanan, L., & Westbury, C. (2003). The effect of semantic
distance in yes/no and go/no-go semantic categorization tasks. Memory &
Cognition, 31(1), 100–113.
Sternberg, S. (1998). Inferring mental operations from reaction time data: How
we compare objects. In D. N. Osherson, D. Scarborough, & S. Sternberg
(Eds.), An invitation to cognitive science, Methods, models, and conceptual issues
(Vol. 4, pp. 436–440). Cambridge, MA: MIT Press.
Thoma, D. (2009). Strategic attention in language testing. Metacognition in yes/no
business English vocabulary test. Frankfurt: Peter Lang.
Waters, G. S., & Caplan, D. (2003). The reliability and stability of verbal work-
ing memory measures. Behavior Research Methods, Instruments, and Computers,
35(4), 550–564. doi:10.3758/BF03195534.
Part 2
Introduction
1.1 Overview
Part 1 introduced the lexical facility construct. Lexical facility combines
size and processing skill as a unitary second language (L2) vocabulary
skill construct. The challenges arising from combining the two was
acknowledged, but the case was made for treating the two as a unitary
construct, both due to the time-contingent nature of L2 vocabulary
knowledge and the potential utility of combining knowledge and skill as
a measurement tool to characterize individual and group differences in
L2 proficiency and performance.
Part 2 provides empirical evidence for the account. It presents a set of
studies that investigate the lexical facility measures (vocabulary knowl-
edge, mean recognition time, and consistency) as reliable indices of indi-
vidual differences in L2 vocabulary skill, separately and in combination
(Chap. 6). The sensitivity of the measures to performance differences in
selected domains of academic English performance is then examined.
These domains consist of university entry standards (Chap. 7), perfor-
mance on the International English Language Testing System (IELTS;
Chap. 8), language program placement (Chap. 9), and general and aca-
demic English classroom performance (Chap. 10). A summary chapter
that identifies the main findings is also included (Chap. 11). The data
presented here are drawn from published and unpublished research by
122 2 Introduction
the author and colleagues. In the final chapter (Chap. 12), the implica-
tions for L2 vocabulary teaching, learning, and testing are discussed,
including the potential for incorporating time measures into models of
L2 vocabulary acquisition and L2 theory more generally.
1. compares the three measures of lexical facility (VKsize, mnRT, and CV) as
stable indices of L2 vocabulary skill;
2. evaluates the sensitivity of these measures individually and as composites to
differences in a range of academic English domains; and, in doing so,
3. establishes the degree to which the composite measures combining the
VKsize measure with the mnRT and CV measures provide a more sensi-
tive measure of L2 proficiency differences than the VKsize measure alone.
alarms) is used to adjust the final score. The computerized test format
presents the test items individually in a randomized order for each test-
taker. The test records test-takers’ yes/no responses and the time they take
to recognize each item. These responses are used to calculate individual
and composite measures of lexical facility. These are described next.
Individual Measures
Composite Measures
In all the studies, the responses are initially examined for factors that may
affect the outcomes, independent of the research variables of interest.
These potentially compromising factors are both general to quantitative
measurement research and specific to the use of the Timed Yes/No Test
format. The raw findings are examined for instrument reliability, exces-
sive false-alarm rates, the occurrence of outliers, and a potential speed–
accuracy trade-off in responses.
either too fast or too slow to reflect the cognitive process of interest.
Random finger presses, lapses of attention, or external distractions can all
contribute to responses that do not reflect the word recognition process.
Outliers are identified here using an absolute value approach in which
response times faster than 300 milliseconds and slower than 5000 milli-
seconds are the low and high cut-off values (Jiang 2013). The high cut-off
value is the item time-out value for the test program and any response
beyond this time is automatically discarded. The time-out value is set at
5000 milliseconds to accommodate the lower-proficiency test-takers in
several of the studies. The data points that fell below the low cut-off of
300 milliseconds are simply removed. These involved only a handful of
data points in any given study, well below 1% of the data.
In each study, the descriptive results are first presented, followed by the
inferential statistics used to test the sensitivity of the measures.
Descriptive Statistics The means, SDs, and confidence intervals (CIs) for
the lexical facility measures are reported in all studies. The value of the CI
as a statistical measure for both descriptive and inferential statistics is
being increasingly recognized in L2 research (Larson-Hall and Herrington
2010; Larson-Hall and Plonsky 2015). The CI is a range of values that is
likely to include the observed mean value for the sample. A bootstrapped
(see below) 95% CI value is reported in all the studies, meaning that
there is a 95% chance that the observed mean is contained in the interval
between the lower- and upper-bound values. A lack of overlap in the CIs
of two mean values indicates a statistically significant difference between
them.
Group mean differences are tested using t-tests for comparisons involv-
ing two groups and an analysis of variance (ANOVA) for comparing
more than two groups when the relevant assumptions are met. The t-tests
and ANOVA assume that data which are normally distributed exhibit
homogeneity of variance; that is, the SDs of the samples are approxi-
mately equal. The data were tested for the key assumptions of normality
and equality of variance, which were generally, but not always, met.
Where variance assumptions are not met for the standard ANOVA,
Welch’s ANOVA is used for the omnibus test and the Games–Howell test
for any follow-up pairwise comparisons (Tabachnick and Fidell 2013).
The studies here use bootstrapping for calculating mean CIs. Bootstrap
ping provides a more robust way to deal with non-normally distributed
data than the use of nonparametric tests, particularly for smaller sample
sizes (Larson-Hall and Herrington 2010).
128 2 Introduction
Interpreting the Effect Size The other element of sensitivity is the effect
size, which is the strength of the measure as a discriminator of criterion-
level differences. In correlation and regression analyses, the effect size is
calculated directly in the r-value. This value squared, R2 (also called the
coefficient of determination), represents the amount of variance in differ-
ences in the criterion variable, for example, the proficiency levels, attrib-
utable to differences in the predictor variable.
For the tests of group mean differences, standardized effect sizes are
calculated separately. The t-test uses Cohen’s d, and ANOVA the η2
(eta-squared)
test (Fritz et al. 2011). The relative importance of the
observed effect sizes is interpreted using a recently introduced scale for
interpreting the r- and d-values in L2 research (Plonsky and Oswald 2014,
p. 899). The scale revises upward the widely used benchmarks suggested
in Cohen (1988). Benchmark values for the interpretation of d are small
(d = .40), medium (d = .70), and large (d = 1.00). Plonsky and Oswald
(2014) note that these values pertain to between-group contrasts, with
pre-post and within-group contrasts requiring larger effect sizes. The
benchmarks for these contrasts are small (d = .60), medium (d = 1.00), and
large (d = 1.40). Between-group contrasts involving proficiency levels are of
primary interest in the studies presented in the following chapters, but
within-group contrasts will also be relevant when comparing performance
References 129
over item frequency bands. The relative impact of the r effects are small
(r = .25), medium (r = .40), and large (r = .60).
The following presents the empirical evidence for the lexical facility
proposal. Chapters 6, 7, 8, 9, 10, and 11 report on a series of empirical
studies, and the final chapter, Chap. 12, discusses the implications and
way forward for the lexical facility account.
References
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale: Lawrence Erlbaum.
Field, A. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage.
Fritz, C. O., Morris, P. E., & Richler, J. J. (2011). Effect size estimates: Current
use, calculations, and interpretation. Journal of Experimental Psychology:
General, 141(1), 2–18. doi:10.1037/a0024338.
Jiang, N. (2013). Conducting reaction time research in second language studies.
London/New York: Routledge.
Larson-Hall, J., & Herrington, R. (2010). Improving data analysis in second
language acquisition by utilizing modern developments in applied statistics.
Applied Linguistics, 31(3), 368–390.
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative
research findings: What gets reported and recommendations for the field.
Language Learning, 65(S1), 127–159.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.).
Boston: Pearson.
6
Lexical Facility as an Index of L2
Proficiency
Aims
6.1 Introduction
This chapter presents the first of seven studies that evaluate lexical facility
as a second language (L2) vocabulary construct. The study examines the
sensitivity of the lexical facility measures to group proficiency differences.
The focus is on three student groups that represent distinct populations
of English users at an Australian university. One is a group of English L2
students studying in a preuniversity language program, and the other two
are English L2 and first language (L1) university students studying in the
arts faculty. The sensitivity of the three lexical facility measures (vocabu-
lary size, mean recognition speed, and recognition speed consistency) to
group differences is examined for each measure individually and in com-
bination. Sensitivity refers to both how well the measures discriminate
between the groups and the strength of the observed differences.
The study tests a claim central—but not unique—to the lexical facility
account, namely that vocabulary size and recognition speed are strong cor-
relates of proficiency differences. The lexical facility account further pro-
poses that consistency in recognition speed, as indexed by the coefficient of
variation (CV), can complement these two dimensions as a reliable index
of recognition vocabulary skill. Of the three measures, vocabulary size has
been shown to be a particularly robust correlate of proficiency. The focus is
on whether the combination of size with speed and consistency scores
results in a more sensitive measure of group differences than size alone.
Vocabulary knowledge is measured using the Timed Yes/No Test. The
format was described in Chap. 5. A defining feature of the Yes/No Test
format is the use of words drawn from a range of frequency-of-occurrence
levels. A basic assumption is that test performance will systematically
decrease as word frequency decreases. The lower frequency a word has,
the less likely it will be known, and if known, the slower it will be recog-
nized. Construct validity for the test format thus depends on showing
that this predicted frequency effect is valid. This is done by demonstrat-
ing that the lexical facility measures are sensitive to differences in word
frequency levels in a manner analogous to that shown for the group dif-
ferences. Establishing the validity of this feature of the test format is
important for the lexical facility proposal itself, as the account makes the
fundamental assumption that vocabulary skill development is input
driven, with a word’s frequency of occurrence a fundamental predictor of
when and how well it will be learned.
Each participant took a written version of the Timed Yes/No Test. The
test measures English vocabulary knowledge and recognition speed by
eliciting a yes/no decision as to whether a presented item is known. Word
items are drawn from four frequency-of-occurrence bands. Pseudowords
are also included as a control for guessing.
The test contained 90 words and 60 pseudowords for a total of 150
items. The word items were taken from the Vocabulary Levels Test (VLT)
introduced in Chap. 1 (Schmitt et al. 2001). The test includes 18 items
from each of four frequency-of-occurrence bands comprising the 2000
(2K), 3000 (3K), 5000 (5K), and 10,000 (10K) most frequently occur-
ring words in English. Also used are 18 words from the Academic Word
List, a set of more frequently used academic words (Coxhead 2000). The
latter were included in an earlier study by Mochida and Harrington
(2006) but are not examined here, given the focus on frequency as a pre-
dictor of vocabulary skill. The pseudowords included to control for guess-
ing all conform to English orthographic and phonological rules.
The test yields a score of vocabulary knowledge, VKsize, that approxi-
mates the individual’s vocabulary size. This measure is the proportion of
correct responses to the frequency-graded word items (also referred to as
‘hits’) minus the proportion of incorrectly identified pseudowords (‘false
alarms’). As the false-alarm rate is used to correct the overall number of
hits, VKsize is an indirect measure of size. Recognition speed is measured
by calculating the mean recognition time, mnRT, for the correctly recog-
nized hits. The mnRT score is reported in milliseconds (1000 millisec-
onds = 1 second). The third lexical facility measure is the coefficient of
variation (CV), which reflects the consistency of recognition time
performance as measured by the mnRT. The lexical facility account is the
first to examine the CV as a useful index of L2 vocabulary development.
The CV is a single value that reflects the relationship between the stan-
dard deviation (SD)—a measure of the variability of the response times
in the set—and the mean response time itself. It is the ratio of the SD of
the mnRT to the mnRT itself (SDmnRT/mnRT).
The sensitivity of the three individual measures to group proficiency
differences is first investigated for each measure separately and then com-
pared with that of the composite scores. Two composite measures are
6.2 Study 1: Lexical Facility as an Index of English Proficiency 135
Test instrument reliability was first calculated to ensure the Timed Yes/
No Test provided a consistent measure of L2 vocabulary knowledge.
Cronbach’s alpha analyses were carried out on item performance on the
words and pseudowords to establish the internal reliability of the test
(Beeckmans et al. 2001). All results in this study had satisfactory reliabil-
ity for yes/no responses and item RTs, all in the higher .8 to lower .9
range (Plonsky and Derrick 2016). Reliability coefficients for the CV
were not calculated, as the measure is derived from the RT means and SD
and is not analyzable at the item level.
Individual item recognition times on words recognized (hits) were
screened for outliers. Only the recognition time responses for word items
correctly answered (hits) were screened. Outliers were defined as being 3
SDs beyond the mean for the individual participant. In the Harrington
(2006) analysis, RTs were trimmed at 2.5 SDs, a threshold challenged as
being too liberal for adjusting the mean responses (Hulstijn et al. 2009).
The data from that study were reanalyzed here at the 3 SD criterion. Item
responses at less than 300 milliseconds were first removed, as these were
too fast for an actual response. These responses reflected preemptive
guessing or keystroke errors and were rare, appearing in only a handful of
participant responses. As in most RT studies, the screening carried out
here focuses on identifying responses that are excessively slow. Participants
were told at the outset that there was a 5000 millisecond (5 second) limit
for responding to each item. The 5-second window allowed enough time
to complete the task but required attention on the part of the test-taker.
Timed-out responses were evident in about 10% of the participant
response sets. These sets typically had only one or two timed-out responses,
and these were at the beginning of the test. Individual item response
6.3 Study 1 Results 137
times for the correct hits beyond the 3 SD cut-off accounted for 2%
(1.6%) for the L2 preuniversity group and L2 (1.9%) and L1 (2.1%)
university groups. This compared with the outlier rate of 3% reported in
Harrington (2006), which used a 2.5 SD cut-off. These are low numbers
but are based on correct hits, meaning that for less proficient participants,
the total number can be small, for example, in the 20s for some of the
preuniversity participants. Also, slower mean RTs and associated SDs for
the slower participants are closer to the 5000-millisecond cut-off, and
that also decreases the number of potential outliers. The marginally higher
outlier rates for the L2 and L1 university groups reflect the relatively
faster means and SDs. Given the low rates, the outliers were not adjusted
for the statistical analyses.
Excessive guessing can also compromise the reliability of the VKsize
score. Guessing is reflected by the proportion of the ‘yes’ responses to
pseudowords. This false-alarm rate is used to correct for guessing,
with higher false-alarm rates resulting in lower VKsize. It is of course
also possible to guess correctly on the word items (hits), but this is
not directly detectable in the format. The absolute false-alarm rate is
also important, as higher rates will make the size estimate less accu-
rate and more difficult to compare with findings from other studies.
False-alarm rates are given in Table 6.2. The rates varied across the
groups, with the L2 preuniversity group averaging over 20% and the
L2 and L1 university groups around 6% each. The difference between
the preuniversity group and the two university groups was statistically
significant.1
The error rate here compares with false-alarm rates of around 5% by
adult L1 English subjects (Ziegler and Perry 1998, p. 57), 6% by advanced
Dutch EFL subjects (Van Heuven et al. 1998), and 9% for French-
speaking learners of Dutch (Eyckmans 2004). Schmitt, Jiang and Grabe
(2011) eliminated all participants at higher than 10% but did not report
a mean false-alarm rate. The false-alarm rate here contrasts with much
larger rates, for example, over 20% reported in Beeckmans et al. (2001),
Cameron (2002), and the minimal instruction condition in Eyckmans
(2004). As evident here and in previous studies (see Chap. 3), false-alarm
rates decrease as proficiency increases, at least when the proficiency differ-
ences are relatively distinct.
138 6 Lexical Facility as an Index of L2 Proficiency
Table 6.1 Bivariate correlations and 95% confidence intervals (within square
brackets) for the three lexical facility measures (VKsize, mnRT, and CV) and two
composite scores (VKsize_ mnRT and VKsize_ mnRT_ CV)
VKsize mnRT CV VKsize_ mnRT
mnRT .68 [.57, .77] –
CV .51 [.39, .62] .62 [.45, .75] -–
VKsize_ mnRT .93 [.89, .96] .89 [.83, .93] .63 [.49, .74] –
VKsize_ mnRT_ CV .85 [.80, .90] .87 [.82, .91] .83 [.75, .90] .95 [.93, .97]
Note: N = 110. All correlations significant at the p < .01 level (two-tailed). VKsize,
correction for guessing scores (hits - false alarms); mnRT, mean recognition
time in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT); 95% CI,
BCa (bias-corrected and accelerated) confidence interval.
The raw data were also examined for a systematic trade-off between
accuracy and speed by the participants. While some trade-off is expected,
a systematic bias toward one or the other dimensions should be avoided
(Heitz 2014). Evidence for such a trade-off would suggest strategic biases
on the part of the test-takers that obscure the actual state of the underly-
ing vocabulary skill. The size and speed data showed no evidence for such
a systematic trade-off (see Table 6.1). The Pearson correlation between
the VKsize score and the inverted mnRT was almost .7, indicating a
strong positive relationship between size and speed. Participants who rec-
ognized more words also did it more quickly. These values are similar to
those reported in Laufer and Nation (2001, p. 19). The positive correla-
tions suggest that mnRTs and the VKsize scores both measure similar
underlying proficiency.
The preliminary analysis sets the stage for the presentation of the
results and a discussion of the research findings.
Descriptive Results
Results for the three individual lexical facility measures across the three
groups are presented in Table 6.2. VKsize scores are reported in percent-
ages, mnRT in milliseconds, and CV in proportions. Confidence inter-
vals (CIs) for the means are also given. The lower and upper CI values are
the range within which the true mean of the population can be found
6.3 Study 1 Results 139
95% of the time. CIs are particularly useful for comparing the relative
difference between two means. The less overlap there is between the two
sets of CIs, the more likely the means come from different underlying
population distributions. Bootstrapped CIs are used throughout to pro-
vide more robust estimates of the mean differences. The 95% CIs are
based on the bias-corrected and accelerated (BCa) method using 1000
samples (Larson-Hall 2016).
Performance on all three measures systematically improved across the
three proficiency levels. For the VKsize scores, the preuniversity group
had the lowest mean at around 35%; the L2 university group had 70%
and the L1 university group, 85%. A similar pattern was evident in the
mnRT scores. The preuniversity group (mean = 1660 milliseconds) was
much slower than the L2 university (960 milliseconds) and the L1 base-
140 6 Lexical Facility as an Index of L2 Proficiency
line group (770 millisecconds). The CV results also reflected the respec-
tive proficiency level but were more evenly spaced out, with group means
of .45, .36, and .25, respectively. The mnRT value for the L2 university
group was midrange of the means reported by Segalowitz and Segalowitz
(1993) for their fast and slow response groups. In that study, only high-
frequency vocabulary items were used, in contrast to the spread of fre-
quency levels examined here. The mnRT value for the L1 university
group was toward the upper range of L1 English subjects reported in
Ratcliff et al. (2004).
To facilitate comparison, the lexical facility means reported in Table 6.2
are converted to standard scores and plotted by the three groups in the
bar chart in Fig. 6.1. For all calculations, the mnRT and CV values are
inverted to make negative values positive, making them consistent with
the VKsize scores. Higher scores thus reflect better performance for all
three measures. The standard scores are calculated by transforming all
three measures into z-scores, averaging them, and then adding 5 to each
standardCFG
standardRT
standardCV
6.00
standard score (z + 5)
4.00
2.00
0.00
Pre-university L2 university L1 university
Lexical facility measures by groups
to eliminate negative values in the same way as the composite scores were
calculated.
There is a consistent pattern for all three measures across the three
proficiency groups. The lack of overlap for the CIs for the means within
the groups indicates little difference in performance across the three mea-
sures. At the same time, a total lack of overlap between the groups shows
that the measures are consistently sensitive to proficiency differences.
Composite scores involving all three were calculated to gauge the effi-
cacy of the combined measures in accounting for the differences of inter-
est. These are also presented in Table 6.1. The pattern of performance on
the composite measures mirrors that observed for the individual mea-
sures, an overall pattern predictable because the composites are made up
of constituent individual scores. The main interest is in how the compos-
ite measures compare with the individual scores regarding sensitivity to
group differences.
The mean performance for all the individual and composite measures
improved as the group proficiency increased, and the CIs indicate that
the observed differences are statistically significant. The level of s ignificance
and the magnitude of the related effect sizes are established in a series of
one-way analyses of variance (ANOVAs).2
The sensitivity of a given measure reflects whether it yields differences
that are statistically significant at the conventional p < .05 level and that
have an effect size that reaches a recognized level of impact. The effect size
for the omnibus ANOVAs is eta-squared (η2). The ‘real-world’ interpreta-
tion of η2 is based on Plonsky and Oswald (2014, p. 889), with .06 being
small, .16 medium, and .36 large. The effect size for the post hoc com-
parisons is Cohen’s d (Lenhard and Lenhard 2014). It is interpreted as .40
being small, .70 medium, and 1.0 large (Plonsky and Oswald 2014).
Note that these values are all larger than the widely used benchmarks first
proposed in Cohen (1988).
142 6 Lexical Facility as an Index of L2 Proficiency
The data set was first examined to see if it met the assumptions of the
one-way ANOVA procedure. There was no evidence of univariate outli-
ers.3 Four of the 15 group-measure conditions did not meet the normality
assumption as assessed by the Shapiro–Wilk test (p > .05).4 Although all
the conditions did not meet the assumption, the one-way ANOVA is
considered robust to some deviation from normality (Maxwell and
Delaney 2004). Results from Levene’s test of homogeneity of variance
showed that only the CV measure met the homogeneity of variance
assumption at p < .05. To accommodate this, Welch’s ANOVA is used, an
ANOVA procedure considered more robust to data with unequal vari-
ances (Moder 2010). Also, the post hoc comparisons use the Games–
Howell test, which also does not assume equal variances.
Test Results
The significance level and effect size findings for the five univariate
ANOVAs are given in Table 6.3. Post hoc tests comparing performance
by group pairs were subsequently carried out for all the omnibus tests and
are reported in Table 6.4.
All five univariate ANOVAs were statistically significant at p < .001.
The η2 values showed that the effect size for all five measures was strong,
accounting for over 50% of the group variance for the VKsize and CV
measures and around 75% for the mnRT, VKsize_mnRT, and VKsize_
mnRT_CV measures. The CIs for the mnRT ŋ2 show that the measure is
significantly stronger than both the VKsize and CV measures. The lexical
facility account assumes that vocabulary size is primary, both in being the
first of the three elements to develop and in being the strongest predictor
of differences between proficiency levels. That is not the case for the
groups here, as the recognition speed differences are a much stronger
overall predictor than the other two measures. The sensitivity of mnRT
performance to group differences is further examined in the pairwise
comparisons of mean differences. The individual and composite mea-
sures and associated d effect sizes are presented in Table 6.4. Cohen’s d
with Hedge’s correction measure was used, given the unequal sample sizes
(Lenhard and Lenhard 2014).
6.3 Study 1 Results 143
Table 6.3 Proficiency-level study. One-way ANOVAs for individual and composite
lexical facility measures as discriminators of English proficiency levels
df F* ŋ2 95% CI for ŋ2
VKsize (2, 56.51) 56.41 .54 [.37, .61]
mnRT (2, 57.06) 169.48 .76 [.67, .83]
CV (2, 66.88) 59.44 .51 [.37, .62]
VKsize_mnRT (2, 51.38) 148.33 .73 [.64, .79]
VKsize_mnRT_CV (2, 51.38) 238.86 .75 [.66, .81]
Note: *All significant at p < .0005 (two-tailed, assuming unequal variances).
VKsize, correction for guessing scores (hits - false alarms); RT, mean response
time in milliseconds; CV, coefficient of variation (SDMeanRT/MeanRT).
Table 6.4 Proficiency-level study. Post hoc comparisons for individual and com-
posite measures, VKsize, mnRT, CV VKsize_mnRT, and VKsize_mnRT_CV
Mean difference* d 95% CI for d
VKsize
L2 university–preuniversity 37.39 1.72 1.16,2.28
L1 university–L2 university 13.58 1.05 .58,1.53
L1 university–preuniversity 50.97 2.68 2.05,3.31
mnRTa
L2 university–preuniversity 693 2.67 2.01,3.32
L1 university–L2 university 185 1.18 .70,1.66
L1 university–preuniversity 878 4.82 3.92,5.73
CV
L2 university–preuniversity .086 1.03 .52,1.53
L1 university–L2 university .114 1.46 .96,1.96
L1 university–preuniversity .200 2.60 1.98,3.23
VKsize_mnRT
L2 university–preuniversity 1.37 2.30 1.69,2.91
L1 university–L2 university .56 1.33 .84,1.82
L1 university–preuniversity 1.94 4.41 3.56,5.25
VKsize_mnRT_CV
L2 university–preuniversity 1.22 2.30 1.69,2.92
L1 university–L2 university .65 1.39 .89,1.88
L1 university–preuniversity 1.87 5.38 4.33,6.2
a
Note: *All values significant at p < .0005; Games–Howell test unequal variances
assumed; raw values given; differences calculated on mnRT(log); VKsize,
correction for guessing scores (hits - false alarms); mnRT, mean response time
in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT); 95% CI, 95%
confidence interval.
144 6 Lexical Facility as an Index of L2 Proficiency
L2 Preuniversity and L2 University The pattern of effect sizes for the two
L2 groups was similar to that of the ANOVA results. The mnRT measure
had the largest effect size (around 2.7), compared with VKsize (1.7) and
CV (1). However, unlike the omnibus analysis, there was some overlap
across the VKsize and mnRT CIs. There was no overlap evident for the
mnRT and CV measures, showing that they may be tapping different
aspects of performance.
L2 University and L1 University The effect sizes for the three individual
measures were similar, with the CV being the highest in absolute terms.
However, the significant overlap across the CIs for all three measures
indicates little difference in strength. This implies that the greater effect
for mnRT evident in the omnibus ANOVA is attributable to differences
between the preuniversity group and the other two groups.
Table 6.5 Proficiency-level study. Medians, interquartile ranges, and 95% confidence intervals for the hits, mean response
time (in milliseconds), and mean proportion coefficient of variation by frequency levels and groups
Hits mnRT CV
2K Mdn IQR 95% CI Mdn IQR 95% CI Mdn IQR 95% CI
Preuniversity 83.51 16 [79.00, 87.00] 1265 490 [1078,1364] .405 .27 [.325, .470]
L2 university 1.00 8 – 752 137 [733,791] .249 .14 [.199, .280]
L1 university 1.00 8 – 706 124 [666,743] .179 .13 [.151, .215]
Total 92.50 14 [92.25, 97.50] 768 297 [750, 810] .251 .20 [.194, .288]
3K
Preuniversity 66.00 18 [63.00, 71.00] 1518 396 [1332,1577] .410 .18 [.390, .460]
L2 university 92.00 14 – 807 259 [777,908] .304 .22 [.261, .339]
L1 university 1.00 13 – 746 129 [702,769] .212 .15 [.173, .249]
Total 88.00 28 [85.00, 92.00] 827 510 [794,906] .299 .20 [.270, .340]
5K
Preuniversity 42.50 22 [34.00, 52.00] 1738 671 [1488,1909] .515 .30 [.370, .580]
L2 university 80.00 16 [76.00, 81.00] 957 372 [890,1052] .391 .22 [.303, .445]
L1 university 92.50 13 [92.50, 92.50] 762 117 [732,779] .203 .10 [.186, .236]
Total 80.00 39 [73.50, 85.00] 910 637 [846,1040] .299 .29 [.267, .364]
10K
Preuniversity 31.50 24 [21.00, 41.00] 2200 922 [1948,2422] .425 .24 [.348, .445]
L2 university 51.50 23 [47.00, 56.00] 1092 599 [993,1303] .426 .23 [.348, .454]
L1 university 80.25 14 [76.00, 86.75] 874 182 [836, 898] .263 .18 [.222, .286]
Total 53.50 42 [47.00, 65.25] 1059 897 [973,1277] .352 .23 [.314, .399]
Note: Hits, correct responses to words; mnRT, mean response time to hits in milliseconds; CV coefficient of variation
Sensitivity of the Lexical Facility Measures to Frequency Levels
(SDMeanRT/Mean RT); Mdn, median, IQR, interquartile range; 95% CI, BCa 95% confidence interval; 2K = 2000, 3K = 3000;
5K = 5000; 10K = 10,000.
147
148 6 Lexical Facility as an Index of L2 Proficiency
from the hits (‘yes’ to words) overall. As such, it is not possible to break
the VKsize performance down to frequency levels. Instead, the propor-
tion of hits at each frequency level will be used as a measure of vocabulary
knowledge. These are uncorrected scores and, as such, will represent some
over- and underestimation of the individual’s actual vocabulary size.
However, it is assumed that this will be reasonably consistent across the
frequency levels (see Mochida and Harrington 2006) such that the rela-
tive differences that emerge will be a valid test of the frequency a ssumption.
The mnRT and CV measures can be identified by levels and will be used,
as in the earlier analyses. Results for the three measures are set out by
frequency levels and groups in Table 6.5. The hit results depart markedly
from a normal distribution, given that the uncorrected hit scores reach
ceiling for the L2 and L1 university groups for the high-frequency word
conditions. As a result, nonparametric statistics will be used to test for
differences between the frequency levels.
HIT2K
HIT3K
1.00 HIT5K
HIT10K
0.80
Median proportion of hits
0.60
0.40
0.20
0.00
Pre-university L2 university L1 university
Frequency levels by groups
Fig. 6.2 Median proportion of hits and 95% confidence intervals for lexical facil-
ity measures by frequency levels and groups
RT2K
2300
RT3K
RT5K
2100 RT10K
Median individual mean RTs (msec)
1900
1700
1500
1300
1100
900
700
500
Pre-university L2-university L1-university
Mean RT performance on frequency bands by groups
Fig. 6.3 Median individual mnRT and 95% confidence intervals for lexical facility
measures by frequency levels and groups
CV2K
CV3K
CV5K
0.60 CV10K
0.50
Median coefficient of variation ratio
0.40
0.30
0.20
0.10
0.00
Pre-university L2 university L1 university
Frequency levels by groups
Fig. 6.4 Median coefficient of variation (CV) and 95% confidence intervals for
lexical facility measures by frequency levels and groups
6.6 Findings for Study 1 151
for the difference between the 2K and 3K levels. The effect size used for
the Wilcoxon signed-rank test is r (Field 2009, p. 550). The r values for
the hits and mnRTs were medium to large (.4–.6). The CV, in contrast,
was negligible at 0.17. The results reported in Table 6.6 are based on the
combined performance of the three groups. Given the differing profiles
exhibited by the L1 group, it is possible that an analysis using only the L2
groups would result in larger effect sizes. This analysis was run and showed
no differences in the significance findings and only slight changes in the
r sizes, and these changes went in both directions.
A basic assumption of the Yes/No Test format and the lexical facility
construct is that the frequency with which a word is used will predict how
soon and how well it is learned. It is a probabilistic assumption that
underpins input-driven learning more widely, and one that is supported
in the hits and mnRT results. The sensitivity of the two measures to
frequency-level differences thus provides construct validity for both the
test format and the lexical facility proposal, though the modest effect sizes
must be acknowledged. The CV measure also showed some sensitivity to
frequency differences, but to a much lesser extent.
6.7 Conclusions
Study 1 showed that all three measures discriminate between the profi-
ciency levels, and that the magnitude of the differences was substantial.
All three measures discriminated between all group levels. Effects sizes
were larger for the individual mnRT and composite measures, suggesting
that a combination of size and speed results in a more sensitive measure
than size alone. However, the effect size differences were not statistically
significant. The results also indicate that frequency-of-occurrence levels
serve as stable indices of vocabulary proficiency. Together, these findings
show that the lexical facility construct is a useful index of L2 vocabulary
skill. In the next chapter, the sensitivity of the lexical facility measure to
differences in English university entry standards is examined. The group
proficiency differences in this study are less pronounced than those in the
next one and will thus provide a more stringent test of the lexical facility
proposal.
Notes
1. The false-alarm data depart markedly from a normal distribution, as some
participants had few-to-none false alarms. A Kruskal–Wallis test was run
to test for the equality of the group false-alarm means. There was a signifi-
cant difference between the groups, χ2 = 18.18, p < .001, η2 = .82.
Follow-up Mann–Whitney tests showed that the difference between the
preuniversity and L2 university groups was significant at U = 289.50,
p < .001, d = .94 (Lenhard and Lenhard 2014).
2. The use of a multivariate ANOVA (MANOVA) is motivated in concep-
tual terms, as the three measures are all assumed to be elements of the
lexical facility construct. However, the data departed significantly from a
References 153
References
Beeckmans, R., Eyckmans, J., Janssens, V., Dufranne, M., & Van de Velde, H.
(2001). Examining the yes/no vocabulary test: Some methodological issues
in theory and practice. Language Testing, 18(3), 235–274.
Cameron, L. (2002). Measuring vocabulary size in English as an additional lan-
guage. Language Teaching Research, 6(2), 145–173.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale: Lawrence Erlbaum.
154 6 Lexical Facility as an Index of L2 Proficiency
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring
the behaviour of two new versions of the vocabulary levels test. Language
Testing, 18(1), 55–89. doi:10.1191/026553201668475857.
Schmitt, N., Jiang, X., & Grabe, W. (2011). The percentage of words known in
a text and reading comprehension. The Modern Language Journal, 95(1),
26–43. doi:10.1111/j.1540-4781.2011.01146.x.
Segalowitz, N., & Segalowitz, S. J. (1993). Skilled performance, practice and
differentiation of speed-up from automatization effects: Evidence from sec-
ond language word recognition. Applied PsychoLinguistics, 14(3), 369–385.
doi:10.1017/S0142716400010845.
van Heuven, W. J. B., Dijkstra, T., & Grainger, J. (1998). Orthographic neigh-
borhood effects in bilingual word recognition. Journal of Memory and
Language, 39(3), 458–483. doi:10.1006/jmla.1998.2584.
Ziegler, J. C., & Perry, C. (1998). No more problems in Coltheart’s neighbor-
hood: Resolving neighborhood conflicts in the lexical decision task. Cognition,
68(2), B53–B62.
7
Lexical Facility and Academic English
Proficiency
Aims
7.1 Introduction
In Study 1, the lexical facility measures were shown to be highly sensitive
to proficiency differences across university English groups, both individ-
ually and in combination. This was evident in the discriminating power
of the measures and the magnitude of the observed group differences.
Greater effect sizes for the mean recognition time and the composites
indicate that a combination of size and speed yields a more sensitive mea-
sure than size alone.
This chapter presents the second of seven studies that establish lexical
facility as a valid and reliable measure of second language (L2) vocabulary.
standards differ greatly, but all are intended to ensure that, if met, the
student has the minimum language skills needed to start English-medium
study.
This study examines the lexical skills of international students com-
mencing study at a major Australian university. The participants have
demonstrated the minimum English proficiency needed to undertake
English-medium university study in different ways, but beyond that can
differ greatly in English skill. The sensitivity of the three lexical facility
measures (VKsize, mnRT, and the CV) to these group differences is mea-
sured using both written and spoken versions of the Timed Yes/No Test.
The use of a spoken version will allow the core characteristics of the lexi-
cal facility account to be tested in the absence of orthographic informa-
tion. The measures are examined individually and in combination, with
the central interest in whether the combination of the processing speed
measures (mnRT and/or CV) with VKsize will be more sensitive to group
differences than VKsize alone.
Each participant took both written and spoken versions of the Timed
Yes/No Test. Both versions used British National Corpus (BNC) items
sampled from the 1K, 3K, 5K, 7K, and 9K levels, as well as pseudowords
to control for guessing. Different items were used in the respective ver-
sions. The written version included 16 items per BNC level (80 word
items in total) plus 40 pseudowords, for a total of 120 items. The spoken
version included 12 items per level (60 word items) plus 36 pseudowords,
for a total of 96 items. The spoken version was shortened out of concern
for possible test-taker fatigue. The pseudowords all conform to English
orthographic and phonological rules.
The Timed Yes/No Test yields three individual measures. VKsize is a
measure of vocabulary knowledge size based on the proportion of the ‘yes’
responses to word items (hits) minus the proportion of ‘yes’ responses to
pseudowords (false alarms). The mnRT measure is the individual’s mean
recognition time for ‘yes’ responses to the word items (hits), and the CV is
a measure of the consistency of that recognition speed performance. It is
the ratio of the standard deviation (SD) of the mnRT to the mnRT itself
(SDmnRT/mnRT). From these individual measures, composite measures
are also calculated. VKsize_mnRT is the composite of vocabulary knowl-
edge calibrated in size and mean recognition speed. All three measures are
combined in VKsize_mnRT_CV. For group comparisons, the mnRT and
CV values were inverted, so higher values in the VKsize, mnRT, and CV
would all reflect better performance. The VKsize_CV is not calculated, as
the CV can only be interpreted as a proficiency index in combination with
mnRT. It is possible to be very slow and very consistent. The composite
measures were calculated by first converting the values to standardized z
scores and then averaging them. Because the conversion to standardized
scores results in negative scores, a value of 5 was added to each score to
make all the results positive and the presentation easier. Group member-
ship is the criterion measure for the statistical analysis.
Participants were tested individually or in small groups in a university
computer lab. The test was administered using LanguageMAP, a m ultimedia
package for the assessment of language processing skill available at www.
languagemap.com. Items were presented on the screen one at a time and
7.3 Study 2 Results 161
Test reliability coefficients were calculated for the written and spoken
tests. Separate analyses were carried out for performance on the words
and pseudowords from each list. All Cronbach’s alpha tests had satisfac-
tory reliability, ranging from high .8 to mid .9 for the yes/no judgments
and item recognition time.
162 7 Lexical Facility and Academic English Proficiency
The data were screened for recognition time outliers for individual
items, following the procedures set out in the previous chapter. Responses
of less than 300 milliseconds were first removed, as these were deemed too
fast for an actual response. These exceedingly fast responses were assumed
to be performance errors arising from inadvertent keystrokes and similar
response-external factors. They accounted for less than .01 of the total
responses. RT item responses were then log-transformed to reduce vari-
ability. Item response times for the correct hits beyond the 3 SD cut-off
were then identified. These accounted for well under 2% of the responses
across all the participants and were left intact for the analysis.
False-alarm rates are given in Table 7.1. The overall false-alarm rate for
the spoken test (24%) was almost double that of the written version
(14%). In both tests, the IELTS 6.5 group had the highest false-alarm rate
(18% and 30%, for written and spoken version, respectively), and the L1
English group the lowest (8% and 14%, respectively). The Singaporean
group had a slightly lower false-alarm rate than the L1 English group in
the written version, but was much closer to the other L2 groups in the
spoken one (24%). As for statistical significance, IELTS 6.5 and IELTS 7+
groups were both higher than the L1 English group, but the IELTS 6.5
group was not different from the Singaporean group, while the IELTS 7+
group was.1 Overall, the spoken false-alarm rates were high and variable.
A mean false-alarm rate of over 30% is very high and raises the ques-
tion as to what constitutes an excessive false-alarm rate. Other studies
have reported much lower false-alarm rates. Schmitt et al. (2012) removed
all participants with false-alarm rates over 10%, while Pellicer-Sánchez and
Schmitt (2012) reported exceptionally low false-alarm rates—a number
of participants had no false alarms at all. Trimming the data to approxi-
mate these false alarms reduces the sample size for both versions by close
to half. Doing so would clearly compromise the usefulness of the method
for use in typical assessment settings such as the one here. Nevertheless,
the high false-alarm rates do raise the issue of comparability with other
studies that report low, or even no, false alarms. The main statistical anal-
yses are carried out on all the participants to assess the robustness of the
lexical facility measures in the context of error-filled performance.
However, subsequent analyses are also carried out on the data sets in
which false-alarm rates are trimmed. The findings are discussed below.
7.3 Study 2 Results 163
Table 7.1 University entry standard study: written and spoken test results.
Pearson’s correlations for the three individual measures (VKsize score, mnRT, and
CV) and the two composite scores (VKsize_mnRT and VKsize_mnRT_CV)
Spoken Written Spoken Written Spoken
Individual VKsize mnRT mnRT CV CV
.40 .62
Spoken mnRT 1
[.23, .55] [.51, .72]
.45
Written CV 1
[29, .58]
Written Spoken
Written Spoken
VKsize_mnRT VKsize_mnRT
Composite VKsize_mnRT VKsize_mnRT
_CV _CV
[.62, .79]
Note: N=132; All correlations significant at p < .0005; VKsize, correction for guessing
scores (hits – false alarms); mnRT, mean response time in milliseconds; CV, coefficient
of variation (SDMeanRT/Mean RT).
164 7 Lexical Facility and Academic English Proficiency
Descriptive Results
Table 7.2 contains the means, SDs, and CIs for the lexical facility mea-
sures for the five entry standard groups. Results for both the spoken and
written versions are given. VKsize scores are reported in percentages, the
mnRT in milliseconds (msec), and the CV as mean ratios. Table 7.3
presents the same findings for the composite scores. The written test
results are discussed first, then the scores for the spoken test.
7.3 Study 2 Results 165
Table 7.2 University entry standard study: written and spoken test results. Means
(M), standard deviations (SD), and confidence intervals (95% CI) for the lexical
facility measures for the five English proficiency standard groups
mnRT
False alarm VKsize (msec) CV
M SD M SD M SD M SD
[95% CI] [95% CI] [95% CI] [95% CI]
WRI 18.91 14.80 56.09 17.19 1446 416 .433 .088
IELTS 6.5 [15.05, 22.86] [51.63, 60.40] [1335, 1563] [.411, .457]
n = 54 SPO 31.53 15.85 43.10 19.49 1595 398 .322 .12
[27.54, 35.74] [37.73, 48.58] [1512, 1699] [.313, .356]
WRI 10.90 9.15 73.00 11.53 1280 299 .416 . 86
IELTS 7+ [7.67, 14.50] [68.26, 77.37] [1164, 1401] [.375, .425]
n = 25 SPO 22.78 12.57 57.29 16.75 1416 265 .278 .92
[17.99, 27.91] [68.26, 77.37] [1164, 1401] [.247, .309]
WRI 14.71 12.44 70.95 17.83 975 206 .459 .121
Malaysian [9.16, 20.96] [61.83, 78.75] [882, 1074] [.402, .517]
n = 17 SPO 27.28 11.83 60.06 12.44 1379 159 .299 .087
[20.91, 34.26] [61.83, 78.75] [1264, 1505] [.257, .329]
WRI 7.50 9.05 85.06 11.72 889 193 .432 .114
Singaporean [4.78, 12.36] [79.23, 89.61] [806, 976] [.379, .481]
n = 19 SPO 23.68 11.83 68.59 12.44 1276 159 .314 .087
[18.15, 29.78] [62.85, 74.13] [1213, 1340] [.278, .351]
WRI 7.81 6.57 84.76 9.66 960 228 .347 .104
English L1 [5.53, 10.95] [79.91–89.21] [853–1047] [.291, .399]
n = 16 SPO 14.23 09.12 81.70 10.03 1169 122 .222 .065
[5.53, 10.95] [79.91–89.21] [853, 1047] [.291, .399]
WRI 14.69 13.22 66.66 18.79 1250 404 .433 .89
Total L2 [12.18, 17.53] [63.17, 70.00] [1179, 1326] [.416, .451]
N = 115 SPO 27.74 14.75 52.82 20.06 1472 341 .313 .083
[25.19, 30.35] [48.91, 56.62] [1416, 1538] [.297, .329]
Note: False alarms, ‘yes’ responses to pseudowords; VKsize, correction for guessing
scores (hits – false alarms); mnRT, mean response time in milliseconds; CV,
coefficient of variation (SD MeanRT/MeanRT); 95% CI, BCa (bias-corrected and
accelerated) 95% confidence intervals; WRI, written test; SPO, spoken test.
Table 7.3 University entry standard study: written and spoken test results.
Means (M), standard deviations (SD) and confidence intervals (CI) for the compos-
ite scores VKsize_ mnRT and VKsize_ mnRT_CV for the five English entry standard
groups
VKsize_mnRT VKsize_mnRT_CV
M SD M SD
[95% CI ] [95% CI]
WRI 4.36 .75 4.55 .63
IELTS 6.5 [4.16, 4.57] [4.40, 4.71]
n = 55 SPO 4.37 .77 4.43 .61
[4.15, 4.57] [4.26, 4.58]
WRI 5.01 .55 5.04 .59
IELTS 7+ [4.79, 5.23] [4.93, 5.39]
n = 25 SPO 5.03 .66 5.10 .76
[4.73, 5.31] [4.81, 5.41]
WRI 5.39 .63 5.16 .68
Malaysian [5.10, 5.66] [4.83, 5.47]
n = 17 SPO 5.16 .66 5.12 .60
[4.85, 5.44] [4.84, 5.40]
WRI 5.94 .51 5.61 .67
Singaporean [5.73, 6.17] [5.32, 5.92]
n = 19 SPO 5.57 .45 5.31 .58
[5.37, 5.77] [5.04, 5.57]
WRI 5.81 .48 5.81 .60
English L1 [5.58, 6.03] [5.51, 6.10]
n = 16 SPO 6.12 .41 6.00 .48
[5.92, 6] [5.81, 6.28]
WRI 4.91 .88 4.92 .74
Total L2 [4.74, 5.07] [4.79, 5.05]
N = 116 SPO 4.82 .84 4.82 .73
[4.66, 4.98] [4.67, 4.94]
Note: VKsize_mnRT, ((zVKsize + zmnRT)/2) + 5; VKsize_CV_mnRT, ((zVKsize + zCV
+ zmnRT)/3) + 5; 95% CI, BCa confidence interval; WRI, written test; SPO,
spoken test.
The group mean differences for the individual written test measures are
set out as follows, with ‘<’ indicating increasingly better performance.
The mean differences between groups in the same brackets are not statis-
tically significant. The results of the statistical tests are presented later:
7.3 Study 2 Results 167
Written VKsize: IELTS 6.5 < [Malaysian < IELTS 7+] < [Singaporean <
English L1]
Written mnRT: IELTS 6.5 < IELTS 7+ < [Malaysian < English L1] <
Singaporean
Written CV: [Malaysian < IELTS 6.5 < Singaporean < IELTS 7+] <
English L1
There is some variation in the orders for the respective measures. The
written VKsize scores cluster into three groups: the IELTS 6.5 group was
the lowest at around 55%; the IELTS 7+ and Malaysian groups were both
over 70%; and the Singaporean and L1 English groups at 85%. The
IELTS 6.5 group also had the slowest mnRT responses at around 1450
milliseconds, with the IELTS 7+ group next at around 1300 millisec-
onds. The mnRT responses for the other three groups ranged from the
lower 800 to the upper 900 milliseconds. The Singaporean group was the
fastest, significantly faster than even the L1 group, the latter not differing
significantly from the Malaysian group. The mnRTs for the Malaysian,
English L1 and Singaporean groups were similar to the overall mean for
the L2 university group in Study 1 (M = 960 milliseconds, SD = 203).
The L1 English group here was noticeably slower than the L1 group in
Study 1 (M = 777 milliseconds, SD = 200). No discernible pattern was
evident for the CV results. The Malaysian group had the least consistent
responses at .46 and the L1 English group the most consistent at .35, the
latter somewhat higher than the .25 for the L1 baseline group in Study 1.
The individual written test scores were combined into the two com-
posite scores. The group orders for the respective score are as follows:
Written VKsize_mnRT: IELTS 6.5 < [IELTS 7+ < Malaysian] < [Malaysian
< Singaporean < English L1]
Written VKsize_mnRT_CV: [IELTS 6.5 < IELTS 7+ < Malaysian <
Singaporean <] < English L1
The means for groups appearing in more than one bracket were not
significantly different from the other group(s) in the bracket, though
there were mean differences. The overlaps in all three measures show that
there was considerable variability in the outcomes. This variability will be
168 7 Lexical Facility and Academic English Proficiency
Spoken VKsize: IELTS 6.5 < [IELTS 7+ < Malaysian] < [Malaysian <
Singaporean] < English L1
Spoken mnRT: IELTS 6.5 < [IELTS 7+ < Malaysian] < [Malaysian <
Singaporean] < English L1
Spoken CV: IELTS 6.5 < [IELTS 7+ < Malaysian < Singaporean] < English
L1
The spoken VKsize scores were 10% to 15% lower across the groups.
The VKsize means were also more differentiated. The IELTS 6.5 group
was the lowest at 43%, the IELTS 7+ and Malaysian groups were around
60%, the Singaporean group just under 70%, and the L1 English group
over 80%. The spoken mnRT values were also 200–300 milliseconds less
than the written mnRT values. This included the L1 group, which was
200 milliseconds slower in recognizing the words in the spoken version,
despite a very small drop in spoken VKsize scores (written: M = 85 mil-
liseconds, SD = 10.00; spoken: M = 82 milliseconds, SD = 10.00). The
CV scores differentiated between the IELTS 6.5 score as the highest (=
least consistent), the other L2 groups in a single grouping, and the L1
English group as the lowest. The spoken mnRTs scores were higher and
had less variability, resulting in lower spoken CV scores compared with
the written test results. Lower CV values are notionally more consistent,
but only when accompanied by faster mean recognition times. Very slow
responders can also be very consistent.
7.3 Study 2 Results 169
As was the case with the written test findings, the mean differences for
the spoken test range from the IELTS 6.5 group at one end to the L1
English group at the other, with the Singaporean group being very close
to the L1 group. The group orders for the two composite scores are as
follows:
Spoken VKsize_mnRT: IELTS 6.5 < IELTS 7+ < Malaysian < Singaporean
< English L1
Spoken VKsize_mnRT_CV: IELTS 6.5 < [IELTS 7+ < Malaysian] <
[Malaysian < Singaporean] < English L1
mance marked by the IELTS 6.5 group at one end and the L1 English
group at the other. Unlike on the written test, the Singaporean group was
more similar to the other L2 groups than to the L1 group in the spoken
format.
The sensitivity of the measures, as reflected in statistical significance
and effect size, is examined next.
.16 medium, and .36 large. The effect size for the post hoc comparisons
is Cohen’s d, interpreted as .40 being small, .70 medium, and 1.0 large.
As noted, these values are all larger than the more commonly used values
proposed in Cohen (1988).
Before running the tests, the data were examined to ensure they met
the assumptions for the ANOVA procedure. No outliers were observed in
the written test results and only one in the spoken test data (for an IELTS
6.5 participant in the mnRT). The case was removed from the analysis.
The normality assumption was met for all conditions in both tests, save
for two, the Singaporean group’s VKsize responses in the written version
and the L1 English group’s CV responses in the spoken one. Although all
the conditions did not meet the assumption, the one-way ANOVA is
considered robust to violations of normality (Maxwell and Delaney
2004). The homogeneity of variance assumption was met for all but two
of the measures (at p < .05), with the VKsize score in both written and
spoken versions being the exception. Given the heterogeneity of variance,
the significant findings are confirmed by running Welch’s ANOVA for
the omnibus test and the Games–Howell test for the pairwise compari-
sons. Bootstrapping is also used for the pairwise comparisons to provide
a more robust test, given the differences in sample sizes, and the fact that
some of those data sets are borderline in meeting normality and variance
assumptions (Larson-Hall and Herrington 2009).
Test Results
The one-way ANOVAs were carried out for the respective written and
spoken tests. The significance and effect size findings for both versions are
given in Table 7.4. Results are reported for both the complete data set
and for subsets in which all participants with false-alarm rates of greater
than 20% are removed. Post hoc tests comparing performance by group
pairs were subsequently carried out for all the univariate ANOVA tests
for the complete data sets. These are reported in Tables 7.5–7.7.
All five univariate ANOVAs were statistically significant for both the
written and spoken tests. The η2 values for VKsize and mnRT results were
around the threshold for a strong effect for both versions. The VKsize,
172 7 Lexical Facility and Academic English Proficiency
Table 7.4 Entry standard study. One-way ANOVA for individual and composite
lexical facility measures as discriminators of English proficiency groups
df F ŋ2 95% CI for ŋ2
VKsize WRITTEN all (4,126) 20.69** .39 [.24, .52]
20% fa trim (4,96) 14.64** .38 [.23, .52]
10% fa trim (4,75) 11.98** .41 [.23, .57]
SPOKEN all (4,126) 19.69** .39 [.24, .52]
20% fa trim (4,50) 11.63** .48 [.32, .61]
mnRT WRITTEN all (4,126) 19.94** .39 [.24, .52]
20% fa trim (4,96) 14.07** .37 [.22, .51]
10% fa trim (4,75) 11.98** .41 [.23, .57]
SPOKEN all (4,126) 14.54** .32 [.19, .46]
20% fa trim (4,50) 12.81** .51 [.35, .62]
CV WRITTEN all (4,126) 03.13* .09 [.02, .20]
20% fa trim (4,96) 02.59* .09 [.01, .22]
10% fa trim (4,75) 2.16 .11 [.01, .26]
SPOKEN all (4,126) 07.85** .20 [.08, .34]
20% fa trim (4,50) 08.22** .39 [.23, .53]
VKsize_mnRT WRITTEN all (4,126) 31.89** .50 [.37, .62]
20% fa trim (4,96) 21.55** .47 [.38, .58]
10% fa trim (4,75) 15.34** .47 [.32, .60]
SPOKEN all (4,126) 29.69** .49 [.36, .61]
20% fa trim (4, 50) 18.47 .59 [.45, .71]
VKsize_mnRT_CV WRITTEN all (4,126) 17.39** .37 [.24, .50]
20% fa trim (4,96) 10.83** .31 [.16, .46]
10% fa trim (4,75) 08.56** .31 [.14, .48]
SPOKEN all (4,126) 24.09** .43 [.28, .58]
20% fa trim (4,50) 18.30 .59 [.46, .69]
Note: *p < .05; **p < .0005; VKsize, correction for guessing scores (hits - false
alarms); mnRT, mean response time in milliseconds; CV, coefficient of variation
(SDMeanRT/Mean RT); VKsize_mnRT, ((zVKsize + zmnRT)/2) + 5, VKsize_CV_
mnRT = ((zVKsize + zCV + zmnRT)/3) + 5; 20% fa, data trimmed to exclude
mean false-alarm rates above 20%.
7.3 Study 2 Results 173
Table 7.5 University entry standard group. Significant pairwise comparisons for
the VKsize measure for written and spoken test results
Mean Mean Mean
Mean difference difference difference difference
d, [CI] d, [CI] d, [CI] d, [CI]
IELTS 6.5 IELTS 7+ Malaysian Singaporean
VKsize WRITTEN
IELTS 7+ 16.91**
1.08, [.57, 1.58]
Malaysian 14.86**
.86, [.29, 1.42]
Singaporean 28.97** 12.06** 14.11*
1.81, [1.21, 2.40] 1.05, [.40, 1.67] .95, [.26, 1.63]
English L1 28.76** 11.77** 13.81**
1.81, [1.18, 2.44] 1.08, [.41, 1.75] .96, [.23, 1.67]
VKsize SPOKEN
IELTS 7+ 14.18**
.77, [.28, 1.26]
Malaysian 16.96**
.89, [.33, 1.45]
Singaporean 25.49** 11.31**
1.41, [.85, 1.99] .66, [.05, 1.27]
English L1 38.59** 24.41** 21.63** 13.10**
2.16, [1.50, 2.82] 1.68, [.95, 2.40] 1.50, [.72, 2.27] .89, [.20, 1.59]
Note: Means difference, row group – column group means; d, Cohen’s d; CI, 95%
confidence interval for Cohen’s d; IELTS 6.5, students entering with IELTS 6.5
overall (n = 55); IELTS 7+, students entering with IELTS 7–7.5 overall (n = 25);
Malaysian, students from Malaysian high school English (n = 17); Singaporean,
students from Singaporean high school (n = 19); English L1, students educated
in English L1 countries: the US, New Zealand, Canada, South Africa (n = 16);
*p < .05; **p <.01 (two-tailed).
mean than VKsize alone. However, the differences are not statistically
reliable, as there is substantial overlap in the CIs for the respective mea-
sures. The individual CV measure had a lower effect size, accounting for
10% and 20% of the variance for the written and spoken test results,
respectively. When combined with the other two measures in the com-
posite VKsize_mnRT_CV measure, it yielded a smaller overall effect size
than that evident in the VKsize_mnRT measure. However, as noted, the
174 7 Lexical Facility and Academic English Proficiency
Table 7.6 University entry standard study. Significant pairwise comparisons for
the mnRT and CV measures for written and spoken test results
Mean Mean Mean Mean
difference difference difference difference
d, [CI] d, [CI] d, [CI] d, [CI]
IELTS 6.5 IELTS 7+ Malaysian Singaporean
mnRT WRITTEN
Malaysian .163** .116 **
1.39, [.80, 1.98] 1.14, [.48, 1.80]
Singaporean .204** .157**
1.86, [1.22, 2.41] 1.62, [.94, 2.31]
English L1 .171** .124** .
1.46, [.85, 2.06] 1.21, [.52, 1.89]
mnRT SPOKEN
IELTS 7+ .049*
.63, [.19, 1.31]
Malaysian .060*
.76, [.22, 1.29]
Singaporean .091** .041*
1.19, [.63, 1.75] .62, [.00, 1.00]
English L1 .128** .079** .068* .037*
1.72, [1.10, 2.35] 1.21, [.53, 1.89] (1.18, [.44, 1.92] .82, [.13, 1.51]
CV WRITTEN
English L1 .086** .069* .111** .085*
.94, [.36, 1.52] .71, [.07, 1.36] .99, [.26, 1.70] .78, [.09, 1.47]
CV SPOKEN
IELTS 7+ .056**
1.48, [.88, 2.07]
English L1 .112** .055* .072** .091**
1.49, [.88, 2.10] .72, [.07, 1.36] 1.05, [.32, 1.77] 1.19, [.47, 1.91]
Note: Means difference, row group – column group means; d, Cohen’s d; CI, 95%
confidence interval for Cohen’s d; IELTS 6.5, students entering with IELTS 6.5
overall (n = 55); IELTS 7+, students entering with IELTS 7–7.5 overall (n = 25);
Malaysian, students from Malaysian high school English (n = 17); Singaporean,
students from Singaporean high school (n = 19); English L1, students educated
in English L1 countries: the US, New Zealand, Canada, South Africa (n = 16);
*p < .05; **p < .01 (two-tailed).
Table 7.7 University entry standard study. Significant pairwise comparisons for
composite VKsize_mnRT and VKsize_mnRT_CV measures for written and spoken
test results
VKsize_mnRT WRITTEN
Mean Difference Mean Difference Mean difference Mean difference
d, [CI] d, [CI] d, [CI] d, [CI]
VKsize_mnRT_CV SPOKEN
IELTS 7+ .67**
1.00, [.51, 1.50]
Malay .69**
1.14, [.56, 1.74]
Singapore .88**
1.46, [.88, 2.03]
English L1 1.62** .95** .93** .73**
2.78, [2.06, 3.50] 1.46, [.76, 2.16] 1.74, [.93, 2.53] 1.57, [.81, 2.34]
Note: Means difference, row group – column group means; d, Cohen’s d; CI, 95%
confidence interval for Cohen’s d; IELTS 6.5, students entering with IELTS 6.5
overall (n = 55); IELTS 7+, students entering with IELTS 7–7.5 overall (n = 25);
Malaysian, students from Malaysian high school English (n = 17); Singaporean,
students from Singaporean high school (n = 19); English L1, students educated
in English L1 countries: the US, New Zealand, Canada, South Africa (n = 16); *p
< .05; **p <.01 (two-tailed); X, comparison not significant in individual VKsize
analysis. Shaded cells indicate that the d value is higher than that for the same
comparison in the individual VKsize analysis.
176 7 Lexical Facility and Academic English Proficiency
ken test. In comparison, the overall rate for Study 1 was 10%. To assess
the possible effect of differences in false-alarm rates on the results, a sub-
sequent analysis was run on the written and spoken test data. In these
analyses, only those students who had rates of less than 20% were included
in the analysis. The 20% trim reduced the written test sample to n = 101
and the spoken test sample to n = 54, the latter less than half the size of
the original. The overall mean false-alarm rates fell as well. For the written
test results, the overall false-alarm rate reported in Table 7.2 (M = 14.67,
SD = 13.22) halved as a result of the 20% trim, (M = 7.87, SD = 5.69). A
similar fall was evident for the spoken test results: overall (M = 27.74,
SD = 14.75); 20% trim (M = 12.57 milliseconds, SD = 5.38). The data
were again trimmed for false-alarm rates of less than 10%. This reduced
the written test sample to n = 74 (and false-alarm rate of M = 4.90 milli-
seconds, SD = 3.11) and the spoken test sample to an unanalyzable n = 14.
The results in Table 7.4 show that despite the trimming, the pattern of
significance levels does not change and there is only a slight improvement
in effect size. Pairwise comparisons similar to the ones reported next also
produced results similar to the original findings, though these are not
reported here.
Post hoc pairwise comparisons of complete data sets provide a more
detailed picture of the sensitivity of the measures. Performances on the
written and spoken versions are compared for each measure. Significant
pairwise comparisons for the individual VKsize scores for the two ver-
sions are presented in Table 7.5, with temporal variables reported in
Table 7.6.
There are two salient findings for the VKsize results, and they hold for
both the written and spoken tests. The first is the noticeable difference
between the IELTS 6.5 group and all the others. This is reflected in the
statistically significant differences and related effect sizes, which ranged
from a Cohen’s d of .9 for the written IELTS 6.5–Malaysian group differ-
ence to over 2 for the spoken IELTS 6.5–L1 English group comparison.
The IELTS 6.5 group clearly performed at a lower level. Another finding
of note is the similarity between the Singaporean and L1 English group
performance. This is particularly evident in the written test results, where
the two groups were nearly identical. Performance by the Singaporean
7.3 Study 2 Results 177
proposal that the combination of the VKsize and mnRT measures will
yield a more sensitive measure of group differences than VKsize alone.
However, some of these differences were very small, for example, only
1.0 in the case of the Malaysian–Singaporean pair, and none were statisti-
cally significant, as evident in the overlap of the CIs. The inclusion of CV
in the VKsize_mnRT_CV composite did not improve the composite’s
sensitivity. Only half the composite comparisons (4 out of 8) in the writ-
ten test results yielded stronger effect sizes, while only three out of 8
spoken comparisons showed an advantage for the VKsize_mnRT_CV
over the individual VKsize comparisons.
Hit1K
Hit3K
100 Hit5K
Hit7K
Hit9K
80
Hits (percentage)
60
40
20
0
Written Spoken
Hit performance for frequency bands by mode
Fig. 7.1 University entry standard study. Mean proportion of hits by frequency
levels for written and spoken test results
3K–5K, and 5K–7K bands, as well as for the overall 1K–9K bands. The r2
effect sizes ranged from .02 (written 7K–9K) to .32 (written and spoken
1K–9K). For the CV, the only significant pairwise comparisons were for
the 1K–9K comparison, with a negligible effect size of .03. Test details are
given in note 2.
The results for hits and mnRT results in both test versions support the
frequency assumption and serve to again support the construct validity of
the test. The mnRT results were slightly less sensitive than the hit mea-
sures to differences in band levels, but overall, both measures were strong
indicators of frequency-level differences. The CV measures were only sen-
sitive to the contrast of the most extreme values. And in these cases, the
effect sizes were quite low.
180 7 Lexical Facility and Academic English Proficiency
mnRT1K
mnRT3K
1,600 mnRT5K
mnRT7K
mnRT9K
1,400
Mean RT (msec)
1,200
1,000
800
600
Written Spoken
Mean RT performance for frequency bands by mode
Fig. 7.2 University entry standard study. Mean response times by frequency lev-
els for written and spoken test results
CV1K
CV3K
0.50 CV5K
CV7K
CV9K
0.40
0.30
CV ratio
0.20
0.10
0.00
Written Spoken
CV performance for frequency bands by mode
Fig. 7.3 University entry standard study. Mean CV ratio by frequency levels for
written and spoken test results
English L1 and the other entry groups. Post hoc analyses of the pairwise
differences between the groups showed increasingly better performance
in both test versions, along with an approximate group continuum of
IELTS 6.5 < IELTS 7+ < Malaysian < Singaporean < English L1. The dif-
ferences between the lowest group, IELTS 6.5, and the other four groups
were statistically significant for both test versions. The IELTS 7+ group
also performed significantly lower than the Malaysian, Singaporean, and
L1 English groups for the VKsize measure, but was at the same level as
the Malaysian group for the mnRT and CV responses. The Singaporean
group performed at the same level as the L1 English group in the written
version, but not in the spoken one. Effect sizes for significant differences
were in the moderate to mostly strong range. The largest effect size (d >
2) was for the IELTS 6.5–L1 English group comparisons in the two tests.
Consistent with the lexical facility account, the composite VKsize_mnRT
score produced larger effect sizes than the individual VKsize analysis on
182 7 Lexical Facility and Academic English Proficiency
both test versions. However, the support is only suggestive, as the observed
differences between the scores were not statistically significant. The inclu-
sion of the CV score in the VKsize_mnRT_CV composite score decreased
the sensitivity of the measure compared with the individual measures.
A main focus of the study was the comparison of written and spoken
test modes. The overall pattern of group differences was similar in the two
versions. However, the spoken test yielded lower VKsize scores and slower
mnRTs. False-alarm rates were also noticeably higher. All three outcomes
may due to the linear nature of the phonological stimuli and the absence
of orthographic cues. In the spoken version, mnRTs were slower and the
CV values were also lower. All the groups were slower but more consis-
tent in the spoken version.
Mean false-alarm rates ranged as high as 30% in the data, raising the
question of how variability in these rates might affect the pattern of
results obtained, as well as the comparability of these findings with other
studies with lower rates. After the complete data set had been analyzed,
subsequent analyses were done in which the data were trimmed for false-
alarm rates exceeding 20% in the written and spoken versions, and 10%
in the written one. Although these trims significantly reduced the sample
sizes—by half for the spoken test data at the 20% rate—the resulting
analyses produced results highly comparable to the original analysis. The
test yields consistent results across different false-alarm rates.
The validity of word frequency statistics as an index of vocabulary
knowledge was also examined. The number of words identified (hits) was
plotted by the five frequency-of-occurrence bands used in the test (1K,
3K, 5K, 7K, and 9K) to test the assumption that frequency of occurrence
will predict learning outcomes. The percentage of hits systematically
decreased as a function of frequency band in both the written and spoken
versions. The effect sizes were negligible or small for bandwise differences
but stronger for nonadjacent band differences. The mnRT findings were
similar to the VKsize results for both versions, though there was more
variability in the recognition time measure. The effect sizes were in the
same range as in the hits. The CV results were insensitive to differences in
frequency bands for both, though noticeably less so in the spoken ver-
sion. The findings support the validity of frequency-based approaches to
measuring vocabulary knowledge.
7.5 Conclusions 183
7.5 Conclusions
The findings replicate those from Study 1. The VKsize, mnRT, and, to a
lesser extent, CV measures provide a reliable means to discriminate
among the entry standard groups. The mean differences suggest that the
combination of vocabulary size and recognition speed provides a more
sensitive measure of group differences than size alone, though the find-
ings await further confirmation, as the differences were not statistically
significant.
The lexical facility measures were used to characterize proficiency dif-
ferences across groups of international university students beginning
study at an Australian university. Although all students are assumed to
have the minimum English needed to commence academic study, it is
evident that they differ markedly in the vocabulary skills tapped by the
lexical facility measures. These measures are core elements of academic
language proficiency. The sensitivity of the measures shows that they can
provide an objective, independent benchmark for assessing these skills.
As such, they have potential as an assessment tool in this domain, for
example, as a means to identify students, pre- and post-enrolment, who
may be at academic risk due to shortcomings in English proficiency
(Read 2016).
The study also compared performance of the written and spoken for-
mats to assess whether and how the mode of presentation has an effect on
outcomes. The test format yields a similar pattern of results in both ver-
sions, though the VKsize and mnRT scores are lower in the spoken ver-
sion. False-alarm rates are also higher, indicating that the spoken format
is more challenging.
The studies in the previous chapter and this one examined the sensitiv-
ity of the lexical facility measures to proficiency differences in university
groups representing different user populations (Study 1) and English
standards used for university entry (Study 2), respectively. In the next
chapter, the focus narrows and the sensitivity of the measures to individ-
ual differences in one of these standards, the IELTS test, is investigated.
184 7 Lexical Facility and Academic English Proficiency
Notes
1. The false-alarm data depart markedly from a normal distribution, given
that a number of participants had few-to-none false alarms. A Kruskal–
Wallis test was run to test for the equality of the group false-alarm means.
For the written version, there was a significant difference between the five
groups, χ2 = 17.00, p < .005. η2 = .09 (Lenhard and Lenhard 2014). A
follow-up Mann–Whitney test of pairs showed that the IELTS 6 group
was significantly higher than all the other groups, and that the only other
significant difference was between the IELTS 7+ and Singaporean groups,
U = 154.50, p < .05, d = .62. For the spoken version, there was also a
significant difference between the five groups, χ2 = 17.92, p < .001.
η2 = .10. A follow-up Mann–Whitney test of pairs showed mixed results.
The IELTS 6.5 group was significantly higher than the IELTS 7+ group,
U = 480, p < .05, d = .47, but not significantly different from the Malaysian
or Singaporean group. It was significantly higher than the L1 English
group, U = 231.50, p < .01, d = .71; the IELTS 7+ group was also signifi-
cantly higher than the L1 English group, U = 117.50, p < .05, d = .74.
2. The degrees of freedom for all the Friedman tests are 4. The p values and
effect size in R2 are for the band comparisons. Effect size is calculated from
r = Wilcoxon z/square root (N × 2).
Written hits: Friedman test χ2 = 363.69, significant at p < .001.
The follow-up Wilcoxon signed-rank test for the pairwise band
comparison showed that the first three adjacent band comparisons
were significant at p < .001: the 1K–3K, (r = .35); 3K–5K, (.54);
the 5K–7K comparison was significant at p < .05, (.14); and the
overall 1K–9K comparison was significant at p < .001, (.57).
Written mnRT: Friedman test χ2 = 262.02, significant at p < .001.
The follow-up Wilcoxon test for the band differences varied some-
what: 1K–3K: p = .003, (r = .17); 3K–5K: p = .275, ns. (.00);
5K–7K: p < .001, (.45); 7K–9K: p = .044 (.04); and the overall
1K–9K comparison: p < .001, (.59).
Written CV: Friedman test χ2 = 10.23, p < .05. The only significant
Wilcoxon test result was the 1K–9K comparison: p < .010, (r = .17).
Spoken hit: Friedman test χ2 = 293.19, significant at p < .001. The
Wilcoxon test showed that the first three band comparison were
significant at p < .001, 1K–3K, (r = .30), 3K–5K, (.42), 5K–7K
References 185
(.28); the 7K–9K comparison was significant at p < .01, (.41); and
overall 1K–9K comparison was significant at p < .001, (.57).
Spoken mnRT: Friedman test χ2 = 223.21, significant at p < .001.
The Wilcoxon test varied: 1K–3K: p < .001, (r = .41); 3K–5K:
p < .001, (.37); 5K–7K: p = .462, ns. (.00); 7K–9K: p = .001, (.20);
and the overall 1K–9K comparison: p < .001, (59).
Spoken CV: Friedman test χ2 = 9.51, significant at p < .05. The only
significant Wilcoxon test result again was the 1K–9K comparison:
p < .01, (r = .17).
References
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale: Lawrence Erlbaum.
Larson-Hall, J., & Herrington, R. (2009). Improving data analysis in second
language acquisition by utilizing modern developments in applied statistics.
Applied Linguistics, 31(3), 368–390.
Lenhard, W., & Lenhard, A. (2014). Calculation of effect sizes. Retrieved
November 29, 2014, from http://www.psychometrica.de/effect_size.html
Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing
data: A model comparison perspective (2nd ed.). New York: Psychology Press.
Moore, P., & Harrington, M. (2016). Fractionating English language p roficiency:
Policy and practice in Australian higher education, T. Liddicoat (Ed.). London:
Taylor & Francis.
Pellicer-Sánchez, A., & Schmitt, N. (2012). Scoring yes-no vocabulary tests:
Reaction time vs. nonword approaches. Language Testing, 29(4), 489–509.
https://doi.org/10.1177/0265532212438053.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
Read, J. (2016). Post-admission language assessment in universities: International
perspectives. Switzerland: Springer International Publishing.
Schmitt, N., Jiang, X., & Grabe, W. (2011). The percentage of words known in
a text and reading comprehension. The Modern Language Journal, 95(1),
26–43. doi:10.1111/j.1540-4781.2011.01146.x.
8
Lexical Facility and IELTS Performance
Aims
8.1 Introduction
This chapter presents the third of seven studies examining lexical facility
as a second language (L2) vocabulary construct. Study 3 narrows the
scope of the previous study by examining the sensitivity of the lexical
facility measures (VKsize, mnRT, and CV) to band-score differences on
the IELTS test, an English proficiency standard widely used for educa-
tional, employment and immigration purposes. The sensitivity of the
individual and composite measures to score differences across five adja-
cent IELTS bands (5–7) is examined. The data were obtained from stu-
dents in an Australian university foundation-year program (N = 371).
Demonstrating the sensitivity of the lexical facility measures to b and-score
mainland China (n = 226) and Hong Kong (n = 54), Macau (n = 16), and
Taiwan (n = 8). The remainder came from a wide range of countries,
including Fiji, Indonesia, Japan, Kazakhstan, Korea, Kuwait, Malaysia,
Nepal, the Philippines, Tanzania, Timor-Leste, and Vietnam. Females
made up 55% of the sample.
years completed the same test. Test items were randomized for each par-
ticipant and presented individually on a computer screen. Participants
were asked to judge, as quickly and accurately as they could, whether they
knew the target word. They were told that they would see items that were
either actual words or pseudowords, the latter being orthographically
possible words in English. Each trial had a 5000-millisecond time limit.
The participants were told to work as quickly and accurately as possible,
as they would be scored on both dimensions. A practice set of five items
was completed before the test. After the test was completed, the students
were asked to sign a consent form allowing the researcher to access their
IELTS scores from the school administration.
The means at the respective IELTS levels across all three groups were
very similar. For example, the VKsize mean for the band-score 6 group
was, for years 1–3, 45, 47, and 44%. None of the small mean differences
for either the VKsize or the mnRT observed across the years for the
respective IELTS levels was significant. The consistency across the three
years is an indication of the reliability of the testing instrument. It
also allows the data to be combined into a single data set. Given the rela-
tively small numbers at the 7.0 (n = 18) and 7.5 (n = 14) levels, these two
levels were combined for the statistical analyses.
The raw data were first examined for four performance factors that can
potentially affect the interpretation of the results. There was adequate test
instrument reliability with Cronbach’s alpha values ranging from .8 to .9
for the VKsize and mnRT measures for the word and pseudoword
8.3 Study 3 Results 191
Table 8.1 IELTS study data set. Years 1–3 means and standard deviations, within
brackets, for the VKsize, mnRT, and CV measures by IELTS overall band score
Year #1 Year #2 Year #3
Note: VKsize, correction for guessing scores (hits - false alarms); mnRT, mean
response time in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT).
responses. The data were also screened for item RT recognition time out-
liers. Item responses at less than 300 milliseconds were first removed, as
these were too fast for an actual response. These exceedingly fast responses
were deemed performance errors arising from inadvertent keystrokes,
and similar response-external factors accounted for less than 1% of the
total responses. Item response times for the correct hits beyond the 3 SD
192 8 Lexical Facility and IELTS Performance
Table 8.2 IELTS band-score study. Means, standard deviations, and confidence
intervals (CI) for the lexical facility measures, individual and composite, for IELTS
overall band scores
False alarm VKsize mnRT (msec) CV
IELTS overall M SD M SD M SD M SD
[95% CI] [95% CI] [95% CI] [95% CI]
5 20.35 12.59 35.30 11.08 1342 443 .436 .119
n = 30 [15.65, 20.06] [31.16, 39.44] [1176, 1508] [.399, .477]
5.5 19.78 12.62 39.52 11.52 1139 213 .397 .102
n = 169 [17.86, 21.69] [37.77, 41.27] [1007, 1171] [.382, .412]
6 22.42 13.66 45.61 12.52 1040 214 .382 .110
n = 72 [19.11, 25.63] [42.66, 48.55] [989, 1199] [.355, .406]
6.5 12.79 12.80 58.77 13.46 1032 131 .358 .111
n = 42 [08.81, 16.78] [54.58, 62.97] [992, 1073] [.326, .389]
7.0–7.5 08.79 09.57 71.77 10.27 861 111 .329 .119
n = 31 [5.28, 12.30] [68.00, 75.55] [820, 902] [.295, .370]
Total 18.54 13.20 45.65 15.79 1098 253 .388 .111
N = 344 [17.14, 19.94] [44.01, 47.36] [1071, 1122] [.376, .399]
Composite score VKsize_mnRT VKsize_mnRT_CV
5 4.27 .93 4.34 .69
[3.91, 4.62] [4.08, 4.60]
5.5 4.79 .53 4.84 .51
[4.71, 4.87] [4.77, 4.92]
6 5.16 .64 5.14 .58
[5.01, 5.36] [5.00, 5.28]
6.5 5.56 .54 5.47 .57
[5.40, 5.73] [5.29, 5.65]
7–7.5 6.29 .41 6.04 .54
[6.14, 6.44] [5.84, 6.23]
Note: VKsize, correction for guessing scores (hits - false alarms), mnRT, mean
response time in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT);
CI, 95% confidence interval.
Bands 6.5 and 7 were not.2 The possible effect of differing false-alarm rates
will be examined in a multiple regression assessing the contribution of the
three measures.
Bootstrapped bivariate correlations for IELTS band scores and the
individual and composite lexical facility measures are reported in
Table 8.3. There was no evidence of a systematic trade-off between yes/no
performance and recognition speed as would be evident in a negative cor-
relation between VKsize scores and inverted mnRTs. The small but sig-
nificant correlation (r = .31) between the two measures indicates that
participants with larger vocabulary sizes also tended to be faster, but that
other factors are also at work.
Table 8.3 IELTS study. IELTS band-score study. Bivariate correlations with boot-
strapped confidence intervals for IELTS band scores and lexical facility measures
IELTS
band False VKsize_
score VKsize Hits alarms mnRT CV mnRT
VKsize .65**
[.58, .70]
Hits .52** .61**
[.36, .52] [.54, .66]
False alarms .24** .57** .31**
[.32, .13] [.63, .51] [.21,
.40]
mnRT .41** .31** .39** .04
[.32, .48] [.22, .40] [.16, [.13,
.35] −.06]
CV .23** .22* .22 .09 .24**
[.13, .33] [.11, .33] [.11, [−.02. [.13,
.33] .05] .34]
VKsize_ .65** .79** .61** .32** .82* .29
mnRT [.59, .70] [.74, .84] [.53, [.40, [.78, [.19, .39]
.69] .22] .85]
VKsize_ .60** .70 .53** .29** .74** .69 .89
mnRT_CV [.53, .66] [.65, .75] [.45, [−.37, [.68, [.63, .75] [.86, .91]
.60] −.19] .78]
Note: N = 344. *p < .01; **p < .001; All correlations significant at p < .01 (two-
tailed). VKsize, correction for guessing scores (hits - false alarms); mnRT, mean
response time in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT);
95% CI, BCa (bias-corrected and accelerated) confidence interval.
194 8 Lexical Facility and IELTS Performance
Descriptive Results
The descriptive statistics for the lexical facility measures are given in
Table 8.2. The number of individuals at the respective IELTS levels
varied, with the largest number of scores, by a considerable degree, at the
5.5 band, and the smallest at the combined 7+ band (7 and 7.5). The
minimum entry requirement for the university in question is an IELTS
score of 6.5. Students in the foundation-year program typically attain
this level by the end of the course.
Performance across the band-score levels for the three measures is pre-
sented visually in Fig. 8.1. The measures are converted to standard scores
to allow a direct comparison. The mean responses for all three lexical
facility measures show a consistent linear relationship between perfor-
mance and band-score level, with the 5 band group having the lowest
VKsize score and the highest mnRT and CV means, and the 7 band
group the highest.
The VKsize and CV measures show a consistent increase across the
band levels. The mnRTs depart slightly from this pattern, being faster at
the 6.0 than at the 6.5 level, though the 15-millisecond difference is not
statistically significant. Composite scores were also calculated and are pre-
7.00
Standardised scores (z + 5)
6.00
5.00
4.00
3.00
2.00
1.00
5 5.5 6 6.5 7 -7.5
n = 37 n = 199 n = 60 n = 43 n = 32
IELTS Overall Bandscores
VKsize mnRT CV
Fig. 8.1 Combined IELTS dataset: Timed Yes/No Test scores by IELTS overall band
scores
8.3 Study 3 Results 195
The focus here is on the sensitivity of the lexical facility measures to dif-
ferences in IELTS band scores. A question addressed in every study pre-
sented in the book is the degree to which the mnRT and CV measures
can account for variance in IELTS band-score differences beyond that
attributable to vocabulary size alone. Study 1 (Chap. 6) focused on the
sensitivity of the measures to three clearly defined proficiency levels in an
Australian university setting. Study 2 (Chap. 7) examined groups in
which proficiency differences were narrower, in that all groups had met
the university’s English-language entry minimum. But there were also
identifiable differences in performance among the groups, both between
the first language (L1) and L2 groups and within the L2 standard groups
themselves. The range of L2 proficiency levels in this study approximates
that of the L2 groups in Study 1, and the focus is on the sensitivity of the
measures to finer gradations of proficiency within this range.
196 8 Lexical Facility and IELTS Performance
Test Results
The significance and effect sizes of the mean differences are examined in
five one-way analyses of variance (ANOVAs) done with the three indi-
vidual and two composite measures. The ANOVA tests are followed up
with post hoc tests that examine the pairwise differences between the
scores.3 The mnRTs were log-transformed but not otherwise modified.
The level-by-measure responses were all normally distributed. The homo-
geneity of variance assumption was met for the VKsize and CV scores but
was borderline for mnRT findings. There is a big imbalance in n size
between the 5.5 band and the other groups. As a result, bootstrapping is
used to provide a more robust set of results for the post hoc comparisons
reported in Tables 8.5 and 8.6.
The ANOVA results presented in Table 8.4 show that all five measures
were statistically significant at p < .001. Hits are also included to assess
the sensitivity of a ‘pure’ size measure, unadjusted for false-alarm perfor-
mance size. The η2 values for the individual measures differed somewhat.
The VKsize scores accounted for over 40%, hits just under 30%, the
mnRT scores 20%, and the CV 5% for the overall test for equality of
means. The omnibus ANOVAs were followed up by post hoc pairwise
comparisons for the individual (Table 8.5) and composite measures
(Table 8.6).
The VKsize scores discriminated between nonadjacent band compari-
sons (5–6, 5–7, 6–7+) and all the adjacent band comparisons, except for
the 5–5.5 difference. The hits also discriminated between the three non-
adjacent band comparisons but only two of the adjacent ones, 5.5–6 and
Table 8.4 IELTS band-score study. One-way ANOVAs for individual and composite
lexical facility measures as discriminators of IELTS overall band scores
df (4, 339) Fa ŋ2 98% CI for ŋ2
VKsize 67.55 .44 .36, .50
Hits 33.26 .28 .20, .60
mnRT 23.75 .22 .13, .27
CV 5.19 .06 .02, .10
VKsize_mnRT 64.54 .43 .36, .51
VKsize_mnRT_CV 49.49 .37 .28, .45
Note: aAll F-values are significant at p < .0005.
8.3 Study 3 Results 197
Table 8.5 IELTS study. Bandwise significant post hoc comparisons for VKsize,
mnRT, and CV
Mean difference dd CI for d
VKsize
5.5 and 6 6.08* .48 .19, .77
6 and 6.5 13.19* .94 .53, 1.36
6.5 and 7+ 12.99* .88 .40, 1.36
5 and 6 10.31* .95 .51, 1.37
6 and 7+ 26.16* 1.89 1.38, 2.4
5 and 7+ 36.47* 2.93 2.25, 3.61
Hits
5.5 and 6 8.72** .78 .50, 1.07
6.5 and 7+ 8.99* .90 .45, 1.38
5 and 6 12.37* 1.07 .59, 1.49
6 and 7+ 12.54** 1.05 .60, 1.46
5 and 7+ 24.09** 2.33 1.68, 2.99
mnRTa
5 and 5.5 184*** .67 .32, 1.1
5.5 and 6 82* .64 .35, .93
6.5 and 7+ 167* 1.19 .70, 1.70
5 and 6 269** 1.15 .71, 1.59
6 and 7+ 182** .85 .40, 1.29
5 and 7+ 452** 1.60 1.06, 2.14
CV
6 and 7+ .049 .42 .15–.85
5 and 7+ .104* .88 .38–1.37
Note: *p < .05, **p < .0005, ***p < .10; all Games–Howell significance levels
assume unequal variances; araw values given, contrast calculated on mnRT(log).
VKsize, correction for guessing scores (hits - false alarms); mnRT, mean
response time in milliseconds; CV, coefficient of variation (SDMeanRT/MeanRT); CI,
95% confidence interval for Cohen’s d.
6.6–7+. The mnRT means were significantly different for all the nonad-
jacent comparisons and for all the adjacent comparisons, except for the
6–6.5 difference. Unlike the VKsize and hits measures, the mnRT dis-
criminated between the lowest levels (5–5.5). The CV measures were sig-
nificant for only two comparisons, the nonadjacent 5–6 and 5–7
differences.
The effect sizes for these comparisons varied. The d values for VKsize
and hits were in the moderate range for the lower 5–5.5 and the 5.5–6
comparisons (d = .48–.78). They were stronger for the higher-level adja-
cent pairs, 6–6.5 and 6.5–7. The 5–7 (3) and 6–7 (2) band comparisons
198 8 Lexical Facility and IELTS Performance
Table 8.6 IELTS band-score study. IELTS bandwise post hoc comparisons for the
VKsize_mnRT and VKsize_mnRT_CV measures
Mean difference d CI
VKsize_mnRT
5 and 5.5 .53* .86 .46, 1.25
5.5 and 6 .37* .65 .37, .94
6 and 6.5 .40* .68 .29, .107
6.5 and 7+ .72* 1.24 .73, 1.74
5 and 6 .90* 1.23 .81, 1.66
6 and 7+ 1.12* 1.98 1.48, 2.48
5 and 7+ 2.02* 2.55 1.87, 3.22
VKsize_mnRT_CV
5 and 5.5 .50* .92 .52, 1.32
5.5 and 6 .29* .57 .28, .84
6 and 6.5 .33* .57 .18, .96
6.5 and 7+ .56* 1.04 .55, 1.53
5 and 6 .79* 1.22 .80, 1.64
6 and 7+ .89* 1.57 1.09, 2.04
5 and 7+ 1.70* 2.74 2.04, 3.43
Note: *p < .05; all Games–Howell significance levels assume unequal variances;
VKsize, correction for guessing scores (hits - false alarms); mnRT, mean
response time in milliseconds; CV, coefficient of variation (SDMeanRT/MeanRT); CI,
95% confidence interval for Cohen’s d.
were very strong. The effect sizes for the significant mnRT comparisons
were comparable to the VKsize measure; smaller for the lower-level, adja-
cent band comparisons and larger for the higher-level, nonadjacent band
comparisons. As in Studies 1 and 2, the CV measure was the weakest.
Of particular interest is how well the effect sizes for the individual
VKsize measure compare with those for the two composite measures.
These are reported in Table 8.6. The two composite measures were more
sensitive than any of the individual measures, discriminating between all
adjacent and nonadjacent bands.
However, there was little difference in the effect sizes for the two com-
posite measures. This may reflect the insensitivity of the CV measure to
band differences. A comparison of d values between Tables 8.5 and 8.6
showed no discernible difference between the composite and individual
scores. The composite measures were superior in discriminating between
band levels, but the difference was not reflected in the effect sizes.
8.3 Study 3 Results 199
Table 8.7 IELTS band-score study. Model summary (R2 and ΔR2) for hierarchical
regression analysis with proficiency level as criterion and VKsize, mnRT, and CV
as predictor variables on written and spoken tests with complete and false-
alarm-trimmed (20 and 10%) data sets
β t Sig R2 Δ R2
Greater than 20% false VKsize .604 11.36 .001 .474 .474**
alarm trim
df (3, 208) mnRT .160 2.94 .004 .502 .028**
Great than 10% false alarm VKsize .644 9.28 .000 .527 .527**
trim
mnRT .174 2.47 .015 .553 .027*
df (3,116)
Note: VKsize, correction for guessing scores (hits - false alarms); mnRT, mean
response time in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT); β,
standardized beta coefficient; Sig., significance; df, degrees of freedom for
model 3 ANOVA.
The overall mean false-alarm rate was 18%, which represents values
ranging from 20% for the lowest 5 level to 8% for the highest 7–7.5 level.
Two follow-up analyses were done to assess whether a high false-alarm
rate affects the results. The complete data set was trimmed to only include
those participants whose mean false-alarm rate did not exceed 20% or
10%, respectively. The 20% trim yielded an overall false-alarm rate of
10% (standard deviation [SD] of 7), and the 10% trim a mean false-
alarm rate of near 5% (SD of 3.5). Both regression analyses produced the
same pattern of responses. The VKsize scores accounted for the over-
whelming amount of the variance, and the mnRT scores accounted for a
significant, small amount of additional variance. The total R2 increased
for each successive trim, but the confidence intervals (CIs) within brack-
ets indicated that the differences were not statistically significant. For the
respective analyses: overall, R2 = .47, [.40, .54]; 20% trim, R2 = .51, [.40,
.59]; and 10% trim, R2 = .56, [.44, .67].
8.5 Conclusions 201
8.5 Conclusions
The VKsize and mnRT measures were sensitive to IELTS band-level dif-
ferences. The CV, in contrast, accounted for few of the differences
observed. These findings replicate those of the first two studies, showing
that the VKsize and mnRT measures are reliable discriminators of test
levels and account for moderate-to-strong effect sizes for these differ-
ences. The composite score results support the claim that the combina-
tion of size and speed provides a more sensitive index of band-score
differences than size alone. The composites discriminated between all the
band levels, and the mnRT measure accounted for a significant, unique
202 8 Lexical Facility and IELTS Performance
amount of variance over and above the VKsize measure in the regression
analysis. The results are tempered, though, by the fact that the composite
measures yielded effect sizes in about the same range as the individual
measures, and the additional amount of variance accounted for in the
regression model was only about 5% of the total model.
In the next chapter, the sensitivity of the lexical facility variables is
examined in the context of placement testing.
Notes
1. The original number of participants was 371. However, 27 of these had
false-alarm rates exceeding 50%, including several around 75%. These
cases, mostly from the 5 and 5.5 bands, were removed from the analysis,
leaving a total sample of N = 344.
2. The false-alarm data depart markedly from a normal distribution, given
that some participants had few-to-none false alarms. A Kruskal–Wallis
test was run to test for the equality of the group false-alarm means. There
was a significant difference (at p < .001) between the groups, χ2 = 36.89,
p < .001, η2 = .07 (Lenhard and Lenhard 2014). A follow-up Mann–
Whitney test of the pairs showed that the 5, 5.5, and 6 bands were not
significantly different. Bands 6 and 6.5 were significantly different,
U = 871, p < .001, d = .53, as were bands 6 and 7+, U = 459, p < .001,
d = 1.05. Bands 6.5 and 7 were not significantly different.
3. Statistical significance is set at the conventional p < .05, and strength of
the difference is reported in effect size measures. The effect size for the
omnibus ANOVAs is eta-squared (η2). The ‘real-world’ interpretation of
η2 is based on Plonsky and Oswald (2014, p. 889), with .06 considered
small, .16 medium, and .36 large. The effect size for the post hoc compari-
sons is Cohen’s d. It is interpreted as .40 being small, .70 medium, and 1.0
large.
4. The first assumption for hierarchical regression is that the criterion vari-
able is continuous. The criterion here is the five IELTS band-score levels.
In the analysis, they are treated as continuous scores, though the small
range of five levels might make this a questionable assumption for some.
The data met the other assumptions for the use of the regression proce-
dure. There was independence of residuals, as assessed by a Durbin–
Watson statistic of 1.64. Scatterplot analyses indicated that a linear
References 203
References
Lenhard, W., & Lenhard, A. (2014). Calculation of effect sizes. Retrieved
November 29, 2014, from http://www.psychometrica.de/effect_size.html
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
9
Lexical Facility and Language
Program Placement
Aims
9.1 Introduction
The first three studies showed the three lexical facility measures to be
sensitive to group differences in proficiency, whether reflected in user
groups, university entry standard, or IELTS band scores. Vocabulary size
(VKsize) was the most sensitive measure, accounting for differences in all
the group comparisons and consistently yielding strong effect sizes. The
importance of size (breadth) in second language (L2) proficiency and
performance has been long established, and the findings further under-
score its significance. The mean recognition time (mnRT) was also a sen-
sitive measure in all three studies. It had a larger effect size than the
VKsize measure in the first study, but less sensitive in the other two.
Crucial to the lexical facility proposal, the mnRT measure accounted for
variability in group differences beyond that of size alone. The mnRT val-
ues varied considerably within and between groups. This meant that the
coefficient of variation (CV) measure has been less informative overall,
usually being insensitive to adjacent proficiency-level differences and
yielding smaller effect sizes.
This chapter presents two studies (Studies 4 and 5) that gauge the sen-
sitivity of the lexical facility measures to English proficiency differences as
defined by language program placement. The studies investigate the sen-
sitivity of the measures to differences in English proficiency across a nar-
rower range of proficiency than in the previous datasets. They also include
learners from a lower level of proficiency than has been examined so far.
Study 4 correlates the measures with placement testing outcomes at a
commercial language school in Australia. The predictive power of the
lexical facility measures is compared with that of an in-house placement
test in identifying learner proficiency for placement across four profi-
ciency levels, ranging from beginners to advanced learners (N = 85).
Study 5 examines program placement in a similar setting in Singapore.
It compares performance on the lexical facility measures with student
placement (N = 66) across four program levels, spanning a similar range
as in the Sydney study, though with learners who are somewhat less
proficient.
Evidence for a strong correlation between the measures and the place-
ment results will further demonstrate the validity and reliability of the
lexical facility construct. The research also has a practical dimension.
Placement testing is an important activity in English-language programs
universally, with significant ramifications when it is done poorly.
Misplaced students can suffer in terms of learning outcomes, and pro-
gram quality can be compromised. The placement-testing process can
also be a very time- and resource-intensive activity. Alderson (2005) and
others have suggested that the Yes/No Test format may be a useful tool for
screening and placement decisions, particularly in the early stages, due to
both its reliability and its ease of administration. The untimed Yes/No
Test format has already been applied to placement decisions in English
(Clark and Ishida 2005; Harsch and Hartig 2015) and Spanish
9.2 Study 4: Sydney Language School Placement Study 207
(Lam 2010). Harrington and Carey (2009) were the first to examine the
use of recognition times in placement decisions. Part of these data is pre-
sented here.
items in the first test were drawn from the four frequency bands (2K,
3K, 5K, and 10K) used in the Vocabulary Levels Test (VLT). The VLT
target words provide a measure that can be both related to the language
program placement decisions and generalized to other settings (see also
Mochida and Harrington 2006). The second test contained word items
taken from course books and materials used at the school. A list of con-
tent words from the elementary to advanced levels of instruction was
selected from recently used texts and materials. In the absence of fixed
vocabulary lists for the program levels, a range of word difficulty
(reflected in the frequency of word occurrence) was obtained by select-
ing words from the program list at the 1K, 3K, 5K, and 10K frequency
bands using the British National Corpus (BNC; http://www.comp.
lancs.ac.uk/ucrel/bncfreq/flists.html). Eighteen items were selected
from each frequency level. Both tests included 28 pseudowords, with
different pseudowords used in the respective tests. The vocabulary mea-
sures analyzed here are combined scores from the two tests (Harrington
and Carey 2009).
The study compares scores from the language school’s placement test
battery and the lexical facility measures. The battery contains four tests
that assess English listening, grammar, writing, and speaking skill. The
tests differ in task format and the role they play in the placement process.
The listening and grammar tests are paper based and designed to assess
knowledge of specific linguistic features and content comprehension. The
listening test assesses global listening comprehension skills through con-
tent questions based on a listening passage. The grammar test assesses the
ability to use grammatical structures and identify grammar errors. Both
tests are scored immediately after completion by a teacher using a scoring
key. The writing test consists of a 120-word essay addressing the question
‘Why did you choose to come to Australia?’ The speaking test is an infor-
mal ten-minute interview with a teacher, the content based on a general
list of questions related to family background, interests, career goals, and
so on. After the student completes the writing and speaking tests, they are
scored by the teacher using a holistic six-step scale that specifies compe-
tencies appropriate to the respective program levels. The teacher refers to
the listening and grammar results before assigning the speaking and
9.3 Study 4 Results 209
listening scores, which serve as the initial placement level. The entire test
battery takes 80–85 minutes to complete.
The language school admits students on a weekly basis, with new stu-
dents tested on Monday. Based on the placement tests, newly entering
students are placed at one of six proficiency levels: beginner, elementary,
lower intermediate, upper intermediate, advanced, and English for
Academic Purposes (EAP). The data collection period in Study 4 spanned
15 weeks, and the same teacher administered all 15 placements.
Each student completed the Timed Yes/No Test and then the language
program tests using the same procedure described in the previous studies.
Test items were randomized for each participant and presented individu-
ally on a computer screen. Participants were asked to judge, as quickly
and accurately as they could, whether they knew the word presented.
They were told that they would see items that were either actual words or
pseudowords, the latter being orthographically possible words in English.
Each trial had a 5000-millisecond time limit. Items not answered were
counted as incorrect. There were only a handful of ‘no answer’ responses
(less than 0.1% of the entire response set). A practice set of five items
with feedback was completed before the test. Instructions for the Timed
Yes/No Test were translated into Korean, Japanese, Chinese, Spanish,
Portuguese, and Czech. A handful of students from other L1 backgrounds
received the instructions in English.
As in the previous studies, three lexical facility measures were collected:
VKsize (proportion of hits minus false alarms), mnRT (mean recognition
times for correct word responses), and CV (SDmnRT/ mnRT). Only one
composite score, the VK_mnRT, is reported because the individual CV
results showed no systematic differences among the groups.
the general and program versions together. The study here examines only
the combined scores. For a more detailed analysis of the respective tests,
see Harrington and Carey (2009).
As done in the previous studies, the raw test results were first examined
for adequate test instrument reliability, an absence of excessive item rec-
ognition time outliers and false-alarm rates, and no systematic trade-offs
between speed and accuracy in participants’ responses.
Cronbach’s alpha reliability coefficients for the word and pseudoword
items on the original tests fell within an acceptable range of .85–.92. As
in previous studies, a small number of item recognition times (less than
2%) went beyond 3 standard deviations (SDs), and these were left intact.
The small number of outliers is due in part to the 5000-millisecond cut-
off time for the presentation of individual items, eliminating extremely
slow recognition times.
In Harrington and Carey (2009), the six placement levels were ana-
lyzed separately. In this study, the beginner (n = 10) group was com-
bined with the elementary (n = 12) group for more power in the analysis.
There was no evidence of a systematic speed–accuracy trade-off by the
participants. The correlation between VKsize and the inverted mnRT
Table 9.1 Sydney language program study. Bivariate Pearson’s correlations for
lexical facility measures, and listening and grammar test scores
VKsize Hit mnRT CV VKsize_mnRT Listening
Hit .64*
[.46, .79]
mnRT .51** .66**
[.25, .65] [.45, .79]
CV .03 .24* .14
[−.23, .18] [−.20, .20] [−.34, .03]
VKsize_mnRT .86* .75** .87* .08
[.78, .91] [.59, .84] [.79, .91] [−.09, .27]
Listening .65** .65* .66* .17 .75
[.49, .75] [.48, .76] [.53, .76] [−.04, .35] [.64, .83]
Grammar .65* .62* .46* .13 .62** .66**
[.50, .76] [.42, .74] [.24, .61] [−.05, .29] [.41, .73] [.52, .78]
Note: N = 87; significant at *p < .05; **p < .001 (two-tailed).
9.3 Study 4 Results 211
Table 9.2 Sydney language program study. Means, standard deviations, and 95%
confidence intervals for the lexical facility measures at the four placement levels
False alarm VKsize mnRT (msec) CV
M SD M SD M SD M SD
[95% CI] [95% CI] [95% CI] [95% CI]
1. Elementary 18.11 16.67 36.52 18.90 1854 547 .462 .057
n = 21 [11.77, 26.61] [28.48, 43.76] [1613, 2115] [.436, .490]
2. Lower 19.62 13.68 52.23 15.61 1464 284 .482 .072
intermediate [15.28, 26.25] [46.72, 57.73] [1376, 1600] [.449, .513]
n = 19
3. Upper 16.48 14.99 56.31 14.38 1506 318 .494 .104
intermediate [11.54, 22.03] [50.14, 61.46] [1384, 1626] [.457, .533]
n = 26
4. Advanced 07.52 05.31 69.24 13.43 1326 241 .493 .094
n = 19 [04.91, 10.31] [64.76, 74.02] [1206, 1428] [.451, .534]
5. Overall 15.05 13.88 53.43 18.86 1543 410 .483 .085
N = 85 [12.49, 18.64] [49.31, 57.33] [1459, 1630] [.451, .531]
Composite score VKsize_mnRT
1. Elementary 4.19 1.01
[3.78, 4.57]
2. Pre-intermediate 5.08 .46
[4.88, 5.31]
3. Intermediate 5.14 .57
[4.92, 5.37]
4. Advanced 5.70 .45
[5.48, 5.85]
Note: VKsize, correction for guessing scores (hits - false alarms); mnRT, mean
recognition time in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT);
95% CI, BCa (bias-corrected and accelerated) confidence interval.
212 9 Lexical Facility and Language Program Placement
intermediate group to 8% for the advanced group. All the groups had
large SDs relative to the mean, indicating considerable variability across
individuals. The false-alarm data are not normally distributed, given that
a portion of the participants had few-to-no false alarms. Kruskal–Wallis
and follow-up Mann–Whitney tests indicated that the advanced group
had a significantly lower false-alarm rate than all the other groups, with
moderate effect size (Cohen’s d = .68).1
Bivariate correlations between the lexical facility measures and the lis-
tening and grammar test scores are presented in Table 9.1. The writing
and speaking ratings are not included, as they are essentially placement
decisions, and thus are not normally distributed. The VKsize, hits, and
mnRT scores significantly correlated with the listening and grammar
tests, with r coefficients in the mid .6 range. The correlation between the
composite VKsize_mnRT and grammar (.65) was the same, but the cor-
relation between VKsize_mnRT and listening scores was stronger (.75),
though the difference was not statistically significant. The CV measure
did not correlate with either placement test.
Descriptive Results
The descriptive statistics for the VKsize, mnRT, and CV scores by Sydney
program placement levels are presented in Table 9.2.
The VKsize and mnRT means differed as a function of placement
level, with the lowest scores produced by the elementary group and the
highest by the advanced group. The CV values showed little variability
across the program levels, ranging from the lowest in the elementary
group to the highest in the upper intermediate and advanced groups.
However, there is a maximum difference of only .025 between the means.
With no observable pattern, the CV values were not included in any
further analysis.
The VKsize and mnRT scores by placement levels are compared visu-
ally with the grammar and listening tests in Fig. 9.1. The scores are pre-
sented as standard scores consisting of the mean standard (z) score plus 5;
the latter added to make all scores positive.
9.3 Study 4 Results 213
Standardised score (Z + 5)
6
1
Elementary Lower Inter Upper Inter Advanced
n=22 n=20 n=26 n=20
Language program placement levels
VK mnRT Grammar Listen
Fig. 9.1 Sydney language program study. Comparison of VKsize and mnRT scores
with program placement grammar and listening scores across four placement
levels
Figure 9.1 shows that the VKsize and mnRT scores were similar to the
pattern of placement test results. The VKsize scores consistently increased
as a function of placement level. The mnRT values were less sensitive in
the middle range, showing little difference between the lower and upper
intermediate groups. The VKsize_mnRT results mirrored that of the
individual VKsize and mnRT results. In the next section, the observed
mean differences are tested for statistical significance and the magnitude
of the effect sizes for the differences.
The lexical facility and program placement tests were compared for how
well they discriminated between the groups and the size of the observed
effects. Also of interest was whether the mnRT and CV measures could
account for unique variance in the placement decisions beyond that
attributable to the VKsize measure alone. The sensitivity of hits as an
alternative to VKsize will also be examined (Shillaw 1996; Harsch and
Hartig 2016).
214 9 Lexical Facility and Language Program Placement
Test Results
Normality assumptions were met for all three individual measures. The
homogeneity of variance assumption was met for the VKsize and CV
measures, but not for the mnRT (log) scores. As a result, Welch’s analysis
of variance (ANOVA) was used for the omnibus test and the Games–
Howell test for the pairwise comparisons. Bootstrapping was also used to
validate the results of the post hoc comparisons. The results are given in
Tables 9.3 and 9.4.
All the omnibus ANOVA results were statistically significant, except
for the CV. Effect size, as measured by ŋ2, ranged from a low of .22 for
the mnRT measure to a high of .56 for the grammar test. The effect size
for the composite VKsize_mnRT measure (.37) was slightly smaller than
the individual VKsize (.40) or hits (.41) results, though none of these
differences were statistically significant. The magnitude of the two vocab-
ulary size measures was nearly twice that of the mnRT measure (.22), a
difference that is statistically significant.
Significant pairwise comparisons and effect sizes are reported in
Table 9.4. The grammar results were significant for all the comparisons,
with effect sizes of d > 1.5. These results are not included in the table.
The VKsize, mnRT, composite VKsize_mnRT, and listening tests dis-
criminated between all the placements levels, except the lower and upper
intermediate groups. The effect sizes for the respective comparisons all
reached and exceeded the threshold of 1.0, which is considered strong
Table 9.3 Sydney language program study. One-way ANOVAs for individual and
composite lexical facility measures and placement test scores as discriminators of
placement levels
df (3,83) F ŋ2 CI for ŋ2
VKsize 18.35 ** .40 .27, .52
Hits 18.98** .41 .27, .53
mnRT 7.96 * .22 .15, .29
CV .662 .02 −.00, .07
VKsize_mnRT 16.02** .37 .29, .45
Listening 26.47** .49 .36, .60
Grammar 34.60** .56 .41, .67
Note: *p < .05; **p < .001.
9.3 Study 4 Results 215
Table 9.4 Sydney language program study. Significant post hoc pairwise compari-
sons of the lexical facility measures and listening test
Mean
difference d CI for d
Elementary and lower intermediate 15.70** .99 .34, 1.65
Elementary and upper intermediate 19.78** 1.19 .57, 1.82
VKsize Lower intermediate and advanced 17.17** 1.63 .89, 2.36
Upper intermediate and advanced 13.08* 1.04 .41, 1.67
Elementary and advanced 32.87** 2.13 1.38, 2.94
Elementary and lower intermediate 15.70** 1.35 .17, .88
Hits Elementary and upper intermediate 19.78** 1.54 .88, 2.19
Elementary and advanced 22.23** 1.71 .99, 2.43
Elementary and lower intermediate 389* .88 .23, 1.53
mnRTa Elementary and upper intermediate 347**** .80 .20, 1.39
Elementary and advanced 302*** 1.22 .54, 1.90
CV (No significant differences)
Elementary and lower intermediate .890** 1.11 1.17, 2.67
Elementary and upper intermediate .940* 1.82 1.13, 2.50
VKsize_ Lower intermediate and advanced .632** 1.34 .63, 2.04
mnRT Upper intermediate and advanced .567** 1.07 .44, 1.67
Elementary and advanced 1.51*** 1.90 1.15, 2.60
Elementary and lower intermediate 19.84** 1.19 .52, 1.86
Elementary and upper intermediate 25.10*** 1.52 .87, 2.18
Listening Lower intermediate and advanced 20.74*** 1.73 .98, 2.47
Upper intermediate and advanced 15.47** 1.16 .52, 1.79
Elementary and advanced 40.58*** 2.67 1.82, 3.52
Note. Games–Howell test significant at *p < .05; **p < .01 (two-tailed);
***p < .001; ****p < .10; araw values given, contrast calculated on mnRT(log),
significance levels assume unequal variances. VKsize, correction for guessing
scores (hits - false alarms); mnRT, mean response time in milliseconds; CV,
coefficient of variation (SDMeanRT/MeanRT); d, Cohen’s d with Hedges’s correction
for unequal sample sizes (Lenhard and Lenhard 2014); CI, BCa 95% confidence
interval.
(Plonsky and Oswald 2014). The listening comparisons yielded the larg-
est effect sizes, with four of the five comparisons at d > 1.5. The hits and
the mnRT were significant for the comparisons involving the elementary
group and the low intermediate, high intermediate, and advanced groups.
The mnRT significance values were borderline and the accompanying
effect sizes lower. The CV omnibus test was not significant, so there were
no pairwise comparisons to consider.
216 9 Lexical Facility and Language Program Placement
The participants were university-age students who had just begun study
in a commercial language school in Singapore. A total of 56 students
(47% females) from China (n = 28), Vietnam (n = 17), and Malaysia (n =
11) participated as volunteers.
Performance on the two tests across the four program levels is pre-
sented as standard scores (z score plus 5) in Fig. 9.2. The two tests differed
slightly in the range of frequency levels used. The BNC set had higher
overall frequency, as it included the 1K level as the highest and the 9K
level as the lowest, compared with respective 2K and 10K levels in the
VLT set. The patterns of the VKsize scores were highly consistent for the
two versions, moving higher by placement level. The only exception was
the preintensive group performance on the VLT, which was relatively
higher than in the BNC counterpart. The mnRT results increased across
the placement levels in a nearly identical manner. Given the consistency,
a combination score averaging the scores on the two tests was used to
increase power.
The overall false-alarm rate was 26%, ranging from a high of 32%
in the intermediate group to just under 20% in the EAP group. See
Table 9.4. The difference of 12% between the two groups was not
9.6 Study 5 Results 219
Standardised score (z = 5) 5
0
VLT_VK BNC_VK VLT_mnRT BNC_mnRT VLT_CV BNC_CV
Lexical facility measures
Elementary Pre-intermediate Intermediate English Academic Purposes
Fig. 9.2 Singapore language program levels. Standardized scores for the lexical
facility measures (VKsize, mnRT, and CV) for the VLT and BNC test versions
Descriptive Results
The descriptive statistics for the VK, mnRT, and CV scores for the
Singapore language program levels are given in Table 9.5.
Overall, the scores were lower than those in the Sydney placement
study. The highest proficiency group, EAP, had a mean of almost 50%,
which was comparable to the lower intermediate group in the Sydney
study. Overall, the mean VKsize scores here were 10–15% less than the
corresponding level in the Sydney study. There was less of a difference
for the mnRT scores, with the two highest proficiency groups in
Sydney and Singapore being very similar in this regard, both around
220 9 Lexical Facility and Language Program Placement
Table 9.5 Singapore language program study. Means, standard deviations, and
confidence intervals for the lexical facility measures for the four Singapore lan-
guage program levels
False alarm VKsize mnRT CV
M SD M SD M SD M SD
Singapore [95% CI] [95% CI] [95% CI] [95% CI]
Elementary 28.88 20.35 22.42 13.70 2083 569 .610 .123
n = 12 [18.98, 38.85] [14.85, 30.25] [1725, 2377] [.541, .568]
Pre intermediate 24.19 15.89 30.70 11.74 1888 456 .644 .094
n = 18 [16.05, 32.33] [24.14, 36.46] [1697, 2103] [.602, .687]
Intermediate 32.33 19.47 34.28 15.99 1565 475 .568 .131
n = 15 [23.41, 41.25] [27.03, 42.22] [1328, 1789] [.505, .624]
EAP 18.97 11.16 49.42 13.91 1318 344 .565 .128
n = 11 [8.56, 29.39] [36.50, 57.84] [1125, 1531] [.487, .641]
Overall 26.35 17.41 33.56 16.15 1731 533 .601 .016
N = 56 [21.68, 31.02] [29.24, 37.89] [1588, 1774] [.569, .633]
Composite score VKsize_mnRT
Elementary 4.32 .651
[3.99, 4.66]
Preintermediate 4.76 .584
[4.49, 5.03]
Intermediate 5.17 .819
[4.78, 5.63]
EAP 5.87 .661
[5.47, 6.24]
Note: EAP, English for Academic purposes; VKsize, correction for guessing scores
(hits - false alarms); mnRT, mean recognition time in milliseconds; CV,
coefficient of variation (SDMeanRT/MeanRT); CI, BCa confidence intervals; VLT,
Vocabulary Levels Test; BNC, British National Corpus.
6
Standardised scores (z + 5)
0
VK mnRT CV
Lexical facility measures
Elementary Pre_intermediate Intermediate English for Academic Purposes
Fig. 9.3 Singapore language program study. Standardized scores for the lexical
facility measures (VKsize, mnRT, and CV) for the combined test by level
The mean-level differences are tested for statistical significance and effect
size. Although the level sizes are small, all three measures met normality
and variance assumptions. The mean-level differences observed for
the individual measures and the composite VKsize_mnRT were tested
for statistical significance in separate one-way ANOVAs, with program
level as the independent variable and scores as the dependent variable.
222 9 Lexical Facility and Language Program Placement
Table 9.6 Singapore language program study. One-way ANOVAs for individual
and composite lexical facility measures as discriminators of program levels
df (3,52) F ŋ2 CI for ŋ2
VKsize 7.96** .31 .12, .50
Hits 4.61* .21 .04, .40
mnRT 6.38** .27 .14, .46
CV 1.56 .08 −.03, .11
VKsize_mnRT 11.01** .39 .29, .45
Note: *p < .01; **p < .001.
The results are reported in Table 9.6. The group means in the post hoc
analysis were bootstrapped to provide more robust statistics for the small
sample sizes.
The omnibus analyses for the VKsize, hits, mnRT, and composite
VKsize_mnRT are all statistically significant. The ŋ2 values are in around
.3 here, compared with .4 in the Sydney study, and signal a moderate-to-
strong effect size. The post hoc contrasts testing adjacent-level differences
are reported in Table 9.7.
The VKsize score was significant in comparisons involving the EAP
group and the elementary, preintermediate, and intermediate groups,
respectively. The hits were less sensitive than VKsize, accounting only for
differences between the elementary and intermediate groups and the
elementary and EAP groups. The mnRT scores were involved in four
significant pairwise comparisons. These were the differences between the
EAP group and the elementary and preintermediate groups, as well as the
preintermediate and intermediate and the elementary and intermediate
comparisons. The composite VKsize_mnRT scores were also sensitive to
four score differences, including all the EAP comparisons and the differ-
ence between the elementary and intermediate groups. The effect sizes
approached or exceeded 1.0 for all the significant comparisons, except
the preintermediate and intermediate group difference for the mnRT
score (.73). A comparison of the individual VKsize scores with the com-
posite VKsize_mnRT scores indicates that the combination of size and
speed provides a more sensitive measure of program level differences than
size alone.
9.7 Findings for Study 5 Singapore Language Program Levels 223
Table 9.7 Singapore language program study. Significant post hoc comparisons
for the lexical facility measures for the four placement levels
Mean
difference d 95% CI for d
VKsize
Elementary and EAP 27.00*** 1.94 .95, 2.93
Preintermediate and EAP 18.72** 1.48 .64, 2.32
Intermediate and EAP 15.13* .99 .175, 1.82
Hits
Elementary and intermediate 15.32* .91 .11, 1.71
Elementary and EAP 17.09* 1.01 1.44, 1.88
mnRTa
Elementary and intermediate 518*** .94 .13, 1.73
Preintermediate and intermediate 323** .73 .02, 1.43
Preintermediate and EAP 570* 1.36 .53, 2.19
Elementary and EAP 765*** 1.61 .66, 2.58
CV None
VKsize_mnRT
Elementary and intermediate .852* .92 .11, 1.74
Preintermediate and EAP 1.11*** 1.00 .38, 1.65
Intermediate and EAP .699** 1.00 .38, 1.65
Elementary and EAP 1.55*** 2.36 1.30, 3.42
Note: *p < .05; **p < .01; ***p < .0005 (two-tailed); araw values given, contrast
calculated on mnRT(log), significance levels assume unequal variances. VKsize,
correction for guessing scores (hits - false alarms); mnRT, mean response time
in milliseconds; CV, coefficient of variation (SDMeanRT/MeanRT); CI, BCa 95%
confidence interval.
9.8 Conclusions
The two studies examined the sensitivity of the lexical facility measures to
differences in language school placement levels. The combination of size
and mnRT provided a more sensitive measure of levels than size alone,
supporting a key part of the lexical facility proposal. The CV again pro-
vided little information about level differences, offering further evidence
that its status as a component of the proposed lexical facility construct is
questionable.
The results give no basis for suggesting that the Timed Yes/No Test can
replace the placement procedures used in the two schools. Harsch and
Hartig (2016) arrived at a similar conclusion about the effectiveness of
the Yes/No Test format as a placement instrument when they compared
it with the C-Test, which they concluded was a more sensitive instru-
ment. However, framing the questions in terms of either/or oversimpli-
fies matters. The in-house placement tests from the Sydney study, the
C-Tests, draw on higher-order linguistic and strategic skills that are not
tapped in the Yes/No Test and are in a complementary relationship with
the low-level lexical facility skills captured by the Timed Yes/No Test
format.
The evidence does indicate that size and speed measures provide a reli-
able and, arguably, a potentially useful tool for identifying learners’ pro-
ficiency levels, with possible future applications independently and in
combination with other measures in the placement process, for example,
as a tool for screening students before arrival at the university.
Notes
1. A Kruskal–Wallis Test was run to test for the equality of the group false-
alarm means. There was a significant difference between the groups,
χ2 = 11.07, p < .011. A follow-up Mann–Whitney test of pairs showed
that the only significant difference was between the advanced and the
upper intermediate groups, at U = 153, p < .05, Cohen’s d = .68 (Lenhard
and Lenhard 2014). None of the other mean differences was significant.
References 225
References
Alderson, J. (2005). Diagnosing foreign language proficiency: The interface between
learning and assessment. New York: Continuum.
Clark, M. K., & Ishida, S. (2005). Vocabulary knowledge differences between
placed and promoted students. Journal of English for Academic Purposes, 4(3),
225–238. doi:10.1016/j.jeap.2004.10.002.
Harrington, M., & Carey, M. (2009). The online yes/no test as a placement
tool. System, 37(4), 614–626. doi:10.1016/j.system.2009.09.006.
Harsch, C., & Hartig, J. (2015). Comparing C-tests and yes/no vocabulary size
tests as predictors of receptive language skills. Language Testing, 33(4),
555–575.
Harsch, C., & Hartig, J. (2016). Comparing C-tests and Yes/No vocabulary size
tests as predictors of receptive language skills. Language Testing, 33(4),
555–575.
Lam, Y. (2010). Yes/No tests for foreign language placement at the post-
secondary level. Canadian Journal of Applied Linguistics/Revue canadienne de
linguistique appliquee, 13(2), 54–72.
Lenhard, W., & Lenhard, A. (2014). Calculation of effect sizes. Retrieved
November 29, 2014, from http://www.psychometrica.de/effect_size.html
Mochida, A., & Harrington, M. (2006). The yes-no test as a measure of recep-
tive vocabulary knowledge. Language Testing, 26(1), 73–98. doi:10.1191/02
65532206lt321oa.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
Shillaw, J. (1996). The application of Rasch modelling to yes/no vocabulary tests.
Swansea: Vocabulary Acquisition Research Group, University of Wales
Swansea.
10
Lexical Facility and Academic
Performance in English
Aims
10.1 Introduction
Vocabulary knowledge (VKsize) and mean recognition time (mnRT)
have been shown to be sensitive indicators of English proficiency differ-
ences across the five studies examined so far. The sensitivity of the two
measures was evident in the sharply defined differences between preuni-
versity, second language (L2), and first language (L1) users in Study 1,
between university entry standards in Studies 2 and 3, and the language
program levels in Studies 4 and 5. Of interest throughout has been the
construct validity of lexical facility as an account of L2 vocabulary skill
and the usefulness and reliability of the Timed Yes/No Test as a measure-
ment instrument.
10.2 S
tudy 6: Lexical Facility Measures
and Academic English Grades
This study evaluates the three lexical facility measures as correlates of
academic performance in a foundation-year course at an Australian uni-
versity. The measures are based on performance on the Timed Yes/No
Test, which are then correlated with the year-end grade received in a
mandatory EAP course. Two student cohorts were tested that differ as to
when they took the test. The entry group took the test at the beginning
of the first semester of the two-semester course, and the exit group took
it at the end of the second. Entry group performance provides a window
on how well the measures can predict learning outcomes for newly
entered students, while exit group performance establishes the degree to
which the measures correlate with grade outcomes at the end of the
course. In Study 7, the measures are examined and correlated with end-
of-year GPAs. In both studies, the effects of individual and combined
measures are examined.
Both groups completed the same version of the written Timed Yes/No
Test. The test words were drawn from the 2K, 3K, 5K, and 10K bands
from the British National Corpus (BNC) from the Lextutor website
(Cobb 2008). Eighteen words were drawn from each level for a total of
72 words and 28 pseudowords, generating a total of 100 items. Second-
semester EAP grades and GPAs were obtained with permission from the
students’ academic records.
The test was given in groups in a computer-equipped classroom in
the school. It was administered using LanguageMAP, an online testing
program available at www.languagemap.com. The order of presentation
was randomized for each test-taker. The entry group took the test in
February, at the beginning of the Australian academic year. Participants
in the exit group took the test in October. The academic year finished
in early December, with second-semester grades and GPAs for both
groups obtained after that. The academic English classes met three times
a week in both semesters and covered all four academic English skills,
with an emphasis on writing. The course grade for the second semester
is used here. Participating students signed a release-for-access form so
that their course grades and GPAs could be assessed after they had fin-
ished the test. Students were given the option to opt out, but none did.
Otherwise, the testing followed the procedures set out in previous
chapters.
230 10 Lexical Facility and Academic Performance in English
The entry group had a false-alarm rate of over 20%, twice that of the
exit group. The difference was significant at the Mann–Whitney Test,
U = 1290, p < .001, d = .89.1 The entry group also had a lower EAP
grade percentage: 66% versus 74% for the exit group, t (143) = 4.57,
p < .001, d = .74. Both differences will the affect the interpretation of the
results.
The moderate significant correlation between the VKsize and the
inverse mnRT indicated no systematic trade-off between yes/no perfor-
mance and recognition speed: for the exit group, r = .38; for the entry
group, r = .27, both significant at p < .01. See Table 10.2.
Descriptive Statistics
The means and SDs for the lexical facility measures are given in Table 10.1.
Also included are academic English percentage marks, letter grades, and
GPAs. The latter two are not normally distributed, so all three are reported
as median and ranges. The GPA results are discussed later.
10.3 Study 6 Results 231
The exit group performed better on all three measures, and in turn on
the composite measures. The absence of any overlap between the upper
and lower bounds of the confidence intervals (CIs) for the respective
group means indicates that this difference is statistically significant.2 The
exit group VKsize score of 48% places it in the 6–6.5 IELTS band in the
results reported for Study 3 in Chap. 8. The mnRT of 1182 was half-
band slower than in the IELTS study, where it corresponds to the 5.5
band. The CV measure for both groups was also in the 5–5.5 range. The
CV measure weakly correlated with the criterion variables, a finding simi-
lar to Study 3.
The sensitivity of the lexical facility measures to academic performance
outcomes is examined separately for the two groups.
232
10
Table 10.2 Bivariate correlations between lexical facility measures and academic English performance measures for entry
and exit groups
VKsize mnRT CV VK_mnRT AE grade % GPA
VKsize .38** .10 .83** .44** .32** Entry group n = 72
[.17, .54] [−.14, .33] [.73, .89] [.33, .66] [.03, .59]
mnRT .27** .06 .82** .21 .22*
[.08, .46] [−.01, .12] [.76, .88] [.02, −.42] [.00, .42]
CV .03 .39** .03 .08 .02
[−.18, .24] [.17, .59] [−.20, .25] [−.16, .32] [−.22, .18]
VK_mnRT .78** .81** .24 .39** .33*
[.67, .86] [.73, .88] [.23, .51] [.10, .62] [.06, .56]
AE% .59** .33* .01 .57** .67**
[.41, .71] [.09, .54] [.18, −.14] [.36, .72] [.53, .79]
GPA .45* .32** .05 .48** .77**
[.27, .60] [.13, .52] [−.17, .28] [.31, .62] [.64, .86]
Exit group n = 68
Note: *p < .05; **p < 01 (two-tailed); VKsize, correction for guessing scores (hits - false alarms); mnRT, mean response
time in milliseconds; CV, coefficient of variation (SDMeanRT/MeanRT); AE%, academic English grade in percentages; GPA,
overall grade point average for academic subjects.
Lexical Facility and Academic Performance in English
10.3 Study 6 Results 233
Studies 6 and 7 differ from the earlier ones in that establishing the relative
sensitivity of the lexical facility measures does not involve discriminating
among group levels. Rather, these studies focused solely on the strength
of the association between individual differences in vocabulary scores and
academic English performance. The bivariate correlations between the
vocabulary measures and academic English grades, as well as GPAs, from
Study 7 are included in Table 10.2.
The bivariate correlations between VKsize and academic English per-
centage and letter grades were significant for both groups, though nearly
twice as large for the exit group. The differences between the groups were
marginally significant: academic English percentage, z = 1.63, p = .051.
The mnRT and academic English measure correlations were statistically
significant and moderately strong for the exit group, while weak and non-
significant for the entry group. The results of the composite VK_mnRT
measure were nearly identical to those of the individual VKsize measure.
The CV scores did not correlate with either measure in both groups.
Test Results
Academic English Grades: Entry Group The VKsize scores accounted for
an R2 of .191, and the mnRT scores accounted for an additional, nonsig-
nificant amount of variance. The model R2 value was .213 and the β coef-
ficient for VKsize was statistically significant.
Academic English Grades: Exit Group The VKsize and mnRT scores
accounted for a significant amount of variance, the latter though was
234 10 Lexical Facility and Academic Performance in English
β t Sig R2 Δ R2
20% trim
mnRT◊ .241 1.53 .135 .204 .055
df (3, 30)
CV .304 1.58 .073 .286 .082
Note: *p < .10; **p < .05; ***p < .001 (two-tailed); VKsize, vocabulary
knowledge; mnRT, log mean response time; CV, coefficient of variation; β,
standardized beta coefficient; ΔR2, change in total R2 (shaded cell is total
variance accounted for by the model); df, degrees of freedom for model 3
ANOVA.
The correlations between the lexical facility measures and GPA were
stronger for the exit group, with the VKsize, mnRT, and VK_mnRT
measures all showing significant, moderate correlation with academic
performance. The CV measure did not correlate GPA. The differences in
the correlations between the groups were not statistically significant.
Hierarchical multiple regression analyses were again performed to assess
the combined effect of the VKsize, mnRT, and CV measures on GPA. The
data met the assumptions for the analysis.
236 10 Lexical Facility and Academic Performance in English
GPA: Entry Group The VKsize scores accounted for an R2 value of .10,
significant at p < .01. Neither the mnRT nor the CV variable accounted
for additional unique variance. The overall model accounted for 12% of
the total variance. The standardized β coefficient for VKsize was .284,
t = 2.29, p < .05.
GPA: Exit Group The VKsize scores were entered as the first step and
accounted for a ΔR2 value of .20, significant at p < .001. The mnRT
scores were then added, accounting for a small amount of additional
unique variance, ΔR2 = .042, statistically significant at the more liberal p
< .10. Recognition time did not account for a unique amount of signifi-
cant variance in GPA beyond the VKsize measure. The total model
accounted for R2 = .243. The β coefficients for VKsize and mnRT were
statistically significant: VKsize: β = .385, t = 3.36, p < .01; mnRT:
β = .224, t = 1.79, p < .10.
5.4
Fig. 10.1 Oman university GPA study. Standardized VKsize, mnRT, and CV scores
by faculty
The results show noticeable variability across the measures and facul-
ties; however, the differences were not statistically significant, apart from
the VKsize difference between the humanities and computing groups.
For the total group, only VKsize accounted for any significant amount of
unique variance in GPA (R2 = .13, p < .001). Individual regression analy-
ses by faculty showed some differences across the four faculties. On its
own, VKsize accounted for 25% of GPA variance in the humanities fac-
ulty (R2 = .25, p < .001), nearly double that of the others; the mnRT did
not account for any unique variance. The results for the engineering fac-
ulty were unusual in that, somewhat remarkably, the mnRT measure
accounted for over 40% of the variance.
Overall, the language proficiency of the participants was lower than
most of the English users examined in the studies presented in this book.
The effect of L1 orthography on Timed Yes/Not Test format may also be
a contributing factor. Fender (2008) reported that Arabic L1 readers of
English demonstrated poorer English word recognition skills than
proficiency-matched learners from Chinese, Korean, and Japanese L1
backgrounds, a finding he attributed to distinctive aspects of the Arabic
script.
Notes 239
10.8 Conclusions
This chapter examined the lexical facility measures as predictors of per-
formance in academic English settings. The measures were related to
individual differences in EAP course grades and program GPA, and as
such complement the previous chapters which examined group
differences. Study 6 investigated the lexical facility measures as predictors
of semester-end course grades in an EAP course in an Australian univer-
sity foundation-year program. Study 7 examined the relationship between
the three measures and overall GPA in the same cohort. Other studies
examining GPA in English-medium university programs were also dis-
cussed. The results were in some contrast to the earlier studies. VKsize
was again shown to be most influential factor in accounting for grade and
GPA differences with mnRT also playing a role, albeit a smaller than in
the earlier findings. It accounted for a significant amount of variance
beyond VKsize, but only when using a more liberal p level.
The VKsize and mnRT measures have measurable effects on both
course grades and GPA in the participants in the studies examined here.
However, it is also evident from the Omani findings that measures incor-
porating higher-order skills such as reading and writing are more infor-
mative indicators of individual proficiency differences than the lower-level
lexical facility processes alone. This result is similar to the placement find-
ings reported in Chap. 9. The effect of L1 orthography might also play a
role.
Notes
1. The false-alarm data do not meet normality assumptions, as some partici-
pants have few or no false alarms.
2. Independent t-tests showed that all of the differences were statistically
significant (two-tailed): for the VK scores: t (143) = 4.19, p < 0.001,
d = 0.69; for mnRT: t (143) = 2.93, p = 0.004, d = 0.49; and for CV: t
(143) = 3.03, p = 0.003, d = 0.51.
3. Namely independence of observations (residuals), no outliers, noncol-
linearity, and normally distributed residuals.
240 10 Lexical Facility and Academic Performance in English
References
Cobb, T. (2008). The Compleat Lexical Tutor. http://www.lextutor.ca/
Fender, M. (2008). Spelling knowledge and reading development: Insights from
Arab ESL learners. Reading in a Foreign Language, 20(1), 19–42.
Harrington, M., & Roche, T. (2014a). Word recognition skill and academic
achievement across disciplines in an English-as-lingua-franca setting. In
U. Knoch (Ed.), Papers in Language Testing, 16, 4.
Harrington, M., & Roche, T. (2014b). Identifying academically at-risk students
at an English-medium university in Oman: Post-enrolment language assess-
ment in an English-as-a-foreign language setting. Journal of English for
Academic Purposes, 15, 34–37.
Roche, T., & Harrington, M. (2013). Recognition vocabulary knowledge as a
predictor of academic performance in an English as a foreign language set-
ting. Language Testing in Asia, 3(1), 1–13. doi:10.1186/2229-0443-3-12.
11
The Effect of Lexical Facility
Aims
11.1 Introduction
This chapter summarizes the findings from Chaps. 6, 7, 8, 9, and 10.
Seven studies have evaluated the sensitivity of size, speed, and consistency
to differences in proficiency and performance in domains of academic
English. Throughout the book, this sensitivity has been characterized as
how well the measures discriminate between the criterion levels and,
more importantly, the relative magnitude of these differences, both indi-
vidually and in combination. Of particular interest is the degree to which
composite measures provide a more sensitive measure of the observed
differences than vocabulary size alone.
11.2 S
ensitivity of Lexical Facility Measures
by Performance Domain
Table 11.1 summarizes the group means for the individual lexical facility
measures, vocabulary size (VKsize), mean recognition time (mnRT), and
coefficient of variation (CV). Also included are the hits, which are the
percentage of words recognized.
The different groups, setting, and lack of a single, independent measure
of proficiency make direct comparisons across the studies impossible—
Table 11.1 Summary of means (M) and standard deviations (SD) for VKsize, hits,
mnRT, and CV measures for Studies 1–5
VKsize Hits mnRT CV
Study n M SD M SD M SD M SD
Study 1: University groups
Preuniversity 32 34 27 55 10 1656 332 .447 .086
L2 university 36 71 16 77 10 963 203 .361 .087
L1 university 42 85 10 91 7 777 202 .247 .071
Study 2: Entry standards
IELTS 6.5 54 57 16 76 12 1444 417 .434 .088
IELTS 7+ 25 73 12 84 7 1280 299 .416 .092
Malaysian 17 70 18 87 11 975 205 .458 .121
Singaporean 19 85 12 93 8 899 193 .432 .114
L1 English 15 85 10 93 6 960 228 .347 .104
Study 3: IELTS band scores
5 30 35 11 56 11 1342 443 .443 .122
5.5 169 40 12 59 12 1139 214 .392 .102
6 72 47 13 68 12 1040 214 .378 .111
6.5 42 59 13 72 10 1032 131 .356 .112
7+ 31 72 10 81 10 861 111 .329 .119
Study 4: Sydney
Elementary 21 37 19 55 16 1854 548 .462 .057
Lower intermediate 19 52 11 71 7 1464 272 .486 .072
Upper intermediate 26 56 14 73 7 1506 318 .494 .104
Advanced 19 69 10 77 9 1326 242 .495 .096
Study 5: Singapore
Elementary 12 22 14 51 20 2083 569 .610 .123
Preintermediate 18 31 12 55 10 1888 456 .644 .095
Intermediate 15 34 16 67 13 1565 475 .568 .132
EAP 11 59 14 68 12 1318 344 .565 .128
11.2 Sensitivity of Lexical Facility Measures by Performance Domain 243
indeed, the lexical facility measure represents one such single measure.
However, several generalizations can be made. The VK size scores and,
to a lesser extent, hits are the most consistent in separating the profi-
ciency levels across all the studies. The IELTS 6.5 band scores were
nearly identical in Studies 2 (M = 57) and 3 (M = 59).1 The adjacent
7+ (both 7 and 7.5) band for the respective studies were Studies 2
(M = 73) and 3 (M = 72). This is similar to the L2 university score in
Study 1 (M = 71), the Malaysian group in Study 2 (M = 70). All of this
suggests that VKsize is a reliable measure of vocabulary skill. The mnRT
results are much less consistent. The 6.5 group in Study 2 has a much
higher mnRT (M = 1444) than the 6.5 group in Study 3 (M = 1031).
The mnRTs for the 7+ levels were also different, for Studies 2 (M = 1280)2
and 3 (M = 861). The amount of recognition time variability under-
scores a difficulty with its use as a measure. This is discussed below. The
two language program groups were much slower than the university
groups, despite the similarity of the VKsize scores for the advanced
group in Study 4 (M = 69) and the English for Academic Purposes
(EAP) group in Study 5 (M = 59), with to the IELTS 6.5 band scores in
Study 2 and 3. This is consistent with the notion that recognition speed
lags behind size (Zhang and Lu 2013). Aside from Study 1, the CV
scores were only sensitive in instances where the proficiency difference
between the groups was great, as in that between IELTS band scores of
5 and 7+.
Table 11.2 summarizes the effect sizes for the individual and compos-
ite measures in the seven studies. Only effect sizes (Cohen’s d) for the
significant pairwise comparisons of means are presented. Blank cells indi-
cate that the mean difference did not reach statistical significance. An
effect size can be interpreted in the absence of statistically significant dif-
ferences, but for presentation purposes, only the significant results will be
discussed. See the specific chapters for effect sizes not reported here.
The benchmark used throughout the book for interpreting the magni-
tude of the observed effect size is taken from Plonsky and Oswald’s meta-
analysis (2014, p. 889). For mean differences between groups, values
around .40 are considered small, around .70 medium, and 1.0 and
beyond large. The authors recommend higher values for within-group
contrasts, namely .60, 1.00, and 1.40, respectively. The comparisons in
244 11 The Effect of Lexical Facility
Table 11.2 Summary of lexical facility measures’ effect sizes for individual and
composite measures
Range of Cohen d effect sizes for pairwise comparisonsa
VKsize_ VKsize_
Study VKsize mnRT CV mnRT mnRT_CV
Study 1: University 1.05–2.68 1.18–4.82 1.03–2.60 1.33–4.11 1.39–5.38
proficiency levels
N = 110
Study 2a: University .86–1.81 1.14–1.86 .71–.94 .73–2.27 .67–2.00
entry standards
study: written test
Study 2b: University .66–2.16 .62–1.72 .72–1.49 .76–2.46 1.00–2.00
entry standards
study: spoken test
N = 132
Study 3: IELTS band .48–2.93 .67–1.60 .42–.88 .65–2.55 .57–2.74
scores
N = 371
Study 4: Australian .99–2.13 .80–1.22 – 1.07–1.90 –
language program
placement
N = 87
Study 5: Singapore .99–1.94 .73–1.61 – .92–2.36 –
language program
levels N = 56
Variance accounted for in hierarchical
regression model
ΔR2 VKsize ΔR2 mnRT ΔR2 CV R2 total
Study 3: IELTS band All .419** .055** .005 .473
scores FA < 20% .474** .028** .008 .511
N = 344 FA < 10% .527** .027** .001 .544
Study 6: EAP grade Entry all .191*** .003 .020 .213
Entry group Entry FA < 20% .149** .055 .082 .286
N = 72 Exit all .349*** .029* .020 .398
Exit group Exit FA < 20% .302*** .056** .021 .379
N = 68
Note: *p < .10; **p < .05; ***p < .001 (two-tailed); VKsize, correction for
guessing scores (hits - false alarms); mnRT, mean recognition time in
milliseconds; CV, coefficient of variation; FA < 20% = only participants with
false-alarm rates less than 20% included in analysis; FA < 10% = only
participants with false-alarm rates less than 10% included in analysis.
a
Comparisons: Study 1: L2 preuniversity – L2 university – L1 university; Study 2:
IELTS 6 – IELTS 7 – Malaysian – Singaporean – L1 English; Study 3: IELTS overall
band score 5 – 5.5 – 6 – 6.5 – 7+; Study 4: Elementary – lower intermediate –
upper intermediate – advanced; Study 5: Elementary – preintermediate –
intermediate – English for Academic Purposes (EAP; advanced).
11.2 Sensitivity of Lexical Facility Measures by Performance Domain 245
across the progressively lower frequency bands (e.g., 2K, 3K, 5K, and
10K) decreased uniformly in all three groups. All the adjacent-level dif-
ferences were statistically significant, and r values ranged between .41 and
.63 for the hits, and .38 and .64 for the mnRT, all in the medium-to-strong
range. The CV was less sensitive to frequency-level differences, although
mean differences mirrored those of the hits and mnRT. Only the 2K–3K
difference was statistically significant, and the negligible r-value of .03
indicated no effect.
In summary, the VKsize and mnRT measures were sensitive to group
differences, while the CV was less so. The effect size ranged from moder-
ate to strong depending on the comparison. The composite VKsize_
mnRT was the most sensitive, consistent with the proposal that the
combination of size and speed was better than size alone in characterizing
group differences. It was also evident that the spoken format yielded a
pattern of results similar to the written, though the size scores were lower
and the responses slower.
Study 3 examined the sensitivity of the measures to IELTS overall
band-score differences among students in a preuniversity foundation-
year course. Scores across five adjacent band-score levels (5–5.5–6–
6.5–7+) were examined. The VKsize score discriminated among all the
IELTS band-score differences, except for the lowest adjacent comparison,
5–5.5. The smallest effect size (d = .48) was between the adjacent 5.5 and
6 levels and the largest (2.93) between the nonadjacent 5 and 7+ levels.
The results for the hits were similar. The mnRT measures discriminated
between all adjacent levels, with effect sizes ranging from .78 for the low-
est comparison (5–5.5) to 1.60 for the largest (5–7+). The CV was only
significant for the 5–7+ (d = .42) and 5–7+ (.88) comparisons.The two
composite measures, VKsize_mnRT and VKsize_mnRT_CV, were more
sensitive than the individual VKsize measure. They discriminated between
all the adjacent levels with comparable effect sizes, providing further sup-
port for the lexical facility proposal. The advantage of combining size and
speed was also supported by the regression analysis, where the mnRT
results accounted for additional unique variance in the model beyond
VKsize, though for only about 6% of the variance, compared with 42%
for VKsize.
11.2 Sensitivity of Lexical Facility Measures by Performance Domain 249
The IELTS band-score results replicate those of the first two studies:
the VKsize and mnRT measures were reliable discriminators of test levels
and accounted for moderate-to-strong effect sizes for these differences,
and the mnRT contributed a unique amount of variance in doing this.
The CV scores were again shown to be less informative than the other
two measures, being sensitive only to the most extreme band-score
differences.
than the individual mnRT measure, though the differences were not
statistically significant. The relatively greater sensitivity of the individual
mnRT and composite VKsize_mnRT measures over the individual
VKsize measure provides another bit of support for the lexical facility
proposal.
The results from the two language school studies are consistent with
the first three studies. Vocabulary size and mnRT measures are reliable
discriminators of test levels, with the differences mostly accompanied by
strong effect sizes. The combination of size and speed results in a more
sensitive measure than size alone.
a small but significant correlation with grades for the exit group (r = .32),
and a nonsignificant correlation for the entry group. The CV did not
correlate with course grades in either group. The VKsize_mnRT correla-
tions for both groups were the same as the respective VKsize scores. A
regression analysis examining the size and speed measures as joint predic-
tors of course grades showed that the VKsize scores accounted for all the
significant variance in academic English grades for the entry group
(20%). In the exit group, VKsize also accounted for most of the variance
(over 30%). The mnRT measure also accounted for a small (but signifi-
cant) amount of variance. In both the regression models on the complete
data set, it accounted for 3% of the variance, though at a more liberal
p-level of .10. For the analysis in which the individual false-alarm rates
were trimmed at 20%, it accounted for 6% at the conventional p < .05
level.
In summary, the VKsize and mnRT measures were more sensitive for
the exit group than for the entry group. A moderately strong correlation
was evident for the exit group between the final grades and VKsize and,
to a lesser extent, mnRT. The mnRT accounted for about 5% of the exit
group grade variance, an amount comparable to the earlier entry stan-
dards, IELTS band scores, and language program studies. There was a
substantial difference between the two groups in academic grades and test
performance that may have had a bearing on the results.
Study 7 explored the link between test performance and program-end
GPAs in the same cohort as in Study 6. Not unexpectedly, the results
mirrored those of the earlier study. For the entry group, there were small
correlations (in the low .3 range) for VKsize, mnRT, and the two com-
bined. The same correlations for the exit group were in the medium range
(.45).
The VKsize and mnRT measures were also considered as predictors of
GPA in tertiary English-medium programs in Oman (Roche and
Harrington 2013; Harrington and Roche 2014b, b). The first study com-
pared the two measures and academic writing skill as predictors of first-
semester GPAs (Roche and Harrington 2013), and the second included
reading skill, along with writing and the two lexical facility measures, as
predictors of GPA. (Harrington and Roche 2014b, b). Roche and
Harrington (2013) found that VKsize and mnRT accounted for unique
252 11 The Effect of Lexical Facility
GPA variance in a regression analysis that only included the two measures
as predictors, though the amount of variance (about 10 and 8%, respec-
tively) was relatively low. But they also found that when the measures are
entered in a model that included an academic English writing score, the
two measures accounted for no additional variance. Similarly, Harrington
and Roche (2014b) examined the combined effect of reading skill, writing
skill, and the two lexical facility scores as GPA predictors and also found
that academic writing skill was the best overall predictor of GPA. It
accounted for most of the variance in the criterion (27%), but that the
other three measures also accounted for a significant amount of variance
(reading, 3%; VKsize, 3%; and mnRT, 2%). When the effect of the
VKsize and mnRT scores were considered independently, both accounted
for a small but significant amount of GPA variance (7 and 9%).
Harrington and Roche (2014a) also found that the sensitivity of the lexi-
cal facility measures as predictors of GPA varied by academic field of
study.
In summary, for the Omani data, the VKsize and mnRT measures
were less sensitive to individual academic grade and GPA differences than
to the group differences examined earlier. This was particularly the case
where they are compared with writing and reading tasks that measure
more global proficiency.
• compare the three measures of lexical facility (VKsize, mnRT, and CV)
as stable indices of L2 vocabulary skill;
• evaluate the sensitivity of these measures individually and as compos-
ites to differences in a range of academic English domains; and, in
doing so,
11.3 Key Findings 253
The VKsize score was the most sensitive individual measure. It was as
good (in Study 1) or better (Studies 2–4 and 6–7) than the mnRT mea-
sure in discriminating between proficiency levels. In the regression mod-
els reported in Studies 3 and 5, VKsize accounted for far greater variance
than mnRT (and of course CV). The effect sizes for the VKsize differ-
ences were consistently strong, whether reflected in Cohen’s d or the R2
value. In the trimmed data set in Study 3, VKsize accounted for over half
the total variance. This finding was not unexpected, given that previous
work on vocabulary size by Laufer, Nation, and their colleagues has
shown that frequentist-based vocabulary size measures are a robust cor-
relate of L2 academic performance. The findings strongly replicate the
earlier research.
The mnRT measure also discriminated between the groups across the
studies, though was slightly less sensitive than the VKsize measure. The d
effect sizes for the significant pairwise comparisons were at minimum of
medium strength, with most strong. In Study 1, mnRT had a larger effect
size than VKsize across the L1 and two L2 groups, as well as between the
two L2 groups alone. In the regression analyses, the measure accounted
for 3–5% of the unique variance in the models. The measure was less
informative of differences between English grades and GPAs, although
even here it was sensitive to some of the group comparisons.
254 11 The Effect of Lexical Facility
The proposal that size and speed together provide a more sensitive mea-
sure than size alone is at the heart of the lexical facility account. This was
supported. The composite measure VKsize_mnRT was generally more
sensitive than VKsize alone. This was evident in both the number of sig-
nificant group comparisons and the relative effect sizes of these differ-
ences. In five of the seven studies, the composite VKsize_mnRT measure
produced a larger effect size than for the VKsize measure alone, though
the differences were not always statistically significant. In the regression
studies, mnRT accounted for a significant amount of unique variance
beyond vocabulary size, although the magnitude of the effect was small
(3–6%).
The findings replicate earlier research that demonstrates a reliable rela-
tionship between size and vocabulary speed (Laufer and Nation 2001;
Harrington 2006), and is at odds with Miralpeix and Meara (2010), who
found none. The results indicate that recognition time does provide an
additional, reliable source of information about individual vocabulary
skill. This is the central finding of the research reported here, and it pro-
vides a solid basis for combining time and speed as a measurement
dimension, that is, for lexical facility.
11.3 Key Findings 255
A distinctive feature of the lexical facility account and the vocabulary size
literature more generally is the use of word frequency statistics to estimate
vocabulary size. A basic assumption is that word frequency is a strong
predictor of when a word is learned and the speed with which it is recog-
nized. The findings here and elsewhere (e.g., Milton 2009) show that
frequency levels provide a reliable and informative framework for charac-
terizing vocabulary development that directly relates to performance.
This holds for both written and spoken modes; however, it was evident
that performance on the spoken version was consistently lower. Word
frequency statistics provide an objective, context-independent way to
benchmark L2 vocabulary development.
The most distinctive feature of the Yes/No Test format is the use of pseu-
dowords. The self-report nature of the format motivates the inclusion of
these phonologically possible, but meaningless, words as a means to
gauge if the test-taker is guessing. In principle, the false-alarm rate is a
measure of guessing independent of vocabulary size, as estimated from
the hits. In practice, this was not the case. There was substantial variabil-
ity in the false-alarm rates within and across studies, but overall the false-
alarm rates were a fair reflection of proficiency levels. They were much
higher for lower-proficiency groups and progressively dropped as levels
improved. The mean performance by the lower-proficiency groups was
20% and higher, while for more proficient L2 and L1 groups, it was
under 10%. Differences in false-alarm rates evident across the studies
here, and in other published research, raise the issue of the comparability
of findings across studies. In Studies 3 and 5, secondary analyses were car-
ried out in which the data sets were trimmed for individuals who had mean
false-alarm rates exceeding 20% (Chaps. 3 and 5) and 10% (Chap. 3).
The statistical tests were then run again. The results were very similar to
256 11 The Effect of Lexical Facility
the original analyses, with the trimmed data sets yielding larger effect
sizes, though the differences were not significant. It was also evident that
the hits by themselves yield a reasonably sensitive measure of vocabulary
knowledge, though not as strong as the VKsize measure. This all suggests
that false alarms may not be necessary for measuring individual perfor-
mance (Harsch and Hartig 2015).
The collection of recognition time data and its use as evidence for
underlying knowledge states is typically associated with the laboratory.
In these controlled settings, the focus is on response time variability in
largely error-free performance in which target behaviors are narrowly
defined and technical demands readily met. The research presented
here has examined mean recognition time differences in error-filled per-
formance more everyday instructional settings. Ensuring optimum
performance, that is, that the test-taker is attending to the task and
working as quickly (and accurately) as possible, is a challenge. A signifi-
cant threat to the reliability of the results is a systematic trade-off in
how quickly and accurately a test-taker responds. Responding very
quickly with many errors, or very slowly with few errors, will render
the results difficult to interpret. There was little evidence of a system-
atic correlation in individual performance between higher accuracy
and slower performance (or vice versa). It is not possible to rule out any
trade-off behavior, but there was no evidence of systematic bias in any
of the individuals or groups studied. All of the studies showed signifi-
cant positive relationships between VKsize and the inverted mnRT, but
the size of the correlations (.2–.5) indicated that other factors were also
at play.
The variability across the studies is also a concern. The IELTS 6.5
group in Study 2 had a much higher mnRT (M = 1444) than the 6.5
group in Study 3 (M = 1031) despite the VKsize scores being nearly
identical. The mnRT means for both the language program groups
(Studies 4 and 5) are much higher relative to the VKsize scores compared
with the other groups. As noted, this may be due to a relative lag in
11.4 Conclusions 257
11.4 Conclusions
The findings support the key element of the lexical facility proposal,
namely that the combination of size and speed provides a more sensitive
index of differences in L2 lexical skill than size alone. This advantage is
reflected in greater sensitivity to proficiency and performance differences
in a range of academic English domains. In five of the seven studies, the
combination of size and speed resulted in larger effect sizes than for the
VKsize measure alone, whether in the regression models or in the com-
posite scores, although the differences were not always statistically sig-
nificant and further confirmation is needed. The CV was much less
sensitive to proficiency differences, having significant and strong effect
for all the pairwise comparisons only in Study 1. These effects were only
evident when comparing groups where the level differences were highly
distinct, as in the IELTS 6.5 and L1 English groups in Study 2. The use-
fulness of the CV as an index of proficiency remains very much an open
question.
Unique to the testing format used here is the inclusion of pseudowords
to assess guessing. The false-alarm rate provided a somewhat stable
258 11 The Effect of Lexical Facility
Notes
1. The VKsize score is an indirect measure of the individual’s vocabulary size.
A very rough estimate of what a VKsize score of 70 represents as overall
vocabulary size can be calculated by taking 70% of 10,000, which is the
word range sampled in almost all the tests here. That would be a mini-
mum of 7000 words. Note this is based on the unlikely assumption that
the false-alarm rate adjusts the hit rate exactly for the actual size. The
individual will also know some words beyond the 10K level, but it will be
a steadily diminishing percentage of these, maybe an additional 1500
words, for a total of 8500 words. This is a rough estimate of the actual size,
and given that only four frequency bands are sampled, one that is closer
to a guess than an estimate. For more precise estimation, the Vocabulary
Size Test, which samples each level from 1K to 15K, is superior (Beglar
2010).
2. Study 2 also had a much slower L1 group (M = 960) compared with the
baseline L1 group in Study 1 (M = 777).
References
Beglar, D. (2010). A Rasch-based validation of the vocabulary size test. Language
Testing, 27(1), 101–118. doi:10.1177/0265532209340194.
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical pro-
ficiency. EUROSLA Yearbook, 6(1), 147–168.
References 259
Harrington, M., & Roche, T. (2014a). Word recognition skill and academic
achievement across disciplines in an English-as-lingua-franca setting. In
U. Knoch (Ed.), Papers in Language Testing, 16, 4.
Harrington, M., & Roche, T. (2014b). Identifying academically at-risk students
at an English-medium university in Oman: Post-enrolment language assess-
ment in an English-as-a-foreign language setting. Journal of English for
Academic Purposes, 15, 34–37.
Harsch, C., & Hartig, J. (2015). Comparing C-tests and yes/no vocabulary size
tests as predictors of receptive language skills. Language Testing, 33(4),
555–575.
Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2
written production. Applied Linguistics, 16(3), 307–322.
Laufer, B., & Nation, P. (2001). Passive vocabulary size and speed of meaning
recognition: Are they related? EUROSLA Yearbook, 1(1), 7–28.
Milton, J. (2009). Measuring second language vocabulary acquisition. Bristol:
Multilingual Matters.
Miralpeix, I., & Meara, P. (2010). The written word. Retrieved from www.
lognostics.co.uk/Vlibrary
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
Roche, T., & Harrington, M. (2013). Recognition vocabulary knowledge as a
predictor of academic performance in an English as a foreign language set-
ting. Language Testing in Asia, 3(1), 1–13. doi:10.1186/2229-0443-3-12.
Zhang, X., & Lu, X. (2013). A longitudinal study of receptive vocabulary
breadth knowledge growth and fluency development. Applied Linguistics,
35(3), 283–304. doi:10.1093/applin/amt014.
12
The Future of Lexical Facility
Aims
• Recap the conceptual and empirical cases for the lexical facility
proposal.
• Discuss the suitability of Timed Yes/No Test and alternative measure-
ment instruments.
• Identify future research directions.
• Consider applications for second language (L2) vocabulary instruction
and assessment.
12.1 Introduction
The lexical facility proposal is driven by the idea that combining vocabu-
lary size and recognition skill (speed and consistency) results in a second
language (L2) vocabulary measure that is more sensitive than size alone
to user proficiency and performance differences. A three-part construct
was introduced at the outset of the book that combines the size of an
individual’s recognition vocabulary, the relative speed with which these
words are recognized, and the consistency of this recognition speed into
the unitary notion of lexical facility. It was proposed that the three mea-
sures combined are more sensitive to individual differences in L2 vocabu-
lary knowledge, as manifested in performance in various domains of
academic English, than size alone. This proposal was test in the studies
reported in Part 2.
The chapter begins by recapping the conceptual and empirical case for
the lexical facility account. Following this, the strengths and limitations
of the Timed Yes/No Test as a measure of lexical facility are considered,
and alternative approaches to testing the construct identified. Directions
for further research are then discussed. These address limitations of the
current work and serve to more solidly establish the function and scope
of the lexical facility construct. Finally, this chapter (and the book) con-
cludes by outlining possible applications of the approach to L2 vocabu-
lary instruction, learning, and assessment.
Empirical evidence for the account comes from the studies summarized
in Chap. 11. The studies examined the sensitivity of the three lexical facil-
ity measures to performance differences in various domains of academic
English. Vocabulary size is measured by the VKsize score, mean recogni-
tion time by the mnRT measure, and consistency by the coefficient of
variation, CV. Sensitivity reflects how well the measures discriminate
between proficiency levels and the strength of the observed differences,
both separately and in combination.
Of the three measures, VKsize was the most sensitive, consistently dis-
criminating between proficiency levels across university-age groups,
English entry standards, IELTS, and program placement outcomes. The
differences were also accompanied by consistently large effect sizes. mnRT
scores were also sensitive to performance differences. Faster recognition
times consistently correlated with higher-proficiency levels, test scores,
and placement levels, though not to the extent of VKsize. The mnRTs
were more variable than the VKsize responses but did provide a stable
index of proficiency.
The lexical facility account is distinctive in proposing that recognition
time consistency (captured by the CV) can also serve as an index of L2
vocabulary proficiency. The findings provide only limited support for the
notion: the measure was sensitive to group or levels only when there was
a noticeably large difference in proficiency, and even in these instances,
the effect sizes were, at best, small. The results indicate that consistency
may serve as a broad index of proficiency, but with still limited practical
application in measurement.
It was proposed that composite measures combining size and pro-
cessing skill (mnRT and CV) are more sensitive to criterion differ-
ences than size (VKsize) alone. The proposal was borne out for the
mnRT results but not for the CV. was for the mnRT measure. The
VKsize_mnRT composite measure was more sensitive to group differ-
ences than VKsize alone. Greater sensitivity was evident in both the
number of significant group comparisons and the relative effect sizes
of these differences. The VKsize_mnRT measure produced a larger
12.2 The Case for Lexical Facility 265
effect size in five of the seven studies, though the differences were not
always statistically significant. In the regression studies, it accounted
for a significant amount of unique variance beyond VKsize, although
the magnitude of the effect was small, accounting for 3–6% of the
unique variance. This pattern was evident for both the written and
spoken modes, though performance on the spoken version was consis-
tently lower across groups and frequency levels.
The incorporation of the CV in the lexical facility construct is an
attempt to treat response variability as a window on performance, as
opposed to mere ‘noise’ that might otherwise obscure experimental effects
of interest. The interest in variability as a characteristic of skill develop-
ment has been of interest to cognitive science researchers (Balota and Yap
2015; Hird and Kirsner 2010) and is represented in L2 research in work
on the CV done by Segalowitz and his colleagues (Segalowitz 2010). It is
an area that warrants greater attention in L2 acquisition research, despite
the modest CV results reported here.
The findings also validated the frequency-based approach to measur-
ing vocabulary knowledge. A fundamental assumption of the lexical facil-
ity approach is that corpus-based word frequency statistics serve to
predict, in probabilistic terms, when and how well a word is learned. The
latter is reflected in part by the mean recognition speed. Size performance
aligned closely with frequency-of-occurrence levels, as did time. The
growing recognition of the correlation of frequency and the contextual
diversity in which a word is encountered (Adelman 2006; Crossely et al.
2013) makes the approach potentially even more informative.
In summary, VKsize and mnRT were shown to be robust measures of
proficiency, and the composite measures of size and speed provided a
more sensitive measure than size alone in the majority of the compari-
sons. Most of the studies reported a mean advantage for the composites,
though in some of these instances, the mean advantages were not statisti-
cally significant. The results thus await further confirmation. The two
regression models also showed that mnRT responses accounted for sig-
nificant, unique variance beyond VKsize. The CV was shown to be only
an approximate measure of proficiency. The use of word frequency statis-
tics as an index of L2 vocabulary knowledge was also corroborated.
266 12 The Future of Lexical Facility
12.3 M
easuring Lexical Facility: The Timed
Yes/No Test and Alternatives
In principle, the lexical facility construct is independent of the measure-
ment format. However, all the evidence presented for it has been col-
lected using the Timed Yes/No Test. As such, problems with the testing
instrument will also be problems for the research construct. Issues arising
from using the Timed Yes/No Test format are first identified, and then
alternatives to the current paradigm are considered.
The test relies on self-report of user vocabulary knowledge. The user sim-
ply indicates whether or not the word is known. Instances of guessing
aside, the format provides no way of establishing what the user knows
about the target word. The test provides a measure of size, a quantitative
property of the user’s mental lexicon. At issue is how well this property
relates to L2 performance and not to knowledge of specific meanings or
senses. The property is described as a probabilistic estimate that ulti-
mately has to be combined with other measures to get a complete picture
of the user’s L2 vocabulary knowledge and skill.
The Timed Yes/No Test draws on the lexical decision paradigm for
the measurement of recognition time. However, there are important
differences between the respective testing conditions that makes the
interpretation of mean recognition times in the Timed Yes/No Test more
12.3 Measuring Lexical Facility: The Timed Yes/No Test... 267
Pseudowords
The written version of the test incidentally assesses English spelling skill.
Differences in a test-taker’s L1 script may, therefore, also affect perfor-
mance. Yes/No Test performance by Arabic L1 users has been shown to
be lower than their matched-proficiency L2 counterparts who come from
alphabet-based L1 backgrounds (Milton 2009), making direct compari-
son across these populations problematic. This asymmetry disappears in
the spoken version. Users from cognate languages may also be differen-
tially affected due to the similarities of the L1 and L2 scripts. It has been
shown that users from closely cognate languages can confuse test pseudo-
words with real L1 words (Meara et al. 1994)—another reason for care in
developing the test items. The variability in written Yes/No Test perfor-
mance potentially introduced by the L1 script needs attention. This is
particularly the case when interpreting performance in settings where
learners from markedly different L1 orthographies are compared. The
studies here had participants from China, Hong Kong, Taiwan, Japan,
Vietnam, and Oman.
Absence of Context
world, and the word recognition processes at the center of the test are
themselves highly sensitive to context. The relatedness of word meaning
is best exemplified in the pervasive effect of priming in word recognition.
Words encountered before or along with a target exert a strong influence
on how quickly the target is recognized and judged. The format does not
directly tap these processes. Previously presented items no doubt affect
performance, but the potential effect is controlled by randomization.
However, it is also the case that individual words develop resting repre-
sentation strengths that reflect the user’s exposure to the word and the
resulting links the word has with the other words in the mental lexicon.
These strengths are a property of the learner’s L2 mental lexicon—a
trait—that provides an important window on L2 performance.
Low-Stakes Testing
All of the studies are examples of low-stakes testing, in which the impor-
tance of the test outcome for the user is limited. The studies were carried
out as part of a research-driven data collection program. Test outcomes
had no bearing on the participants’ grade or any other aspect of study. As
a result, the degree of motivation and the attention users gave to the task
varied within and between groups. Most users were keen and focused on
the task, but on occasion needed to be reminded and more generally
12.3 Measuring Lexical Facility: The Timed Yes/No Test... 271
There are alternatives to the test format that can reduce or eliminate most
of the limitations noted. Any alternative measure of lexical facility needs
to meet two basic criteria: the test items must be sampled from a range of
frequency levels that allow vocabulary size to be estimated, and the speed
with which individual items are recognized needs to be collected or con-
trolled. Alternative formats for collecting both size and speed measures
are available.
Pair Choice The potential response bias that arises by using binary
response options can be avoided by using a pair choice format, such as in
the Recognition-Based Vocabulary Test proposed by Eyckmans (2004).
Here, the user chooses which member of a word–pseudoword pair is an
actual word. While this format avoids the yes/no response bias problem,
other problems associated with using pseudowords remain, including
potential similarity with words in the L1 or L2. Another alternative is an
animacy-judgment task, which involves a semantic choice (Segalowitz
and Freed 2004). Users are presented with a noun pair—one animate and
one inanimate—and must only identify the animate term. This format
effectively circumvents the problems arising from the use of pseudowords.
Unfortunately, there are only a relatively small number of animate nouns
in the language, making it difficult to cover adequately the frequency
range required for estimating size.
272 12 The Future of Lexical Facility
Words Only (No Pseudowords) The simplest way to avoid the pseudoword
problem is to eliminate them altogether. A suitable instruction set might
prove sufficient to minimize test-taker guessing, especially for particular
learner backgrounds and settings (Shillaw 1996). Another way to dis-
courage guessing is to regularly and randomly stop the test after a ‘yes’
response to a word and ask the user to define, describe, or translate it.
However, recurrent interruptions will lengthen the time it takes to com-
plete the test and may affect response speed, as the user constantly goes
off- and online. Another option is to test (or at least threaten to test) the
word items at the end of the test. This may also lengthen the duration of
the test beyond desirable limits.
12.4 T
he Next Step in Lexical Facility
Research
The lexical facility proposal is motivated by the notion that time is a
defining feature of L2 vocabulary skill and thus should be directly incor-
porated in the measurement of L2 vocabulary. Differences in processing
time, especially response times, have long been used as a window on
underlying knowledge representations.1 However, in both L1 and L2
research, time differences have been examined as evident across estab-
lished knowledge representations. The lexical facility account proposes
that the combination of developing vocabulary knowledge and process-
ing skill provides a measure of L2 vocabulary knowledge/skill more sensi-
tive to differences in L2 performance than the individual measures alone.
Conceptually, the account represents an approach to modeling L2 vocab-
ulary knowledge that recognizes its time-contingent nature and seeks to
understand how the temporal dimension of recognition speed (and pos-
sibly consistency) covaries with vocabulary knowledge in L2 proficiency
development. It has been evident that combining size and speed presents
significant theoretical and methodological concerns. These have been
identified and addressed to varying degrees. The studies provide e mpirical
support for combining size and speed, but more work is needed before
the lexical facility construct can be considered firmly established.
The first need is for a better understanding of how vocabulary size and
recognition speed covary in the course of development. Previous research
has shown that the development of recognition speed lags behind that of
size within individuals. The one-off nature of the studies here, coupled
with smaller sample sizes in many of the conditions, did not allow this
issue to be addressed. More work is needed to ascertain whether there is a
consistent relationship between lexical facility and proficiency as the learner
12.4 The Next Step in Lexical Facility Research 275
12.5 U
ses of Lexical Facility in Vocabulary
Assessment and Instruction
Lexical facility is a low-level processing constraint on comprehension sen-
sitive to proficiency differences across a number of academic English
domains. The online Timed Yes/No Test format is a time- and resource-
effective tool that allows the size and speed measures to be gathered easily
in program, classroom, and individual settings, for low-stakes testing
purposes. The attractiveness of the untimed Yes/No Test format for place-
ment testing and user self-diagnosis was recognized from the time it first
appeared (Meara and Jones 1990; Milton 2009). The inclusion of recog-
nition time as a response measure in the timed version improves the sen-
sitivity of the measure and has the potential to increase user engagement
(Lee and Chen 2011).
The testing format allows a sample of learner vocabulary size and speed to
be obtained quickly and reliably in classroom and program settings. It
lends itself to three types of testing in particular.
that the vocabulary size and RT measures have good predictive validity
when calibrated against placement levels, and especially when used in
combination with site-specific tools (Harrington and Carey 2009). The
test is also an efficient online tool for institutions to test international
students offshore (Roche and Harrington 2017).
instruction. The need to increase learner vocabulary size has long been
recognized as an imperative in L2 vocabulary instruction (Nation 2013).
As recognition time is also established as a crucial aspect of developing
vocabulary skill, one might ask whether explicit attempts to develop rec-
ognition speed in and outside the classroom are also warranted. A small
number of studies have attempted to explicitly develop learner recogni-
tion speed. Explicit retrieval practice on learned words has been used to
develop automaticity in single-word recognition (Akamatsu 2008) and
reading comprehension (Fukkink et al. 2005), and to facilitate written
production in English (Snellings et al. 2002). Other intentional retrieval
activities have also been used to improve vocabulary learning outcomes
(Barcroft 2007). Such studies are still relatively scarce, and the scope for
future work in the area is significant.
12.6 Conclusions
This book has made a case for including recognition speed (and to a lesser
extent, consistency) in the measurement of L2 vocabulary knowledge. The
point of departure was the observation that more proficient users can rec-
ognize more words and do this faster and more consistently than less pro-
ficient users, and the suggestion that this relationship is not coincidental.
While vocabulary size has received significant attention, recognition speed
has largely been ignored as a usable index of L2 vocabulary learning—as
opposed to processing skill.2 The findings show that the combination of
mean recognition speed and size provides a more sensitive measure of L2
vocabulary differences than either alone.
This project is the first to systematically examine a measure of process-
ing consistency—the CV—as an index of proficiency, analogous to
vocabulary size and recognition speed. The measure was sensitive to
group differences when the proficiency levels are very distinct, but other-
wise was less informative.
The findings reinforce the importance of lower-level vocabulary pro-
cesses in models of L2 vocabulary and also demonstrate the usefulness of
frequency-based approaches to indexing L2 development. They replicate
and extend previous work that has shown vocabulary size to be a sensitive
References 279
Notes
1. Or in the words of a popular cognitive psychology textbook, Time Is cog-
nition (Lachman et al. 1979, p. 133).
2. One of the few exceptions is Pellicer-Sánchez and Schmitt (2012), who
used a threshold recognition time to establish whether a word had been
learned.
References
Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity,
not word frequency, determines word-naming and reading times. Psychological
Science, 17(9), 814–823.
Akamatsu, N. (2008). The effects of training on automatization of word recog-
nition in English as a foreign language. Applied PsychoLinguistics, 29(2),
175–193. doi:10.1017/S0142716408080089.
Barcroft, J. (2007). Effects of opportunities for word retrieval during second
language vocabulary learning. Language Learning, 57(1), 35–56.
Bruton, A. (2009). The vocabulary knowledge scale: A critical analysis. Language
Assessment Quarterly, 6(4), 288–297.
Crossely, S. A., Subtirelu, N., & Salsbury, T. (2013). Frequency effects or con-
text effects in second language word learning. Studies in Second Language
Acquisition, 35, 727–755. doi:10.1017/S0272263113000375.
Eyckmans, J. (2004). Learners’ response behavior in Yes/No vocabulary tests. In
H. Daller, M. Milton, & J. Treffers-Daller (Eds.), Modelling and assessing
vocabulary knowledge (pp. 59–76). Cambridge: Cambridge University Press.
Fukkink, R. G., Hulstijn, J., & Simis, A. (2005). Does training in second-
language word recognition affect reading comprehension? An experimental
study. Modern Language Journal, 89(1), 54–75. doi:10.1111/j.0026-7902.
2005.00265.x.
Hannon, B. (2012). Understanding the relative contributions of lower-level
word processes, higher-level processes, and working memory to reading com-
280 12 The Future of Lexical Facility
Albrechtsen, D., Haastrup, K., & Henriksen, B. (2008). Vocabulary and writing
in a first and second language: Processes and development. Basingstoke: Palgrave
Macmillan.
Alderson, J. (2005). Diagnosing foreign language proficiency: The interface between
learning and assessment. New York: Continuum.
Anderson, R. C., & Freebody, P. (1981). Vocabulary knowledge. In J. T. Guthie
(Ed.), Comprehension and teaching: Research reviews (pp. 77–117). Newark:
International Reading Association.
Anderson, R. C., & Freebody, P. (1983). Reading comprehension and the assess-
ment and acquisition of word knowledge. In B. Huston (Ed.), Advances in
reading/language research (Vol. 2, pp. 231–256). Greenwich: JAI Press.
Andrews, S. (1992). Frequency and neighborhood effects on lexical access: Lexical
similarity or orthographic redundancy? Journal of Experimental Psychology:
Learning, Memory, and Cognition, 18(2), 234–254.
doi:10.1037/0278-7393.18.2.234.
Andrews, S. (2008). Lexical expertise and reading skill. In B. H. Ross (Ed.), The
psychology of learning and motivation: Advances in research and theory (Vol. 49,
pp. 247–281). San Diego: Elsevier.
Andrews, S. (Ed.). (2010). From inkmarks to ideas: Current issues in lexical pro-
cessing. Hove: Psychology Press.
Bachman, L. (1990). Fundamental considerations in language testing. Oxford:
Oxford University Press.
Bachmann, L. F., & Palmer, A. (2010). Language assessment in practice: Developing
language assessments and justifying their use in the real world. Oxford: Oxford
University Press.
Baddeley, A. (2012). Working memory: Theories, models, and controversies.
Annual Review of Psychology, 63, 1–29.
Baddeley, A. D., & Hitch, G. (1974). Working memory. In G. A. Bowers (Ed.),
The psychology of learning and motivation (Vol. 8, pp. 47–89). New York:
Academic Press.
Bader, M., & Häussler, J. (2010). Toward a model of grammaticality judgments.
Journal of Linguistics, 46(2), 273–330. doi:10.1017/S0022226709990260.
Balota, D. A., & Chumbley, J. I. (1984). Are lexical decisions a good measure of
lexical access? The role of word frequency in the neglected decision phase.
Journal of Experimental Psychology: Human Perception and Performance, 10(3),
340–357. doi:10.1037/0096-1523.10.3.340.
Balota, D. A., Cortese, M. J., Sergeant-Marshall, S. D., Speiler, D. H., & Yap,
M. J. (2004). Visual word recognition of single syllable words. Journal of
Experimental Psychology: General, 133(2), 382–416.
References
285
Balota, D. A., Yap, M. J., & Cortese, M. J. (2006). Visual word recognition: The
journey from features to meaning (a travel update). In M. J. Traxler & M. A.
Gernsbacher (Eds.), Handbook of psycholinguistics (2nd ed., pp. 285–375).
Amsterdam: Elsevier.
Barcroft, J. (2007). Effects of opportunities for word retrieval during second
language vocabulary learning. Language Learning, 57(1), 35–56.
Bardel, C., & Lindqvist, C. (2011). Developing a lexical profiler for spoken
French L2 and Italian L2: The role of frequency, thematic vocabulary and
cognates. EUROSLA Yearbook, 11, 75–93. doi:10.1075/eurosla.11.06bar.
Bauer, L., & Nation, I. S. P. (1993). Word families. International Journal of
Lexicography, 6(4), 253–279.
Beeckmans, R., Eyckmans, J., Janssens, V., Dufranne, M., & Van de Velde, H.
(2001). Examining the yes/no vocabulary test: Some methodological issues
in theory and practice. Language Testing, 18(3), 235–274.
Beglar, D. (2010). A Rasch-based validation of the vocabulary size test. Language
Testing, 27(1), 101–118. doi:10.1177/0265532209340194.
Beglar, A., & Hunt, A. (1999). Revising and validating the 2000 word level and
the university word level vocabulary tests. Language Testing, 16(2), 131–162.
doi:10.1191/026553299666419728.
Bell, L. C., & Perfetti, C. A. (1994). Reading skill: Some adult comparisons.
Journal of Educational Psychology, 86(2), 244–255. doi:10.1037/0022-
0663.86.2.244.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating
language structure and use. Cambridge: Cambridge University Press.
Brown, J. D. (2005). Testing in language programs: A comprehensive guide to
English language assessment. New York: McGraw-Hill.
Bruton, A. (2009). The vocabulary knowledge scale: A critical analysis. Language
Assessment Quarterly, 6(4), 288–297.
Bundgaard-Nielsen, R. L., Best, C. T., & Tyler, M. D. (2011). Vocabulary size
was associated with second-language vowel perception performance in adult
learners. Studies in Second Language Acquisition, 33(3), 433–461. doi:10.1017/
S0272263111000040.
Cameron, L. (2002). Measuring vocabulary size in English as an additional lan-
guage. Language Teaching Research, 6(2), 145–173.
Carreiras, M., Perea, M., & Grainger, J. (1997). Effects of the orthographic
neighborhood in visual word recognition: Cross-task comparisons. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 23(4), 857.
Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies.
Cambridge: Cambridge University Press.
286 References
Gelderen, A. V., Schoonen, R., Glopper, K. D., Hulstijn, J., Simis, A., Snellings,
P., & Stevenson, M. (2004). Linguistic knowledge, processing speed, and
metacognitive knowledge in and first- and second- language reading compre-
hension: A componential analysis. Journal of Educations Psychology, 96(1),
19–30.
Gelderen, A. V., Schoonen, R., Stoel, R. D., Glopper, K. D., & Hulstijn, J. (2007).
Development of adolescent reading comprehension in language 1 and language
2: A longitudinal analysis of constituent components. Journal of Educational
Psychology, 99(3), 477–491. doi:10.1037/0022-0663.99.3.477.
Geva, E., & Wang, M. (2001). The development of basic reading skills in chil-
dren: A cross-language perspective. Annual Review of Applied Linguistics, 21,
182–204.
Grabe, W. (2009). Reading in a second language: Moving from theory to practice.
New York: Cambridge University Press.
Green, D., & Swets, J. A. (1966). Signal detection theory and psychophysics.
New York: Wiley.
Grigorenko, E. L., & Naples, A. J. (2012). Single-word reading: Behavioral and
biological perspectives. New York: Taylor & Francis.
Hannon, B. (2012). Understanding the relative contributions of lower-level
word processes, higher-level processes, and working memory to reading com-
prehension performance in proficient adult readers. Reading Research
Quarterly, 47(2), 125–152. doi:10.1002/RRQ.013.
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical pro-
ficiency. EUROSLA Yearbook, 6(1), 147–168.
Harrington, M., & Carey, M. (2009). The online yes/no test as a placement
tool. System, 37(4), 614–626. doi:10.1016/j.system.2009.09.006.
Harrington, M., & Jiang, W. (2013). Focus on the forms: From recognition
practice in Chinese vocabulary learning. Australian Review of Applied
Linguistics, 36(2), 132–145.
Harrington, M., & Levy, M. (2001). CALL begins with a ‘C’: Interaction in
computer-mediated language learning. System, 29(1), 15–26.
Harrington, M., & Roche, T. (2014a). Word recognition skill and academic
achievement across disciplines in an English-as-lingua-franca setting. In
U. Knoch (Ed.), Papers in Language Testing, 16, 4.
Harrington, M., & Roche, T. (2014b). Identifying academically at-risk students
at an English-medium university in Oman: Post-enrolment language assess-
ment in an English-as-a-foreign language setting. Journal of English for
Academic Purposes, 15, 34–37.
References
289
Harsch, C., & Hartig, J. (2015). Comparing C-tests and yes/no vocabulary size
tests as predictors of receptive language skills. Language Testing, 33(4),
555–575.
Hazenberg, S., & Hulstijn, J. H. (1996). Defining a minimal receptive vocabu-
lary for non-native university students: An empirical investigation. Applied
Linguistics, 17(2), 145–163.
Heitz, R. P. (2014). The speed-accuracy tradeoff: History, physiology, methodol-
ogy, and behavior. Frontiers in Neuroscience, 8, 150.
Hird, K., & Kirsner, K. (2010). Objective measurement of fluency in natural
language production: A dynamic systems approach. Journal of Neurolinguistics,
23(5), 518–530. doi:10.1016/j.jneuroling.2010.03.001.
Hirsh, D., & Nation, P. (1992). What vocabulary size was needed to read
unsimplified texts for pleasure? Reading in a Foreign Language, 8(2), 689–696.
Holden, J. G., Van Orden, G. C., & Turvey, M. T. (2009). Dispersion of
response times reveals cognitive dynamics. Psychological Review, 116(2),
318–342. doi:10.1037/a0014849.
Holmes, V. M. (2009). Bottom-up processing and reading comprehension in
experienced adult readers. Journal of Research in Reading, 32(3), 309–326.
doi:10.1111/j.1467-9817.2009.01396.
Hoover, W. A., & Gough, P. B. (1990). The simple view of reading. Reading and
Writing, 2(2), 127–160. doi:10.1007/BF00401799.
Hu, M., & Nation, P. (2000). Unknown vocabulary density and reading com-
prehension. Reading in a Foreign Language, 13(1), 403–430.
Huibregtse, I., Admiraal, W., & Meara, P. (2002). Scores on a yes-no vocabulary
test: Correction for guessing and response style. Language Testing, 19(3),
227–245.
Hulstijn, J. H. (2011). Language proficiency in native and nonnative speakers:
An agenda for research and suggestions for second-language assessment.
Language Assessment Quarterly, 8(3), 229–249.
Hulstijn, J. H., Van Gelderen, A., & Schoonen, R. (2009). Automatization in
second language acquisition: What does the coefficient of variation tell us?
Applied Psycholinguistics, 30(04), 555–582.
Jackson, N. E. (2005). Are university students’ component reading skills related
to their text comprehension and academic achievement. Learning and
Individual Differences, 15(2), 113–139. doi:10.1016/j.lindif.2004.11.001.
Jacobs, A. M., & Grainger, J. (1994). Models of visual word recognition:
Sampling the state of the art. Journal of Experimental Psychology: Human
Perception and Performance, 20(6), 1311.
290 References
Lightbown, P. M., & Spada, N. (2013). How languages are learned (4th ed.).
Oxford: Oxford University Press.
Luce, R. D. (1986). Response times. New York: Oxford University Press.
Magnuson, J. S. (2008). Nondeterminism, pleiotropy, and single-word reading:
Theoretical and practical concerns. In E. L. Grigorenko & A. J. Naples (Eds.),
Single-word reading: Behavioral and biological perspectives (pp. 377–404).
New York: Lawrence.
Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing
data: A model comparison perspective (2nd ed.). New York: Psychology Press.
McCarthy, M. (1998). Spoken language and applied linguistics. Cambridge:
Cambridge University Press.
McLean, S., Hogg, N., & Kramer, B. (2014). Estimations of Japanese university
learners’ English vocabulary sizes using the vocabulary size test. Vocabulary
Learning and Instruction, 3(2), 47–55.
McLean, S., Kramer, B., & Beglar, D. (2015). The creation of a new vocabulary
levels test. Language Teaching Research, 19(6), 741–760. doi:10.1177/
1362168814567889.
McNamara, T. F. (1996). Measuring second language performance. London:
Addison Wesley Longman.
Meara, P. (1989). Matrix models of vocabulary acquisition. AILA Review, 6,
66–74.
Meara, P. (1996). The dimensions of lexical competence. In G. Brown,
K. Malmkjaer, & J. Williams (Eds.), Performance and competence in second
language acquisition (pp. 35–53). Cambridge: Cambridge University Press.
Meara, P. (2002). The rediscovery of vocabulary. Second Language Research,
18(4), 393–407. doi:10.1191/0267658302sr211xx.
Meara, P. (2005). Lexical frequency profiles: A Monte Carlo analysis. Applied
Linguistics, 26(1), 32–47.
Meara, P. (2009). Connected words: Word associations and second language vocabu-
lary acquisition. Amsterdam: John Benjamins.
Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary
tests. Language Testing, 4(2), 142–145.
Meara, P., & Jones, G. (1987). Tests of vocabulary size in English as a foreign
language. Polyglot, 8(1), 1–40.
Meara, P., & Jones, G. (1988). Vocabulary size as placement indicator. In
P. Grunwell (Ed.), Applied linguistics in society (pp. 80–87). London: CILT.
Meara, P., & Jones, G. (1990). Eurocentres vocabulary size test. 10KA. Zurich:
Eurocentres.
294 References
Meara, P. M., & Milton, J. L. (2002). X_Lex: The Swansea vocabulary levels test.
Newbury: Express.
Meara, P. M., & Milton, J. (2003). X_Lex: The Swansea vocabulary levels test.
Swansea: Lognostics.
Meara, P. M., & Miralpeix, I. (2006). Y_Lex: The Swansea advanced vocabulary
levels test. v2.05. Swansea: Lognostics.
Meara, P., Lightbown, P. M., & Halter, R. H. (1994). The effects of cognates on
the applicability of yes/no vocabulary tests. The Canadian Modern Language
Review, 50(2), 296–311.
Messick, S. (1995). Validity of psychological assessment: Validation of infer-
ences from persons’ responses and performances as scientific inquiry into
score meaning. American Psychologist, 50(9), 741.
Milton, J. (2009). Measuring second language vocabulary acquisition. Bristol:
Multilingual Matters.
Milton, J., & Alexiou, T. (2009). Vocabulary size and the common European
framework of reference for languages. In B. Richards, M. H. Daller, D. D.
Malvern, P. Meara, J. Milton, & J. Treffers-Daller (Eds.), Vocabulary studies
in first and second language acquisition (pp. 194–211). Basingstoke: Palgrave
Macmillan.
Milton, J., & Hopkins, N. (2006). Lexical profiles, learning styles and the con-
struct validity of lexical size tests. In H. Daller, J. Milton, & J. Treffers-Daller
(Eds.), Modelling and assessing vocabulary knowledge (pp. 47–58). Cambridge:
Cambridge University Press.
Miralpeix, I. (2007). Lexical knowledge in instructed language learning: The
effects of age and exposure. International Journal of English Studies, 7(2),
61–83.
Miralpeix, I., & Meara, P. (2010). The written word. Retrieved from www.log-
nostics.co.uk/Vlibrary
Mochida, A., & Harrington, M. (2006). The yes-no test as a measure of recep-
tive vocabulary knowledge. Language Testing, 26(1), 73–98. doi:10.1191/02
65532206lt321oa.
Moder, K. (2010). Alternatives to F-test in one way ANOVA in case of hetero-
geneity of variances (a simulation study). Psychological Test and Assessment
Modeling, 52(4), 343–353.
Nagy, W. E., Anderson, R., Schommer, M., Scott, J. A., & Stallman, A. (1989).
Morphological families in the internal lexicon. Reading Research Quarterly,
24(3), 263–282. doi:10.2307/747770.
Nassaji, H. (2014). The role and importance of lower-level processes in second
language reading. Language Teaching, 47(1), 1–37.
References
295
Nassaji, H., & Geva, E. (1999). The contribution of phonological and ortho-
graphic processing skills to adult ESL reading: Evidence from native speakers
of Farsi. Applied Psycholinguistics, 20(2), 241–267.
Nation, I. S. P. (2006). How large a vocabulary was needed for reading and lis-
tening? The Canadian Modern Language Review/La Revue Canadienne des
Langues Vivantes, 63(1), 59–82.
Nation, I. S. P. (2012). The vocabulary size test: Information and specifications.
Retrieved from http://www.victoria.ac.nz/lals/about/staff/publications/paul-
nation/Vocabulary-Size-Test-information-and-specifications.pdf
Nation, I. S. P. (2013). Learning vocabulary in another language (2nd ed.).
Cambridge: Cambridge University Press.
Nation, P., & Coxhead, A. (2014). Vocabulary size research at Victoria University
of Wellington, New Zealand. Language Teaching, 47(03), 398–403.
Norris, D. (2013). Models of visual word recognition. Trends in Cognitive
Sciences, 17(1), 517–524. doi:10.1016/j.tics.2013.08.003.
Pachella, R. G. (1974). The interpretation of reaction time in information pro-
cessing research. In B. H. Kantowitz (Ed.), Human information processing:
Tutorials in performance and cognition (pp. 41–82). Hillsdale: Lawrence
Erlbaum Associates, Inc.
Paradis, M. (2004). A neurolinguistic theory of bilingualism. Amsterdam: John
Benjamins.
Paradis, M. (2009). Declarative and procedural determinants of second languages
(Vol. 40). Amsterdam: John Benjamins Publishing.
Paradis, J. (2010). Bilingual children’s acquisition of English verb morphology:
Effects of language exposure, structure complexity, and task type. Language
Learning, 60(3), 651–680.
Pellicer-Sánchez, A., & Schmitt, S. (2012). Scoring yes-no vocabulary tests:
Reaction time vs. nonword approaches. Language Testing, 29(4), 489–509.
doi:10.1177/0265532212438053.
Perea, M., Rosa, E., & Gómez, C. (2002). Is the go/no-go lexical decision task
an alternative to the yes/no lexical decision task? Memory & Cognition, 30(1),
34–45.
Perfetti, C. A. (1985). Reading ability. New York: Oxford University Press.
Perfetti, C. A. (2007). Reading ability: Lexical ability to comprehension. Scientific
Studies of Reading, 11(4), 357–383. doi:10.1080/10888430701530730.
Perfetti, C. A., & Hart, L. (2001). The lexical basis of comprehension skill. In
D. S. Gorfien (Ed.), On the consequences of meaning selection: Perspectives on
resolving lexical ambiguity (pp. 67–86). Washington, DC: American
Psychological Association.
296 References
Perfetti, C. A., & Hart, L. (2002). The lexical quality hypothesis. In L. Verhoeven
(Ed.), Precursors of functional literacy (pp. 189–213). Philadelphia: Benjamins.
Perfetti, C. A., & Stafura, J. (2014). Word knowledge in a theory of reading
comprehension. Scientific Studies of Reading, 18(1), 22–37. doi:10.1080/108
88438.2013.827687.
Plonsky, L., & Derrick, D. J. (2016). A meta-analysis of reliability coefficients
in second language research. The Modern Language Journal, 100, 538–553.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
Qian, D. D. (1999). Assessing the roles of depth and breadth of vocabulary
knowledge in reading comprehension. The Canadian Modern Language
Review, 56(2), 282–307.
Ratcliff, R., Gomez, P., & McKoon, G. (2004). A diffusion model account of
the lexical decision task. Psychological Review, 111(1), 159–182.
Raymond, W. D., & Brown, E. L. (2012). Are effects of word frequency effects
of contexts of use? In S. T. Gries & D. Divjak (Eds.), Frequency effects in lan-
guage learning and processing (pp. 35–52). Berlin: De Gruyter Mouton.
Read, J. (2000). Assessing vocabulary. Cambridge: Cambridge University Press.
Read, J. (2004a). Plumbing the depths: How should the construct of vocabulary
knowledge be defined? In P. Bogaards & B. Laufer (Eds.), Vocabulary in a
second language: Selection, acquisition, and testing (pp. 209–227). Amsterdam:
John Benjamins.
Read, J. (2004b). Research in teaching vocabulary. Annual Review of Applied
Linguistics, 24, 146–161.
Read, J. (2016). Post-admission language assessment in universities: International
perspectives. Switzerland: Springer International Publishing.
Read, J., & Chapelle, C. A. (2001). A framework for second language vocabu-
lary assessment. Language Testing, 18(1), 1–32.
Read, J., & Nation, P. (2009). Introduction: Meara’s contribution to research in
lexical processing. In T. Fitzpatrick & A. Barfield (Eds.), Lexical processing in
second language learners (pp. 1–12). Bristol: Multilingual Matters.
Read, J., & Shiotsu, T. (2010). Extending the yes/test as a measure of the English
vocabulary knowledge of Japanese learners. Paper presented at the colloquium
on the measurement of L2 vocabulary development at the 2010 Annual
Conference of the Applied Linguistics Association of Australia, Brisbane.
Richards, J. C. (1976). The role of vocabulary teaching. TESOL Quarterly, 10,
77–89.
Richards, B. (1987). Type/token ratios: What do they really tell us? Journal of
Child Language, 14(2), 201–209. doi:10.1017/S0305000900012885.
References
297
Richland, L. E., Kornell, N., & Kao, L. S. (2009). The pretesting effect: Do
unsuccessful retrieval attempts enhance learning? Journal of Experimental
Psychology: Applied, 15(3), 243–257. doi:10.1037/a0016496.
Roche, T., & Harrington, M. (2013). Recognition vocabulary knowledge as a
predictor of academic performance in an English as a foreign language set-
ting. Language Testing in Asia, 3(1), 1–13. doi:10.1186/2229-0443-3-12.
Roche, T., & Harrington, M. (2017). Offshore and onsite placement testing for
English pathway programmes. Journal of Further and Higher Education. doi:
10.1080/0309877X.2017.1301403. Published online May 9, 2017.
Roediger, H. L., III, & Karpicke, J. D. (2006). Test-enhanced learning: Taking
memory tests improves long-term retention. Psychological Science, 17(3),
249–255.
Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking
assessment: Reporting a score profile and a composite. Language Testing,
24(3), 355–390. doi:10.1177/0265532207077205.
Schmitt, N. (2010). Researching vocabulary. A vocabulary research manual.
Basingstoke: Palgrave Macmillan.
Schmitt, N. (2014). Size and depth of vocabulary knowledge: What the research
shows. Language Learning, 64(4), 913–951.
Schmitt, N., & Schmitt, D. (2014). A reassessment of frequency and vocabulary
size in L2 vocabulary teaching. Language Teaching, 47(4), 484–503.
doi:10.1017/S0261444812000018.
Schmitt, N., & Zimmerman, C. B. (2002). Derivative word forms: What do
learners know? TESOL Quarterly, 36(2), 145–171. doi:10.2307/3588328.
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring
the behaviour of two new versions of the vocabulary levels test. Language
Testing, 18(1), 55–89. doi:10.1191/026553201668475857.
Schmitt, N., Jiang, X., & Grabe, W. (2011). The percentage of words known in
a text and reading comprehension. The Modern Language Journal, 95(1),
26–43. doi:10.1111/j.1540-4781.2011.01146.x.
Schnipke, D. L., & Scrams, D. J. (2002). Exploring issues of examinee behavior:
Insights gained from response-time analyses. In C. N. Mills, M. Potenza, J. J.
Fremer, & W. Ward (Eds.), Computer-based testing: Building the foundation of
future assessments (pp. 237–266). Hillsdale: Lawrence Erlbaum Associates.
Segalowitz, N. (2005). Automaticity and second languages. In C. Doughty &
M. Long (Eds.), The handbook of second language acquisition (pp. 382–408).
Oxford: Blackwell.
Segalowitz, N. (2007). Access fluidity, attention control, and the acquisition of
fluency in a second language. TESOL Quarterly, 41(1), 181–186.
298 References
emergent property, 76 L
English for Academic Purposes, 228 LanguageMAP, 160
English-medium academic study, 115 lemmas, 9
entry requirement, 194 lexical availability, 70
entry standards, 158 lexical decision task (LDT), 46
error rates, 105 lexical expertise, 70
lexical facility, 26
lexical fluency, 70
F lexicality, 59
false-alarm rate, 102 lexical quality hypothesis, 71
false alarms, 55, 98 liberal response condition, 102
familiarity, 57 L1 script effects, 269
fluency, 46 long-term memory, 48
formulaic speech, 5 low stakes, 113
frequency band, 34 low-stakes testing, 270
frequency distribution, 11
frequency-of-occurrence bands, 75
frequency statistics, 12 M
frequentist, 15 measurement-based variance, 55
mental lexicon, 13
multiword units, 5
G
Go/No-Go response format, 273
grade-point-averages (GPAs), 228 N
grain size, 73 neighbours, 52
nonparametric, 36
nonwords, 52
H
higher-order cognitive processes, 49
high stakes, 113 O
hits, 100 orthographic processing, 49
outlier values, 106
I
intentional retrieval activities, 278 P
International English Language phonological skills, 49
Testing System (IELTS), pivot, 73
26, 189 placement decisions, 37
interactionalist, 79 placement testing, 5, 276
Index
305
R U
rapid serial visual presentation, university admission, 158
273
readiness testing, 276
recognition vocabulary, 4 V
reliability, 68 validity, 68
response bias, 102 variability, 77
response style, 100 verbal efficiency, 70
vocabulary fluency, 70
Vocabulary Levels Test (VLT), 13
S Vocabulary Size Test (VST), 25, 28
scoring formulas, 100
self-report, 266
single word, 74 W
single-word presentation, 39 word families, 9–10
situation model, 48 word frequency, 46
sound-spelling correspondences, 58 word identification, 47
speed-accuracy trade-offs, 83, 106, word recognition speed, 39, 85
107 working memory, 72
spoken version, 161 written version, 169
standardized tests, 37
strategic processing, 29
Y
Yes/No Test, 13
T
testing effect, 277
text coverage, 16 Z
text integration processes, 48 Zipf ’s law, 11