Lexical Facilities

Download as pdf or txt
Download as pdf or txt
You are on page 1of 319

MICHAEL HARRINGTON

the

be
to
of
and
a

in
t
tha
v e

it
ha
for

not with as youdo at


on

Lexical Facility
Size, Recognition Speed and Consistency
as Dimensions of Second Language
Vocabulary Knowledge
Lexical Facility
Michael Harrington

Lexical Facility
Size, Recognition Speed and
Consistency as Dimensions of Second
Language Vocabulary Knowledge
Michael Harrington
University of Queensland
Brisbane, QLD, Australia

ISBN 978-1-137-37261-1    ISBN 978-1-137-37262-8 (eBook)


DOI 10.1057/978-1-137-37262-8

Library of Congress Control Number: 2017946891

© The Editor(s) (if applicable) and The Author(s) 2018


The author(s) has/have asserted their right(s) to be identified as the author(s) of this work in accordance
with the Copyright, Designs and Patents Act 1988.
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and trans-
mission or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

Cover design by Henry Petrides

Printed on acid-free paper

This Palgrave Macmillan imprint is published by Springer Nature


The registered company is Macmillan Publishers Ltd.
The registered company address is: The Campus, 4 Crinan Street, London, N1 9XW, United Kingdom
This book is dedicated in loving memory to my parents, Frank and Dolores.
Front Cover

The image on the front cover is a stylized representation of what is known


as ‘Zipf ’s law’, which states that the frequency with which a word is used
is inversely proportional to its rank in a frequency table. The vertical
y-axis represents the frequency with which a word is used, and its rank
order is set out along the horizontal x-axis. The sloping function shows
that a small number of words account for the majority of uses. The
approach set out in this book assumes that frequency rank is a strong
predictor of vocabulary learning.

vii
Acknowledgments

I would first like to thank my wife Jan and daughter Bridget for their
forebearance. I am also greatly indebted to John Read for his advice and sup-
port throughout this project. He, of course, is not responsible for the final
outcome. Special thanks to collaborators Thomas Roche, Michael Carey, and
Akira Mochida, and colleagues Noriko Iwashita, Paul Moore, Wendy Jiang,
Mike Levy, Yukie Horiba, Yuutaka Yamauchi, Shuuhei Kadota, Ken Hashimoto,
Fred Anderson, Mark Sawyer, Kazuo Misono, John Ingram and Jenifer Larson-­
Hall. Thanks also to Said Al-Amrani, Lara Weinglass, and Mike Powers.
Vikram Goyal programmed and has served as the long-standing sys-
tem administrator for the LanguageMAP online testing program used
to collect the data reported. He has been especially valuable to the proj-
ect. Special thanks also to Chris Evason, Director of the University of
Queensland’s (UQ) Foundation-Year program, who has provided encour-
agement and financial support for testing and program development.
Funding support is also acknowledged from Andrew Everett and the UQ
International Education Directorate.
The research reported here has been supported by the Telstra Broadband
Fund and a UQ Uniquest Pathfinder grant for the development of the
LanguageMAP program. Support was also provided by research contracts
from the Milton College (Chap. 9) and International Education Services–
UQ Foundation-Year (Chaps. 8, 9, and 10), and a grant, with Thomas
Roche, from the Omani Ministry of Research.
ix
Contents

Part 1 Introduction   1


References 2

1 Size as a Dimension of L2 Vocabulary Skill  3


1.1 Introduction  3
1.2 Estimating Vocabulary Size  5
1.3 Vocabulary Size as a Dimension of Learners’
Vocabulary Knowledge 13
1.4 Conclusions 21
References 22

2 Measuring Recognition Vocabulary Size 25


2.1 Introduction 25
2.2 Approaches to Measuring Recognition
Vocabulary Size 26
2.3 Uses of the Vocabulary Size Tests 34
2.4 Conclusions 39
References 40

xi
xii Contents

3 L2 Word Recognition Skill and Its Measurement 45


3.1 Introduction 45
3.2 Word Recognition Skill and Text Comprehension 46
3.3 Word Recognition Skill and L2 Text Comprehension 49
3.4 The LDT as a Measure of Word Recognition Skill 51
3.5 LDT Performance as a Window on Word Knowledge 57
3.6 Using the LDT Format to Measure L2 Word
Recognition Skill 60
3.7 Conclusions 61
References 61

4 Lexical Facility: Bringing Size and Speed Together 67


4.1 Introduction 67
4.2 Defining Lexical Facility 68
4.3 Lexical Facility as a Vocabulary Skill Construct 73
4.4 Lexical Facility as a Measurement Construct 76
4.5 Bringing Size and Speed Together 79
4.6 Recognition Vocabulary Size and Speed as a Vocabulary
Measure83
4.7 Establishing Lexical Facility: The Research Program 86
4.8 Conclusions 88
References 89

5 Measuring Lexical Facility: The Timed Yes/No Test 95


5.1 Introduction 95
5.2 The Timed Yes/No Test 96
5.3 Scoring the Timed Yes/No Test 99
5.4 Administering the Test 109
5.5 The Timed Yes/No Test as an L2 Vocabulary Task 112
5.6 Lexical Facility in English 115
5.7 Conclusions 116
References117
Contents
   xiii

Part 2 Introduction 121


1.1 Overview 121
1.2 Aims of the Empirical Research 122
1.3 An Overview of Methods Used 122
References129

6 Lexical Facility as an Index of L2 Proficiency131


6.1 Introduction 131
6.2 Study 1: Lexical Facility as an Index
of English Proficiency 132
6.3 Study 1 Results 136
6.4 Sensitivity of the Lexical Facility Measures
to Frequency Levels 146
6.5 Discriminating Between Frequency Levels 148
6.6 Findings for Study 1 151
6.7 Conclusions 152
References153

7 Lexical Facility and Academic English Proficiency157


7.1 Introduction 157
7.2 Study 2: Lexical Facility and University English
Entry Standards 158
7.3 Study 2 Results 161
7.4 Study 2 Findings 180
7.5 Conclusions 183
References185

8 Lexical Facility and IELTS Performance187


8.1 Introduction 187
8.2 Study 3: Lexical Facility and IELTS Performance 188
8.3 Study 3 Results 190
xiv Contents

8.4 Findings for Study 3 IELTS Band-Scores 201


8.5 Conclusions 201
References203

9 Lexical Facility and Language Program Placement205


9.1 Introduction 205
9.2 Study 4: Sydney Language School Placement Study 207
9.3 Study 4 Results 209
9.4 Findings for Study 4 Sydney Language Program
Placement216
9.5 Study 5: Singapore Language Program Study 217
9.6 Study 5 Results 218
9.7 Findings for Study 5 Singapore Language
Program Levels 223
9.8 Conclusions 224
References225

10 Lexical Facility and Academic Performance in English227


10.1 Introduction 227
10.2 Study 6: Lexical Facility Measures and Academic
English Grades 228
10.3 Study 6 Results 230
10.4 Findings for Study 6 Lexical Facility and Academic
English Grades 235
10.5 Study 7: Lexical Facility and GPA 235
10.6 Findings for Study 7 Lexical Facility and GPA 236
10.7 Other GPA Studies 236
10.8 Conclusions 239
References240
Contents
   xv

11 The Effect of Lexical Facility241


11.1 Introduction 241
11.2 Sensitivity of Lexical Facility Measures by
Performance Domain 242
11.3 Key Findings 252
11.4 Conclusions 257
References258

12 The Future of Lexical Facility261


12.1 Introduction 261
12.2 The Case for Lexical Facility 262
12.3 Measuring Lexical Facility: The Timed Yes/No Test
and Alternatives266
12.4 The Next Step in Lexical Facility Research 274
12.5 Uses of Lexical Facility in Vocabulary Assessment
and Instruction276
12.6 Conclusions 278
References279

References283

Index303
List of Figures

Fig. 1.1 Elements of vocabulary knowledge tapped by vocabulary


size tests 7
Fig. 1.2 A frequentist model of vocabulary learning 15
Fig. 1.3 Cumulative percentage of text coverage and corresponding
frequency bands 18
Fig. 1.4 Text coverage as the number of unfamiliar words and the
number of lines of text per unfamiliar word 18
Fig. 1.5 A sample reading text with 80% text coverage 19
Fig. 1.6 Text comprehension percentage as a function of vocabulary
coverage (Schmitt et al. 2011, p. 34) 20
Fig. 2.1 Instructions and example item for Vocabulary Levels Test
(Adapted from Nation 2013, p. 543) 27
Fig. 2.2 Sample item from Nation’s Vocabulary Size Test 28
Fig. 2.3 A simple checklist version of the original Yes/No Test 31
Fig. 2.4 Comparison of VLT and Yes/No Test Performance
(Mochida and Harrington 2006) 33
Fig. 3.1 Word recognition in the construction–integration model
of text comprehension (figure adapted from Perfetti and
Stafura (2014, p. 33)) 47
Fig. 3.2 A schematic diagram of the lexical decision task 54
Fig. 5.1 Yes/No Test response types 99
Fig. 5.2 Four Yes/No Test scoring formulas 101
Fig. 5.3 Composite measure formulas 108

xvii
xviii List of Figures

Fig. 5.4 Elements of the instruction set for the Timed Yes/No Test 111
Fig. 6.1 Lexical facility measures by English proficiency levels 140
Fig. 6.2 Median proportion of hits and 95% confidence intervals
for lexical facility measures by frequency levels and groups 149
Fig. 6.3 Median individual mnRT and 95% confidence intervals
for lexical facility measures by frequency levels and groups 150
Fig. 6.4 Median coefficient of variation (CV) and 95% confidence
intervals for lexical facility measures by frequency levels
and groups 150
Fig. 7.1 University entry standard study. Mean proportion of hits
by frequency levels for written and spoken test results 179
Fig. 7.2 University entry standard study. Mean response times by
frequency levels for written and spoken test results 180
Fig. 7.3 University entry standard study. Mean CV ratio by
frequency levels for written and spoken test results 181
Fig. 8.1 Combined IELTS dataset: Timed Yes/No Test scores by
IELTS overall band scores 194
Fig. 9.1 Sydney language program study. Comparison of VKsize
and mnRT scores with program placement grammar and
listening scores across four placement levels 213
Fig. 9.2 Singapore language program levels. Standardized scores
for the lexical facility measures (VKsize, mnRT, and CV)
for the VLT and BNC test versions 219
Fig. 9.3 Singapore language program study. Standardized scores
for the lexical facility measures (VKsize, mnRT, and CV)
for the combined test by level 221
Fig. 10.1 Oman university GPA study. Standardized VKsize,
mnRT, and CV scores by faculty 238
List of Tables

Table 1.1 Vocabulary size expressed in word families


and text coverage (written and spoken) across nine
corpora (Nation 2006, p. 79) 17
Table 3.1 A meta-analysis of factors affecting L2 reading skill
(Jeon and Yamashita 2014) 50
Table 6.1 Bivariate correlations and 95% confidence intervals
(within square brackets) for the three lexical facility
measures (VKsize, mnRT, and CV) and two composite
scores (VKsize_ mnRT and VKsize_ mnRT_ CV) 138
Table 6.2 Proficiency-level study. Means, standard deviations,
and confidence intervals for false-alarm rates and
the lexical facility measures, individual and
composite, for the three proficiency levels 139
Table 6.3 Proficiency-level study. One-way ANOVAs for
individual and composite lexical facility measures
as discriminators of English proficiency levels 143
Table 6.4 Proficiency-level study. Post hoc comparisons for
individual and composite measures, VKsize,
mnRT, CV VKsize_mnRT, and VKsize_mnRT_CV 143
Table 6.5 Proficiency-level study. Medians, interquartile ranges,
and 95% confidence intervals for the hits, mean response
time (in milliseconds), and mean proportion coefficient
of variation by frequency levels and groups 147

xix
xx List of Tables

Table 6.6 Frequency-level analysis. Comparing sensitivity of hits


(correct responses to words), mean RT, and CV to
frequency band differences using the omnibus Friedman
test and the follow-up Wilcoxon signed-rank test 149
Table 7.1 University entry standard study: written and spoken test
results. Pearson’s correlations for the three individual
measures (VKsize score, mnRT, and CV) and the two
composite scores (VKsize_mnRT and VKsize_mnRT_CV) 163
Table 7.2 University entry standard study: written and spoken test
results. Means (M), standard deviations (SD), and
confidence intervals (95% CI) for the lexical facility
measures for the five English proficiency standard groups 165
Table 7.3 University entry standard study: written and spoken test
results. Means (M), standard deviations (SD) and
confidence intervals (CI) for the composite scores
VKsize_mnRT and VKsize_mnRT_CV for
the five English entry standard groups 166
Table 7.4 Entry standard study. One-way ANOVA for individual
and composite lexical facility measures as discriminators
of English proficiency groups 172
Table 7.5 University entry standard group. Significant pairwise
comparisons for the VKsize measure for written and
spoken test results 173
Table 7.6 University entry standard study. Significant pairwise
comparisons for the mnRT and CV measures for written
and spoken test results 174
Table 7.7 University entry standard study. Significant pairwise
comparisons for composite VKsize_mnRT and VKsize_
mnRT_CV measures for written and spoken test results 175
Table 8.1 IELTS study data set. Years 1–3 means and standard
deviations, within brackets, for the VKsize, mnRT, and
CV measures by IELTS overall band score 191
Table 8.2 IELTS band-score study. Means, standard deviations, and
confidence intervals (CI) for the lexical facility measures,
individual and composite, for IELTS overall band scores 192
Table 8.3 IELTS study. IELTS band-score study. Bivariate
correlations with bootstrapped confidence intervals for
IELTS band scores and lexical facility measures 193
List of Tables
   xxi

Table 8.4 IELTS band-score study. One-way ANOVAs for individual


and composite lexical facility measures as discriminators of
IELTS overall band scores 196
Table 8.5 IELTS study. Bandwise significant post hoc comparisons
for VKsize, mnRT, and CV 197
Table 8.6 IELTS band-score study. IELTS bandwise post hoc
comparisons for the VKsize_mnRT and VKsize_mnRT_
CV measures 198
Table 8.7 IELTS band-score study. Model summary (R2 and ΔR2)
for hierarchical regression analysis with proficiency level as
criterion and VKsize, mnRT, and CV as predictor variables
on written and spoken tests with complete and false-alarm-­
trimmed (20 and 10%) data sets 200
Table 9.1 Sydney language program study. Bivariate Pearson’s
correlations for lexical facility measures, and listening and
grammar test scores 210
Table 9.2 Sydney language program study. Means, standard
deviations, and 95% confidence intervals for the lexical
facility measures at the four placement levels 211
Table 9.3 Sydney language program study. One-way ANOVAs for
individual and composite lexical facility measures and
placement test scores as discriminators of placement levels 214
Table 9.4 Sydney language program study. Significant post hoc
pairwise comparisons of the lexical facility measures and
listening test 215
Table 9.5 Singapore language program study. Means, standard
deviations, and confidence intervals for the lexical facility
measures for the four Singapore language program levels 220
Table 9.6 Singapore language program study. One-way ANOVAs
for individual and composite lexical facility measures as
discriminators of program levels 222
Table 9.7 Singapore language program study. Significant post hoc
comparisons for the lexical facility measures for the four
placement levels 223
Table 10.1 Australian university foundation-year study. Means,
standard deviations, and confidence intervals (CI) for
the individual and composite lexical facility measures, and
median and range values for academic grades and GPAs
for entry and exit groups 231
xxii List of Tables

Table 10.2 Bivariate correlations between lexical facility measures and


academic English performance measures for entry and exit
groups232
Table 10.3 Australian university foundation-year study. Model
summary of hierarchical regression analyses for entry and
exit groups using EAP grade percentage as criterion and
VKsize, mnRT, and CV as ordered predictor variables 234
Table 11.1 Summary of means (M) and standard deviations (SD) for
VKsize, hits, mnRT, and CV measures for Studies 1–5 242
Table 11.2 Summary of lexical facility measures’ effect sizes for
individual and composite measures 244
Introduction

Two bedrocks of fluent second language (L2) performance are an ade-


quate stock of words and the ability to access those words quickly.
Separately, the two have been shown to be reliable and sensitive correlates
of L2 proficiency both across and within user levels. The two are exam-
ined here jointly as a property of L2 vocabulary skill called lexical facility.
The book first makes the conceptual case for combining the two dimen-
sions and then provides empirical evidence for the sensitivity of the com-
bined measures to differences in proficiency and performance in common
domains of academic English. The main focus is on lexical facility in
written English, though some spoken language data are also presented.

Scope of the Book


The term lexical facility reflects how many words a learner knows and how
fast these words can be recognized. The term lexical is used to denote the
word-level focus, and the term facility the relative ease of accessing that
knowledge. A sizeable literature exists that relates vocabulary size to L2
performance. Researchers, including Bhatia Laufer, Paul Meara, Paul
Nation, and Norbert Schmitt, have sought to identify the kind and num-
ber of words an individual needs to function in various L2 domains, with

xxiii
xxiv Introduction

a particular interest in the vocabulary size needed for fluent performance


and its assessment in domains of academic English. A foundation of
vocabulary size research is the use of word frequency statistics as an index
for estimating an individual user’s vocabulary size. The resulting estimates
are then related to performance in various domains (e.g., Laufer and
Nation 1995). The vocabulary size research literature is the point of
departure for the lexical facility approach presented in the book.
A smaller body of research has also examined how L2 word processing
skill develops. Norman Segalowitz, Jan Hulstijn, and colleagues have
investigated the role that word recognition speed and consistency play in
fluent L2 performance, and in particular the development of automaticity.
“Word recognition speed is expressed throughout this book as the mean
recognition time (mnRT) it takes an individual to recognize a set of words
presented separately.” Faster recognition times have been shown to reliably
correlate with better performance both within and between users. In addi-
tion to the relative speed with which words are recognized, the overall
consistency of recognition speed is also of interest. Word recognition con-
sistency is captured in the coefficient of variation (CV), which is the ratio
of the standard deviation of the mnRT to the mnRT itself (SDmnRT/
mnRT). Segalowitz has proposed that the interaction of the mnRT and
the CV over the course of proficiency development can serve as an indica-
tor of automatization (Segalowitz and Segalowitz 1993). In the lexical
facility account, the CV is examined as an index of proficiency by itself
and in combination with the size and mnRT measures. As a measure of
response variability, the CV is examined as a window on vocabulary skill
development, as opposed to mere ‘noise’ that might otherwise obscure
experimental effects of interest. The interest in variability as a characteris-
tic of performance in its own right is attracting increasing attention in
cognitive science (Balota and Yap 2011; Hird and Kirsner 2010).
The two research areas differ in goals and method, but are in accord that
quantitative measures of vocabulary size and processing skill are important
indicators of L2 proficiency. Proficient learners have bigger vocabularies
and can access that knowledge more efficiently than their less proficient
counterparts. The book explores how the empirically established—and
intuitive—relationship between proficiency, and vocabulary size and pro-
cessing skill is manifested in various domains of academic English.
Introduction
   xxv

The book is the first to investigate the value of treating vocabulary


size and processing skill (recognition speed and consistency) as a uni-
tary construct. The main empirical concern is the extent to which com-
bined measures of vocabulary size and processing skill are more sensitive
to performance differences than size alone. Sensitivity is reflected in
how reliably (as reflected in statistical significance) the measures dis-
criminate between levels in a given domain, and the magnitude of this
difference as reflected in the effect size. Evidence for the efficacy of a
composite measure combining static knowledge (size) and dynamic
processing skill (speed and consistency)—that is, for lexical facility—
has clear implications for L2 vocabulary research, testing, and assessment.
Lexical facility is a quantitative entity that captures a crucial facet of
lower-level L2 vocabulary knowledge skill. It is approached as a trait, that
is, as a user-internal, context-free property of L2 vocabulary knowledge
that is developed as a result of experience with the language and is avail-
able for use across contexts (Read and Chapelle 2001).

Research Goals
This book has three goals. The first is to make the theoretical case for lexi-
cal facility. The validity of the construct is established in the first four
chapters by first examining the crucial roles that vocabulary size (Chaps. 1
and 2) and word recognition skill (Chap. 3) play in L2 performance.
The rationale for characterizing size and processing skill jointly as an L2
vocabulary construct, that is, for lexical facility, is then set out in Chap. 4.
This chapter discusses key theoretical and methodological issues that arise
from the proposal. Primary among these is the attempt to treat size and
speed as parts of a unitary construct. Standard practice in the psychomet-
ric tradition has long been to treat the two as separate dimensions.
Human performance has been characterized either as knowledge (also
called power) or speed, the relative importance of each dependent on the
kind of performance being measured. Knowledge is seen as the critical
attribute of higher-level cognitive tasks such as educational testing, while
speed is paramount for mechanical tasks such as typing. The lexical
­facility account proposes that size (knowledge) and processing skill (speed
xxvi Introduction

and consistency) can be productively considered together as indices of L2


vocabulary proficiency. As a result, the proposal has implications for the
broader incorporation of temporal measures in models of L2 learning
and use.
The second and third goals concern the empirical case for the con-
struct. The second goal is to assess the reliability and validity of an instru-
ment to measure lexical facility, the Timed Yes/No Test. In Part 2, seven
studies are presented that examine the sensitivity of the vocabulary size
and processing skill measures (size and consistency), individually and in
combination, to variability in proficiency and performance in various
academic English domains. All seven studies measure lexical facility using
the Timed Yes/No Test. The instrument is an online measure of recogni-
tion vocabulary knowledge based on the lexical decision task, a measure
of lexical access widely used in cognitive psychology. Chapter 5 describes
the Timed Yes/No Test and provides a rationale for its use. The use of
speed and consistency as measures of proficiency raises methodological
and technical issues. These are identified, and the implications for bring-
ing time as a performance measure out of the laboratory and into class-
room and testing contexts are discussed.
The third goal is to demonstrate the sensitivity of the lexical facility
measures to proficiency and performance differences in academic English.
Chapter 6 establishes the sensitivity of the size, speed, and consistency
measures to differences in proficiency levels in university-age users.
The chapter also demonstrates the validity of word frequency statistics to
index individual vocabulary knowledge. In Chap. 7, the sensitivity of the
measures to group differences in English entry standards used in an
Australian university is examined. Written and spoken versions of the
test are administered to evaluate differences in test performance due to
language mode. Chapter 8 investigates the measures as predictors of per-
formance by preuniversity students on one specific English entry stan-
dard, the International English Language Testing System (IELTS) test.
Performance on the lexical facility measures is compared with placement
Introduction
   xxvii

testing outcomes in language schools in Sydney and Singapore in Chap. 9.


The last chapter, Chap. 10, investigates the measures as predictors of aca-
demic English grades and grade point average (GPA) in a university prep-
aration program in Australia. Also discussed are findings from other
studies that have addressed the same issues. Chapter 11 presents a sum-
mary of the findings from all the studies. The data reported in the various
studies are drawn from published and unpublished research by the author
and colleagues. Chapter 12 completes the book by considering the future
of the lexical facility proposal in light of the findings.
In summary, this book attempts to establish lexical facility as a quanti-
tative measure of L2 vocabulary proficiency that can serve as a context­
independent index sensitive to learner performance in specific academic
English settings. The studies in Part 2 aim to

1. compare the three measures of lexical facility (vocabulary knowledge,


mean recognition time, and recognition time consistency) as stable indices
of L2 vocabulary skill;
2. evaluate the sensitivity of the three measures individually and as compos-
ites to differences in a range of academic English domains; and, in doing
so,
3. establish the degree to which the composite measures combining size with
processing skill (recognition speed and consistency) provide a more sensitive
indicator of L2 proficiency and performance differences than vocabulary
size alone.

The book is in two parts. Part 1 presents the theoretical foundation


and motivation for the lexical facility proposal. Part 2 reports on a set of
studies that provide empirical evidence for lexical facility and concludes
with a chapter that considers the place of lexical facility in the modeling
and measurement of L2 vocabulary.
xxviii Introduction

References
Balota, D. A., & Yap, M. J. (2011). Moving beyond the mean in studies of
mental chronometry: The power of response time distributional analyses.
Current Directions in Psychological Science, 20(3), 160–166.
Hird, K., & Kirsner, K. (2010). Objective measurement of fluency in natural
language production: A dynamic systems approach. Journal of Neurolinguistics,
23(5), 518–530. doi:10.1016/j.jneuroling.2010.03.001.
Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2
written production. Applied Linguistics, 16(3), 307–322.
Read, J., & Chapelle, C. A. (2001). A framework for second language vocabu-
lary assessment. Language Testing, 18(1), 1–32.
Segalowitz, N., & Segalowitz, S. J. (1993). Skilled performance, practice and
differentiation of speed-up from automatization effects: Evidence from sec-
ond language word recognition. Applied Psycholinguistics, 14(3), 369–385.
doi:10.1017/S0142716400010845.
Part 1
Introduction

Part 1 (Chaps. 1, 2, 3, 4, and 5) introduces the theoretical and method-


ological foundations of the lexical facility account. Chapter 1 introduces
the vocabulary size research program, including the frequency-based tests
of vocabulary knowledge that are used to estimate second language (L2)
vocabulary size in the individual user, which in turn has been related to
differences in L2 performance. Chapter 2 then presents different types of
vocabulary size tests, including the Vocabulary Levels Test (Nation 2013)
and the Yes/No Test (Meara and Buxton 1987). Test assumptions and
uses in testing and instruction are described and key findings surveyed.
Research on the development of speed and consistency in L2 word recog-
nition skill is examined in Chap. 3. The aims and methods of this research
paradigm are then described, as are key research findings. These two inde-
pendent lines of research provide the foundation for the lexical facility
proposal introduced in Chap. 4, which sets out the rationale for combin-
ing the two dimensions and discusses the key issues related to this under-
taking. Chapter 5 describes the Timed Yes/No Test, which is used in the
studies in Part 2 that provide evidence for the lexical facility account.
2 1 Introduction

References
Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary
tests. Language Testing, 4(2), 142–145.
Nation, I. S. P. (2013). Learning vocabulary in another language (2nd ed.).
Cambridge: Cambridge University Press
1
Size as a Dimension
of L2 Vocabulary Skill

Aims

• Introduce the vocabulary size research literature.


• Describe how vocabulary size is counted.
• Describe the use of word frequency statistics to estimate vocabulary size.
• Relate vocabulary size measures to second language (L2)
performance.

1.1 Introduction
This chapter introduces the field of what will be called vocabulary size
research, an approach based on the simple assumption that the overall
number of words a user knows—the breadth of an individual’s vocabulary
stock—provides an index of vocabulary knowledge. The focus on vocab-
ulary breadth means that little attention is given to what specific words
are known or the extent (or depth) to which any given word is used.
Rather, researchers in the area are interested in estimating the vocabulary
size needed to perform particular tasks in a target language. These tasks

© The Author(s) 2018 3


M. Harrington, Lexical Facility, DOI 10.1057/978-1-137-37262-8_1
4 1 Size as a Dimension of L2 Vocabulary Skill

can range from reading authentic texts (Hazenberg and Hulstijn 1996) to
coping with unscripted spoken language (Nation 2006). Size estimates
are used to propose vocabulary thresholds for second language (L2)
instruction, and more generally to provide a quantitative picture of an
individual’s L2 vocabulary knowledge (Laufer 2001; Laufer and
Ravenhorts-Kalovski 2010). The focus here, and in the book in general,
is on the size of recognition vocabulary and the role it plays in L2 use.
The main focus is on the recognition of written language.
Recognition vocabulary is acquired before productive vocabulary and
serves as the foundation for the learning of more complex language struc-
tures. The store of recognition vocabulary knowledge builds up over the
course of an individual’s experience with the language. This knowledge
ranges from the most minimal, as in the case of knowing only that a word
exists, to an in-depth understanding of its meaning and uses. A sparkplug
may be a thingamajig found in a car or, according to Wikipedia, ‘a device
for delivering electric current from an ignition system to the combustion
chamber of a spark-ignition engine to ignite the compressed fuel/air mix-
ture by an electric spark, while containing combustion pressure within
the engine’. Recognition vocabulary knowledge emerges from both
intentional learning and implicit experience, and even the most casual
experience can contribute to the stock of recognition vocabulary knowl-
edge. Repeated exposure to a word also has a direct effect on how effi-
ciently it is recognized.
The notion that knowing more words allows a language user to do
more in the language hardly seems controversial. However, many appar-
ently commonsensical assumptions in language learning are often diffi-
cult to specify in useful detail or to apply in practice (Lightbown and
Spada 2013). Even when evidence lends support to the basic idea, spe-
cific findings introduce qualifications that often diminish the scope and
power of the original insight. This chapter introduces and surveys the
vocabulary size research literature to see how the ‘greater size = better per-
formance’ assumption manifests itself. The methodology used for esti-
mating vocabulary size is first described, and then findings from key
studies are presented.
Size is a quantitative property and therefore requires some unit of mea-
surement. In the vocabulary size approach, it is the single word. Size
1.2 Estimating Vocabulary Size 5

estimates reflect vocabulary breadth and have been related to L2 perfor-


mance in two ways. Researchers have sought to establish the minimum
size thresholds needed to perform specific tasks, such as reading an aca-
demic text (Schmitt et al. 2011), or related size to performance outcomes
in specific settings, as in placement testing (Meara and Jones 1988).

1.2 Estimating Vocabulary Size


The measurement of recognition vocabulary is a far more complex task
than might first appear. The first difficulty involves defining what to
count as a word. Criteria must also be established for deciding how a
given word is recognized for counting. Finally, a practical means must be
devised for obtaining a sufficient sample of the individual’s language from
which to make a valid size estimate. All three factors present challenges
for the researcher.

What to Count

The vocabulary size approach quantifies vocabulary knowledge as a col-


lection of single words. Characterizing vocabulary knowledge as a collec-
tion of individual words accords with how vocabulary knowledge is
popularly viewed. Single words are the means by which children learn to
spell and are the basis for dictionaries, spelling bees, and crossword puz-
zles. They also have a privileged place in vocabulary learning and teach-
ing, where word lists are a staple feature of any language textbook. And,
of course, multiword units (collocations, formulaic speech) are ultimately
made up of single words. Learning these forms involves either associating
a combination of known words to a new meaning or learning a new unit
in which some or all of the words are unknown (Wray 2008). In either
case, the single word represents a basic building block.
Single words are different from other kinds of language knowledge in
how they are acquired and represented in the brain. The L2 learner learns
a word (sound–meaning pair) consciously and that is stored as part of the
declarative memory system, a system open to reflection and explicit
6 1 Size as a Dimension of L2 Vocabulary Skill

­ odification. But this knowledge is only part of the lexicon, which con-
m
sists of these words in combination with the mostly implicit grammatical
properties that constrain how the words are used. These properties reside
in procedural memory, a system of implicit, unconscious knowledge.
Paradis (2009) makes a distinction between vocabulary and the lexicon to
capture this difference. Vocabulary is the totality of sound–meaning asso-
ciations and is typical of L2 learner knowledge, particularly in the early
stages. The lexicon characterizes the system of explicit and implicit
knowledge that the first language (L1) user develops as a matter of course
in development, and which is developed to varying degrees in more
advanced L2 users. In Paradis’s terms, the lexical facility account relates
strictly to vocabulary knowledge, its measurement, and its relationship to
L2 proficiency and performance.
Last, the pivotal role the single word plays in online processing also
reflects its importance. The word serves as the intersecting node for a
range of sentence and discourse processes that unfold in the process of
reading (Andrews 2008). It is where the rubber meets the road, as it were,
in text comprehension.
The focus on the recognition of single words means that the vocabu-
lary size approach captures only a small part of L2 vocabulary knowledge,
a multidimensional notion comprising knowledge of form, meaning, and
usage. Each word is part of a complex web of relationships with other
words, and this complex network is used to realize the wide range of
expressive, communicative, and instrumental functions encountered in
everyday use. Figure 1.1 depicts the basic elements of word knowledge in
a three-part model adapted from Nation (2013); see also Richards (1976).
The vocabulary size account reduces vocabulary knowledge to the sin-
gle dimension of the number of individual words a user knows, or more
precisely, recognizes. It is about the user’s ability to relate a form to a basic
meaning, whether by identifying the meaning from among a set of alter-
natives, as in the Vocabulary Levels Test (VLT), or merely recognizing a
word when it is presented alone, as in the Yes/No Test. This passive ‘rec-
ognition knowledge’ is assumed to be an internal property—a trait—of
the L2 user’s vocabulary stock that can be measured independently of a
given context.
1.2 Estimating Vocabulary Size 7

Very high certainty


Written form, including orthography and
possible letter combinations
FORM
Spoken form, including the pronunciation of
individual sounds and connected speech

Word parts, including part of speech and


morphology
MEANING Referents, such as chair, sky, car
Concepts, such as truth, love, justice
Conceptual associations and links, including
metaphoric language such as life is a journey

Collocations: Tendency of two or more words


to occur together in discourse, both
grammatical
collocations (abide by, deal with) and semantic
collocations (spend money, cheerful
expression).
Associations: Comprised of links between
USE
words. Include syntagmatic associates
(abandon; hope, ship, me), paradigmatic
associates (abandon; neglect, give up, forsake).
Stylistic variations: Based on setting and
participants. This includes changes in use over
time, geographical or regional variation, social
class and social role variation; also includes Very low certainty
emotional valence.

Fig. 1.1 Elements of vocabulary knowledge tapped by vocabulary size tests

Single words are therefore of primary importance, primary used here


both in the sense of being crucial to understanding and in representing
the first stage of the comprehension process. However, it is also the case
that single words are typically used in combination with other words.
These combinations can be fixed, as in the case of collocations, or they
can be governed by grammatical and discourse constraints. The meaning
8 1 Size as a Dimension of L2 Vocabulary Skill

of a given word very often depends on the context, and ‘knowing’ a word
ultimately comes down to whether it facilitates comprehension in a par-
ticular context in an appropriate and timely manner. The measurement
of size alone says nothing about the depth of word knowledge, though
the two are not unrelated. Ultimately, greater vocabulary size correlates
with greater depth of vocabulary knowledge (Vermeer 2001).
The central question in the vocabulary size approach is the degree to
which this single form–meaning dimension relates to individual differ-
ences in L2 performance. Evidence of a reliable relationship between size
and performance has implications for the way L2 vocabulary knowledge
is conceptualized and, in turn, for L2 vocabulary assessment. The next
section will consider the challenging problem of how to count single
words.

Quantifying Vocabulary Size

There are alternative ways to calculate vocabulary size, all with their
advantages and disadvantages. The number of words on this page could
simply be counted by tallying the number of white spaces before each
word. These are all words in the simplest sense. But this method would
yield a very insensitive measure of vocabulary knowledge, given that
many words are repeated. For example, the word ‘the’ appears seven times
in this paragraph. The same word can also appear in different forms.
Does the researcher count ‘word’ and ‘words’ as one or two words? As a
result, although estimating vocabulary size is a quantitative process, the
researcher must make qualitative distinctions as to if and how individual
word forms are counted. Several alternatives are available.

Type/Token Ratio A basic distinction can be made between the first


appearance of a word in a text and that same word being repeated, or
what is termed the type/token ratio (TTR). Types refer to the various
unique words in a text, counted by the first appearance of a word. Tokens
refer to every subsequent appearance of that word type in the text. The
phrase ‘the big cat in the big hat’ has five types (the, big, cat, in, hat) out
of seven total tokens. The TTR is a measure of lexical diversity originally
1.2 Estimating Vocabulary Size 9

developed for measuring L1 vocabulary development. It is an index of


lexical diversity and not a measure of absolute size, but it is reasonable to
assume that users who produce a wider variety of words—that is, have a
higher TTR—will also have larger vocabularies. In practice, however, the
measure has been shown to be relatively insensitive to differences in pro-
ficiency levels (Richards 1987).

Lemmas Estimating vocabulary size requires a way to identify the


number of distinct word meanings represented in all the tokens in a
text. One approach widely used in corpus linguistics is the lemma. A
lemma (or citation form) is a particular form of a word (or lexeme) that
is chosen by convention to represent the canonical form. These forms
are typically used in dictionaries as the headwords. Lemmas consist of
all regularly inflected forms sharing the same stem and belonging to
the same syntactic category. The lemma for the verb bank, for exam-
ple, includes the verb forms banks, banked, and banking. A separate
lemma is assumed for the noun bank (as an institution) and its plural,
banks. A further distinction is made between the lemmas representing
homonyms, such as in river bank and bank loan, though these can
pose a particular problem for computer-­ based corpus analyses.
(Aitchison 2012).

Word Families Related to the lemma is the word family, which is defined
as the base word form plus its inflections and most common derivational
variants, for example, invite, invites, inviting, invitation (Hirsh and Nation
1992, p. 692). English inflections include third person -s, past participle
-ed, present participle -ing, plural -s, possessive -s, and comparative -er and
superlative -est. Derivational affixes include -able, -er, -ish, -less, -ly, -ness,
-th, -y, non-, un-, -al, -ation, -ess, -ful, -ism, -ist, -ity, -ize, -ment, and in-
(Hirsh and Nation 1992, p. 692). As with the lemma, the underlying idea
is that a base word and its inflected forms express the same core meaning,
and thus can be considered learned words if a learner knows the base and
the affix rules. Bauer and Nation (1993) proposed seven levels of affixes,
which include derivations and inflections. Word families differ from lem-
mas in that they cross syntactic categories. In the example of bank, as
above, the noun and verb forms are counted as part of the same family.
10 1 Size as a Dimension of L2 Vocabulary Skill

As a result, a lemma count will always be larger than the word family
count, given the narrower range of forms counted as a single instance.
Milton identifies what he terms a ‘very crude’ equivalence of lemma to
word family involving multiplying the word family size by 1.6 to get the
approximate lemma size (Milton 2009, p. 12).

 he Word Family as a Unit of Recognition


T
Vocabulary Knowledge

The word family has been widely used as the unit of counting in vocabu-
lary size studies (Schmitt 2010). Nation has argued that the word family
is a particularly appropriate unit for studying L2 recognition vocabulary
because it is primarily about meaning and meaning potential (Nation
2006, p. 76). It also has a degree of psycholinguistic reality regarding how
the different forms in a given family are stored in the mental lexicon
(Nagy et al. 1989). The basic assumption is that if the meaning of the
base word is known, the various inflections and derivations in which it
appears will also be potentially understood, at least to some degree. This
assumption has proved to be useful in relating individual vocabulary size
to L2 use, but is one that is not categorical. The assumption that a learner
who knows the meaning of build will understand the meaning of builder
on the first encounter is a probabilistic one. Schmitt and Zimmerman
(2002) show that university-level ESL students’ knowledge of the derived
forms of many stem words is far from complete, for example, not knowing
that persistent, persistently, and persistence all come from persist. However,
they also recognize that users will probably work out the meaning of per-
sistence faster if they knew persist than if they did not.
The word family construct also conflates the distinction that Paradis
(2009) makes between the stock of form–meaning associations stored in
declarative memory and morphological processes that are procedural in
nature. Widely used tests of vocabulary size, the VLT (Nation 2013) and
the Yes/No Test (Meara and Buxton 1987), always present the base form
as the test item, thus sidestepping any attempt to measure the morpho-
logical knowledge assumed in the word family construct.
1.2 Estimating Vocabulary Size 11

While recognizing these limitations, the word family is nonetheless an


easy-to-understand and widely used measure of recognition vocabulary
size, and is used in the research studies reported in Part 2. From here
forward, word and word family will be used interchangeably when refer-
ring to size; that is, a reference to ‘the number of words’ will mean the
same as ‘the number of word families’.

 ord Frequency Statistics as an Index of Vocabulary


W
Size

Figuring out how many words a user knows is the next challenge for the
vocabulary size researcher. While in theory it may be possible to identify
every single word a user knows, in practice, the process of fixing vocabu-
lary size is one of estimation. A vocabulary size estimate is based on a
finite sample of a user’s knowledge obtained in a specific task or set of
tasks. Recognition vocabulary knowledge is passive by nature, and evi-
dence for it must be elicited from the user. This is done by presenting a
set of words to a user and eliciting a response that indicates whether the
items are known. Time and resource limitations mean that any test can
present only a limited number of words, and it is from this limited sam-
ple that the user’s vocabulary size is estimated. Word frequency statistics
provide the vocabulary size researcher with a reliable and objective means
to index the size of recognition (and productive) vocabulary knowledge
(Laufer 2001).
Words greatly differ in how often they occur in a given language.
When the words in a large corpus of spoken or written English are rank-­
ordered from the most to least frequently occurring, a highly distinctive
pattern emerges. The 2000–3000 most frequently occurring words
account for the vast majority of tokens that appear in the corpus. Beyond
these high-frequency words, the relative frequency of a given word
steadily decreases as a function of its relative order, until the very-low-­
frequency words tail off and account for only a tiny proportion of tokens.
This frequency distribution, called Zipf ’s law (after one of its original
discoverers), provides an index for the measurement and interpretation of
vocabulary size. The law states that, for a corpus of natural language
utterances, the frequency of any word is in inverse proportion to its rank
12 1 Size as a Dimension of L2 Vocabulary Skill

in the frequency table. This inverse proportionality means that words


occur in a predictable and distinctive pattern. The most frequent word
occurs approximately twice as often as the second most frequent word,
three times as often as the third most frequent word, and so on. For
example, in the classic Brown Corpus, the word ‘the’ is the most fre-
quently occurring word, and by itself accounts for nearly 7% of all word
occurrences. Reflecting Zipf ’s law, the second-place word, ‘of ’, accounts
for slightly over 3.5% of words (36,411 occurrences), followed by ‘and’
at 2.8% (28,852). The first 135 words alone account for half the Brown
Corpus tokens (Biber et al. 1998).
When describing vocabulary size using these frequency counts,
researchers usually fix size in increments of 1000 words, that is, the 1000
most frequent (1–999th), the 2000 most frequent (1000–1999th), and
so on. The shorthand 1K, 2K, 3K, and so on is used to refer to these
bands throughout the book. Words sampled from selected bands are used
in testing instruments that systematically elicit word knowledge across a
range of frequency levels and quantify this knowledge in an objective and
context-independent way.
The use of word frequency statistics is a distinctive feature of the
vocabulary size approach. The development of very large and accessible
corpora has resulted in increasingly refined frequency counts of spoken
and written language use. These corpora include the British National
Corpus (BNC) (Leech et al. 2001), the Cambridge and Nottingham
Corpus of Discourse in English (CANCODE) (McCarthy 1998), and
the Corpus of Contemporary American English (COCA) (http://corpus.
byu.edu/coca/), among others. The increasing availability of vocabulary
software tools at high-quality websites such as the Complete Lexical
Tutor, www.lextutor.ca/, and Lancaster University, corpora.lancs.ac.uk,
permits researchers and practitioners to define and map learner vocabu-
lary size onto receptive and productive L2 performance in an increasingly
sophisticated manner.
In summary, the word family is an attested unit of recognition vocabu-
lary knowledge that has been widely used in vocabulary size research. It is
also used in the studies reported in this book, in conjunction with
word frequency statistics, to estimate user vocabulary size. The pat-
terns of word frequency have more than just a descriptive function, as
1.3 Vocabulary Size as a Dimension of Learners’ Vocabulary... 13

they also have direct implications for vocabulary learning and the repre-
sentation of this knowledge in the mental lexicon. This is discussed below.

1.3  ocabulary Size as a Dimension


V
of Learners’ Vocabulary Knowledge
The preceding has introduced the logic of vocabulary size research. The
next issue to consider is what individual differences in vocabulary size tell
us about a user’s underlying vocabulary knowledge and how it relates to
L2 learning and use. In other words, why measure vocabulary size?

 ocabulary Size as a Dimension of L2


V
Vocabulary Knowledge

The vocabulary size approach focuses on the size, or breadth, of an indi-


vidual’s vocabulary knowledge. This breadth is established by identifying
how many words a user can recognize. This can be done by matching a
presented word with a basic definition, as is done in the VLT, or simply
by indicating that the word is known, as in the Yes/No Test. Both tests are
examined in the following chapter. Correct performance on the VLT
shows that a user knows a basic meaning for a word. It does not indicate
whether the user knows all, or even any other, meanings for the word.
Accuracy on the Yes/No Test shows that the test-taker can link some
meaning with the word form; however, the nature of that meaning is an
open question. At a minimum, it might just be that a particular word
exists in the language.
As is evident in Fig. 1.1, vocabulary size tests directly measure only a
very small part of what it means to know a word. Complete word knowl-
edge consists of form, meaning, and use. Form concerns the perceptual
and physical shape of a word, including both how it is pronounced and
how it is spelled. The meaning of a word includes knowledge of its basic
meaning and the words typically associated with it. Word meaning also
encompasses the range of lexical relationships that the word has with
other words in the mental lexicon. These links include polysemy,
14 1 Size as a Dimension of L2 Vocabulary Skill

a­ntonymy, homonymy, synonymy, and other relational links such as


metonymy (Aitchison 2012). Knowledge of word use is the third part of
word knowledge and arguably the most important. Words are typically
used in particular combinations dictated by the requirements of the set-
ting, goal, and participants. Using a word appropriately also involves a
range of conceptual and world knowledge that goes beyond the mental
lexicon proper.
Performance on vocabulary size tests reflects only how many form–
meaning mappings the user knows, and these typically correspond to the
most basic meaning or meanings. These tests do not directly tap the user’s
knowledge of the range of word meanings or uses, that is, depth. Although
the relationship between breadth and depth is open to debate (Qian
1999; Read 2004), it is assumed here that the two are not independent
dimensions (Vermeer 2001). The assumption is that depth of knowledge
emerges with increasing size; that is, evidence that a user knows a large
number of these form–meaning mappings implies that vocabulary depth
knowledge is also present to some degree. However, vocabulary size tests
are ultimately not about qualitative differences in vocabulary knowledge,
but instead provide an estimate of the number of words (minimally
form–meaning mappings) the user has. The relative size of this vocabu-
lary stock is assumed to represent a basic constraint on performance in
the L2. How strong a constraint is a central focus of the research reported
in Part 2.
In addition to assumptions about the nature of the underlying vocabu-
lary knowledge, the vocabulary size approach also makes assumptions
about how this knowledge is acquired and used.

 ocabulary Learning Assumptions in the Vocabulary


V
Size Approach

A defining characteristic of the vocabulary size approach is the use of


word frequency statistics to estimate individual vocabulary size. Schmitt
(2010) states that frequency ‘is arguably the single most important char-
acteristic of lexis that researchers must address’ (p. 64). Frequency is par-
ticularly important in the development of recognition vocabulary, which
1.3 Vocabulary Size as a Dimension of Learners’ Vocabulary... 15

is driven by exposure to the language. This exposure is both intended and


incidental. Vocabulary learning is thus assumed to be input driven (Ellis
2002). The likelihood that a given word is known can be predicted to a
large extent by the frequency with which it appears in the language. All
things being equal, words that occur more frequently will be learned
sooner. Over time, they will also be accessed more quickly. High-­
frequency words are learned before mid-frequency words, which in turn
are acquired before low-frequency words. In short, vocabulary size is an
emergent property of L2 knowledge development, driven by the input the
learner receives.
Figure 1.2 depicts this frequentist model in schematic terms. Three
things should be noted. First and foremost, the more frequently a word
is used in a given language, the more likely it is that an individual will
know the word. Second, word knowledge (as reflected by accurate test
performance) forms a gradient across frequency bands such that knowl-
edge in specific bands increases proportionally from the high-frequency
to low-­frequency bands. Proportionally, more high-frequency words are
known than mid-frequency words, and in turn, more mid-frequency
words are known than low-frequency words. Finally, the relationship
between size and learning is expressed in probabilistic terms. It is not
assumed that higher-frequency words are never learned before lower-fre-
quency ones. Individual differences in life experience mean that beginner
users will know some very-low-frequency words as a reflection of their
interests and experience. The key point is that word frequency statistics

Word Knowledge as a Function of the Frequency of


Occurrence of the Word in the Language

100
Likelihood of knowing

80
60
40
20
0
High Mid Low
Frequency of occurence

Fig. 1.2 A frequentist model of vocabulary learning


16 1 Size as a Dimension of L2 Vocabulary Skill

­ rovide an objective means to scale learner vocabulary size as probabi-


p
listic estimates that can, in turn, serve as sensitive discriminators of user
proficiency differences and performance outcomes.

Frequency of Occurrence and Diversity of Context

Frequency is a quantitative variable that ignores where words appear.


Quantity, not quality, is what matters. However, recent work indicates
that the effects of frequency on learning, and particularly on use, may not
merely be the function of the number of times a word occurs, but instead
reflect the range of contexts in which it is encountered. Research suggests
that frequency of occurrence is strongly correlated, if not confounded,
with contextual diversity when accounting for speed of lexical decision test
performance (Adelman et al. 2006) and L2 vocabulary learning (Crossley,
et al. 2013). This suggests that word frequency statistics may have a quali-
tative dimension, serving as surrogates for contextual variety. Frequency
may also have a qualitative dimension beyond mere token counts.

Frequency of Occurrence and Text Coverage

The primary focus of the vocabulary size approach is the relationship


between vocabulary size and L2 performance. Evidence for this relation-
ship takes two forms. One is the relationship between vocabulary size and
L2 proficiency as measured in proficiency standards such as the Common
European Framework (Milton 2009) and more localized performance
measures such as a placement test (Harrington and Carey 2009). This
approach is exemplified in the studies reported in Part 2 and will be dis-
cussed at length there. But there is another way in which vocabulary size
has been related to L2 performance. There has been a long-standing inter-
est in how user vocabulary size relates to the lexical demands of the text.
The rudimentary question driving this research is the number of words a
reader needs to know to ensure that all the words encountered in a given
text are recognized, that is, text coverage (Laufer 1992). Text coverage has
been widely used as a graded measure of the relative difficulty the reader
will have in comprehending a text and, in turn, as a means to identify
1.3 Vocabulary Size as a Dimension of Learners’ Vocabulary... 17

Table 1.1 Vocabulary size expressed in word families and text coverage (written
and spoken) across ninea corpora (Nation 2006, p. 79)
Knowledge of all Approximate written text Approximate spoken text
word in coverage (%) coverage (%)
1K 78–81 81.84
2K 8–9 5–6
3K 3–5 2–3
4K–5K 3 1.5–3
6K–9K 2 0.75–1
10K–14K <1 0.5
Proper nouns 2–4 1–1.5
+14K 1–3 1
a
Corpora analyzed: Lancaster–Oslo–Bergen (LOB) Corpus, Freiburg–LOB, Brown,
Frown, Kohlapur, Macquarie, Wellington Written, Wellington Spoken, and
Lund, available from the International Computer Archive of Modern and
Medieval English at http://gandalf.aksis.uib.no/icame.html (Nation 2006, p. 63).

vocabulary learning needs. A number of studies have examined the user


size–text coverage relationship in both spoken and written texts (Adolphs
and Schmitt 2003; Cobb 2007; Hazenberg and Hulstijn 1996; Hsueh-
Chao and Nation 2000; Laufer 1989, 1992; Laufer and Ravenhorst-
Kalovski 2010; Milton, 2009; Nation 2006; Schmitt et al. 2011; Webb and
Rodgers 2009; van Zeeland and Schmitt 2013).
Nation (2006) examined the relationship between size and text cover-
age in nine spoken and written corpora covering a range of text types.
Table 1.1 sets out the correspondence of vocabulary size with text cover-
age levels.
The data in Table 1.1 are presented visually in Fig. 1.3 to illustrate the
profound effect of bands 1K–3K on how much of the text will be recog-
nized. Knowledge of the 1K band alone allows a user to recognize around
80% of the words occurring in written or spoken texts.
The 1K band contains all of the function words (articles, prepositions,
pronominals, auxiliary verbs, conjunctions) that account for a significant
amount of coverage in any given text. Knowledge of the 2K band pro-
vides an additional 10% coverage in written texts and 5% coverage in
spoken texts. The 3K words provide an additional coverage of about 5%
and 3%, respectively. Past the 5K band, there is only a small increase in
text coverage as a function of increasingly lower-frequency bands. Proper
18 1 Size as a Dimension of L2 Vocabulary Skill

100
90
80
Percentage of coverage

70
60
50 Spoken Wrien
40
30
20
10
0
1K 2K 3K 4–5K 6–9K 10–14K
Frequency band

Fig. 1.3 Cumulative percentage of text coverage and corresponding frequency


bands

% of text Number of unfamiliar Number of text lines per 1

coverage words per 100 words unfamiliar word.

99 1 10

98 2 5

95 5 2

90 10 1

80 20 0.5

Note: Assumes ten words per line

Fig. 1.4 Text coverage as the number of unfamiliar words and the number of
lines of text per unfamiliar word

nouns are usually treated separately in size–text coverage discussions


because they are highly text and context specific.
So how do these various text coverage levels map onto text comprehen-
sion? Figure 1.4 sets out the relationship between text coverage levels and
the number of unfamiliar words a reader will encounter.
1.3 Vocabulary Size as a Dimension of Learners’ Vocabulary... 19

There are three ___ that must be considered when ______

to ______ how the most _______ of ____ coverage relates

to reading ____ ____. The first is the ___ of

reading, or ____ Reading a _____ is different from reading

a novel. The second is the nature of ______. Reading

for general _______ is different from reading for _____. The

third is the way in which ______ is _______.

All these _____must be taken into consideration when ______

to ___ how differences in ___ coverage might affect ______.

Fig. 1.5 A sample reading text with 80% text coverage

Learning the 1K band alone means that a reader should recognize


approximately 80% of the words in a text. While this might appear to be
enough to make some sense of a text, in fact, it is woefully inadequate for
even minimal understanding. The text in Fig. 1.5 has been modified so
that 20% of the words are missing. The omissions render the text almost
incomprehensible. Note that the unfamiliar words are usually the most
important for understanding a given text.
The various studies differ in aim, setting, and size, but a consensus has
emerged as to vocabulary size needs for key text coverage thresholds. For
written texts, it is generally agreed that 95% text coverage requires knowl-
edge of 1K–3K bands, and that this will result in only a basic level of
comprehension. For 98%–99% text coverage, up to the 8K–9K range is
needed, and it is only in this range that reading starts to become fluent.
These generalizations are relatively stable across text types and genres,
which can differ significantly in difficulty and purpose.
The relationship between size and spoken text coverage has received
somewhat less attention than the reading text research, but the findings
are generally the same. Adolphs and Schmitt (2003) found that 2K–3K
20 1 Size as a Dimension of L2 Vocabulary Skill

word families were sufficient for 95% text coverage, a number similar to
the written text research. In contrast, knowledge of only the 6K–7K
bands was needed for 98% text coverage, lesser than the 8K–9K sug-
gested as being necessary to read authentic texts with some degree of
fluency (Nation 2006). van Zeeland and Schmitt (2013) also reported
that listening comprehension required knowledge of fewer word families
than comparable reading levels.
A question remains as to whether these text coverage levels, particu-
larly the 95% and 98% levels, reflect a qualitative threshold that must be
met for adequate comprehension, or a continuum from lesser to greater
comprehension skill. Schmitt et al. (2011) examined this issue by plot-
ting text coverage levels against performance for 600 tertiary L2 English
readers from 12 different countries. The relationship between text cover-
age and comprehension was plotted at ten text coverage levels, ranging
from 90% to 100% coverage. See Fig. 1.6.
This figure is adapted from Schmitt et al. (2011, p. 34), with only
alternating text coverage levels reported here. A consistent linear relation-
ship is evident across the reading comprehension and vocabulary cover-
age levels. There is little suggestion of discrete thresholds at the 95% or

100
90
Comprehension percentage

80
70
60
50
40
30
1+SD Mean 1-SD
20
10
0
90% 92% 94% 96% 98% 99% 100%
(n=21) (n=39) (n=93) (n=176) (n=200) (n=186) (n=187)
Vocabulary Coverage and Number of Parcipants at Each Level

Fig. 1.6 Text comprehension percentage as a function of vocabulary coverage


(Schmitt et al. 2011, p. 34)
1.4 Conclusions 21

the 98% coverage level, with the comprehension slopes trending up in a


consistent way. It is important to note that the respective levels represent
averages that hide highly significant individual variability. For example,
comprehension scores at +1 standard deviation for the 92% coverage
group are nearly identical to the mean performance of the 98% group.
Although Schmitt et al. (2011) reject the idea of discrete thresholds, they
do endorse 98% text coverage as a target for achieving adequate compre-
hension, and with it the 8K–9K bands as the target identified by Nation
and others as needed for meeting this goal (Nation 2006). The significant
variability underscores the probabilistic nature of fixing vocabulary size
and its relationship to L2 performance.

1.4 Conclusions
The vocabulary size approach is based on the simple assumption that the
number of words an individual knows has a direct relationship to L2
proficiency. The focus here is on recognition vocabulary knowledge,
which is narrowly defined as the ability to recognize the association
between a single word form and a basic meaning. Vocabulary learning is
viewed as an input-driven process in which vocabulary size emerges from
the user’s experience with the language. Corpus-based word frequency
statistics provide a means of estimating the overall vocabulary size from
recognition performance on a limited set of words. These vocabulary size
estimates have been related to L2 proficiency and use in two ways.
Vocabulary size has been examined as a predictor of differences in L2
performance as measured by standardized and context-specific tests. As a
key component of the lexical facility construct, this use of vocabulary size
is the focus of the book. Considerable attention has also been given to the
relationship between vocabulary size and text coverage, the latter reflect-
ing the comprehension demands of written or spoken texts. Vocabulary
size thresholds have been proposed to meet the levels of text coverage
required for successful comprehension. Both uses demonstrate the utility
of vocabulary size as a dimension of L2 vocabulary knowledge and a cor-
relate of L2 proficiency.
22 1 Size as a Dimension of L2 Vocabulary Skill

Vocabulary size is a basic element of the lexical facility account. In the


next chapter, three vocabulary size test formats are examined. The VLT,
the more recent Vocabulary Size Test (VST), and the Yes/No Test will all
be described and compared. The strengths and limitations of the respec-
tive tests are also discussed. The Yes/No Test is the basis for the Timed
Yes/No Test used in the studies reported in Part 2. The rationale for using
the format is also presented.

References
Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity,
not word frequency, determines word-naming and reading times. Psychological
Science, 17(9), 814–823.
Adolphs, S., & Schmitt, N. (2003). Lexical coverage of spoken discourse.
Applied Linguistics, 24(4), 425–438.
Aitchison, J. (2012). Words in the mind: An introduction to the mental lexicon
(4th ed.). Malden: Wiley.
Andrews, S. (2008). Lexical expertise and reading skill. In B. H. Ross (Ed.), The
psychology of learning and motivation: Advances in research and theory (Vol. 49,
pp. 247–281). San Diego: Elsevier.
Bauer, L., & Nation, I. S. P. (1993). Word families. International Journal of
Lexicography, 6(4), 253–279.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating
language structure and use. Cambridge: Cambridge University Press.
Cobb, T. (2007). Computing the vocabulary demands of L2 reading. Language
Learning & Technology, 11(3), 38–63.
Crossely, S. A., Subtirelu, N., & Salsbury, T. (2013). Frequency effects or con-
text effects in second language word learning. Studies in Second Language
Acquisition, 35(4), 727–755. doi:10.1017/S0272263113000375.
Ellis, N. C. (2002). Frequency effects in language processing: A review with
implications for theories of implicit and explicit language acquisition. Studies
in Second Language Acquisition, 24(2), 143–188.
Harrington, M., & Carey, M. (2009). The online yes/no test as a placement
tool. System, 37(4), 614–626. doi:10.1016/j.system.2009.09.006.
Hazenberg, S., & Hulstijn, J. H. (1996). Defining a minimal receptive vocabu-
lary for non-native university students: An empirical investigation. Applied
Linguistics, 17(2), 145–163.
References 23

Hirsh, D., & Nation, P. (1992). What vocabulary size is needed to read unsim-
plified texts for pleasure? Reading in a Foreign Language, 8(2), 689–696.
Hsueh-Chao, M. H., & Nation, I. S. P. (2000). Unknown vocabulary density
and reading comprehension. Reading in a Foreign Language, 13(1),
403–430.
Laufer, B. (1989). What percentage of text-lexis is essential for comprehension?
In C. Lauren & M. Nordman (Eds.), Special language: From humans thinking
to thinking, machines (pp. 316–323). Clevedon: Multilingual Matters.
Laufer, B. (1992). How much lexis is necessary for reading comprehension? In
P. J. L. Arnaud & H. Béjoint (Eds.), Vocabulary and applied linguistics
(pp. 126–132). London: Macmillan. doi:10.1007/978-1-349-12396-4_12.
Laufer, B. (2001). Quantitative evaluation of vocabulary: How it can be done
and what it was good for. In C. Elder, K. Hill, A. Brown, N. Iwashita,
L. Grove, T. Lumley, & T. MacNamara (Eds.), Experimenting with uncer-
tainty: Essays in hounour of Alan Davies (pp. 241–250). Cambridge:
Cambridge University Press.
Laufer, B., & Ravenhorts-Kalovski, G. C. (2010). Lexical threshold revisited:
Lexical text coverage, learners’ vocabulary size and reading comprehension.
Reading in a Foreign Language, 22(1), 15–30.
Leech, G., Rayson, P., & Wilson, A. (2001). Word frequencies in spoken and writ-
ten English. London: Longman.
Lightbown, P. M., & Spada, N. (2013). How languages are learned (4th ed.).
Oxford: Oxford University Press.
McCarthy, M. (1998). Spoken language and applied linguistics. Cambridge:
Cambridge University Press.
Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary
tests. Language Testing, 4(2), 142–145.
Meara, P., & Jones, G. (1988). Vocabulary size as placement indicator. In
P. Grunwell (Ed.), Applied linguistics in society (pp. 80–87). London: CILT.
Milton, J. (2009). Measuring second language vocabulary acquisition. Bristol:
Multilingual Matters.
Nagy, W. E., Anderson, R., Schommer, M., Scott, J. A., & Stallman, A. (1989).
Morphological families in the internal lexicon. Reading Research Quarterly,
24(3), 263–282. doi:10.2307/747770.
Nation, I. S. P. (2006). How large a vocabulary was needed for reading and lis-
tening? The Canadian Modern Language Review/La Revue Canadienne des
Langues Vivantes, 63(1), 59–82.
24 1 Size as a Dimension of L2 Vocabulary Skill

Nation, I. S. P. (2013). Learning vocabulary in another language (2nd ed.).


Cambridge, UK: Cambridge University Press.
Paradis, M. (2009). Declarative and procedural determinants of second languages
(Vol. 40). Amsterdam: John Benjamins Publishing.
Qian, D. D. (1999). Assessing the roles of depth and breadth of vocabulary
knowledge in reading comprehension. The Canadian Modern Language
Review, 56(2), 282–307.
Read, J. (2004). Plumbing the depths: How should the construct of vocabulary
knowledge be defined? In P. Bogaards & B. Laufer (Eds.), Vocabulary in a
second language: Selection, acquisition, and testing (pp. 209–227). Amsterdam:
John Benjamins.
Richards, J. C. (1976). The role of vocabulary teaching. TESOL Quarterly, 10,
77–89.
Richards, B. (1987). Type/token ratios: What do they really tell us? Journal of
Child Language, 14(2), 201–209. doi:10.1017/S0305000900012885.
Schmitt, N. (2010). Researching vocabulary. A vocabulary research manual.
Basingstoke: Palgrave Macmillan.
Schmitt, N., & Zimmerman, C. B. (2002). Derivative word forms: What do
learners know? TESOL Quarterly, 36(2), 145–171. doi:10.2307/3588328.
Schmitt, N., Jiang, X., & Grabe, W. (2011). The percentage of words known in
a text and reading comprehension. The Modern Language Journal, 95(1),
26–43. doi:10.1111/j.1540-4781.2011.01146.x.
van Zeeland, H., & Schmitt, N. (2013). Lexical coverage in L1 and L2 listening
comprehension: The same or different from reading comprehension? Applied
Linguistics, 34(4), 457–479.
Vermeer, A. (2001). Breadth and depth of vocabulary in relation to L1/L2
acquisition and frequency of input. Applied PsychoLinguistics, 22(2), 217–234.
Webb, S., & Rodgers, M. P. H. (2009). The lexical coverage of movies. Applied
Linguistics, 30(3), 407–427. doi:10.1093/applin/amp010.
Wray, A. (2008). Formulaic language: Pushing the boundaries. Oxford: Oxford
University Press.
2
Measuring Recognition Vocabulary Size

Aims

• Introduce the two approaches to measuring second language (L2) rec-


ognition vocabulary size, the Vocabulary Levels Test (VLT)/Vocabulary
Size Test (VST) and the Yes/No Test
• Identify uses of the tests
• Compare the test formats

2.1 Introduction
This chapter describes two approaches to measuring recognition vocabu-
lary size. The first approach is represented in two tests developed by Paul
Nation and his colleagues: the Vocabulary Levels Test (VLT) (Nation
2013; Schmitt et al. 2001) and, more recently, the Vocabulary Size Test
(VST) (Beglar 2010; Nation 2012). The other approach is embodied in
Paul Meara’s Yes/No Test of recognition vocabulary knowledge (Meara
and Buxton 1987; Meara and Jones 1988). The approaches share the
same frequentist perspective and the same measurement goal—vocabu-
lary size—but go about it in fundamentally different ways. Because the

© The Author(s) 2018 25


M. Harrington, Lexical Facility, DOI 10.1057/978-1-137-37262-8_2
26 2 Measuring Recognition Vocabulary Size

latter serves as the foundation for the Timed Yes/No Test used in the lexi-
cal facility account, it is important to understand the differences between
the two and the advantages (and limitations) of the Yes/No Test format.
In both approaches, words are systematically sampled from frequency-­
of-­occurrence bands, with relative performance across the bands provid-
ing a measure of individual vocabulary size. As noted in the previous
chapter, the use of word frequency statistics as the basis for estimat-
ing vocabulary size is a distinctive feature of the vocabulary size approach.
The next section describes the three test formats and their three main
uses. The first use is characterizing vocabulary size as learning targets, or
thresholds. The second is examining the sensitivity of vocabulary size to
proficiency level differences evident in both global standards, such as
TOEFL and IELTS, and more local applications, such as placement testing
and academic performance. The third use is theoretical and relates to gain-
ing a better understanding of written recognition vocabulary size as a
dimension of L2 vocabulary knowledge. This includes both spoken recog-
nition vocabulary size and productive vocabulary size, spoken and written.

2.2  pproaches to Measuring Recognition


A
Vocabulary Size
The two approaches differ in how they elicit user vocabulary knowledge.
The VLT/VST approach uses a multiple-choice format in which a target
item is presented with alternative definitions from which the user chooses.
The Yes/No Test approach simply requires the user to indicate whether
they know the word. The multiple-choice tests are examined first.

The Vocabulary Levels Test

The VLT was first introduced in the early 1980s and has since been modi-
fied and revised (Nation 1990; Beglar and Hunt 1999; Schmitt et al.
2001). Each test item comprises six possible target word items and three
short definitions. The test-taker matches the word to the corresponding
definition. An example is given in Fig. 2.1.
2.2 Approaches to Measuring Recognition Vocabulary Size 27

This is a vocabulary test. You must choose the right word to go with each meaning.

Write the number of the word next to its meaning.

1 bench

2 charity ____ long seat

3 mate ____ help to the poor

4 jar ____ part of a country

5 mirror

6 province

Fig. 2.1 Instructions and example item for Vocabulary Levels Test (Adapted from
Nation 2013, p. 543)

The VLT includes words from each of the 2K, 3K, 5K, and 10K fre-
quency bands (Nation 2013). Also included is a set of academic words in
an Academic Word List (Coxhead 2000). These are words that occur with
high frequency in academic texts but are drawn from a range of frequency
levels. The current version comprises ten sets of word clusters from the
four levels and the academic words, for a total of 150 test words (3 target
items × 10 sets × 5 levels). The validity and reliability of the test have been
examined in a number of studies (Beglar and Hunt 1999; Culligan 2015;
Read 1988; Schmitt et al. 2001). The studies primarily evaluated the
target items used and the wording of the response alternatives. The two
versions of the test published by Schmitt, Schmitt and Clapham (2001)
are now the standards and have been used with a variety of learners in a
range of settings. The test is available in Schmitt (2010).
The VLT has also been used in a timed format by Laufer and Nation
(2001) and Zhang and Lu (2013). These studies are examined in the next
chapter, where vocabulary recognition speed is discussed.

The Vocabulary Size Test

The VLT samples words from four frequency bands in the 2K–10K range.
From this, it is possible to estimate an overall vocabulary size by interpo-
lating response levels for the untested bands. In other words, ceiling per-
formance on the 2K band is assumed to imply that the test-taker knows
28 2 Measuring Recognition Vocabulary Size

all of the 1K words as well. However, this is a rough estimate, especially


for the lower-frequency bands, 6K–9K, and beyond. In light of this
shortcoming, Nation and his colleagues introduced the Vocabulary Size
Test (VST), to provide a more direct estimate of overall vocabulary size
(Nation 2012). The test expands the range of bands tested, including ten
items each from the 1K to either the 13K or the 14K band, depending on
the version of the test used. Overall size is calculated by multiplying the
score on the 140-item total by 100; a perfect score of 140/140 would
yield an overall vocabulary size of 14,000 words. Stopping at the 14K
band was motivated by the fact that the most frequent 14K words pro-
vide over 99% coverage in written and spoken tests (Nation 2006). It was
assumed that sampling up to that level would ensure satisfactory reading
comprehension of most texts. The test does not need to interpolate word
knowledge for bands not tested, but still makes the assumption that the
ten items selected for each band are representative of all the other words
in the band.
The format also differs from the VLT in that it uses a standard four-­
alternative multiple-choice format. This format allows each target item to
be tested independently, rather than with two targets, as is the case with
the VLT. An example of an item is given in Fig. 2.2. The test is available
in an interactive web format on the Lextutor website, www.lextutor.ca/
tests/. It is also available online at www.my.vocabularysize.com.
Beglar (2010) examined the validity of the VST using Rasch analysis.
The fit indices showed the test to be a highly reliable instrument for dis-
criminating across a range of learner proficiency levels. However, it has

DINOSAUR: The children were pretending to be dinosaurs.

a. robbers who work at sea

b. very small creatures with human form but with wings

c. large creatures with wings that breathe fire

d. animals that lived a long time ago

Fig. 2.2 Sample item from Nation’s Vocabulary Size Test


2.2 Approaches to Measuring Recognition Vocabulary Size 29

also been observed that multiple-choice tests are sensitive to guessing—


that is, a test consisting of four alternative answer items will yield a score
of 25% by guessing alone. Stewart (2014) challenged the sensitivity of
the Rasch validation procedure to potential guessing, with implications
for the resulting size estimate. Using an alternative analysis on a set of
simulated response data, Stewart concluded that using multiple-choice
tests for the direct measurement of vocabulary size is ‘inadvisable’
(p. 280). This criticism notwithstanding, a multiple-choice version of the
VLT, the New Vocabulary Levels Test (NVLT), has recently been recently
introduced (McClean and Kramer 2015). Laufer and Levitzky-Aviad
(2016) also use a multiple-choice format to assess receptive vocabulary
size in the CATTS (Computer Adaptive Test of Size and Strength), an
online instrument that tests receptive and productive vocabulary size.
Given the wider 1K–14K range, the VST provides a more systematic
measure of vocabulary size. However, the VLT remains widely used in
pedagogical settings because it is easy to use and particularly useful for
identifying learner vocabulary needs. The two formats differ in the range
of vocabulary they test and presentation formats, but are similar in that
both require the test-taker to demonstrate knowledge of target word
meaning. In the VLT, the test-taker matches target words with short
meanings; in the VST, they select among multiple-choice answer options.
As a result, the monolingual versions of the tests assess knowledge of
more vocabulary than just the target test items, as a correct response
depends on a test-taker being able to understand all the words used in the
test. The recent development of bilingual versions has sought to address
this problem (Nation and Coxhead 2014). In these instances, the target
word is presented in English, but the response alternatives are shown in
the first language, for example, Russian (Elgort 2013).
Both formats also permit the test-taker to engage in significant strate-
gic processing. The possibility of producing a correct response can be
improved by being able to reject the alternatives definitively. In the paper
and online versions of the tests, where multiple items appear together on
a single page or screen, it is also possible for test-takers to spend varying
amounts of time on different items. As a result, the scores produced may
also reflect individual differences in how the individual goes about taking
the test, in addition to the vocabulary they know.
30 2 Measuring Recognition Vocabulary Size

The Yes/No Test

A much different approach to measuring recognition vocabulary size is


the Yes/No Test format first introduced in Meara and Buxton (1987).
The original checklist format presented a list of target words and required
that test-takers merely indicate which ones they knew. No further attempt
was made to establish whether a word checked was in fact known. Rather,
to control for guessing on the part of the test-taker, nonword items were
presented along with the target words. These pseudowords conform to the
orthographic and phonological rules of the language but have no mean-
ing. The incorrect recognition of pseudowords as known words by the
test-taker provides a measure of guessing. The final score reflects the total
number of words identified, adjusted by performance on the pseudo-
words. As a result, the format provides only an approximate measure of
size that varies by the individual’s tendency to guess.
Figure 2.3 recreates an early version of the test adapted for French
(Meara and Buxton 1987, p. 145). The list includes words that are not
actual French words, such as fombe, étoulage, and ponte. These pseudo-
words conform to French spelling but do not exist in the language. All
the words are presented on a single test sheet, and the user checks the
words they know. Using pseudowords to control for guessing is based on
the lexical decision task, an experimental technique with a long tradition
in cognitive psychology. The use of pseudowords will be discussed in the
next chapter, where the lexical decision task is examined.
The checklist format is efficient in that it permits a large number of
vocabulary items from a range of frequency levels to be presented in a single
test. Meara and Jones (1988, 1990) developed a computerized version of
the test, the Eurocentres Vocabulary Size Test, which uses an adaptive pre-
sentation format to test the knowledge of the 1K–10K words. Subsequently,
Meara and colleagues introduced the X-Lex (Meara and Milton 2002),
which tests knowledge of 1K–5K words, and the more advanced Y-Lex
version (Meara and Miralpeix 2006), which assesses knowledge of vocabu-
lary in the 6K–10K range. Recent versions of the tests are available at
www.lognostics.co.uk/tools/. The testing format is also used in the
Vocabulary Size Placement Test included in the web-­based DIALANG
system, available at www.lancaster.ac.uk/researchenterprise/dialang/about/.
2.2 Approaches to Measuring Recognition Vocabulary Size 31

Look through the French words listed below. Cross out any words that you do not know

well enough to say what they mean. Keep a record of how long it takes you to do the test.

VIVANT TROUVER MAGIR ROMPTANT

MÉLANGE LIVRER IVRE FOMBE

MOUP VION LAGUE INONDATION

SOUTENIR SIÉCLE TORVEAU PRÉTRE

REPOS GANAL HARTON TOULE

GOÛTER FOULARD EXIGER AVARE

ÉTOULAGE ÉCARTER MIGNETTE JAMBONNANT

DÉMÉNAGER POIGNÉE ÉQUIPE MISSONNEUR

AJURER BARRON CLAGE TOUTEFOIS

LEUSSE CRUYER HÉSITER SURPRENDRE

LAVIRE SID ROMAN CHIC

ORNIR CÉRISE PAPIMENT CONFITURE

GÔTER PONTE

Fig. 2.3 A simple checklist version of the original Yes/No Test

The DIALANG vocabulary test uses a checklist format in which the


test-taker is presented with a set of words and pseudowords and can
answer the individual items in order.
The online X-Lex and A-Lex present items one at a time, as does the
LanguageMAP test, which is used in Part 2. The single-word format per-
mits individual responses to items to be timed and also minimizes the
amount of strategic processing the test-taker can engage in. Both of these
are important for testing the lexical facility construct.
The use of self-report knowledge and pseudowords to correct for guess-
ing distinguishes the test from the VLT/VST and L2 vocabulary tests
more generally. Unlike the VLT, VST, and CATTS the Yes/No Test format
32 2 Measuring Recognition Vocabulary Size

provides no cues, in the form of question wording or response alterna-


tives, to aid recognition. It also makes the significant assumption that a
‘yes’ response to a word means that the test-taker knows the word, with no
attempt made to specify what the test-taker knows about the word. The
final score on the test comprises the total of the ‘yes’ responses to word
items adjusted by performance on the pseudowords. There are different
ways to make the adjustment. A simple method is to subtract the propor-
tion of correct word responses from incorrect pseudoword responses. The
resulting adjusted score then represents a vocabulary size estimate that
potentially reflects the number of words known, but guessing behavior is
taken into account as well. Other methods have been examined in a num-
ber of studies (Beeckmans et al. 2001; Huibregtse et al. 2002; Mochida
and Harrington 2006; Pellicer-Sánchez and Schmitt 2012; Culligan
2015). Despite the attention that scoring and the related issue of pseudo-
word use have received, there is no strong consensus as to best practice.
This issue is discussed in more detail in Chap. 5 in relation to the Timed
Yes/No Test used to assess the lexical facility account. Timed versions of
the Yes/No Test (Harrington 2006; Harrington and Carey 2009; Shiotsu
and Read 2009; Pellicer-Sánchez and Schmitt 2012) are introduced in the
next chapter, where the measurement of vocabulary recognition speed is
examined.
In summary, the test formats differ in presentation format and response
type. The VLT, VST, and CATTS formats all present target words with a
set of alternatives containing the correct meaning. The Yes/No Test pres-
ents only the target items, either in a checklist or individually, with no
other recognition cues available. Accordingly, the response types differ,
with the VLT/VST/CATTS response involving the identification of the
correct word among available alternatives, and the Yes/No Test, a simple
yes/no judgment as to whether the item is known. The formats also differ
in the amount of strategic processing possible. Multiple response alterna-
tives allow the test-taker to choose among more or less likely responses,
with the chances of correct responses improved by being able to defini-
tively reject highly unlikely alternatives. There is also the scope for the
test-taker to vary the time spent on specific items. The earlier checklist
versions of the Yes/No Test also allow some strategic processing, as the
entire list can be examined and attention apportioned as desired. In the
2.2 Approaches to Measuring Recognition Vocabulary Size 33

single-item presentation format, attention is limited to the current item,


with no opportunity to go backward or forward to answer the ‘easy’ ones
first.

Comparing VLT and Yes/No Test Performance

Given the differences in presentation and response type, the question


arises as to how comparable the results of the respective formats might
be. Mochida and Harrington (2006) investigated this question by com-
paring performance on the two formats. In this study, university-level
English L2 participants completed both tests using the Yes/No format
and the VLT format. The same target items, drawn from the VLT, were
used in both tests (Schmitt et al. 2001) and pseudowords were added in
the Yes/No version. The online Yes/No Test version was completed first,
with no feedback given to the test-takers. The paper-based VLT was then
administered. The aim of the study was to compare alternative scoring
methods for the Yes/No Test. The results yielded only small differences
across the different scoring formulas examined. Figure 2.4 presents the
results for Yes/No Test and VLT performance. The VLT results were
scored by subtracting the proportion of pseudowords incorrectly identified

100
VLT Yes/No Test
90
80
Percentage correct

70
60
50
40
30
20
10
0
2K 3K 5K 10K
Frequency-of-occurence bands

Fig. 2.4 Comparison of VLT and Yes/No Test Performance (Mochida and
Harrington 2006)
34 2 Measuring Recognition Vocabulary Size

as words (‘false alarms’) from words correctly recognized (‘hits’). This for-
mula is used in the studies reported in Part 2.
It was evident that performance on the two tests is very similar across
all the frequency levels for the group of students tested. In an earlier
study, Cameron (2002) compared performance on the Yes/No Test and
the VLT by secondary ESL students in the UK. In contrast to Mochida
and Harrington’s (2006) results, the scores on the two tests did not cor-
relate. However, the Cameron study used a Yes/No Test including differ-
ent items from the VLT. The secondary-level participants were also of
lower language proficiency and produced a much higher error rate for the
pseudowords, making direct comparisons between the two studies
difficult.
Mochida and Harrington’s (2006) results suggest that the frequency
band format used in the VLT and the Yes/No Test yield similar size mea-
sures despite the difference in item presentation and response type. On a
more practical note, the respective formats also make significantly differ-
ent time demands: the VLT version took almost 30 minutes to complete,
while the Yes/No Test took around five minutes.

2.3 Uses of the Vocabulary Size Tests


The measurement of vocabulary size has both pedagogical and research
motivations, with the development and use of the tests driven in particu-
lar by interests in vocabulary instruction and assessment.

Vocabulary Size and Text Coverage

Identifying the number of words a user needs to understand an L2 text


adequately has been of long-standing interest in vocabulary size
research. The benchmark used in both written and spoken text compre-
hension is text coverage. This refers to the number of words a reader
needs to know to ensure that at least one meaning for all the words
encountered in a given text is known (Laufer 1992). There is a consen-
sus that 95% text coverage requires knowledge of 1K–3K word families
2.3 Uses of the Vocabulary Size Tests 35

(Nation 2006). However, this affords only a basic level of comprehen-


sion, with upward of 98–99% text coverage needed for a more com-
plete understanding. To meet this level of text coverage, knowledge of
words in the 8K–9K range is needed. These figures are generalizations
across text types and genres that can differ significantly in difficulty and
purpose but have been widely recognized. They are discussed in more
detail in Chap. 1.
These thresholds provide a target for learning, and the tests are
often used in instructional settings for diagnostic and screening purposes.
The ready availability of the tests at online sites such as www.my.vocabu-
larysize.com, www.lextutor.ca, and www.lognostics.co.uk also make it
possible for individual learners to estimate their vocabulary size and track
their vocabulary learning progress. The VLT, in particular, has been
widely used in instructional settings to provide a relatively quick way to
assess whether learners have mastered the higher-frequency bands (up
through 3K) that are crucial for meaning-based learning activities in the
L2. In fact, Nation sees this diagnostic function as a primary purpose of
the test (Nation 2013). The X-Lex and Y-Lex tests also assess mastery of
the higher-frequency bands.

Vocabulary Size and Proficiency Standards

The tests have also been used to examine the relationship between recog-
nition vocabulary size and L2 proficiency that involves more than reading
skill alone. Individual difference in size has been correlated with a range
of standardized tests such as TOEFL, TOEIC, IELTS, and the Common
European Framework of Reference for Languages (CEFR). All measures
provide a characterization of overall English proficiency for academic,
government, and employment purposes. TOEFL performance by various
L1 learner groups has been correlated with the VLT (Qian 2002; Alavi
2012), VST (McLean et al. 2014), and Yes/No Test performance (Meara
and Milton 2003; Milton 2009). TOEIC scores have also been correlated
with the VLT (Kanzaki 2015), VST (Kanzaki 2015; McLean et al. 2014;
Stewart 2014), and Yes/No Test performance (Kanzaki 2015; Stubbe
36 2 Measuring Recognition Vocabulary Size

2015). Similarly, variability in IELTS scores has been related to the VLT
performance differences (Alavi 2012) and the Yes/No Test performance
(Milton 2009; Stæhr 2008). The Yes/No Test has also been related to the
CEFR scale (Alderson 2005; Milton and Alexiou 2009).
This research consistently shows that individual differences in recogni-
tion vocabulary size are sensitive to outcomes on the criterion standard
examined. This sensitivity is evident in two ways. The first is the presence
of a statistically significant difference among the levels as a function of
vocabulary size or in the strength of association between the size and
proficiency levels. This is expressed in the p value, usually set at <0.05.
The other aspect of sensitivity is the effect size, which is the strength, or
magnitude, of the statistically significant result. In correlation and regres-
sion analyses the effect is expressed in the R2 value, also called the coeffi-
cient of determination. This signifies the amount of variance in the
differences in the proficiency levels that can be attributed to individual
differences in vocabulary test performance. The R2 value is an important
benchmark for comparing effect sizes across the different studies in the
later chapters. Other effect size statistics will be used when testing mean
differences using the t-test and ANOVA, and their nonparametric alter-
natives as well.
The studies cited above all report statistically significant results for the
size tests as discriminators of performance. However, they differ greatly in
the amount of variance accounted for in the criterion measure of interest.
Kanzaki (2014) examined VLT and VST performance as predictors of
TOEIC performance in Japanese learners. The VLT results correlated
with multiple TOEIC versions, r = 0.5–0.7, or between 25% and 50% of
the variance accounted for, while the VST scores produced slightly lower
correlations, of 0.4–0.6,. or 16–36% of the variance in the TOEIC scores.
Milton and Alexiou (2009) examined Yes/No Test performance as a
­predictor of CEFR scale placement for learners of English, French, and
Greek learned as foreign language (FL)/L2 in Greece, Hungary, Spain,
and the UK. Vocabulary size accounted for around 70% of the variance
for Greek L2 and EFL (English as foreign language) learners in Greece
and French FL learners in Spain, but only about 17% of that for the
CEFR levels for EFL learners in Hungary. The differences in effect sizes
across the groups, criterion standards, and vocabulary size tests in just
2.3 Uses of the Vocabulary Size Tests 37

these two studies show that the relationship between vocabulary size and
proficiency standard is affected by the size measure used, the participants,
the setting, and the criterion outcome examined. Chap. 8 in Part 2 com-
pares performance on the Timed Yes/No Test and IELTS.

 ocabulary Size and Performance in


V
Instructional Settings

The tests have also been used to examine the relationship between vocab-
ulary size and individual proficiency differences in specific learning set-
tings and skill areas. As is the case with the standardized tests, these
proficiency domains include reading skill as a central component, but
also tap the range of language skills that contribute to overall language
performance. Both test formats have been examined as potential tools
for language program placement decisions. Placement outcomes have
been related to performance in the VLT (Akbarian 2010; Clark and
Ishida 2005), VST (Gee and Nguyen 2015) and Yes/No Test (Harrington
and Carey 2009; Harsch and Hartig 2015; Lam 2010; Meara and Jones
1990). All the studies show test outcomes to be sensitive to placement
level differences to some degree. However, when compared directly with
other placement measures, such as an in-house placement test
(Harrington and Carey 2009) or other vocabulary test formats such as
the C-test (Harsch and Hartig 2015), the size measure alone proves to be
less sensitive than the more complex measures. This issue is examined in
Chap. 9.
Vocabulary size has also been examined as a predictor of various types
of academic English performance. It has been related to classroom perfor-
mance (Morris and Cobb 2004; Roche and Harrington 2013), where size
is a moderately strong predictor of course grade outcomes. Size has also
been related to overall grade point average (Harrington and Roche 2014,
Roche et al. 2016), although the link is weaker. The latter research was
undertaken to identify students at potential academic risk due to lan-
guage proficiency limitations and is examined in Chap. 10. EFL writing
outcomes have been correlated with performance on the VLT (Lemmouh
2008) and Yes/No Test (Roche and Harrington 2013) recognition vocabulary
38 2 Measuring Recognition Vocabulary Size

measures. Similarly, EFL listening skill has been correlated with VLT
performance in university students (Stæhr 2008) and lower-­proficiency
secondary students (Stæhr 2008). The range of domains illustrates the
sensitivity of vocabulary size differences to virtually all types of language
performance. However, the question remains as to how sensitive those
differences are and thus how useful they might be in characterizing user
performance.
A handful of studies also suggest that vocabulary size correlates with
other dimensions of L2 performance beyond reading and listening. VST
performance has been related to better phonetic discrimination skills
(Bundgaard-Nielsen et al. 2011). Better Yes/No Test performance has
also been correlated with superior learning strategy use (Kojic-Sabo and
Lightbown 1999). Both studies open up the possibility that recognition
vocabulary size may affect L2 performance in more ways than previously
imagined.

Receptive and Productive Vocabulary Size

The focus of this book is on written recognition vocabulary size, but


analogous research has been undertaken on spoken recognition vocabu-
lary size as measured by the Listening VLT (McLean et al. 2015) and the
A-Lex, a spoken version of the Yes/No Test (Milton 2009). A sizeable
body of research also relates to the role of productive vocabulary size in
L2 writing in the VLT/VST tradition (Laufer and Nation 1995; Bardel
and Lindquist 2011; East 2004; Laufer 2005a, b; Webb 2007). Although
there are significant differences between written and spoken language
performance the underlying importance of vocabulary size in fluent per-
formance is the same.

Comparing the Size Test Formats

Recognition vocabulary size is a stable correlate of differences in L2 profi-


ciency across a range of domains. As one of three dimensions of the lexical
facility construct, how it is measured is a central concern. Is one approach
2.4 Conclusions 39

better than the other? Both use word frequency statistics to index vocabu-
lary size but differ substantially in format. The VLT/VST format uses
traditional matching and multiple-choice items, while the Yes/No Test
uses a simple self-report format to indicate whether a test-taker knows an
item. It also includes pseudowords to control for guessing, a distinctive
feature of the test that is also the most problematic. This issue is discussed
in Chap. 5, which introduces and describes the Timed Yes/No Test.
The lexical facility account uses the Yes/No Test format because it
directly taps knowledge of the target items in a way that is not affected by
the cues used in possible response alternatives (in monolingual or bilin-
gual versions). Presenting target items one at a time also eliminates the
guessing advantage afforded by an individual being able to reject unlikely
alternatives and greatly reduces the scope for the strategic allocation of
attention within and across items. The single-word presentation format
also allows for a direct measure of individual word recognition speed.

2.4 Conclusions
This chapter introduced the two most widely used approaches to measur-
ing L2 recognition vocabulary size, the multiple-choice VLT/VST and
the simple word recognition Yes/No Test. Both approaches are similar in
that vocabulary size is estimated from test performance on a set of words
systematically sampled from frequency-of-occurrence bands. Size esti-
mates based on this performance have been shown to be correlated to
variability in outcomes in a range of L2 performance domains. These
include proficiency standards such as IELTS and the CEFR and more
localized measures such as program placement and academic English per-
formance. The two approaches differ substantially in format, with the
VLT/VST approach using multiple-choice tasks, and the self-report Yes/
No Test merely requires users to indicate whether they know the word.
The Timed Yes/No Test uses this format to measure lexical facility in the
following chapters. The test combines the size measure described here
with a measure of word recognition speed. The motivation for including
recognition time in a measure of vocabulary skill is set out in the next
chapter, where research on L2 word recognition skill is discussed.
40 2 Measuring Recognition Vocabulary Size

References
Akbarian, I. (2010). The relationship between vocabulary size and depth for ESP/
EAP learners. System, 38(3), 391–401. doi:10.1016/j.system.2010.06.013.
Alavi, S. M. (2012). The role of vocabulary size in predicting performance on
TOEFL reading item types. System, 40(3), 376–385.
Alderson, J. (2005). Diagnosing foreign language proficiency: The interface between
learning and assessment. New York: Continuum.
Bardel, C., & Lindquist, C. (2011). Developing a lexical profiler for spoken
French L2 and Italian L2: The role of frequency, thematic vocabulary and
cognates. EUROSLA Yearbook, 11, 75–93. doi:10.1075/eurosla.11.06bar.
Beeckmans, R., Eyckmans, J., Janssens, V., Dufranne, M., & Van de Velde, H.
(2001). Examining the yes/no vocabulary test: Some methodological issues
in theory and practice. Language Testing, 18(3), 235–274.
Beglar, D. (2010). A Rasch-based validation of the vocabulary size test. Language
Testing, 27(1), 101–118. doi:10.1177/0265532209340194.
Beglar, A., & Hunt, A. (1999). Revising and validating the 2000 word level and
the university word level vocabulary tests. Language Testing, 16(2), 131–162.
doi:10.1191/026553299666419728.
Bundgaard-Nielsen, R. L., Best, C. T., & Tyler, M. D. (2011). Vocabulary size
was associated with second-language vowel perception performance in adult
learners. Studies in Second Language Acquisition, 33(3), 433–461. doi:10.1017/
S0272263111000040.
Cameron, L. (2002). Measuring vocabulary size in English as an additional lan-
guage. Language Teaching Research, 6(2), 145–173.
Clark, M. K., & Ishida, S. (2005). Vocabulary knowledge differences between
placed and promoted students. Journal of English for Academic Purposes, 4(3),
225–238. doi:10.1016/j.jeap.2004.10.002.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2),
213–238.
Culligan, B. (2015). A comparison of three test formats to assess word difficulty.
Language Testing, 32(4), 503–520.
East, M. (2004). Calculating the lexical frequency profile of written German
texts. Australian Review of Applied Linguistics, 27(1), 30–43.
Elgort, I. (2013). Effects of L1 definitions and cognate status of test items on the
vocabulary size test. Language Testing, 30(2), 253–272. doi:10.1177/
0265532212459028.
References 41

Gee, R. W., & Nguyen, L. T. C. (2015). The bilingual vocabulary size test for
Vietnamese learners: Reliability and use in placement testing. Asian Journal
of English Language Teaching, 25, 63–80.
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical pro-
ficiency. EUROSLA Yearbook, 6(1), 147–168.
Harrington, M., & Carey, M. (2009). The online yes/no test as a placement
tool. System, 37(4), 614–626. doi:10.1016/j.system.2009.09.006.
Harrington, M., & Roche, T. (2014). Identifying academically at-risk students
at an English-medium university in Oman: Post-enrolment language assess-
ment in an English-as-a-foreign language setting. Journal of English for
Academic Purposes, 15, 34–37.
Harsch, C., & Hartig, J. (2015). Comparing C-tests and yes/no vocabulary size
tests as predictors of receptive language skills. Language Testing, 33(4), 555–575.
Huibregtse, I., Admiraal, W., & Meara, P. (2002). Scores on a yes-no vocabulary
test: Correction for guessing and response style. Language Testing, 19(3),
227–245.
Kanzaki, M. (2014). Comparing TOEIC® and vocabulary test scores. In G.
Brooks, M. Grogan, & M. Porter (Eds.), 2014 PanSIG conference proceedings
(pp. 52–58). Miyazaki: JALT
Kanzaki, M. (2015). Comparing TOEIC® and vocabulary test scores. In
G. Brooks, M. Grogan, & M. Porter (Eds.), 2014 PanSIG conference proceedings
(pp. 52–58). Miyazaki: JALT.
Kojic-Sabo, I., & Lightbown, P. M. (1999). Students’ approaches to vocabulary
learning and their relationship to success. The Modern Language Journal,
83(2), 176–192. doi:10.1111/0026-7902.00014.
Lam, Y. (2010). Yes/No tests for foreign language placement at the post-­
secondary level. Canadian Journal of Applied Linguistics/Revue canadienne de
linguistique appliquee, 13(2), 54–72.
Laufer, B. (1992). How much lexis is necessary for reading comprehension? In
P. J. L. Arnaud & H. Béjoint (Eds.), Vocabulary and applied linguistics
(pp. 126–132). London: Macmillan. doi:10.1007/978-1-349-12396-4_12.
Laufer, B. (2005a). Focus on form in second language vocabulary learning.
EUROSLA Yearbook, 5(1), 223–250.
Laufer, B. (2005b). Lexical frequency profiles: From Monte Carlo to the real
world. A response to Meara. Applied Linguistics, 26(4), 582–588.
Laufer, B., & Levitzky-Aviad, T. (2016). CATTS (Computer Adaptive Test of
Size & Strength). Downloaded May 1, 2016, from http://www.lextutor.ca/
tests/levels/recognition/nvlt/paper.pdf
42 2 Measuring Recognition Vocabulary Size

Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2
written production. Applied Linguistics, 16(3), 307–322.
Laufer, B., & Nation, P. (2001). Passive vocabulary size and speed of meaning
recognition: Are they related? EUROSLA Yearbook, 1(1), 7–28.
Lemmouh, Z. (2008). The relationship between grades and the lexical richness
of student essays. Nordic Journal of English Studies, 7(3), 163–180.
McLean, S., Hogg, N., & Kramer, B. (2014). Estimations of Japanese university
learners’ English vocabulary sizes using the vocabulary size test. Vocabulary
Learning and Instruction, 3(2), 47–55.
McClean and Kramer. (2015). The creation of a new vocabulary levels test. In
G. Brooks, M. Grogan, & M. Porter (Eds.), 2014 PanSIG conference proceed-
ings (pp. 1–11). Miyazaki: JALT.
Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary
tests. Language Testing, 4(2), 142–145.
Meara, P., & Jones, G. (1988). Vocabulary size as placement indicator. In
P. Grunwell (Ed.), Applied linguistics in society (pp. 80–87). London: CILT.
Meara, P., & Jones, G. (1990). Eurocentres vocabulary size test. 10KA. Zurich:
Eurocentres.
Meara, P. M., & Milton, J. L. (2002). X_Lex: The Swansea vocabulary levels test.
Newbury: Express.
Meara, P. M., & Milton, J. (2003). X_Lex: The Swansea vocabulary levels test.
Swansea: Lognostics.
Meara, P. M., & Miralpeix, I. (2006). Y_Lex: The Swansea advanced vocabulary
levels test. v2.05. Swansea: Lognostics.
Milton, J. (2009). Measuring second language vocabulary acquisition. Bristol:
Multilingual Matters.
Milton, J., & Alexiou, T. (2009). Vocabulary size and the common European
framework of reference for languages. In B. Richards, M. H. Daller, D. D.
Malvern, P. Meara, J. Milton, & J. Treffers-Daller (Eds.), Vocabulary studies
in first and second language acquisition (pp. 194–211). Basingstoke: Palgrave
Macmillan.
Mochida, A., & Harrington, M. (2006). The yes-no test as a measure of recep-
tive vocabulary knowledge. Language Testing, 26(1), 73–98. doi:10.1191/02
65532206lt321oa.
Morris, L., & Cobb, T. (2004). Vocabulary profiles as predictors of the academic
performance of teaching English as a second language trainees. System, 32(1),
75–87. doi:10.1016/j.system.2003.05.001.
Nation, I. S. P. (1990). Teaching and learning vocabulary. Rowley: Newbury
House.
References 43

Nation, I. S. P. (2006). How large a vocabulary was needed for reading and lis-
tening? The Canadian Modern Language Review/La Revue Canadienne des
Langues Vivantes, 63(1), 59–82.
Nation, I. S. P. (2012). The vocabulary size test: Information and specifications.
Retrieved from http://www.victoria.ac.nz/lals/about/staff/publications/paul-
nation/Vocabulary-Size-Test-information-and-specifications.pdf
Nation, I. S. P. (2013). Learning vocabulary in another language (2nd ed.).
Cambridge, UK: Cambridge University Press.
Nation, P., & Coxhead, A. (2014). Vocabulary size research at Victoria University
of Wellington, New Zealand. Language Teaching, 47(03), 398–403.
Pellicer-Sánchez, A., & Schmitt, S. (2012). Scoring yes-no vocabulary tests:
Reaction time vs. nonword approaches. Language Testing, 29(4), 489–509.
doi:10.1177/0265532212438053.
Qian, D. D. (2002). Investigating the relationship between vocabulary knowl-
edge and academic reading performance: An assessment perspective. Language
Learning, 52(3), 513–536.
Read. (1988). Measuring the vocabulary knowledge of second language learners.
RELC Journal, 19, 12–25.
Roche, T., & Harrington, M. (2013). Recognition vocabulary knowledge as a
predictor of academic performance. Language Testing in Asia, 3, 1–12.
Roche, T., Harrington, M., Sinha, Y., & Denman, C. (2016). Vocabulary recog-
nition skill as a screening tool in English-as-a-Lingua-Franca University set-
tings. In J. Read (Ed.), Post-admission language assessment of University students,
English language education (Vol. 6, pp. 159–178). Switzerland: Springer.
Schmitt, N. (2010). Researching vocabulary. A vocabulary research manual.
Basingstoke: Palgrave Macmillan.
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring
the behaviour of two new versions of the vocabulary levels test. Language
Testing, 18(1), 55–89. doi:10.1191/026553201668475857.
Shiotsu, T., & Read, J. (2009, November). Extending the yes/no test as a measure
of the English vocabulary knowledge of Japanese learners. Paper presented at The
measurement of L2 lexical development colloquium, Annual Conference of
the Applied Linguistics Association of Australia, Brisbane.
Stæhr, L. S. (2008). Vocabulary size and the skills of listening,
reading and writing. Language Learning Journal, 36(2), 139–152.
doi:10.1080/09571730802389975.
Stewart, J. (2014). Do multiple-choice options inflate estimates of vocabulary size
on the VST? Language Assessment Quarterly, 11(3), 271–282. doi:10.1080/15
434303.2014.922977.
44 2 Measuring Recognition Vocabulary Size

Stubbe, R. (2015). Replacing translation tests with yes/no tests. Vocabulary


Learning and Instruction, 4, 38–48. doi:10.7820/vli.vo4.2.stubbe.
Webb, S. (2007). The effect of repetition on vocabulary knowledge. Applied
Linguistics, 28(1). doi:10.1093/applin/aml048.
Zhang, X., & Lu, X. (2013). A longitudinal study of receptive vocabulary
breadth knowledge growth and fluency development. Applied Linguistics,
35(3), 283–304. doi:10.1093/applin/amt014.
3
L2 Word Recognition Skill and Its
Measurement

Aims

• Introduce word recognition skill as an aspect of second language (L2)


vocabulary knowledge.
• Identify the role of word recognition skill in text comprehension.
• Describe the lexical decision task (LDT) as a tool for measuring of
word recognition skill.
• Examine LDT performance as a window on word knowledge.

3.1 Introduction
Efficiency in recognizing individual words is a fundamental element of
fluent discourse comprehension. As with vocabulary size, fast and rela-
tively effortless word recognition is a foundation of second language (L2)
proficiency. It is also an aspect of language skill where first language (L1)
and L2 users differ markedly. In the words of Paul Meara, ‘[t]he ability to
recognize and retrieve words effortlessly seems to be a basic feature of the
performance of L1 speakers, and a feature that is conspicuously lacking
from the performance of most L2 speakers’ (2002, p. 404).

© The Author(s) 2018 45


M. Harrington, Lexical Facility, DOI 10.1057/978-1-137-37262-8_3
46 3 L2 Word Recognition Skill and Its Measurement

This chapter examines word recognition skill as a key aspect of L2


vocabulary skill and, in turn, a basic element of L2 proficiency. Word
recognition skill refers to the relative speed with which an individual rec-
ognizes individual words and the consistency of this recognition speed.
Vocabulary size and recognition skill are lower level processes that are a
primary constraint on fluent text comprehension. The motivation for
including size and speed in the lexical facility account is established by
first examining the importance of word recognition skill in fluent dis-
course processing and the development of L2 fluency. As is the case with
increasing vocabulary size, the development of fluent word recognition
skill is an input-driven process that reflects a user’s experience with the
language, making both size and speed directly affected by word frequency.
A traditional instrument for the measurement of word recognition
skill is the lexical decision task (LDT) (Balota and Chumbley 1984). The
LDT format is the basis of the Timed Yes/No Test used to test the lexical
facility proposal. The logic and design of the LDT are first presented,
including a description of how task performance is scored and inter-
preted. LDT performance is affected by factors related to both the nature
of the mental lexicon and the user’s experience with the language.
Performance on the task can thus provide a window on the mental lexi-
con and potentially a useful measure of L2 word recognition skill and its
development.

3.2  ord Recognition Skill and Text


W
Comprehension
Text comprehension starts with the processing of words. Written text
comprehension begins with decoding visual forms and accessing their
related meanings in the mental lexicon. This skill has been characterized
as ‘the ability to rapidly derive a presentation from printed input that
allows access to the appropriate entry in the mental lexicon, and thus, the
retrieval of semantic information at the word level’ (Hoover and Gough
1990, p. 130). The focus here is on the low-level word identification pro-
cesses responsible for initial access to meaning.
3.2 Word Recognition Skill and Text Comprehension 47

The importance of this skill is readily evident when viewed in the con-
text of the broader text comprehension process. The construction–inte-
gration (C–I) model (Kintsch 1998) provides a useful framework for
situating word recognition skill (and later, lexical facility) in overall dis-
course comprehension, as well as relating this skill to other aspects of L2
vocabulary knowledge. The two-stage model is presented in Fig. 3.1.
The first stage of comprehension involves the construction of a text
base from the words and phrases in a text. Crucial to this stage is word
recognition (or ‘word identification’, in Kintsch’s terms), which involves
recognizing and extracting meaning from surface forms. This process
involves both phonological and orthographic knowledge, and is assumed

Prior knowledge

Text model Situation model

Sentence
Word-to-text integration
representation

Semantic units

Word recognition

Orthographic units Phonological units

WRITTEN
WRITTEN TEXT

Fig. 3.1 Word recognition in the construction–integration model of text compre-


hension (figure adapted from Perfetti and Stafura (2014, p. 33))
48 3 L2 Word Recognition Skill and Its Measurement

to be modular in that it is carried out with little top-down influence


(Fodor 1983). Kintsch characterizes identification as the result of ‘dumb
activation’ of word forms arising from the interaction of the orthographic
and phonological stimuli with the semantic unit in the mental lexicon.
This interaction happens at the individual word level, which then feeds
into sentence-level representations. Lexical information in the text base
also triggers the retrieval of additional, related information from prior
knowledge stored in long-term memory.
The output of this construction process guides, or constrains, the
higher-order text integration processes that result in a situation model of
the world evoked by the text (Kintsch 2005). The situation model reflects
how an individual understands a text, or more precisely, the situation
evoked by a text. This situation can be real or imaginary and is important
because it facilitates other key facets of understanding beyond the text,
particularly the use of inferencing and general knowledge in comprehen-
sion. An inadequate situation model, or one that is not relevant to the
task context, will limit understanding. Constructing the text base requires
the reader to access words quickly to form a semantic network that must
be continually revised to reflect the overall emerging discourse. The con-
struction and integration phases make significant demands on working
memory, that is, the capacity to simultaneously process new information
while continuing to build up a coherent understanding of the text (Juffs
and Harrington 2011). Nonfluent word recognition skills make a large
demand on overall processing resources and give rise to incomplete or
invalid situation models.
Efficient word recognition skill sits at the base of the text comprehen-
sion process. As a low-level process, it involves recognizing quickly what
letters and words mean as a crucial step to integrating this information
into meaningful sentences, paragraphs, and larger discourse units. Slow,
inefficient word recognition can create a bottleneck in this integration
process that impedes comprehension. Not surprisingly, word recognition
skill has been shown to be a strong predictor of L1 reading skill across a
range of proficiency levels (LaBerge and Samuels 1974; Just and Carpenter
1992; Stanovich et al. 1991; Bell and Perfetti 1994; Holmes 2009). It has
also been shown to play a crucial role in L2 text comprehension.
3.3 Word Recognition Skill and L2 Text Comprehension 49

3.3  ord Recognition Skill and L2 Text


W
Comprehension
Fluent word recognition is a basic characteristic of L1 reading perfor-
mance, but is often limited in L2 readers (Fender 2001; Meara 2002).
While its importance is widely acknowledged, the contribution of L2
word recognition skill to L2 reading and L2 proficiency more generally
has received somewhat limited attention from L2 reading researchers. This
lack of interest has been attributed, in part, to an assumption that L2 word
recognition skill largely mirrors that of the L1 (Nassaji 2014). However,
while there are fundamental similarities between the L1 and the L2, there
are also striking differences. Unlike the child L1 reader, the adult L2 reader
typically begins reading in the L2 with a limited oral vocabulary. This
means that the development of L2 word recognition skills goes beyond the
simple mapping of known spoken words onto new orthographic forms.
The L2 learner typically also brings fluent L1 word recognition skills to the
L2 reading process. These cross-linguistic effects, especially those of L1
orthographic and phonological skills on L2 orthographic processing, can
have both positive and negative effects (Koda 2005).
Researchers that have examined L2 word recognition skill have estab-
lished it as an important predictor of fluent reading. In one of the earliest
studies, Favreau and Segalowitz (1983) identified more efficient word rec-
ognition processes as a key predictor of fluent L2 reading in bilingual read-
ers. Subsequent research has, with few exceptions, implicated fluent word
recognition skill in fluent L2 reading by adults (Akamatsu 2003; Koda
1992; Nassaji 2014; Shiotsu 2009) and younger readers (Geva and Wang
2001). Word recognition skills were shown to play a central role in L2
reading proficiency in a recent meta-analysis by Jeon and Yamashita (2014).
The analysis comprised 59 studies drawn from the applied linguistics and
L2 reading literature, and included data from child, ­bilingual, and adult
readers across a variety of contexts. Ten factors were correlated with L2
reading skill, ranging from higher-order cognitive processes such as meta-
cognition to low-level word decoding. The factors are listed in Table 3.1.
The ten factors in the central column are drawn from the original
study and are divided into high-evidence (18 or more effect sizes) or low-­
evidence factors. The factor types in the left column have been added
50 3 L2 Word Recognition Skill and Its Measurement

Table 3.1 A meta-analysis of factors affecting L2 reading skill (Jeon and Yamashita
2014)
N = 59 studies—Jeon and Yamashita (2014)
Evidence
Type Factor r level
Word recognition skill L2 vocabulary knowledge 0.79 High
L2 decoding 0.56 High
Phonological awareness 0.48 Low
Orthographic knowledge 0.51 Low
Morphological knowledge 0.61 Low
Grammar knowledge L2 grammar knowledge 0.85 High
L1 reading skill L1 reading comprehension 0.77 High
L2 proficiency L2 listening comprehension 0.77 Low
Cognitive processes Working memory 0.42 Low
Metacognition 0.32 Low

here. High-evidence correlates for L2 reading skill include L2 grammar


knowledge, L1 reading comprehension, L2 vocabulary knowledge, and
L2 decoding—in that order of effect size. Low-evidence factors include
L2 phonological awareness, L2 orthographic knowledge, L2 morphologi-
cal knowledge, L2 listening comprehension, working memory (L1 and
L2 measures combined), and metacognition. Five of these factors can be
considered word recognition subskills. L2 vocabulary knowledge includes
the measures of vocabulary size examined in Chaps. 1 and 2, as well as mea-
sures of vocabulary depth. Second to L2 grammar knowledge (r = 0.85),
L2 vocabulary knowledge has the strongest correlation with L2 reading
(r = 0.79). The correlations with L2 reading skill for other factors ranged
from 0.4 to 0.6. Word recognition speed by itself does not appear in the
list of factors; rather, it is subsumed under the category of L2 decoding
skills. Decoding includes all measures that assess either silent or oral read-
ing efficiency and/or accuracy in processing L2 pseudowords and/or real
words. The analysis of the relationship between L2 decoding and L2
reading comprehension identified 20 independent decoding–reading
skill correlations from 18 studies. The studies represented eight different
languages, with participants ranging from kindergarten to postgraduate-­
level study. Overall, L2 decoding had an overall mean correlation of 0.56,
just short of being considered a strong effect (Plonsky and Oswald 2014),
but substantial nonetheless, having been drawn from a set of such heter-
ogenous studies and measures.
3.4 The LDT as a Measure of Word Recognition Skill 51

The lexical facility construct combines L2 vocabulary knowledge and


decoding skill, the latter being characterized by the speed with which real
words can be accurately recognized. It operationalizes word recognition
skill as a three-part construct consisting of vocabulary size (breadth), rec-
ognition speed, and the efficiency with which words are recognized. The
latter is indexed by the coefficient of variation (CV) (Segalowitz and
Segalowitz 1993), a measure that has been implicated in the development
of lexical automaticity. The measure and its uses are described later.
The three lexical facility measures are collected using the Timed Yes/
No Test described in Chap. 5. The test format is based on the LDT. The
next section gives an overview of the logic and structure of the LDT as a
measure of word recognition skill.

3.4  he LDT as a Measure of Word


T
Recognition Skill
The LDT has been widely used in cognitive psychology to study lexical
access (Balota et al. 2006). It has also been used in bilingual lexicon
research (Dijkstra 2005) and more recently in L2 lexical research, where
it has provided the basis for the timed component of the Yes/No Test
(Harrington 2006; Pellicer-Sánchez and Schmitt 2012). The latter is the
primary focus here. As the name suggests, the LDT has a lexical compo-
nent and a decision component.

The Lexical: Words and Not Words

The LDT requires an individual to make an immediate judgment about


a word item (a string of letters) presented on a computer screen. This
judgment can take various forms. It might be whether the item is an
actual word in a given language, has been seen previously, or relates to
some other feature of the word that is relevant to the particular research
question. The test items consist of both real words and made-up words
included to control for guessing. The real words are selected according to
the purpose of the study, which is typically to examine how recognition
52 3 L2 Word Recognition Skill and Its Measurement

speed is affected by such factors as word frequency, the number of sylla-


bles, the number of related words, and other factors. A number of these
factors will be introduced below. The made-up words are of two types. A
pseudoword (also called a legal nonword) is spelled and pronounced as a
real word but does not exist in the language, such as mordle in English
(Stone and Van Orden 1992). The other type of made-up word is a non-
word (illegal nonword) that is both meaningless and does not conform to
the orthographic or phonological conventions of the language, such as
ztkumi in English. As accurate word performance requires contact with
actual word entries in the mental lexicon, including pseudowords ensures
that the individual is processing the test items for meaning. Using (illegal)
nonwords, on the other hand, tests the mastery of phonological and
orthographic rules of a language, especially in the L2 (Koda 2007;
Yamashita 2013).

The Decision: Speeded Judgments

In a typical LDT study, the participant is presented with a large number


of words and pseudoword/nonword items and instructed to respond to
them as accurately and quickly as possible. The nature of the response
depends on the aim of the study. Differences in the mean response times
by condition, participant, or both can be of interest to the researcher. The
relative speed of response reflects the strength of the underlying lexical
representation. This is in part a function of word-internal characteristics
such as its frequency of occurrence and the number of words (­ ‘neighbors’)
with which it shares overlapping surface forms. These features are exam-
ined below. Response times are also affected by task effects, such as priming,
where a previously presented word influences the speed of response to a
subsequent word (Balota et al. 2006).
The focus in LDT performance is on individual and group differ-
ences in response speed. As such, the task is designed so that word
recognition performance is nearly error free. Of interest is not whether
the individual knows the target word, but instead of how fast a mean-
ing for the word can be accessed. By controlling for accuracy (i.e.,
ensuring that there are very few errors), the researcher can be confident
3.4 The LDT as a Measure of Word Recognition Skill 53

that the participant is not trading off accuracy for the speed of response,
or vice versa. This is important because interpreting accuracy and speed
performance where both variables vary presents the researcher with a
potential confound. Lower accuracy may be due to a tendency to
answer more quickly and not necessarily to a lack of underlying knowl-
edge, while slower responses might reflect greater care being taken in
making the correct response at the expense of answering quickly
(Pachella 1974).
As a result, typical applications of the LDT in psychological research
assume that the participant knows the words. The need for the partici-
pant to have a threshold level of vocabulary knowledge makes the tech-
nique less useful for child language research (Harley 2013) or for L2
research more generally, where vocabulary knowledge levels can vary
greatly. In L2 research, it has been used in bilingual research involving
advanced learners. Here, the interest is in investigating the organization
and interaction of the cross-linguistic mental lexicon, and the partici-
pants are proficient in both languages (Dijkstra 2005; Kroll et al. 2010).
Experimental L2 acquisition research has also used the LDT in studies
that involve only advanced learners (Elgort 2013; Favreau and Segalowitz
1983) or where the participants are trained on a set of items beforehand
to ensure error-free performance (Akamatsu 2008). In the lexical facility
account tested in Part 2, both response accuracy and response time will
be simultaneously examined as indices of L2 proficiency.

Lexical Decision as a Two-Stage Process

The LDT involves a simple yes/no response to a presented item, but the
factors affecting that judgment are complex. At a minimum, the task can
be broken down into two stages (Balota and Chumbley 1984; Jacobs and
Grainger 1994), which is schematized in Fig. 3.2.
Balota and Chumbley (1984) decompose the task into the word recog-
nition processes that reflect the underlying strength of the representation
and the decision processes that act on the output of the recognition phase.
The mental representation of a word consists of phonological, ortho-
graphic, and semantic code information. The strength of these elements
54 3 L2 Word Recognition Skill and Its Measurement

Item presentation

(1) Recognition process:

Based on

strength & quality of representation

(2) Decision process;

Based on

test-taker bias, task demands, and degree of

task understanding

Response

Fig. 3.2 A schematic diagram of the lexical decision task

is a function of the frequency with which the word has been used and its
relationship to other words in the mental lexicon. All things being equal,
a word with a stronger underlying representation will be recognized faster
and more consistently; it is this underlying representation strength and
its interaction with other specific task elements that is usually the object
of the researcher’s interest. The word recognition stage feeds into the
decision stage, where the participant makes the decision as to how to
respond. Decision-stage factors also affect the speed of response. Decision-
related factors include systematic effects, such as test-taker’s attitude and
motivation, and nonsystematic effects, such as attention on a given trial
or fatigue. Performance at the decision stage can also be affected by the
3.4 The LDT as a Measure of Word Recognition Skill 55

decision criteria used by the participant. These can be based on learner-­


internal biases, as in the general tendency to be more conservative, or
task-induced demands arising from, for example, instructions to priori-
tize speed over accuracy (Wagenmakers et al. 2008).
Decision-stage factors introduce what the testing literature terms
construct-­irrelevant or measurement-based variance that obscures genuine
differences in skill (Brown 2005). Any attempt to use the LDT as a mea-
sure of L2 vocabulary skill needs to recognize and, to the greatest extent
possible, minimize the influence of these factors on the skill measures.

LDT Performance Measures

The LDT yields three measures of performance. Knowledge is measured by


the accuracy score, which reflects performance on word and pseudoword/
nonword items. Recognition speed is measured in terms of the mean response
time and the standard deviation (SD) of the mean response time. An addi-
tional measure of consistency, the CV, can also be calculated from the mean
response time and SD values. In the present study ‘mean response time’ is
labelled ‘mean recognition time’ (mnRT). The two concepts are the same.

Accuracy The LDT is a forced-choice task in which the participant pro-


vides a yes/no judgment about items presented individually as to whether
each meets a prespecified criterion (e.g., is a word). The test set consists
of both word and pseudoword/nonword items. The proportion of each
item type depends on the aims of the study, but tests use a 50/50 split
between the two-word types. Pseudowords (legal nonwords) are used
when the task involves the processing of the word’s meaning. In studies
investigating mastering of graphophone skill, nonwords may also be used
(Koda 2007; Yamashita 2013). Accuracy represents the ability to discrim-
inate between actual words and the pseudoword/nonword stimuli.

Correct ‘yes’ responses to words (termed hits) reflect the individual’s


knowledge of the word items. Incorrect ‘yes’ responses to nonword or
pseudowords (false alarms) index an individual’s tendency to guess. As
56 3 L2 Word Recognition Skill and Its Measurement

noted, accuracy is usually used as a control variable to establish that the


participant knows the words, with speed of response and its variability
being the focus of correct responses. Participants with excessive false-­
alarm rates are usually excluded from analysis, though what constitutes
‘excessive’ is an issue.

Time Measures The main LDT measure is the mean recognition time,
mnRT, for word items calculated over individuals, conditions, groups,
and items. Only correct responses are used in the calculation of the mnRT,
though incorrect responses are usually few. Interpreting the mnRTs also
requires a measure of variability of the sample on which the mnRT was
calculated. This is SD, which is the estimate of the average variability
between data points in a sample used if the data are normally distributed.
In cases where correct responses are not used, the median and range serve
the same function as the mnRT and SD. The smaller the SD, the more
confidence the researcher has that the mean is a ‘true’ indication of the
underlying group mean.

In addition to the SD, another measure of variability used is the


CV. The CV is the ratio of the SD of the mnRT to the mnRT itself
(SDmnRT/mnRT) and serves as a measure of recognition time consistency.
The CV characterizes the relationship between the mnRT and the SD in
a single value, which can be used to compare performance across groups
and individuals, and for both across time. In the same way that the mnRT
reflects how fast a participant recognizes a set of items, the CV indexes
how consistent the recognition speed is across items in a set, independent
of the mnRT.
The changing relationship between the mnRT and the CV in the indi-
vidual over time has been examined as a window on the development of
automaticity. Segalowitz and colleagues have argued that a positive cor-
relation between the mnRT and the CV in the presence of decreasing
mnRTs signals the emergence of automaticity in word recognition
(Segalowitz and Segalowitz 1993; Segalowitz et al. 1995, 1998). Empirical
support for the proposal has appeared (Harrington 2006; Akamatsu
2008), but both theoretical and empirical challenges to the account have
been set out in a study by Hulstijn et al. (2009).
3.5 LDT Performance as a Window on Word Knowledge 57

The lexical facility account incorporates vocabulary size, the mnRT,


and the CV as measures of L2 vocabulary. Segalowitz et al. (1998)
described the mnRT as ‘a useful index of word recognition skill’ (p. 59).
The research program testing the lexical facility proposal in Part 2 seeks
to establish whether the CV can serve as another.

3.5 L DT Performance as a Window on Word


Knowledge
A 2016 Google Scholar search using the search terms ‘lexical decision
task’ and ‘lexical decision tests’ elicited over 25,000 hits. This voluminous
body of research has resulted in a sophisticated understanding of the pro-
cesses and knowledge structures that subserve LDT performance. Jiang
(2013, pp. 78–82) identifies a range of factors that can affect response
speed in the LDT. They are, in turn, also key factors in vocabulary learn-
ing, representation, and use.

Frequency One of the strongest effects on recognition speed is the fre-


quency with which a target word occurs in a language (Ellis 2002). High-­
frequency items such as cat are recognized more quickly than low-frequency
ones such as platypus. Frequency also plays a central role in the way that
L2 vocabulary is learned and processed, as was discussed in Chaps. 1 and
2 (Schmitt 2010, p. 63).

Familiarity Frequency provides an objective, corpus-based estimate of


the likelihood that the word will be known and the speed with which it
will be recognized. However, individual learners have different learning
experiences, and typically know some low-frequency words not predicted
by their current general level of proficiency. Familiarity-based ratings
based on subjective judgments by users have been shown to be superior
to corpus-based word frequency accounts in predicting response times
(Lewellen et al. 1993).
58 3 L2 Word Recognition Skill and Its Measurement

Neighbors Words also differ in the number of closely related words, or


neighbors, they have in the mental lexicon. Neighbors are words that dif-
fer by only one letter, such as house and mouse. Words with more neigh-
bors are recognized more quickly in what is called the ‘neighborhood
density effect’ (Andrews 1992). However, the number of neighbors can
also slow down responses in some instances: a high-frequency word with
many neighbors is recognized more slowly than a high-frequency word
with fewer neighbors (Carreiras et al. 1997). Of course, the number of
neighbors an individual knows is a function of vocabulary size.

Age of Acquisition The age at which a word is learned affects how fast
words are recognized. Words learned at a younger age are recognized
faster than those learned later.

Regularity of Sound-Spelling Correspondences The regularity of sound-­


spelling correspondences in words also affects recognition speed, though
the effect is complex. For example, key is recognized faster than quay. A
­frequency-by-regularity interaction also occurs in which regular low-­
frequency words (yak) are recognized more quickly than irregular low-­
frequency words (yacht). This interaction is not evident for high-frequency
words, with the regular cat recognized as fast as the low-frequency have. The
English spelling system presents a particular challenge for L2 learners
because the correspondence between a word’s pronunciation and its spelling
can be quite irregular. Although some basic rules can be explicitly taught,
the many exceptions (e.g., night, byte, and kite) have to be learned individu-
ally and require substantial experience with the language. This experience is
reflected in LDT performance.

Number of Meanings The number of meanings a word expresses will


affect how quickly it is recognized. Break will be recognized faster than
bifurcate, given that it has many more meanings. This feature is closely
related to the notion of vocabulary depth and is also a direct reflection of
an individual’s experience with the language.

Number of Associates An associate refers to all words activated when a


target word is recognized. Words with more associates are generally recog-
nized faster, though the effect is complex and sensitive to task conditions
3.5 LDT Performance as a Window on Word Knowledge 59

(Yap and Balota 2015). Naturally, the number of meanings and associates
a target has is directly related to vocabulary size.

There are also three factors related to the use of pseudowords and non-
words that affect recognition times.

Lexicality Actual words are recognized faster than pseudowords or


nonwords.

Nonword Advantage Nonwords, as used in this book, refer to words that


do not conform to the orthographic conventions of a given language,
while pseudowords do. The latter could be words but are not. Nonwords
are typically rejected faster than pseudowords.

Pseudohomophones A pseudohomophone is a pseudoword that sounds


like a real word, such as brane for brain. Pseudohomophones have been
shown to be much slower to reject, and Jiang notes that using these forms
should be avoided unless they are an object of study (Jiang 2013).

Finally, there are two factors in LDT performance that are specific to
testing L2 and bilingual populations.

Cognate Status Cross-linguistic word pairs with a high degree of overlap


in meaning and orthography, such as English tourist and Spanish turista,
are recognized faster than noncognates. Meara et al. (1994) suggest that
cognate status can affect the speed with which L2 pseudowords that
resemble words in the L1 are rejected.

Cross-linguistic Homographs These are words that have identical forms in


two languages but different meanings. For example, the English word
pan means ‘bread’ in Spanish. Some research has reported that homo-
graphs can cause slower recognition times (de Groot et al. 2000).

The preceding factors all effect LDT performance, though Jiang notes
that the role the various factors play in LDT performance, and word
recognition more generally, remains a subject of discussion and debate
(Jiang 2013, p. 84). Regardless, the quality and strength of these effects
60 3 L2 Word Recognition Skill and Its Measurement

are a direct outgrowth of the individual’s experience with the language.


These properties change over the course of learning and use, and, as a
whole, reflect the development of L2 vocabulary proficiency. The LDT is
a tool that can provide a window on this development.

3.6  sing the LDT Format to Measure


U
L2 Word Recognition Skill
The lexical facility account introduced in the next chapter combines
vocabulary size and word recognition skill, the latter consisting of speed
and consistency measures. Evidence for the proposal will come from user
performance on the Timed Yes/No Test, a format which incorporates
basic features of the LDT. The features are well-suited to measuring L2
word recognition skill as operationalized in the lexical facility account.
The timed yes/no response format permits the accuracy and speed of
­performance on words from a range of frequency levels to be tested. The
capacity to test a large number of items allows more reliable estimates of
frequency-indexed vocabulary size. The format is also time and resource
efficient.
The use of the LDT format as a vocabulary measurement tool departs
from its traditional use as a laboratory method to study reaction time
variability in lexical access. In these controlled settings, performance
accuracy is a control variable. Performance at, or near, ceiling provides
the baseline against which differences in response time performance are
used to infer the nature of the underlying cognitive structures and pro-
cesses responsible for performance. In the Timed Yes/No Test, accuracy
and speed are both treated as independent variables; their relationship to
L2 performance is tracked both separately and in combination.
Interpreting reaction time data in the presence of variable accuracy—that
is, high error rates—departs from conventional practice in reaction/
response time research to keep the two separate. This is done to avoid a
potential confound between accuracy and speed that might arise when
interpreting results (Luce 1986).
Other aspects of the test format also sit outside current mainstream
practice in L2 vocabulary testing. The use of pseudowords, a simple yes/
References 61

no judgment as to whether a word is known, and the presentation of


single words without a context all make the test format distinctive. The
motivation for using the format to measure lexical facility is presented in
the next chapter.

3.7 Conclusions
Lexical facility is measured using the Timed Yes/No Test, an instrument
that incorporates basic features of the LDT. The format provides a win-
dow on the development of word recognition skill that is driven by expe-
rience with the language, as is the lexical facility construct itself. Applying
the format to the measurement of L2 word recognition skill across a
range of user proficiency levels, that is, where accuracy performance can
vary greatly, is a novel undertaking and one that presents a number
of challenges to the researcher. These will be examined in the studies pre-
sented in Part 2.

References
Akamatsu, N. (2003). The effects of first language orthographic features on sec-
ond language reading in text. Language Learning, 53(2), 207–231.
Akamatsu, N. (2008). The effects of training on automatization of word recog-
nition in English as a foreign language. Applied PsychoLinguistics, 29(2),
175–193. doi:10.1017/S0142716408080089.
Andrews, S. (1992). Frequency and neighborhood effects on lexical access:
Lexical similarity or orthographic redundancy? Journal of Experimental
Psychology: Learning, Memory, and Cognition, 18(2), 234–254.
doi:10.1037/0278-7393.18.2.234.
Balota, D. A., & Chumbley, J. I. (1984). Are lexical decisions a good measure of
lexical access? The role of word frequency in the neglected decision phase.
Journal of Experimental Psychology: Human Perception and Performance, 10(3),
340–357. doi:10.1037/0096-1523.10.3.340.
Balota, D. A., Yap, M. J., & Cortese, M. J. (2006). Visual word recognition: The
journey from features to meaning (a travel update). In M. J. Traxler & M. A.
Gernsbacher (Eds.), Handbook of psycholinguistics (2nd ed., pp. 285–375).
Amsterdam: Elsevier.
62 3 L2 Word Recognition Skill and Its Measurement

Bell, L. C., & Perfetti, C. A. (1994). Reading skill: Some adult comparisons.
Journal of Educational Psychology, 86(2), 244–255.
doi:10.1037/0022-0663.86.2.244.
Brown, J. D. (2005). Testing in language programs: A comprehensive guide to
English language assessment. New York: McGraw-Hill.
Carreiras, M., Perea, M., & Grainger, J. (1997). Effects of the orthographic
neighborhood in visual word recognition: Cross-task comparisons. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 23(4), 857.
De Groot, A. M., Delmaar, P., & Lupker, S. J. (2000). The processing of inter-
lexical homographs in translation recognition and lexical decision: Support
for non-selective access to bilingual memory. The Quarterly Journal of
Experimental Psychology: Section A, 53(2), 397–428.
Dijkstra, T. (2005). Bilingual visual word recognition and lexical access. In J. F.
Kroll & A. M. B. de Groot (Eds.), Handbook of bilingualism: Psycholinguistic
approaches (pp. 179–201). New York: Oxford University Press.
Elgort, I. (2013). Effects of L1 definitions and cognate status of test items on the
vocabulary size test. Language Testing, 30(2), 253–272.
doi:10.1177/0265532212459028.
Ellis, N. C. (2002). Frequency effects in language processing: A review with
implications for theories of implicit and explicit language acquisition. Studies
in Second Language Acquisition, 24(2), 143–188.
Favreau, M., & Segalowitz, N. S. (1983). Automatic and controlled processes in
the first and second language of reading fluent bilinguals. Memory and
Cognition, 11(6), 565–574. doi:10.3758/BF03198281.
Fender, M. J. (2001). A review of L1 and L2/ESL word integration development
involved in lower-level text processing. Language Learning, 51(2), 319–396.
doi:10.1111/0023-8333.00157.
Fodor, J. (1983). Modularity of mind. Cambridge, MA: MIT Press.
Geva, E., & Wang, M. (2001). The development of basic reading skills in chil-
dren: A cross-language perspective. Annual Review of Applied Linguistics, 21,
182–204.
Harley, T. A. (2013). The psychology of language: From data to theory (4th ed.).
Hove: Psychology Press.
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical pro-
ficiency. EUROSLA Yearbook, 6(1), 147–168.
Holmes, V. M. (2009). Bottom-up processing and reading comprehension in
experienced adult readers. Journal of Research in Reading, 32(3), 309–326.
doi:10.1111/j.1467-9817.2009.01396.
References 63

Hoover, W. A., & Gough, P. B. (1990). The simple view of reading. Reading and
Writing, 2(2), 127–160. doi:10.1007/BF00401799.
Hulstijn, J. H., Van Gelderen, A., & Schoonen, R. (2009). Automatization in
second language acquisition: What does the coefficient of variation tell us?
Applied PsychoLinguistics, 30(04), 555–582.
Jacobs, A. M., & Grainger, J. (1994). Models of visual word recognition:
Sampling the state of the art. Journal of Experimental Psychology: Human
Perception and Performance, 20(6), 1311.
Jeon, E. H., & Yamashita, J. (2014). L2 reading comprehension and its corre-
lates: A meta-analysis. Language Learning, 64(1), 160–212. doi:10.1111/
lang.12034.
Jiang, N. (2013). Conducting reaction time research in second language studies.
New York: Routledge.
Juffs, M., & Harrington, M. (2011). Aspects of working memory in L2 learn-
ing. Language Teaching, 44(2), 137–166. doi:10.1017/S0261444810000509.
Just, M. A., & Carpenter, P. A. (1992). A capacity theory comprehension: Individual
differences in working memory. Psychological Review, 99(1), 122–149.
Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge:
Cambridge University Press.
Kintsch, W. (2005). An overview of top-down and bottom-up effects in com-
prehension: The C-I perspective. Discourse Processes, 39(2–3), 125–128. doi:
10.1080/0163853X.2005.9651676.
Koda, K. (1992). The effects of lower-level processing skills on FL reading per-
formance: Implications for instruction. The Modern Language Journal, 76(4),
502–512.
Koda, K. (1996). L2 word recognition research: A critical review. The Modern
Language Journal, 80(4), 450–460.
Koda, K. (2005). Insights into second language reading: A cross-linguistic approach.
New York: Cambridge University Press.
Koda, K. (2007). Reading and language learning: Crosslinguistic constraints on
second language reading development. Language Learning, 57, 1–44.
doi:10.1111/0023-8333.101997010-i1.
Kroll, J., Van Hell, J., Tokowicz, N., & Green, D. (2010). The revised hierarchical
model: A critical review and assessment. Bilingualism: Language and Cognition,
13(3), 373–381. doi:10.1017/S136672891000009X.
LaBerge, D., & Samuels, S. J. (1974). Toward a theory of automatic informa-
tion processing in reading. Cognitive Psychology, 6(2), 293–323.
doi:10.1016/0010-0285(74)90015-2.
64 3 L2 Word Recognition Skill and Its Measurement

Lewellen, M. J., Goldinger, S. D., Pisoni, D. B., & Greene, B. G. (1993). Lexical
familiarity and processing efficiency: Individual differences in naming, lexical
decision, and semantic categorization. Journal of Experimental Psychology:
General, 122(3), 316–330. doi:10.1037/0096-3445.122.3.316.
Luce, R. D. (1986). Response times. New York: Oxford University Press.
Meara, P. (2002). The rediscovery of vocabulary. Second Language Research,
18(4), 393–407. doi:10.1191/0267658302sr211xx.
Meara, P., Lightbown, P. M., & Halter, R. H. (1994). The effects of cognates on
the applicability of yes/no vocabulary tests. The Canadian Modern Language
Review, 50(2), 296–311.
Nassaji, H. (2014). The role and importance of lower-level processes in second
language reading. Language Teaching, 47(1), 1–37.
Pachella, R. G. (1974). The interpretation of reaction time in information pro-
cessing research. In B. H. Kantowitz (Ed.), Human information processing:
Tutorials in performance and cognition (pp. 41–82). Hillsdale: Lawrence
Erlbaum Associates, Inc.
Pellicer-Sánchez, A., & Schmitt, S. (2012). Scoring yes-no vocabulary tests:
Reaction time vs. nonword approaches. Language Testing, 29(4), 489–509.
doi:10.1177/0265532212438053.
Perfetti, C. A., & Stafura, J. (2014). Word knowledge in a theory of reading
comprehension. Scientific Studies of Reading, 18(1), 22–37. doi:10.1080/108
88438.2013.827687.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
Schmitt, N. (2010). Researching vocabulary. A vocabulary research manual.
Basingstoke: Palgrave Macmillan.
Segalowitz, N. (2010). Cognitive bases of second language fluency. New York:
Routledge.
Segalowitz, N., & Segalowitz, S. J. (1993). Skilled performance, practice and
differentiation of speed-up from automatization effects: Evidence from sec-
ond language word recognition. Applied PsychoLinguistics, 14(3), 369–385.
doi:10.1017/S0142716400010845.
Segalowitz, N., Watson, V., & Segalowitz, S. J. (1995). Vocabulary skill: Single
case assessment of automaticity of word recognition in a timed lexical deci-
sion task. Second Language Research, 11(2), 121–136.
Segalowitz, N., Segalowitz, S. J., & Wood, A. G. (1998). Assessing the develop-
ment of automaticity in second language word recognition. Applied
PsychoLinguistics, 19(1), 53–67.
References 65

Shiotsu, T. (2009). Reading ability and components of word recognition speed:


The case of L1-Japanese EFL learners. In Z. Han & N. J. Anderson (Eds.),
Second language reading research and instruction: Crossing the boundaries
(pp. 15–39). Ann Arbor: University of Michigan Press.
Stanovich, K. E., West, R. F., & Cunningham, A. E. (1991). Beyond phonologi-
cal processes: Print exposure and orthographic processing. In Phonological
processes in literacy: A tribute to Isabelle Y. Liberman (pp. 219–235). Hillsdale:
Lawrence Erlbaum Associates.
Stone, G., & Van Orden, C. (1992). Resolving empirical inconsistencies con-
cerning priming, frequency, and nonword foils in lexical decision. Language
and Speech, 35(3), 295–324. doi:10.1177/002383099203500302.
Wagenmakers, E. J., Ratcliff, R., Gomez, P., & McKoon, G. (2008). A diffusion
model account of criterion shifts in the lexical decision task. Journal of
Memory and Language, 58(1), 140–159. doi:10.1016/j.jml.2007.04.006.
Yamashita, J. (2013). Word recognition subcomponents and passage level read-
ing in a foreign language. Reading in a Foreign Language, 25(1), 52–71.
Yap, M., & Balota, D. (2015). Visual word recognition. In A. Pollastsek &
R. Treiman (Eds.), The Oxford handbook of reading (pp. 26–43). New York:
Oxford University Press.
4
Lexical Facility: Bringing Size
and Speed Together

Aims

• Introduce lexical facility


–– as a vocabulary skill construct
–– as a measurement construct
• Identify the challenges of combining size and speed measures.
• Describe the lexical facility research program.

4.1 Introduction
The preceding chapters examined vocabulary size and word recognition skill
as elements of second language (L2) vocabulary knowledge. This chapter
introduces an approach to L2 vocabulary skill and its measurement that
brings together vocabulary size and the two dimensions of word recognition
skill, recognition speed and consistency. Lexical facility characterizes vocabu-
lary size and processing skill dimensions as complementary indices of L2
vocabulary knowledge that, when combined, p ­ rovide a more sensitive mea-
sure of individual differences in L2 vocabulary than vocabulary size alone.

© The Author(s) 2018 67


M. Harrington, Lexical Facility, DOI 10.1057/978-1-137-37262-8_4
68 4 Lexical Facility: Bringing Size and Speed Together

This book presents conceptual support and empirical evidence for the
value of treating size and processing skill as a unitary construct. Three
goals were set out at the beginning. The first goal is to establish the theo-
retical basis of the lexical facility construct, with particular attention to
the combination of the knowledge (size) and speed dimensions. The sec-
ond is to provide empirical evidence for the lexical facility construct as a
valid and reliable measure of individual differences in L2 vocabulary skill.
The third goal is to demonstrate how this measure correlates with out-
comes in various L2 performance domains.
This chapter focuses on the first aim—that is, establishing the concep-
tual basis for lexical facility. The construct is first defined and related to
existing research on word recognition skill. Lexical facility is approached
both as a vocabulary skill construct and as a measurement construct. As
a vocabulary skill construct, it represents lower-level word recognition
processes that play a crucial role in discourse comprehension skill. As a
measurement construct, it combines the three dimensions of vocabulary
size, recognition speed, and consistency as trait-like entities underlying
performance across contexts and uses. The lexical facility proposal repre-
sents a significant departure from the traditional practice in vocabulary
learning and assessment of treating knowledge and speed as independent
entities. The implications of the proposal are discussed, and the ways in
which the account differs from current directions in L2 vocabulary
research are highlighted.
The final part of the chapter introduces the research program that tests
the lexical facility proposal. The studies reported in Chaps. 6, 7, 8, 9, and 10
address the second and third goals of the book, that is, to provide evidence
for the validity and reliability of the lexical facility measures as indices of
L2 vocabulary skill and as correlates of performance in key L2 domains.

4.2 Defining Lexical Facility


Lexical facility is the capacity to recognize words quickly. It combines
vocabulary knowledge, as manifested by vocabulary size, with the pro-
cessing skill needed to access these words. This latter includes how fast,
4.2 Defining Lexical Facility 69

on average, these words are recognized, as measured by mean recognition


time (mnRT), and the consistency in the speed with which this takes
place, as captured in the coefficient of variation (CV). All refer to the
linking of a visual word form to a meaning (semantic representation) in
the individual’s mental lexicon. The recognition process is bounded in
time from the moment the visual stimulus is first perceived until the lexi-
cal entry—the meaning component—becomes available.
Vocabulary size is the boundary condition on lexical facility because a
word must be known before it can be recognized. Size is measured using
the Timed Yes/No Test, which elicits simple judgments as to whether the
individual knows the word, but does not indicate what is known about it,
nor how it is used. Recognition speed reflects the mnRT with which words
are identified. The interest here is how differences in mnRT correlate with
vocabulary size and the relationship of the two to L2 performance.
Along with vocabulary size and mnRT, lexical facility includes consis-
tency with which words are recognized. The measure of consistency used
here is the CV, which is a single value that reflects the stability of response
times across a set of items (Segalowitz et al. 1998). Here it is the ratio of
recognition speed variability (the standard deviation of the mean, SDmnRT)
divided by the mean recognition speed. The CV has been implicated in
the development of L2 automaticity (Akamatsu 2008; Hulstijn et al.
2009; Segalowitz et al. 1998), but the concern here is the usefulness of
the measure as an index of L2 vocabulary skill development. The relative
speed and consistency of recognition reflect how strongly the words are
represented in the mental lexicon. This strength determines the amount
of processing required, with stronger representations yielding faster and
more consistent responses and drawing on fewer processing resources
(Luce 1986; Balota et al. 2006).

Situating Lexical Facility in Word Recognition Research

Lexical facility is about user vocabulary size and recognition skill, and the
role they play in L2 performance, individually and in combination. The
term facility denotes a basic capacity that the learner develops and has
available for use across a range of contexts. This facility emerges as the
70 4 Lexical Facility: Bringing Size and Speed Together

result of a user’s experience with a language, or what is termed an ‘emer-


gent property’ of the L2 lexicon (Ellis 2002). The focus here is on relating
individual differences in lexical facility to proficiency differences across
key domains of academic English performance.
The construct is similar to other terms used to characterize word rec-
ognition skill. It shares an affinity with verbal efficiency (LaBerge and
Samuels 1974), lexical expertise (Perfetti 1992; Andrews 2008), lexical
fluency (Kroll et al. 2002), word fluency (Bell and Perfetti 1994; Hannon
2012), lexical availability (Catalán 2013), and vocabulary fluency (Zhang
and Lu 2013). While similar (in some cases very similar) to these con-
cepts, the other approaches focus on speed and resource demands related
to processing, with little reference to vocabulary knowledge, particularly
size. For both theoretical and methodological reasons, processing in these
approaches is examined in settings where knowledge of words tested is
assumed. The lexical facility construct, in contrast, focuses on the concur-
rent development of vocabulary size and processing skill.
The notions of efficiency, expertise, and fluency are all closely related
to lexical facility, but each foreground particular aspects of word recogni-
tion skill. Efficiency focuses on recognition speed and the processing
resources these recognition processes draw on (LaBerge and Samuels
1974). Lexical expertise denotes the possession of a specialized skill, typi-
cally to distinguish a ‘skilled’—usually adult—reader from a less accom-
plished counterpart (Andrews 2007). Expertise also suggests an end state
of learning instead of a developmental process and is used in normative
models of first language (L1) reading (e.g., Perfetti 1985; Hannon 2012).
Lexical facility can also be described as a kind of fluency as in Zhang and
Lu (2013), but that term is used here to refer to language processing at a
larger grain of linguistic knowledge than the word. Fluency is used to
denote the ability to understand and produce larger units of language
(‘chunks’) that are essential to skilled performance (Segalowitz 2010).
Lexical facility concerns word-level vocabulary skill, which is a critical
contributor to fluent performance. While references to fluency at the
word level are not unusual (e.g., Bell and Perfetti 1994; Kroll et al. 2002;
Zhang and Lu 2013), the term lexical facility is preferred here instead.
Lexical facility is also closely related to automaticity or automatized
processing. Automaticity is characterized by fast recognition speed and
4.2 Defining Lexical Facility 71

processing consistency (Segalowitz 2010): fast and highly consistent word


identification in the context of lexical facility account is indistinguishable
from automatized processing, but the term lexical facility is used to denote
the complete developmental continuum, rather than just the end state
that automaticity implies. Lexical facility also has close affinity with
Perfetti’s (2007) lexical quality hypothesis, in which word-level knowl-
edge and processing are closely related; however, that account encom-
passes a much broader treatment of the qualitative aspects of word
knowledge in contrast to the narrower word identification criteria used in
the lexical facility account. In summary, lexical facility shares key features
with other approaches to word recognition skill but also represents a dis-
tinctive perspective on the development of this crucial element of L2
linguistic knowledge.

Lexical Facility as a Bottleneck in Text Comprehension

Lexical facility is about skill in word identification. Although narrow in


scope, this skill is crucial for discourse comprehension in general and
written text comprehension in particular. The role of lexical facility in
text comprehension was described in Chap. 3. Lexical facility plays a
primary—as in both initial and central—role in the text comprehension
process. It is a lower-level process that links a written form with a seman-
tic entry in a user’s mental lexicon. The output of this process then feeds
into sentence-level processes, which in turn interact with higher-level dis-
course and cognitive processes.
Fluent text comprehension rests on lower-level word recognition pro-
cesses that are both efficient (meaning that words can be accessed quickly
and with little effort) and effective (meaning that the right word is
­identified) (Just and Carpenter 1992, p. 122). Efficiency involves both
the time (speed and consistency) and the amount of attention and mem-
ory resources required to identify a word. Effectiveness depends on
accessing a word appropriate to the discourse context. This access relies,
at a minimum, on having a vocabulary stock of adequate size for the tar-
get text. For fluent readers, it is assumed that all the semantic informa-
tion related to a given word form is activated at the initial identification,
72 4 Lexical Facility: Bringing Size and Speed Together

with sentence- and text-level factors then influencing the selection of the
correct meaning from among the available alternatives (Kintsch 1998;
Liu 2009).
The importance of fluent word recognition cannot be exaggerated.
Recognizing words is a singular recurring cognitive activity in reading,
with individual differences in word recognition skills serving to separate
lower-proficiency learners from their more fluent counterparts (Perfetti
2007, p. 357). It is also one of the most observable differences between
advanced L2 readers and their L1 counterparts (Meara 2002; Koda 2005).
Individuals with smaller and slower vocabularies, that is, less lexical
facility, expend greater effort in identifying individual words and attempt-
ing to figure out unfamiliar words encountered in a text. This, in turn,
results in slower and less effective comprehension outcomes. These indi-
viduals will process fewer words, cover less text, and achieve a diminished
understanding compared to more fluent readers in the same amount of
time. The degree of lexical facility thus has a direct impact on the working
memory resources available for higher-level comprehension. Working
memory is the capacity to maintain previously encountered material
while simultaneously processing new material (Baddeley 2012). It is an
important determinant of text comprehension in particular and L2 learn-
ing and use in general (Juffs and Harrington 2011). For less fluent read-
ers, slower, less effective word recognition processes draw directly upon
available memory resources and limit the amount of working memory
available for executing the higher-level processes needed for successful
comprehension (Perfetti 1985). This is not the case for fluent L1 readers,
for whom word identification is generally automatic and assumed to play
a relatively indirect role in comprehension outcomes. Hannon (2012),
for example, links word identification efficiency to overall text
­comprehension via sentence-level processing, where it combines word
integration and syntactic processes that together determine text compre-
hension outcomes.
The word identification skills represented in the lexical facility con-
struct are an integral element of L2 vocabulary skill, with efficient word
identification processes being important predictors of fluent L2 reading
outcomes (Koda 1996, 2005; see also Wang and Koda 2005; Shiotsu
2009).
4.3 Lexical Facility as a Vocabulary Skill Construct 73

4.3 L exical Facility as a Vocabulary


Skill Construct
L2 vocabulary knowledge is a multidimensional construct (Read 2000)
of which lexical facility is but one, albeit crucial, element. The lexical
facility account makes a set of related assumptions regarding how words
are learned and represented in the mental lexicon, how this knowledge is
processed, and how it relates to language performance. Each assumption
draws on established findings and traditions, but several are at odds with
dominant approaches in current theory and methodology in L2 vocabu-
lary research. These will be identified in turn.
The primary unit of analysis in the lexical facility account is the single
word, as was the case with the vocabulary size research discussed in Chaps. 1
and 2. The motivation for focusing on the single word is both theoretical
and practical. Words are the basic stuff of language use. Balota et al. (2006)
liken the role of individual words in language to the role of the cell in biol-
ogy. The cell is the basic element of which the larger organism is constituted.
The function of the organism depends on how the cells are organized, and
the health of these cell assemblies dictates the organism’s health. Furthermore,
a given cell has value and function only in combination with other cells. The
single word is likewise the basic building block of the mental lexicon and a
primary unit of language performance.
The focus on the single word also accords with psycholinguistic models
of reading comprehension. What is known about basic reading processes
is based on the study of single words: ‘Studies of single word reading are
the source of nearly all our crucial knowledge of reading mechanisms’
(Magnuson 2008, p. 379). Andrews (2006) uses another metaphor, that
of a ‘pivot’, to describe the crucial role that words play in reading com-
prehension: ‘Whatever the “grain size” of linguistic knowledge in general,
there is a functional level at which lexical knowledge acts in a “localist”
manner … because words serve as the interface or “pivot” for reading
comprehension’ (p. 319). This so-called pivot is the meeting place of
lower-level perceptual processes and higher-level processes that yield
comprehension. Andrew draws on both the physical sense of a pivot as a
mechanical device that supports the movement of an object and the more
74 4 Lexical Facility: Bringing Size and Speed Together

abstract sense of being an event or action on which further progress


depends. Words are described as pivots in comprehension because they
are at the end of perceptual (orthographic) processing and the starting
point of conceptual processing. Lexical facility can thus be said to be
pivotal.
The focus on single words is also warranted by the manner in which
lexical information is represented in the mental lexicon. Single words are
nodes of information an individual develops as a function of experience
and learning. This information is orthographic, graphemic, and seman-
tic, and in total constitutes the representations needed for successful
reading (Grigorenko and Naples 2008, p. ix).
Word-level representations have a distinctive status in memory. Individual
words are learned and stored in declarative memory (Ullman 2005; Paradis
2010). Declarative memory (also called explicit memory) is responsible for
the conscious learning of new material, as well as the recall of previously
learned material. Declarative knowledge can be talked about and explicitly
taught, and is affected directly by the frequency and specific characteristics
of input. In contrast, basic elements of language structure, particularly pho-
nology, morphology, and syntax, are supported by procedural memory sys-
tems. These rule-based processes are typically not available to conscious
deliberation, and the relationship between their development and experi-
ence is more complex. Declarative memory-­based knowledge is accessible
to tasks that elicit user recall and judgment, as in the Timed Yes/No Test
used here to measure lexical facility.
The single word is also a fundamental unit of L2 vocabulary instruc-
tion. It is the starting point for beginner learners and a basic unit of learn-
ing, teaching, and assessment at all levels of development (Grabe 2009).
This is not to discount the importance of multiword units such as lexical
phrases, collocations, and other fixed forms in fluent performance.
Spoken and written discourse rarely involves single-word utterances
alone, and these larger units are increasingly viewed as being of equal
importance to individual words in teaching and learning (Schmitt 2010,
p. 8). The importance of multiword units is also evident from corpus
linguistics research, where the context-dependent nature of much word
meaning and the role that morphological rules play in relating and medi-
ating the meaning of individual words suggest limits to focusing on single
4.3 Lexical Facility as a Vocabulary Skill Construct 75

words alone in vocabulary teaching and research (Gardner 2007).


However, neither concern invalidates the single-word focus of the lexical
facility proposal. Vocabulary knowledge is a multidimensional notion,
and the perspectives complement rather than oppose each other. However,
even while recognizing the importance of lexical units above the single
word, it is also a fact that multiword units are ultimately composed of
individual words, both at the perceptual and at the representational level.
The focus on the single-word unit in the lexical facility account is
motivated by the central role that the word plays in learning, representa-
tion, and processing. These properties, in turn, make it a particularly
appropriate unit for the measurement of L2 vocabulary knowledge. It
also provides a unit of measurement for L2 vocabulary knowledge that
can be readily quantified.

A Quantitative Approach to L2 Vocabulary Knowledge

The three elements of the lexical facility construct lend themselves to


quantitative measurement. Vocabulary size is measured by performance
on tests that use word items drawn from a range of frequency-of-­
occurrence bands to yield an estimate of vocabulary size. Speed is mea-
sured by mean response time of the words correctly recognized, and
consistency by the SD of the mean and the CV. All other aspects of word
knowledge are ignored—in particular, the multiple associations that a
given word shares with other words in the individual’s mental lexicon, for
example, semantic and usage-based associations and collocations, are
excluded (Fitzpatrick 2006). These are traditionally viewed as measures of
vocabulary depth and are contrasted with the breadth of vocabulary mea-
sured by the vocabulary size test. The bread/depth distinction is a simple,
intuitively appealing way to characterize user vocabulary knowledge;
however, the actual independence of the two is open to question.
Researchers such as Qian (1999) and Read (2004) have argued that the
two are independent dimensions of vocabulary knowledge, while others
view them as the same, with an increase in vocabulary breadth accompa-
nied closely by increasing depth (e.g., Vermeer 2001). This debate is
beyond the scope of this book, but the assumption here is that the two
76 4 Lexical Facility: Bringing Size and Speed Together

are not, for practical purposes, independent. All things being equal,
increases in depth are accompanied by an increase in breadth. As is the
case with vocabulary speed and size, learners with very ‘broad’ vocabular-
ies also have very ‘deep’ ones.

Lexical Facility Is an Emergent Property

Lexical facility emerges from the individual’s experience with the lan-
guage. The user’s vocabulary size, speed, and consistency reflect the fre-
quency of exposure to the words in the language. Frequency of occurrence
is a strong predictor of when a word will be learned and an important
determinant of the strength of word representation in the mental lexi-
con (Ellis 2002, 2012). This strength develops as a result of successive
word retrieval events and the resulting multiple associations formed
with other words in the mental lexicon. It, in turn, predicts how quickly
a word will be accessed in use (Balota et al. 2006). Recent research also
suggests that word frequency statistics may be more than just a quantita-
tive notion. The frequency with which a word appears has been shown
to closely relate to the range of contexts in which it appears, not just to
the overall number of occurrences (Adelman and Brown 2006; Raymond
and Brown 2012). Frequency thus serves as an indicator of how widely
a particular word is used, that is, as a measure of vocabulary depth.

4.4 L exical Facility as a Measurement


Construct
The lexical facility proposal has significant implications for testing and
assessment. It has two fundamental characteristics as a measurement con-
struct that set it apart from other approaches to L2 vocabulary and profi-
ciency testing. The first is the combination of vocabulary size and
processing skill (speed and consistency) in a unified measure of L2 lexical
skill. The second is the assumption that lexical facility is best character-
ized as a trait for measurement purposes.
4.4 Lexical Facility as a Measurement Construct 77

L exical Facility as a Measure of Size and


Processing Skill

Lexical facility is unique in that it examines vocabulary size and recogni-


tion skill (recognition speed and consistency) concurrently as correlates
of L2 vocabulary skill. Chapters 1 and 2 established vocabulary size as a
reliable predictor of performance in common L2 domains. A central
question addressed in this book is whether the processing skill measures
can serve a similar function. Ultimately, of course, the answer is ‘yes’.
Chapter 3 showed that faster mean recognition time (mnRT) and consis-
tent (CV) word recognition skill is a hallmark of advanced L2 users and
essential attributes of automaticity. The issue is whether these elements
can also serve as reliable and sensitive measures across a more fine-grained
range of proficiency levels. In statistical terms, it is about how well the
processing skill measures can account for additional amounts of variance
in performance differences after the effect of vocabulary size is taken into
account. The incorporation of the CV in the lexical facility construct is
an attempt to treat response variability as a window on performance, as
opposed to mere ‘noise’ that might otherwise obscure experimental effects
of interest. The interest in variability as a characteristic of performance in
its own right is attracting increasing attention in cognitive science (Yap
and Balota 2015). The research program here marks one of the first
attempts to establish consistency, as measured by the CV, as a reliable
measure of performance differences.
The prima facie case for combining size and speed seems strong. Fluent
performance depends on what you know (i.e., having enough words to be
able to access the ones needed for the particular discourse context) and
when you know it (i.e., being able to access these words in a manner
quick enough and consistent enough for the specific discourse demands).
Despite the intuitive nature of the idea, recognition speed—not to men-
tion response consistency—is neglected in models of L2 vocabulary mea-
surement. Recognition speed, and response speed in general, has received
scant mention in standard L2 testing texts (Bachman 1990; McNamara
1996). Likewise, L2 vocabulary testing research devotes little attention to
temporal measures in the discussions of models of L2 vocabulary assess-
ment (Read 2001; Milton 2009; Hulstijn 2011; Schmitt 2010). In recent
78 4 Lexical Facility: Bringing Size and Speed Together

work on L2 proficiency testing, Hulstijn (2011) mentions speed as an


index of L2 proficiency but does not include it as an explicit measure.
This is partly due to current methodological limitations in the field.
Hulstijn et al. (2009) note that the emergence of vocabulary size (or
‘knowledge acquisition’, in their terms) and of the speed and consistency
of recognition (or ‘skill acquisition’) are fundamentally intertwined, mak-
ing the isolation of the knowledge and skill dimensions difficult, if not
impossible (p. 579).
The difficulty posed in trying to relate knowledge to speed is also rec-
ognized by testing specialists. Van der Linden (2009) describes the ten-
sion between the intuitive attraction of combining the two and its
operationalization in theory and practice in the following terms:

Test theorists have always been intrigued by the relationship between


responses to test items and the time used by a test taker to produce them.
Both seem indicative of the same behavior on test items. Nevertheless,
their relationship appears to be difficult to conceptualize, let alone repre-
sent coherently in a statistical model. (2009, p. 247)

The lexical facility proposal might thus appear to be a case of treading


where others fear to go. It attempts to answer the basic question as to
whether size and speed together provide a more sensitive indicator of
individual differences in a fundamentally important aspect of L2 vocabu-
lary skill. The research also attempts to gain a better understanding of
how these two elements interrelate across the L2 developmental contin-
uum. This will have immediate implications for L2 vocabulary research
and assessment, and more widely for the incorporation of speed as an
element in models of L2 development.

Lexical Facility Is a Trait

The accurate measurement of language skill requires the tester to define


the underlying construct being measured in a way appropriate to the
target behavior. Read and Chapelle (2001) identify three approaches to
this construct definition process. The essential difference between the
approaches is the weight given to the knowledge the individual brings to
4.5 Bringing Size and Speed Together 79

the language task versus the features of the task and its context of use. A
behaviorist posits the underlying construct as being synonymous with the
behavior required for the particular context. Specifying the performance
context is defining the construct. The interactionalist approach character-
izes proficiency as an interaction between what the learner knows and the
context of use (Chapelle 1998). The interactionalist approach has been
the dominant one in recent L2 testing, as it takes into account both what
the learner brings to the task and what the specific task demands
(Bachmann and Palmer 2010). The third approach shifts the focus solely
to what the learner knows. A trait approach characterizes the individual’s
vocabulary knowledge independent of any specific context. Lexical facil-
ity is assumed to be trait-like in that vocabulary size and processing skill
are assumed to be learner-internal characteristics that can be measured
and usefully interpreted independently of the vocabulary demands of
specific contexts. The trait approach is appropriate for the lexical facility
account, given its narrow scope (word recognition size and speed), which
serves as a fundamental constraint on comprehension processes in a con-
sistent manner across a range of contexts of use.
The trait approach to vocabulary measurement is generally disfavored
in L2 vocabulary testing, given the complexity of vocabulary knowledge
and its context-sensitive nature (Chapelle et al. 2010). However, the nar-
row scope of lexical facility as a lower-level word recognition skill lends
itself to characterization as a probabilistic capacity, independent of
­specific contexts (Laufer 2001). As such, lexical facility is characterized as
a trait measurement construct that serves as a frequency-based objective
index against which to measure and compare lexical facility across learn-
ers and settings (Kempe and MacWhinney 1996).
How well it does this will be evaluated in the research studies reported
in Part 2.

4.5 Bringing Size and Speed Together


Lexical facility is about the concurrent emergence of vocabulary size and
processing skill and how this relates to increasing L2 proficiency.
Examining the simultaneous development of vocabulary knowledge and
80 4 Lexical Facility: Bringing Size and Speed Together

processing skill runs counter to the long-standing practice in lexical


research that uses speed as a dependent variable interpreted against per-
formance that has few, if any, errors (Sternberg 1998). By controlling for
what the participant knows, the effect of changes in the stimulus material
on response speed can be isolated and interpreted in relation to specific
research questions. Fixing one dimension allows the researcher to elimi-
nate potential confounds in the assignment of causes to observed effects.
This research paradigm has been very successful in identifying key mech-
anisms and processes underlying lexical performance. However, the
assumption that the individuals being studied already have the knowl-
edge being examined means that the paradigm only works in controlled
experimental settings with mature or expert users, be they L1 monolin-
guals (Balota et al. 2006) or balanced bilinguals (Kroll and Stewart 1994).
The lexical facility account differs in that its central concern is how vocab-
ulary size and processing skill covary as correlates of performance, both at a
single point in time and across development. Interpreting speed differences
in the presence of varying vocabulary size, and vice versa, raises significant
conceptual and methodological challenges. These are considered next.

 ombining Size and Speed: Mixing Apples


C
and Oranges?

The lexical facility account combines vocabulary size, mean recognition


speed, and recognition speed consistency as a unitary L2 vocabulary con-
struct. The proposal thus ignores the long-standing distinction in the
psychometric literature between lower-order behavior that is defined by
speed and higher-order behavior defined by knowledge (Carroll 1993).
Speed is a defining attribute for lower-level behavior typified in simple
perceptual and motor tasks such as sorting and typing. In these tasks, the
underlying knowledge is of limited cognitive complexity, and the pro-
cessing task by which it is manifested is simple and well understood. In
contrast, knowledge (also called power) is the critical attribute of higher-­
order complex cognitive domains, in which the knowledge is complex
and displayed in performance that is varied and of varying degrees of
complexity. Test performance draws on reasoning and other higher-order
cognitive skills, including language, and is characteristic of almost all
4.5 Bringing Size and Speed Together 81

testing done for educational purposes. Rather than reflecting a single


underlying ability, power and speed are assumed to be qualitatively differ-
ent dimensions (Carroll 1993).
In educational testing, individual differences in how fast a test-taker
can answer a test item are assumed to be irrelevant to assessing mastery of
the material. Instead, what is crucial is the ability to demonstrate mastery
of the knowledge—that is, answering the item correctly. This might
involve answering a question, solving a problem, or relating some aspect
of knowledge to another. For example, if a geography test item asks for
the names of all the countries bordering Switzerland, what is important
is that the test-taker gives the correct names of the countries. The speed
with which these names are produced is assumed to have no bearing on
whether the material is known, nor is it seen as relevant to how that infor-
mation might be used in other tasks, say, writing an essay on the causes
of World War I. Time limits in educational tests are set to allow an indi-
vidual who has mastered the material adequate time to demonstrate that
knowledge (Schnipke and Scrams 2002). Running out of time in ­properly
designed tests is a function of inadequate mastery of the material, not a
problem arising from slow retrieval processes.
In general, there has been little need for speed in educational testing in
general, and in L2 testing in particular. John Carroll, a seminal figure in
psychometrics, notes that the value of including a speed component in
the measurement of cognitive performance may (merely) reflect ‘a soci-
etal judgment concerning the value of high intelligence combined with
quickness of response or problem-solving’ (Carroll 1993, p. 509). The
speed with which knowledge is demonstrated is of limited importance
outside of TV quiz shows and party games. This is particularly the case in
educational testing, where the demonstration of knowledge is an
outcome-­oriented process. The nature of L2 knowledge is such that the
sole emphasis on what the learner knows ignores an integral aspect of that
knowledge.

The Time-Contingent Nature of L2 Knowledge

The lack of interest in speed as a measure of cognitive performance arises


from the distinction made between lower-order skills (such as typing)
82 4 Lexical Facility: Bringing Size and Speed Together

and higher-order cognitive skills (such as knowledge of geography). Speed


is deemed essential for the first but extraneous to the second. This prompts
the question as to the nature of the linguistic knowledge that underpins
performance in the L2. In short—and oversimplifying matters greatly—
is L2 knowledge more like typing or geography? Language tests in formal
settings typically resemble those given in other academic subjects. The
student needs to show on paper, or increasingly online, mastery of the
material; how quickly this is done is largely irrelevant. There is the
assumption and hope that knowledge demonstrated in a test will be
applied in actual language use. However, experience has long suggested
that successful classroom performance does not necessarily result in com-
municative competence. A central question raised by the lexical facility
proposal is the extent to which the treatment of vocabulary knowledge
(size) and processing skill (speed) as a unitary measurement construct
provides a more sensitive discriminator of individual differences in L2
vocabulary knowledge/skill, which in turn relates to L2 performance.
In addition to these theoretical considerations, there is also a serious
methodological problem that can arise when examining knowledge and
speed simultaneously. In Chap. 3, the potential problem posed by a
trade-off in test performance between response speed and accuracy was
introduced. Individuals can vary in the relative emphasis placed on giving
a quick answer versus a correct one—quick and sloppy versus slow and
careful. As a result, final accuracy and speed measures may reflect in part
an individual response bias. Fast responses accompanying lower-accuracy
scores may indicate that the test-taker is emphasizing a fast response over
a correct one. Conversely, slower responses accompanied by high accu-
racy may indicate an overly deliberate approach on the part of the test-­
taker in answering. In both cases, test performance does not directly
reflect the nature of the underlying knowledge representations. The
potential for such trade-offs depends greatly on the nature of the test task
and can differ widely across individuals (Luce 1986). These individual
differences arise from a range of cognitive, attentional, and motivational
factors often not related to the knowledge being measured. This means
that the researcher must be careful when setting up the testing task, and
particularly the instructions, in a manner that minimizes the likelihood
of these trade-offs occurring. The accuracy and speed responses must also
4.6 Recognition Vocabulary Size and Speed as a Vocabulary... 83

be carefully observed for the presence of possible trade-offs, and appro-


priate steps must be taken when they are present. It is assumed that
speed–accuracy trade-offs can never be eliminated completely, and that
attention must be given to their potential effects (Heitz 2014).
Finally, the use of speed as a proficiency measure also presents techni-
cal challenges. Before the development of computer-based data collection
tools, time measures were unwieldy to collect and analyze, with adequate
control possible only in laboratory conditions. Increasingly accessible off-­
the-­shelf and online tools are now available that allow temporal measures
to be incorporated in test development, administration, and scoring in
the classroom and other settings outside the laboratory (Lee and Chen
2011).
The lexical facility proposal seeks to establish whether the processing
skill measures (speed and consistency), both independently and in com-
bination with vocabulary size, can provide a fuller picture of L2 vocabu-
lary skill, and one that is more sensitive to differences in proficiency and
performance levels than vocabulary size alone. The discussion so far has
focused on speed of response. Also part of the lexical facility proposal is
the potential role of response consistency, as captured by the CV. The CV
is calculated from recognition time and as such is open to same concep-
tual and methodological issues raised in regard to using speed as a perfor-
mance measure.
The relative neglect of processing skill (mean RT and CV) as a perfor-
mance measure has conceptual, empirical, and practical bases. The issues
raised present a number of challenges to establishing lexical facility as an
L2 vocabulary construct. These are addressed in Part 2 as evidence for the
account is presented.

4.6  ecognition Vocabulary Size and Speed


R
as a Vocabulary Measure
Recognition speed as an L2 vocabulary measure has not been totally
ignored. The lexical facility account is not the first attempt to examine
vocabulary size and recognition speed as covarying elements in measuring
84 4 Lexical Facility: Bringing Size and Speed Together

L2 proficiency. Laufer and Nation (2001) examined item accuracy and


response speed in Vocabulary Levels Test (VLT) performance. The study
compared item completion speed and accuracy scores for university-­level
EFL students and L1 controls on a computerized version of the VLT. They
found that greater accuracy was accompanied by faster response times, as
evident in a significant inverse correlation between overall mean response
times and scores at all levels of the test. Mean response times also discrimi-
nated individual performance, but only for the L2 participants with larger
vocabularies. The L2 participants also showed significantly greater
response time variability than the L1 controls. Finally, in comparisons
where vocabulary size was controlled, speed-up in response times lagged
behind increases in overall size (Laufer and Nation 2001, p. 18). Despite
these results, the authors have shown little subsequent interest in response
time as a measure of vocabulary skill in applied settings (Nation 2013).
A more recent longitudinal study also examined the relationship between
vocabulary size and response speed in VLT performance. Zhang and Lu
(2013) administered the test to 300 EFL learners at a Chinese university
three times over a 22-month period to examine how vocabulary size and
item response speed (which they labeled ‘fluency’) covaried over time.
Both accuracy scores and response times improved systematically across
the three tests as a function of the VLT frequency levels, but only a weak
relationship overall was observed between vocabulary size and response
speed. Like Laufer and Nation (2001), the authors found that increases in
vocabulary size outpaced the development of faster response times.
Both studies demonstrated a relationship between learners’ response
speed and vocabulary size. But due to the design of the studies, they pro-
vide only a coarse picture of that relationship. Recall that a VLT item
consists of three target words matched to three of six possible alternative
definitions (see the example item in Fig. 2.1 in Chap. 2). In both studies,
the three item definitions were presented in individual frames, one at a
time, with the same six word options appearing in each of the three
frames. The researchers also randomized the order of the six words in
each trial.
The speed measure in both studies reflects overall item response time
rather than individual word recognition time. Given that the six word
options appear in all three frame presentation trials, the item response
4.6 Recognition Vocabulary Size and Speed as a Vocabulary... 85

times are not independent. Relatively less time is spent processing the six
words in the second and third frame presentations. Also, and crucially,
the response time measures reflect the average speed of item completion.
They are not a direct measure of word recognition speed, though Zhang
and Lu (2013, p. 8) use that term interchangeably with response time.
Rather, the times are measures of word and meaning-matching speeds
that involve the consideration of alternatives before a response is given.
As a result, the mean response time values are very high. In both studies,
response times for the 3K level ranged from 8 seconds to beyond 15 sec-
onds. In contrast, in L2 word recognition studies in which individual
word recognition is measured, values typically range from 500 millisec-
onds to 1500 milliseconds (Segalowitz and Segalowitz 1993; van Gelderen
et al. 2004).
Of more direct relevance to the current work, a small number of stud-
ies have used the Timed Yes/No format to examine the relationship
between vocabulary size and mean recognition speed (mnRT). The stud-
ies differ in aims, settings, and participants, but all report a correlation
between vocabulary size and recognition speed when size and mnRT
measures are collected on the same words (i.e., Harrington 2006;
Harrington and Carey 2009; Shiotsu and Read 2009; Pellicer-Sánchez
and Schmitt 2012).
A precursor to the lexical facility proposal is Harrington (2006)
who reported a systematic correlation between vocabulary size, recogni-
tion speed, and consistency (CV) across frequency and proficiency levels
for university-age English learners when examining the development of
L2 word automaticity. Harrington and Carey (2009) also found that
vocabulary size and recognition speed predicted placement levels in an
English language program. This research will be examined more closely
in Part 2.
In contrast, possibly the only study that reported no correlation
between vocabulary size and speed is Miralpeix and Meara (2010). The
study examined vocabulary knowledge in L2 English students at a Spanish
university using separate size and speed tests. Vocabulary size was mea-
sured using the X_Lex (1K–5K) and Y_Lex (6K–10K) tests, and mean
response speed measured using an independent lexical access test that
required the individual to judge whether a presented word was animate
86 4 Lexical Facility: Bringing Size and Speed Together

(Segalowitz and Freed 2004). The researchers found no overall correla-


tion between vocabulary size and animacy-judgment times, but did
report correlations of around 0.7 for both the groups for the relationship
between time and a vocabulary interview measure that included vocabu-
lary quality. The lack of observed correlation may be due to the relative
ease of the animacy recognition test, as reflected in the fast responses. The
mean response times for the intermediate and advanced L2 participants
in the study were around 815 milliseconds, close to the 784 milliseconds
reported for the L1 participants. In contrast, Pellicer-Sánchez and Schmitt
(2012) reported times on a Timed Yes/No Test of 860 milliseconds for an
advanced L2 group and 854 milliseconds for L1 controls. Harrington
(2006) reported means of 1650 milliseconds for preuniversity English
learners compared with 780 milliseconds for L1 university students.
The notion that larger vocabulary size correlates with faster recogni-
tion is supported by all but one of the studies, and that study examined
size and speed measures on different sets of lexical items. The relationship
between vocabulary size, recognition speed, and, consistency of this rec-
ognition speed (CV) is examined in the lexical facility research program
described in the next section.

4.7  stablishing Lexical Facility: The Research


E
Program
Lexical facility is a fundamental component of L2 vocabulary skill and a
prerequisite to fluent L2 performance. It consists of recognition vocabu-
lary size, the mean speed with which these words are recognized, and
consistency of the recognition speed. The measures are objective, inde-
pendent variables that permit comparison of performance across groups
and settings. The three measures are examined as correlates of L2 profi-
ciency both separately and in composite scores. These measures are col-
lected using the Timed Yes/No Test, an online measure of recognition
vocabulary knowledge described in detail in Chap. 5. The frequency-­
based measurement instrument is assumed to assess trait-like vocabulary
4.7 Establishing Lexical Facility: The Research Program 87

skill that reflects the individual’s capacity to process lexical knowledge


across contexts of use.
Empirical evidence for the account is presented in Part 2. The findings
reported there address three empirical goals. They will:

1. compare the three measures of lexical facility (vocabulary knowledge [size],


mnRT, and recognition time consistency) as stable indices of L2 vocabu-
lary skill;
2. evaluate the sensitivity of the three measures to group and individual per-
formance differences in a range of L2 instructional domains; and by doing
so
3. establish the degree to which the composite measures provide a more sensi-
tive measure of L2 proficiency and performance than any measure consid-
ered separately.

Sensitivity is defined as how reliable, as in statistically significant, the


measures are in discriminating between proficiency levels (research goal 1),
and the relative magnitude of the differences, individually and in combi-
nation (research goal 2). A corollary question (research goal 3) relates to
whether the two time measures, mnRT and CV, can account for unique
variance in the differences beyond that attributed to vocabulary size
alone. The lexical facility account is the first to examine the CV as an
independent index of L2 vocabulary skill.
The first five studies in Part 2 examine the sensitivity of the measures
to program- and test-based group differences. Study 1 (Chap. 6) investi-
gates the sensitivity of the individual and combined measures to differ-
ences across three distinct English proficiency groups. The groups consist
of preuniversity L2 students, L2 university students, and L1 university
students at an Australian university. Studies 2 (Chap. 7) and 3 (Chap. 8)
examine the sensitivity of the measures to group differences in English
entry standards used in Australian universities. Study 2 examines the
measures across five English entry standard groups, and Study 3 considers
the measures as correlates to IELTS scores for students in a university
foundation-year program. Studies 4 and 5 (Chap. 9) investigate the lexical
facility as a measure of proficiency as reflected in placement in language
88 4 Lexical Facility: Bringing Size and Speed Together

program placement levels. Study 4 correlates the three measures with


in-house placement tests at an English language school in Australia, and
Study 5 examines them as correlates of placement levels in an English
language school in Singapore.
The last two studies (Chap. 10) investigate the sensitivity of the mea-
sures to individual differences in English for Academic Purposes (EAP)
proficiency, as evident in English-medium academic performance. Study
6 examines the measures as predictors of course grades in an EAP course,
and Study 7 investigates how well the measures predicted grade point
average outcomes in a university preparation program.
The studies evaluate the proposal that increasing vocabulary size
and processing skill can productively serve in tandem as dimensions
of L2 vocabulary skill. Evidence for the proposal will have implica-
tions for incorporating time into models of L2 acquisition and in the
use of composite measures of size and time in measurement and
assessment.

4.8  Conclusions
The lexical facility account is distinctive among L2 vocabulary approaches
in that it combines vocabulary size and recognition speed. The proposed
combination of size and processing skill (recognition time and consis-
tency) is at odds with the traditional psychometric approach to measure-
ment which treats the two as independent dimensions of behavior. There
are a number of reasons for the practice, both in testing theory in general
and in L2 vocabulary assessment in particular, but treating size and speed
as independent entities may also obscure a basic underlying relationship,
given the time-contingent nature of L2 knowledge.
Central to the lexical facility account is the proposal that the combina-
tion of recognition speed and consistency with vocabulary size provides a
combined measure of greater sensitivity to group and individual differ-
ences than vocabulary size alone. Evidence for this effect validates the
account and has broader implications for theory and practice in L2
vocabulary instruction and assessment.
References 89

These will be taken up in Part 2, where empirical evidence for the lexi-
cal facility proposal is presented. The studies reported all use the Timed
Yes/No Test. In the next chapter, the test format and methodology are
described.

References
Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity,
not word frequency, determines word-naming and reading times. Psychological
Science, 17(9), 814–823.
Akamatsu, N. (2008). The effects of training on automatization of word recog-
nition in English as a foreign language. Applied PsychoLinguistics, 29(2),
175–193. doi:10.1017/S0142716408080089.
Andrews, S. (Ed.). (2006). From inkmarks to ideas: Current issues in lexical pro-
cessing. Hove: Psychology Press.
Andrews, S. (2008). Lexical expertise and reading skill. In B. H. Ross (Ed.), The
psychology of learning and motivation: Advances in research and theory (Vol. 49,
pp. 247–281). San Diego: Elsevier.
Bachman, L. (1990). Fundamental considerations in language testing. Oxford:
Oxford University Press.
Bachmann, L. F., & Palmer, A. (2010). Language assessment in practice: Developing
language assessments and justifying their use in the real world. New York: Oxford
University Press.
Baddeley, A. (2012). Working memory: Theories, models, and controversies.
Annual Review of Psychology, 63, 1–29.
Balota, D. A., Yap, M. J., & Cortese, M. J. (2006). Visual word recognition: The
journey from features to meaning (a travel update). In M. J. Traxler & M. A.
Gernsbacher (Eds.), Handbook of psycholinguistics (2nd ed., pp. 285–375).
Amsterdam: Elsevier.
Bell, L. C., & Perfetti, C. A. (1994). Reading skill: Some adult comparisons. Journal
of Educational Psychology, 86(2), 244–255. doi:10.1037/0022-0663.86.2.244.
Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies.
Cambridge: Cambridge University Press.
Catalán, R. M. J. (Ed.). (2013). Lexical availability in English and Spanish as a
second language (Vol. 17). Dordrecht: Springer.
Chapelle, C. A. (1998). Construct definition and validity inquiry in SLA
research. In L. F. Bachman & A. E. Cohen (Eds.), Interfaces between second
90 4 Lexical Facility: Bringing Size and Speed Together

language acquisition and language testing (pp. 32–70). Cambridge: Cambridge


University Press.
Chapelle, C. A., Enright, M. K., & Jamieson, J. (2010). Does an argument-­
based approach to validity make a difference? Educational Measurement: Issues
and Practice, 29(1), 3–13.
Ellis, N. C. (2002). Frequency effects in language processing: A review with
implications for theories of implicit and explicit language acquisition. Studies
in Second Language Acquisition, 24(2), 143–188.
Ellis, N. C. (2012). What can we count in language, and what counts in lan-
guage acquisition, cognition, and use? In S. T. Gries & D. Divjak (Eds.),
Frequency effects in language learning and processing (pp. 7–34). Berlin:
DeGruyter Mouton.
Fitzpatrick, T. (2006). Habits and rabbits: Word associations and the L2 lexi-
con. EUROSLA Yearbook, 6(1), 147–168.
Gardner, D. (2007). Validating the construct of ‘word’ in applied corpus-based
vocabulary research: A critical survey. Applied Linguistics, 28(2), 242–265.
doi:10.1093/applin/amm010.
Gelderen, A. V., Schoonen, R., Glopper, K. D., Hulstijn, J., Simis, A., Snellings,
P., & Stevenson, M. (2004). Linguistic knowledge, processing speed, and
metacognitive knowledge in and first- and second- language reading compre-
hension: A componential analysis. Journal of Educations Psychology, 96(1),
19–30.
Grabe, W. (2009). Reading in a second language: Moving from theory to practice.
New York: Cambridge University Press.
Grigorenko, E. L., & Naples, A. J. (Eds.). (2008). Single-word reading: Behavioral
and biological perspectives. New York: Taylor & Francis.
Hannon, B. (2012). Understanding the relative contributions of lower-level
word processes, higher-level processes, and working memory to reading
­comprehension performance in proficient adult readers. Reading Research
Quarterly, 47(2), 125–152. doi:10.1002/RRQ.013.
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical pro-
ficiency. EUROSLA Yearbook, 6(1), 147–168.
Harrington, M., & Carey, M. (2009). The online yes/no test as a placement
tool. System, 37(4), 614–626. doi:10.1016/j.system.2009.09.006.
Heitz, R. P. (2014). The speed-accuracy tradeoff: History, physiology, methodol-
ogy, and behavior. Frontiers in Neuroscience, 8, 150.
Hird, K., & Kirsner, K. (2010). Objective measurement of fluency in natural
language production: A dynamic systems approach. Journal of Neurolinguistics,
23(5), 518–530. doi:10.1016/j.jneuroling.2010.03.001.
References 91

Hulstijn, J. H. (2011). Language proficiency in native and nonnative speakers:


An agenda for research and suggestions for second-language assessment.
Language Assessment Quarterly, 8(3), 229–249.
Hulstijn, J. H., Van Gelderen, A., & Schoonen, R. (2009). Automatization in
second language acquisition: What does the coefficient of variation tell us?
Applied PsychoLinguistics, 30(04), 555–582.
Juffs, M., & Harrington, M. (2011). Aspects of working memory in L2 learn-
ing. Language Teaching, 44(2), 137–166. doi:10.1017/S0261444810000509.
Just, M. A., & Carpenter, P. A. (1992). A capacity theory comprehension:
Individual differences in working memory. Psychological Review, 99(1),
122–149.
Kempe, V., & MacWhinney, B. (1996). The crosslinguistic assessment of for-
eign language vocabulary learning. Applied PsychoLinguistics, 17(2), 149–183.
doi:10.1017/S0142716400007621.
Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge:
Cambridge University Press.
Koda, K. (1996). L2 word recognition research: A critical review. The Modern
Language Journal, 80(4), 450–460.
Koda, K. (2005). Insights into second language reading: A cross-linguistic approach.
New York: Cambridge University Press.
Kroll, J. F., & Stewart, E. (1994). Category interference in translation and pic-
ture naming: Evidence for asymmetric connections between bilingual mem-
ory representations. Journal of Memory and Language, 33(2), 149–174.
doi:10.1006/jmla.1994.1008.
Kroll, J. F., Michael, E., Tokowicz, N., & Dufour, R. (2002). The development
of lexical fluency in a second language. Second Language Research, 18(2),
137–171.
LaBerge, D., & Samuels, S. J. (1974). Toward a theory of automatic informa-
tion processing in reading. Cognitive Psychology, 6(2), 293–323.
Laufer, B. (2001). Quantitative evaluation of vocabulary: How it can be done
and what it was good for. In C. Elder, K. Hill, A. Brown, N. Iwashita,
L. Grove, T. Lumley, & T. MacNamara (Eds.), Experimenting with uncer-
tainty: Essays in hounour of Alan Davies (pp. 241–250). Cambridge:
Cambridge University Press.
Laufer, B., & Nation, P. (2001). Passive vocabulary size and speed of meaning
recognition: Are they related? EUROSLA Yearbook, 1(1), 7–28.
Lee, Y. H., & Chen, H. (2011). A review of recent response-time analyses in
educational testing. Psychological Test and Assessment Modeling, 53(3),
359–379.
92 4 Lexical Facility: Bringing Size and Speed Together

Luce, R. D. (1986). Response times. New York: Oxford University Press.


Magnuson, J. S. (2008). Nondeterminism, pleiotropy, and single-word reading:
Theoretical and practical concerns. In E. L. Grigorenko & A. J. Naples (Eds.),
Single-word reading: Behavioral and biological perspectives (pp. 377–404).
New York: Lawrence.
McNamara, T. F. (1996). Measuring second language performance. London:
Addison Wesley Longman.
Meara, P. (2002). The rediscovery of vocabulary. Second Language Research,
18(4), 393–407. doi:10.1191/0267658302sr211xx.
Milton, J. (2009). Measuring second language vocabulary acquisition. Bristol:
Multilingual Matters.
Miralpeix, I., & Meara, P. (2010). The written word. Retrieved from www.log-
nostics.co.uk/Vlibrary
Nation, I. S. P. (2013). Learning vocabulary in another language (2nd ed.).
Cambridge, UK: Cambridge University Press.
Paradis, J. (2010). Bilingual children’s acquisition of English verb morphology:
Effects of language exposure, structure complexity, and task type. Language
Learning, 60(3), 651–680.
Pellicer-Sánchez, A., & Schmitt, S. (2012). Scoring yes-no vocabulary tests:
Reaction time vs. nonword approaches. Language Testing, 29(4), 489–509.
doi:10.1177/0265532212438053.
Perfetti, C. A. (1985). Reading ability. New York: Oxford University Press.
Perfetti, C. A. (1992). The representation problem in reading acquisition. In P.
B. Gough, L. C. Ehri, & R. Treiman (Eds.), Reading acquisition (pp. 145–
174). Hillsdale: Erlbaum.
Perfetti, C. A. (2007). Reading ability: Lexical ability to comprehension. Scientific
Studies of Reading, 11(4), 357–383. doi:10.1080/10888430701530730.
Qian, D. D. (1999). Assessing the roles of depth and breadth of vocabulary
knowledge in reading comprehension. The Canadian Modern Language
Review, 56(2), 282–307.
Raymond, W. D., & Brown, E. L. (2012). Are effects of word frequency effects
of contexts of use? In S. T. Gries & D. Divjak (Eds.), Frequency effects in lan-
guage learning and processing (pp. 35–52). Berlin: De Gruyter Mouton.
Read, J. (2000). Assessing vocabulary. Cambridge: Cambridge University Press.
Read, J. (2004). Plumbing the depths: How should the construct of vocabulary
knowledge be defined? In P. Bogaards & B. Laufer (Eds.), Vocabulary in a
second language: Selection, acquisition, and testing (pp. 209–227). Amsterdam:
John Benjamins.
References 93

Read, J., & Chapelle, C. A. (2001). A framework for second language vocabu-
lary assessment. Language Testing, 18(1), 1–32.
Schmitt, N. (2010). Researching vocabulary. A vocabulary research manual.
Basingstoke: Palgrave Macmillan.
Schnipke, D. L., & Scrams, D. J. (2002). Exploring issues of examinee behav-
ior: Insights gained from response-time analyses. In C. N. Mills, M. Potenza,
J. J. Fremer, & W. Ward (Eds.), Computer-based testing: Building the founda-
tion of future assessments (pp. 237–266). Hillsdale: Lawrence Erlbaum
Associates.
Segalowitz, N. (2010). Cognitive bases of second language fluency. New York:
Routledge.
Segalowitz, N., & Freed, B. (2004). Context, contact and cognition in oral flu-
ency acquisition: Learning Spanish in at home and study abroad contexts.
Studies in Second Language Acquisition, 26(2), 173–199. doi:10.1017/
S0272263104262027.
Segalowitz, N., & Segalowitz, S. J. (1993). Skilled performance, practice and
differentiation of speed-up from automatization effects: Evidence from sec-
ond language word recognition. Applied PsychoLinguistics, 14(3), 369–385.
doi:10.1017/S0142716400010845.
Segalowitz, N., Segalowitz, S. J., & Wood, A. G. (1998). Assessing the develop-
ment of automaticity in second language word recognition. Applied
PsychoLinguistics, 19(1), 53–67.
Shiotsu, T. (2009). Reading ability and components of word recognition speed:
The case of L1-Japanese EFL learners. In Z. Han & N. J. Anderson (Eds.),
Second language reading research and instruction: Crossing the boundaries
(pp. 15–39). Ann Arbor: University of Michigan Press.
Shiotsu, T., & Read, J. (2009, November). Extending the yes/no test as a measure
of the English vocabulary knowledge of Japanese learners. Paper presented at The
measurement of L2 lexical development colloquium, Annual Conference of
the Applied Linguistics Association of Australia, Brisbane.
Sternberg, S. (1998). Inferring mental operations from reaction time data: How
we compare objects. In D. N. Osherson, D. Scarborough, & S. Sternberg
(Eds.), An invitation to cognitive science, Methods, models, and conceptual issues
(Vol. 4, pp. 436–440). Cambridge, MA: MIT Press.
Ullman, M. T. (2005). A cognitive neuroscience perspective on second language
acquisition: The declarative/procedural model. In C. Sanz (Ed.), Mind and
context in adult second language acquisition: Methods, theory, and practice
(pp. 141–178). Washington, DC: Georgetown University Press.
94 4 Lexical Facility: Bringing Size and Speed Together

van der Linden, W. J. (2009). Conceptual issues in response time modelling.


Journal of Educational Measurement, 46(3), 247–272. doi:10.1111/
j.1745-3984.2009.00080.x.
Vermeer, A. (2001). Breadth and depth of vocabulary in relation to L1/L2
acquisition and frequency of input. Applied PsychoLinguistics, 22(2), 217–234.
Wang, M., & Koda, K. (2005). Commonalities and differences in word identi-
fication skills among learners of English as a second language. Language
Learning, 55(1), 71–98. doi:10.1111/j.0023-8333.2005.00290.x.
Yap, M., & Balota, D. (2015). Visual word recognition. In A. Pollastsek & R.
Treiman (Eds.), The Oxford handbook of reading (pp. 26–43). New York:
Oxford University Press.
Zhang, X., & Lu, X. (2013). A longitudinal study of receptive vocabulary
breadth knowledge growth and fluency development. Applied Linguistics,
35(3), 283–304. doi:10.1093/applin/amt014.
5
Measuring Lexical Facility: The Timed
Yes/No Test

Aims

• Introduce the Timed Yes/No Test as a measure of lexical facility.


• Describe test characteristics and features.
• Discuss challenges in combining vocabulary size and processing skill
in a single measure.

5.1 Introduction
This chapter describes the Timed Yes/No Test, the online assessment tool
used to measure lexical facility in the studies reported in Part 2. Lexical
facility consists of three dimensions: vocabulary size, mean recognition
time (mnRT), and recognition speed consistency, as captured in the coef-
ficient of variation (CV). The size measure is based on the number of
words (hits) recognized minus pseudowords incorrectly recognized as
words (false alarms). Various formulas have been proposed to combine
hit and false alarm performance. These are described and evaluated. The
use of mnRT as a proficiency measure distinguishes the test (and the

© The Author(s) 2018 95


M. Harrington, Lexical Facility, DOI 10.1057/978-1-137-37262-8_5
96 5 Measuring Lexical Facility: The Timed Yes/No Test

l­exical facility account) from other approaches but also presents


­methodological challenges. Considered in turn are issues related to inter-
preting recognition time results in performance with high error rates, the
non-­normality of speed data, and potential for speed–accuracy trade-offs
(SATs) in test performance.
The empirical studies establish the sensitivity of the three measures to
performance differences in selected domains of academic English, sepa-
rately and in combination. The combined effect of the measures is evalu-
ated in composite measures that index lexical facility performance in a
single score and also serve to offset the effect of trade-offs in performance
among the measures. The composite scores used here are described. In
some instances, the combined effect of the measures is also evaluated
using multiple regression models that fix the relative contribution of the
respective measures in accounting for performance outcomes.
As a test of L2 vocabulary, the Timed Yes/No Test is distinctive in sev-
eral respects. Knowledge is measured by self-report via a simple yes/no
response, pseudowords are included to control for guessing, and response
time performance (speed and consistency) is incorporated as an index of
proficiency. The format is compared with other L2 vocabulary test for-
mats to fix the test within the broader field of L2 vocabulary testing.
Finally, the focus on written recognition skill in academic English in the
empirical research program is motivated.

5.2 The Timed Yes/No Test


The Yes/No Test format was introduced in Chap. 2. The test has a num-
ber of distinctive features that sets it apart from other formats. It elicits a
simple yes/no judgment as to whether the presented word item is known,
meaning that what the test-taker knows about the word is not directly
assessed. Test items consist of words drawn from a range of frequency
bands that allow test-taker’s vocabulary size to be estimated. Word items
are presented along with pseudowords to control for guessing.
Pseudowords are letter strings that conform to orthographic and phono-
logical rules of the given language but are not actual words. The Timed
5.2 The Timed Yes/No Test 97

Yes/No Test format presents items one at a time and collects both yes/no
and recognition times. The format differs from other L2 vocabulary tests
in the selection of items, the response format, and the scoring procedure.
These are described next.

Test Items

Word Items The Timed Yes/No Test yields an estimate of an individual’s


vocabulary size by testing knowledge of items sampled from selected
frequency-­of-occurrence bands. By convention, these are divided into
bands of 1000 words, beginning with the 1000 most frequent words. For
convenience, the bands will be denoted by 1K, 2K, 3K, and so on. The
frequency bands are drawn from corpora including the British National
Corpus (BNC) (Leech et al. 2001), the Cambridge and Nottingham
Corpus of Discourse in English (CANCODE) (McCarthy 1998), and
the Corpus of Contemporary American English (COCA) (Davies 2008).
Frequency-based word lists are available at The Compleat Lexical Tutor
(http://www.lextutor.ca/) and Lancaster University (http://corpora.lancs.
ac.uk).

The range of frequency bands sampled can vary. The X_Lex test sam-
ples items from the 1K–5K range, while a version assessing knowledge of
lower-frequency words, Y_Lex (Meara and Miralpeix 2006), tests items
in the 6K–10K range. The proficiency level of the cohort being tested
affects what bands are selected for inclusion. Tests including low-­
frequency bands (e.g., 7K–10K) run the risk of being too difficult for
beginner learners, while using too narrow a range, (1K–3K), may result
in more advanced learners performing at ceiling. At the same time, a
spread of frequency bands allows a greater range of learners to be tested
and compared. The band ranges used in the studies reported later range
from the 1K to the 10K band, with four bands (2K, 3K, 5K, and 10K)
and five bands (1K, 3K, 5K, 7K, and 9K) used, depending on the study.
The range used represents a trade-off between the aim of the study, the
proficiency range of the participants, and the time and resources available
for testing.
98 5 Measuring Lexical Facility: The Timed Yes/No Test

Pseudoword Items Pseudowords are letter strings that have no meaning


but conform to the phonological and orthographical rules in the lan-
guage, such as erlude or gallify in English. They are included to control for
guessing on the part of the test-taker. Early researchers generated pseudo-
words by switching letters in existing words to make phonologically legal
but meaningless words in the target language (Anderson and Freebody
1981). Pseudowords can vary in how closely they resemble actual words,
and it is important to ensure that the pseudowords used do not resemble
actual words too closely. This potential confusion can be avoided by hav-
ing independent raters, students similar to the cohort being tested, verify
that the pseudowords used cannot be easily mistaken as actual words.
More recently, pseudoword generators such as Wuggy have appeared that
allow researchers to generate their own pseudowords with relative ease
(Keuleers and Brysbaert 2010).

The number of ‘yes’ responses to pseudowords (‘false alarms’) reflects


the degree of guessing by a test-taker. The false-alarm rate is used either
to adjust the final score or, if overly excessive, as the basis for removing
the individual from the analysis. No firm consensus exists as to the opti-
mum percentage of words and pseudowords a test should contain.
Beeckmans et al. (2001) suggest a proportion of 70% words to 30%
pseudowords, while Mochida and Harrington (2006) used a 60%/40%
split. The tests used in the studies reported later contain both 30% and
40% pseudowords. The main factor to consider when deciding on the
number of pseudowords is the potential response pattern by the partici-
pants. The test-taker responds by hitting either a ‘yes’ or a ‘no’ key, and an
overpreponderance on either over the course of the test may give rise to a
conscious or unconscious response bias. Imagine a 100-item test contain-
ing 70 words and 30 pseudowords. Perfect performance would result in
the test-taker hitting the ‘yes’ key 70 times (for the 70 actual words) and
the ‘no’ key 30 times (for the 30 pseudowords). The much larger number
of ‘yes’ keystrokes may prompt increasing hesitation to (correctly) respond
‘yes’ when actual word items appear, out of concern that there have been
too many ‘yes’ responses. The test-taker may even answer ‘no’ to a known
word in order give a more ‘even’ set of responses. In instances where the
test-taker is assumed to know all the words, a 50/50 split of words and
5.3 Scoring the Timed Yes/No Test 99

pseudowords gives rise to little response bias, as an optimum score would


be the result of 50 ‘yes’ and 50 ‘no’ keystrokes. This is typically the case in
lexical decision tasks used with first language (L1) adults. In the Timed
Yes/No Test used here, the test-takers do not know a number of the word
items used, and the proportion can vary somewhat according to word
frequency and individual proficiency level. As a result, the total number
of ‘no’ responses will be a function of both not knowing actual words and
rejecting pseudowords. Ideally, an optimum mix will result in the test-­
taker responding ‘yes’ about half the time, though this is difficult to
achieve. The use of 30% or 40% pseudowords in the tests reported in
Part 2 represents a compromise that permits the test to be used across a
range of proficiency levels.

5.3 Scoring the Timed Yes/No Test


Vocabulary Size

There are four kinds of responses possible for the two item types and two
response alternatives. A schematic diagram of the four is given in Fig. 5.1.
The item response matrix is based on signal detection theory model of
decision-making (Green and Swets 1966). There are two kinds of correct
responses: ‘yes’ responses to actual word items (hits) and ‘no’ responses to

Item type

Word Pseudoword

correct incorrect
YES
(‘hit’) (‘false alarm’)
Response type

incorrect correct
NO
(‘miss’) (‘correct reject’)

Fig. 5.1 Yes/No Test response types


100 5 Measuring Lexical Facility: The Timed Yes/No Test

pseudowords (correct rejections). Corresponding incorrect responses are


‘no’ to actual words (misses) and ‘yes’ responses to pseudowords (false
alarms).
Performance is measured by the number of hits, which reflects vocabu-
lary size, and the false alarms, which indicate guessing behavior. A false
alarm rate of 20% means that the hit rate, and therefore vocabulary size,
is also overestimated by the same amount. What the format does not
capture are instances where a ‘no’ response is given to a word the test-­
taker actually knows. This results in an underestimation of vocabulary
size and may arise when the test-taker is being overly cautious.
Researchers have investigated various ways to use the false-alarm per-
formance to correct for guessing (Beeckmans et al. 2001; Cameron 2002;
Eyckmans 2004; Huibregtse et al. 2002; Mochida and Harrington 2006;
Siakaluk et al. 2003; Thoma 2009; Pellicer-Sánchez and Schmitt 2012).
Alternative scoring formulas have been proposed that vary by the assump-
tions made about the state of an individual’s vocabulary knowledge
and individual response style. Huibregtse et al. (2002) and, subsequently,
Mochida and Harrington (2006) and Pellicer-Sánchez and Schmitt
(2012) compared four scoring formulas. These are presented in Fig. 5.2.
The first method calculates the number of hits minus the false-alarm
rate (H-FA). This method takes the false-alarm performance rate into
account in the most direct manner, but has been shown to underestimate
actual vocabulary knowledge when the false-alarm rate is low (Huibregtse
et al. 2002, p. 231). The rate of under- and over-estimations can be
gauged by comparing Yes/No Test performance with performance on
more standard vocabulary tests where the test-taker’s knowledge of a
word can be verified, such as the Vocabulary Levels Test (VLT) (Mochida
and Harrington 2006; also Pellicer-Sánchez and Schmitt 2012). Mochida
and Harrington (2006) examined Yes/No Test performance by English
L2 university students and found under- and over-estimation rates of 6%
and 5%, respectively. However, there were substantial differences across
frequency bands, with performance for both types for the 2K level at 1%,
and for the 10K level at around 13%.
Earlier studies using the untimed Yes/No Test format used a correction-­
for-­blind guessing (cfbg) procedure, as illustrated in the second formula
(Meara 1989; Meara and Buxton 1987; Cameron 2002). This formula
5.3 Scoring the Timed Yes/No Test 101

1. Hits-false alarms (H-FA). Adjusts total score to reflect guessing, but is not very

sensitive to individual response style. Ignores the correct rejection of pseudowords.

2. Correction for blind guessing (cfbg). Incorporates the correct rejection to account

for ‘blind’ guessing (Anderson and Freebody 1983; Meara and Buxton 1987). Blind

guessing assumes the respondent either knows the word or is guessing at random.

Does not account for response style.

P (k) = (h) – (f)

1 – (f)

3. Meara’s Dm. Attempts to incorporate sophisticated guessing in a Signal Detection

Theory approach in to correction for guessing (Meara 1992, cited in Huibregtse et al.

2002). Does not take into account response style, and tends to underestimate scores

compared to other methods (Huibregtse et al. 2002).

Dm = (h – f) –f

(1 – f) – h

4. ISDT. Assumes sophisticated guessing and takes into account individual response

styles (Green and Swets 1966; Huibregtse et al. 2002).

ISDT = 1 – 4h (1–f)–2(h–f)(1+h–f)

4h (1–f)–(h–f)(1+h–f)

Fig. 5.2 Four Yes/No Test scoring formulas

takes into account individual differences in the proportion of hits to false


alarms produced, calculating the proportion of H-FA divided by the pro-
portion of pseudowords correctly rejected (Anderson and Freebody
1983). The cfbg formula is based on a ‘blind guessing model’ that assumes
102 5 Measuring Lexical Facility: The Timed Yes/No Test

that the respondent either knows a word or is guessing at random—that


is, it assumes that L2 vocabulary knowledge is categorical (Huibregtse
et al. 2002, p. 231). The either/or random guessing assumption is at odds
with the graded nature of word knowledge and its mathematical proper-
ties emphasize the hit rate over the false-alarm rate. At its most extreme,
when the hit rate equals 1, the score will be 1 regardless of the number of
false alarms. In an attempt to address this problem, Meara proposed an
alternative formula: Δm, the fourth formula in Fig. 5.2. This formula,
however, turned out to be overly conservative, yielding scores that over-
corrected for false alarms in general and, in particular, yielded uninter-
pretable—or no—scores when hits were relatively low and false alarms
relatively high (Huibregtse et al. 2002, p. 245).
None of the first three formulas (H-FA, cfbg, Δm) takes into account
individual response bias as a source of variability. This response bias reflects
a systematic tendency by an individual to respond in a specific manner,
which could be consistently liberal (a tendency to respond ‘yes’) or conser-
vative (a tendency to respond ‘no’). A more complex formula, the ISDT,
which stands for an Index of Signal Detection Theory, was proposed to
account for both sophisticated guessing and this underlying response bias
(Huibregtse et al. 2002; Beeckmans et al. 2001). Huibregtse et al. (2002)
modeled scores generated by the four formulas across a hypothetical range
of hit and false-alarm values produced in different response styles. The
proficiency and response bias levels were arbitrary but allowed the four
scoring methods to be systematically compared across different hit and
false-alarm rate levels. The sensitivity of the formulas varied as a function
of the proportion of hits to false alarms. Δm performed the worst, yielding
the lowest scores overall, especially when false-alarm rates were high rela-
tive to hits, and it did not even yield a score in the liberal response condi-
tion, where hits were low and false alarms high. Outcomes for the other
methods also showed variability across the different modeling conditions.
The ISDT overestimates performance when the hit and false-alarm rates
are low. At the intermediate and advanced proficiency levels, the H-FA
and ISDT formulas yield similar findings, suggesting that the underlying
assumption concerning guessing and response style embodied in the more
complex Δm and ISDT formulas is not necessary.
5.3 Scoring the Timed Yes/No Test 103

Mochida and Harrington (2006) examined the four scoring methods in


a study that compared performance on the same words presented in both
the Yes/No Test and VLT formats. The results of both tests were similar
regardless of the scoring method used. Meara’s Δm formula was most dis-
similar, as it was in Huibregtse et al.’s earlier (2002) study. Furthermore,
Mochida and Harrington (2006) found that hits alone provided as good a
predictor of an individual’s VLT score as any of the other formulas. Harsch
and Hartig (2015) also found that hits alone are a satisfactory measure of
vocabulary size, suggesting that hit and false-alarm rates be treated as sepa-
rate indicators, the former as a measure of ­vocabulary breadth and the
latter as a measure of guessing. In both these studies, false-alarm rates were
low, resulting in only a limited need to correct for guessing. As false alarm
rates increase the need to correct for guessing becomes more important.
In an innovative attempt to develop an alternative to pseudowords as a
means to control for guessing, Pellicer-Sánchez and Schmitt (2012) pro-
posed that reaction time thresholds on individual word responses (hits) can
be used to control for the effects of guessing. The authors hypothesized
that slower times reflect less certainty on the part of the test-taker, and it is
on these items that most guessing will occur. By removing these items from
the analysis, a more reliable measure of vocabulary size can be obtained.
They tested this assumption by comparing performance on words pre-
sented in the Timed Yes/No Test format with that of the same words pre-
sented in a follow-up individual recall test in which the word had to be
recalled and used correctly. It was assumed that requiring the participant to
use the word correctly would provide a more reliable measure of overall
vocabulary size. Participants were significantly faster in recognizing words
presented in the Yes/No Test when they could also recall them in the fol-
low-up test (accurate words), compared with when they answered ‘yes’ to a
target word in the test but could not subsequently recall those words (inac-
curate words). Reaction time differences between the accurate and inac-
curate words were then used to identify a general reaction time threshold
level for each individual. Accurate responses falling below the threshold
were removed and the Yes/No Test score was recalculated. The authors then
correlated this score and the scores generated by the four scoring formulas
discussed earlier with the follow-up recall test score. The results showed
that the traditional formulas had a slightly higher correlation with the
104 5 Measuring Lexical Facility: The Timed Yes/No Test

recall test than with the time-adjusted approach. Ultimately, the findings
did not find a clear advantage for the reaction time or for any one of the
established scoring formulas. Regardless, the logic of the Pellicer-Sánchez
and Schmitt (2012) study accords with that of the lexical facility proposal,
namely that speed of recognition reflects stronger word knowledge.
The research to date on Yes/No Test scoring has failed to identify a
formula that is clearly superior across the range of testing contexts, test-­
taker samples, and performance outcomes encountered (Beeckmans et al.
2001, p. 272). Pellicer-Sánchez and Schmitt (2012) raise the possibility
of using adaptive scoring, in which the formula that appears to be the
most sensitive to the pattern of responses made by the individual test-­
taker is used. The practical considerations would be significant but, as the
authors note, not insurmountable; however, it is not clear whether the
increased sensitivity that might come from using the various formulas
would be worth the effort and expense. It would also present difficulties
in comparing performance across individuals.
The empirical studies reported in Part 2 use the H-FA formula to score
test performance across a range of test-takers and testing domains. Of
course, if false-alarm rates are low, or even zero, the hits provide a usable
measure on their own. However, even in this case, the hit rate as a reflec-
tion of an individual’s ‘true’ vocabulary knowledge might still be an over-
estimate (the test-taker did some guessing) or an underestimate (the
test-taker did not select some known words). It is important to emphasize
that the vocabulary size measure that the test yields is a probabilistic esti-
mate serving as an indirect measure of vocabulary size, given the adjust-
ment for guessing involved. The usefulness of the testing approach is not
about identifying an individual’s absolute vocabulary size, but with pro-
ducing a relative measure of vocabulary size that can meaningfully dis-
criminate among proficiency and performance levels.

Recognition Speed and Consistency

In addition to the vocabulary size measure, the test also collects recogni-
tion time for each item presented. A mean recognition time (mnRT) is
calculated for all the words correctly identified (hits). The mnRT and its
5.3 Scoring the Timed Yes/No Test 105

standard deviation (SD) are then used to estimate recognition speed con-
sistency. This is expressed in the coefficient of variation (CV), which is
the ratio of the SD of the mnRT to the mean RT (SDmnRT/mnRT). These
processing skill measures are examined individually and in combination
with the size measures as indices of proficiency.

Recognition Speed as a Proficiency Measure

The use of the mnRT as a proficiency measure presents a number of chal-


lenges for data collection, analysis, and interpretation. Three of particular
concern are the interpretation of mnRT performance when high levels of
errors occur in the yes/no performance, the non-normality of time distri-
butions, and the potential influence of systematic speed-accuracy trade-­
offs on overall response outcomes.

Interpreting mnRT Results in Performance with High Error Rates Response


time performance in L1 lexical research is examined on words that are known
to the participant. This is done to ensure that any observed differences in
response time behavior can be more confidently attributed to the indepen-
dent variable(s) manipulated in the study, and not to a lack of knowledge of
the word (Sternberg 1998). This allows the effect of specific factors that
influence response time behavior to be isolated and systematically tested.
Such factors might include the effect of word class membership or of prior
exposure (priming) on performance. L2 and bilingual processing research
also use this approach. For example, in automaticity research the develop-
ment of increasing retrieval speed is In this research, increasing retrieval
speed is examined on words that participants already know and then practice
as part of the study (Akamatsu 2008; ​Segalowitz et al. 1998). The lexical
facility account, in contrast, examines the emergence of size, recognition
speed, and consistency simultaneously. The mnRTs are calculated on correct
word responses (hits) only, meaning that they are typically based on only a
portion of the words tested, and the same words that contribute to the size
measure. As a result, mnRTs can sometimes represent only a small sets of
words. This is particularly the case for performance on the lower-frequency
bands and by lower-­proficiency test-takers. The validity and reliability of
these measures is an empirical question addressed in Part 2.
106 5 Measuring Lexical Facility: The Timed Yes/No Test

The Non-normality of Response Time Responses Recognition speed


data typically do not form a normal distribution (Luce 1986). This is
because there is a physical limit on how fast a test-taker can respond but
not, in theory, on how slow—though in practical terms this is limited by
time allotted to complete the task. As a result, mnRT data are often
positively skewed, with faster responses clustering at one end of the dis-
tribution and the more extreme slower (larger response times) at the
other. In these instances, the data do not meet the normality assumption
required for the use of the conventional parametric tests such as the
t-test and the analysis of variance (ANOVA). A variety of techniques
have been developed to deal with this problem and thus render the data
amenable to analysis with these more powerful statistical tools (Jiang
2012). Skewed distributions can be transformed mathematically using
data transformation techniques that preserve the pattern of differences
between the data points but diminish the impact of extreme values.
These outlier values are usually defined as data points that are 2.5–3 SDs
beyond the sample mean and they can have a pronounced effect on nor-
mality. There are different ways to deal with them, including deleting
them completely or replacing them with less extreme values. The studies
reported in Part 2 screen the raw RT data for extreme values at 3 SDs.
Given the very small number of such outliers (none exceeding more
than 2% of a given data set), they are treated in the same way as other
incorrect responses. The small number of individual recognition time
outliers is due in part to a 5000 millisecond time limit for each item
presentation trial in test. If no response is made before time is up, the
item is timed out and logged as an incorrect response. As will be evident
in the studies reported later, actual logged-out responses are rare, par-
ticularly in later trials.

Speed–Accuracy Trade-Offs Response time analyses are usually restricted


to performance on low or no-error data out of concern for potential
trade-offs in speed and accuracy. In the Timed Yes/No Test, a perfect
positive correlation (r = +1) between the size and mnRT scores means
that individuals with the highest accuracy are also the slowest respond-
ers and vice versa. A perfect negative correlation between size and mean
5.3 Scoring the Timed Yes/No Test 107

recognition time (−1.0) would indicate that the measures are redun-
dant, that size can be perfectly predicted by RT and vice versa. Perfect
correlations, of course, do not happen, so the interest is in the direction
and relative strength of the relationship. A strong positive correlation
indicates a systematic speed-accuracy trade-off, while a strong negative
correlation is consistent with the lexical facility account. A weak, or no,
correlation shows no systematic trade-off but does not preclude it
entirely. It would also provide little support for the lexical facility
proposal.

The potential for speed-accuracy trade-offs is greatly diminished by


ensuring that the test-taker works as quickly and accurately as possible.
This is facilitated by giving clear and effective instructions that emphasize
the importance of both speed and accuracy in responding. Limiting the
time available to give the response also helps mitigate the occurrence of
slower responses that might contribute to a trade-off effect. In the studies
reported in Part 2 the test-taker is given only five seconds to respond to
each item. After the test is completed, it is also important to check the
results for evidence of possible SATs before statistical tests are run and the
findings interpreted (Heitz 2014).

CV as a Dimension of Proficiency The consistency of recognition speed is


also a dimension of proficiency in the lexical facility construct. The CV
captures the relationship between the mnRT and the SD in a single
value and indexes the consistency of response speed in a set of responses,
independent of the mnRT. The changing relationship between the indi-
vidual’s mnRT and CV over time has been examined as window on the
development of automaticity. A positive correlation between the mnRT
and the CV in the presence of decreasing mnRTs and CVs has been taken
as evidence for the emergence of automatic word recognition
skill (Segalowitz and Segalowitz 1993; Segalowitz et al. 1995, 1998). The
focus here is not on emergence of automaticity—much of the data will
come from users whose processing skill is far from automatic. Rather, in
the lexical facility account, lower CVs in the presence of faster mnRTs are
examined as a potential index of increasing word recognition skill in the
L2, which in turn is reflected in better L2 performance.
108 5 Measuring Lexical Facility: The Timed Yes/No Test

Composite Measures of Lexical Facility

Lexical facility is composed of vocabulary size, mean response time, and


the CV. The sensitivity of these measures to differences in performance in
selected domains of academic English are the focus of the studies reported
in subsequent chapters. The measures are evaluated individually and in
combination. The combined effect of the measures is examined in two
ways. The primary way is through the use of composite measures, which
provide a single score for performance. Composite scores can be obtained
either by averaging or summing the component scores or by weighting
the contribution of each score individually (Moses 2013). The scores are
assumed to represent a unitary underlying construct, so it is important
that the component scores correlate with each other but at the same time
provide unique information about the construct (Sawaki 2007).
Composite scores are also useful when there is a possible trade-off in per-
formance across the measures involved, as is the case here (Waters and
Caplan 2003). The formulas for the composite measures used in the
study are presented in Fig. 5.3.

Measures Label Formula

Size & speed VKsize_mnRT ((VKsize z + mnRT z)/2) + 5

Size & consistency VKsize_CV ((VKsize z + CV z)/2) + 5

Size, speed & consistency VKsize_mnRT_CV ((VKsize z + mnRT z + CV z)/3) + 5

Key

VKsize: Vocabulary knowledge-size

mnRT: Mean response time

CV: Coefficient of variation

Fig. 5.3 Composite measure formulas


5.4 Administering the Test 109

The composite measures are calculated from the three component


scores. The vocabulary size component is the proportion of incorrect
responses to pseudowords (false alarms) subtracted from correct responses
to words (hits), or hits minus false alarms. The size measure is labeled
VKsize to denote that it is a measure of vocabulary knowledge based on
frequency indexed size measure that is adjusted for guessing. It is at
best an approximate measure of the individual’s actual vocabulary size.
The other two component scores are mean recognition timess (mnRT) for
all the words correctly identified (hits) and the CV, which is the ratio of
the SD of the mnRT to the mnRT itself. The three composite measures
in Fig. 5.3 represent all three possible combinations of the size, speed,
and consistency dimensions. Each score is the average of the component
z scores plus a value of 5 added to eliminate negative values. A main focus
of the empirical research is how these measures compare with the indi-
vidual component scores in accounting for differences in L2 perfor-
mance. Although included for completeness, the VKsize_CV measure is
of limited usefulness. Low CV values reflect high consistency, but this
can be evident in consistently fast or consistently slow performance. Thus
the measure only indicates greater skill when it is accompanied by fast
recognition times.
An alternative way to evaluate the combined effect of the measures is
through regression analyses. Hierarchical regression models allow the
contribution made by an individual variable to performance outcomes to
be fixed relative to the contribution of the other variables. In analyses
reported in Part 2, VKsize, mnRT, and CV are entered sequentially as
predictor variables and their effect on criterion performance differences
(e.g., placement decisions) are evaluated. See Chaps. 8 and 10. The VKsize
measure is always entered first, both because knowing a word is logically
prior to being able to recognize it and because of the well-established
relationship between size and performance reviewed in Chaps. 1 and 2.

5.4 Administering the Test


The distinctive nature of the test means that test-takers must fully under-
stand the purpose of the test and the proper procedure for completing it.
110 5 Measuring Lexical Facility: The Timed Yes/No Test

Instructions

The unique format makes it particularly important that the instructions


are clear. The self-report nature of the response, the use of pseudowords,
and the need to respond as quickly as possible can all be a source of
potential confusion for the first-time test taker. The format differs from
other vocabulary tests in that the test-taker merely reports whether the
presented item is known or unknown. There is no need to demonstrate
that knowledge. The test-taker must understand that the yes/no judge-
ment relates to whether the word is known and not whether it is a possible
English word. This is important because pseudowords observe English
sound and spelling conventions. Just because a word looks like an English
word does not mean it is one. It must also be made clear that answering
‘yes’ to pseudowords will reduce the score received.
Instructions can differ in how prescriptive they are for the yes/no deci-
sion. Eyckmans (2004) makes a distinction between lenient and stringent
instructions. An example of lenient instructions is given below from a test
given to French L1 students of Dutch. The instructions were glossed in
English and provided alongside the L1 French (Eyckmans 2004, p. 95).

Tick the words you know. Some of the words in the list do not exist in
Dutch.

This instruction is considered lenient because, in the words of the author,


it ‘left much to the imagination of the test taker’ (2004, p. 95). Note that
no guidance was given as to what it means to ‘know’ a word.
Compare a more stringent version used by the author:

Tick the words you know the meaning of. When in doubt, do not tick the
item. Notice that some of the words in the list do not exist in Dutch. After
completing this test, you will be asked to translate some of the words of the
list.

In this version, the test-taker is told to tick only the words for which
they know the meaning, to the degree that they can supply a transla-
tion. Also, the individual is explicitly cautioned against guessing. The
5.4 Administering the Test 111

more explicit instructions are intended to diminish potential uncer-


tainty and thus result in a lower false-alarm rate (Eyckmans 2004,
pp. 95–96).
In the Timed Yes/No Test, the test-taker must also understand that
speed of response is as important as accuracy in assessing overall
­performance. The instructions should explicitly state that the test-taker
work as quickly and accurately as possible. As noted earlier, both exces-
sively fast responses made with less attention to accuracy and excessively
slow but deliberate responses can result in a trade-off in accuracy and
speed that makes the results difficult to interpret.
Figure 5.4 summarizes the elements of the test instruction set.
The use of bilingual instructions can also provide a greater degree of
confidence that the test-takers understand the task.

The instruction set should provide the following elements.

1. Description of the difference between the words and pseudowords. It should be clear that

the pseudoword look like possible words (i.e., are phonologically ‘legal’) but do not exist

in the language.

2. Criteria for responding ‘yes’. What it means to know a word should be specified clearly

(e.g., recognize it when encountered in a text)?

3. Criteria for scoring. It should be specified that there is a penalty for guessing; incorrect

‘yes’ responses to pseudowords will result in lower scores.

4. Response behaviour expectations. It should be emphasised that it is important to work as

accurately and as quickly as possible.

5. Any follow-up activities. If appropriate planned or even possible follow-up activities can

be specified, e.g., you may be tested on some of the items after the test.

Fig. 5.4 Elements of the instruction set for the Timed Yes/No Test
112 5 Measuring Lexical Facility: The Timed Yes/No Test

Procedure

The Timed Yes/No Test used in the empirical research is administered


using LanguageMAP, an online test developed at the University of
Queensland, Australia. It is available at www.languagemap.com. The test-­
taker is presented with a single item on a computer screen and gives a
simple yes/no response as to whether they know the item. Other criteria
are also possible, such as whether the word exists in the language being
tested. The items are presented in randomized order for each test-taker,
with responses collected through keyboard presses. Response time is mea-
sured from the moment an item appears on the screen until the test-taker
presses the relevant key. The test uses a 5000 millisecond time limit for
each trial, well beyond the upper range of typical item response times
(900–1500 milliseconds). If the cutoff time is reached before a test-taker
gives a response, the item disappears from the screen and an incorrect
score is recorded. The knowledge that there is cutoff time helps to elimi-
nate excessively long response times, which can result in outliers. The
time limit can also motivate test-takers to respond more quickly than
they might otherwise.
Administration can be by individuals or in a group, and the online
format means that the responses are collected and scored automatically.
After the instructions are given, the test-taker completes a set of practice
items, with feedback, to ensure that they understand the procedure. The
data reported here were collected in computer labs or computer-equipped
classrooms.

5.5  he Timed Yes/No Test as an L2


T
Vocabulary Task
The distinctive nature of the Timed Yes/No Test can be placed within the
larger field of L2 vocabulary testing by locating it along the four vocabu-
lary task dimensions identified by Read (2000). The four dimensions are
presented as dichotomies, with the Timed Yes/No Test element in italics.
The test is considered discrete versus embedded, general versus specific,
5.5 The Timed Yes/No Test as an L2 Vocabulary Task 113

context independent versus context dependent, and low stakes versus high
stakes. These dimensions affect how the test scores are interpreted and
used.

Discrete (Versus Embedded)

This dimension concerns the degree of task independence. Discrete ver-


sus embedded vocabulary measures differ by the extent to which vocabu-
lary is measured as a discrete, independent construct—as opposed to
being part of a larger construct such as reading comprehension. The
Timed Yes/No Test focuses on size and speed measures (mean recognition
time and consistency), which are then correlated with performance in
various domains. As Read (2000, p. 9) notes, most existing vocabulary
tests assume that vocabulary knowledge is an independent construct, at
least to some degree by the focus on vocabulary. The test is unequivocally
independent and thus discrete.

Selective (Versus Comprehensive)

The second task dimension concerns the nature of the vocabulary being
tested. Selective versus comprehensive measures differ according to the
range of vocabulary used in the assessment. Selective tests incorporate a
set of target items from a text and the test-taker is tested on just these
items. A comprehensive measure, in contrast, takes into account all
vocabulary content in the test material. Knowledge of individual words is
not assessed; rather, overall vocabulary use is rated and a judgment made
as to the individual’s relative level of vocabulary mastery. In Read’s terms,
the Timed Yes/No Test is a selective test because it consists of a set of
items drawn from frequency lists. However, as with comprehensive mea-
sures, the focus lies not on whether specific items are known, but rather
on the proportion of items known at different frequency levels. In this
way, the items are representative of frequency bands and accessed as such.
In principle, items can be sampled at random from matched frequency
levels and yield the same measure, despite being different items. Therefore,
114 5 Measuring Lexical Facility: The Timed Yes/No Test

although the Timed Yes/No Test is selective in that a set of target items at
different frequency levels are tested, it is also comprehensive in that per-
formance on the target items is assumed to represent the proportion of
words known at the frequency level in question. For example, a score of
80% on 20 words sampled from the 2K band is interpreted as evidence
that the test-taker knows 800 of the words in the 2K band. It does not
target knowledge of specific words.

Context Independent (Versus Context Dependent)

The third dimension focuses on the role of context. Context-independent


tests can be completed without reference to any context, while context-­
dependent tests require the test-taker to, in Read’s words, engage with
the text (p. 11). The Timed Yes/No Test format is clearly a context-­
independent test, in which the focus is on quantifying vocabulary
knowledge and speed based on objective word frequency statistics. It
does not measure context-appropriate vocabulary use. The context-­
independent nature of the test aligns with the trait approach to charac-
terizing L2 vocabulary knowledge that the lexical facility account
embodies.

Low Stakes (Versus High Stakes)

The fourth dimension involves the perceived importance of the test out-
comes. To date, the Timed Yes/No Test has been administered in set-
tings and for purposes where there is relatively little at stake for the
test-taker. The results reported in Part 2 were collected in research proj-
ects and pilot testing programs that did not have an immediate bearing
on the test-­taker’s course of study or future goals, making the test of low
stakes. Although a low stakes test, the Timed Yes/No Test can poten-
tially be used to complement high-stakes testing functions related to
university entrance and placement decisions, as illustrated in Chaps. 8
to 10.
5.6 Lexical Facility in English 115

In summary, using Read’s framework the Timed Yes/No Test is a dis-


crete, task-independent, low-stakes task. The test is more difficult to place
on the selective-comprehensive continuum as it has elements of both.
The discrete and task-independent nature of the format align with the
trait characterization of lexical facility as a measurement construct dis-
cussed in Chap. 4.

5.6 Lexical Facility in English


The lexical facility account is tested in performance by L2 English users
of academic English. Examined is the sensitivity of the size, speed, and
consistency measures to performance differences in standardized profi-
ciency tests, placement testing, and English language and general aca-
demic outcomes in English-medium academic settings. Lexical facility is
assumed to be a facet of both spoken and written language, but the focus
here is mainly on the written language, although some spoken language
data are presented.
The focus on academic English performance is motivated in part by
the sheer size of the domain. The number of L2 English students involved
in English-medium academic study continues to expand rapidly, both in
traditional English L1 countries such as Australia, the US, and the UK,
and in the growing number of English-medium programs in coun-
tries where English is not the dominant language. However, the focus on
academic English does not imply that the lexical facility proposal is not
relevant to other domains of L2 English performance. The bedrock nature
of lexical facility for language comprehension, and arguably for produc-
tion as well, means that it is implicated in all language performance.
The focus primarily, but not exclusively, on written English means that
some caution is needed in extending the results to other languages,
­especially those that are not alphabet-based. The irregular sound–spelling
relationship characteristic of many English words means that the devel-
opment of fluent word recognition skills present a challenge to many L2
English learners (Norris 2013). Fender (2008) reported that Arabic L1
readers of English demonstrated poorer English word recognition skills
116 5 Measuring Lexical Facility: The Timed Yes/No Test

than proficiency-matched English L2 learners from Chinese, Korean,


and Japanese L1 backgrounds. The fact that L1 orthography can be a
potential source of performance variability in lexical tests such as the
Timed Yes/No Test certainly affects the way the results can be interpreted
(Koda 2005).
L1 background may also affect test performance through cognate
effects. These can be evident in the processing of both words and pseudo-
words. The Dutch learners of French studied by Eyckmans (2004) had
much higher false-alarm rates than the Asian-background ESL users in
Mochida and Harrington’s (2006) study, possibly due to the similarity of
pseudowords to actual words in the L1. The effect of L1 cognates on test
performance may be a particular problem for lower-proficiency learners
(Meara 1996).
However, while cross-linguistic effects are expected, the underlying
logic relative to the importance of vocabulary size and speed in perfor-
mance is assumed to hold for all languages, all things being equal.

5.7 Conclusions
The Timed Yes/No Test is an online assessment tool that is used to test the
lexical facility construct. The test format has a number of features that
distinguish it from other approaches to L2 vocabulary testing. These fea-
tures and the motivation for their use was explained. Two features that
received particular attention were the use of pseudowords as a means to
control for guessing as well as recognition speed and consistency as
vocabulary knowledge measures. These features raise theoretical and
methodological challenges for the research reported in Part 2.
The central aim of the empirical research program is to establish the
sensitivity of the three measures, individually and in combination, to
learners’ performance differences. The combined effect of the measures
will be evaluated using composite scores, which is another distinctive fea-
ture of the approach. The studies in Part 2 test the lexical facility proposal
by examining Timed Yes/No Test performance as a predictor of outcomes
in various domains of academic English in both ESL and EFL settings.
References 117

References
Akamatsu, N. (2008). The effects of training on automatization of word recog-
nition in English as a foreign language. Applied Psycholinguistics, 29(02),
175–193. doi:10.1017/S0142716408080089.
Anderson, R. C., & Freebody, P. (1981). Vocabulary knowledge. In J. T. Guthie
(Ed.), Comprehension and teaching: Research reviews (pp. 77–117). Newark:
International Reading Association.
Anderson, R. C., & Freebody, P. (1983). Reading comprehension and the assess-
ment and acquisition of word knowledge. In B. Huston (Ed.), Advances in
reading/language research (Vol. 2, pp. 231–256). Greenwich: JAI Press.
Beeckmans, R., Eyckmans, J., Janssens, V., Dufranne, M., & Van de Velde, H.
(2001). Examining the yes/no vocabulary test: Some methodological issues
in theory and practice. Language Testing, 18(3), 235–274.
Cameron, L. (2002). Measuring vocabulary size in English as an additional lan-
guage. Language Teaching Research, 6(2), 145–173.
Davies, M. (2008). The corpus of contemporary American English: 450 million,
1990–present. Available online at http://copruse.byu.edu.coca/
Eyckmans, J. (2004). Learners’ response behavior in Yes/No vocabulary tests. In
H. Daller, M. Milton, & J. Treffers-Daller (Eds.), Modelling and assessing
vocabulary knowledge (pp. 59–76). Cambridge: Cambridge University Press.
Fender, M. (2008). Spelling knowledge and reading development: Insights from
Arab ESL learners. Reading in a Foreign Language, 20(1), 19–42.
Green, D., & Swets, J. A. (1966). Signal detection theory and psychophysics.
New York: Wiley.
Harsch, C., & Hartig, J. (2015). Comparing C-tests and yes/no vocabulary size
tests as predictors of receptive language skills. Language Testing, 33(4),
555–575.
Heitz, R. P. (2014). The speed-accuracy tradeoff: History, physiology, methodol-
ogy, and behavior. Frontiers in Neuroscience, 8, 150.
Huibregtse, I., Admiraal, W., & Meara, P. (2002). Scores on a yes-no vocabulary
test: Correction for guessing and response style. Language Testing, 19(3),
227–245.
Jiang, N. (2013). Conducting reaction time research in second language studies.
London/New York: Routledge.
Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword gen-
erator. Behavior Research Methods, 42(3), 627–633. d ­oi:10.3758/
BRM.42.3.627.
118 5 Measuring Lexical Facility: The Timed Yes/No Test

Koda, K. (2005). Insights into second language reading: A cross-linguistic approach.


New York: Cambridge University Press.
Leech, G., Rayson, P., & Wilson, A. (2001). Word frequencies in spoken and writ-
ten English. London: Longman.
Luce, R. D. (1986). Response times. New York: Oxford University Press.
McCarthy, M. (1998). Spoken language and applied linguistics. Cambridge:
Cambridge University Press.
Meara, P. (1989). Matrix models of vocabulary acquisition. AILA Review, 6,
66–74.
Meara, P. (1996). The dimensions of lexical competence. In G. Brown,
K. Malmkjaer, & J. Williams (Eds.), Performance and competence in second
language acquisition (pp. 35–53). Cambridge: Cambridge University Press.
Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary
tests. Language Testing, 4(2), 142–145.
Meara, P. M., & Miralpeix, I. (2006). Y_Lex: The Swansea advanced vocabulary
levels test. v2.05. Swansea: Lognostics.
Mochida, A., & Harrington, M. (2006). The yes-no test as a measure of recep-
tive vocabulary knowledge. Language Testing, 26(1), 73–98. doi:10.1191/02
65532206lt321oa.
Moses, T. (2013). Alternative smoothing and scaling strategies for weighted com-
posite scores. Educational and Psychological Measurement, 74(3), 516–536.
Norris, D. (2013). Models of visual word recognition. Trends in Cognitive
Sciences, 17(1), 517–524. doi:10.1016/j.tics.2013.08.003.
Pellicer-Sánchez, A., & Schmitt, S. (2012). Scoring yes-no vocabulary tests:
Reaction time vs. nonword approaches. Language Testing, 29(4), 489–509.
doi:10.1177/0265532212438053.
Read, J. (2000). Assessing vocabulary. Cambridge: Cambridge University Press.
Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking
assessment: Reporting a score profile and a composite. Language Testing,
24(3), 355–390. doi:10.1177/0265532207077205.
Segalowitz, N., & Segalowitz, S. J. (1993). Skilled performance, practice and
differentiation of speed-up from automatization effects: Evidence from sec-
ond language word recognition. Applied PsychoLinguistics, 14(3), 369–385.
doi:10.1017/S0142716400010845.
Segalowitz, N., Watson, V., & Segalowitz, S. J. (1995). Vocabulary skill: Single
case assessment of automaticity of word recognition in a timed lexical deci-
sion task. Second Language Research, 11(2), 121–136.
References 119

Segalowitz, N., Segalowitz, S. J., & Wood, A. G. (1998). Assessing the develop-
ment of automaticity in second language word recognition. Applied
PsychoLinguistics, 19(1), 53–67.
Siakaluk, P. D., Buchanan, L., & Westbury, C. (2003). The effect of semantic
distance in yes/no and go/no-go semantic categorization tasks. Memory &
Cognition, 31(1), 100–113.
Sternberg, S. (1998). Inferring mental operations from reaction time data: How
we compare objects. In D. N. Osherson, D. Scarborough, & S. Sternberg
(Eds.), An invitation to cognitive science, Methods, models, and conceptual issues
(Vol. 4, pp. 436–440). Cambridge, MA: MIT Press.
Thoma, D. (2009). Strategic attention in language testing. Metacognition in yes/no
business English vocabulary test. Frankfurt: Peter Lang.
Waters, G. S., & Caplan, D. (2003). The reliability and stability of verbal work-
ing memory measures. Behavior Research Methods, Instruments, and Computers,
35(4), 550–564. doi:10.3758/BF03195534.
Part 2
Introduction

1.1 Overview
Part 1 introduced the lexical facility construct. Lexical facility combines
size and processing skill as a unitary second language (L2) vocabulary
skill construct. The challenges arising from combining the two was
acknowledged, but the case was made for treating the two as a unitary
construct, both due to the time-contingent nature of L2 vocabulary
knowledge and the potential utility of combining knowledge and skill as
a measurement tool to characterize individual and group differences in
L2 proficiency and performance.
Part 2 provides empirical evidence for the account. It presents a set of
studies that investigate the lexical facility measures (vocabulary knowl-
edge, mean recognition time, and consistency) as reliable indices of indi-
vidual differences in L2 vocabulary skill, separately and in combination
(Chap. 6). The sensitivity of the measures to performance differences in
selected domains of academic English performance is then examined.
These domains consist of university entry standards (Chap. 7), perfor-
mance on the International English Language Testing System (IELTS;
Chap. 8), language program placement (Chap. 9), and general and aca-
demic English classroom performance (Chap. 10). A summary chapter
that identifies the main findings is also included (Chap. 11). The data
presented here are drawn from published and unpublished research by
122 2 Introduction

the author and colleagues. In the final chapter (Chap. 12), the implica-
tions for L2 vocabulary teaching, learning, and testing are discussed,
including the potential for incorporating time measures into models of
L2 vocabulary acquisition and L2 theory more generally.

1.2 Aims of the Empirical Research


The research aims to establish lexical facility as a context-independent
index of L2 vocabulary skill that is sensitive to performance differences in
various academic English domains. The research program reported in
Part 2;

1. compares the three measures of lexical facility (VKsize, mnRT, and CV) as
stable indices of L2 vocabulary skill;
2. evaluates the sensitivity of these measures individually and as composites to
differences in a range of academic English domains; and, in doing so,
3. establishes the degree to which the composite measures combining the
VKsize measure with the mnRT and CV measures provide a more sensi-
tive measure of L2 proficiency differences than the VKsize measure alone.

1.3 An Overview of Methods Used


The seven studies reported in Part 2 use the same approach to testing the
lexical facility proposal, with the methodology adapted in minor ways to
the particular study. These methods and their rationale are described here.
Each of the studies collected data for the lexical facility measures with
the Timed Yes/No Test, described in Chap. 5. The test elicits a yes/no
response on test items that yields an estimate of the individual’s English
vocabulary size and a measure of the speed and consistency with which
these words are recognized. The tests items are both actual words and
pseudowords—orthographically possible but nonexistent words inter-
spersed randomly as a control for guessing. Word items are drawn from a
range of frequency-of-occurrence bands, and the proportion of words
recognized (hits) across the range provides an estimate of the individual’s
vocabulary size. The incorrect recognition of pseudowords as words (false
1.3 An Overview of Methods Used 123

alarms) is used to adjust the final score. The computerized test format
presents the test items individually in a randomized order for each test-­
taker. The test records test-takers’ yes/no responses and the time they take
to recognize each item. These responses are used to calculate individual
and composite measures of lexical facility. These are described next.

Individual Measures

The test responses are used to calculate measures of vocabulary knowl-


edge, mean recognition time, and recognition time consistency, as well as
composites combining these measures. The vocabulary knowledge score
(VKsize) is the proportion of hits (‘yes’ responses to words) minus the
proportion of false alarms (‘yes’; responses to pseudowords). The ‘hit
minus false alarm’ score is an estimate of an individual’s vocabulary size,
or breadth. Two individuals with the same hit rate (an index of size) but
different false-alarm rates will therefore also have different VKsize scores.
The notation VKsize is used to denote the fact that the measure is not a
direct estimate of the individual’s vocabulary size, but rather a measure of
vocabulary knowledge based on a frequency-indexed size measure that
also takes guessing into account.
The second response measure is mean recognition speed (mnRT),
based on the average of all the individual recognition times for the correct
hits. The third measure is the coefficient of variation (CV), which is an
index of the consistency of speed with which a set of words is recognized.
It is a single value that reflects the relationship between the variability of
the response times in a given set, as reflected in the standard deviation
(SD), and the mean response time itself. The CV is not collected directly
by the test but is derived from the mnRT and SD responses. It is the ratio
of the SD of the mnRT to the mnRT itself (SDmnRT/mnRT).
Hits ‘Yes’ responses to words
VKsize Accuracy score providing an indirect measure of vocabulary size:
proportion of hits minus false alarms (‘yes’ responses to
pseudowords)
mnRT Mean response time of correct hits
CV Coefficient of variation: ratio of the standard deviation of the mean
RT to the mean RT (SDmnRT/mean RT)
124 2 Introduction

Composite Measures

Two composite measures are also examined. The VKsize_mnRT measure


combines the VKsize scores and mnRTs as a composite of size and speed.
The VKsize_mnRT_CV measure combines all three measures in a single
value. The third possibility VKsize_CV, which combines the VKsize
scores and the CV, is not examined, as the CV is only interpretable in
combination with the mnRT. It is possible for a test-taker to be very slow
and very consistent.
The composite measures are calculated by first switching the sign on
the raw mnRT and CV measures so that higher values reflect better per-
formance. This makes the scores consistent with the VKsize values and
permits standardized (z) scores for the three measures to be added together
and averaged. Because the use of standardized scores usually results in
some negative scores, a value of 5 is added to each score to make all the
results positive. The composite scores are compared with the individual
scores for how reliably they discriminate among levels and groups, and
how large an effect they have on the criterion differences. Of particular
interest is whether the combination of VKsize and mnRT/CV is more
sensitive to criterion differences than the VKsize measure alone.

Establishing Response Reliability

In all the studies, the responses are initially examined for factors that may
affect the outcomes, independent of the research variables of interest.
These potentially compromising factors are both general to quantitative
measurement research and specific to the use of the Timed Yes/No Test
format. The raw findings are examined for instrument reliability, exces-
sive false-alarm rates, the occurrence of outliers, and a potential speed–
accuracy trade-off in responses.

Instrument Reliability Instrument reliability reflects the degree to which


a test consistently measures what it is intended to measure; that is, it gives
the same results every time it is used (in a hypothetically similar situa-
tion). The reliability of the Timed Yes/No Test is assessed by Cronbach’s
1.3 An Overview of Methods Used 125

alpha coefficient, a widely used measure of the consistency of responses


across the items. Test instruments are assumed to be minimally reliable if
the coefficient exceeds .7 out of a total 1. However, values above .8 are
preferred for cognitive tests (Field 2009). In the studies reported later, the
values range from mid-.8 to mid-.9.

Excessive False-Alarm Rates A potentially compromising factor unique to


the Yes/No Test format is the presence of high false-alarm rates. An indi-
vidual’s false-alarm rate is subtracted from the hit rate (i.e., the number
of actual words they correctly recognize) to yield the VKsize score. Higher
false-alarm rates result in lower VKsize scores, and excessively high false-­
alarm rates call into question the viability of the score as a valid measure
of underlying vocabulary knowledge. High individual false-alarm rates
may result when the test-taker genuinely confuses pseudowords with
known words, or does not pay close attention during testing, or fails to
understand the task demands. Extremely high or low false-alarm and hit
rates together are indicative of a pronounced tendency to respond to all
items with either a ‘yes’ (high rates) or a ‘no’ (low rates). These cases need
to be removed from the analysis. It is difficult to specify what constitutes
a reasonable maximum false-alarm rate. In the studies reported in Part 2,
the mean group false-alarm rates range from around 5% to 20%, the
range reflecting decreases in group proficiency. Within the group, means
are individual false-alarm rates that can be very high, with cases removed
from the study only when the rate exceeded 45%. This is a very high level
of guessing, but in most of the studies, a lower threshold would result in
a substantial loss of data. If a number of test-takers must regularly be
discarded because of excessive false-alarm rates, the instrument is of lim-
ited application in either research or assessment. Still, the high rate of
guessing in some studies and groups within studies does raise questions
about the validity of the measure. To address this concern, in several stud-
ies, a follow-up statistical analysis is performed in which the false-alarm
rate threshold is lowered, scores exceeding the threshold are removed, and
the data analysis is run again.

Outliers A common problem encountered in response time data analysis


is the occurrence of outlier values. These are item response times that are
126 2 Introduction

either too fast or too slow to reflect the cognitive process of interest.
Random finger presses, lapses of attention, or external distractions can all
contribute to responses that do not reflect the word recognition process.
Outliers are identified here using an absolute value approach in which
response times faster than 300 milliseconds and slower than 5000 milli-
seconds are the low and high cut-off values (Jiang 2013). The high cut-off
value is the item time-out value for the test program and any response
beyond this time is automatically discarded. The time-out value is set at
5000 milliseconds to accommodate the lower-proficiency test-takers in
several of the studies. The data points that fell below the low cut-off of
300 milliseconds are simply removed. These involved only a handful of
data points in any given study, well below 1% of the data.

Speed–Accuracy Trade-Off A potential confounding factor in the Timed


Yes/No Test format is the possibility of a trade-off between speed and
accuracy in individual test performance. Test-takers are instructed to
respond as quickly and accurately as possible on every trial. Given the
dual dimension of the task, it is possible that individuals might differ in
how much emphasis they give to the respective dimensions. Some may
work very quickly at the expense of accuracy, while others very slowly but
very carefully. In the studies reported here, the VKsize scores are corre-
lated with the inverse of the mnRT and CV scores to avoid the presence
of minus signs in the reporting and discussion of results. If greater size
(VKsize) and higher speed (mnRT or CV) are both elements of greater
lexical facility, a positive correlation should exist overall between the two.
Evidence of a systematic speed–accuracy trade-off is a significant negative
correlation.

 valuating the Sensitivity of the Lexical


E
Facility Measures

The focus of the empirical research program is establishing the sensitivity


of the lexical facility measures to the various performance criteria.
Sensitivity is reflected in the degree to which the measures discriminate
between levels in a given criterion and the effect size of these differences.
1.3 An Overview of Methods Used 127

In each study, the descriptive results are first presented, followed by the
inferential statistics used to test the sensitivity of the measures.

Descriptive Statistics The means, SDs, and confidence intervals (CIs) for
the lexical facility measures are reported in all studies. The value of the CI
as a statistical measure for both descriptive and inferential statistics is
being increasingly recognized in L2 research (Larson-Hall and Herrington
2010; Larson-Hall and Plonsky 2015). The CI is a range of values that is
likely to include the observed mean value for the sample. A bootstrapped
(see below) 95% CI value is reported in all the studies, meaning that
there is a 95% chance that the observed mean is contained in the interval
between the lower- and upper-­bound values. A lack of overlap in the CIs
of two mean values indicates a statistically significant difference between
them.

Discriminating Between Performance Levels Statistical tests are used to


establish the sensitivity of the lexical facility measures to criterion perfor-
mance differences, both individually and in combination. The sensitivity of
each measure is based on the statistical significance of the test and the
accompanying effect size. The alpha value for all the studies is .05, and spe-
cific adjustments are made where appropriate for multiple comparisons.

Group mean differences are tested using t-tests for comparisons involv-
ing two groups and an analysis of variance (ANOVA) for comparing
more than two groups when the relevant assumptions are met. The t-tests
and ANOVA assume that data which are normally distributed exhibit
homogeneity of variance; that is, the SDs of the samples are approxi-
mately equal. The data were tested for the key assumptions of normality
and equality of variance, which were generally, but not always, met.
Where variance assumptions are not met for the standard ANOVA,
Welch’s ANOVA is used for the omnibus test and the Games–Howell test
for any follow-up pairwise comparisons (Tabachnick and Fidell 2013).
The studies here use bootstrapping for calculating mean CIs. Bootstrap­
ping provides a more robust way to deal with non-normally distributed
data than the use of nonparametric tests, particularly for smaller sample
sizes (Larson-Hall and Herrington 2010).
128 2 Introduction

The strength of association between the predictor lexical facility mea-


sures and criterion performance variables is measured the using Pearson’s
product moment correlation for bivariate correlations. In instances where
the potential statistical significance of the difference between two bivari-
ate correlations is of interest, the Fisher r-to-z transformation developed
by Richard Lowry and available at http://vassarstats.net/tabs_rz.html is
used. The data were analyzed using SPSS: statistical package for the social
sciences. The program reports significance levels up to p = .000. In
instances where these values are obtained, they are reported as p < .001.
Bootstrapped CIs for the group means are reported throughout. This
permits all the results to be presented in a uniform manner, regardless of
the size of the sample or whether a particular data set met all the assump-
tions for the use of the parametric measures.

Interpreting the Effect Size The other element of sensitivity is the effect
size, which is the strength of the measure as a discriminator of criterion-­
level differences. In correlation and regression analyses, the effect size is
calculated directly in the r-value. This value squared, R2 (also called the
coefficient of determination), represents the amount of variance in differ-
ences in the criterion variable, for example, the proficiency levels, attrib-
utable to differences in the predictor variable.

For the tests of group mean differences, standardized effect sizes are
calculated separately. The t-test uses Cohen’s d, and ANOVA the η2
(eta-squared)
­
test (Fritz et al. 2011). The relative importance of the
observed effect sizes is interpreted using a recently introduced scale for
interpreting the r- and d-values in L2 research (Plonsky and Oswald 2014,
p. 899). The scale revises upward the widely used benchmarks suggested
in Cohen (1988). Benchmark values for the interpretation of d are small
(d = .40), medium (d = .70), and large (d = 1.00). Plonsky and Oswald
(2014) note that these values pertain to between-group contrasts, with
pre-post and within-group contrasts requiring larger effect sizes. The
benchmarks for these contrasts are small (d = .60), medium (d = 1.00), and
large (d = 1.40). Between-group contrasts involving proficiency levels are of
primary interest in the studies presented in the following chapters, but
within-group contrasts will also be relevant when comparing performance
References 129

over item frequency bands. The relative impact of the r effects are small
(r = .25), medium (r = .40), and large (r = .60).
The following presents the empirical evidence for the lexical facility
proposal. Chapters 6, 7, 8, 9, 10, and 11 report on a series of empirical
studies, and the final chapter, Chap. 12, discusses the implications and
way forward for the lexical facility account.

References
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale: Lawrence Erlbaum.
Field, A. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage.
Fritz, C. O., Morris, P. E., & Richler, J. J. (2011). Effect size estimates: Current
use, calculations, and interpretation. Journal of Experimental Psychology:
General, 141(1), 2–18. doi:10.1037/a0024338.
Jiang, N. (2013). Conducting reaction time research in second language studies.
London/New York: Routledge.
Larson-Hall, J., & Herrington, R. (2010). Improving data analysis in second
language acquisition by utilizing modern developments in applied statistics.
Applied Linguistics, 31(3), 368–390.
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative
research findings: What gets reported and recommendations for the field.
Language Learning, 65(S1), 127–159.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.).
Boston: Pearson.
6
Lexical Facility as an Index of L2
Proficiency

Aims

• Evaluate the lexical facility measures as indices of second language (L2)


proficiency.
• Examine the sensitivity of the measures to
–– well-defined proficiency levels
–– word frequency levels
• Assess the sensitivity of the measures independently and in combination.

6.1 Introduction
This chapter presents the first of seven studies that evaluate lexical facility
as a second language (L2) vocabulary construct. The study examines the
sensitivity of the lexical facility measures to group proficiency differences.
The focus is on three student groups that represent distinct populations
of English users at an Australian university. One is a group of English L2
students studying in a preuniversity language program, and the other two
are English L2 and first language (L1) university students studying in the

© The Author(s) 2018 131


M. Harrington, Lexical Facility, DOI 10.1057/978-1-137-37262-8_6
132 6 Lexical Facility as an Index of L2 Proficiency

arts faculty. The sensitivity of the three lexical facility measures (vocabu-
lary size, mean recognition speed, and recognition speed consistency) to
group differences is examined for each measure individually and in com-
bination. Sensitivity refers to both how well the measures discriminate
between the groups and the strength of the observed differences.
The study tests a claim central—but not unique—to the lexical facility
account, namely that vocabulary size and recognition speed are strong cor-
relates of proficiency differences. The lexical facility account further pro-
poses that consistency in recognition speed, as indexed by the coefficient of
variation (CV), can complement these two dimensions as a reliable index
of recognition vocabulary skill. Of the three measures, vocabulary size has
been shown to be a particularly robust correlate of proficiency. The focus is
on whether the combination of size with speed and consistency scores
results in a more sensitive measure of group differences than size alone.
Vocabulary knowledge is measured using the Timed Yes/No Test. The
format was described in Chap. 5. A defining feature of the Yes/No Test
format is the use of words drawn from a range of frequency-of-occurrence
levels. A basic assumption is that test performance will systematically
decrease as word frequency decreases. The lower frequency a word has,
the less likely it will be known, and if known, the slower it will be recog-
nized. Construct validity for the test format thus depends on showing
that this predicted frequency effect is valid. This is done by demonstrat-
ing that the lexical facility measures are sensitive to differences in word
frequency levels in a manner analogous to that shown for the group dif-
ferences. Establishing the validity of this feature of the test format is
important for the lexical facility proposal itself, as the account makes the
fundamental assumption that vocabulary skill development is input
driven, with a word’s frequency of occurrence a fundamental predictor of
when and how well it will be learned.

6.2  tudy 1: Lexical Facility as an Index


S
of English Proficiency
This study examines the sensitivity of the lexical facility measures to dif-
ferences across three levels of English proficiency in university-age users.
It also investigates the sensitivity of the measures to differences in test
6.2 Study 1: Lexical Facility as an Index of English Proficiency 133

item word frequency. The three measures are vocabulary knowledge


(VKsize), which reflects vocabulary size (also referred to as breadth),
mean response time (mnRT), and the coefficient of variation (CV). The
effect of the measures is examined individually and in combination
through the use of composite scores.
A central concern here and in the studies presented throughout the
book is the extent to which the composite measures surpass the VKsize
measure alone in sensitivity. As discussed in Chaps. 1 and 2, vocabulary
size is a well-established index of L2 proficiency. At issue is whether, and
to what extent, the addition of processing speed measures (mnRT and
CV) increases the sensitivity of that index. Or, more precisely, do the
mnRT and CV measures, separately and together, combine with VKsize
to account for variance in the group differences beyond that attributable
to the vocabulary size measure itself.
The findings examined include data reported in Mochida and
Harrington (2006) and Harrington (2006), as well as previously unpub-
lished data.

Setting and Participants

The study was carried out at a large Australian university. Participants


were from groups representing three distinct populations of English
users. The groups consisted of students from a preuniversity English
language program (n = 32), L2 English students studying in the univer-
sity’s arts faculty (n = 36) and their L1 English counterparts (n = 42).
Participants in the preuniversity language program were from East Asia,
including China, Vietnam, Thailand, Japan, and Korea. They were in
the fifth level of a full-time seven-level program that serves as a pathway
to university studies. The L2 English university students had the same
linguistic ­background as the preuniversity group. All the students were
in their first or second semester of study. The participants in L1 English
group were all Australian-educated native speakers of English. The
preuniversity group participated as volunteers as part of a class activity.
The L2 and L1 university students were recruited in a first-year linguis-
tics course and participated for course credit. About 60% of partici-
pants were female across the groups.
134 6 Lexical Facility as an Index of L2 Proficiency

Materials and Procedures

Each participant took a written version of the Timed Yes/No Test. The
test measures English vocabulary knowledge and recognition speed by
eliciting a yes/no decision as to whether a presented item is known. Word
items are drawn from four frequency-of-occurrence bands. Pseudowords
are also included as a control for guessing.
The test contained 90 words and 60 pseudowords for a total of 150
items. The word items were taken from the Vocabulary Levels Test (VLT)
introduced in Chap. 1 (Schmitt et al. 2001). The test includes 18 items
from each of four frequency-of-occurrence bands comprising the 2000
(2K), 3000 (3K), 5000 (5K), and 10,000 (10K) most frequently occur-
ring words in English. Also used are 18 words from the Academic Word
List, a set of more frequently used academic words (Coxhead 2000). The
latter were included in an earlier study by Mochida and Harrington
(2006) but are not examined here, given the focus on frequency as a pre-
dictor of vocabulary skill. The pseudowords included to control for guess-
ing all conform to English orthographic and phonological rules.
The test yields a score of vocabulary knowledge, VKsize, that approxi-
mates the individual’s vocabulary size. This measure is the proportion of
correct responses to the frequency-graded word items (also referred to as
‘hits’) minus the proportion of incorrectly identified pseudowords (‘false
alarms’). As the false-alarm rate is used to correct the overall number of
hits, VKsize is an indirect measure of size. Recognition speed is measured
by calculating the mean recognition time, mnRT, for the correctly recog-
nized hits. The mnRT score is reported in milliseconds (1000 millisec-
onds = 1 second). The third lexical facility measure is the coefficient of
variation (CV), which reflects the consistency of recognition time
­performance as measured by the mnRT. The lexical facility account is the
first to examine the CV as a useful index of L2 vocabulary development.
The CV is a single value that reflects the relationship between the stan-
dard deviation (SD)—a measure of the variability of the response times
in the set—and the mean response time itself. It is the ratio of the SD of
the mnRT to the mnRT itself (SDmnRT/mnRT).
The sensitivity of the three individual measures to group proficiency
differences is first investigated for each measure separately and then com-
pared with that of the composite scores. Two composite measures are
6.2 Study 1: Lexical Facility as an Index of English Proficiency 135

examined: VKsize_mnRT, which combines the VKsize scores and mnRTs


as a composite measure of size and speed, and VKsize_mnRT_CV, which
combines all three measures in a single value. A composite of VKsize and
CV was not calculated, as the CV is only interpretable as a measure of
skill when accompanied by fast recognition times. It is possible to be very
slow and very consistent. The raw values for mnRT and CV are inverted
to make them consistent with the VKsize values. This results in higher
scores for all three values, reflecting better performance. The composites
are then calculated by first converting the individual scores to standard-
ized z scores and then averaging them. Because the use of standardized
scores will result in negative scores for some individuals, 5 is added to
each score to make all the results positive. For example, VKsize_mnRT is
calculated as ((VKsizez + mnRTz)/2) + 5.
Participants were tested individually or in small groups in a university
computer lab. The test was administered using LanguageMAP, a multi-
media package for the assessment of language processing skill developed
at the University of Queensland and available at www.languagemap.com.
The presentation of items was randomized for each participant. Test items
were presented individually on a computer screen and participants asked
to judge as quickly and accurately as to whether they knew the word.
Participants were informed that items appearing on the screen could be
either words or pseudowords. They were told to work as quickly and
accurately as possible, as they would be scored on both dimensions.
Following the instruction set used by Eyckmans (2004, p. 96), they were
also warned that they might be tested on some of the words later, though,
in fact, they were not tested at the end.
On each trial, the participant first saw a screen with a three-point fixa-
tion star. After a 1500-millisecond interval, a word or pseudoword
appeared on the screen. The participant responded ‘yes’ or ‘no’ by press-
ing the appropriate key on a keyboard. Each word appeared on the screen
for only 5000 milliseconds, after which the presentation was terminated.
If the participant failed to answer in the allotted time, a ‘not answered’
value was recorded and treated as a ‘miss’ in the subsequent data analysis.
Only a handful of ‘not answered’ responses occurred in any of the groups.
A practice set of five items was completed before the test. Feedback was
given during the practice set to ensure the instructions were understood.
No feedback was given during the actual test.
136 6 Lexical Facility as an Index of L2 Proficiency

6.3 Study 1 Results


Preliminary Analysis The raw data were first checked for factors that
might affect outcomes independent of the research variables of interest.
These factors are both general to quantitative research and specific to the
use of the Timed Yes/No Test format. The raw findings are examined for
instrument reliability, excessive false-alarm rates, the occurrence of outli-
ers, and a potential speed–accuracy trade-off in responses.

Test instrument reliability was first calculated to ensure the Timed Yes/
No Test provided a consistent measure of L2 vocabulary knowledge.
Cronbach’s alpha analyses were carried out on item performance on the
words and pseudowords to establish the internal reliability of the test
(Beeckmans et al. 2001). All results in this study had satisfactory reliabil-
ity for yes/no responses and item RTs, all in the higher .8 to lower .9
range (Plonsky and Derrick 2016). Reliability coefficients for the CV
were not calculated, as the measure is derived from the RT means and SD
and is not analyzable at the item level.
Individual item recognition times on words recognized (hits) were
screened for outliers. Only the recognition time responses for word items
correctly answered (hits) were screened. Outliers were defined as being 3
SDs beyond the mean for the individual participant. In the Harrington
(2006) analysis, RTs were trimmed at 2.5 SDs, a threshold challenged as
being too liberal for adjusting the mean responses (Hulstijn et al. 2009).
The data from that study were reanalyzed here at the 3 SD criterion. Item
responses at less than 300 milliseconds were first removed, as these were
too fast for an actual response. These responses reflected preemptive
guessing or keystroke errors and were rare, appearing in only a handful of
participant responses. As in most RT studies, the screening carried out
here focuses on identifying responses that are excessively slow. Participants
were told at the outset that there was a 5000 millisecond (5 second) limit
for responding to each item. The 5-second window allowed enough time
to complete the task but required attention on the part of the test-taker.
Timed-out responses were evident in about 10% of the participant
response sets. These sets typically had only one or two timed-out responses,
and these were at the beginning of the test. Individual item response
6.3 Study 1 Results 137

times for the correct hits beyond the 3 SD cut-off accounted for 2%
(1.6%) for the L2 preuniversity group and L2 (1.9%) and L1 (2.1%)
university groups. This compared with the outlier rate of 3% reported in
Harrington (2006), which used a 2.5 SD cut-off. These are low numbers
but are based on correct hits, meaning that for less proficient participants,
the total number can be small, for example, in the 20s for some of the
preuniversity participants. Also, slower mean RTs and associated SDs for
the slower participants are closer to the 5000-millisecond cut-off, and
that also decreases the number of potential outliers. The marginally higher
outlier rates for the L2 and L1 university groups reflect the relatively
faster means and SDs. Given the low rates, the outliers were not adjusted
for the statistical analyses.
Excessive guessing can also compromise the reliability of the VKsize
score. Guessing is reflected by the proportion of the ‘yes’ responses to
pseudowords. This false-alarm rate is used to correct for guessing,
with higher false-alarm rates resulting in lower VKsize. It is of course
also possible to guess correctly on the word items (hits), but this is
not directly detectable in the format. The absolute false-alarm rate is
also important, as higher rates will make the size estimate less accu-
rate and more difficult to compare with findings from other studies.
False-alarm rates are given in Table 6.2. The rates varied across the
groups, with the L2 preuniversity group averaging over 20% and the
L2 and L1 university groups around 6% each. The difference between
the preuniversity group and the two university groups was statistically
significant.1
The error rate here compares with false-alarm rates of around 5% by
adult L1 English subjects (Ziegler and Perry 1998, p. 57), 6% by advanced
Dutch EFL subjects (Van Heuven et al. 1998), and 9% for French-­
speaking learners of Dutch (Eyckmans 2004). Schmitt, Jiang and Grabe
(2011) eliminated all participants at higher than 10% but did not report
a mean false-alarm rate. The false-alarm rate here contrasts with much
larger rates, for example, over 20% reported in Beeckmans et al. (2001),
Cameron (2002), and the minimal instruction condition in Eyckmans
(2004). As evident here and in previous studies (see Chap. 3), false-alarm
rates decrease as proficiency increases, at least when the proficiency differ-
ences are relatively distinct.
138 6 Lexical Facility as an Index of L2 Proficiency

Table 6.1 Bivariate correlations and 95% confidence intervals (within square
brackets) for the three lexical facility measures (VKsize, mnRT, and CV) and two
composite scores (VKsize_ mnRT and VKsize_ mnRT_ CV)
VKsize mnRT CV VKsize_ mnRT
mnRT .68 [.57, .77] –
CV .51 [.39, .62] .62 [.45, .75] -–
VKsize_ mnRT .93 [.89, .96] .89 [.83, .93] .63 [.49, .74] –
VKsize_ mnRT_ CV .85 [.80, .90] .87 [.82, .91] .83 [.75, .90] .95 [.93, .97]
Note: N = 110. All correlations significant at the p < .01 level (two-tailed). VKsize,
correction for guessing scores (hits - false alarms); mnRT, mean recognition
time in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT); 95% CI,
BCa (bias-corrected and accelerated) confidence interval.

The raw data were also examined for a systematic trade-off between
accuracy and speed by the participants. While some trade-off is expected,
a systematic bias toward one or the other dimensions should be avoided
(Heitz 2014). Evidence for such a trade-off would suggest strategic biases
on the part of the test-takers that obscure the actual state of the underly-
ing vocabulary skill. The size and speed data showed no evidence for such
a systematic trade-off (see Table 6.1). The Pearson correlation between
the VKsize score and the inverted mnRT was almost .7, indicating a
strong positive relationship between size and speed. Participants who rec-
ognized more words also did it more quickly. These values are similar to
those reported in Laufer and Nation (2001, p. 19). The positive correla-
tions suggest that mnRTs and the VKsize scores both measure similar
underlying proficiency.
The preliminary analysis sets the stage for the presentation of the
results and a discussion of the research findings.

Descriptive Results

Results for the three individual lexical facility measures across the three
groups are presented in Table 6.2. VKsize scores are reported in percent-
ages, mnRT in milliseconds, and CV in proportions. Confidence inter-
vals (CIs) for the means are also given. The lower and upper CI values are
the range within which the true mean of the population can be found
6.3 Study 1 Results 139

Table 6.2 Proficiency-level study. Means, standard deviations, and confidence


intervals for false-alarm rates and the lexical facility measures, individual and
composite, for the three proficiency levels
L2 preuniversity L2 university L1 university
n = 32 n = 36 n = 42
Mean SD Mean SD Mean SD
[95% CI ] [95% CI)] [95% CI]
False alarm rate (%) 21.47 18.81 6.50 7.77 6.01 5.38
[17.41, 27.64] [4.10, 8.90] [4.67, 7.69]
Individual measures
VKsize (%) 33.58 26.79 70.97 15.91 84.55 09.62
[23.32, 42.57] [65.42, 76.19) [81.66, 87.10]
mnRT (msec) 1656 332 963 203 777 203
[1536, 1773] [890, 1032] [753, 802]
CV (proportion) .447 .086 .361 .087 .247 .071
[.417, 472] [.332, .388] [.227, .271]
Composite measures
VKsize_mnRT 3.77 0.63 5.15 0.58 5.72 0.21
[3.56, 3.98] [4.97, 5.32] [5.65, 5.77]
VKsize_mnRT_CV 3.88 0.41 5.10 0.61 5.72 0.29
[3.74, 4.03] [4.92, 5.31] [5.66, 5.83]
Note: VKsize, correction for guessing scores (hits - false alarms); mnRT, mean
response time in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT);
95% CI, BCa confidence interval.

95% of the time. CIs are particularly useful for comparing the relative
difference between two means. The less overlap there is between the two
sets of CIs, the more likely the means come from different underlying
population distributions. Bootstrapped CIs are used throughout to pro-
vide more robust estimates of the mean differences. The 95% CIs are
based on the bias-corrected and accelerated (BCa) method using 1000
samples (Larson-Hall 2016).
Performance on all three measures systematically improved across the
three proficiency levels. For the VKsize scores, the preuniversity group
had the lowest mean at around 35%; the L2 university group had 70%
and the L1 university group, 85%. A similar pattern was evident in the
mnRT scores. The preuniversity group (mean = 1660 milliseconds) was
much slower than the L2 university (960 milliseconds) and the L1 base-
140 6 Lexical Facility as an Index of L2 Proficiency

line group (770 millisecconds). The CV results also reflected the respec-
tive proficiency level but were more evenly spaced out, with group means
of .45, .36, and .25, respectively. The mnRT value for the L2 university
group was midrange of the means reported by Segalowitz and Segalowitz
(1993) for their fast and slow response groups. In that study, only high-­
frequency vocabulary items were used, in contrast to the spread of fre-
quency levels examined here. The mnRT value for the L1 university
group was toward the upper range of L1 English subjects reported in
Ratcliff et al. (2004).
To facilitate comparison, the lexical facility means reported in Table 6.2
are converted to standard scores and plotted by the three groups in the
bar chart in Fig. 6.1. For all calculations, the mnRT and CV values are
inverted to make negative values positive, making them consistent with
the VKsize scores. Higher scores thus reflect better performance for all
three measures. The standard scores are calculated by transforming all
three measures into z-scores, averaging them, and then adding 5 to each

standardCFG
standardRT
standardCV

6.00
standard score (z + 5)

4.00

2.00

0.00
Pre-university L2 university L1 university
Lexical facility measures by groups

Fig. 6.1 Lexical facility measures by English proficiency levels


6.3 Study 1 Results 141

to eliminate negative values in the same way as the composite scores were
calculated.
There is a consistent pattern for all three measures across the three
proficiency groups. The lack of overlap for the CIs for the means within
the groups indicates little difference in performance across the three mea-
sures. At the same time, a total lack of overlap between the groups shows
that the measures are consistently sensitive to proficiency differences.
Composite scores involving all three were calculated to gauge the effi-
cacy of the combined measures in accounting for the differences of inter-
est. These are also presented in Table 6.1. The pattern of performance on
the composite measures mirrors that observed for the individual mea-
sures, an overall pattern predictable because the composites are made up
of constituent individual scores. The main interest is in how the compos-
ite measures compare with the individual scores regarding sensitivity to
group differences.

 ensitivity of the Lexical Facility Measures


S
to Group Differences

The mean performance for all the individual and composite measures
improved as the group proficiency increased, and the CIs indicate that
the observed differences are statistically significant. The level of s­ ignificance
and the magnitude of the related effect sizes are established in a series of
one-way analyses of variance (ANOVAs).2
The sensitivity of a given measure reflects whether it yields differences
that are statistically significant at the conventional p < .05 level and that
have an effect size that reaches a recognized level of impact. The effect size
for the omnibus ANOVAs is eta-squared (η2). The ‘real-world’ interpreta-
tion of η2 is based on Plonsky and Oswald (2014, p. 889), with .06 being
small, .16 medium, and .36 large. The effect size for the post hoc com-
parisons is Cohen’s d (Lenhard and Lenhard 2014). It is interpreted as .40
being small, .70 medium, and 1.0 large (Plonsky and Oswald 2014).
Note that these values are all larger than the widely used benchmarks first
proposed in Cohen (1988).
142 6 Lexical Facility as an Index of L2 Proficiency

The data set was first examined to see if it met the assumptions of the
one-way ANOVA procedure. There was no evidence of univariate outli-
ers.3 Four of the 15 group-measure conditions did not meet the normality
assumption as assessed by the Shapiro–Wilk test (p > .05).4 Although all
the conditions did not meet the assumption, the one-way ANOVA is
considered robust to some deviation from normality (Maxwell and
Delaney 2004). Results from Levene’s test of homogeneity of variance
showed that only the CV measure met the homogeneity of variance
assumption at p < .05. To accommodate this, Welch’s ANOVA is used, an
ANOVA procedure considered more robust to data with unequal vari-
ances (Moder 2010). Also, the post hoc comparisons use the Games–
Howell test, which also does not assume equal variances.

Test Results

The significance level and effect size findings for the five univariate
ANOVAs are given in Table 6.3. Post hoc tests comparing performance
by group pairs were subsequently carried out for all the omnibus tests and
are reported in Table 6.4.
All five univariate ANOVAs were statistically significant at p < .001.
The η2 values showed that the effect size for all five measures was strong,
accounting for over 50% of the group variance for the VKsize and CV
measures and around 75% for the mnRT, VKsize_mnRT, and VKsize_
mnRT_CV measures. The CIs for the mnRT ŋ2 show that the measure is
significantly stronger than both the VKsize and CV measures. The lexical
facility account assumes that vocabulary size is primary, both in being the
first of the three elements to develop and in being the strongest predictor
of differences between proficiency levels. That is not the case for the
groups here, as the recognition speed differences are a much stronger
overall predictor than the other two measures. The sensitivity of mnRT
performance to group differences is further examined in the pairwise
comparisons of mean differences. The individual and composite mea-
sures and associated d effect sizes are presented in Table 6.4. Cohen’s d
with Hedge’s correction measure was used, given the unequal sample sizes
(Lenhard and Lenhard 2014).
6.3 Study 1 Results 143

Table 6.3 Proficiency-level study. One-way ANOVAs for individual and composite
lexical facility measures as discriminators of English proficiency levels
df F* ŋ2 95% CI for ŋ2
VKsize (2, 56.51) 56.41 .54 [.37, .61]
mnRT (2, 57.06) 169.48 .76 [.67, .83]
CV (2, 66.88) 59.44 .51 [.37, .62]
VKsize_mnRT (2, 51.38) 148.33 .73 [.64, .79]
VKsize_mnRT_CV (2, 51.38) 238.86 .75 [.66, .81]
Note: *All significant at p < .0005 (two-tailed, assuming unequal variances).
VKsize, correction for guessing scores (hits - false alarms); RT, mean response
time in milliseconds; CV, coefficient of variation (SDMeanRT/MeanRT).

Table 6.4 Proficiency-level study. Post hoc comparisons for individual and com-
posite measures, VKsize, mnRT, CV VKsize_mnRT, and VKsize_mnRT_CV
Mean difference* d 95% CI for d
VKsize
L2 university–preuniversity 37.39 1.72 1.16,2.28
L1 university–L2 university 13.58 1.05 .58,1.53
L1 university–preuniversity 50.97 2.68 2.05,3.31
mnRTa
L2 university–preuniversity 693 2.67 2.01,3.32
L1 university–L2 university 185 1.18 .70,1.66
L1 university–preuniversity 878 4.82 3.92,5.73
CV
L2 university–preuniversity .086 1.03 .52,1.53
L1 university–L2 university .114 1.46 .96,1.96
L1 university–preuniversity .200 2.60 1.98,3.23
VKsize_mnRT
L2 university–preuniversity 1.37 2.30 1.69,2.91
L1 university–L2 university .56 1.33 .84,1.82
L1 university–preuniversity 1.94 4.41 3.56,5.25
VKsize_mnRT_CV
L2 university–preuniversity 1.22 2.30 1.69,2.92
L1 university–L2 university .65 1.39 .89,1.88
L1 university–preuniversity 1.87 5.38 4.33,6.2
a
Note: *All values significant at p < .0005; Games–Howell test unequal variances
assumed; raw values given; differences calculated on mnRT(log); VKsize,
correction for guessing scores (hits - false alarms); mnRT, mean response time
in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT); 95% CI, 95%
confidence interval.
144 6 Lexical Facility as an Index of L2 Proficiency

Individual Measure Group Differences

All the individual measure contrasts were statistically significant at


p < .001 (two-tailed). The d effect sizes were all at, and usually beyond,
the threshold value of 1, which is considered a strong effect. The relative
strength of the effects differs by group comparison. These are considered
in turn next.

L2 Preuniversity and L2 University The pattern of effect sizes for the two
L2 groups was similar to that of the ANOVA results. The mnRT measure
had the largest effect size (around 2.7), compared with VKsize (1.7) and
CV (1). However, unlike the omnibus analysis, there was some overlap
across the VKsize and mnRT CIs. There was no overlap evident for the
mnRT and CV measures, showing that they may be tapping different
aspects of performance.

L2 University and L1 University The effect sizes for the three individual
measures were similar, with the CV being the highest in absolute terms.
However, the significant overlap across the CIs for all three measures
indicates little difference in strength. This implies that the greater effect
for mnRT evident in the omnibus ANOVA is attributable to differences
between the preuniversity group and the other two groups.

L2 Preuniversity and L1 University The largest differences are expected


in the comparison of the preuniversity group and the L1 group. The
mnRT measure had the strongest effect size (over 4.8), compared with
VKsize (2.7) and CV (2.6). As with the omnibus test, there was no over-
lap in the CIs between mnRT and either the VKsize or CV effect sizes.

Differences in recognition time thus account for the largest amount


of variability and serve as the most sensitive index of differences between
the groups, especially between the lowest proficiency group and the
others.
6.3 Study 1 Results 145

Composite Measure Group Differences

A central claim of the lexical facility proposal is that the combination


of size and speed (i.e., VKsize_mnRT) provides a more sensitive index of
proficiency differences than size alone (VKsize). Logically, the individual
mnRT value can be examined in the same way, but the account assumes
that the interpretation of mnRT as an index of proficiency is only possi-
ble in association with a measure of vocabulary knowledge. The compos-
ite VKsize_mnRT_CV is the strongest test of the lexical facility proposal,
as it combines all three measures. Both composite measure comparisons
in Table 6.4 were statistically significant at p < .001 (two-tailed). The d
effect sizes were significant and larger than the constituent measures overall.
As was the case with the individual measures, the strength of the effects
for the measures differs across the groups.

L2 Preuniversity and L2 University The composites VKsize_mnRT and


VKsize_mnRT_CV both had an effect size of 2.3 compared with the
individual VKsize measure of 1.7 and the mnRT measure of 2.7. The
identical value for the two composites shows that CV scores contributed
nothing beyond the VKsize_mnRT composite. The noticeably larger
effect sizes for the two composites compared with the individual VKsize
measure supports the lexical facility proposal, but there was substantial
overlap in the CIs for all three measures.

L2 and L1 University The composites VKsize_mnRT and VKsize_


mnRT_CV had similar effect sizes of 1.33 and 1.39, respectively, com-
pared with the individual VKsize measure of 1.05 and the mnRT measure
of 1.18. As was the case in the preceding contrast, both the composites
had a larger effect size value than the individual VKsize measure, but
there was substantial overlap for the CIs for all the effect sizes.

L2 Preuniversity and L1 University The large effect sizes evident in the


individual measure analysis are mirrored in the composite score compari-
son. The effect size for the composite VKsize_mnRT was 4.41, and for
146 6 Lexical Facility as an Index of L2 Proficiency

VKsize_mnRT_CV, 5.38. These are compared with the individual VKsize


effect size of 2.8 and the individual mnRT effect size of 4.03. The CIs for
both composite scores overlapped considerably, but both were indepen-
dent of the individual VKsize measure, indicating that size and speed
together are better than size alone, though as noted above, speed (mnRT)
was the most sensitive measure on its own.

Three findings emerge. The first is that mean recognition time is an


important index of proficiency, stronger here than VKsize alone. The sec-
ond is that when combined, VKsize and mnRT provide a more sensitive
measure of group differences than VKsize alone. And finally, it is evident
that mnRT performance is the most salient difference between the lowest
proficiency preuniversity group and the other two groups.5

6.4  ensitivity of the Lexical Facility


S
Measures to Frequency Levels
A basic assumption of the lexical facility proposal, and the lexical fre-
quency profile approach of Laufer and others more generally (Laufer and
Nation 1995), is that there is a largely linear relationship between the
likelihood that a word is known and the frequency with which it is used
in the language. This basic relationship has profound consequences for
models of L2 learning and more practical implications for vocabulary
instruction and measurement, as it allows the researcher to fix an indi-
vidual’s approximate vocabulary size by sampling vocabulary from a
range of frequency levels. The assumption is tested here by examining the
sensitivity of the lexical facility measures to frequency-level differences in
the test overall and in the respective groups. Showing that frequency-level
differences systematically predict test performance is integral to establish-
ing the construct validity of the test format and of the lexical facility
approach more generally (Messick 1995).
The effect of word frequency on test performance is established by
examining responses at the respective frequency level (2K, 3K, 5K, and
10K). The VKsize measure is based on false alarms (‘yes’ to pseudowords)
6.4

Table 6.5 Proficiency-level study. Medians, interquartile ranges, and 95% confidence intervals for the hits, mean response
time (in milliseconds), and mean proportion coefficient of variation by frequency levels and groups
Hits mnRT CV
2K Mdn IQR 95% CI Mdn IQR 95% CI Mdn IQR 95% CI
Preuniversity 83.51 16 [79.00, 87.00] 1265 490 [1078,1364] .405 .27 [.325, .470]
L2 university 1.00 8 – 752 137 [733,791] .249 .14 [.199, .280]
L1 university 1.00 8 – 706 124 [666,743] .179 .13 [.151, .215]
Total 92.50 14 [92.25, 97.50] 768 297 [750, 810] .251 .20 [.194, .288]
3K
Preuniversity 66.00 18 [63.00, 71.00] 1518 396 [1332,1577] .410 .18 [.390, .460]
L2 university 92.00 14 – 807 259 [777,908] .304 .22 [.261, .339]
L1 university 1.00 13 – 746 129 [702,769] .212 .15 [.173, .249]
Total 88.00 28 [85.00, 92.00] 827 510 [794,906] .299 .20 [.270, .340]
5K
Preuniversity 42.50 22 [34.00, 52.00] 1738 671 [1488,1909] .515 .30 [.370, .580]
L2 university 80.00 16 [76.00, 81.00] 957 372 [890,1052] .391 .22 [.303, .445]
L1 university 92.50 13 [92.50, 92.50] 762 117 [732,779] .203 .10 [.186, .236]
Total 80.00 39 [73.50, 85.00] 910 637 [846,1040] .299 .29 [.267, .364]
10K
Preuniversity 31.50 24 [21.00, 41.00] 2200 922 [1948,2422] .425 .24 [.348, .445]
L2 university 51.50 23 [47.00, 56.00] 1092 599 [993,1303] .426 .23 [.348, .454]
L1 university 80.25 14 [76.00, 86.75] 874 182 [836, 898] .263 .18 [.222, .286]
Total 53.50 42 [47.00, 65.25] 1059 897 [973,1277] .352 .23 [.314, .399]
Note: Hits, correct responses to words; mnRT, mean response time to hits in milliseconds; CV coefficient of variation
Sensitivity of the Lexical Facility Measures to Frequency Levels

(SDMeanRT/Mean RT); Mdn, median, IQR, interquartile range; 95% CI, BCa 95% confidence interval; 2K = 2000, 3K = 3000;
5K = 5000; 10K = 10,000.
147
148 6 Lexical Facility as an Index of L2 Proficiency

from the hits (‘yes’ to words) overall. As such, it is not possible to break
the VKsize performance down to frequency levels. Instead, the propor-
tion of hits at each frequency level will be used as a measure of vocabulary
knowledge. These are uncorrected scores and, as such, will represent some
over- and underestimation of the individual’s actual vocabulary size.
However, it is assumed that this will be reasonably consistent across the
frequency levels (see Mochida and Harrington 2006) such that the rela-
tive differences that emerge will be a valid test of the frequency a­ ssumption.
The mnRT and CV measures can be identified by levels and will be used,
as in the earlier analyses. Results for the three measures are set out by
frequency levels and groups in Table 6.5. The hit results depart markedly
from a normal distribution, given that the uncorrected hit scores reach
ceiling for the L2 and L1 university groups for the high-frequency word
conditions. As a result, nonparametric statistics will be used to test for
differences between the frequency levels.

6.5 Discriminating Between Frequency Levels


Results for the three lexical facility measures reported in Table 6.6 are set
out visually in Figs. 6.2–6.4. As is evident, there is a stable pattern across
the frequency levels within each group for each of the measures. The hits
and mnRT responses reflect differences between frequency levels for both
L2 groups. The absence of any overlap in the CIs for the level medians
indicates that all these differences are statistically significant. The level
medians for the CV responses showed little difference across the levels for
the L2 preuniversity group and only one difference between the 2K and
5K levels for the L2 university groups. For the L1 university group, the
only significant (i.e., nonoverlapping CIs) median differences for the hits,
mnRTs, and CVs were between the 2K and 10K levels.
The overall median differences between levels were tested using the
nonparametric Friedman test to test the equality of four level medians.
This was statistically significant at p < .001. This was followed up by the
Wilcoxon signed-rank test to test level pair differences. See Table 6.6.
The median-level differences for the combined groups were significant
for all the hits and mnRT comparisons, while the CV was only significant
Table 6.6 Frequency-level analysis. Comparing sensitivity of hits (correct responses
to words), mean RT, and CV to frequency band differences using the omnibus
Friedman test and the follow-up Wilcoxon signed-rank test
Friedman test
χ2 df = 3
Hits 206.302 All χ2 significant at p < .001
mnRT 199.46
CV 28.17
Wilcoxon signed-rank test
Hits mnRT CV
Level Z r Z r Z r
2K–3K 5.83** .41 6.28** .45 2.35* .17
3K–5K 6.41** .46 5.37** .38 1.81 .13
5K–10K 7.42** .53 6.96** .50 1.13 .08
2K–10K 8.82** .63 8.99** .64 4.31* .31
Note: N = 110; *p < .05; ** p < .001; Hits, number of correct responses to words;
mnRT, mean response time in milliseconds; CV, coefficient of variation
(SDMeanRT/Mean RT); 95% CI, BCa confidence interval; r = Z/√N.

HIT2K
HIT3K
1.00 HIT5K
HIT10K

0.80
Median proportion of hits

0.60

0.40

0.20

0.00
Pre-university L2 university L1 university
Frequency levels by groups

Fig. 6.2 Median proportion of hits and 95% confidence intervals for lexical facil-
ity measures by frequency levels and groups
RT2K
2300
RT3K
RT5K
2100 RT10K
Median individual mean RTs (msec)

1900

1700

1500

1300

1100

900

700

500
Pre-university L2-university L1-university
Mean RT performance on frequency bands by groups

Fig. 6.3 Median individual mnRT and 95% confidence intervals for lexical facility
measures by frequency levels and groups

CV2K
CV3K
CV5K
0.60 CV10K

0.50
Median coefficient of variation ratio

0.40

0.30

0.20

0.10

0.00
Pre-university L2 university L1 university
Frequency levels by groups

Fig. 6.4 Median coefficient of variation (CV) and 95% confidence intervals for
lexical facility measures by frequency levels and groups
6.6 Findings for Study 1 151

for the difference between the 2K and 3K levels. The effect size used for
the Wilcoxon signed-rank test is r (Field 2009, p. 550). The r values for
the hits and mnRTs were medium to large (.4–.6). The CV, in contrast,
was negligible at 0.17. The results reported in Table 6.6 are based on the
combined performance of the three groups. Given the differing profiles
exhibited by the L1 group, it is possible that an analysis using only the L2
groups would result in larger effect sizes. This analysis was run and showed
no differences in the significance findings and only slight changes in the
r sizes, and these changes went in both directions.
A basic assumption of the Yes/No Test format and the lexical facility
construct is that the frequency with which a word is used will predict how
soon and how well it is learned. It is a probabilistic assumption that
underpins input-driven learning more widely, and one that is supported
in the hits and mnRT results. The sensitivity of the two measures to
frequency-­level differences thus provides construct validity for both the
test format and the lexical facility proposal, though the modest effect sizes
must be acknowledged. The CV measure also showed some sensitivity to
frequency differences, but to a much lesser extent.

6.6 Findings for Study 1


The individual and composite measures discriminated between the
English L2 preuniversity, English L2 university, and L1 university groups.
The strength of the differences, as reflected in effect sizes, showed that the
VKsize, mnRT, and CV measures were all sensitive to group differences.
The individual mnRT measure was stronger than the VKsize measure,
meaning that the combination of size and speed provides a more sensitive
indicator of group differences than size alone. The inclusion of the CV
measure, along with the other two measures, in the composite VKsize_
mnRT_CV measure did not increase sensitivity, as reflected in the
unchanged effect size.
Differences in performance across word frequency levels were also ana-
lyzed to test the assumption that there is a direct relationship between the
frequency with which a word is used and how soon and well it is learned.
Overall, the results show a clear linear relationship between the lexical
152 6 Lexical Facility as an Index of L2 Proficiency

facility measures and the frequency of occurrence, as reflected in the fre-


quency levels. The VKsize and mnRT measures were more sensitive than
the CV measure to frequency-level differences. In general, the effect sizes
were small.

6.7 Conclusions
Study 1 showed that all three measures discriminate between the profi-
ciency levels, and that the magnitude of the differences was substantial.
All three measures discriminated between all group levels. Effects sizes
were larger for the individual mnRT and composite measures, suggesting
that a combination of size and speed results in a more sensitive measure
than size alone. However, the effect size differences were not statistically
significant. The results also indicate that frequency-of-occurrence levels
serve as stable indices of vocabulary proficiency. Together, these findings
show that the lexical facility construct is a useful index of L2 vocabulary
skill. In the next chapter, the sensitivity of the lexical facility measure to
differences in English university entry standards is examined. The group
proficiency differences in this study are less pronounced than those in the
next one and will thus provide a more stringent test of the lexical facility
proposal.

Notes
1. The false-alarm data depart markedly from a normal distribution, as some
participants had few-to-none false alarms. A Kruskal–Wallis test was run
to test for the equality of the group false-alarm means. There was a signifi-
cant difference between the groups, χ2 = 18.18, p < .001, η2 = .82.
Follow-up Mann–Whitney tests showed that the difference between the
preuniversity and L2 university groups was significant at U = 289.50,
p < .001, d = .94 (Lenhard and Lenhard 2014).
2. The use of a multivariate ANOVA (MANOVA) is motivated in concep-
tual terms, as the three measures are all assumed to be elements of the
­lexical facility construct. However, the data departed significantly from a
References 153

key assumption for the test, namely that of homogeneity of variance/cova-


riance, and so the MANOVA procedure was not done. Alternatively, five
univariate ANOVAs are carried out to compare the effect of the three
individual measures with each other and with the two composite mea-
sures of interest, VKsize_mnRT and VKsize_mnRT_CV.
3. Boxplot inspections used throughout the book assume outliers to be val-
ues 1.5 box lengths from the edge of the box.
4. Two of these were the VKsize scores for the L2 university (p < .005) and
L1 university groups (p < .02), both showing a tendency to higher scores,
as reflected in the moderately negative skew. The others were the CV score
for the L2 university group (p < .02) and the composite VKsize_mnRT
score for the L1 university group (p = .008).
5. A central aim in the empirical research presented in these chapters is to
demonstrate that the measures of processing skill (mnRT and CV), com-
bined with VKsize, will result in a more sensitive measure of proficiency
differences than the VKsize measure alone. This question is particularly
conducive to treatment in a regression format, where the effect of the
individual measures on group differences can be sequentially analyzed and
quantified. A candidate technique for the current study is the ordinal
logistic regression; the MANOVA-related discriminant analysis is another
(Field 2009). The procedure can be used to predict an ordinal (categori-
cal) variable, as in proficiency group membership, given one or more
independent variables—in this case, the three lexical facility measures.
This is the same logic as standard multiple or hierarchical regression, but
the criterion is an ordered category instead of a continuous variable. An
ordinal logistic regression was tried with these data, but the assumptions
were not met, particularly that of proportional odds.

References
Beeckmans, R., Eyckmans, J., Janssens, V., Dufranne, M., & Van de Velde, H.
(2001). Examining the yes/no vocabulary test: Some methodological issues
in theory and practice. Language Testing, 18(3), 235–274.
Cameron, L. (2002). Measuring vocabulary size in English as an additional lan-
guage. Language Teaching Research, 6(2), 145–173.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale: Lawrence Erlbaum.
154 6 Lexical Facility as an Index of L2 Proficiency

Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2),


213–238.
Eyckmans, J. (2004). Learners’ response behavior in Yes/No vocabulary tests. In
H. Daller, M. Milton, & J. Treffers-Daller (Eds.), Modelling and assessing
vocabulary knowledge (pp. 59–76). Cambridge: Cambridge University Press.
Field, A. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage.
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical pro-
ficiency. EUROSLA Yearbook, 6(1), 147–168.
Heitz, R. P. (2014). The speed-accuracy tradeoff: History, physiology, methodol-
ogy, and behavior. Frontiers in Neuroscience, 8, 150.
Hulstijn, J. H., Van Gelderen, A., & Schoonen, R. (2009). Automatization in
second language acquisition: What does the coefficient of variation tell us?
Applied PsychoLinguistics, 30(04), 555–582.
Larson-Hall, J. (2016). A guide to doing statistics in second language research using
SPSS and R. New York: Routledge.
Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2
written production. Applied Linguistics, 16(3), 307–322.
Laufer, B., & Nation, P. (2001). Passive vocabulary size and speed of meaning
recognition: Are they related? EUROSLA Yearbook, 1(1), 7–28.
Lenhard, W., & Lenhard, A. (2014). Calculation of effect sizes. Retrieved
November 29, 2014, from http://www.psychometrica.de/effect_size.html
Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing
data: A model comparison perspective (2nd ed.). New York: Psychology Press.
Messick, S. (1995). Validity of psychological assessment: Validation of infer-
ences from persons’ responses and performances as scientific inquiry into
score meaning. American Psychologist, 50(9), 741.
Mochida, A., & Harrington, M. (2006). The yes-no test as a measure of recep-
tive vocabulary knowledge. Language Testing, 26(1), 73–98.
­doi:10.1191/0265532206lt321oa.
Moder, K. (2010). Alternatives to F-test in one way ANOVA in case of hetero-
geneity of variances (a simulation study). Psychological Test and Assessment
Modeling, 52(4), 343–353.
Plonsky, L., & Derrick, D. J. (2016). A meta-analysis of reliability coefficients
in second language research. The Modern Language Journal, 100, 538–553.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
Ratcliff, R., Gomez, P., & McKoon, G. (2004). A diffusion model account of
the lexical decision task. Psychological Review, 111(1), 159–182.
References 155

Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring
the behaviour of two new versions of the vocabulary levels test. Language
Testing, 18(1), 55–89. doi:10.1191/026553201668475857.
Schmitt, N., Jiang, X., & Grabe, W. (2011). The percentage of words known in
a text and reading comprehension. The Modern Language Journal, 95(1),
26–43. doi:10.1111/j.1540-4781.2011.01146.x.
Segalowitz, N., & Segalowitz, S. J. (1993). Skilled performance, practice and
differentiation of speed-up from automatization effects: Evidence from sec-
ond language word recognition. Applied PsychoLinguistics, 14(3), 369–385.
doi:10.1017/S0142716400010845.
van Heuven, W. J. B., Dijkstra, T., & Grainger, J. (1998). Orthographic neigh-
borhood effects in bilingual word recognition. Journal of Memory and
Language, 39(3), 458–483. doi:10.1006/jmla.1998.2584.
Ziegler, J. C., & Perry, C. (1998). No more problems in Coltheart’s neighbor-
hood: Resolving neighborhood conflicts in the lexical decision task. Cognition,
68(2), B53–B62.
7
Lexical Facility and Academic English
Proficiency

Aims

• Evaluate the sensitivity of the lexical facility measures to English uni-


versity admission standards.
• Assess the sensitivity of the measures as individual and composite
measures.
• Compare performance on written and spoken versions of the test.

7.1 Introduction
In Study 1, the lexical facility measures were shown to be highly sensitive
to proficiency differences across university English groups, both individ-
ually and in combination. This was evident in the discriminating power
of the measures and the magnitude of the observed group differences.
Greater effect sizes for the mean recognition time and the composites
indicate that a combination of size and speed yields a more sensitive mea-
sure than size alone.
This chapter presents the second of seven studies that establish lexical
facility as a valid and reliable measure of second language (L2) ­vocabulary.

© The Author(s) 2018 157


M. Harrington, Lexical Facility, DOI 10.1057/978-1-137-37262-8_7
158 7 Lexical Facility and Academic English Proficiency

It examines the sensitivity of the three measures to various English profi-


ciency standards used for admission to an Australian university. The range
of English proficiency in the groups in this study is narrower than that
examined in Study 1.
The entry standards represent the minimum English proficiency
needed for university admission, but the actual proficiency of incoming
students varies, sometimes greatly (Moore and Harrington 2016). As
vocabulary size and processing skill are core elements of academic lan-
guage proficiency, the lexical facility construct can provide an objective,
independent benchmark for quantifying these skills as indices of learners’
proficiency levels. Study 2 thus tests the validity of the lexical facility
proposal, and the usefulness of the test format in this setting, with poten-
tial implications for English-language proficiency assessment (Read
2016).
An additional aim of the chapter is to assess the effect of language
mode on performance. Within-subject performance on written and spo-
ken versions of the Timed Yes/No Test is compared to evaluate the effect
of language modes on the response measures. Although spoken and writ-
ten word recognition are discrete skills, vocabulary size and processing
skill are crucial to both.

7.2  tudy 2: Lexical Facility and University


S
English Entry Standards
International students who want to study in English-speaking countries
such as Australia must demonstrate they have the English proficiency
needed to successfully undertake academic work. Evidence for this profi-
ciency takes various forms. The majority submit scores from standardized
English-language tests, particularly the IELTS and TOEFL tests. Others
enter by successfully completing a high school degree with a recognized
English-language component, as in Singapore, Malaysia, and some
European countries. Still, others have their English proficiency deemed
sufficient by the institution based on life and school experiences. The
7.2 Study 2: Lexical Facility and University English Entry Standards 159

standards differ greatly, but all are intended to ensure that, if met, the
student has the minimum language skills needed to start English-medium
study.
This study examines the lexical skills of international students com-
mencing study at a major Australian university. The participants have
demonstrated the minimum English proficiency needed to undertake
English-medium university study in different ways, but beyond that can
differ greatly in English skill. The sensitivity of the three lexical facility
measures (VKsize, mnRT, and the CV) to these group differences is mea-
sured using both written and spoken versions of the Timed Yes/No Test.
The use of a spoken version will allow the core characteristics of the lexi-
cal facility account to be tested in the absence of orthographic informa-
tion. The measures are examined individually and in combination, with
the central interest in whether the combination of the processing speed
measures (mnRT and/or CV) with VKsize will be more sensitive to group
differences than VKsize alone.

Setting and Participants

Incoming international students at an Australian university (N = 131)


took part in the study. They were recruited at an orientation program at
the beginning of the academic year. The sample consisted of 67 under-
graduates (63% female) and 75 postgraduates (44% female) from a
range of countries and entering different faculties at the university. They
belonged to five different English entry standard groups consisting of
(1) students entering with the minimum IELTS band score of 6.5 over-
all (n = 54), (2) a higher proficiency IELTS group entering with 7.0–7.5
overall (n = 25), (3) Singaporean students who completed a recognized
English high school program in their home country (n = 19), (4)
Malaysian students who met similar requirements (n = 17), and (5) a
baseline group of native speaker exchange students from the UK, the
US, and South Africa (n = 16). Each student received $20 for
participating.
160 7 Lexical Facility and Academic English Proficiency

Materials and Procedures

Each participant took both written and spoken versions of the Timed
Yes/No Test. Both versions used British National Corpus (BNC) items
sampled from the 1K, 3K, 5K, 7K, and 9K levels, as well as pseudowords
to control for guessing. Different items were used in the respective ver-
sions. The written version included 16 items per BNC level (80 word
items in total) plus 40 pseudowords, for a total of 120 items. The spoken
version included 12 items per level (60 word items) plus 36 pseudowords,
for a total of 96 items. The spoken version was shortened out of concern
for possible test-taker fatigue. The pseudowords all conform to English
orthographic and phonological rules.
The Timed Yes/No Test yields three individual measures. VKsize is a
measure of vocabulary knowledge size based on the proportion of the ‘yes’
responses to word items (hits) minus the proportion of ‘yes’ responses to
pseudowords (false alarms). The mnRT measure is the individual’s mean
recognition time for ‘yes’ responses to the word items (hits), and the CV is
a measure of the consistency of that recognition speed performance. It is
the ratio of the standard deviation (SD) of the mnRT to the mnRT itself
(SDmnRT/mnRT). From these individual measures, composite measures
are also calculated. VKsize_mnRT is the composite of vocabulary knowl-
edge calibrated in size and mean recognition speed. All three measures are
combined in VKsize_mnRT_CV. For group comparisons, the mnRT and
CV values were inverted, so higher values in the VKsize, mnRT, and CV
would all reflect better performance. The VKsize_CV is not calculated, as
the CV can only be interpreted as a proficiency index in combination with
mnRT. It is possible to be very slow and very consistent. The composite
measures were calculated by first converting the values to standardized z
scores and then averaging them. Because the conversion to standardized
scores results in negative scores, a value of 5 was added to each score to
make all the results positive and the presentation easier. Group member-
ship is the criterion measure for the statistical analysis.
Participants were tested individually or in small groups in a university
computer lab. The test was administered using LanguageMAP, a m ­ ultimedia
package for the assessment of language processing skill available at www.
languagemap.com. Items were presented on the screen one at a time and
7.3 Study 2 Results 161

in a randomized order for each participant. Participants were told that


they would see both real English words and pseudowords (the latter being
orthographically possible words in English) and to answer ‘yes’ only to
words known, as they would lose points for incorrectly identifying pseu-
dowords as known words. They were also told to work as quickly and
accurately as possible because they would be scored on both dimensions.
In each trial in the written version, the participant first saw a three-star
fixation point. After a 1500-millisecond interval, a word or pseudoword
appeared on the screen. The participant responded ‘yes’ or ‘no’ by pressing
the appropriate key on the keyboard. No feedback is given. The word
appeared on the screen for 5000 milliseconds, after which the presentation
trial was terminated. If the participant failed to answer in the allotted time,
a ‘not answered’ value was recorded and treated as a ‘miss’ in the subsequent
data analysis. There were only a tiny number of ‘not answered’ responses. A
practice set of five items was completed before the test. Feedback was given
in the practice set to ensure the instructions were understood.
The spoken version of the trial was presented via computer head-
phones. Before each trial, the participant first saw a three-star fixation
point in the center of the screen. Each star successively disappeared at a
rate of 1000 milliseconds per star. When the final star disappeared, a
word item was presented only through the headphones. The screen
remained blank. Participants had 5000 milliseconds to answer before the
trial timed out. As with the written version, there were only a handful of
timed-out responses. All participants completed the written version first.

7.3 Study 2 Results


Preliminary Analysis The raw results were first checked for test instrument
reliability, recognition time outliers, excessive false-alarm rates, and evidence
for a systematic trade-off between speed and accuracy by participants.

Test reliability coefficients were calculated for the written and spoken
tests. Separate analyses were carried out for performance on the words
and pseudowords from each list. All Cronbach’s alpha tests had satisfac-
tory reliability, ranging from high .8 to mid .9 for the yes/no judgments
and item recognition time.
162 7 Lexical Facility and Academic English Proficiency

The data were screened for recognition time outliers for individual
items, following the procedures set out in the previous chapter. Responses
of less than 300 milliseconds were first removed, as these were deemed too
fast for an actual response. These exceedingly fast responses were assumed
to be performance errors arising from inadvertent keystrokes and similar
response-external factors. They accounted for less than .01 of the total
responses. RT item responses were then log-transformed to reduce vari-
ability. Item response times for the correct hits beyond the 3 SD cut-off
were then identified. These accounted for well under 2% of the responses
across all the participants and were left intact for the analysis.
False-alarm rates are given in Table 7.1. The overall false-alarm rate for
the spoken test (24%) was almost double that of the written version
(14%). In both tests, the IELTS 6.5 group had the highest false-alarm rate
(18% and 30%, for written and spoken version, respectively), and the L1
English group the lowest (8% and 14%, respectively). The Singaporean
group had a slightly lower false-alarm rate than the L1 English group in
the written version, but was much closer to the other L2 groups in the
spoken one (24%). As for statistical significance, IELTS 6.5 and IELTS 7+
groups were both higher than the L1 English group, but the IELTS 6.5
group was not different from the Singaporean group, while the IELTS 7+
group was.1 Overall, the spoken false-alarm rates were high and variable.
A mean false-alarm rate of over 30% is very high and raises the ques-
tion as to what constitutes an excessive false-alarm rate. Other studies
have reported much lower false-alarm rates. Schmitt et al. (2012) removed
all participants with false-alarm rates over 10%, while Pellicer-Sánchez and
Schmitt (2012) reported exceptionally low false-alarm rates—a number
of participants had no false alarms at all. Trimming the data to approxi-
mate these false alarms reduces the sample size for both versions by close
to half. Doing so would clearly compromise the usefulness of the method
for use in typical assessment settings such as the one here. Nevertheless,
the high false-alarm rates do raise the issue of comparability with other
studies that report low, or even no, false alarms. The main statistical anal-
yses are carried out on all the participants to assess the robustness of the
lexical facility measures in the context of error-filled performance.
However, subsequent analyses are also carried out on the data sets in
which false-alarm rates are trimmed. The findings are discussed below.
7.3 Study 2 Results 163

Table 7.1 University entry standard study: written and spoken test results.
Pearson’s correlations for the three individual measures (VKsize score, mnRT, and
CV) and the two composite scores (VKsize_mnRT and VKsize_mnRT_CV)
Spoken Written Spoken Written Spoken
Individual VKsize mnRT mnRT CV CV

.72 .50 .34 .21 .27


Written VKsize
[.63, .79] [.35, .64] [.18, .47] [.05, .37] [.12, .40]

.46 .45 .31 .32


Spoken VKsize 1
[.33, .57] [.31, .57] [.17, .45] [.17, .47]

.62 .42 .34


Written mnRT 1
[.52, .72] [.25, .57] [.19, .47]

.40 .62
Spoken mnRT 1
[.23, .55] [.51, .72]

.45
Written CV 1
[29, .58]

Written Spoken
Written Spoken
VKsize_mnRT VKsize_mnRT
Composite VKsize_mnRT VKsize_mnRT
_CV _CV

Written VKsize_ mnRT .73 .92 .66

[.66, .79] [.89, .94] [.58, .74]

Spoken VKsize_mnRT .73 .93

[.56, .80] [.91, .95]

Written VKsize_mnRT_CV .71

[.62, .79]

Spoken VKsize_mnRT_CV .32 .21


[.63, .79] [.63, .79]

Note: N=132; All correlations significant at p < .0005; VKsize, correction for guessing
scores (hits – false alarms); mnRT, mean response time in milliseconds; CV, coefficient
of variation (SDMeanRT/Mean RT).
164 7 Lexical Facility and Academic English Proficiency

Also of concern is the potential for systematic trade-offs between speed


and accuracy by individuals. Table 7.1 presents bivariate correlations
between the individual and composite measures for the written and spo-
ken test results. The cross-version correlations are shaded. The
VKsize and mnRT correlations for the written test (r = .50) and the spo-
ken test (r = .45) indicate a moderate positive correlation, with higher
vocabulary knowledge correlating with faster recognition performance.
The correlations were slightly lower than the .7 reported in Study 1. The
positive correlations are consistent with the notion that yes/no perfor-
mance and recognition time tap similar underlying proficiency, with no
obvious trade-off between speed of response and accuracy in performance.
However, the correlation was only moderately strong, which means that
other factors also affect the relationship between the size scores and RTs.
The three measures are moderately to strongly correlated, both within
and across the test versions. The correlations between the written and
spoken versions on the VKsize (r = .72) and mnRT (r = .62) measures
were stronger than that for the CV (r = .45) measure, a difference that is
statistically significant for the 95% confidence interval (CI) level. The
mnRT is part of the formula used to calculate the CV, but the spoken
(r = .61) and written (r = .35) correlations, while significant, indicate that
correlation between the two measures leaves a large amount of variance
unaccounted for. The composite scores show a similar pattern.
The size of the correlation coefficients shows that the measures corre-
late within and between tests, but that they also tap other knowledge and
skill aspects. The lexical facility account assumes that the three elements
are related, but that each also makes a distinctive contribution to the
construct. The study findings are examined next.

Descriptive Results

Table 7.2 contains the means, SDs, and CIs for the lexical facility mea-
sures for the five entry standard groups. Results for both the spoken and
written versions are given. VKsize scores are reported in percentages, the
mnRT in milliseconds (msec), and the CV as mean ratios. Table 7.3
­presents the same findings for the composite scores. The written test
results are discussed first, then the scores for the spoken test.
7.3 Study 2 Results 165

Table 7.2 University entry standard study: written and spoken test results. Means
(M), standard deviations (SD), and confidence intervals (95% CI) for the lexical
facility measures for the five English proficiency standard groups
mnRT
False alarm VKsize (msec) CV
M SD M SD M SD M SD
[95% CI] [95% CI] [95% CI] [95% CI]
WRI 18.91 14.80 56.09 17.19 1446 416 .433 .088
IELTS 6.5 [15.05, 22.86] [51.63, 60.40] [1335, 1563] [.411, .457]
n = 54 SPO 31.53 15.85 43.10 19.49 1595 398 .322 .12
[27.54, 35.74] [37.73, 48.58] [1512, 1699] [.313, .356]
WRI 10.90 9.15 73.00 11.53 1280 299 .416 . 86
IELTS 7+ [7.67, 14.50] [68.26, 77.37] [1164, 1401] [.375, .425]
n = 25 SPO 22.78 12.57 57.29 16.75 1416 265 .278 .92
[17.99, 27.91] [68.26, 77.37] [1164, 1401] [.247, .309]
WRI 14.71 12.44 70.95 17.83 975 206 .459 .121
Malaysian [9.16, 20.96] [61.83, 78.75] [882, 1074] [.402, .517]
n = 17 SPO 27.28 11.83 60.06 12.44 1379 159 .299 .087
[20.91, 34.26] [61.83, 78.75] [1264, 1505] [.257, .329]
WRI 7.50 9.05 85.06 11.72 889 193 .432 .114
Singaporean [4.78, 12.36] [79.23, 89.61] [806, 976] [.379, .481]
n = 19 SPO 23.68 11.83 68.59 12.44 1276 159 .314 .087
[18.15, 29.78] [62.85, 74.13] [1213, 1340] [.278, .351]
WRI 7.81 6.57 84.76 9.66 960 228 .347 .104
English L1 [5.53, 10.95] [79.91–89.21] [853–1047] [.291, .399]
n = 16 SPO 14.23 09.12 81.70 10.03 1169 122 .222 .065
[5.53, 10.95] [79.91–89.21] [853, 1047] [.291, .399]
WRI 14.69 13.22 66.66 18.79 1250 404 .433 .89
Total L2 [12.18, 17.53] [63.17, 70.00] [1179, 1326] [.416, .451]
N = 115 SPO 27.74 14.75 52.82 20.06 1472 341 .313 .083
[25.19, 30.35] [48.91, 56.62] [1416, 1538] [.297, .329]
Note: False alarms, ‘yes’ responses to pseudowords; VKsize, correction for guessing
scores (hits – false alarms); mnRT, mean response time in milliseconds; CV,
coefficient of variation (SD MeanRT/MeanRT); 95% CI, BCa (bias-corrected and
accelerated) 95% confidence intervals; WRI, written test; SPO, spoken test.

Participants in all five groups met the minimum English proficiency


requirement for university entrance, but group differences were antici-
pated. Minimally, performance by the IELTS 6.5 and IELTS 7+ groups is
expected to differ, as is that between the L1 English group and some, if
not all, of the L2 groups.
166 7 Lexical Facility and Academic English Proficiency

Table 7.3 University entry standard study: written and spoken test results.
Means (M), standard deviations (SD) and confidence intervals (CI) for the compos-
ite scores VKsize_ mnRT and VKsize_ mnRT_CV for the five English entry standard
groups
VKsize_mnRT VKsize_mnRT_CV
M SD M SD
[95% CI ] [95% CI]
WRI 4.36 .75 4.55 .63
IELTS 6.5 [4.16, 4.57] [4.40, 4.71]
n = 55 SPO 4.37 .77 4.43 .61
[4.15, 4.57] [4.26, 4.58]
WRI 5.01 .55 5.04 .59
IELTS 7+ [4.79, 5.23] [4.93, 5.39]
n = 25 SPO 5.03 .66 5.10 .76
[4.73, 5.31] [4.81, 5.41]
WRI 5.39 .63 5.16 .68
Malaysian [5.10, 5.66] [4.83, 5.47]
n = 17 SPO 5.16 .66 5.12 .60
[4.85, 5.44] [4.84, 5.40]
WRI 5.94 .51 5.61 .67
Singaporean [5.73, 6.17] [5.32, 5.92]
n = 19 SPO 5.57 .45 5.31 .58
[5.37, 5.77] [5.04, 5.57]
WRI 5.81 .48 5.81 .60
English L1 [5.58, 6.03] [5.51, 6.10]
n = 16 SPO 6.12 .41 6.00 .48
[5.92, 6] [5.81, 6.28]
WRI 4.91 .88 4.92 .74
Total L2 [4.74, 5.07] [4.79, 5.05]
N = 116 SPO 4.82 .84 4.82 .73
[4.66, 4.98] [4.67, 4.94]
Note: VKsize_mnRT, ((zVKsize + zmnRT)/2) + 5; VKsize_CV_mnRT, ((zVKsize + zCV
+ zmnRT)/3) + 5; 95% CI, BCa confidence interval; WRI, written test; SPO,
spoken test.

Written Test Results

The group mean differences for the individual written test measures are
set out as follows, with ‘<’ indicating increasingly better performance.
The mean differences between groups in the same brackets are not statis-
tically significant. The results of the statistical tests are presented later:
7.3 Study 2 Results 167

Written VKsize: IELTS 6.5 < [Malaysian < IELTS 7+] < [Singaporean <
English L1]
Written mnRT: IELTS 6.5 < IELTS 7+ < [Malaysian < English L1] <
Singaporean
Written CV: [Malaysian < IELTS 6.5 < Singaporean < IELTS 7+] <
English L1

There is some variation in the orders for the respective measures. The
written VKsize scores cluster into three groups: the IELTS 6.5 group was
the lowest at around 55%; the IELTS 7+ and Malaysian groups were both
over 70%; and the Singaporean and L1 English groups at 85%. The
IELTS 6.5 group also had the slowest mnRT responses at around 1450
milliseconds, with the IELTS 7+ group next at around 1300 millisec-
onds. The mnRT responses for the other three groups ranged from the
lower 800 to the upper 900 milliseconds. The Singaporean group was the
fastest, significantly faster than even the L1 group, the latter not differing
significantly from the Malaysian group. The mnRTs for the Malaysian,
English L1 and Singaporean groups were similar to the overall mean for
the L2 university group in Study 1 (M = 960 milliseconds, SD = 203).
The L1 English group here was noticeably slower than the L1 group in
Study 1 (M = 777 milliseconds, SD = 200). No discernible pattern was
evident for the CV results. The Malaysian group had the least consistent
responses at .46 and the L1 English group the most consistent at .35, the
latter somewhat higher than the .25 for the L1 baseline group in Study 1.
The individual written test scores were combined into the two com-
posite scores. The group orders for the respective score are as follows:

Written VKsize_mnRT: IELTS 6.5 < [IELTS 7+ < Malaysian] < [Malaysian
< Singaporean < English L1]
Written VKsize_mnRT_CV: [IELTS 6.5 < IELTS 7+ < Malaysian <
Singaporean <] < English L1

The means for groups appearing in more than one bracket were not
significantly different from the other group(s) in the bracket, though
there were mean differences. The overlaps in all three measures show that
there was considerable variability in the outcomes. This variability will be
168 7 Lexical Facility and Academic English Proficiency

characterized more precisely in the statistical analyses below. The pattern


of mean differences emerging from the written test results is a continuum
of performance marked by the IELTS 6.5 group at one end and the L1
English group at the other, though the Singaporean group was very close
to the latter.

Spoken Test Results

Performance on the spoken version showed the same relative pattern as


the written version, but performance on the VKsize and mnRT measures
was systematically lower. The mean orders are as follows:

Spoken VKsize: IELTS 6.5 < [IELTS 7+ < Malaysian] < [Malaysian <
Singaporean] < English L1
Spoken mnRT: IELTS 6.5 < [IELTS 7+ < Malaysian] < [Malaysian <
Singaporean] < English L1
Spoken CV: IELTS 6.5 < [IELTS 7+ < Malaysian < Singaporean] < English
L1

The spoken VKsize scores were 10% to 15% lower across the groups.
The VKsize means were also more differentiated. The IELTS 6.5 group
was the lowest at 43%, the IELTS 7+ and Malaysian groups were around
60%, the Singaporean group just under 70%, and the L1 English group
over 80%. The spoken mnRT values were also 200–300 milliseconds less
than the written mnRT values. This included the L1 group, which was
200 milliseconds slower in recognizing the words in the spoken version,
despite a very small drop in spoken VKsize scores (written: M = 85 mil-
liseconds, SD = 10.00; spoken: M = 82 milliseconds, SD = 10.00). The
CV scores differentiated between the IELTS 6.5 score as the highest (=
least consistent), the other L2 groups in a single grouping, and the L1
English group as the lowest. The spoken mnRTs scores were higher and
had less variability, resulting in lower spoken CV scores compared with
the written test results. Lower CV values are notionally more consistent,
but only when accompanied by faster mean recognition times. Very slow
responders can also be very consistent.
7.3 Study 2 Results 169

As was the case with the written test findings, the mean differences for
the spoken test range from the IELTS 6.5 group at one end to the L1
English group at the other, with the Singaporean group being very close
to the L1 group. The group orders for the two composite scores are as
follows:

Spoken VKsize_mnRT: IELTS 6.5 < IELTS 7+ < Malaysian < Singaporean
< English L1
Spoken VKsize_mnRT_CV: IELTS 6.5 < [IELTS 7+ < Malaysian] <
[Malaysian < Singaporean] < English L1

The spoken VKsize_mnRT score discriminated between all five groups,


the only measure in either mode to do so. The other two composite mea-
sures had the same mean order, but the absolute differences were small
and there was significant variability.
In the spoken test, the mnRTs were higher and less variable, resulting
in lower CV scores. Slower spoken responses are due, in part, to the
nature of the phonological cue. In the written version, the entire word
appears at once, with the printed word providing immediate, simultane-
ous cues for recognition. Spelling also provides a way to disambiguate
meaning for words that sound the same. The phonological cue, in
­contrast, is constrained by the linear nature of the sound stream. The test-­
taker waits until the entire word is presented before giving a response.
Faster readers thus may respond relatively faster in the written version.
Given that timing begins with sound onset, there is the possibility that
word length can affect recognition time. However, the words were similar
in length, ranging mostly between five and seven letters, and the keyboard-­
collected response times were insensitive to this relatively fine difference
in length. The greater difficulty in the spoken version was most evident in
pseudoword performance, where the false-alarm rate was almost twice
that of the written version for most of the participants, regardless of
group. Without the visual information provided by spelling cues, the task
of judging whether a segment was a known word appeared to be much
more difficult.
In summary, the pattern of mean differences for the spoken test results
mirrors that of the written test results. There is a continuum of perfor-
170 7 Lexical Facility and Academic English Proficiency

mance marked by the IELTS 6.5 group at one end and the L1 English
group at the other. Unlike on the written test, the Singaporean group was
more similar to the other L2 groups than to the L1 group in the spoken
format.
The sensitivity of the measures, as reflected in statistical significance
and effect size, is examined next.

 ensitivity of the Lexical Facility Measures to Entry


S
Standard Differences

Chapter 6 examined the sensitivity of the lexical facility measures to


groups at clearly differentiated proficiency levels. In this study, the profi-
ciency differences between the entry standard groups in this study are not
as distinct, nor are they ordered a priori by proficiency levels. The stu-
dents in the five groups ostensibly have the minimum proficiency needed
to begin English-medium study at university, but the pattern of responses
set out above show distinct levels of test performance. There is a rela-
tively stable continuum spanning the IELTS 6.5 group scores at the lower
end and the L1 group at the upper. Between these two poles, the differences
are less clear-cut; in both versions, the IELTS 7+ group is similar to the
IELTS 6.5 group at the lower end, and the Singaporean group approxi-
mates the L1 group at the upper end for the written test results and
is similar to the other L2 groups for the spoken test results. As in the
previous chapter, the mean differences are tested for statistical signifi-
cance in a series of one-way analyses of variance (ANOVAs) on the writ-
ten and spoken tests. The dependent variables for the respective test are
the three individual measures, VKsize, mnRT, and CV, and the two com-
posite scores, VKsize_mnRT and VKsize_mnRT_CV. Entry standard
group is the independent variable. Post hoc tests comparing performance
by group pairs are also carried out for all the ANOVA tests.
The lexical facility measures are considered sensitive to group differ-
ences if they reach statistical significance at p < .05 level, accompanied by
an effect size that reaches a recognized level of impact. The effect size
for the omnibus ANOVAs is eta-squared (η2). The interpretation of η2
is based on Plonsky and Oswald (2014: 889), with .06 being small,
7.3 Study 2 Results 171

.16 medium, and .36 large. The effect size for the post hoc comparisons
is Cohen’s d, interpreted as .40 being small, .70 medium, and 1.0 large.
As noted, these values are all larger than the more commonly used values
proposed in Cohen (1988).
Before running the tests, the data were examined to ensure they met
the assumptions for the ANOVA procedure. No outliers were observed in
the written test results and only one in the spoken test data (for an IELTS
6.5 participant in the mnRT). The case was removed from the analysis.
The normality assumption was met for all conditions in both tests, save
for two, the Singaporean group’s VKsize responses in the written version
and the L1 English group’s CV responses in the spoken one. Although all
the conditions did not meet the assumption, the one-way ANOVA is
considered robust to violations of normality (Maxwell and Delaney
2004). The homogeneity of variance assumption was met for all but two
of the measures (at p < .05), with the VKsize score in both written and
spoken versions being the exception. Given the heterogeneity of variance,
the significant findings are confirmed by running Welch’s ANOVA for
the omnibus test and the Games–Howell test for the pairwise compari-
sons. Bootstrapping is also used for the pairwise comparisons to provide
a more robust test, given the differences in sample sizes, and the fact that
some of those data sets are borderline in meeting normality and variance
assumptions (Larson-Hall and Herrington 2009).

Test Results

The one-way ANOVAs were carried out for the respective written and
spoken tests. The significance and effect size findings for both versions are
given in Table 7.4. Results are reported for both the complete data set
and for subsets in which all participants with false-alarm rates of greater
than 20% are removed. Post hoc tests comparing performance by group
pairs were subsequently carried out for all the univariate ANOVA tests
for the complete data sets. These are reported in Tables 7.5–7.7.
All five univariate ANOVAs were statistically significant for both the
written and spoken tests. The η2 values for VKsize and mnRT results were
around the threshold for a strong effect for both versions. The VKsize,
172 7 Lexical Facility and Academic English Proficiency

mnRT, and composite VKsize_mnRT_CV measures all accounted for


around 40% of the variance in both the written and spoken test results.
The strongest effect was evident in the composite VKsize_mnRT mea-
sure, which accounted for 50% of the variance in both the written and
spoken test findings. In absolute terms, this supports the lexical facil-
ity proposal, as the combined VKsize and mnRT measure had a larger

Table 7.4 Entry standard study. One-way ANOVA for individual and composite
lexical facility measures as discriminators of English proficiency groups
df F ŋ2 95% CI for ŋ2
VKsize WRITTEN all (4,126) 20.69** .39 [.24, .52]
20% fa trim (4,96) 14.64** .38 [.23, .52]
10% fa trim (4,75) 11.98** .41 [.23, .57]
SPOKEN all (4,126) 19.69** .39 [.24, .52]
20% fa trim (4,50) 11.63** .48 [.32, .61]
mnRT WRITTEN all (4,126) 19.94** .39 [.24, .52]
20% fa trim (4,96) 14.07** .37 [.22, .51]
10% fa trim (4,75) 11.98** .41 [.23, .57]
SPOKEN all (4,126) 14.54** .32 [.19, .46]
20% fa trim (4,50) 12.81** .51 [.35, .62]
CV WRITTEN all (4,126) 03.13* .09 [.02, .20]
20% fa trim (4,96) 02.59* .09 [.01, .22]
10% fa trim (4,75) 2.16 .11 [.01, .26]
SPOKEN all (4,126) 07.85** .20 [.08, .34]
20% fa trim (4,50) 08.22** .39 [.23, .53]
VKsize_mnRT WRITTEN all (4,126) 31.89** .50 [.37, .62]
20% fa trim (4,96) 21.55** .47 [.38, .58]
10% fa trim (4,75) 15.34** .47 [.32, .60]
SPOKEN all (4,126) 29.69** .49 [.36, .61]
20% fa trim (4, 50) 18.47 .59 [.45, .71]
VKsize_mnRT_CV WRITTEN all (4,126) 17.39** .37 [.24, .50]
20% fa trim (4,96) 10.83** .31 [.16, .46]
10% fa trim (4,75) 08.56** .31 [.14, .48]
SPOKEN all (4,126) 24.09** .43 [.28, .58]
20% fa trim (4,50) 18.30 .59 [.46, .69]
Note: *p < .05; **p < .0005; VKsize, correction for guessing scores (hits - false
alarms); mnRT, mean response time in milliseconds; CV, coefficient of variation
(SDMeanRT/Mean RT); VKsize_mnRT, ((zVKsize + zmnRT)/2) + 5, VKsize_CV_
mnRT = ((zVKsize + zCV + zmnRT)/3) + 5; 20% fa, data trimmed to exclude
mean false-alarm rates above 20%.
7.3 Study 2 Results 173

Table 7.5 University entry standard group. Significant pairwise comparisons for
the VKsize measure for written and spoken test results
Mean Mean Mean
Mean difference difference difference difference
d, [CI] d, [CI] d, [CI] d, [CI]
IELTS 6.5 IELTS 7+ Malaysian Singaporean
VKsize WRITTEN
IELTS 7+ 16.91**
1.08, [.57, 1.58]
Malaysian 14.86**
.86, [.29, 1.42]
Singaporean 28.97** 12.06** 14.11*
1.81, [1.21, 2.40] 1.05, [.40, 1.67] .95, [.26, 1.63]
English L1 28.76** 11.77** 13.81**
1.81, [1.18, 2.44] 1.08, [.41, 1.75] .96, [.23, 1.67]
VKsize SPOKEN
IELTS 7+ 14.18**
.77, [.28, 1.26]
Malaysian 16.96**
.89, [.33, 1.45]
Singaporean 25.49** 11.31**
1.41, [.85, 1.99] .66, [.05, 1.27]
English L1 38.59** 24.41** 21.63** 13.10**
2.16, [1.50, 2.82] 1.68, [.95, 2.40] 1.50, [.72, 2.27] .89, [.20, 1.59]
Note: Means difference, row group – column group means; d, Cohen’s d; CI, 95%
confidence interval for Cohen’s d; IELTS 6.5, students entering with IELTS 6.5
overall (n = 55); IELTS 7+, students entering with IELTS 7–7.5 overall (n = 25);
Malaysian, students from Malaysian high school English (n = 17); Singaporean,
students from Singaporean high school (n = 19); English L1, students educated
in English L1 countries: the US, New Zealand, Canada, South Africa (n = 16);
*p < .05; **p <.01 (two-tailed).

mean than VKsize alone. However, the differences are not statistically
reliable, as there is substantial overlap in the CIs for the respective mea-
sures. The individual CV measure had a lower effect size, accounting for
10% and 20% of the variance for the written and spoken test results,
respectively. When combined with the other two measures in the com-
posite VKsize_mnRT_CV measure, it yielded a smaller overall effect size
than that evident in the VKsize_mnRT measure. However, as noted, the
174 7 Lexical Facility and Academic English Proficiency

Table 7.6 University entry standard study. Significant pairwise comparisons for
the mnRT and CV measures for written and spoken test results
Mean Mean Mean Mean
difference difference difference difference
d, [CI] d, [CI] d, [CI] d, [CI]
IELTS 6.5 IELTS 7+ Malaysian Singaporean
mnRT WRITTEN
Malaysian .163** .116 **
1.39, [.80, 1.98] 1.14, [.48, 1.80]
Singaporean .204** .157**
1.86, [1.22, 2.41] 1.62, [.94, 2.31]
English L1 .171** .124** .
1.46, [.85, 2.06] 1.21, [.52, 1.89]
mnRT SPOKEN
IELTS 7+ .049*
.63, [.19, 1.31]
Malaysian .060*
.76, [.22, 1.29]
Singaporean .091** .041*
1.19, [.63, 1.75] .62, [.00, 1.00]
English L1 .128** .079** .068* .037*
1.72, [1.10, 2.35] 1.21, [.53, 1.89] (1.18, [.44, 1.92] .82, [.13, 1.51]
CV WRITTEN
English L1 .086** .069* .111** .085*
.94, [.36, 1.52] .71, [.07, 1.36] .99, [.26, 1.70] .78, [.09, 1.47]
CV SPOKEN
IELTS 7+ .056**
1.48, [.88, 2.07]
English L1 .112** .055* .072** .091**
1.49, [.88, 2.10] .72, [.07, 1.36] 1.05, [.32, 1.77] 1.19, [.47, 1.91]
Note: Means difference, row group – column group means; d, Cohen’s d; CI, 95%
confidence interval for Cohen’s d; IELTS 6.5, students entering with IELTS 6.5
overall (n = 55); IELTS 7+, students entering with IELTS 7–7.5 overall (n = 25);
Malaysian, students from Malaysian high school English (n = 17); Singaporean,
students from Singaporean high school (n = 19); English L1, students educated
in English L1 countries: the US, New Zealand, Canada, South Africa (n = 16);
*p < .05; **p < .01 (two-tailed).

difference is not statistically significant. The somewhat negative effect of


the CV in the composite score is more evident when pairwise group com-
parisons are considered in Table 7.6 below.
The mean false-alarm rates ranged from 8% for the L1 English group
to over 30% for the IELTS 6.5 group. The overall false-alarm rate for the
L2 participants was 15% in the written version and 28% in the spo-
7.3 Study 2 Results 175

Table 7.7 University entry standard study. Significant pairwise comparisons for
composite VKsize_mnRT and VKsize_mnRT_CV measures for written and spoken
test results
VKsize_mnRT WRITTEN
Mean Difference Mean Difference Mean difference Mean difference
d, [CI] d, [CI] d, [CI] d, [CI]

IELTS 6.5 IELTS 7+ Malay Singapore


IELTS 7+ .65**
.94, [.44, 1.43]
Malay 1.00** .38* X
1.42, [.84, 2.02] .64 [.02, 1.28]
Singapore 1.58** .93 ** .55**
2.27, [1.63, 2.91] 1.73, [1.04, 2.03] .96, [.27, 1.65]
English L1 1.44** .79 ** .41*
2.07, [1.41, 2.72] 1.50, [.80, 2.21] .73, [.02, 1.43]
VKsize_mnRT SPOKEN
IELTS 7+ .66**
.86, [.36, 1.35]
Malay .79**
1.06, [.49, 1.63]
Singapore 1.19** .54 ** .40* X
1.69, [1.10, 2.28] .84 [.21, 1.46] .72 [.05 – 1.39]
English L1 1.76** 1.10** .96** .56**
2.46, [1.78, 3.16] 1.69, [.96, 2.41] 1.74, [.94, 2.54] 1.29, [.56, 2.03]
VKsize _mnRT_CV WRITTEN
IELTS 7+ .49**
.79, [.30, 1.28]
Malay .61**
.94, [.37, 1.50]

Singapore 1.05** .57 ** .45*


1.64, [1.06, 2.23] .90, [.28, 1.53] .67, [0, 1.35]
English L1 1.24** .76 ** .64**
2.00, [1.36, 2.65] 1.28, [.59, 1.96] 1.01, [.29, 1.73]

VKsize_mnRT_CV SPOKEN
IELTS 7+ .67**
1.00, [.51, 1.50]
Malay .69**
1.14, [.56, 1.74]
Singapore .88**
1.46, [.88, 2.03]
English L1 1.62** .95** .93** .73**
2.78, [2.06, 3.50] 1.46, [.76, 2.16] 1.74, [.93, 2.53] 1.57, [.81, 2.34]

Note: Means difference, row group – column group means; d, Cohen’s d; CI, 95%
confidence interval for Cohen’s d; IELTS 6.5, students entering with IELTS 6.5
overall (n = 55); IELTS 7+, students entering with IELTS 7–7.5 overall (n = 25);
Malaysian, students from Malaysian high school English (n = 17); Singaporean,
students from Singaporean high school (n = 19); English L1, students educated
in English L1 countries: the US, New Zealand, Canada, South Africa (n = 16); *p
< .05; **p <.01 (two-tailed); X, comparison not significant in individual VKsize
analysis. Shaded cells indicate that the d value is higher than that for the same
comparison in the individual VKsize analysis.
176 7 Lexical Facility and Academic English Proficiency

ken test. In comparison, the overall rate for Study 1 was 10%. To assess
the possible effect of differences in false-alarm rates on the results, a sub-
sequent analysis was run on the written and spoken test data. In these
analyses, only those students who had rates of less than 20% were included
in the analysis. The 20% trim reduced the written test sample to n = 101
and the spoken test sample to n = 54, the latter less than half the size of
the original. The overall mean false-alarm rates fell as well. For the written
test results, the overall false-alarm rate reported in Table 7.2 (M = 14.67,
SD = 13.22) halved as a result of the 20% trim, (M = 7.87, SD = 5.69). A
similar fall was evident for the spoken test results: overall (M = 27.74,
SD = 14.75); 20% trim (M = 12.57 milliseconds, SD = 5.38). The data
were again trimmed for false-alarm rates of less than 10%. This reduced
the written test sample to n = 74 (and false-alarm rate of M = 4.90 milli-
seconds, SD = 3.11) and the spoken test sample to an unanalyzable n = 14.
The results in Table 7.4 show that despite the trimming, the pattern of
significance levels does not change and there is only a slight improvement
in effect size. Pairwise comparisons similar to the ones reported next also
produced results similar to the original findings, though these are not
reported here.
Post hoc pairwise comparisons of complete data sets provide a more
detailed picture of the sensitivity of the measures. Performances on the
written and spoken versions are compared for each measure. Significant
pairwise comparisons for the individual VKsize scores for the two ver-
sions are presented in Table 7.5, with temporal variables reported in
Table 7.6.
There are two salient findings for the VKsize results, and they hold for
both the written and spoken tests. The first is the noticeable difference
between the IELTS 6.5 group and all the others. This is reflected in the
statistically significant differences and related effect sizes, which ranged
from a Cohen’s d of .9 for the written IELTS 6.5–Malaysian group differ-
ence to over 2 for the spoken IELTS 6.5–L1 English group comparison.
The IELTS 6.5 group clearly performed at a lower level. Another finding
of note is the similarity between the Singaporean and L1 English group
performance. This is particularly evident in the written test results, where
the two groups were nearly identical. Performance by the Singaporean
7.3 Study 2 Results 177

group decreased somewhat in the spoken version, where it was signifi-


cantly lower than the L1 English group and equal to the Malaysian group.
The mnRT and CV comparisons are presented in Table 7.6.
The mnRT scores for the written test results were less sensitive than the
VKsize scores to group differences. There was no significant difference
between the two IELTS groups, and these two groups were significantly
different from the other three groups, who did not differ among them-
selves. Effect sizes were all strong, ranging from d = 1.14 for the IELTS
7+−Malaysian group comparison to 1.86 for the IELTS 6.5–Singaporean
group difference. The mnRT results for the spoken version were some-
what more sensitive to group differences. The IELTS 6.5 group was sig-
nificantly slower than all the other groups, while the L1 English group
was significantly faster. The effect sizes were somewhat smaller than in the
written test, ranging from the moderate (.63) for the IELTS 6.5–IELTS
7+ group comparison to the largest (1.72) for the IELTS 6.5–L1 English
group comparison.
The CV was the least sensitive of the three measures in both test ver-
sions. In the written test, the L1 English group outperformed all the
other groups, with the differences in all the other comparisons not reach-
ing significance. In the spoken test, the measure was slightly more sensi-
tive, discriminating between the IELTS 6.5 and IELTS 7+ group
performance, in addition to all the comparisons involving the L1 English
group. The effect sizes were also weaker. For the written test results, all
effect sizes were just under 1.0. For the spoken test results, they were
slightly stronger, ranging from 1.05 (Singaporean–Malaysian) to 1.49
(English L1–IELTS 6.5).
Results for the composite scores are given in Table 7.7. The main inter-
est is in the sensitivity of the composite measures in comparison with the
VKsize measures. Shaded cells indicate when the composite d value is
higher for the respective composite measure than for the same compari-
son in the individual VKsize analysis reported in Table 7.5.
For the VKsize_mnRT measure, the shaded areas show that six out of
the eight written test comparisons yielded greater effects sizes than the
individual VKsize analysis. In the spoken results, all the composite com-
parisons (8 out of 8) had greater effects sizes. This is consistent with the
178 7 Lexical Facility and Academic English Proficiency

proposal that the combination of the VKsize and mnRT measures will
yield a more sensitive measure of group differences than VKsize alone.
However, some of these differences were very small, for example, only
1.0 in the case of the Malaysian–Singaporean pair, and none were statisti-
cally significant, as evident in the overlap of the CIs. The inclusion of CV
in the VKsize_mnRT_CV composite did not improve the composite’s
sensitivity. Only half the composite comparisons (4 out of 8) in the writ-
ten test results yielded stronger effect sizes, while only three out of 8
spoken comparisons showed an advantage for the VKsize_mnRT_CV
over the individual VKsize comparisons.

 ensitivity of the Lexical Facility Measures


S
to Frequency Differences

A fundamental assumption of the lexical facility account is that word


frequency statistics are a reliable predictor of learning outcomes. This
assumption was supported by the written test results reported in Chapter
6 and is examined again here for the written and spoken test results. As
in the previous chapter, the measure of vocabulary size is the percentage
of hits (‘yes’ responses to words) at the respective frequency band levels,
and the mnRT as the measure of speed. The VKsize score is not used
because it is calculated across all the frequency bands. The highest fre-
quency level had near-ceiling performance by all the groups. As this
resulted in a markedly non-normal distribution nonparametric statistics
are used to test frequency-­level differences. The median and interquartile
range (IQR) for the five frequency levels are used as an alternative to the
mean and SDs.
Figures 7.1 and 7.2 present the median for the hits and the mnRTs
across the five frequency levels, respectively, whereas Fig. 7.3 presents the
mean CV ratio.
The median frequency-level differences were tested using the Friedman
test, a nonparametric alternative to a one-way repeated-measures
ANOVA. The test was significant for all three measures at p < .001.2 The
follow-up Wilcoxon signed-rank test for the pairwise band comparison
showed that the written and spoken hits were significant for the 1K–3K,
7.3 Study 2 Results 179

Hit1K
Hit3K
100 Hit5K
Hit7K
Hit9K

80
Hits (percentage)

60

40

20

0
Written Spoken
Hit performance for frequency bands by mode

Fig. 7.1 University entry standard study. Mean proportion of hits by frequency
levels for written and spoken test results

3K–5K, and 5K–7K bands, as well as for the overall 1K–9K bands. The r2
effect sizes ranged from .02 (written 7K–9K) to .32 (written and spoken
1K–9K). For the CV, the only significant pairwise comparisons were for
the 1K–9K comparison, with a negligible effect size of .03. Test details are
given in note 2.
The results for hits and mnRT results in both test versions support the
frequency assumption and serve to again support the construct validity of
the test. The mnRT results were slightly less sensitive than the hit mea-
sures to differences in band levels, but overall, both measures were strong
indicators of frequency-level differences. The CV measures were only sen-
sitive to the contrast of the most extreme values. And in these cases, the
effect sizes were quite low.
180 7 Lexical Facility and Academic English Proficiency

mnRT1K
mnRT3K
1,600 mnRT5K
mnRT7K
mnRT9K

1,400
Mean RT (msec)

1,200

1,000

800

600
Written Spoken
Mean RT performance for frequency bands by mode

Fig. 7.2 University entry standard study. Mean response times by frequency lev-
els for written and spoken test results

7.4 Study 2 Findings


This study examined the sensitivity of the lexical facility measures to dif-
ferences in proficiency across five English entrance standard groups at an
Australian university. The groups of international students ranged from
those who entered with the minimum IELTS overall band score of 6.5,
students with 7–7.5 band scores, Malaysian and Singaporean high school
graduates, and English L1 students from English-speaking countries.
Both written and spoken versions of the test were given to all students to
assess the effect of language mode on performance.
The individual VKsize and mnRT measures were reliable indicators of
group differences in both the written and spoken versions of the test. The
CV measure was useful only in discriminating performance between the
7.4 Study 2 Findings 181

CV1K
CV3K
0.50 CV5K
CV7K
CV9K

0.40

0.30
CV ratio

0.20

0.10

0.00
Written Spoken
CV performance for frequency bands by mode

Fig. 7.3 University entry standard study. Mean CV ratio by frequency levels for
written and spoken test results

English L1 and the other entry groups. Post hoc analyses of the pairwise
differences between the groups showed increasingly better performance
in both test versions, along with an approximate group continuum of
IELTS 6.5 < IELTS 7+ < Malaysian < Singaporean < English L1. The dif-
ferences between the lowest group, IELTS 6.5, and the other four groups
were statistically significant for both test versions. The IELTS 7+ group
also performed significantly lower than the Malaysian, Singaporean, and
L1 English groups for the VKsize measure, but was at the same level as
the Malaysian group for the mnRT and CV responses. The Singaporean
group performed at the same level as the L1 English group in the written
version, but not in the spoken one. Effect sizes for significant differences
were in the moderate to mostly strong range. The largest effect size (d >
2) was for the IELTS 6.5–L1 English group comparisons in the two tests.
Consistent with the lexical facility account, the composite VKsize_mnRT
score produced larger effect sizes than the individual VKsize analysis on
182 7 Lexical Facility and Academic English Proficiency

both test versions. However, the support is only suggestive, as the observed
differences between the scores were not statistically significant. The inclu-
sion of the CV score in the VKsize_mnRT_CV composite score decreased
the sensitivity of the measure compared with the individual measures.
A main focus of the study was the comparison of written and spoken
test modes. The overall pattern of group differences was similar in the two
versions. However, the spoken test yielded lower VKsize scores and slower
mnRTs. False-alarm rates were also noticeably higher. All three outcomes
may due to the linear nature of the phonological stimuli and the absence
of orthographic cues. In the spoken version, mnRTs were slower and the
CV values were also lower. All the groups were slower but more consis-
tent in the spoken version.
Mean false-alarm rates ranged as high as 30% in the data, raising the
question of how variability in these rates might affect the pattern of
results obtained, as well as the comparability of these findings with other
studies with lower rates. After the complete data set had been analyzed,
subsequent analyses were done in which the data were trimmed for false-­
alarm rates exceeding 20% in the written and spoken versions, and 10%
in the written one. Although these trims significantly reduced the sample
sizes—by half for the spoken test data at the 20% rate—the resulting
analyses produced results highly comparable to the original analysis. The
test yields consistent results across different false-alarm rates.
The validity of word frequency statistics as an index of vocabulary
knowledge was also examined. The number of words identified (hits) was
plotted by the five frequency-of-occurrence bands used in the test (1K,
3K, 5K, 7K, and 9K) to test the assumption that frequency of occurrence
will predict learning outcomes. The percentage of hits systematically
decreased as a function of frequency band in both the written and spoken
versions. The effect sizes were negligible or small for bandwise differences
but stronger for nonadjacent band differences. The mnRT findings were
similar to the VKsize results for both versions, though there was more
variability in the recognition time measure. The effect sizes were in the
same range as in the hits. The CV results were insensitive to differences in
frequency bands for both, though noticeably less so in the spoken ver-
sion. The findings support the validity of frequency-based approaches to
measuring vocabulary knowledge.
7.5 Conclusions 183

7.5 Conclusions
The findings replicate those from Study 1. The VKsize, mnRT, and, to a
lesser extent, CV measures provide a reliable means to discriminate
among the entry standard groups. The mean differences suggest that the
combination of vocabulary size and recognition speed provides a more
sensitive measure of group differences than size alone, though the find-
ings await further confirmation, as the differences were not statistically
significant.
The lexical facility measures were used to characterize proficiency dif-
ferences across groups of international university students beginning
study at an Australian university. Although all students are assumed to
have the minimum English needed to commence academic study, it is
evident that they differ markedly in the vocabulary skills tapped by the
lexical facility measures. These measures are core elements of academic
language proficiency. The sensitivity of the measures shows that they can
provide an objective, independent benchmark for assessing these skills.
As such, they have potential as an assessment tool in this domain, for
example, as a means to identify students, pre- and post-enrolment, who
may be at academic risk due to shortcomings in English proficiency
(Read 2016).
The study also compared performance of the written and spoken for-
mats to assess whether and how the mode of presentation has an effect on
outcomes. The test format yields a similar pattern of results in both ver-
sions, though the VKsize and mnRT scores are lower in the spoken ver-
sion. False-alarm rates are also higher, indicating that the spoken format
is more challenging.
The studies in the previous chapter and this one examined the sensitiv-
ity of the lexical facility measures to proficiency differences in university
groups representing different user populations (Study 1) and English
standards used for university entry (Study 2), respectively. In the next
chapter, the focus narrows and the sensitivity of the measures to individ-
ual differences in one of these standards, the IELTS test, is investigated.
184 7 Lexical Facility and Academic English Proficiency

Notes
1. The false-alarm data depart markedly from a normal distribution, given
that a number of participants had few-to-none false alarms. A Kruskal–
Wallis test was run to test for the equality of the group false-alarm means.
For the written version, there was a significant difference between the five
groups, χ2 = 17.00, p < .005. η2 = .09 (Lenhard and Lenhard 2014). A
follow-up Mann–Whitney test of pairs showed that the IELTS 6 group
was significantly higher than all the other groups, and that the only other
significant difference was between the IELTS 7+ and Singaporean groups,
U = 154.50, p < .05, d = .62. For the spoken version, there was also a
significant difference between the five groups, χ2 = 17.92, p < .001.
η2 = .10. A follow-up Mann–Whitney test of pairs showed mixed results.
The IELTS 6.5 group was significantly higher than the IELTS 7+ group,
U = 480, p < .05, d = .47, but not significantly different from the Malaysian
or Singaporean group. It was significantly higher than the L1 English
group, U = 231.50, p < .01, d = .71; the IELTS 7+ group was also signifi-
cantly higher than the L1 English group, U = 117.50, p < .05, d = .74.
2. The degrees of freedom for all the Friedman tests are 4. The p values and
effect size in R2 are for the band comparisons. Effect size is calculated from
r = Wilcoxon z/square root (N × 2).
Written hits: Friedman test χ2 = 363.69, significant at p < .001.
The follow-­up Wilcoxon signed-rank test for the pairwise band
comparison showed that the first three adjacent band comparisons
were significant at p < .001: the 1K–3K, (r = .35); 3K–5K, (.54);
the 5K–7K comparison was significant at p < .05, (.14); and the
overall 1K–9K comparison was significant at p < .001, (.57).
Written mnRT: Friedman test χ2 = 262.02, significant at p < .001.
The follow-up Wilcoxon test for the band differences varied some-
what: 1K–3K: p = .003, (r = .17); 3K–5K: p = .275, ns. (.00);
5K–7K: p < .001, (.45); 7K–9K: p = .044 (.04); and the overall
1K–9K comparison: p < .001, (.59).
Written CV: Friedman test χ2 = 10.23, p < .05. The only significant
Wilcoxon test result was the 1K–9K comparison: p < .010, (r = .17).
Spoken hit: Friedman test χ2 = 293.19, significant at p < .001. The
Wilcoxon test showed that the first three band comparison were
significant at p < .001, 1K–3K, (r = .30), 3K–5K, (.42), 5K–7K
References 185

(.28); the 7K–9K comparison was significant at p < .01, (.41); and
overall 1K–9K comparison was significant at p < .001, (.57).
Spoken mnRT: Friedman test χ2 = 223.21, significant at p < .001.
The Wilcoxon test varied: 1K–3K: p < .001, (r = .41); 3K–5K:
p < .001, (.37); 5K–7K: p = .462, ns. (.00); 7K–9K: p = .001, (.20);
and the overall 1K–9K comparison: p < .001, (59).
Spoken CV: Friedman test χ2 = 9.51, significant at p < .05. The only
significant Wilcoxon test result again was the 1K–9K comparison:
p < .01, (r = .17).

References
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale: Lawrence Erlbaum.
Larson-Hall, J., & Herrington, R. (2009). Improving data analysis in second
language acquisition by utilizing modern developments in applied statistics.
Applied Linguistics, 31(3), 368–390.
Lenhard, W., & Lenhard, A. (2014). Calculation of effect sizes. Retrieved
November 29, 2014, from http://www.psychometrica.de/effect_size.html
Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing
data: A model comparison perspective (2nd ed.). New York: Psychology Press.
Moore, P., & Harrington, M. (2016). Fractionating English language p­ roficiency:
Policy and practice in Australian higher education, T. Liddicoat (Ed.). London:
Taylor & Francis.
Pellicer-Sánchez, A., & Schmitt, N. (2012). Scoring yes-no vocabulary tests:
Reaction time vs. nonword approaches. Language Testing, 29(4), 489–509.
https://doi.org/10.1177/0265532212438053.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
Read, J. (2016). Post-admission language assessment in universities: International
perspectives. Switzerland: Springer International Publishing.
Schmitt, N., Jiang, X., & Grabe, W. (2011). The percentage of words known in
a text and reading comprehension. The Modern Language Journal, 95(1),
26–43. doi:10.1111/j.1540-4781.2011.01146.x.
8
Lexical Facility and IELTS Performance

Aims

• Evaluate the sensitivity of the lexical facility measures to IELTS overall


band scores.
• Assess the sensitivity of the measures both independently and in
combination.

8.1 Introduction
This chapter presents the third of seven studies examining lexical facility
as a second language (L2) vocabulary construct. Study 3 narrows the
scope of the previous study by examining the sensitivity of the lexical
facility measures (VKsize, mnRT, and CV) to band-score differences on
the IELTS test, an English proficiency standard widely used for educa-
tional, employment and immigration purposes. The sensitivity of the
individual and composite measures to score differences across five adja-
cent IELTS bands (5–7) is examined. The data were obtained from stu-
dents in an Australian university foundation-year program (N = 371).
Demonstrating the sensitivity of the lexical facility measures to b­ and-­score

© The Author(s) 2018 187


M. Harrington, Lexical Facility, DOI 10.1057/978-1-137-37262-8_8
188 8 Lexical Facility and IELTS Performance

differences provides further support for the lexical facility proposal. It


also has implications for the use of the testing format in assessing readi-
ness to take the test.

8.2  tudy 3: Lexical Facility and IELTS


S
Performance
The previous chapter examined the sensitivity of the lexical facility mea-
sures to differences in English proficiency as reflected in different English
entry standards for an Australian university. Two of the groups comprised
students using IELTS scores for entry. One group met the minimum 6.5
overall band-score requirement and the other had IELTS scores at the
next two levels, 7.0–7.5. The VKsize scores were sensitive to between-­
group differences in both the spoken and written versions of the test. The
mnRT and CV measures were also sensitive to differences in the spoken
version of the test.
This study takes a closer look at the relationship between the lexical
facility measures and IELTS performance across a range of overall band-­
score levels. The data come from a university preparation program attached
to an Australian university. The foundation-year course is equivalent to
Year 12 high school study in Australia, providing both the high school
academic coursework and the English language training needed for entry
to university. Students with varying levels of English proficiency enroll in
the program. Many students take IELTS before entering, with overall
band scores ranging from the minimum of 5.0 required for foundation-­
year entry to 8.0. In this study, students entering the program took the
written Timed Yes/No Test at the beginning of the academic year. Test
performance was then correlated with the most recent IELTS overall band
score in their academic record at the time of the testing.

Setting and Participants

University foundation-year students (N = 344) participated as volunteers.


Most came from countries that use Chinese as the main written language:
8.2 Study 3: Lexical Facility and IELTS Performance 189

mainland China (n = 226) and Hong Kong (n = 54), Macau (n = 16), and
Taiwan (n = 8). The remainder came from a wide range of countries,
including Fiji, Indonesia, Japan, Kazakhstan, Korea, Kuwait, Malaysia,
Nepal, the Philippines, Tanzania, Timor-Leste, and Vietnam. Females
made up 55% of the sample.

Materials and Procedures

All the participants completed a written version of the Timed Yes/No


Test. The 100-item test contained 72 actual words and 28 pseudowords.
The 72 actual words consisted of 18 items from four British National
Corpus (BNC) bands: 1K, 3K, 5K, and 9K. The pseudowords included
to control for guessing conform to English orthographic and phonologi-
cal rules. Performance was measured and scored in the same way as in
previous studies. Three individual measures (VKsize, mnRT, and CV)
and two composite measures (VKsize_mnRT and VKsize_mnRT_CV)
were investigated. The mean recognition time (mnRT) and coefficient of
variation (CV) scores were inverted so that higher values reflected better
performance, making the scores consistent with the VKsize values.
Composite measures were calculated by converting the raw means stan-
dardized (z) scores and averaging them. The conversion to standardized
scores results in negative scores for some. To eliminate these scores, the
value of 5 was added to each score to make all the scores positive.
The IELTS scores were taken from the student’s records and was the
IELTS score closest to each student’s entry into the program. As a result,
the time between when the most recent IELTS test had been taken and
when the Timed Yes/No Test was administered varied by individual. The
academic records indicated that the students took the test on average
5.5 months (SD = 2.2 months) before the testing session. Twelve students
who had reported IELTS scores older than one year were not included in
the group of 371 students analyzed.
Participants were tested as class groups in a school computer lab as part
of a class activity, applying the same procedures used in the earlier stud-
ies. The test was administered on LanguageMAP, an online language test-
ing program available at www.languagemap.com. Students in all three
190 8 Lexical Facility and IELTS Performance

years completed the same test. Test items were randomized for each par-
ticipant and presented individually on a computer screen. Participants
were asked to judge, as quickly and accurately as they could, whether they
knew the target word. They were told that they would see items that were
either actual words or pseudowords, the latter being orthographically
possible words in English. Each trial had a 5000-millisecond time limit.
The participants were told to work as quickly and accurately as possible,
as they would be scored on both dimensions. A practice set of five items
was completed before the test. After the test was completed, the students
were asked to sign a consent form allowing the researcher to access their
IELTS scores from the school administration.

8.3 Study 3 Results


Preliminary Analysis The data were collected at the beginning of the aca-
demic year over three consecutive years (see Table 8.1). There was some
variation across the three years in the distribution of the scores. Year 1
had a relatively large number of students at the 5.5 level and none at the
6.5 or 7.0 level. Year 3 had the largest number overall, but none at the 5.0
level.

The means at the respective IELTS levels across all three groups were
very similar. For example, the VKsize mean for the band-score 6 group
was, for years 1–3, 45, 47, and 44%. None of the small mean differences
for either the VKsize or the mnRT observed across the years for the
respective IELTS levels was significant. The consistency across the three
years is an indication of the reliability of the testing instrument. It
also allows the data to be combined into a single data set. Given the rela-
tively small numbers at the 7.0 (n = 18) and 7.5 (n = 14) levels, these two
levels were combined for the statistical analyses.
The raw data were first examined for four performance factors that can
potentially affect the interpretation of the results. There was adequate test
instrument reliability with Cronbach’s alpha values ranging from .8 to .9
for the VKsize and mnRT measures for the word and pseudoword
8.3 Study 3 Results 191

Table 8.1 IELTS study data set. Years 1–3 means and standard deviations, within
brackets, for the VKsize, mnRT, and CV measures by IELTS overall band score
Year #1 Year #2 Year #3

VKsize mnRT CV VKsize mnRT CV VKsize mnRT CV

32.61 1334 .441 28.83 1119 .498


5
(13.27) (431) (.115) (6.69) (267) (.079)
n = 32 n=8
39.59 1235 .428 40.04 1205 .376 36.29 1059 .392
5.5
(11.86) (270) (.088) (12.04) (281) (.114) (12.42) (137) (.079)
n =51 n = 26 n = 114
45.50 1243 .443 46.67 1074 .295 44.38 1026 .408
6
(19.61) (333) (.140) (10.78) (303) (.076) (13.92) (168) (.097)
n=3 n = 18 n = 54
59.72 1148 .338 57.75 1019 .369
6.5
(15.12) (233) (.133) (14.71) (147) (.104)
n=8 n = 37
59.35 1011 .326 68.33 838 .343
7
(18.17) (200) (.112) (11.00) (78) (.105)
n=4 n = 14
74.51 1109 .560 81.86 854 .330 76.24 796 .290
7.5
(1.27) (238) (.003) 6.81 (45) (.142) 7.67 (53) (.108)
n=2 n=5 n=7
N=371 n = 88 n = 69 n = 214
Reliability Items RT Items RT Items RT
Words .91 .93 .92 .88 .86 .87
Pseudowords .87 .90 .95 .93 .88 .89

Note: VKsize, correction for guessing scores (hits - false alarms); mnRT, mean
response time in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT).

responses. The data were also screened for item RT recognition time out-
liers. Item responses at less than 300 milliseconds were first removed, as
these were too fast for an actual response. These exceedingly fast responses
were deemed performance errors arising from inadvertent keystrokes,
and similar response-external factors accounted for less than 1% of the
total responses. Item response times for the correct hits beyond the 3 SD
192 8 Lexical Facility and IELTS Performance

Table 8.2 IELTS band-score study. Means, standard deviations, and confidence
intervals (CI) for the lexical facility measures, individual and composite, for IELTS
overall band scores
False alarm VKsize mnRT (msec) CV
IELTS overall M SD M SD M SD M SD
[95% CI] [95% CI] [95% CI] [95% CI]
5 20.35 12.59 35.30 11.08 1342 443 .436 .119
n = 30 [15.65, 20.06] [31.16, 39.44] [1176, 1508] [.399, .477]
5.5 19.78 12.62 39.52 11.52 1139 213 .397 .102
n = 169 [17.86, 21.69] [37.77, 41.27] [1007, 1171] [.382, .412]
6 22.42 13.66 45.61 12.52 1040 214 .382 .110
n = 72 [19.11, 25.63] [42.66, 48.55] [989, 1199] [.355, .406]
6.5 12.79 12.80 58.77 13.46 1032 131 .358 .111
n = 42 [08.81, 16.78] [54.58, 62.97] [992, 1073] [.326, .389]
7.0–7.5 08.79 09.57 71.77 10.27 861 111 .329 .119
n = 31 [5.28, 12.30] [68.00, 75.55] [820, 902] [.295, .370]
Total 18.54 13.20 45.65 15.79 1098 253 .388 .111
N = 344 [17.14, 19.94] [44.01, 47.36] [1071, 1122] [.376, .399]
Composite score VKsize_mnRT VKsize_mnRT_CV
5 4.27 .93 4.34 .69
[3.91, 4.62] [4.08, 4.60]
5.5 4.79 .53 4.84 .51
[4.71, 4.87] [4.77, 4.92]
6 5.16 .64 5.14 .58
[5.01, 5.36] [5.00, 5.28]
6.5 5.56 .54 5.47 .57
[5.40, 5.73] [5.29, 5.65]
7–7.5 6.29 .41 6.04 .54
[6.14, 6.44] [5.84, 6.23]
Note: VKsize, correction for guessing scores (hits - false alarms), mnRT, mean
response time in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT);
CI, 95% confidence interval.

cutoff were then identified. As in the previous studies, these accounted


for less than 2% of the responses across all the participants and were left
intact for the analysis.
False-alarm rates by groups are given in Table 8.2.1 The rates reflect the
IELTS score gradient, ranging from nearly 30% for the 5 band to under
10% for the 7+ band. The 5, 5.5, and 6 bands were not significantly dif-
ferent. Bands 6 and 6.5 were significantly different, as were bands 6 and 7.
8.3 Study 3 Results 193

Bands 6.5 and 7 were not.2 The possible effect of differing false-alarm rates
will be examined in a multiple regression assessing the contribution of the
three measures.
Bootstrapped bivariate correlations for IELTS band scores and the
individual and composite lexical facility measures are reported in
Table 8.3. There was no evidence of a systematic trade-off between yes/no
performance and recognition speed as would be evident in a negative cor-
relation between VKsize scores and inverted mnRTs. The small but sig-
nificant correlation (r = .31) between the two measures indicates that
participants with larger vocabulary sizes also tended to be faster, but that
other factors are also at work.

Table 8.3 IELTS study. IELTS band-score study. Bivariate correlations with boot-
strapped confidence intervals for IELTS band scores and lexical facility measures
IELTS
band False VKsize_
score VKsize Hits alarms mnRT CV mnRT
VKsize .65**
[.58, .70]
Hits .52** .61**
[.36, .52] [.54, .66]
False alarms .24** .57** .31**
[.32, .13] [.63, .51] [.21,
.40]
mnRT .41** .31** .39** .04
[.32, .48] [.22, .40] [.16, [.13,
.35] −.06]
CV .23** .22* .22 .09 .24**
[.13, .33] [.11, .33] [.11, [−.02. [.13,
.33] .05] .34]
VKsize_ .65** .79** .61** .32** .82* .29
mnRT [.59, .70] [.74, .84] [.53, [.40, [.78, [.19, .39]
.69] .22] .85]
VKsize_ .60** .70 .53** .29** .74** .69 .89
mnRT_CV [.53, .66] [.65, .75] [.45, [−.37, [.68, [.63, .75] [.86, .91]
.60] −.19] .78]
Note: N = 344. *p < .01; **p < .001; All correlations significant at p < .01 (two-
tailed). VKsize, correction for guessing scores (hits - false alarms); mnRT, mean
response time in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT);
95% CI, BCa (bias-corrected and accelerated) confidence interval.
194 8 Lexical Facility and IELTS Performance

Descriptive Results

The descriptive statistics for the lexical facility measures are given in
Table 8.2. The number of individuals at the respective IELTS levels
­varied, with the largest number of scores, by a considerable degree, at the
5.5 band, and the smallest at the combined 7+ band (7 and 7.5). The
minimum entry requirement for the university in question is an IELTS
score of 6.5. Students in the foundation-year program typically attain
this level by the end of the course.
Performance across the band-score levels for the three measures is pre-
sented visually in Fig. 8.1. The measures are converted to standard scores
to allow a direct comparison. The mean responses for all three lexical
facility measures show a consistent linear relationship between perfor-
mance and band-score level, with the 5 band group having the lowest
VKsize score and the highest mnRT and CV means, and the 7 band
group the highest.
The VKsize and CV measures show a consistent increase across the
band levels. The mnRTs depart slightly from this pattern, being faster at
the 6.0 than at the 6.5 level, though the 15-millisecond difference is not
statistically significant. Composite scores were also calculated and are pre-

7.00
Standardised scores (z + 5)

6.00

5.00

4.00

3.00

2.00

1.00
5 5.5 6 6.5 7 -7.5
n = 37 n = 199 n = 60 n = 43 n = 32
IELTS Overall Bandscores
VKsize mnRT CV

Fig. 8.1 Combined IELTS dataset: Timed Yes/No Test scores by IELTS overall band
scores
8.3 Study 3 Results 195

sented in Table 8.2. As expected, they mirror the individual measures of


which they are composed. The mean differences for the individual and
composite scores are tested below for statistical significance and effect size.
The Pearson r correlations for the IELTS scores and the respective indi-
vidual and composite measures are given in Table 8.3. The band levels are
treated as a continuous variable here and in the multiple regression analy-
sis reported below.
All the correlations were significant at p < .01 (two-tailed). VKsize had
the strongest correlation with IELTS scores, as evident in the individual
and composite correlations of around r = .60. The bivariate correlations
between IELTS band scores and the individual mnRT and CV measures
were smaller, around .40 and .25, respectively. The nearly identical cor-
relations between band scores across the individual VKsize scores and the
composite VKsize_mnRT and VKsize_mnRT_CV scores indicate that
neither the mnRT nor the CV increased the sensitivity of the VKsize in
the bivariate analysis.

 ensitivity of the Lexical Measures to IELTS Score


S
Differences

The focus here is on the sensitivity of the lexical facility measures to dif-
ferences in IELTS band scores. A question addressed in every study pre-
sented in the book is the degree to which the mnRT and CV measures
can account for variance in IELTS band-score differences beyond that
attributable to vocabulary size alone. Study 1 (Chap. 6) focused on the
sensitivity of the measures to three clearly defined proficiency levels in an
Australian university setting. Study 2 (Chap. 7) examined groups in
which proficiency differences were narrower, in that all groups had met
the university’s English-language entry minimum. But there were also
identifiable differences in performance among the groups, both between
the first language (L1) and L2 groups and within the L2 standard groups
themselves. The range of L2 proficiency levels in this study approximates
that of the L2 groups in Study 1, and the focus is on the sensitivity of the
measures to finer gradations of proficiency within this range.
196 8 Lexical Facility and IELTS Performance

Test Results

The significance and effect sizes of the mean differences are examined in
five one-way analyses of variance (ANOVAs) done with the three indi-
vidual and two composite measures. The ANOVA tests are followed up
with post hoc tests that examine the pairwise differences between the
scores.3 The mnRTs were log-transformed but not otherwise modified.
The level-by-measure responses were all normally distributed. The homo-
geneity of variance assumption was met for the VKsize and CV scores but
was borderline for mnRT findings. There is a big imbalance in n size
between the 5.5 band and the other groups. As a result, bootstrapping is
used to provide a more robust set of results for the post hoc comparisons
reported in Tables 8.5 and 8.6.
The ANOVA results presented in Table 8.4 show that all five measures
were statistically significant at p < .001. Hits are also included to assess
the sensitivity of a ‘pure’ size measure, unadjusted for false-alarm perfor-
mance size. The η2 values for the individual measures differed somewhat.
The VKsize scores accounted for over 40%, hits just under 30%, the
mnRT scores 20%, and the CV 5% for the overall test for equality of
means. The omnibus ANOVAs were followed up by post hoc pairwise
comparisons for the individual (Table 8.5) and composite measures
(Table 8.6).
The VKsize scores discriminated between nonadjacent band compari-
sons (5–6, 5–7, 6–7+) and all the adjacent band comparisons, except for
the 5–5.5 difference. The hits also discriminated between the three non-
adjacent band comparisons but only two of the adjacent ones, 5.5–6 and

Table 8.4 IELTS band-score study. One-way ANOVAs for individual and composite
lexical facility measures as discriminators of IELTS overall band scores
df (4, 339) Fa ŋ2 98% CI for ŋ2
VKsize 67.55 .44 .36, .50
Hits 33.26 .28 .20, .60
mnRT 23.75 .22 .13, .27
CV 5.19 .06 .02, .10
VKsize_mnRT 64.54 .43 .36, .51
VKsize_mnRT_CV 49.49 .37 .28, .45
Note: aAll F-values are significant at p < .0005.
8.3 Study 3 Results 197

Table 8.5 IELTS study. Bandwise significant post hoc comparisons for VKsize,
mnRT, and CV
Mean difference dd CI for d
VKsize
5.5 and 6 6.08* .48 .19, .77
6 and 6.5 13.19* .94 .53, 1.36
6.5 and 7+ 12.99* .88 .40, 1.36
5 and 6 10.31* .95 .51, 1.37
6 and 7+ 26.16* 1.89 1.38, 2.4
5 and 7+ 36.47* 2.93 2.25, 3.61
Hits
5.5 and 6 8.72** .78 .50, 1.07
6.5 and 7+ 8.99* .90 .45, 1.38
5 and 6 12.37* 1.07 .59, 1.49
6 and 7+ 12.54** 1.05 .60, 1.46
5 and 7+ 24.09** 2.33 1.68, 2.99
mnRTa
5 and 5.5 184*** .67 .32, 1.1
5.5 and 6 82* .64 .35, .93
6.5 and 7+ 167* 1.19 .70, 1.70
5 and 6 269** 1.15 .71, 1.59
6 and 7+ 182** .85 .40, 1.29
5 and 7+ 452** 1.60 1.06, 2.14
CV
6 and 7+ .049 .42 .15–.85
5 and 7+ .104* .88 .38–1.37
Note: *p < .05, **p < .0005, ***p < .10; all Games–Howell significance levels
assume unequal variances; araw values given, contrast calculated on mnRT(log).
VKsize, correction for guessing scores (hits - false alarms); mnRT, mean
response time in milliseconds; CV, coefficient of variation (SDMeanRT/MeanRT); CI,
95% confidence interval for Cohen’s d.

6.6–7+. The mnRT means were significantly different for all the nonad-
jacent comparisons and for all the adjacent comparisons, except for the
6–6.5 difference. Unlike the VKsize and hits measures, the mnRT dis-
criminated between the lowest levels (5–5.5). The CV measures were sig-
nificant for only two comparisons, the nonadjacent 5–6 and 5–7
differences.
The effect sizes for these comparisons varied. The d values for VKsize
and hits were in the moderate range for the lower 5–5.5 and the 5.5–6
comparisons (d = .48–.78). They were stronger for the higher-level adja-
cent pairs, 6–6.5 and 6.5–7. The 5–7 (3) and 6–7 (2) band comparisons
198 8 Lexical Facility and IELTS Performance

Table 8.6 IELTS band-score study. IELTS bandwise post hoc comparisons for the
VKsize_mnRT and VKsize_mnRT_CV measures
Mean difference d CI
VKsize_mnRT
5 and 5.5 .53* .86 .46, 1.25
5.5 and 6 .37* .65 .37, .94
6 and 6.5 .40* .68 .29, .107
6.5 and 7+ .72* 1.24 .73, 1.74
5 and 6 .90* 1.23 .81, 1.66
6 and 7+ 1.12* 1.98 1.48, 2.48
5 and 7+ 2.02* 2.55 1.87, 3.22
VKsize_mnRT_CV
5 and 5.5 .50* .92 .52, 1.32
5.5 and 6 .29* .57 .28, .84
6 and 6.5 .33* .57 .18, .96
6.5 and 7+ .56* 1.04 .55, 1.53
5 and 6 .79* 1.22 .80, 1.64
6 and 7+ .89* 1.57 1.09, 2.04
5 and 7+ 1.70* 2.74 2.04, 3.43
Note: *p < .05; all Games–Howell significance levels assume unequal variances;
VKsize, correction for guessing scores (hits - false alarms); mnRT, mean
response time in milliseconds; CV, coefficient of variation (SDMeanRT/MeanRT); CI,
95% confidence interval for Cohen’s d.

were very strong. The effect sizes for the significant mnRT comparisons
were comparable to the VKsize measure; smaller for the lower-level, adja-
cent band comparisons and larger for the higher-level, nonadjacent band
comparisons. As in Studies 1 and 2, the CV measure was the weakest.
Of particular interest is how well the effect sizes for the individual
VKsize measure compare with those for the two composite measures.
These are reported in Table 8.6. The two composite measures were more
sensitive than any of the individual measures, discriminating between all
adjacent and nonadjacent bands.
However, there was little difference in the effect sizes for the two com-
posite measures. This may reflect the insensitivity of the CV measure to
band differences. A comparison of d values between Tables 8.5 and 8.6
showed no discernible difference between the composite and individual
scores. The composite measures were superior in discriminating between
band levels, but the difference was not reflected in the effect sizes.
8.3 Study 3 Results 199

 he Lexical Facility Measures as Predictors of IELTS


T
Band-Score Differences

The potential benefit of combining vocabulary knowledge and processing


skill as a combined measure can also be examined by a multiple regres-
sion analysis. This analysis allows the relative effect of individual mea-
sures on the criterion IELTS score levels to be compared in combination
with the contribution of the other measures. In the composite scores used
so far, the constituent measure scores are averaged, with each making the
same proportional contribution to the composite measure (i.e., the com-
posite VKsize_mnRT score is the average of the VKsize and mnRT z
scores, plus 5). The regression approach allows the relative contribution
of the individual measures to be established independently and provides
an additional way to test the claim that the combination of mean recog-
nition time and consistency with size, provides a more sensitive than
size alone.
The relative contribution of the variables in combination was tested in
a hierarchical regression analysis in which the VKsize, mnRT, and CV
scores are entered sequentially as predictors of the IELTS score. The data
met the assumptions for the use of the multiple regression approach.4 The
results are given in Table 8.7. The analysis of the entire data set is first
reported. Analyses involving data of participants with false-alarm rates of
<20 and <10% are also done to investigate the possible effect of variabil-
ity of overall false-alarm rates on the pattern of responses.
The VKsize was entered as the first step and alone accounted for over
40% of the variance in IELTS band-score differences. mnRT was added
in the second step and also accounted for additional, unique amount of
variance (ΔR2) of 5%. Both were significant at p < .001. The inclusion of
the CV in the third step added a negligible, nonsignificant ΔR2 of less
1%. The total model accounted for 47% of the variance. This is a very
substantial amount for two basic dimensions of language knowledge. The
results show that size and speed combined to provide a more sensitive
measure than size alone, though additional contribution made by mnRT
was small.
200 8 Lexical Facility and IELTS Performance

Table 8.7 IELTS band-score study. Model summary (R2 and ΔR2) for hierarchical
regression analysis with proficiency level as criterion and VKsize, mnRT, and CV
as predictor variables on written and spoken tests with complete and false-
alarm-­trimmed (20 and 10%) data sets

β t Sig R2 Δ R2

VKsize .566 13.56 .001 .419 .419**


Complete data set
df (3,340) mnRT .221 5.23 .001 .469 .055**

CV .065 1.59 .112 .473 .005

Greater than 20% false VKsize .604 11.36 .001 .474 .474**
alarm trim
df (3, 208) mnRT .160 2.94 .004 .502 .028**

CV .100 1.97 .050 .511 .008

Great than 10% false alarm VKsize .644 9.28 .000 .527 .527**
trim
mnRT .174 2.47 .015 .553 .027*
df (3,116)

CV .031 .466 .642 .554 .001

Note: VKsize, correction for guessing scores (hits - false alarms); mnRT, mean
response time in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT); β,
standardized beta coefficient; Sig., significance; df, degrees of freedom for
model 3 ANOVA.

The overall mean false-alarm rate was 18%, which represents values
ranging from 20% for the lowest 5 level to 8% for the highest 7–7.5 level.
Two follow-up analyses were done to assess whether a high false-alarm
rate affects the results. The complete data set was trimmed to only include
those participants whose mean false-alarm rate did not exceed 20% or
10%, respectively. The 20% trim yielded an overall false-alarm rate of
10% (standard deviation [SD] of 7), and the 10% trim a mean false-­
alarm rate of near 5% (SD of 3.5). Both regression analyses produced the
same pattern of responses. The VKsize scores accounted for the over-
whelming amount of the variance, and the mnRT scores accounted for a
significant, small amount of additional variance. The total R2 increased
for each successive trim, but the confidence intervals (CIs) within brack-
ets indicated that the differences were not statistically significant. For the
respective analyses: overall, R2 = .47, [.40, .54]; 20% trim, R2 = .51, [.40,
.59]; and 10% trim, R2 = .56, [.44, .67].
8.5 Conclusions 201

8.4 Findings for Study 3 IELTS Band-Scores


Study 3 focused on the sensitivity of the lexical facility measures to overall
score differences in IELTS band scores. The VKsize scores discriminated
among all five IELTS overall band-score levels. Effect sizes for lower-level,
adjacent band comparisons were moderate, while those for higher-level,
adjacent bands and nonadjacent band comparisons were strong. The
mnRT measure was somewhat less sensitive than the size measure. The
effect size for the significant comparisons was moderate to strong. The
CV measure was only of limited sensitivity, discriminating between only
two nonadjacent bands. The composite VKsize_mnRT and VKsize_
mnRT_CV scores discriminated between all adjacent and nonadjacent
band levels and had effect sizes in the same range as the significant, non-
adjacent results.
A hierarchical regression showed that the mnRT measure accounted
for a unique amount of additional variance overall. This effect was not
evident in the pairwise comparisons involving the composite scores and
is consistent with the notion that size and speed together yield a more
sensitive measure than size alone. Follow-up regression analyses were run
with data sets trimmed at 20% and 10% false-alarm rate to investigate
the potential effect of high false-alarm rates. The follow-up analyses
accounted for slightly more overall model variance, but the difference was
not statistically significant.

8.5 Conclusions
The VKsize and mnRT measures were sensitive to IELTS band-level dif-
ferences. The CV, in contrast, accounted for few of the differences
observed. These findings replicate those of the first two studies, showing
that the VKsize and mnRT measures are reliable discriminators of test
levels and account for moderate-to-strong effect sizes for these differ-
ences. The composite score results support the claim that the combina-
tion of size and speed provides a more sensitive index of band-score
differences than size alone. The composites discriminated between all the
band levels, and the mnRT measure accounted for a significant, unique
202 8 Lexical Facility and IELTS Performance

amount of variance over and above the VKsize measure in the regression
analysis. The results are tempered, though, by the fact that the composite
measures yielded effect sizes in about the same range as the individual
measures, and the additional amount of variance accounted for in the
regression model was only about 5% of the total model.
In the next chapter, the sensitivity of the lexical facility variables is
examined in the context of placement testing.

Notes
1. The original number of participants was 371. However, 27 of these had
false-alarm rates exceeding 50%, including several around 75%. These
cases, mostly from the 5 and 5.5 bands, were removed from the analysis,
leaving a total sample of N = 344.
2. The false-alarm data depart markedly from a normal distribution, given
that some participants had few-to-none false alarms. A Kruskal–Wallis
test was run to test for the equality of the group false-alarm means. There
was a significant difference (at p < .001) between the groups, χ2 = 36.89,
p < .001, η2 = .07 (Lenhard and Lenhard 2014). A follow-up Mann–
Whitney test of the pairs showed that the 5, 5.5, and 6 bands were not
significantly different. Bands 6 and 6.5 were significantly different,
U = 871, p < .001, d = .53, as were bands 6 and 7+, U = 459, p < .001,
d = 1.05. Bands 6.5 and 7 were not significantly different.
3. Statistical significance is set at the conventional p < .05, and strength of
the difference is reported in effect size measures. The effect size for the
omnibus ANOVAs is eta-squared (η2). The ‘real-world’ interpretation of
η2 is based on Plonsky and Oswald (2014, p. 889), with .06 considered
small, .16 medium, and .36 large. The effect size for the post hoc compari-
sons is Cohen’s d. It is interpreted as .40 being small, .70 medium, and 1.0
large.
4. The first assumption for hierarchical regression is that the criterion vari-
able is continuous. The criterion here is the five IELTS band-score levels.
In the analysis, they are treated as continuous scores, though the small
range of five levels might make this a questionable assumption for some.
The data met the other assumptions for the use of the regression proce-
dure. There was independence of residuals, as assessed by a Durbin–
Watson statistic of 1.64. Scatterplot analyses indicated that a linear
References 203

relationship exists between the three predictor variables collectively and


individually. A visual inspection of the P–P plot showed the standardized
residuals to be approximately normally distributed, while the residual
plots indicated that the data met the homoscedasticity assumption.
Tolerance values of around .9 showed no problem with multicollinearity.

References
Lenhard, W., & Lenhard, A. (2014). Calculation of effect sizes. Retrieved
November 29, 2014, from http://www.psychometrica.de/effect_size.html
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
9
Lexical Facility and Language
Program Placement

Aims

• Evaluate the sensitivity of the lexical facility measures to language pro-


gram placement outcomes in Australian and Singaporean language
schools.
• Assess the sensitivity of the measures independently and in
combination.

9.1 Introduction
The first three studies showed the three lexical facility measures to be
sensitive to group differences in proficiency, whether reflected in user
groups, university entry standard, or IELTS band scores. Vocabulary size
(VKsize) was the most sensitive measure, accounting for differences in all
the group comparisons and consistently yielding strong effect sizes. The
importance of size (breadth) in second language (L2) proficiency and
performance has been long established, and the findings further under-
score its significance. The mean recognition time (mnRT) was also a sen-
sitive measure in all three studies. It had a larger effect size than the

© The Author(s) 2018 205


M. Harrington, Lexical Facility, DOI 10.1057/978-1-137-37262-8_9
206 9 Lexical Facility and Language Program Placement

VKsize measure in the first study, but less sensitive in the other two.
Crucial to the lexical facility proposal, the mnRT measure accounted for
variability in group differences beyond that of size alone. The mnRT val-
ues varied considerably within and between groups. This meant that the
coefficient of variation (CV) measure has been less informative overall,
usually being insensitive to adjacent proficiency-level differences and
yielding smaller effect sizes.
This chapter presents two studies (Studies 4 and 5) that gauge the sen-
sitivity of the lexical facility measures to English proficiency differences as
defined by language program placement. The studies investigate the sen-
sitivity of the measures to differences in English proficiency across a nar-
rower range of proficiency than in the previous datasets. They also include
learners from a lower level of proficiency than has been examined so far.
Study 4 correlates the measures with placement testing outcomes at a
commercial language school in Australia. The predictive power of the
lexical facility measures is compared with that of an in-house placement
test in identifying learner proficiency for placement across four profi-
ciency levels, ranging from beginners to advanced learners (N = 85).
Study 5 examines program placement in a similar setting in Singapore.
It compares performance on the lexical facility measures with student
placement (N = 66) across four program levels, spanning a similar range
as in the Sydney study, though with learners who are somewhat less
proficient.
Evidence for a strong correlation between the measures and the place-
ment results will further demonstrate the validity and reliability of the
lexical facility construct. The research also has a practical dimension.
Placement testing is an important activity in English-language programs
universally, with significant ramifications when it is done poorly.
Misplaced students can suffer in terms of learning outcomes, and pro-
gram quality can be compromised. The placement-testing process can
also be a very time- and resource-intensive activity. Alderson (2005) and
others have suggested that the Yes/No Test format may be a useful tool for
screening and placement decisions, particularly in the early stages, due to
both its reliability and its ease of administration. The untimed Yes/No
Test format has already been applied to placement decisions in English
(Clark and Ishida 2005; Harsch and Hartig 2015) and Spanish
9.2 Study 4: Sydney Language School Placement Study 207

(Lam 2010). Harrington and Carey (2009) were the first to examine the
use of recognition times in placement decisions. Part of these data is pre-
sented here.

9.2  tudy 4: Sydney Language School


S
Placement Study
This study compares the effectiveness of the Timed Yes/No Test with that
of an in-house placement test in discriminating among entry placement
levels at an English-language school in Australia. A portion of the results
was originally reported in Harrington and Carey (2009).

Setting and Participants

The setting for the study was a well-established English-language college


in Sydney. The school offers a range of English-language courses, includ-
ing test preparation for IELTS, TOEIC, and other standardized tests. At
the time of the study, it had an enrollment of 300 students coming from
Asia, Europe, and South America. The program uses a rolling enrollment
format, with 6–10 new students commencing studies every Monday.
New students are placed at one of six levels in the program based on the
outcome of a two-hour placement test battery given on the first day.
The participants (N = 85) ranged in age from 19 to 33 and were
approximately split by gender (53% females). Many of the students con-
tinue to university study in Australia or elsewhere at the end of their time
at the college. The largest number of participants came from Korea
(n = 32) and Japan (n = 18), with the remainder from 14 different first
language (L1) backgrounds.

Materials and Procedures

The participants completed two versions of the written Timed Yes/No


Test. Each version contained 72 words and 28 pseudowords. The word
208 9 Lexical Facility and Language Program Placement

items in the first test were drawn from the four frequency bands (2K,
3K, 5K, and 10K) used in the Vocabulary Levels Test (VLT). The VLT
target words provide a measure that can be both related to the language
program placement decisions and generalized to other settings (see also
Mochida and Harrington 2006). The second test contained word items
taken from course books and materials used at the school. A list of con-
tent words from the elementary to advanced levels of instruction was
selected from recently used texts and materials. In the absence of fixed
vocabulary lists for the program levels, a range of word difficulty
(reflected in the frequency of word occurrence) was obtained by select-
ing words from the program list at the 1K, 3K, 5K, and 10K frequency
bands using the British National Corpus (BNC; http://www.comp.
lancs.ac.uk/ucrel/bncfreq/flists.html). Eighteen items were selected
from each frequency level. Both tests included 28 pseudowords, with
different pseudowords used in the respective tests. The vocabulary mea-
sures analyzed here are combined scores from the two tests (Harrington
and Carey 2009).
The study compares scores from the language school’s placement test
battery and the lexical facility measures. The battery contains four tests
that assess English listening, grammar, writing, and speaking skill. The
tests differ in task format and the role they play in the placement process.
The listening and grammar tests are paper based and designed to assess
knowledge of specific linguistic features and content comprehension. The
listening test assesses global listening comprehension skills through con-
tent questions based on a listening passage. The grammar test assesses the
ability to use grammatical structures and identify grammar errors. Both
tests are scored immediately after completion by a teacher using a scoring
key. The writing test consists of a 120-word essay addressing the question
‘Why did you choose to come to Australia?’ The speaking test is an infor-
mal ten-minute interview with a teacher, the content based on a general
list of questions related to family background, interests, career goals, and
so on. After the student completes the writing and speaking tests, they are
scored by the teacher using a holistic six-step scale that specifies compe-
tencies appropriate to the respective program levels. The teacher refers to
the listening and grammar results before assigning the speaking and
9.3 Study 4 Results 209

l­istening scores, which serve as the initial placement level. The entire test
battery takes 80–85 minutes to complete.
The language school admits students on a weekly basis, with new stu-
dents tested on Monday. Based on the placement tests, newly entering
students are placed at one of six proficiency levels: beginner, elementary,
lower intermediate, upper intermediate, advanced, and English for
Academic Purposes (EAP). The data collection period in Study 4 spanned
15 weeks, and the same teacher administered all 15 placements.
Each student completed the Timed Yes/No Test and then the language
program tests using the same procedure described in the previous studies.
Test items were randomized for each participant and presented individu-
ally on a computer screen. Participants were asked to judge, as quickly
and accurately as they could, whether they knew the word presented.
They were told that they would see items that were either actual words or
pseudowords, the latter being orthographically possible words in English.
Each trial had a 5000-millisecond time limit. Items not answered were
counted as incorrect. There were only a handful of ‘no answer’ responses
(less than 0.1% of the entire response set). A practice set of five items
with feedback was completed before the test. Instructions for the Timed
Yes/No Test were translated into Korean, Japanese, Chinese, Spanish,
Portuguese, and Czech. A handful of students from other L1 backgrounds
received the instructions in English.
As in the previous studies, three lexical facility measures were collected:
VKsize (proportion of hits minus false alarms), mnRT (mean recognition
times for correct word responses), and CV (SDmnRT/ mnRT). Only one
composite score, the VK_mnRT, is reported because the individual CV
results showed no systematic differences among the groups.

9.3 Study 4 Results


Preliminary Analysis Two versions of the Timed Yes/No Test were used in
Harrington and Carey’s original (2009) study. They consisted of a general
version based on the VLT and a program version using items based on
program teaching material. A combination score was also calculated for
210 9 Lexical Facility and Language Program Placement

the general and program versions together. The study here examines only
the combined scores. For a more detailed analysis of the respective tests,
see Harrington and Carey (2009).

As done in the previous studies, the raw test results were first examined
for adequate test instrument reliability, an absence of excessive item rec-
ognition time outliers and false-alarm rates, and no systematic trade-offs
between speed and accuracy in participants’ responses.
Cronbach’s alpha reliability coefficients for the word and pseudoword
items on the original tests fell within an acceptable range of .85–.92. As
in previous studies, a small number of item recognition times (less than
2%) went beyond 3 standard deviations (SDs), and these were left intact.
The small number of outliers is due in part to the 5000-millisecond cut-­
off time for the presentation of individual items, eliminating extremely
slow recognition times.
In Harrington and Carey (2009), the six placement levels were ana-
lyzed separately. In this study, the beginner (n = 10) group was com-
bined with the elementary (n = 12) group for more power in the analysis.
There was no evidence of a systematic speed–accuracy trade-off by the
participants. The correlation between VKsize and the inverted mnRT

Table 9.1 Sydney language program study. Bivariate Pearson’s correlations for
lexical facility measures, and listening and grammar test scores
VKsize Hit mnRT CV VKsize_mnRT Listening
Hit .64*
[.46, .79]
mnRT .51** .66**
[.25, .65] [.45, .79]
CV .03 .24* .14
[−.23, .18] [−.20, .20] [−.34, .03]
VKsize_mnRT .86* .75** .87* .08
[.78, .91] [.59, .84] [.79, .91] [−.09, .27]
Listening .65** .65* .66* .17 .75
[.49, .75] [.48, .76] [.53, .76] [−.04, .35] [.64, .83]
Grammar .65* .62* .46* .13 .62** .66**
[.50, .76] [.42, .74] [.24, .61] [−.05, .29] [.41, .73] [.52, .78]
Note: N = 87; significant at *p < .05; **p < .001 (two-tailed).
9.3 Study 4 Results 211

measures was r = .51, indicating a moderately strong relationship


between higher scores and faster responses. See Table 9.1. This is the
opposite of what might be expected if there were a systematic trade-off
between accuracy and speed, and suggests that recognition time and
accuracy tap the same underlying proficiency to some extent. However,
the amount of unexplained variance means that the relationship between
accuracy and speed is also affected by other factors as well.
False-alarm rates are given in Table 9.2. The mean false-alarm
rate for the entire group was 15%, ranging from 20% for the lower

Table 9.2 Sydney language program study. Means, standard deviations, and 95%
confidence intervals for the lexical facility measures at the four placement levels
False alarm VKsize mnRT (msec) CV
M SD M SD M SD M SD
[95% CI] [95% CI] [95% CI] [95% CI]
1. Elementary 18.11 16.67 36.52 18.90 1854 547 .462 .057
n = 21 [11.77, 26.61] [28.48, 43.76] [1613, 2115] [.436, .490]
2. Lower 19.62 13.68 52.23 15.61 1464 284 .482 .072
intermediate [15.28, 26.25] [46.72, 57.73] [1376, 1600] [.449, .513]
n = 19
3. Upper 16.48 14.99 56.31 14.38 1506 318 .494 .104
intermediate [11.54, 22.03] [50.14, 61.46] [1384, 1626] [.457, .533]
n = 26
4. Advanced 07.52 05.31 69.24 13.43 1326 241 .493 .094
n = 19 [04.91, 10.31] [64.76, 74.02] [1206, 1428] [.451, .534]
5. Overall 15.05 13.88 53.43 18.86 1543 410 .483 .085
N = 85 [12.49, 18.64] [49.31, 57.33] [1459, 1630] [.451, .531]
Composite score VKsize_mnRT
1. Elementary 4.19 1.01
[3.78, 4.57]
2. Pre-intermediate 5.08 .46
[4.88, 5.31]
3. Intermediate 5.14 .57
[4.92, 5.37]
4. Advanced 5.70 .45
[5.48, 5.85]
Note: VKsize, correction for guessing scores (hits - false alarms); mnRT, mean
recognition time in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT);
95% CI, BCa (bias-corrected and accelerated) confidence interval.
212 9 Lexical Facility and Language Program Placement

i­ntermediate group to 8% for the advanced group. All the groups had
large SDs relative to the mean, indicating considerable variability across
individuals. The false-alarm data are not normally distributed, given that
a portion of the participants had few-to-no false alarms. Kruskal–Wallis
and follow-­up Mann–Whitney tests indicated that the advanced group
had a significantly lower false-alarm rate than all the other groups, with
moderate effect size (Cohen’s d = .68).1
Bivariate correlations between the lexical facility measures and the lis-
tening and grammar test scores are presented in Table 9.1. The writing
and speaking ratings are not included, as they are essentially placement
decisions, and thus are not normally distributed. The VKsize, hits, and
mnRT scores significantly correlated with the listening and grammar
tests, with r coefficients in the mid .6 range. The correlation between the
composite VKsize_mnRT and grammar (.65) was the same, but the cor-
relation between VKsize_mnRT and listening scores was stronger (.75),
though the difference was not statistically significant. The CV measure
did not correlate with either placement test.

Descriptive Results

The descriptive statistics for the VKsize, mnRT, and CV scores by Sydney
program placement levels are presented in Table 9.2.
The VKsize and mnRT means differed as a function of placement
level, with the lowest scores produced by the elementary group and the
highest by the advanced group. The CV values showed little variability
across the program levels, ranging from the lowest in the elementary
group to the highest in the upper intermediate and advanced groups.
However, there is a maximum difference of only .025 between the means.
With no observable pattern, the CV values were not included in any
further analysis.
The VKsize and mnRT scores by placement levels are compared visu-
ally with the grammar and listening tests in Fig. 9.1. The scores are pre-
sented as standard scores consisting of the mean standard (z) score plus 5;
the latter added to make all scores positive.
9.3 Study 4 Results 213

Standardised score (Z + 5)
6

1
Elementary Lower Inter Upper Inter Advanced
n=22 n=20 n=26 n=20
Language program placement levels
VK mnRT Grammar Listen

Fig. 9.1 Sydney language program study. Comparison of VKsize and mnRT scores
with program placement grammar and listening scores across four placement
levels

Figure 9.1 shows that the VKsize and mnRT scores were similar to the
pattern of placement test results. The VKsize scores consistently increased
as a function of placement level. The mnRT values were less sensitive in
the middle range, showing little difference between the lower and upper
intermediate groups. The VKsize_mnRT results mirrored that of the
individual VKsize and mnRT results. In the next section, the observed
mean differences are tested for statistical significance and the magnitude
of the effect sizes for the differences.

 ensitivity of the Lexical Facility Measures


S
to Placement Decisions

The lexical facility and program placement tests were compared for how
well they discriminated between the groups and the size of the observed
effects. Also of interest was whether the mnRT and CV measures could
account for unique variance in the placement decisions beyond that
attributable to the VKsize measure alone. The sensitivity of hits as an
alternative to VKsize will also be examined (Shillaw 1996; Harsch and
Hartig 2016).
214 9 Lexical Facility and Language Program Placement

Test Results

Normality assumptions were met for all three individual measures. The
homogeneity of variance assumption was met for the VKsize and CV
measures, but not for the mnRT (log) scores. As a result, Welch’s analysis
of variance (ANOVA) was used for the omnibus test and the Games–
Howell test for the pairwise comparisons. Bootstrapping was also used to
validate the results of the post hoc comparisons. The results are given in
Tables 9.3 and 9.4.
All the omnibus ANOVA results were statistically significant, except
for the CV. Effect size, as measured by ŋ2, ranged from a low of .22 for
the mnRT measure to a high of .56 for the grammar test. The effect size
for the composite VKsize_mnRT measure (.37) was slightly smaller than
the individual VKsize (.40) or hits (.41) results, though none of these
differences were statistically significant. The magnitude of the two vocab-
ulary size measures was nearly twice that of the mnRT measure (.22), a
difference that is statistically significant.
Significant pairwise comparisons and effect sizes are reported in
Table 9.4. The grammar results were significant for all the comparisons,
with effect sizes of d > 1.5. These results are not included in the table.
The VKsize, mnRT, composite VKsize_mnRT, and listening tests dis-
criminated between all the placements levels, except the lower and upper
intermediate groups. The effect sizes for the respective comparisons all
reached and exceeded the threshold of 1.0, which is considered strong

Table 9.3 Sydney language program study. One-way ANOVAs for individual and
composite lexical facility measures and placement test scores as discriminators of
placement levels
df (3,83) F ŋ2 CI for ŋ2
VKsize 18.35 ** .40 .27, .52
Hits 18.98** .41 .27, .53
mnRT 7.96 * .22 .15, .29
CV .662 .02 −.00, .07
VKsize_mnRT 16.02** .37 .29, .45
Listening 26.47** .49 .36, .60
Grammar 34.60** .56 .41, .67
Note: *p < .05; **p < .001.
9.3 Study 4 Results 215

Table 9.4 Sydney language program study. Significant post hoc pairwise compari-
sons of the lexical facility measures and listening test
Mean
difference d CI for d
Elementary and lower intermediate 15.70** .99 .34, 1.65
Elementary and upper intermediate 19.78** 1.19 .57, 1.82
VKsize Lower intermediate and advanced 17.17** 1.63 .89, 2.36
Upper intermediate and advanced 13.08* 1.04 .41, 1.67
Elementary and advanced 32.87** 2.13 1.38, 2.94
Elementary and lower intermediate 15.70** 1.35 .17, .88
Hits Elementary and upper intermediate 19.78** 1.54 .88, 2.19
Elementary and advanced 22.23** 1.71 .99, 2.43
Elementary and lower intermediate 389* .88 .23, 1.53
mnRTa Elementary and upper intermediate 347**** .80 .20, 1.39
Elementary and advanced 302*** 1.22 .54, 1.90
CV (No significant differences)
Elementary and lower intermediate .890** 1.11 1.17, 2.67
Elementary and upper intermediate .940* 1.82 1.13, 2.50
VKsize_ Lower intermediate and advanced .632** 1.34 .63, 2.04
mnRT Upper intermediate and advanced .567** 1.07 .44, 1.67
Elementary and advanced 1.51*** 1.90 1.15, 2.60
Elementary and lower intermediate 19.84** 1.19 .52, 1.86
Elementary and upper intermediate 25.10*** 1.52 .87, 2.18
Listening Lower intermediate and advanced 20.74*** 1.73 .98, 2.47
Upper intermediate and advanced 15.47** 1.16 .52, 1.79
Elementary and advanced 40.58*** 2.67 1.82, 3.52
Note. Games–Howell test significant at *p < .05; **p < .01 (two-tailed);
***p < .001; ****p < .10; araw values given, contrast calculated on mnRT(log),
significance levels assume unequal variances. VKsize, correction for guessing
scores (hits - false alarms); mnRT, mean response time in milliseconds; CV,
coefficient of variation (SDMeanRT/MeanRT); d, Cohen’s d with Hedges’s correction
for unequal sample sizes (Lenhard and Lenhard 2014); CI, BCa 95% confidence
interval.

(Plonsky and Oswald 2014). The listening comparisons yielded the larg-
est effect sizes, with four of the five comparisons at d > 1.5. The hits and
the mnRT were significant for the comparisons involving the elementary
group and the low intermediate, high intermediate, and advanced groups.
The mnRT significance values were borderline and the accompanying
effect sizes lower. The CV omnibus test was not significant, so there were
no pairwise comparisons to consider.
216 9 Lexical Facility and Language Program Placement

9.4 F indings for Study 4 Sydney Language


Program Placement
This study evaluated the predictive validity of the lexical facility mea-
sures as placement tools at an Australian English-language school.
VKsize and mnRT predicted placement decisions with some success.
The measures had strong correlations with the respective placement
tests and the overall placement decisions. Both discriminated among
placement levels in a pattern similar to (but not quite as sensitive as) the
in-house measures. The CV, in contrast, showed no sensitivity to place-
ment-level differences. Of central interest was whether the composite
VKsize_mnRT would be a more sensitive measure of level differences
than VKsize alone. There was no difference between the two in terms of
discriminating between levels and little difference in the magnitude of
the effect sizes.
The effect sizes can be compared with other placement studies using
the untimed Yes/No Test format. Clark and Ishida (2005) reported only
a small Cohen’s d effect size of .4, while Harsch and Hartig (2016)
reported an R2 value of .35, which just meets the criterion for a strong
effect. Lam (2010) did not report effect sizes or the statistics needed to
calculate them. These findings also mirror results in Harsch and Hartig
(2016), who found that the performance on the untimed Yes/No Test
accounted for no additional variance in English L2 listening and reading
test scores than C-Test scores. However, the findings did not support the
notion that hits alone are more sensitive than the corrected-for-guessing
VKsize scores.
Considered separately, the program tests were somewhat more sensi-
tive than the lexical facility measures. Program tests that test a wider
range of language skills have been developed and regularly modified for
placement purposes, and provide the standard against which the lexical
facility measures are evaluated. Accordingly, it is expected that they will
be more sensitive than the lexical facility measures. However, the VKsize
scores, at least, provide a reasonable approximation of the placement-­
level differences detected by the program measures.
9.5 Study 5: Singapore Language Program Study 217

9.5  tudy 5: Singapore Language


S
Program Study
Study 4 showed that the lexical facility measures were reasonable corre-
lates of a placement test used in a language school in Australia. Students
at the school were studying English for a variety of reasons, with a sub-
stantial number planning to pursue university study in Australia. This
study examines the sensitivity of the three lexical facility measures to
placement levels in an English-language school in Singapore. The partici-
pants came from Southeast Asia and overall had lower English proficiency
than the Sydney cohort.

Setting and Participants

The participants were university-age students who had just begun study
in a commercial language school in Singapore. A total of 56 students
(47% females) from China (n = 28), Vietnam (n = 17), and Malaysia (n =
11) participated as volunteers.

Materials and Procedures

Two tests were administered to each participant. Both tests consisted of


64 test words (16 items multiplied by four levels) and 32 pseudowords.
The first test comprised VLT items from the 2K, 3K, 5K, and 10K fre-
quency bands. The second test included items from the BNC’s 1K, 3K,
5K, and 9K levels. The tests were modified versions of those used in
Study 4. Intact groups for the respective placement levels were used as the
criterion measures.
Program staff in Singapore administered the testing in the language
school’s computer room with the assistance of the classroom instructors.
Participants were tested as class groups as part of a class activity, apply-
ing the same procedures described previously. The test was administered
218 9 Lexical Facility and Language Program Placement

on LanguageMAP, an online language testing program available at www.


languagemap.com. The instructors provided login and password
­information and directed the students to read the online instructions.
The testing format was also explained orally in both Chinese and
Vietnamese, and the students were monitored for understanding.

9.6 Study 5 Results


Preliminary Analysis Both tests were of satisfactory reliability, with
Cronbach’s alpha values from .87 to .95. Item responses at less than 300
milliseconds were first removed, as being too fast for an actual response.
These responses are likely performance errors arising from inadvertent
keystrokes, and similar response-external factors accounted for less than
0.1% of the total responses. Item response times for the correct hits
beyond the 3 SD cut-off were then identified. These accounted for less
than 2% of the correct hits across all the participants and were left intact
for the analysis.

Performance on the two tests across the four program levels is pre-
sented as standard scores (z score plus 5) in Fig. 9.2. The two tests differed
slightly in the range of frequency levels used. The BNC set had higher
overall frequency, as it included the 1K level as the highest and the 9K
level as the lowest, compared with respective 2K and 10K levels in the
VLT set. The patterns of the VKsize scores were highly consistent for the
two versions, moving higher by placement level. The only exception was
the preintensive group performance on the VLT, which was relatively
higher than in the BNC counterpart. The mnRT results increased across
the placement levels in a nearly identical manner. Given the consistency,
a combination score averaging the scores on the two tests was used to
increase power.
The overall false-alarm rate was 26%, ranging from a high of 32%
in the intermediate group to just under 20% in the EAP group. See
Table 9.4. The difference of 12% between the two groups was not
9.6 Study 5 Results 219

Standardised score (z = 5) 5

0
VLT_VK BNC_VK VLT_mnRT BNC_mnRT VLT_CV BNC_CV
Lexical facility measures
Elementary Pre-intermediate Intermediate English Academic Purposes

Fig. 9.2 Singapore language program levels. Standardized scores for the lexical
facility measures (VKsize, mnRT, and CV) for the VLT and BNC test versions

s­ tatistically ­significant: Kruskal–Wallis test, χ2 = 4.87, p = .254. This


rate compares with an overall false-alarm rate of 15% for the Sydney
group.
As in the previous studies, there was little evidence of a systematic
trade-off between speed and accuracy. The VKsize scores and inverted
mnRTs correlated at r = .44, p < .001, indicating that participants with
larger vocabulary sizes also tended to be faster, but that other factors were
also at play.

Descriptive Results

The descriptive statistics for the VK, mnRT, and CV scores for the
Singapore language program levels are given in Table 9.5.
Overall, the scores were lower than those in the Sydney placement
study. The highest proficiency group, EAP, had a mean of almost 50%,
which was comparable to the lower intermediate group in the Sydney
study. Overall, the mean VKsize scores here were 10–15% less than the
corresponding level in the Sydney study. There was less of a difference
for the mnRT scores, with the two highest proficiency groups in
Sydney and Singapore being very similar in this regard, both around
220 9 Lexical Facility and Language Program Placement

Table 9.5 Singapore language program study. Means, standard deviations, and
confidence intervals for the lexical facility measures for the four Singapore lan-
guage program levels
False alarm VKsize mnRT CV
M SD M SD M SD M SD
Singapore [95% CI] [95% CI] [95% CI] [95% CI]
Elementary 28.88 20.35 22.42 13.70 2083 569 .610 .123
n = 12 [18.98, 38.85] [14.85, 30.25] [1725, 2377] [.541, .568]
Pre intermediate 24.19 15.89 30.70 11.74 1888 456 .644 .094
n = 18 [16.05, 32.33] [24.14, 36.46] [1697, 2103] [.602, .687]
Intermediate 32.33 19.47 34.28 15.99 1565 475 .568 .131
n = 15 [23.41, 41.25] [27.03, 42.22] [1328, 1789] [.505, .624]
EAP 18.97 11.16 49.42 13.91 1318 344 .565 .128
n = 11 [8.56, 29.39] [36.50, 57.84] [1125, 1531] [.487, .641]
Overall 26.35 17.41 33.56 16.15 1731 533 .601 .016
N = 56 [21.68, 31.02] [29.24, 37.89] [1588, 1774] [.569, .633]
Composite score VKsize_mnRT
Elementary 4.32 .651
[3.99, 4.66]
Preintermediate 4.76 .584
[4.49, 5.03]
Intermediate 5.17 .819
[4.78, 5.63]
EAP 5.87 .661
[5.47, 6.24]
Note: EAP, English for Academic purposes; VKsize, correction for guessing scores
(hits - false alarms); mnRT, mean recognition time in milliseconds; CV,
coefficient of variation (SDMeanRT/MeanRT); CI, BCa confidence intervals; VLT,
Vocabulary Levels Test; BNC, British National Corpus.

1300 milliseconds. However, the lower proficiency Sydney groups


were approximately 200 milliseconds faster on average. In Study 1, the
preuniversity group had an overall mean of 1600 milliseconds, compa-
rable for the overall mean of 1540 millisecconds for the Sydney study
and 1730 milliseconds for the Singaporean study. The CV (around
600 msec) for the Singaporean group was also much higher than that
for the Sydney group (480 msec), and the latter itself was higher than
that in the proficiency and entry standard studies. The mean differ-
ences are tested for significance below. The CV values were not very
9.6 Study 5 Results 221

6
Standardised scores (z + 5)

0
VK mnRT CV
Lexical facility measures
Elementary Pre_intermediate Intermediate English for Academic Purposes

Fig. 9.3 Singapore language program study. Standardized scores for the lexical
facility measures (VKsize, mnRT, and CV) for the combined test by level

informative, with the preintermediate group having a higher score (less


consistency) than the elementary group, and the intermediate and EAP
groups being nearly identical. As was the case with the Sydney data, the
CV scores were not included in the composite analysis.
The program-level differences are presented visually in Fig. 9.3, where
the lexical facility measures are presented as standardized scores.
As with the previous studies, the VKsize and mnRT scores consistently
increase as a function of increasing language proficiency.

 ensitivity of the Lexical Facility Measures


S
to Program-­Level Differences

The mean-level differences are tested for statistical significance and effect
size. Although the level sizes are small, all three measures met normality
and variance assumptions. The mean-level differences observed for
the individual measures and the composite VKsize_mnRT were tested
for statistical significance in separate one-way ANOVAs, with program
level as the independent variable and scores as the dependent variable.
222 9 Lexical Facility and Language Program Placement

Table 9.6 Singapore language program study. One-way ANOVAs for individual
and composite lexical facility measures as discriminators of program levels
df (3,52) F ŋ2 CI for ŋ2
VKsize 7.96** .31 .12, .50
Hits 4.61* .21 .04, .40
mnRT 6.38** .27 .14, .46
CV 1.56 .08 −.03, .11
VKsize_mnRT 11.01** .39 .29, .45
Note: *p < .01; **p < .001.

The results are reported in Table 9.6. The group means in the post hoc
­analysis were bootstrapped to provide more robust statistics for the small
sample sizes.
The omnibus analyses for the VKsize, hits, mnRT, and composite
VKsize_mnRT are all statistically significant. The ŋ2 values are in around
.3 here, compared with .4 in the Sydney study, and signal a moderate-to-­
strong effect size. The post hoc contrasts testing adjacent-level differences
are reported in Table 9.7.
The VKsize score was significant in comparisons involving the EAP
group and the elementary, preintermediate, and intermediate groups,
respectively. The hits were less sensitive than VKsize, accounting only for
differences between the elementary and intermediate groups and the
­elementary and EAP groups. The mnRT scores were involved in four
significant pairwise comparisons. These were the differences between the
EAP group and the elementary and preintermediate groups, as well as the
preintermediate and intermediate and the elementary and intermediate
comparisons. The composite VKsize_mnRT scores were also sensitive to
four score differences, including all the EAP comparisons and the differ-
ence between the elementary and intermediate groups. The effect sizes
approached or exceeded 1.0 for all the significant comparisons, except
the preintermediate and intermediate group difference for the mnRT
score (.73). A comparison of the individual VKsize scores with the com-
posite VKsize_mnRT scores indicates that the combination of size and
speed provides a more sensitive measure of program level differences than
size alone.
9.7 Findings for Study 5 Singapore Language Program Levels 223

Table 9.7 Singapore language program study. Significant post hoc comparisons
for the lexical facility measures for the four placement levels
Mean
difference d 95% CI for d
VKsize
Elementary and EAP 27.00*** 1.94 .95, 2.93
Preintermediate and EAP 18.72** 1.48 .64, 2.32
Intermediate and EAP 15.13* .99 .175, 1.82
Hits
Elementary and intermediate 15.32* .91 .11, 1.71
Elementary and EAP 17.09* 1.01 1.44, 1.88
mnRTa
Elementary and intermediate 518*** .94 .13, 1.73
Preintermediate and intermediate 323** .73 .02, 1.43
Preintermediate and EAP 570* 1.36 .53, 2.19
Elementary and EAP 765*** 1.61 .66, 2.58
CV None
VKsize_mnRT
Elementary and intermediate .852* .92 .11, 1.74
Preintermediate and EAP 1.11*** 1.00 .38, 1.65
Intermediate and EAP .699** 1.00 .38, 1.65
Elementary and EAP 1.55*** 2.36 1.30, 3.42
Note: *p < .05; **p < .01; ***p < .0005 (two-tailed); araw values given, contrast
calculated on mnRT(log), significance levels assume unequal variances. VKsize,
correction for guessing scores (hits - false alarms); mnRT, mean response time
in milliseconds; CV, coefficient of variation (SDMeanRT/MeanRT); CI, BCa 95%
confidence interval.

9.7 F indings for Study 5 Singapore Language


Program Levels
This study examined the sensitivity of the lexical facility measures to
program-­level differences in a Singapore language school. Overall, the
Singapore measures were slightly less sensitive to differences than they
were in the Sydney study. Both the VKsize and mnRT measures again
were shown to be sensitive to level differences, with the composite
VKsize_mnRT measure doing the best job. The CV proved to be singu-
larly uninformative for this group of learners.
224 9 Lexical Facility and Language Program Placement

9.8 Conclusions
The two studies examined the sensitivity of the lexical facility measures to
differences in language school placement levels. The combination of size
and mnRT provided a more sensitive measure of levels than size alone,
supporting a key part of the lexical facility proposal. The CV again pro-
vided little information about level differences, offering further evidence
that its status as a component of the proposed lexical facility construct is
questionable.
The results give no basis for suggesting that the Timed Yes/No Test can
replace the placement procedures used in the two schools. Harsch and
Hartig (2016) arrived at a similar conclusion about the effectiveness of
the Yes/No Test format as a placement instrument when they compared
it with the C-Test, which they concluded was a more sensitive instru-
ment. However, framing the questions in terms of either/or oversimpli-
fies matters. The in-house placement tests from the Sydney study, the
C-Tests, draw on higher-order linguistic and strategic skills that are not
tapped in the Yes/No Test and are in a complementary relationship with
the low-level lexical facility skills captured by the Timed Yes/No Test
format.
The evidence does indicate that size and speed measures provide a reli-
able and, arguably, a potentially useful tool for identifying learners’ pro-
ficiency levels, with possible future applications independently and in
combination with other measures in the placement process, for example,
as a tool for screening students before arrival at the university.

Notes
1. A Kruskal–Wallis Test was run to test for the equality of the group false-­
alarm means. There was a significant difference between the groups,
χ2 = 11.07, p < .011. A follow-up Mann–Whitney test of pairs showed
that the only significant difference was between the advanced and the
upper intermediate groups, at U = 153, p < .05, Cohen’s d = .68 (Lenhard
and Lenhard 2014). None of the other mean differences was significant.
References 225

References
Alderson, J. (2005). Diagnosing foreign language proficiency: The interface between
learning and assessment. New York: Continuum.
Clark, M. K., & Ishida, S. (2005). Vocabulary knowledge differences between
placed and promoted students. Journal of English for Academic Purposes, 4(3),
225–238. doi:10.1016/j.jeap.2004.10.002.
Harrington, M., & Carey, M. (2009). The online yes/no test as a placement
tool. System, 37(4), 614–626. doi:10.1016/j.system.2009.09.006.
Harsch, C., & Hartig, J. (2015). Comparing C-tests and yes/no vocabulary size
tests as predictors of receptive language skills. Language Testing, 33(4),
555–575.
Harsch, C., & Hartig, J. (2016). Comparing C-tests and Yes/No vocabulary size
tests as predictors of receptive language skills. Language Testing, 33(4),
555–575.
Lam, Y. (2010). Yes/No tests for foreign language placement at the post-­
secondary level. Canadian Journal of Applied Linguistics/Revue canadienne de
linguistique appliquee, 13(2), 54–72.
Lenhard, W., & Lenhard, A. (2014). Calculation of effect sizes. Retrieved
November 29, 2014, from http://www.psychometrica.de/effect_size.html
Mochida, A., & Harrington, M. (2006). The yes-no test as a measure of recep-
tive vocabulary knowledge. Language Testing, 26(1), 73–98. doi:10.1191/02
65532206lt321oa.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
Shillaw, J. (1996). The application of Rasch modelling to yes/no vocabulary tests.
Swansea: Vocabulary Acquisition Research Group, University of Wales
Swansea.
10
Lexical Facility and Academic
Performance in English

Aims

• Examine the sensitivity of the lexical facility measures to academic


English grades and overall grade point averages (GPAs) in an Australian
university foundation-year program.
• Survey GPA results from other studies.

10.1 Introduction
Vocabulary knowledge (VKsize) and mean recognition time (mnRT)
have been shown to be sensitive indicators of English proficiency differ-
ences across the five studies examined so far. The sensitivity of the two
measures was evident in the sharply defined differences between preuni-
versity, second language (L2), and first language (L1) users in Study 1,
between university entry standards in Studies 2 and 3, and the language
program levels in Studies 4 and 5. Of interest throughout has been the
construct validity of lexical facility as an account of L2 vocabulary skill
and the usefulness and reliability of the Timed Yes/No Test as a measure-
ment instrument.

© The Author(s) 2018 227


M. Harrington, Lexical Facility, DOI 10.1057/978-1-137-37262-8_10
228 10 Lexical Facility and Academic Performance in English

The research discussed in this chapter examines the measures as predic-


tors of individual differences in performance in two domains of academic
English. Study 6 investigates the lexical facility measures as predictors
of English for Academic Purposes (EAP) course grades in an Australian
­university foundation-year course, and Study 7 as predictors of grade
point average (GPA) in the same program. In addition, findings from
other studies that have also examined lexical facility as a predictor of
GPA are discussed (Roche and Harrington 2013; Harrington and
Roche 2014b, b).
As in the previous studies, these studies evaluate the sensitivity of the
measures both independently and in combination.

10.2 S
 tudy 6: Lexical Facility Measures
and Academic English Grades
This study evaluates the three lexical facility measures as correlates of
academic performance in a foundation-year course at an Australian uni-
versity. The measures are based on performance on the Timed Yes/No
Test, which are then correlated with the year-end grade received in a
mandatory EAP course. Two student cohorts were tested that differ as to
when they took the test. The entry group took the test at the beginning
of the first semester of the two-semester course, and the exit group took
it at the end of the second. Entry group performance provides a window
on how well the measures can predict learning outcomes for newly
entered students, while exit group performance establishes the degree to
which the measures correlate with grade outcomes at the end of the
course. In Study 7, the measures are examined and correlated with end-­
of-­year GPAs. In both studies, the effects of individual and combined
measures are examined.

Setting and Participants

The data were collected in the Australian university’s foundation-year


program introduced in Study 3 (Chap. 7). The foundation-year course is
10.2 Study 6: Lexical Facility Measures and Academic English Grades 229

a university preparation course equivalent to Year 12 high school study in


Australia. The course provides both the academic coursework and the
English-language training needed for entry to the university. Students
enter the program with varying levels of English proficiency, with a mini-
mum of IELTS overall 5 required for entry.
The entry group consisted of 72 students (42% female), while the exit
group had 68 (59% female). The participants in both groups came from
North Asia, predominantly from China and Hong Kong, Taiwan, and
Japan. Some of the entry group participants were also part of the IELTS
data reported in Chap. 7. Participation in the study was voluntary.

Material and Procedures

Both groups completed the same version of the written Timed Yes/No
Test. The test words were drawn from the 2K, 3K, 5K, and 10K bands
from the British National Corpus (BNC) from the Lextutor website
(Cobb 2008). Eighteen words were drawn from each level for a total of
72 words and 28 pseudowords, generating a total of 100 items. Second-­
semester EAP grades and GPAs were obtained with permission from the
students’ academic records.
The test was given in groups in a computer-equipped classroom in
the school. It was administered using LanguageMAP, an online testing
program available at www.languagemap.com. The order of presentation
was randomized for each test-taker. The entry group took the test in
February, at the beginning of the Australian academic year. Participants
in the exit group took the test in October. The academic year finished
in early December, with second-semester grades and GPAs for both
groups obtained after that. The academic English classes met three times
a week in both semesters and covered all four academic English skills,
with an emphasis on writing. The course grade for the second semester
is used here. Participating students signed a release-for-access form so
that their course grades and GPAs could be assessed after they had fin-
ished the test. Students were given the option to opt out, but none did.
Otherwise, the testing followed the procedures set out in previous
chapters.
230 10 Lexical Facility and Academic Performance in English

10.3 Study 6 Results


Preliminary Analysis The raw data were first examined for the presence of
factors independent of the research variable effects that might affect the
interpretation of the results. Cronbach’s alpha coefficients of .8–.9 indi-
cated satisfactory instrument reliability for the Timed Yes/No Test. The
individual item recognition times for correct hits were first screened for
item responses at less-than-300-millisecond outliers. These responses are
too fast for an actual response and reflect performance errors arising from
inadvertent keystrokes and similar response-external factors. There were
only a handful, and they were removed from the data set. The remaining
recognition times were then screened for outliers. Item response times for
the correct hits beyond the 3 standard deviation (SD) cut-off were then
identified. These accounted for just over 1% of the correct hits across all
the participants and were left intact for the analysis.

The entry group had a false-alarm rate of over 20%, twice that of the
exit group. The difference was significant at the Mann–Whitney Test,
U = 1290, p < .001, d = .89.1 The entry group also had a lower EAP
grade percentage: 66% versus 74% for the exit group, t (143) = 4.57,
p < .001, d = .74. Both differences will the affect the interpretation of the
results.
The moderate significant correlation between the VKsize and the
inverse mnRT indicated no systematic trade-off between yes/no perfor-
mance and recognition speed: for the exit group, r = .38; for the entry
group, r = .27, both significant at p < .01. See Table 10.2.

Descriptive Statistics

The means and SDs for the lexical facility measures are given in Table 10.1.
Also included are academic English percentage marks, letter grades, and
GPAs. The latter two are not normally distributed, so all three are reported
as median and ranges. The GPA results are discussed later.
10.3 Study 6 Results 231

Table 10.1 Australian university foundation-year study. Means, standard devia-


tions, and confidence intervals (CI) for the individual and composite lexical facility
measures, and median and range values for academic grades and GPAs for entry
and exit groups
Entry test group (n = 72) Exit test group (n = 68)
Lexical facility measure Mean SD [95% CI] Mean SD [95% CI]
False-alarm rate 23.02 14.53 [20.36, 11.45 10.57 [8.39,
26.11] 14.50]
VKsize 38.46 15.44 [34.64, 48.51 14.45 [45.92,
42.43] 51.12]
mnRT (msec) 1357 340 [1274, 1182 359 [1098,
1432] 1266]
CV .349 .102 [.436, .401 .109 [.375, .415]
−.467]
Composite
VKsize_mnRT 4.79 0.80 [4.62, 4.96] 5.37 0.78 [5.18, 5.54]
VKsize_mnRT_CV 4.62 0.67 [4.49, 4.76] 5.65 0.52 [5.51, 5.79]
Academic score Median Range Median Range
AE percentage mark 65.81 30–88 73.84 50–93
AE grade 4.75 2–7 6.0 4–7
GPA 5.40 2.0–6.8 5.80 2.8–7.00
Note: VKsize, correction for guessing scores (hits - false alarms); mnRT, mean
response time in milliseconds; CV, coefficient of variation (SDMeanRT/Mean RT);
95% CI, 95% confidence interval; AE, academic English; GPA, grade point
average.

The exit group performed better on all three measures, and in turn on
the composite measures. The absence of any overlap between the upper
and lower bounds of the confidence intervals (CIs) for the respective
group means indicates that this difference is statistically significant.2 The
exit group VKsize score of 48% places it in the 6–6.5 IELTS band in the
results reported for Study 3 in Chap. 8. The mnRT of 1182 was half-­
band slower than in the IELTS study, where it corresponds to the 5.5
band. The CV measure for both groups was also in the 5–5.5 range. The
CV measure weakly correlated with the criterion variables, a finding simi-
lar to Study 3.
The sensitivity of the lexical facility measures to academic performance
outcomes is examined separately for the two groups.
232
10

Table 10.2 Bivariate correlations between lexical facility measures and academic English performance measures for entry
and exit groups
VKsize mnRT CV VK_mnRT AE grade % GPA
VKsize .38** .10 .83** .44** .32** Entry group n = 72
[.17, .54] [−.14, .33] [.73, .89] [.33, .66] [.03, .59]
mnRT .27** .06 .82** .21 .22*
[.08, .46] [−.01, .12] [.76, .88] [.02, −.42] [.00, .42]
CV .03 .39** .03 .08 .02
[−.18, .24] [.17, .59] [−.20, .25] [−.16, .32] [−.22, .18]
VK_mnRT .78** .81** .24 .39** .33*
[.67, .86] [.73, .88] [.23, .51] [.10, .62] [.06, .56]
AE% .59** .33* .01 .57** .67**
[.41, .71] [.09, .54] [.18, −.14] [.36, .72] [.53, .79]
GPA .45* .32** .05 .48** .77**
[.27, .60] [.13, .52] [−.17, .28] [.31, .62] [.64, .86]
Exit group n = 68
Note: *p < .05; **p < 01 (two-tailed); VKsize, correction for guessing scores (hits - false alarms); mnRT, mean response
time in milliseconds; CV, coefficient of variation (SDMeanRT/MeanRT); AE%, academic English grade in percentages; GPA,
overall grade point average for academic subjects.
Lexical Facility and Academic Performance in English
10.3 Study 6 Results 233

 ensitivity of the Lexical Facility Measures


S
to Academic English Grades

Studies 6 and 7 differ from the earlier ones in that establishing the relative
sensitivity of the lexical facility measures does not involve discriminating
among group levels. Rather, these studies focused solely on the strength
of the association between individual differences in vocabulary scores and
academic English performance. The bivariate correlations between the
vocabulary measures and academic English grades, as well as GPAs, from
Study 7 are included in Table 10.2.
The bivariate correlations between VKsize and academic English per-
centage and letter grades were significant for both groups, though nearly
twice as large for the exit group. The differences between the groups were
marginally significant: academic English percentage, z = 1.63, p = .051.
The mnRT and academic English measure correlations were statistically
significant and moderately strong for the exit group, while weak and non-
significant for the entry group. The results of the composite VK_mnRT
measure were nearly identical to those of the individual VKsize measure.
The CV scores did not correlate with either measure in both groups.

Test Results

The combined contribution of the VKsize, mnRT, and CV variables to


academic performance differences was tested in a hierarchical regression
analysis in which academic English percentage was the criterion variable
and the VKsize, mnRT and CV scores were entered sequentially as pre-
dictor variables. The data met the assumptions for a hierarchical multiple
regression analysis.3 Separate analyses were done for each group and are
reported in Table 10.3.

Academic English Grades: Entry Group The VKsize scores accounted for
an R2 of .191, and the mnRT scores accounted for an additional, nonsig-
nificant amount of variance. The model R2 value was .213 and the β coef-
ficient for VKsize was statistically significant.

Academic English Grades: Exit Group The VKsize and mnRT scores
accounted for a significant amount of variance, the latter though was
234 10 Lexical Facility and Academic Performance in English

Table 10.3 Australian university foundation-year study. Model summary of hier-


archical regression analyses for entry and exit groups using EAP grade percentage
as criterion and VKsize, mnRT, and CV as ordered predictor variables

β t Sig R2 Δ R2

Entry group VKsize .233 .3.71 .001 .190 .191***

all mnRT◊ 5.25 .344 .732 .193 .003


df (3, 68)
CV 10.86 1.31 .194 .213 .020

Entry group VKsize .445 .2.67 .012 .149 .149**

20% trim
mnRT◊ .241 1.53 .135 .204 .055
df (3, 30)
CV .304 1.58 .073 .286 .082

Exit group VKsize .518 5.07 .001 .349 .349 ***

all mnRT .245 2.20 .031 .378 .029*

df (3, 64) CV .158 .158 .145 .398 .020

Exit group VKsize .435 3.63 .001 .302 .302***

20% trim mnRT .319 2.49 .016 .358 .056**

df (3, 54) CV .115 1.32 .192 .379 .021

Note: *p < .10; **p < .05; ***p < .001 (two-tailed); VKsize, vocabulary
knowledge; mnRT, log mean response time; CV, coefficient of variation; β,
standardized beta coefficient; ΔR2, change in total R2 (shaded cell is total
variance accounted for by the model); df, degrees of freedom for model 3
ANOVA.

only significant at p < .10. The CV accounted for no variance in the


model. The total model R2 value was .349. The β coefficients for VKsize
and mnRT were statistically significant. A secondary analysis was also
done in which all participants with a false-alarm rate exceeding 20% were
excluded from the analysis. For the entry group, the trim resulted in a
somewhat better total R2 value for the model from .213 in the original
analysis to .286 in the trimmed analysis. In the exit group, the trim
resulted in a slightly smaller R2 value, from .398 to .379.
10.5 Study 7: Lexical Facility and GPA 235

10.4 F indings for Study 6 Lexical Facility


and Academic English Grades
The relationship between the VKsize and mnRT measures and academic
English grades was much stronger for the exit group. In this group,
VKsize accounted for 35% of the variance, with mnRT also accounting
for a small amount of unique variance at a more liberal level of statistical
significance (p < .10). The CV measure showed little sensitivity to grades
in either group. In summary, the results indicate that the lexical facility
measures are poor predictors of ultimate course grades in this setting,
but are moderately strong correlates of grades at the end of the course.
It is unclear how the overall academic superiority of the exit group
affects this outcome. As with the previous studies, trimming the data
sets of larger false-alarm rates does not materially affect the pattern of
findings.

10.5 Study 7: Lexical Facility and GPA


The GPAs of the foundation-year students taking part in Study 6 were
also collected. The descriptive and bivariate correlation results were
reported in Tables 10.1 and 10.2. The findings mirror those of the aca-
demic English grades study, though the magnitude is somewhat smaller
for the GPA data.

Results: The Measures as Predictors of GPA

The correlations between the lexical facility measures and GPA were
stronger for the exit group, with the VKsize, mnRT, and VK_mnRT
measures all showing significant, moderate correlation with academic
performance. The CV measure did not correlate GPA. The differences in
the correlations between the groups were not statistically significant.
Hierarchical multiple regression analyses were again performed to assess
the combined effect of the VKsize, mnRT, and CV measures on GPA. The
data met the assumptions for the analysis.
236 10 Lexical Facility and Academic Performance in English

GPA: Entry Group The VKsize scores accounted for an R2 value of .10,
significant at p < .01. Neither the mnRT nor the CV variable accounted
for additional unique variance. The overall model accounted for 12% of
the total variance. The standardized β coefficient for VKsize was .284,
t = 2.29, p < .05.

GPA: Exit Group The VKsize scores were entered as the first step and
accounted for a ΔR2 value of .20, significant at p < .001. The mnRT
scores were then added, accounting for a small amount of additional
unique variance, ΔR2 = .042, statistically significant at the more liberal p
< .10. Recognition time did not account for a unique amount of signifi-
cant variance in GPA beyond the VKsize measure. The total model
accounted for R2 = .243. The β coefficients for VKsize and mnRT were
statistically significant: VKsize: β = .385, t = 3.36, p < .01; mnRT:
β = .224, t = 1.79, p < .10.

10.6 F indings for Study 7 Lexical Facility


and GPA
The VKsize measure accounted for a moderate amount of variance in
GPA for the exit group and a smaller portion for the entry group. As was
the case with the academic English grades, the lexical facility measures
weakly predict GPA for the entry group and are correlates of GPA for the
exit group. The weaker associations evident in the GPA results may be
because GPA reflects both language and academic skills, in contrast to
the language skill focus of the academic English grades.

10.7 Other GPA Studies


The relationship between the lexical facility measures and GPA was also
examined in three studies carried out in English-medium university pro-
grams in Oman (Roche and Harrington 2013; Harrington and Roche
2014b, b). The setting differs substantially from the Australian university
foundation-year study. English in Oman serves as a lingua franca used for
10.7 Other GPA Studies 237

specific ends (e.g., international business, study) and in limited societal


spheres. As in many developing economies, the Omani government has
introduced English-medium universities in a bid to develop what is seen
to be a more globally competitive workforce (Harrington and Roche
2014b). Both students and academic staff often struggle in these settings
to develop the English-language proficiency needed for academic success,
and the nexus between proficiency and academic outcomes is a major
concern for staff, administrators, and political leaders alike.
The research compared the lexical facility measures as correlates of
first-semester GPA and academic writing skill (Roche and Harrington
2013), and both academic reading and writing skill (Harrington and
Roche 2014b). Roche and Harrington (2013) found that VKsize and
mnRT both contributed to GPA differences among students in an Omani
college of applied sciences (N = 70). The effect size, however, was small,
with VKsize accounting for 7% of the variance and mnRT accounting
for 8%. As with most of the studies here, the CV measure did not cor-
relate with GPA at all. However, when a measure of academic writing (a
mock IELTS essay) was entered into the model initially, writing scores
accounted for 16% of the GPA variance, mnRT an additional (signifi-
cant) 8%, and VKsize less than 1% (not significant). The composite
VKsize_mnRT result was identical to the individual mnRT result.
Harrington and Roche (2014b) also compared the lexical facility mea-
sures with academic writing and reading as predictors of GPA in another
Omani university (N = 174). When considered in isolation, both VKsize
and mnRT were both statistically significant, accounting for about 17%
of the total variance. Furthermore, when writing, reading, VKsize, and
mnRT scores were entered sequentially into a regression model, only
writing (25%) accounted for significant variance in the GPA results.
Finally, as part of the same study, Harrington and Roche (2014a) explored
the link between the lexical facility measures and GPA as a function of
field of study, analyzing this relationship separately in four faculties:
humanities, computing, business, and engineering (N = 280). Figure 10.1
shows the standardized scores (mean of z scores plus 5) for the VKsize,
mnRT, and CV scores. The mnRT and CV measures were inverted so
that the higher values were faster and more consistent.
238 10 Lexical Facility and Academic Performance in English

5.4

Standard scores (z mean + 5)


5.3
5.2
5.1
5
4.9
4.8
4.7
4.6
4.5
4.4
4.3
Humanies Compung Business Engineering
n = 143 n = 51 n =54 n = 32
Facules
VK mnRT CV

Fig. 10.1 Oman university GPA study. Standardized VKsize, mnRT, and CV scores
by faculty

The results show noticeable variability across the measures and facul-
ties; however, the differences were not statistically significant, apart from
the VKsize difference between the humanities and computing groups.
For the total group, only VKsize accounted for any significant amount of
unique variance in GPA (R2 = .13, p < .001). Individual regression analy-
ses by faculty showed some differences across the four faculties. On its
own, VKsize accounted for 25% of GPA variance in the humanities fac-
ulty (R2 = .25, p < .001), nearly double that of the others; the mnRT did
not account for any unique variance. The results for the engineering fac-
ulty were unusual in that, somewhat remarkably, the mnRT measure
accounted for over 40% of the variance.
Overall, the language proficiency of the participants was lower than
most of the English users examined in the studies presented in this book.
The effect of L1 orthography on Timed Yes/Not Test format may also be
a contributing factor. Fender (2008) reported that Arabic L1 readers of
English demonstrated poorer English word recognition skills than
proficiency-­matched learners from Chinese, Korean, and Japanese L1
backgrounds, a finding he attributed to distinctive aspects of the Arabic
script.
Notes 239

10.8 Conclusions
This chapter examined the lexical facility measures as predictors of per-
formance in academic English settings. The measures were related to
individual differences in EAP course grades and program GPA, and as
such complement the previous chapters which examined group
­differences. Study 6 investigated the lexical facility measures as predictors
of semester-end course grades in an EAP course in an Australian univer-
sity foundation-year program. Study 7 examined the relationship between
the three measures and overall GPA in the same cohort. Other studies
examining GPA in English-medium university programs were also dis-
cussed. The results were in some contrast to the earlier studies. VKsize
was again shown to be most influential factor in accounting for grade and
GPA differences with mnRT also playing a role, albeit a smaller than in
the earlier findings. It accounted for a significant amount of variance
beyond VKsize, but only when using a more liberal p level.
The VKsize and mnRT measures have measurable effects on both
course grades and GPA in the participants in the studies examined here.
However, it is also evident from the Omani findings that measures incor-
porating higher-order skills such as reading and writing are more infor-
mative indicators of individual proficiency differences than the lower-level
lexical facility processes alone. This result is similar to the placement find-
ings reported in Chap. 9. The effect of L1 orthography might also play a
role.

Notes
1. The false-alarm data do not meet normality assumptions, as some partici-
pants have few or no false alarms.
2. Independent t-tests showed that all of the differences were statistically
significant (two-tailed): for the VK scores: t (143) = 4.19, p < 0.001,
d = 0.69; for mnRT: t (143) = 2.93, p = 0.004, d = 0.49; and for CV: t
(143) = 3.03, p = 0.003, d = 0.51.
3. Namely independence of observations (residuals), no outliers, noncol-
linearity, and normally distributed residuals.
240 10 Lexical Facility and Academic Performance in English

References
Cobb, T. (2008). The Compleat Lexical Tutor. http://www.lextutor.ca/
Fender, M. (2008). Spelling knowledge and reading development: Insights from
Arab ESL learners. Reading in a Foreign Language, 20(1), 19–42.
Harrington, M., & Roche, T. (2014a). Word recognition skill and academic
achievement across disciplines in an English-as-lingua-franca setting. In
U. Knoch (Ed.), Papers in Language Testing, 16, 4.
Harrington, M., & Roche, T. (2014b). Identifying academically at-risk students
at an English-medium university in Oman: Post-enrolment language assess-
ment in an English-as-a-foreign language setting. Journal of English for
Academic Purposes, 15, 34–37.
Roche, T., & Harrington, M. (2013). Recognition vocabulary knowledge as a
predictor of academic performance in an English as a foreign language set-
ting. Language Testing in Asia, 3(1), 1–13. doi:10.1186/2229-0443-3-12.
11
The Effect of Lexical Facility

Aims

• Summarize the empirical results from Chaps. 6, 7, 8, 9, and 10.


• Highlight key findings.

11.1 Introduction
This chapter summarizes the findings from Chaps. 6, 7, 8, 9, and 10.
Seven studies have evaluated the sensitivity of size, speed, and consistency
to differences in proficiency and performance in domains of academic
English. Throughout the book, this sensitivity has been characterized as
how well the measures discriminate between the criterion levels and,
more importantly, the relative magnitude of these differences, both indi-
vidually and in combination. Of particular interest is the degree to which
composite measures provide a more sensitive measure of the observed
differences than vocabulary size alone.

© The Author(s) 2018 241


M. Harrington, Lexical Facility, DOI 10.1057/978-1-137-37262-8_11
242 11 The Effect of Lexical Facility

11.2 S
 ensitivity of Lexical Facility Measures
by Performance Domain
Table 11.1 summarizes the group means for the individual lexical facility
measures, vocabulary size (VKsize), mean recognition time (mnRT), and
coefficient of variation (CV). Also included are the hits, which are the
percentage of words recognized.
The different groups, setting, and lack of a single, independent ­measure
of proficiency make direct comparisons across the studies impossible—

Table 11.1 Summary of means (M) and standard deviations (SD) for VKsize, hits,
mnRT, and CV measures for Studies 1–5
VKsize Hits mnRT CV
Study n M SD M SD M SD M SD
Study 1: University groups
Preuniversity 32 34 27 55 10 1656 332 .447 .086
L2 university 36 71 16 77 10 963 203 .361 .087
L1 university 42 85 10 91 7 777 202 .247 .071
Study 2: Entry standards
IELTS 6.5 54 57 16 76 12 1444 417 .434 .088
IELTS 7+ 25 73 12 84 7 1280 299 .416 .092
Malaysian 17 70 18 87 11 975 205 .458 .121
Singaporean 19 85 12 93 8 899 193 .432 .114
L1 English 15 85 10 93 6 960 228 .347 .104
Study 3: IELTS band scores
5 30 35 11 56 11 1342 443 .443 .122
5.5 169 40 12 59 12 1139 214 .392 .102
6 72 47 13 68 12 1040 214 .378 .111
6.5 42 59 13 72 10 1032 131 .356 .112
7+ 31 72 10 81 10 861 111 .329 .119
Study 4: Sydney
Elementary 21 37 19 55 16 1854 548 .462 .057
Lower intermediate 19 52 11 71 7 1464 272 .486 .072
Upper intermediate 26 56 14 73 7 1506 318 .494 .104
Advanced 19 69 10 77 9 1326 242 .495 .096
Study 5: Singapore
Elementary 12 22 14 51 20 2083 569 .610 .123
Preintermediate 18 31 12 55 10 1888 456 .644 .095
Intermediate 15 34 16 67 13 1565 475 .568 .132
EAP 11 59 14 68 12 1318 344 .565 .128
11.2 Sensitivity of Lexical Facility Measures by Performance Domain 243

indeed, the lexical facility measure represents one such single measure.
However, several generalizations can be made. The VK size scores and,
to a lesser extent, hits are the most consistent in separating the profi-
ciency levels across all the studies. The IELTS 6.5 band scores were
nearly identical in Studies 2 (M = 57) and 3 (M = 59).1 The adjacent
7+ (both 7 and 7.5) band for the respective studies were Studies 2
(M = 73) and 3 (M = 72). This is similar to the L2 university score in
Study 1 (M = 71), the Malaysian group in Study 2 (M = 70). All of this
suggests that VKsize is a reliable measure of vocabulary skill. The mnRT
results are much less consistent. The 6.5 group in Study 2 has a much
higher mnRT (M = 1444) than the 6.5 group in Study 3 (M = 1031).
The mnRTs for the 7+ levels were also different, for Studies 2 (M = 1280)2
and 3 (M = 861). The amount of recognition time variability under-
scores a difficulty with its use as a measure. This is discussed below. The
two language program groups were much slower than the university
groups, despite the similarity of the VKsize scores for the advanced
group in Study 4 (M = 69) and the English for Academic Purposes
(EAP) group in Study 5 (M = 59), with to the IELTS 6.5 band scores in
Study 2 and 3. This is consistent with the notion that recognition speed
lags behind size (Zhang and Lu 2013). Aside from Study 1, the CV
scores were only sensitive in instances where the proficiency difference
between the groups was great, as in that between IELTS band scores of
5 and 7+.
Table 11.2 summarizes the effect sizes for the individual and compos-
ite measures in the seven studies. Only effect sizes (Cohen’s d) for the
significant pairwise comparisons of means are presented. Blank cells indi-
cate that the mean difference did not reach statistical significance. An
effect size can be interpreted in the absence of statistically significant dif-
ferences, but for presentation purposes, only the significant results will be
discussed. See the specific chapters for effect sizes not reported here.
The benchmark used throughout the book for interpreting the magni-
tude of the observed effect size is taken from Plonsky and Oswald’s meta-­
analysis (2014, p. 889). For mean differences between groups, values
around .40 are considered small, around .70 medium, and 1.0 and
beyond large. The authors recommend higher values for within-group
contrasts, namely .60, 1.00, and 1.40, respectively. The comparisons in
244 11 The Effect of Lexical Facility

Table 11.2 Summary of lexical facility measures’ effect sizes for individual and
composite measures
Range of Cohen d effect sizes for pairwise comparisonsa
VKsize_ VKsize_
Study VKsize mnRT CV mnRT mnRT_CV
Study 1: University 1.05–2.68 1.18–4.82 1.03–2.60 1.33–4.11 1.39–5.38
proficiency levels
N = 110
Study 2a: University .86–1.81 1.14–1.86 .71–.94 .73–2.27 .67–2.00
entry standards
study: written test
Study 2b: University .66–2.16 .62–1.72 .72–1.49 .76–2.46 1.00–2.00
entry standards
study: spoken test
N = 132
Study 3: IELTS band .48–2.93 .67–1.60 .42–.88 .65–2.55 .57–2.74
scores
N = 371
Study 4: Australian .99–2.13 .80–1.22 – 1.07–1.90 –
language program
placement
N = 87
Study 5: Singapore .99–1.94 .73–1.61 – .92–2.36 –
language program
levels N = 56
Variance accounted for in hierarchical
regression model
ΔR2 VKsize ΔR2 mnRT ΔR2 CV R2 total
Study 3: IELTS band All .419** .055** .005 .473
scores FA < 20% .474** .028** .008 .511
N = 344 FA < 10% .527** .027** .001 .544
Study 6: EAP grade Entry all .191*** .003 .020 .213
Entry group Entry FA < 20% .149** .055 .082 .286
N = 72 Exit all .349*** .029* .020 .398
Exit group Exit FA < 20% .302*** .056** .021 .379
N = 68
Note: *p < .10; **p < .05; ***p < .001 (two-tailed); VKsize, correction for
guessing scores (hits - false alarms); mnRT, mean recognition time in
milliseconds; CV, coefficient of variation; FA < 20% = only participants with
false-alarm rates less than 20% included in analysis; FA < 10% = only
participants with false-alarm rates less than 10% included in analysis.
a
Comparisons: Study 1: L2 preuniversity – L2 university – L1 university; Study 2:
IELTS 6 – IELTS 7 – Malaysian – Singaporean – L1 English; Study 3: IELTS overall
band score 5 – 5.5 – 6 – 6.5 – 7+; Study 4: Elementary – lower intermediate –
upper intermediate – advanced; Study 5: Elementary – preintermediate –
intermediate – English for Academic Purposes (EAP; advanced).
11.2 Sensitivity of Lexical Facility Measures by Performance Domain 245

Table 11.1 all represent between-group contrasts. Within-group


­comparisons are made as part of the test of the word frequency assump-
tion, but the nonparametric test in that analysis uses r as the effect size.
The table also contains the effect sizes, in R2, for regression model vari-
ance accounted for by the three measures. Benchmarks for interpreting
r/ R2 are .25/.06 for small, .40/.16 for medium, and .60/.36 for large
effects (Plonsky and Oswald 2014, p. 889).
The first five studies examined the sensitivity of the measures to pro-
gram- and test-based group differences. The last two investigated the
measures’ sensitivity to individual differences in English-medium aca-
demic performance. The results of each study are briefly summarized.

 niversity English Group Differences


U
(Study 1, Chap. 6)

Study 1 focused on the sensitivity of the individual and composite mea-


sures to differences between three distinct English proficiency groups. All
three measures (VKsize, mnRT, and CV) were sensitive to differences
between L2 preuniversity students, L2 university students, and first lan-
guage (L1) university students at an Australian university. The effect sizes
for all measures exceeded, most considerably, the benchmark 1 for a
strong effect. The proficiency difference between the preuniversity and
L1 university groups is reflected in very high effect sizes, ranging from
2.68 for the VKsize difference to over 5 for the composite VKsize_
mnRT_CV. The effect sizes for the L2 preuniversity and L2 university
groups were also robust, ranging from 1.70 for the VKsize to 2.60 for the
mnRT comparison. The stronger effect sizes for the VKsize_mnRT_CV
comparisons (1.4–5.4) over the individual VKsize comparisons (1.1–2.7)
support the proposal that a combination of size and speed provides a
more sensitive measure of group differences than size alone. However,
this result was largely due to the strength of the mnRT measure. This was
the only study in which the mnRT effect size was stronger than that of
VKsize.
The study also analyzed group performance as a function of word fre-
quency levels. Hits (percentage of words identified) were used instead of
246 11 The Effect of Lexical Facility

the VKsize measure, as the latter incorporates false-alarm performance


and is calculated across the entire item set. Hits and mnRT performance
across the progressively lower frequency bands (e.g., 2K, 3K, 5K, and
10K) decreased uniformly in all three groups. All the adjacent-level dif-
ferences were statistically significant, and r values ranged between .41 and
.63 for the hits, and .38 and .64 for the mnRT, all in the medium-to-­
strong range. The CV was less sensitive to frequency-level differences.
Although mean differences mirrored those of the hits and mnRT, only
the 2K–3K difference was statistically significant, and the r-value of .17
indicates no effect.
Study 1 showed that all three measures were stable dimensions of L2
vocabulary proficiency and highly sensitive to group differences. mnRT
and CV measures also accounted for additional variability beyond size
alone. The study also demonstrated the validity of frequency-of-­
occurrence levels as indices of L2 vocabulary learning.

 niversity English Entry Standards and IELTS (Study 2,


U
Chap. 7 and Study 3, Chap. 8)

Studies 2 and 3 examined the sensitivity of the measures to group differ-


ences in English entry standards used in Australian universities. In Study
2, the sensitivity of the measures was examined across five groups of inter-
national students. It comprised entering students with a university mini-
mum IELTS 6.5 overall band score, a combined group of the next two
bands IELTS 7+ (7 and 7.5 combined), Malaysian high school graduates,
Singaporean high school graduates, and a baseline group of L1 students
from English-speaking countries. Unlike Study 1, the four L2 groups did
not represent a fixed order of proficiency, though the IELTS groups were
expected to be different from each other and the baseline L1 English
group was better than all the rest. The motivation for the study was to
assess how well the measures can serve as an independent benchmark to
compare the somewhat disparate groups; all are assumed to share a thresh-
old level of English proficiency but also differ noticeably beyond that. The
study also compared performance on the lexical facility measures in writ-
ten and spoken formats to assess the mode of presentation on test out-
comes, for both group differences and the frequency assumption.
11.2 Sensitivity of Lexical Facility Measures by Performance Domain 247

Considering the three measures, the VKsize scores improved on a con-


tinuum of IELTS 6.5 < IELTS 7+ < Malaysian high school < Singaporean
high school < L1 English. This was true for both presentation modes,
though the spoken test scores were consistently lower. In the written ver-
sion, the differences between the IELTS 6.5 group and the other four groups
were statistically significant, with the effect size d values ranging from .86 to
1.81. The Singaporean group was also significantly different from the
IELTS 7+ (d = .66) and the Malaysian groups (d = .95), but was not differ-
ent from the L1 English group. In the spoken test results, the pattern was
the same, except that the Singaporean group was not significantly different
from the Malaysian group, but differed from the L1 English group (d = .89).
The individual mnRTs were less sensitive to group differences in both the
written and spoken test data. For the written version, the groups split into
two. The IELTS 6.5 and 7+ groups did not differ from each other, but were
different from the other three groups (d = 1.14–1.86), who in turn did not
differ from each other. In the spoken test results, the IELTS 6.5 group dif-
fered significantly from the other groups (d = .63–1.73). The L1 English
group was also significantly different from all the other groups (d = .82–1.72).
The CV measure was the least sensitive of the three individual mea-
sures. In the written test results, the L1 English group was significantly
different from the other groups, though effect sizes were in the medium
range (d = .71–.99). The same pattern was evident for the spoken test
data, though the effect sizes were slightly larger (d = .71–1.49). The dif-
ference between the IELTS 6.5 and IELTS 7+ groups was also significant
(d = 1.48).
The most sensitive measure in both the written and spoken formats
was the composite VKsize_mnRT. It discriminated between all groups in
both versions, with the sole exception of the written Singaporean and
L1 English group difference. It also yielded higher effect sizes in seven of
the ten comparisons in the written test results and all (10 out of 10) of
the comparisons in the spoken test results. However, although the d val-
ues were higher, the confidence intervals (CIs) show that the differences
were not statistically significant. The other composite measure VKsize_
mnRT_CV was less informative due to the inclusion of the CV.
The study also analyzed test performance as a function of word fre-
quency levels. As was the case in Study 1, hits and mnRT performance
248 11 The Effect of Lexical Facility

across the progressively lower frequency bands (e.g., 2K, 3K, 5K, and
10K) decreased uniformly in all three groups. All the adjacent-level dif-
ferences were statistically significant, and r values ranged between .41 and
.63 for the hits, and .38 and .64 for the mnRT, all in the ­medium-to-­strong
range. The CV was less sensitive to frequency-level differences, although
mean differences mirrored those of the hits and mnRT. Only the 2K–3K
difference was statistically significant, and the negligible r-value of .03
indicated no effect.
In summary, the VKsize and mnRT measures were sensitive to group
differences, while the CV was less so. The effect size ranged from moder-
ate to strong depending on the comparison. The composite VKsize_
mnRT was the most sensitive, consistent with the proposal that the
combination of size and speed was better than size alone in characterizing
group differences. It was also evident that the spoken format yielded a
pattern of results similar to the written, though the size scores were lower
and the responses slower.
Study 3 examined the sensitivity of the measures to IELTS overall
band-score differences among students in a preuniversity foundation-­
year course. Scores across five adjacent band-score levels (5–5.5–6–
6.5–7+) were examined. The VKsize score discriminated among all the
IELTS band-score differences, except for the lowest adjacent comparison,
5–5.5. The smallest effect size (d = .48) was between the adjacent 5.5 and
6 levels and the largest (2.93) between the nonadjacent 5 and 7+ levels.
The results for the hits were similar. The mnRT measures discriminated
between all adjacent levels, with effect sizes ranging from .78 for the low-
est comparison (5–5.5) to 1.60 for the largest (5–7+). The CV was only
significant for the 5–7+ (d = .42) and 5–7+ (.88) comparisons.The two
composite measures, VKsize_mnRT and VKsize_mnRT_CV, were more
sensitive than the individual VKsize measure. They discriminated between
all the adjacent levels with comparable effect sizes, providing further sup-
port for the lexical facility proposal. The advantage of combining size and
speed was also supported by the regression analysis, where the mnRT
results accounted for additional unique variance in the model beyond
VKsize, though for only about 6% of the variance, compared with 42%
for VKsize.
11.2 Sensitivity of Lexical Facility Measures by Performance Domain 249

The IELTS band-score results replicate those of the first two studies:
the VKsize and mnRT measures were reliable discriminators of test levels
and accounted for moderate-to-strong effect sizes for these differences,
and the mnRT contributed a unique amount of variance in doing this.
The CV scores were again shown to be less informative than the other
two measures, being sensitive only to the most extreme band-score
differences.

L anguage Program Placement (Studies 4 and 5,


Chap. 9)

Studies 4 and 5 examined the sensitivity of the measures to proficiency as


characterized by placement in English language programs. Study 4 com-
pared the three measures with in-house placement tests at a Sydney
English language school. There was a strong correlation between the lexi-
cal facility measures and the program placement tests, evident both in
discriminating among the four levels and in the overall placement deci-
sions. VKsize scores discriminated between all the program levels (begin-
ner, lower intermediate, upper intermediate, and advanced), with strong
effect sizes (.99–2.13). mnRT discriminated between the elementary and
other levels, but not between the other three. The effect sizes for the sig-
nificant comparisons ranged from .80 to 1.22. The composite VKsize_
mnRT matched the individual VKsize in sensitivity, with slightly higher
effect sizes in most of the respective comparisons. The CV measures
showed no sensitivity to any of the measures.
Study 5 examined the lexical facility measures as correlates of place-
ment levels, (elementary, preintermediate, intermediate, and EAP) in an
English language school in Singapore. The students were similar to those
in Study 4 regarding learning goals but were of lower overall proficiency.
VKsize scores discriminated the EAP group from the other three levels,
with strong effect sizes (.99–1.94), but the three levels themselves did
not differ. The mnRTs discriminated between all three adjacent levels,
with effect sizes ranging from .73 to 1.61. The CV measure was not sen-
sitive to any level differences. The composite VKsize_mnRT also dis-
criminated between all four levels, with effects sizes slightly stronger
250 11 The Effect of Lexical Facility

than the ­individual mnRT measure, though the differences were not
statistically significant. The relatively greater sensitivity of the individual
mnRT and composite VKsize_mnRT measures over the individual
VKsize measure provides another bit of support for the lexical facility
proposal.
The results from the two language school studies are consistent with
the first three studies. Vocabulary size and mnRT measures are reliable
discriminators of test levels, with the differences mostly accompanied by
strong effect sizes. The combination of size and speed results in a more
sensitive measure than size alone.

 cademic English Performance (Studies 6 and 7,


A
Chap. 10)

Studies 1–5 concerned the sensitivity of the lexical facility measures to


differences in proficiency as used for functional ends, as in evaluating
university entry standards and language program placement. In the last
two studies, the sensitivity of the measures to individual differences in
academic English performance was examined. Study 6 investigated the
measures’ predictors of course grades in an English for Academic Purposes
(EAP) course, and Study 7 as correlates of grade point average (GPA) in
the same university preparation program.
Study 6 examined the sensitivity of the measures to individual differ-
ences in academic performance in a preuniversity foundation-year course
in Australia. The measures were correlated with the semester-end grades
for a year-long EAP course. The study also examined two groups that
differed by when they took the Timed Yes/No test: an entry test group
took the test at the beginning of the academic year, and an exit test
group took it at the end, just before receiving the course grade. The entry
group results reflect the strength of the measures as predictors of overall
course performance, while the exit group findings provided a test of how
well the measures correlate more immediately with the end-of-year course
grades. There was a marginally strong correlation between VKsize perfor-
mance and academic English grades (r = .60) for the exit group, and a
moderate correlation (r = .44) for the entry group. The mnRT results had
11.2 Sensitivity of Lexical Facility Measures by Performance Domain 251

a small but significant correlation with grades for the exit group (r = .32),
and a nonsignificant correlation for the entry group. The CV did not
correlate with course grades in either group. The VKsize_mnRT correla-
tions for both groups were the same as the respective VKsize scores. A
regression analysis examining the size and speed measures as joint predic-
tors of course grades showed that the VKsize scores accounted for all the
significant variance in academic English grades for the entry group
(20%). In the exit group, VKsize also accounted for most of the variance
(over 30%). The mnRT measure also accounted for a small (but signifi-
cant) amount of variance. In both the regression models on the complete
data set, it accounted for 3% of the variance, though at a more liberal
p-level of .10. For the analysis in which the individual false-alarm rates
were trimmed at 20%, it accounted for 6% at the conventional p < .05
level.
In summary, the VKsize and mnRT measures were more sensitive for
the exit group than for the entry group. A moderately strong correlation
was evident for the exit group between the final grades and VKsize and,
to a lesser extent, mnRT. The mnRT accounted for about 5% of the exit
group grade variance, an amount comparable to the earlier entry stan-
dards, IELTS band scores, and language program studies. There was a
substantial difference between the two groups in academic grades and test
performance that may have had a bearing on the results.
Study 7 explored the link between test performance and program-end
GPAs in the same cohort as in Study 6. Not unexpectedly, the results
mirrored those of the earlier study. For the entry group, there were small
correlations (in the low .3 range) for VKsize, mnRT, and the two com-
bined. The same correlations for the exit group were in the medium range
(.45).
The VKsize and mnRT measures were also considered as predictors of
GPA in tertiary English-medium programs in Oman (Roche and
Harrington 2013; Harrington and Roche 2014b, b). The first study com-
pared the two measures and academic writing skill as predictors of first-­
semester GPAs (Roche and Harrington 2013), and the second included
reading skill, along with writing and the two lexical facility measures, as
predictors of GPA. (Harrington and Roche 2014b, b). Roche and
Harrington (2013) found that VKsize and mnRT accounted for unique
252 11 The Effect of Lexical Facility

GPA variance in a regression analysis that only included the two measures
as predictors, though the amount of variance (about 10 and 8%, respec-
tively) was relatively low. But they also found that when the measures are
entered in a model that included an academic English writing score, the
two measures accounted for no additional variance. Similarly, Harrington
and Roche (2014b) examined the combined effect of reading skill, ­writing
skill, and the two lexical facility scores as GPA predictors and also found
that academic writing skill was the best overall predictor of GPA. It
accounted for most of the variance in the criterion (27%), but that the
other three measures also accounted for a significant amount of variance
(reading, 3%; VKsize, 3%; and mnRT, 2%). When the effect of the
VKsize and mnRT scores were considered independently, both accounted
for a small but significant amount of GPA variance (7 and 9%).
Harrington and Roche (2014a) also found that the sensitivity of the lexi-
cal facility measures as predictors of GPA varied by academic field of
study.
In summary, for the Omani data, the VKsize and mnRT measures
were less sensitive to individual academic grade and GPA differences than
to the group differences examined earlier. This was particularly the case
where they are compared with writing and reading tasks that measure
more global proficiency.

11.3 Key Findings


The studies have sought to establish lexical facility as a context-­
independent index of L2 vocabulary skill sensitive to performance differ-
ences in various academic English domains. There were three closely
related aims of the research. The research sought to

• compare the three measures of lexical facility (VKsize, mnRT, and CV)
as stable indices of L2 vocabulary skill;
• evaluate the sensitivity of these measures individually and as compos-
ites to differences in a range of academic English domains; and, in
doing so,
11.3 Key Findings 253

• establish the degree to which the composite measures combining the


VKsize measure with the mnRT and CV measures provide a more
sensitive measure of L2 proficiency than the VKsize measure alone.

The main findings relative to these aims are now summarized.

Vocabulary Size Is a Sensitive Measure of Proficiency

The VKsize score was the most sensitive individual measure. It was as
good (in Study 1) or better (Studies 2–4 and 6–7) than the mnRT mea-
sure in discriminating between proficiency levels. In the regression mod-
els reported in Studies 3 and 5, VKsize accounted for far greater variance
than mnRT (and of course CV). The effect sizes for the VKsize differ-
ences were consistently strong, whether reflected in Cohen’s d or the R2
value. In the trimmed data set in Study 3, VKsize accounted for over half
the total variance. This finding was not unexpected, given that previous
work on vocabulary size by Laufer, Nation, and their colleagues has
shown that frequentist-based vocabulary size measures are a robust cor-
relate of L2 academic performance. The findings strongly replicate the
earlier research.

 ean Response Time Is Also a Sensitive Measure


M
of Proficiency

The mnRT measure also discriminated between the groups across the
studies, though was slightly less sensitive than the VKsize measure. The d
effect sizes for the significant pairwise comparisons were at minimum of
medium strength, with most strong. In Study 1, mnRT had a larger effect
size than VKsize across the L1 and two L2 groups, as well as between the
two L2 groups alone. In the regression analyses, the measure accounted
for 3–5% of the unique variance in the models. The measure was less
informative of differences between English grades and GPAs, although
even here it was sensitive to some of the group comparisons.
254 11 The Effect of Lexical Facility

 he CV Is Less Sensitive Than the VKsize and mnRT


T
Measures

The lexical facility account introduced in Chap. 4 proposed that response


time consistency, as measured by the CV, can be a reliable and informa-
tive index of vocabulary skill and a sensitive measure of proficiency, both
by itself and in combination with VKsize and mnRT. This proposal
received very limited support. In only one study (Study 1) did the CV
mirror the sensitivity of the other two measures. Significant CV effects
were only evident in group comparisons in which level differences were
very distinct, as in the IELTS 6.5 and L1 English groups in Study 2, and
the IELTS 5 and 7+ groups in Study 3. The CV can be considered as an
index of proficiency, but only in somewhat crude terms.

 Ksize and mnRT Together Are More Sensitive


V
Than VKsize Alone

The proposal that size and speed together provide a more sensitive mea-
sure than size alone is at the heart of the lexical facility account. This was
supported. The composite measure VKsize_mnRT was generally more
sensitive than VKsize alone. This was evident in both the number of sig-
nificant group comparisons and the relative effect sizes of these differ-
ences. In five of the seven studies, the composite VKsize_mnRT measure
produced a larger effect size than for the VKsize measure alone, though
the differences were not always statistically significant. In the regression
studies, mnRT accounted for a significant amount of unique variance
beyond vocabulary size, although the magnitude of the effect was small
(3–6%).
The findings replicate earlier research that demonstrates a reliable rela-
tionship between size and vocabulary speed (Laufer and Nation 2001;
Harrington 2006), and is at odds with Miralpeix and Meara (2010), who
found none. The results indicate that recognition time does provide an
additional, reliable source of information about individual vocabulary
skill. This is the central finding of the research reported here, and it pro-
vides a solid basis for combining time and speed as a measurement
dimension, that is, for lexical facility.
11.3 Key Findings 255

 Frequency-Based Measure Provides a Valid Index


A
of Vocabulary Knowledge

A distinctive feature of the lexical facility account and the vocabulary size
literature more generally is the use of word frequency statistics to estimate
vocabulary size. A basic assumption is that word frequency is a strong
predictor of when a word is learned and the speed with which it is recog-
nized. The findings here and elsewhere (e.g., Milton 2009) show that
frequency levels provide a reliable and informative framework for charac-
terizing vocabulary development that directly relates to performance.
This holds for both written and spoken modes; however, it was evident
that performance on the spoken version was consistently lower. Word
frequency statistics provide an objective, context-independent way to
benchmark L2 vocabulary development.

F alse-Alarm Rates Are a General Indicator


of Proficiency as Well as Guessing (but Might Not
Make That Much Difference)

The most distinctive feature of the Yes/No Test format is the use of pseu-
dowords. The self-report nature of the format motivates the inclusion of
these phonologically possible, but meaningless, words as a means to
gauge if the test-taker is guessing. In principle, the false-alarm rate is a
measure of guessing independent of vocabulary size, as estimated from
the hits. In practice, this was not the case. There was substantial variabil-
ity in the false-alarm rates within and across studies, but overall the false-­
alarm rates were a fair reflection of proficiency levels. They were much
higher for lower-proficiency groups and progressively dropped as levels
improved. The mean performance by the lower-proficiency groups was
20% and higher, while for more proficient L2 and L1 groups, it was
under 10%. Differences in false-alarm rates evident across the studies
here, and in other published research, raise the issue of the comparability
of findings across studies. In Studies 3 and 5, secondary analyses were car-
ried out in which the data sets were trimmed for individuals who had mean
false-alarm rates exceeding 20% (Chaps. 3 and 5) and 10% (Chap. 3).
The statistical tests were then run again. The results were very similar to
256 11 The Effect of Lexical Facility

the original analyses, with the trimmed data sets yielding larger effect
sizes, though the differences were not significant. It was also evident that
the hits by themselves yield a reasonably sensitive measure of vocabulary
knowledge, though not as strong as the VKsize measure. This all suggests
that false alarms may not be necessary for measuring individual perfor-
mance (Harsch and Hartig 2015).

Recognition Time Can Be a Messy Measure

The collection of recognition time data and its use as evidence for
underlying knowledge states is typically associated with the laboratory.
In these controlled settings, the focus is on response time variability in
largely error-free performance in which target behaviors are narrowly
defined and technical demands readily met. The research presented
here has examined mean recognition time differences in error-filled per-
formance more everyday instructional settings. Ensuring optimum
performance, that is, that the test-taker is attending to the task and
working as quickly (and accurately) as possible, is a challenge. A signifi-
cant threat to the reliability of the results is a systematic trade-off in
how quickly and accurately a test-taker responds. Responding very
quickly with many errors, or very slowly with few errors, will render
the results difficult to interpret. There was little evidence of a system-
atic correlation in individual performance between higher accuracy
and slower performance (or vice versa). It is not possible to rule out any
trade-off behavior, but there was no evidence of systematic bias in any
of the individuals or groups studied. All of the studies showed signifi-
cant positive relationships between VKsize and the inverted mnRT, but
the size of the correlations (.2–.5) indicated that other factors were also
at play.
The variability across the studies is also a concern. The IELTS 6.5
group in Study 2 had a much higher mnRT (M = 1444) than the 6.5
group in Study 3 (M = 1031) despite the VKsize scores being nearly
identical. The mnRT means for both the language program groups
(Studies 4 and 5) are much higher relative to the VKsize scores compared
with the other groups. As noted, this may be due to a relative lag in
11.4 Conclusions 257

development or may reflect different testing conditions. Both studies


were administered by local, on-site staff at the Sydney and Singaporean
schools. All the other studies were carried out by the author or close col-
leagues. Recognition time performance is far more sensitive to differ-
ences in individual motivation and attention, and it is possible that the
importance of the recognition time responses received less emphasis by
the administrators in the language program studies. There was also con-
siderable variation within the other studies, which also varied somewhat
in testing conditions, for example, data collected in a group versus
individually.
While acknowledging these limitations, the results also show that rec-
ognition time on its own, and in combination with size, provides a reli-
able and informative means of characterizing L2 vocabulary knowledge
that is sensitive to proficiency differences in important functional
domains of academic English performance.

11.4 Conclusions
The findings support the key element of the lexical facility proposal,
namely that the combination of size and speed provides a more sensitive
index of differences in L2 lexical skill than size alone. This advantage is
reflected in greater sensitivity to proficiency and performance differences
in a range of academic English domains. In five of the seven studies, the
combination of size and speed resulted in larger effect sizes than for the
VKsize measure alone, whether in the regression models or in the com-
posite scores, although the differences were not always statistically sig-
nificant and further confirmation is needed. The CV was much less
sensitive to proficiency differences, having significant and strong effect
for all the pairwise comparisons only in Study 1. These effects were only
evident when comparing groups where the level differences were highly
distinct, as in the IELTS 6.5 and L1 English groups in Study 2. The use-
fulness of the CV as an index of proficiency remains very much an open
question.
Unique to the testing format used here is the inclusion of pseudowords
to assess guessing. The false-alarm rate provided a somewhat stable
258 11 The Effect of Lexical Facility

­ easure of performance, but considerable variability within and between


m
groups was also evident. The results suggest that word performance (hits)
alone can provide a reasonable measure of vocabulary knowledge without
including false-alarm performance.
The validity of the testing format was also established in the analysis of
performance at the word frequency-of-occurrence levels. The results from
both written and spoken versions showed that word frequency statistics
provide a reliable and robust predictor of outcomes.
The final chapter revisits the original lexical facility proposal in light of
these findings and identifies directions for future research.

Notes
1. The VKsize score is an indirect measure of the individual’s vocabulary size.
A very rough estimate of what a VKsize score of 70 represents as overall
vocabulary size can be calculated by taking 70% of 10,000, which is the
word range sampled in almost all the tests here. That would be a mini-
mum of 7000 words. Note this is based on the unlikely assumption that
the false-alarm rate adjusts the hit rate exactly for the actual size. The
individual will also know some words beyond the 10K level, but it will be
a steadily diminishing percentage of these, maybe an additional 1500
words, for a total of 8500 words. This is a rough estimate of the actual size,
and given that only four frequency bands are sampled, one that is closer
to a guess than an estimate. For more precise estimation, the Vocabulary
Size Test, which samples each level from 1K to 15K, is superior (Beglar
2010).
2. Study 2 also had a much slower L1 group (M = 960) compared with the
baseline L1 group in Study 1 (M = 777).

References
Beglar, D. (2010). A Rasch-based validation of the vocabulary size test. Language
Testing, 27(1), 101–118. doi:10.1177/0265532209340194.
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical pro-
ficiency. EUROSLA Yearbook, 6(1), 147–168.
References 259

Harrington, M., & Roche, T. (2014a). Word recognition skill and academic
achievement across disciplines in an English-as-lingua-franca setting. In
U. Knoch (Ed.), Papers in Language Testing, 16, 4.
Harrington, M., & Roche, T. (2014b). Identifying academically at-risk students
at an English-medium university in Oman: Post-enrolment language assess-
ment in an English-as-a-foreign language setting. Journal of English for
Academic Purposes, 15, 34–37.
Harsch, C., & Hartig, J. (2015). Comparing C-tests and yes/no vocabulary size
tests as predictors of receptive language skills. Language Testing, 33(4),
555–575.
Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2
written production. Applied Linguistics, 16(3), 307–322.
Laufer, B., & Nation, P. (2001). Passive vocabulary size and speed of meaning
recognition: Are they related? EUROSLA Yearbook, 1(1), 7–28.
Milton, J. (2009). Measuring second language vocabulary acquisition. Bristol:
Multilingual Matters.
Miralpeix, I., & Meara, P. (2010). The written word. Retrieved from www.
lognostics.co.uk/Vlibrary
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
Roche, T., & Harrington, M. (2013). Recognition vocabulary knowledge as a
predictor of academic performance in an English as a foreign language set-
ting. Language Testing in Asia, 3(1), 1–13. doi:10.1186/2229-0443-3-12.
Zhang, X., & Lu, X. (2013). A longitudinal study of receptive vocabulary
breadth knowledge growth and fluency development. Applied Linguistics,
35(3), 283–304. doi:10.1093/applin/amt014.
12
The Future of Lexical Facility

Aims

• Recap the conceptual and empirical cases for the lexical facility
proposal.
• Discuss the suitability of Timed Yes/No Test and alternative measure-
ment instruments.
• Identify future research directions.
• Consider applications for second language (L2) vocabulary instruction
and assessment.

12.1 Introduction
The lexical facility proposal is driven by the idea that combining vocabu-
lary size and recognition skill (speed and consistency) results in a second
language (L2) vocabulary measure that is more sensitive than size alone
to user proficiency and performance differences. A three-part construct
was introduced at the outset of the book that combines the size of an
individual’s recognition vocabulary, the relative speed with which these
words are recognized, and the consistency of this recognition speed into

© The Author(s) 2018 261


M. Harrington, Lexical Facility, DOI 10.1057/978-1-137-37262-8_12
262 12 The Future of Lexical Facility

the unitary notion of lexical facility. It was proposed that the three mea-
sures combined are more sensitive to individual differences in L2 vocabu-
lary knowledge, as manifested in performance in various domains of
academic English, than size alone. This proposal was test in the studies
reported in Part 2.
The chapter begins by recapping the conceptual and empirical case for
the lexical facility account. Following this, the strengths and limitations
of the Timed Yes/No Test as a measure of lexical facility are considered,
and alternative approaches to testing the construct identified. Directions
for further research are then discussed. These address limitations of the
current work and serve to more solidly establish the function and scope
of the lexical facility construct. Finally, this chapter (and the book) con-
cludes by outlining possible applications of the approach to L2 vocabu-
lary instruction, learning, and assessment.

12.2 The Case for Lexical Facility


The Conceptual Case for Lexical Facility

Lexical facility characterizes vocabulary skill as a property defined by the


number of words an individual can recognize and the efficiency with
which this is done. Word knowledge in this framework denotes a person’s
ability to link a form encountered with a meaning representation in the
mental lexicon—at a minimum, the ability to recognize a word as belong-
ing to the language. By any standard, it is a minimal account of what it
means to know a word. However, having an adequate stock of words is a
fundamental precondition for fluent performance, as is efficient skill in
accessing this knowledge. This efficiency is reflected in both how fast a
user can recognize words and the consistency of this speed from word to
word.
Researchers interested in vocabulary size and processing skill have
worked independently. Vocabulary size researchers such as Laufer,
Nation, and Schmitt have been primarily interested in understanding
how increasing vocabulary stock covaries with performance outcomes;
12.2 The Case for Lexical Facility 263

speed researchers such as Segalowitz and Hulstijn have focused on how


retrieval practice facilitates the development of processing efficiency,
especially automaticity. Size research is concerned with the effect of an
increasing number of new words and speed on developing skill in pro-
cessing words already learned. This division reflects a long-standing psy-
chometric tradition that treats knowledge (power) and speed as
independent dimensions. In higher cognitive domains, the speed with
which knowledge mastery is demonstrated is assumed to be incidental to
the mastery of the knowledge itself. An algebra test is scored for how
many items are correct, not for how fast they are answered. In language
learning, however, the speed with which a user displays knowledge is
anything but incidental. Language that is not available to meet immedi-
ate discourse demands is of no use. The lexical facility account captures
the dependency between knowledge and processing skill in L2
vocabulary.
The parallel role that vocabulary size and processing skill play in fluent
text comprehension also warrants their combination in a single construct.
Both lower-level processes are critical bottlenecks on global comprehen-
sion processes and two aspects of L2 vocabulary knowledge that most
directly reflect the user’s experience with the language; they are exposure-­
driven, emergent properties of the user’s L2 mental lexicon.
The discrete, context-independent nature of vocabulary size and speed
also means that as an L2 vocabulary construct, lexical facility can be use-
fully characterized as a trait: as a set of relatively stable, learner-internal
representations whose availability and use is not limited to any particular
context. The trait approach focuses on individual abilities and contrasts
with the context-based and interactional approaches to measuring L2
vocabulary prevalent in L2 language acquisition and testing theory. It
provides an objective means to compare user vocabulary knowledge
across contexts of use.
In short, the conceptual case for the lexical facility account arises from
the time-contingent nature of L2 performance, the critical role that size
and speed play in fluent text comprehension skills, and the emergentist
nature of its origin and development. It is a trait the user develops and
brings to specific contexts of use.
264 12 The Future of Lexical Facility

The Empirical Case for Lexical Complexity

Empirical evidence for the account comes from the studies summarized
in Chap. 11. The studies examined the sensitivity of the three lexical facil-
ity measures to performance differences in various domains of academic
English. Vocabulary size is measured by the VKsize score, mean recogni-
tion time by the mnRT measure, and consistency by the coefficient of
variation, CV. Sensitivity reflects how well the measures discriminate
between proficiency levels and the strength of the observed differences,
both separately and in combination.
Of the three measures, VKsize was the most sensitive, consistently dis-
criminating between proficiency levels across university-age groups,
English entry standards, IELTS, and program placement outcomes. The
differences were also accompanied by consistently large effect sizes. mnRT
scores were also sensitive to performance differences. Faster recognition
times consistently correlated with higher-proficiency levels, test scores,
and placement levels, though not to the extent of VKsize. The mnRTs
were more variable than the VKsize responses but did provide a stable
index of proficiency.
The lexical facility account is distinctive in proposing that recognition
time consistency (captured by the CV) can also serve as an index of L2
vocabulary proficiency. The findings provide only limited support for the
notion: the measure was sensitive to group or levels only when there was
a noticeably large difference in proficiency, and even in these instances,
the effect sizes were, at best, small. The results indicate that consistency
may serve as a broad index of proficiency, but with still limited practical
application in measurement.
It was proposed that composite measures combining size and pro-
cessing skill (mnRT and CV) are more sensitive to criterion differ-
ences than size (VKsize) alone. The proposal was borne out for the
mnRT results but not for the CV. was for the mnRT measure. The
VKsize_mnRT composite measure was more sensitive to group differ-
ences than VKsize alone. Greater sensitivity was evident in both the
number of significant group comparisons and the relative effect sizes
of these differences. The VKsize_mnRT measure produced a larger
12.2 The Case for Lexical Facility 265

effect size in five of the seven studies, though the differences were not
always statistically significant. In the regression studies, it accounted
for a significant amount of unique variance beyond VKsize, although
the magnitude of the effect was small, accounting for 3–6% of the
unique variance. This pattern was evident for both the written and
spoken modes, though performance on the spoken version was consis-
tently lower across groups and frequency levels.
The incorporation of the CV in the lexical facility construct is an
attempt to treat response variability as a window on performance, as
opposed to mere ‘noise’ that might otherwise obscure experimental effects
of interest. The interest in variability as a characteristic of skill develop-
ment has been of interest to cognitive science researchers (Balota and Yap
2015; Hird and Kirsner 2010) and is represented in L2 research in work
on the CV done by Segalowitz and his colleagues (Segalowitz 2010). It is
an area that warrants greater attention in L2 acquisition research, despite
the modest CV results reported here.
The findings also validated the frequency-based approach to measur-
ing vocabulary knowledge. A fundamental assumption of the lexical facil-
ity approach is that corpus-based word frequency statistics serve to
predict, in probabilistic terms, when and how well a word is learned. The
latter is reflected in part by the mean recognition speed. Size performance
aligned closely with frequency-of-occurrence levels, as did time. The
growing recognition of the correlation of frequency and the contextual
diversity in which a word is encountered (Adelman 2006; Crossely et al.
2013) makes the approach potentially even more informative.
In summary, VKsize and mnRT were shown to be robust measures of
proficiency, and the composite measures of size and speed provided a
more sensitive measure than size alone in the majority of the compari-
sons. Most of the studies reported a mean advantage for the composites,
though in some of these instances, the mean advantages were not statisti-
cally significant. The results thus await further confirmation. The two
regression models also showed that mnRT responses accounted for sig-
nificant, unique variance beyond VKsize. The CV was shown to be only
an approximate measure of proficiency. The use of word frequency statis-
tics as an index of L2 vocabulary knowledge was also corroborated.
266 12 The Future of Lexical Facility

12.3 M
 easuring Lexical Facility: The Timed
Yes/No Test and Alternatives
In principle, the lexical facility construct is independent of the measure-
ment format. However, all the evidence presented for it has been col-
lected using the Timed Yes/No Test. As such, problems with the testing
instrument will also be problems for the research construct. Issues arising
from using the Timed Yes/No Test format are first identified, and then
alternatives to the current paradigm are considered.

The Timed Yes/No Test as a Measurement Instrument

The test format has distinctive characteristics that can alternatively be


considered strengths or limitations, depending on the research goal and
learning assumptions.

Self-Report of Vocabulary Knowledge

The test relies on self-report of user vocabulary knowledge. The user sim-
ply indicates whether or not the word is known. Instances of guessing
aside, the format provides no way of establishing what the user knows
about the target word. The test provides a measure of size, a quantitative
property of the user’s mental lexicon. At issue is how well this property
relates to L2 performance and not to knowledge of specific meanings or
senses. The property is described as a probabilistic estimate that ulti-
mately has to be combined with other measures to get a complete picture
of the user’s L2 vocabulary knowledge and skill.

Timed Binary Response Format

The Timed Yes/No Test draws on the lexical decision paradigm for
the measurement of recognition time. However, there are important
­differences between the respective testing conditions that makes the
­interpretation of mean recognition times in the Timed Yes/No Test more
12.3 Measuring Lexical Facility: The Timed Yes/No Test... 267

tentative. In both test formats the user responds as quickly as possible


either a ‘yes’ or a ‘no’ to individual items drawn from a set of items. In the
traditional lexical decision task wand pseudowords items are randomly
balanced such that there is a 50% probability of either response on any
given trial. It is also assumed that the test-taker will know the words and
correctly reject the pseudowords, resulting in few, if any, errors. The
potential for a response bias to emerge in response to a preponderance of
either ‘yes’ or ‘no’ responses is low.
In the Timed Yes/No Test, in contrast, the proportion of each response
type will be a function of both the individual’s vocabulary size and guess-
ing behavior. This potential asymmetry gives rise to possible response
biases by individuals. A test-taker who has a small vocabulary and low
false-alarm rate may end up responding ‘no’ far more often than ‘yes’. As
a result, the user may start anticipating hitting the ‘no’ key, resulting in
judgment errors and overly fast recognition times; conversely, the strate-
gic test-taker might feel that there have been too many ‘no’ responses and
answer ‘yes’ to try to balance things up.
The studies reported here used word/pseudoword splits that incorpo-
rate more real words than pseudowords (typically, 70 words/30 pseudo-
words) to take into account the potential ‘no’ responses to word items. As
the proficiency range of the users in a single sample increases, getting the
right balance becomes increasingly difficult. To what degree the imbal-
ance between words and pseudowords affects test performance, particu-
larly recognition time, needs more attention.
The yes/no response format also distorts what it means to know a
word. Recognition knowledge of a given word is a continuum and not a
‘yes’ or ‘no’ proposition. Rather, it is a continuum. Vocabulary knowledge
scales have been proposed for self-report testing that offers the user alter-
natives that more closely reflect the true knowledge state for the word.
These alternatives might be: I have never seen the word/I have seen the word
but don’t know what it means/I have seen the word and know what it means.
Whether these formats are more effective than the straight yes/no response
remains an open question (Bruton 2009). It is possible to incorporate
three (or more) response alternatives for each item and thus get a possibly
more accurate report of how well the individual knows the word.
However, by increasing the number of alternatives in a single trial,
268 12 The Future of Lexical Facility

r­ecognition time becomes more difficult to interpret, as it would reflect


the strength of the target item representation as well as all that of the
alternatives. It would also complicate scoring.

Pseudowords

The use of pseudowords is the most distinctive, and arguably problem-


atic, feature of the Timed Yes/No Test format. In the formula used here,
pseudoword performance is subtracted from hit scores to yield a cor-
rected score that reflects both the number of words recognized and the
guessing performance. As a result, it is possible that two users may end up
with the same corrected scores but have markedly different guessing rates.
Student A with 40 hits and 20 false alarms has the same corrected score
as Student B with 30 hits and 10 false alarms. If the false-alarm rates
across the users in the group are similar, it is not too much of a problem.
But when there is a range of mean false-alarm rates, the vocabulary size
scores become far less sensitive. Different scoring methods have been
investigated for different false-alarm rates, but they yield similar results
when a single formula is applied to a sample that has even moderate vari-
ability in individual false-alarm rates (Huibregtse et al. 2002; Mochida
and Harrington 2006). In theory, it is possible to apply the scoring for-
mula optimal for the false-alarm rate for each user, but in practice, it
would be difficult to do (Pellicer-Sánchez and Schmitt 2012).
The scoring difficulties underscore the importance of keeping false-­
alarm rates as low as possible. It is important to ensure that the pseudo-
words conform to the phonological and orthographic regularities of the
language, but that they are not too similar to real words. The recent
appearance of pseudoword generators makes the task easier, as they allow
reliable pseudowords that conform to the desired attributes to be readily
produced (Keuleers and Brysbaert 2010).
The type and quality of test instructions also play a significant role in
minimizing extreme false-alarm rates. Instructions should be presented
clearly and contain explicit caution against guessing. Although users
should be told to work as quickly and accurately as possible, a balance
must also be maintained between warnings against gratuitous guessing
12.3 Measuring Lexical Facility: The Timed Yes/No Test... 269

and encouraging the test-taker to work both quickly and accurately, as


overly conservative performance can also compromise the validity of the
final score.
Finally, the inclusion of pseudowords in the test raises issues of ecologi-
cal validity. Language users are never called upon in everyday life to con-
sciously reject a string of letters as not being a word. Exposure to and
processing of letter strings that do not exist in the language may promote
processing strategies that have little relevance to actual use and, by defini-
tion, provide no opportunity for learning. This limitation may under-
mine the utility of the test for classroom applications. Alternatives to the
use of pseudowords are discussed below.

L1 Script Effects

The written version of the test incidentally assesses English spelling skill.
Differences in a test-taker’s L1 script may, therefore, also affect perfor-
mance. Yes/No Test performance by Arabic L1 users has been shown to
be lower than their matched-proficiency L2 counterparts who come from
alphabet-based L1 backgrounds (Milton 2009), making direct compari-
son across these populations problematic. This asymmetry disappears in
the spoken version. Users from cognate languages may also be differen-
tially affected due to the similarities of the L1 and L2 scripts. It has been
shown that users from closely cognate languages can confuse test pseudo-
words with real L1 words (Meara et al. 1994)—another reason for care in
developing the test items. The variability in written Yes/No Test perfor-
mance potentially introduced by the L1 script needs attention. This is
particularly the case when interpreting performance in settings where
learners from markedly different L1 orthographies are compared. The
studies here had participants from China, Hong Kong, Taiwan, Japan,
Vietnam, and Oman.

Absence of Context

The Timed Yes/No Test format assesses word knowledge independent of


context. However, words are rarely encountered in isolation in the real
270 12 The Future of Lexical Facility

world, and the word recognition processes at the center of the test are
themselves highly sensitive to context. The relatedness of word meaning
is best exemplified in the pervasive effect of priming in word recognition.
Words encountered before or along with a target exert a strong influence
on how quickly the target is recognized and judged. The format does not
directly tap these processes. Previously presented items no doubt affect
performance, but the potential effect is controlled by randomization.
However, it is also the case that individual words develop resting repre-
sentation strengths that reflect the user’s exposure to the word and the
resulting links the word has with the other words in the mental lexicon.
These strengths are a property of the learner’s L2 mental lexicon—a
trait—that provides an important window on L2 performance.

Focus on Single Words

The test ignores vocabulary knowledge represented in multiword units


(phrases, idioms, etc.). Knowledge of these units is, of course, central to
fluent use; however, with only limited exceptions, they begin for the
learner as a collection of individual words that are apprehended, com-
bined, and then formed into a larger unit of extended meaning. Reading
research shows that text recognition begins, however fleetingly, with read-
ing single words. The mean speed with which a reader recognizes indi-
vidual words is also a trait-like property that has a direct relationship to
L2 performance.

Low-Stakes Testing

All of the studies are examples of low-stakes testing, in which the impor-
tance of the test outcome for the user is limited. The studies were carried
out as part of a research-driven data collection program. Test outcomes
had no bearing on the participants’ grade or any other aspect of study. As
a result, the degree of motivation and the attention users gave to the task
varied within and between groups. Most users were keen and focused on
the task, but on occasion needed to be reminded and more generally
12.3 Measuring Lexical Facility: The Timed Yes/No Test... 271

cajoled to focus on the task. Lack of attention is particularly problematic


for the Timed Yes/No Test format, as the timed nature of the test requires
the concentration of the test-taker to produce as-fast-as-possible effects.
While the time condition motivates many users to approach the task as a
challenge, others were less keen and required additional prompting to
maintain attention. How these differing levels of motivation affect testing
outcomes is a fundamental question. A major challenge is to devise test
instructions and presentation formats that optimize test-taker engage-
ment with the test.

Alternatives to the Timed Yes/No Test Format

There are alternatives to the test format that can reduce or eliminate most
of the limitations noted. Any alternative measure of lexical facility needs
to meet two basic criteria: the test items must be sampled from a range of
frequency levels that allow vocabulary size to be estimated, and the speed
with which individual items are recognized needs to be collected or con-
trolled. Alternative formats for collecting both size and speed measures
are available.

Pair Choice The potential response bias that arises by using binary
response options can be avoided by using a pair choice format, such as in
the Recognition-Based Vocabulary Test proposed by Eyckmans (2004).
Here, the user chooses which member of a word–pseudoword pair is an
actual word. While this format avoids the yes/no response bias problem,
other problems associated with using pseudowords remain, including
potential similarity with words in the L1 or L2. Another alternative is an
animacy-judgment task, which involves a semantic choice (Segalowitz
and Freed 2004). Users are presented with a noun pair—one animate and
one inanimate—and must only identify the animate term. This format
effectively circumvents the problems arising from the use of ­pseudowords.
Unfortunately, there are only a relatively small number of animate nouns
in the language, making it difficult to cover adequately the frequency
range required for estimating size.
272 12 The Future of Lexical Facility

Words Only (No Pseudowords) The simplest way to avoid the pseudoword
problem is to eliminate them altogether. A suitable instruction set might
prove sufficient to minimize test-taker guessing, especially for particular
learner backgrounds and settings (Shillaw 1996). Another way to dis-
courage guessing is to regularly and randomly stop the test after a ‘yes’
response to a word and ask the user to define, describe, or translate it.
However, recurrent interruptions will lengthen the time it takes to com-
plete the test and may affect response speed, as the user constantly goes
off- and online. Another option is to test (or at least threaten to test) the
word items at the end of the test. This may also lengthen the duration of
the test beyond desirable limits.

A further possibility is to retain the pseudowords but use them only as


measures of guessing, with the hits rate being used as the sole size mea-
sure (Harsch and Hartig 2015). The hits were similar to the VKsize scores
in sensitivity to group differences for most of the studies, reflecting a rela-
tively narrow range of false-alarm rates within the group. At the same
time, false-alarm rates also reflected proficiency levels more broadly.
Lower-proficiency participants invariably had high false-alarm rates,
while high-proficiency individuals had the lowest. False-alarm perfor-
mance is thus both a measure of guessing and of proficiency. Dispensing
with pseudowords may affect the face validity of the format as a reliable
measure of vocabulary knowledge, as pseudowords provide the only evi-
dence for the user’s guessing behavior.

Using Context Cues The perceived limitation of presenting words with-


out context can, of course, be addressed by giving contextual cues as
prompts. Some research designs have attempted to incorporate context in
the test format to improve test reliability and validity (e.g., Read and
Shiotsu 2010). However, presenting a target context for each item
requires that the effect of the context words on recognition time be
­controlled. Incorporating contextual cues also significantly increases the
time it takes to complete a test and can greatly decrease the number of
target items that can be tested in one sitting. Testing single words in the
current format permits a large number of items to be tested in a short
period—and ensures greater test reliability.
12.3 Measuring Lexical Facility: The Timed Yes/No Test... 273

Go/No-Go Format Recognition speed can be attributed to three basic fac-


tors: how well a word has been learned (the strength of the underlying
word representation), the time an individual takes when deciding whether
to answer ‘yes’ or ‘no’, and the motor movements needed to realize the
response. The last two components can be minimized—and the latter
eliminated entirely—by a using a Go/No-Go response format. Here, the
user needs to only tap a single key if responding ‘yes’ (the ‘go’). If the
answer is ‘no’, the test-taker does nothing (the ‘no-go’) and simply waits
until the item times out. Performance on this format has been compared
with that on the standard lexical decision task format and has been argued
to yield more accurate judgments, promote faster response times, and
make fewer processing demands than the forced-choice format (Perea
et al. 2002). Whether adopting such a format would result in the same
outcomes for the learner populations that have been tested with the
Timed Yes/No Test remains an open question and one that merits further
investigation.

Program-Controlled Presentation Times A very different approach to mea-


suring individual differences in word recognition speed involves varying
the amount of time available to the user to view and respond to a word
target. In the Yes/No Test format, frequency level is manipulated by
including items from a range of frequency bands and then collecting an
individual’s response time to each item. An alternative is to vary presenta-
tion durations within frequency levels so that words appear for different
lengths of time, ranging from very fast (say, 500–600 milliseconds) to
very slow (over 3000 milliseconds). User response times can also be col-
lected. Accuracy performance on both the hits and the false alarms is the
main dependent variable and is interpreted as a function of frequency
level and time available. This format may provide a better window on the
development of speed relative to size within and across users. It may also
allow exposure thresholds to be established that index fluent word recog-
nition and that can be correlated with language performance outcomes.

The adaptation of the rapid serial visual presentation paradigm may


also be possible (Spence and Witkowski 2013). The format allows unlim-
ited text to be presented in one location on a computer screen and could
274 12 The Future of Lexical Facility

be used to present a contextual frame for target items, offering a more


tightly controlled way to measure target recognition time than does a line
of text on the screen.
These alternatives are not all mutually exclusive, and other alternatives
meeting the two lexical facility criteria are no doubt possible.

12.4 T
 he Next Step in Lexical Facility
Research
The lexical facility proposal is motivated by the notion that time is a
defining feature of L2 vocabulary skill and thus should be directly incor-
porated in the measurement of L2 vocabulary. Differences in processing
time, especially response times, have long been used as a window on
underlying knowledge representations.1 However, in both L1 and L2
research, time differences have been examined as evident across estab-
lished knowledge representations. The lexical facility account proposes
that the combination of developing vocabulary knowledge and process-
ing skill provides a measure of L2 vocabulary knowledge/skill more sensi-
tive to differences in L2 performance than the individual measures alone.
Conceptually, the account represents an approach to modeling L2 vocab-
ulary knowledge that recognizes its time-contingent nature and seeks to
understand how the temporal dimension of recognition speed (and pos-
sibly consistency) covaries with vocabulary knowledge in L2 proficiency
development. It has been evident that combining size and speed presents
significant theoretical and methodological concerns. These have been
identified and addressed to varying degrees. The studies provide e­ mpirical
support for combining size and speed, but more work is needed before
the lexical facility construct can be considered firmly established.
The first need is for a better understanding of how vocabulary size and
recognition speed covary in the course of development. Previous research
has shown that the development of recognition speed lags behind that of
size within individuals. The one-off nature of the studies here, coupled
with smaller sample sizes in many of the conditions, did not allow this
issue to be addressed. More work is needed to ascertain whether there is a
consistent relationship between lexical facility and proficiency as the learner
12.4 The Next Step in Lexical Facility Research 275

moves from beginner to more advanced stages of development. The use of


larger data sets, ideally collected over multiple sites, would allow vocabu-
lary size scores to be held constant overall and for different frequency levels.
This would permit recognition time (and possibly CV) variability to be
examined as a function of proficiency differences at a finer grain of analy-
sis. One possible outcome from this would be the identification of recogni-
tion time thresholds similar to the vocabulary size thresholds or ranges
identified as necessary for fluent reading (Schmitt et al. 2011).
The place of lexical facility in a complete model of L2 reading also
needs attention. The independent contribution of low-level word knowl-
edge and word recognition processes to reading skill is well-established
(Jeon and Yamashita 2014). More needs to be understood about the
interaction of these elements—lexical facility—with higher-level sen-
tence and discourse processes in predicting individual differences in
reader performance. The effect of lexical facility on L2 reading outcomes
may be more direct for beginner learners, where it is still developing, than
in more advanced L2 and L1 readers (Hannon 2012). Also ignored to
date is the potential contribution to lexical facility differences made by
individual differences in base rates of information processing speed
(Segalowitz 2010). An inclusion of baseline measures would provide a
sharper picture of the recognition time effects.
The findings show a consistent linear relationship between lexical facil-
ity (particularly size and recognition speed) and proficiency differences,
whether reflected in user and entry standard group differences or place-
ment test and IELTS scores. These criterion measures all reflect global
proficiency. Further research needs to assess the relationship between
­lexical facility and performance in the four skill areas of listening, speak-
ing, reading, and writing. The one study (Study 2) that examined the
spoken lexical facility measure produced the same linear pattern evident
in the written test results. Whether there is a correlation between the
spoken lexical facility measure and speaking skill awaits further investi-
gation. The correlations between the lexical facility measures and the
placement listening and grammar tests (Study 4) also suggest that the
­

relationship may be pervasive. The more stable this relationship is shown


to be, the more useful lexical facility is as an objective index of L2 vocab-
ulary development.
276 12 The Future of Lexical Facility

12.5 U
 ses of Lexical Facility in Vocabulary
Assessment and Instruction
Lexical facility is a low-level processing constraint on comprehension sen-
sitive to proficiency differences across a number of academic English
domains. The online Timed Yes/No Test format is a time- and resource-­
effective tool that allows the size and speed measures to be gathered easily
in program, classroom, and individual settings, for low-stakes testing
purposes. The attractiveness of the untimed Yes/No Test format for place-
ment testing and user self-diagnosis was recognized from the time it first
appeared (Meara and Jones 1990; Milton 2009). The inclusion of recog-
nition time as a response measure in the timed version improves the sen-
sitivity of the measure and has the potential to increase user engagement
(Lee and Chen 2011).

Vocabulary Testing and Assessment

The testing format allows a sample of learner vocabulary size and speed to
be obtained quickly and reliably in classroom and program settings. It
lends itself to three types of testing in particular.

Readiness Testing The results of the IELTS band-score study (Study 3)


show how the measures can be used to assess readiness to take it and the
ones like it (TOEFL, TOEIC, etc.). These tests represent a significant
hurdle for international students who want to study in Australia, New
Zealand, the UK, and the USA. For most prospective students, these tests
involve a considerable investment of time and money. Lexical facility per-
formance can provide one indication of readiness to complete these high-­
stakes tests.

Placement Testing Placing new students at the appropriate level in lan-


guage programs is an important decision that affects teaching and learn-
ing outcomes. It is a time- and resource-intensive activity for many
language programs, especially when the testing must be done at regular
intervals and involves a range of proficiency levels. The findings indicate
12.5 Uses of Lexical Facility in Vocabulary Assessment... 277

that the vocabulary size and RT measures have good predictive validity
when calibrated against placement levels, and especially when used in
combination with site-specific tools (Harrington and Carey 2009). The
test is also an efficient online tool for institutions to test international
students offshore (Roche and Harrington 2017).

Postenrollment Language Assessment/At-Risk Screening The increasing


number of English L2 students undertaking study in traditional English-­
speaking countries means that more attention is being given to English-­
language proficiency needs post enrollment. The test can provide a
cost-effective and reliable screening tool for tracking student progress and
a potential means to identify students at risk in English-medium study
for language-related reasons (Read 2016). This diagnostic screening func-
tion has been incorporated in multiskill tests, such as DIALANG
(Jamieson 2005), and can serve as an initial diagnostic to establish which
students need to take a more comprehensive diagnostic test, such as the
DELNA test (Read 2007).

Classroom Applications The game-like quality of the Timed Yes/No test


format lends itself to classroom use, as well as to formative, individual
assessment activities outside of class. The format requires the explicit
retrieval of the word from memory, a process that strengthens the word
itself and the links it has to related words, thus facilitating long-term reten-
tion and easier future access. Research on the testing effect (Karpicke and
Roediger 2007) shows that test-driven retrieval of material from memory
can substantially boost the learning of new material, and the mere attempt
at retrieval may also have a positive effect on learning, regardless of whether
test performance was successful (Richland et al. 2009). The test format
lends itself to adaptation in individual and class-­based testing programs
where new and revision words are systematically tested.

Developing Word Recognition Speed

An increasing recognition of processing skill as a measurable dimension


of L2 vocabulary knowledge may also affect approaches to vocabulary
278 12 The Future of Lexical Facility

instruction. The need to increase learner vocabulary size has long been
recognized as an imperative in L2 vocabulary instruction (Nation 2013).
As recognition time is also established as a crucial aspect of developing
vocabulary skill, one might ask whether explicit attempts to develop rec-
ognition speed in and outside the classroom are also warranted. A small
number of studies have attempted to explicitly develop learner recogni-
tion speed. Explicit retrieval practice on learned words has been used to
develop automaticity in single-word recognition (Akamatsu 2008) and
reading comprehension (Fukkink et al. 2005), and to facilitate written
production in English (Snellings et al. 2002). Other intentional retrieval
activities have also been used to improve vocabulary learning outcomes
(Barcroft 2007). Such studies are still relatively scarce, and the scope for
future work in the area is significant.

12.6 Conclusions
This book has made a case for including recognition speed (and to a lesser
extent, consistency) in the measurement of L2 vocabulary knowledge. The
point of departure was the observation that more proficient users can rec-
ognize more words and do this faster and more consistently than less pro-
ficient users, and the suggestion that this relationship is not coincidental.
While vocabulary size has received significant attention, recognition speed
has largely been ignored as a usable index of L2 vocabulary learning—as
opposed to processing skill.2 The findings show that the combination of
mean recognition speed and size provides a more sensitive measure of L2
vocabulary differences than either alone.
This project is the first to systematically examine a measure of process-
ing consistency—the CV—as an index of proficiency, analogous to
vocabulary size and recognition speed. The measure was sensitive to
group differences when the proficiency levels are very distinct, but other-
wise was less informative.
The findings reinforce the importance of lower-level vocabulary pro-
cesses in models of L2 vocabulary and also demonstrate the usefulness of
frequency-based approaches to indexing L2 development. They replicate
and extend previous work that has shown vocabulary size to be a sensitive
References 279

heuristic for identifying L2 proficiency differences. The question posed


here was whether the addition of recognition time and consistency
improves this sensitivity. The answer is a qualified ‘yes’.

Notes
1. Or in the words of a popular cognitive psychology textbook, Time Is cog-
nition (Lachman et al. 1979, p. 133).
2. One of the few exceptions is Pellicer-Sánchez and Schmitt (2012), who
used a threshold recognition time to establish whether a word had been
learned.

References
Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity,
not word frequency, determines word-naming and reading times. Psychological
Science, 17(9), 814–823.
Akamatsu, N. (2008). The effects of training on automatization of word recog-
nition in English as a foreign language. Applied PsychoLinguistics, 29(2),
175–193. doi:10.1017/S0142716408080089.
Barcroft, J. (2007). Effects of opportunities for word retrieval during second
language vocabulary learning. Language Learning, 57(1), 35–56.
Bruton, A. (2009). The vocabulary knowledge scale: A critical analysis. Language
Assessment Quarterly, 6(4), 288–297.
Crossely, S. A., Subtirelu, N., & Salsbury, T. (2013). Frequency effects or con-
text effects in second language word learning. Studies in Second Language
Acquisition, 35, 727–755. doi:10.1017/S0272263113000375.
Eyckmans, J. (2004). Learners’ response behavior in Yes/No vocabulary tests. In
H. Daller, M. Milton, & J. Treffers-Daller (Eds.), Modelling and assessing
vocabulary knowledge (pp. 59–76). Cambridge: Cambridge University Press.
Fukkink, R. G., Hulstijn, J., & Simis, A. (2005). Does training in second-­
language word recognition affect reading comprehension? An experimental
study. Modern Language Journal, 89(1), 54–75. doi:10.1111/j.0026-7902.
2005.00265.x.
Hannon, B. (2012). Understanding the relative contributions of lower-level
word processes, higher-level processes, and working memory to reading com-
280 12 The Future of Lexical Facility

prehension performance in proficient adult readers. Reading Research


Quarterly, 47(2), 125–152. doi:10.1002/RRQ.013.
Harrington, M., & Carey, M. (2009). The online yes/no test as a placement
tool. System, 37(4), 614–626. doi:10.1016/j.system.2009.09.006.
Harsch, C., & Hartig, J. (2015). Comparing C-tests and yes/no vocabulary size
tests as predictors of receptive language skills. Language Testing, 33(4), 555–575.
Hird, K., & Kirsner, K. (2010). Objective measurement of fluency in natural
language production: A dynamic systems approach. Journal of Neurolinguistics,
23(5), 518–530. doi:10.1016/j.jneuroling.2010.03.001.
Huibregtse, I., Admiraal, W., & Meara, P. (2002). Scores on a yes-no vocabulary
test: Correction for guessing and response style. Language Testing, 19(3),
227–245.
Jamieson, J. (2005). Trends in computer-based second language assessment.
Annual Review of Applied Linguistics, 25, 228–242.
Jeon, E. H., & Yamashita, J. (2014). L2 reading comprehension and its corre-
lates: A meta-analysis. Language Learning, 64(1), 160–212. https://doi.
org/10.1111/lang.12034.
Karpicke, J. D., & Roediger, H. L. (2007). Repeated retrieval during learning
was the key to long-term retention. Journal of Memory and Language, 57(2),
151–162. doi:10.1016/j.jml.2006.09.004.
Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword
generator. Behavior Research Methods, 42(3), 627–633. doi:10.3758/BRM.
42.3.627.
Lachman, R., Lachman, J. L., & Butterfield, E. C. (1979). Cognitive psychology
and information processing. Hillsdale: Lawrence Erlbaum Associates.
Lee, Y. H., & Chen, H. (2011). A review of recent response-time analyses in
educational testing. Psychological Test and Assessment Modeling, 53(3),
359–379.
Meara, P., & Jones, G. (1990). Eurocentres vocabulary size test. 10KA. Zurich:
Eurocentres.
Meara, P., Lightbown, P. M., & Halter, R. H. (1994). The effects of cognates on
the applicability of yes/no vocabulary tests. The Canadian Modern Language
Review, 50(2), 296–311.
Milton, J. (2009). Measuring second language vocabulary acquisition. Bristol:
Multilingual Matters.
Mochida, A., & Harrington, M. (2006). The yes-no test as a measure of recep-
tive vocabulary knowledge. Language Testing, 26(1), 73–98. doi:10.1191/02
65532206lt321oa.
References 281

Nation, I. S. P. (2013). Learning vocabulary in another language (2nd ed.).


Cambridge: Cambridge University Press.
Pellicer-Sánchez, A., & Schmitt, S. (2012). Scoring yes-no vocabulary tests:
Reaction time vs. nonword approaches. Language Testing, 29(4), 489–509.
doi:10.1177/0265532212438053.
Perea, M., Rosa, E., & Gómez, C. (2002). Is the go/no-go lexical decision task an
alternative to the yes/no lexical decision task? Memory & Cognition, 30(1), 34–45.
Read, J. (2016). Post-admission language assessment in universities: International
perspectives. Switzerland: Springer International Publishing.
Read, J., & Shiotsu, T. (2010). Extending the yes/test as a measure of the English
vocabulary knowledge of Japanese learners. Paper presented at the colloquium
on the measurement of L2 vocabulary development at the 2010 Annual
Conference of the Applied Linguistics Association of Australia, Brisbane.
Richland, L. E., Kornell, N., & Kao, L. S. (2009). The pretesting effect: Do
unsuccessful retrieval attempts enhance learning? Journal of Experimental
Psychology: Applied, 15(3), 243–257. doi:10.1037/a0016496.
Roche, T., & Harrington, M. (2017). Offshore and onsite placement testing for
English pathway programmes. Journal of Further and Higher Education. doi:
10.1080/0309877X.2017.1301403. Published online May 9, 2017.
Schmitt, N., Jiang, X., & Grabe, W. (2011). The percentage of words known in
a text and reading comprehension. The Modern Language Journal, 95(1),
26–43. doi:10.1111/j.1540-4781.2011.01146.x.
Segalowitz, N. (2010). Cognitive bases of second language fluency. New York:
Routledge.
Segalowitz, N., & Freed, B. (2004). Context, contact and cognition in oral flu-
ency acquisition: Learning Spanish in at home and study abroad contexts.
Studies in Second Language Acquisition, 26(2), 173–199. doi:10.1017/
S0272263104262027.
Shillaw, J. (1996). The application of Rasch modelling to yes/no vocabulary tests.
Swansea: Vocabulary Acquisition Research Group, University of Wales
Swansea.
Snellings, P., van Gelderen, A., & de Glopper, K. (2002). Lexical retrieval: An
aspect of fluent second language production that can be enhanced. Language
Learning, 52(4), 723–754.
Spence, R., & Witkowski, M. (2013). Rapid serial visual presentation: Design for
cognition. London/New York: Springer. isbn:9781447150855.
Yap, M., & Balota, D. (2015). Visual word recognition. In A. Pollastsek & R.
Treiman (Eds.), The Oxford handbook of reading (pp. 26–43). New York:
Oxford University Press.
References

Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity,


not word frequency, determines word-naming and reading times. Psychological
Science, 17(9), 814–823.
Adolphs, S., & Schmitt, N. (2003). Lexical coverage of spoken discourse.
Applied Linguistics, 24(4), 425–438.
Aitchison, J. (2012). Words in the mind: An introduction to the mental lexicon
(4th ed.). Malden: Wiley.
Akamatsu, N. (1999). The effects of first language orthographic features on
word recognition processing in English as a second language. Reading and
Writing: An Interdisciplinary Journal, 11(4), 381–403. doi:10.1023/A:1008
053520326.
Akamatsu, N. (2003). The effects of first language orthographic features on sec-
ond language reading in text. Language Learning, 53(2), 207–231.
Akamatsu, N. (2008). The effects of training on automatization of word recog-
nition in English as a foreign language. Applied PsychoLinguistics, 29(2),
175–193. doi:10.1017/S0142716408080089.
Akbarian, I. (2010). The relationship between vocabulary size and depth for ESP/
EAP learners. System, 38 (3), 391–401. doi:10.1016/j.system.2010.06.013.
Alavi, S. M. (2012). The role of vocabulary size in predicting performance on
TOEFL reading item types. System, 40(3), 376–385.

© The Author(s) 2018 283


M. Harrington, Lexical Facility, DOI 10.1057/978-1-137-37262-8
284 References

Albrechtsen, D., Haastrup, K., & Henriksen, B. (2008). Vocabulary and writing
in a first and second language: Processes and development. Basingstoke: Palgrave
Macmillan.
Alderson, J. (2005). Diagnosing foreign language proficiency: The interface between
learning and assessment. New York: Continuum.
Anderson, R. C., & Freebody, P. (1981). Vocabulary knowledge. In J. T. Guthie
(Ed.), Comprehension and teaching: Research reviews (pp. 77–117). Newark:
International Reading Association.
Anderson, R. C., & Freebody, P. (1983). Reading comprehension and the assess-
ment and acquisition of word knowledge. In B. Huston (Ed.), Advances in
reading/language research (Vol. 2, pp. 231–256). Greenwich: JAI Press.
Andrews, S. (1992). Frequency and neighborhood effects on lexical access: Lexical
similarity or orthographic redundancy? Journal of Experimental Psychology:
Learning, Memory, and Cognition, 18(2), 234–254.
doi:10.1037/0278-7393.18.2.234.
Andrews, S. (2008). Lexical expertise and reading skill. In B. H. Ross (Ed.), The
psychology of learning and motivation: Advances in research and theory (Vol. 49,
pp. 247–281). San Diego: Elsevier.
Andrews, S. (Ed.). (2010). From inkmarks to ideas: Current issues in lexical pro-
cessing. Hove: Psychology Press.
Bachman, L. (1990). Fundamental considerations in language testing. Oxford:
Oxford University Press.
Bachmann, L. F., & Palmer, A. (2010). Language assessment in practice: Developing
language assessments and justifying their use in the real world. Oxford: Oxford
University Press.
Baddeley, A. (2012). Working memory: Theories, models, and controversies.
Annual Review of Psychology, 63, 1–29.
Baddeley, A. D., & Hitch, G. (1974). Working memory. In G. A. Bowers (Ed.),
The psychology of learning and motivation (Vol. 8, pp. 47–89). New York:
Academic Press.
Bader, M., & Häussler, J. (2010). Toward a model of grammaticality judgments.
Journal of Linguistics, 46(2), 273–330. doi:10.1017/S0022226709990260.
Balota, D. A., & Chumbley, J. I. (1984). Are lexical decisions a good measure of
lexical access? The role of word frequency in the neglected decision phase.
Journal of Experimental Psychology: Human Perception and Performance, 10(3),
340–357. doi:10.1037/0096-1523.10.3.340.
Balota, D. A., Cortese, M. J., Sergeant-Marshall, S. D., Speiler, D. H., & Yap,
M. J. (2004). Visual word recognition of single syllable words. Journal of
Experimental Psychology: General, 133(2), 382–416.
References
   285

Balota, D. A., Yap, M. J., & Cortese, M. J. (2006). Visual word recognition: The
journey from features to meaning (a travel update). In M. J. Traxler & M. A.
Gernsbacher (Eds.), Handbook of psycholinguistics (2nd ed., pp. 285–375).
Amsterdam: Elsevier.
Barcroft, J. (2007). Effects of opportunities for word retrieval during second
language vocabulary learning. Language Learning, 57(1), 35–56.
Bardel, C., & Lindqvist, C. (2011). Developing a lexical profiler for spoken
French L2 and Italian L2: The role of frequency, thematic vocabulary and
cognates. EUROSLA Yearbook, 11, 75–93. doi:10.1075/eurosla.11.06bar.
Bauer, L., & Nation, I. S. P. (1993). Word families. International Journal of
Lexicography, 6(4), 253–279.
Beeckmans, R., Eyckmans, J., Janssens, V., Dufranne, M., & Van de Velde, H.
(2001). Examining the yes/no vocabulary test: Some methodological issues
in theory and practice. Language Testing, 18(3), 235–274.
Beglar, D. (2010). A Rasch-based validation of the vocabulary size test. Language
Testing, 27(1), 101–118. doi:10.1177/0265532209340194.
Beglar, A., & Hunt, A. (1999). Revising and validating the 2000 word level and
the university word level vocabulary tests. Language Testing, 16(2), 131–162.
doi:10.1191/026553299666419728.
Bell, L. C., & Perfetti, C. A. (1994). Reading skill: Some adult comparisons.
Journal of Educational Psychology, 86(2), 244–255. doi:10.1037/0022-
0663.86.2.244.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating
language structure and use. Cambridge: Cambridge University Press.
Brown, J. D. (2005). Testing in language programs: A comprehensive guide to
English language assessment. New York: McGraw-Hill.
Bruton, A. (2009). The vocabulary knowledge scale: A critical analysis. Language
Assessment Quarterly, 6(4), 288–297.
Bundgaard-Nielsen, R. L., Best, C. T., & Tyler, M. D. (2011). Vocabulary size
was associated with second-language vowel perception performance in adult
learners. Studies in Second Language Acquisition, 33(3), 433–461. doi:10.1017/
S0272263111000040.
Cameron, L. (2002). Measuring vocabulary size in English as an additional lan-
guage. Language Teaching Research, 6(2), 145–173.
Carreiras, M., Perea, M., & Grainger, J. (1997). Effects of the orthographic
neighborhood in visual word recognition: Cross-task comparisons. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 23(4), 857.
Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies.
Cambridge: Cambridge University Press.
286 References

Chapelle, C. A. (1998). Construct definition and validity inquiry in SLA


research. In L. F. Bachman & A. E. Cohen (Eds.), Interfaces between second
language acquisition and language testing (pp. 32–70). Cambridge: Cambridge
University Press.
Chapelle, C. A. (2006). L2 vocabulary acquisition theory: The role of infer-
ence, dependability and generalizability in assessment. In M. Chalhoub-
Deville, C. A. Chapelle, & P. Duff (Eds.), Inference and generalizability in
applied linguistics: Multiple perspectives (pp. 47–64). Amsterdam: John
Benjamins.
Chapelle, C. A., Enright, M. K., & Jamieson, J. (2010). Does an argument-­
based approach to validity make a difference? Educational Measurement: Issues
and Practice, 29(1), 3–13.
Clark, M. K., & Ishida, S. (2005). Vocabulary knowledge differences between
placed and promoted students. Journal of English for Academic Purposes, 4(3),
225–238. doi:10.1016/j.jeap.2004.10.002.
Cobb, T. (2007). Computing the vocabulary demands of L2 reading. Language
Learning & Technology, 11(3), 38–63.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale: Lawrence Erlbaum.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2),
213–238.
Coxhead, A., & Nation, P. (2001). The specialised vocabulary of English for
academic purposes. In J. Flowerdew & M. Peacock (Eds.), Research perspec-
tives on English for academic purposes (pp. 252–267). Cambridge: Cambridge
University Press.
Crossely, S. A., Subtirelu, N., & Salsbury, T. (2013). Frequency effects or con-
text effects in second language word learning. Studies in Second Language
Acquisition, 35, 727–755. doi:10.1017/S0272263113000375.
Culligan, B. (2015). A comparison of three test formats to assess word difficulty.
Language Testing, 32(4), 503–520.
Davies, M. (2008). The corpus of contemporary American English: 450 million,
1990–present. Available online at http://copruse.byu.edu.coca/
De Groot, A. M., Delmaar, P., & Lupker, S. J. (2000). The processing of inter-
lexical homographs in translation recognition and lexical decision: Support
for non-selective access to bilingual memory. The Quarterly Journal of
Experimental Psychology: Section A, 53(2), 397–428.
Elgort, I. (2013). Effects of L1 definitions and cognate status of test items on the
vocabulary size test. Language Testing, 30(2), 253–272. doi:10.1177/02655
32212459028.
References
   287

Ellis, N. C. (2002). Frequency effects in language processing: A review with


implications for theories of implicit and explicit language acquisition. Studies
in Second Language Acquisition, 24(2), 143–188.
Ellis, N. C. (2012). What can we count in language, and what counts in lan-
guage acquisition, cognition, and use? In S. T. Gries & D. Divjak (Eds.),
Frequency effects in language learning and processing (pp. 7–34). Berlin:
DeGruyter Mouton.
Eyckmans, J. (2004a). Measuring receptive vocabulary size. Utrecht: LOT.
Eyckmans, J. (2004b). Learners’ response behavior in yes/no vocabulary tests. In
H. Daller, M. Milton, & J. Treffers-Daller (Eds.), Modelling and assessing
vocabulary knowledge (pp. 59–76). Cambridge: Cambridge University Press.
Favreau, M., & Segalowitz, N. S. (1983). Automatic and controlled processes in
the first and second language of reading fluent bilinguals. Memory and
Cognition, 11(6), 565–574. doi:10.3758/BF03198281.
Fender, M. J. (2001). A review of L1 and L2/ESL word integration development
involved in lower-level text processing. Language Learning, 51(2), 319–396.
doi:10.1111/0023-8333.00157.
Fender, M. (2003). English word recognition and word integration skills of
native Arabic- and Japanese-speaking learners of English as a second lan-
guage. Applied PsychoLinguistics, 24(2), 289–315.
Fender, M. (2008). Spelling knowledge and reading development: Insights from
Arab ESL learners. Reading in a Foreign Language, 20(1), 19–42.
Field, A. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage.
Fitzpatrick, T. (2006). Habits and rabbits: Word associations and the L2 lexi-
con. EUROSLA Yearbook, 6(1), 147–168.
Fodor, J. (1983). Modularity of mind. Cambridge, MA: MIT Press.
Fritz, C. O., Morris, P. E., & Richler, J. J. (2011). Effect size estimates: Current
use, calculations, and interpretation. Journal of Experimental Psychology:
General, 141(1), 2–18. doi:10.1037/a0024338.
Fukkink, R. G., Hulstijn, J., & Simis, A. (2005). Does training in second-­
language word recognition affect reading comprehension? An experimental
study. Modern Language Journal, 89(1), 54–75. doi:10.1111/j.0026-
7902.2005.00265.x.
Gardner, D. (2007). Validating the construct of ‘word’ in applied corpus-based
vocabulary research: A critical survey. Applied Linguistics, 28(2), 242–265.
doi:10.1093/applin/amm010.
Gee, R. W., & Nguyen, L. T. C. (2015). The bilingual vocabulary size test for
Vietnamese learners: Reliability and use in placement testing. Asian Journal
of English Language Teaching, 25, 63–80.
288 References

Gelderen, A. V., Schoonen, R., Glopper, K. D., Hulstijn, J., Simis, A., Snellings,
P., & Stevenson, M. (2004). Linguistic knowledge, processing speed, and
metacognitive knowledge in and first- and second- language reading compre-
hension: A componential analysis. Journal of Educations Psychology, 96(1),
19–30.
Gelderen, A. V., Schoonen, R., Stoel, R. D., Glopper, K. D., & Hulstijn, J. (2007).
Development of adolescent reading comprehension in language 1 and language
2: A longitudinal analysis of constituent components. Journal of Educational
Psychology, 99(3), 477–491. doi:10.1037/0022-0663.99.3.477.
Geva, E., & Wang, M. (2001). The development of basic reading skills in chil-
dren: A cross-language perspective. Annual Review of Applied Linguistics, 21,
182–204.
Grabe, W. (2009). Reading in a second language: Moving from theory to practice.
New York: Cambridge University Press.
Green, D., & Swets, J. A. (1966). Signal detection theory and psychophysics.
New York: Wiley.
Grigorenko, E. L., & Naples, A. J. (2012). Single-word reading: Behavioral and
biological perspectives. New York: Taylor & Francis.
Hannon, B. (2012). Understanding the relative contributions of lower-level
word processes, higher-level processes, and working memory to reading com-
prehension performance in proficient adult readers. Reading Research
Quarterly, 47(2), 125–152. doi:10.1002/RRQ.013.
Harrington, M. (2006). The lexical decision task as a measure of L2 lexical pro-
ficiency. EUROSLA Yearbook, 6(1), 147–168.
Harrington, M., & Carey, M. (2009). The online yes/no test as a placement
tool. System, 37(4), 614–626. doi:10.1016/j.system.2009.09.006.
Harrington, M., & Jiang, W. (2013). Focus on the forms: From recognition
practice in Chinese vocabulary learning. Australian Review of Applied
Linguistics, 36(2), 132–145.
Harrington, M., & Levy, M. (2001). CALL begins with a ‘C’: Interaction in
computer-mediated language learning. System, 29(1), 15–26.
Harrington, M., & Roche, T. (2014a). Word recognition skill and academic
achievement across disciplines in an English-as-lingua-franca setting. In
U. Knoch (Ed.), Papers in Language Testing, 16, 4.
Harrington, M., & Roche, T. (2014b). Identifying academically at-risk students
at an English-medium university in Oman: Post-enrolment language assess-
ment in an English-as-a-foreign language setting. Journal of English for
Academic Purposes, 15, 34–37.
References
   289

Harsch, C., & Hartig, J. (2015). Comparing C-tests and yes/no vocabulary size
tests as predictors of receptive language skills. Language Testing, 33(4),
555–575.
Hazenberg, S., & Hulstijn, J. H. (1996). Defining a minimal receptive vocabu-
lary for non-native university students: An empirical investigation. Applied
Linguistics, 17(2), 145–163.
Heitz, R. P. (2014). The speed-accuracy tradeoff: History, physiology, methodol-
ogy, and behavior. Frontiers in Neuroscience, 8, 150.
Hird, K., & Kirsner, K. (2010). Objective measurement of fluency in natural
language production: A dynamic systems approach. Journal of Neurolinguistics,
23(5), 518–530. doi:10.1016/j.jneuroling.2010.03.001.
Hirsh, D., & Nation, P. (1992). What vocabulary size was needed to read
unsimplified texts for pleasure? Reading in a Foreign Language, 8(2), 689–696.
Holden, J. G., Van Orden, G. C., & Turvey, M. T. (2009). Dispersion of
response times reveals cognitive dynamics. Psychological Review, 116(2),
318–342. doi:10.1037/a0014849.
Holmes, V. M. (2009). Bottom-up processing and reading comprehension in
experienced adult readers. Journal of Research in Reading, 32(3), 309–326.
doi:10.1111/j.1467-9817.2009.01396.
Hoover, W. A., & Gough, P. B. (1990). The simple view of reading. Reading and
Writing, 2(2), 127–160. doi:10.1007/BF00401799.
Hu, M., & Nation, P. (2000). Unknown vocabulary density and reading com-
prehension. Reading in a Foreign Language, 13(1), 403–430.
Huibregtse, I., Admiraal, W., & Meara, P. (2002). Scores on a yes-no vocabulary
test: Correction for guessing and response style. Language Testing, 19(3),
227–245.
Hulstijn, J. H. (2011). Language proficiency in native and nonnative speakers:
An agenda for research and suggestions for second-language assessment.
Language Assessment Quarterly, 8(3), 229–249.
Hulstijn, J. H., Van Gelderen, A., & Schoonen, R. (2009). Automatization in
second language acquisition: What does the coefficient of variation tell us?
Applied Psycholinguistics, 30(04), 555–582.
Jackson, N. E. (2005). Are university students’ component reading skills related
to their text comprehension and academic achievement. Learning and
Individual Differences, 15(2), 113–139. doi:10.1016/j.lindif.2004.11.001.
Jacobs, A. M., & Grainger, J. (1994). Models of visual word recognition:
Sampling the state of the art. Journal of Experimental Psychology: Human
Perception and Performance, 20(6), 1311.
290 References

Jamieson, J. (2005). Trends in computer-based second language assessment.


Annual Review of Applied Linguistics, 25, 228–242.
Jeon, E. H., & Yamashita, J. (2014). L2 reading comprehension and its corre-
lates: A meta-analysis. Language Learning, 64(1), 160–212. doi:10.1111/
lang.12034.
Jiang, N. (2012). Conducting reaction time research in second language studies.
London/New York: Routledge.
Jiang, X., Sawaki, Y., & Sabatini, J. (2012). Word reading efficiency and oral
reading fluency in ESL reading comprehension. Reading Psychology, 33(4),
323–349. doi:10.1080/02702711.2010.526051.
Juffs, M., & Harrington, M. (2011). Aspects of working memory in L2 learn-
ing. Language Teaching, 44(2), 137–166. doi:10.1017/S026144481000
0509.
Just, M. A., & Carpenter, P. A. (1992). A capacity theory comprehension:
Individual differences in working memory. Psychological Review, 99(1),
122–149.
Kane, M., & Case, S. M. (2004). The reliability and validity of weighted com-
posite scores. Applied Measurement in Education, 17(3), 221–240.
Kanzaki, M. (2015). Comparing TOEIC® and vocabulary test scores. In
G. Brooks, M. Grogan, & M. Porter (Eds.), The 2014 PanSIG conference
proceedings (pp. 52–58). Miyazaki: JALT.
Karpicke, J. D., & Roediger, H. L. (2007). Repeated retrieval during learning
was the key to long-term retention. Journal of Memory and Language, 57(2),
151–162. doi:10.1016/j.jml.2006.09.004.
Kempe, V., & MacWhinney, B. (1996). The crosslinguistic assessment of for-
eign language vocabulary learning. Applied Psycholinguistics, 17(2), 149–183.
doi:10.1017/S0142716400007621.
Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword gen-
erator. Behavior Research Methods, 42(3), 627–633. ­ doi:10.3758/BRM.
42.3.627.
Kintsch, W. (1998). Comprehension: A paradigm for cognition. Cambridge:
Cambridge University Press.
Kintsch, W. (2005). An overview of top-down and bottom-up effects in com-
prehension: The C-I perspective. Discourse Processes, 39(2–3), 125–128. doi:
10.1080/0163853X.2005.9651676.
Koda, K. (1996). L2 word recognition research: A critical review. The Modern
Language Journal, 80(4), 450–460.
Koda, K. (2005). Insights into second language reading: A cross-linguistic approach.
New York: Cambridge University Press.
References
   291

Koda, K. (2007). Reading and language learning: Crosslinguistic constraints on


second language reading development. Language Learning, 57, 1–44. doi:10.
1111/0023-8333.101997010-i1.
Kojic-Sabo, I., & Lightbown, P. M. (1999). Students’ approaches to vocabulary
learning and their relationship to success. The Modern Language Journal,
83(2), 176–192. doi:10.1111/0026-7902.00014.
Kroll, J. F., & Bialystok, E. (2013). Understanding the consequences of bilin-
gualism for language processing and cognition. Journal of Cognitive Psychology,
25(5), 497–514.
Kroll, J. F., & Stewart, E. (1994). Category interference in translation and pic-
ture naming: Evidence for asymmetric connections between bilingual mem-
ory representations. Journal of Memory and Language, 33(2), 149–174.
doi:10.1006/jmla.1994.1008.
Kroll, J. F., & Tokowicz, N. (2001). The development of conceptual representa-
tion for words in a second language. In J. L. Nicol & T. Langendoen (Eds.),
Language processing in bilinguals (pp. 49–71). Cambridge, MA: Blackwell.
Kroll, J. F., Michael, E., Tokowicz, N., & Dufour, R. (2002). The development
of lexical fluency in a second language. Second Language Research, 18(2),
137–171.
LaBerge, D., & Samuels, S. J. (1974). Toward a theory of automatic informa-
tion processing in reading. Cognitive Psychology, 6(2), 293–323. doi:10.1016/
0010-0285(74)90015-2.
Lam, Y. (2010). Yes/No tests for foreign language placement at the post-­
secondary level. Canadian Journal of Applied Linguistics/Revue canadienne de
linguistique appliquee, 13(2), 54–72.
Larson-Hall, J. (2016). A guide to doing statistics in second language research using
SPSS and R. New York: Routledge.
Larson-Hall, J., & Herrington, R. (2010). Improving data analysis in second
language acquisition by utilizing modern developments in applied statistics.
Applied Linguistics, 31(3), 368–390.
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative
research findings: What gets reported and recommendations for the field.
Language Learning, 65(S1), 127–159.
Laufer, B. (1989). What percentage of text-lexis is essential for comprehension?
In C. Lauren & M. Nordman (Eds.), Special language: From humans thinking
to thinking, machines (pp. 316–323). Clevedon: Multilingual Matters.
Laufer, B. (1992). How much lexis is necessary for reading comprehension? In
P. J. L. Arnaud & H. Béjoint (Eds.), Vocabulary and applied linguistics
(pp. 126–132). London: Macmillan. doi:10.1007/978-1-349-12396-4_12.
292 References

Laufer, B. (2001). Quantitative evaluation of vocabulary: How it can be done


and what it was good for. In C. Elder, K. Hill, A. Brown, N. Iwashita,
L. Grove, T. Lumley, & T. MacNamara (Eds.), Experimenting with uncer-
tainty: Essays in hounour of Alan Davies (pp. 241–250). Cambridge:
Cambridge University Press.
Laufer, B. (2005a). Focus on form in second language vocabulary learning.
EUROSLA Yearbook, 5(1), 223–250.
Laufer, B. (2005b). Lexical frequency profiles: From Monte Carlo to the real
world. A response to Meara. Applied Linguistics, 26(4), 582–588.
Laufer, B. (2013). Lexical frequency profiles. In C. A. Chapelle (Ed.), The ency-
clopedia of applied linguistics. Boston: Wiley-Blackwell.
Laufer, B., & Goldstein, Z. (2004). Testing vocabulary knowledge: Size,
strength, and computer adaptiveness. Language Learning, 54(3), 399–436.
doi:10.1111/j.0023-8333.2004.00260.x.
Laufer, B., & Levitzky-Aviad, T. (2016). CATTS (Computer Adaptive Test of
Size & Strength). Downloaded May 1, 2016, from http://www.lextutor.ca/
tests/levels/recognition/nvlt/paper.pdf
Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2
written production. Applied Linguistics, 16(3), 307–322.
Laufer, B., & Nation, P. (2001). Passive vocabulary size and speed of meaning
recognition: Are they related? EUROSLA Yearbook, 1(1), 7–28.
Laufer, B., & Ravenhorts-Kalovski, G. C. (2010). Lexical threshold revisited:
Lexical text coverage, learners’ vocabulary size and reading comprehension.
Reading in a Foreign Language, 22(1), 15–30.
Laufer, B., Elder, C., Hill, K., & Congdon, P. (2004). Size and strength: Do we
need both to measure vocabulary knowledge? Language Testing, 21(2),
202–226.
Lee, Y. H., & Chen, H. (2011). A review of recent response-time analyses in edu-
cational testing. Psychological Test and Assessment Modeling, 53(3), 359–379.
Lee, I. A., & Preacher, K. J. (2013, September). Calculation for the test of the
difference between two dependent correlations with one variable in common
[Computer software]. Available from http://quantpsy.org
Leech, G., Rayson, P., & Wilson, A. (2001). Word frequencies in spoken and writ-
ten English. London: Longman.
Lemmouh, Z. (2008). The relationship between grades and the lexical richness
of student essays. Nordic Journal of English Studies, 7(3), 163–180.
Lenhard, W., & Lenhard, A. (2014). Calculation of effect sizes. Retrieved
November 29, 2014, from http://www.psychometrica.de/effect_size.html
References
   293

Lightbown, P. M., & Spada, N. (2013). How languages are learned (4th ed.).
Oxford: Oxford University Press.
Luce, R. D. (1986). Response times. New York: Oxford University Press.
Magnuson, J. S. (2008). Nondeterminism, pleiotropy, and single-word reading:
Theoretical and practical concerns. In E. L. Grigorenko & A. J. Naples (Eds.),
Single-word reading: Behavioral and biological perspectives (pp. 377–404).
New York: Lawrence.
Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analyzing
data: A model comparison perspective (2nd ed.). New York: Psychology Press.
McCarthy, M. (1998). Spoken language and applied linguistics. Cambridge:
Cambridge University Press.
McLean, S., Hogg, N., & Kramer, B. (2014). Estimations of Japanese university
learners’ English vocabulary sizes using the vocabulary size test. Vocabulary
Learning and Instruction, 3(2), 47–55.
McLean, S., Kramer, B., & Beglar, D. (2015). The creation of a new vocabulary
levels test. Language Teaching Research, 19(6), 741–760. doi:10.1177/
1362168814567889.
McNamara, T. F. (1996). Measuring second language performance. London:
Addison Wesley Longman.
Meara, P. (1989). Matrix models of vocabulary acquisition. AILA Review, 6,
66–74.
Meara, P. (1996). The dimensions of lexical competence. In G. Brown,
K. Malmkjaer, & J. Williams (Eds.), Performance and competence in second
language acquisition (pp. 35–53). Cambridge: Cambridge University Press.
Meara, P. (2002). The rediscovery of vocabulary. Second Language Research,
18(4), 393–407. doi:10.1191/0267658302sr211xx.
Meara, P. (2005). Lexical frequency profiles: A Monte Carlo analysis. Applied
Linguistics, 26(1), 32–47.
Meara, P. (2009). Connected words: Word associations and second language vocabu-
lary acquisition. Amsterdam: John Benjamins.
Meara, P., & Buxton, B. (1987). An alternative to multiple choice vocabulary
tests. Language Testing, 4(2), 142–145.
Meara, P., & Jones, G. (1987). Tests of vocabulary size in English as a foreign
language. Polyglot, 8(1), 1–40.
Meara, P., & Jones, G. (1988). Vocabulary size as placement indicator. In
P. Grunwell (Ed.), Applied linguistics in society (pp. 80–87). London: CILT.
Meara, P., & Jones, G. (1990). Eurocentres vocabulary size test. 10KA. Zurich:
Eurocentres.
294 References

Meara, P. M., & Milton, J. L. (2002). X_Lex: The Swansea vocabulary levels test.
Newbury: Express.
Meara, P. M., & Milton, J. (2003). X_Lex: The Swansea vocabulary levels test.
Swansea: Lognostics.
Meara, P. M., & Miralpeix, I. (2006). Y_Lex: The Swansea advanced vocabulary
levels test. v2.05. Swansea: Lognostics.
Meara, P., Lightbown, P. M., & Halter, R. H. (1994). The effects of cognates on
the applicability of yes/no vocabulary tests. The Canadian Modern Language
Review, 50(2), 296–311.
Messick, S. (1995). Validity of psychological assessment: Validation of infer-
ences from persons’ responses and performances as scientific inquiry into
score meaning. American Psychologist, 50(9), 741.
Milton, J. (2009). Measuring second language vocabulary acquisition. Bristol:
Multilingual Matters.
Milton, J., & Alexiou, T. (2009). Vocabulary size and the common European
framework of reference for languages. In B. Richards, M. H. Daller, D. D.
Malvern, P. Meara, J. Milton, & J. Treffers-Daller (Eds.), Vocabulary studies
in first and second language acquisition (pp. 194–211). Basingstoke: Palgrave
Macmillan.
Milton, J., & Hopkins, N. (2006). Lexical profiles, learning styles and the con-
struct validity of lexical size tests. In H. Daller, J. Milton, & J. Treffers-Daller
(Eds.), Modelling and assessing vocabulary knowledge (pp. 47–58). Cambridge:
Cambridge University Press.
Miralpeix, I. (2007). Lexical knowledge in instructed language learning: The
effects of age and exposure. International Journal of English Studies, 7(2),
61–83.
Miralpeix, I., & Meara, P. (2010). The written word. Retrieved from www.log-
nostics.co.uk/Vlibrary
Mochida, A., & Harrington, M. (2006). The yes-no test as a measure of recep-
tive vocabulary knowledge. Language Testing, 26(1), 73–98. doi:10.1191/02
65532206lt321oa.
Moder, K. (2010). Alternatives to F-test in one way ANOVA in case of hetero-
geneity of variances (a simulation study). Psychological Test and Assessment
Modeling, 52(4), 343–353.
Nagy, W. E., Anderson, R., Schommer, M., Scott, J. A., & Stallman, A. (1989).
Morphological families in the internal lexicon. Reading Research Quarterly,
24(3), 263–282. doi:10.2307/747770.
Nassaji, H. (2014). The role and importance of lower-level processes in second
language reading. Language Teaching, 47(1), 1–37.
References
   295

Nassaji, H., & Geva, E. (1999). The contribution of phonological and ortho-
graphic processing skills to adult ESL reading: Evidence from native speakers
of Farsi. Applied Psycholinguistics, 20(2), 241–267.
Nation, I. S. P. (2006). How large a vocabulary was needed for reading and lis-
tening? The Canadian Modern Language Review/La Revue Canadienne des
Langues Vivantes, 63(1), 59–82.
Nation, I. S. P. (2012). The vocabulary size test: Information and specifications.
Retrieved from http://www.victoria.ac.nz/lals/about/staff/publications/paul-
nation/Vocabulary-Size-Test-information-and-specifications.pdf
Nation, I. S. P. (2013). Learning vocabulary in another language (2nd ed.).
Cambridge: Cambridge University Press.
Nation, P., & Coxhead, A. (2014). Vocabulary size research at Victoria University
of Wellington, New Zealand. Language Teaching, 47(03), 398–403.
Norris, D. (2013). Models of visual word recognition. Trends in Cognitive
Sciences, 17(1), 517–524. doi:10.1016/j.tics.2013.08.003.
Pachella, R. G. (1974). The interpretation of reaction time in information pro-
cessing research. In B. H. Kantowitz (Ed.), Human information processing:
Tutorials in performance and cognition (pp. 41–82). Hillsdale: Lawrence
Erlbaum Associates, Inc.
Paradis, M. (2004). A neurolinguistic theory of bilingualism. Amsterdam: John
Benjamins.
Paradis, M. (2009). Declarative and procedural determinants of second languages
(Vol. 40). Amsterdam: John Benjamins Publishing.
Paradis, J. (2010). Bilingual children’s acquisition of English verb morphology:
Effects of language exposure, structure complexity, and task type. Language
Learning, 60(3), 651–680.
Pellicer-Sánchez, A., & Schmitt, S. (2012). Scoring yes-no vocabulary tests:
Reaction time vs. nonword approaches. Language Testing, 29(4), 489–509.
doi:10.1177/0265532212438053.
Perea, M., Rosa, E., & Gómez, C. (2002). Is the go/no-go lexical decision task
an alternative to the yes/no lexical decision task? Memory & Cognition, 30(1),
34–45.
Perfetti, C. A. (1985). Reading ability. New York: Oxford University Press.
Perfetti, C. A. (2007). Reading ability: Lexical ability to comprehension. Scientific
Studies of Reading, 11(4), 357–383. doi:10.1080/10888430701530730.
Perfetti, C. A., & Hart, L. (2001). The lexical basis of comprehension skill. In
D. S. Gorfien (Ed.), On the consequences of meaning selection: Perspectives on
resolving lexical ambiguity (pp. 67–86). Washington, DC: American
Psychological Association.
296 References

Perfetti, C. A., & Hart, L. (2002). The lexical quality hypothesis. In L. Verhoeven
(Ed.), Precursors of functional literacy (pp. 189–213). Philadelphia: Benjamins.
Perfetti, C. A., & Stafura, J. (2014). Word knowledge in a theory of reading
comprehension. Scientific Studies of Reading, 18(1), 22–37. doi:10.1080/108
88438.2013.827687.
Plonsky, L., & Derrick, D. J. (2016). A meta-analysis of reliability coefficients
in second language research. The Modern Language Journal, 100, 538–553.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes
in L2 research. Language Learning, 64, 878–912. doi:10.1111/lang. 12079.
Qian, D. D. (1999). Assessing the roles of depth and breadth of vocabulary
knowledge in reading comprehension. The Canadian Modern Language
Review, 56(2), 282–307.
Ratcliff, R., Gomez, P., & McKoon, G. (2004). A diffusion model account of
the lexical decision task. Psychological Review, 111(1), 159–182.
Raymond, W. D., & Brown, E. L. (2012). Are effects of word frequency effects
of contexts of use? In S. T. Gries & D. Divjak (Eds.), Frequency effects in lan-
guage learning and processing (pp. 35–52). Berlin: De Gruyter Mouton.
Read, J. (2000). Assessing vocabulary. Cambridge: Cambridge University Press.
Read, J. (2004a). Plumbing the depths: How should the construct of vocabulary
knowledge be defined? In P. Bogaards & B. Laufer (Eds.), Vocabulary in a
second language: Selection, acquisition, and testing (pp. 209–227). Amsterdam:
John Benjamins.
Read, J. (2004b). Research in teaching vocabulary. Annual Review of Applied
Linguistics, 24, 146–161.
Read, J. (2016). Post-admission language assessment in universities: International
perspectives. Switzerland: Springer International Publishing.
Read, J., & Chapelle, C. A. (2001). A framework for second language vocabu-
lary assessment. Language Testing, 18(1), 1–32.
Read, J., & Nation, P. (2009). Introduction: Meara’s contribution to research in
lexical processing. In T. Fitzpatrick & A. Barfield (Eds.), Lexical processing in
second language learners (pp. 1–12). Bristol: Multilingual Matters.
Read, J., & Shiotsu, T. (2010). Extending the yes/test as a measure of the English
vocabulary knowledge of Japanese learners. Paper presented at the colloquium
on the measurement of L2 vocabulary development at the 2010 Annual
Conference of the Applied Linguistics Association of Australia, Brisbane.
Richards, J. C. (1976). The role of vocabulary teaching. TESOL Quarterly, 10,
77–89.
Richards, B. (1987). Type/token ratios: What do they really tell us? Journal of
Child Language, 14(2), 201–209. doi:10.1017/S0305000900012885.
References
   297

Richland, L. E., Kornell, N., & Kao, L. S. (2009). The pretesting effect: Do
unsuccessful retrieval attempts enhance learning? Journal of Experimental
Psychology: Applied, 15(3), 243–257. doi:10.1037/a0016496.
Roche, T., & Harrington, M. (2013). Recognition vocabulary knowledge as a
predictor of academic performance in an English as a foreign language set-
ting. Language Testing in Asia, 3(1), 1–13. doi:10.1186/2229-0443-3-12.
Roche, T., & Harrington, M. (2017). Offshore and onsite placement testing for
English pathway programmes. Journal of Further and Higher Education. doi:
10.1080/0309877X.2017.1301403. Published online May 9, 2017.
Roediger, H. L., III, & Karpicke, J. D. (2006). Test-enhanced learning: Taking
memory tests improves long-term retention. Psychological Science, 17(3),
249–255.
Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking
assessment: Reporting a score profile and a composite. Language Testing,
24(3), 355–390. doi:10.1177/0265532207077205.
Schmitt, N. (2010). Researching vocabulary. A vocabulary research manual.
Basingstoke: Palgrave Macmillan.
Schmitt, N. (2014). Size and depth of vocabulary knowledge: What the research
shows. Language Learning, 64(4), 913–951.
Schmitt, N., & Schmitt, D. (2014). A reassessment of frequency and vocabulary
size in L2 vocabulary teaching. Language Teaching, 47(4), 484–503.
doi:10.1017/S0261444812000018.
Schmitt, N., & Zimmerman, C. B. (2002). Derivative word forms: What do
learners know? TESOL Quarterly, 36(2), 145–171. doi:10.2307/3588328.
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring
the behaviour of two new versions of the vocabulary levels test. Language
Testing, 18(1), 55–89. doi:10.1191/026553201668475857.
Schmitt, N., Jiang, X., & Grabe, W. (2011). The percentage of words known in
a text and reading comprehension. The Modern Language Journal, 95(1),
26–43. doi:10.1111/j.1540-4781.2011.01146.x.
Schnipke, D. L., & Scrams, D. J. (2002). Exploring issues of examinee behavior:
Insights gained from response-time analyses. In C. N. Mills, M. Potenza, J. J.
Fremer, & W. Ward (Eds.), Computer-based testing: Building the foundation of
future assessments (pp. 237–266). Hillsdale: Lawrence Erlbaum Associates.
Segalowitz, N. (2005). Automaticity and second languages. In C. Doughty &
M. Long (Eds.), The handbook of second language acquisition (pp. 382–408).
Oxford: Blackwell.
Segalowitz, N. (2007). Access fluidity, attention control, and the acquisition of
fluency in a second language. TESOL Quarterly, 41(1), 181–186.
298 References

Segalowitz, N. (2010). Cognitive bases of second language fluency. New York:


Routledge.
Segalowitz, N., & Freed, B. (2004). Context, contact and cognition in oral flu-
ency acquisition: Learning Spanish in at home and study abroad contexts.
Studies in Second Language Acquisition, 26(2), 173–199. doi:10.1017/
S0272263104262027.
Segalowitz, N., & Segalowitz, S. J. (1993). Skilled performance, practice and
differentiation of speed-up from automatization effects: Evidence from sec-
ond language word recognition. Applied Psycholinguistics, 14(3), 369–385.
doi:10.1017/S0142716400010845.
Segalowitz, N., Watson, V., & Segalowitz, S. J. (1995). Vocabulary skill: Single
case assessment of automaticity of word recognition in a timed lexical deci-
sion task. Second Language Research, 11(2), 121–136.
Segalowitz, N., Segalowitz, S. J., & Wood, A. G. (1998). Assessing the develop-
ment of automaticity in second language word recognition. Applied
Psycholinguistics, 19(1), 53–67.
Shah, S. K., Gill, A. A., Mahmood, R., & Bilal, M. (2013). Lexical richness, a
reliable measure of intermediate L2 learners’ current status of acquisition of
English language. Journal of Education and Practice, 4(6), 42–47.
Shillaw, J. (1996). The application of Rasch modelling to yes/no vocabulary tests.
Swansea: Vocabulary Acquisition Research Group, University of Wales Swansea.
Shillaw, J. (2009). Putting yes/no tests in context. In T. Fitzpatrick & A. Barfield
(Eds.), Lexical processing in second language learners (pp. 13–24). Bristol:
Multilingual Matters.
Shiotsu, T. (2009). Reading ability and components of word recognition speed:
The case of L1-Japanese EFL learners. In Z. Han & N. J. Anderson (Eds.),
Second language reading research and instruction: Crossing the boundaries
(pp. 15–39). Ann Arbor: University of Michigan Press.
Shiotsu, T., & Read, J. (2009, November). Extending the yes/no test as a measure
of the English vocabulary knowledge of Japanese learners. Paper presented at The
measurement of L2 lexical development colloquium, Annual Conference of
the Applied Linguistics Association of Australia, Brisbane.
Siakaluk, P. D., Buchanan, L., & Westbury, C. (2003). The effect of semantic
distance in yes/no and go/no-go semantic categorization tasks. Memory &
Cognition, 31(1), 100–113.
Siegel, L. S. (2005). A comparison of the cognitive processes underlying reading
comprehension in native English and ESL speakers. Written Language &
Literacy, 8(2), 207–231.
References
   299

Skehan, P. (1989). Individual differences in second-language learning. London:


Edward Arnold.
Snellings, P., van Gelderen, A., & de Glopper, K. (2002). Lexical retrieval: An
aspect of fluent second language production that can be enhanced. Language
Learning, 52(4), 723–754.
Spence, R., & Witkowski, M. (2013). Rapid serial visual presentation: Design for
cognition. London/New York: Springer. isbn:9781447150855.
Stæhr, L. S. (2008). Vocabulary size and the skills of listening, reading and writ-
ing. Language Learning Journal, 36(2), 139–152. doi:10.1080/09571
730802389975.
Stanovich, K. E. (1990). Concepts in developmental theories of reading skill:
Cognitive resources, automaticity, and modularity. Developmental Review,
10(1), 72–100. doi:10.1016/0273-2297(90)90005-O.
Stanovich, K. E., West, R. F., & Cunningham, A. E. (1991). Beyond phonologi-
cal processes: Print exposure and orthographic processing. In Phonological
processes in literacy: A tribute to Isabelle Y. Liberman (pp. 219–235). Hillsdale:
Lawrence Erlbaum Associates.
Sternberg, S. (1998). Inferring mental operations from reaction time data: How
we compare objects. In D. N. Osherson, D. Scarborough, & S. Sternberg
(Eds.), An invitation to cognitive science, Methods, models, and conceptual issues
(Vol. 4, pp. 436–440). Cambridge, MA: MIT Press.
Stewart, J. (2014). Do multiple-choice options inflate estimates of vocabulary
size on the VST? Language Assessment Quarterly, 11(3), 271–282. doi:10.1080/
15434303.2014.922977.
Stone, G., & Van Orden, C. (1992). Resolving empirical inconsistencies con-
cerning priming, frequency, and nonword foils in lexical decision. Language
and Speech, 35(3), 295–324. doi:10.1177/002383099203500302.
Stubbe, R. (2012). Do pseudoword false alarm rates and overestimation rates in
yes/no vocabulary tests change with Japanese university students’ English
ability levels? Language Testing, 29(4), 471–488.
Stubbe, R. (2015). Replacing translation tests with yes/no tests. Vocabulary
Learning and Instruction, 4, 38–48. doi:10.7820/vli.vo4.2.stubbe.
Tabachnick, B. G., Fidell, L. S., & Osterlind, S. J. (2001). Using multivariate
statistics. New York: Pearsons.
Thoma, D. (2009). Strategic attention in language testing. Metacognition in yes/no
business English vocabulary test. Frankfurt: Peter Lang.
Ullman, M. T. (2005). A cognitive neuroscience perspective on second language
acquisition: The declarative/procedural model. In C. Sanz (Ed.), Mind and
300 References

context in adult second language acquisition: Methods, theory, and practice


(pp. 141–178). Washington, DC: Georgetown University Press.
van der Linden, W. J. (2009). Conceptual issues in response time modelling.
Journal of Educational Measurement, 46(3), 247–272. doi:10.1111/
j.1745-3984.2009.00080.x.
van Heuven, W. J. B., Dijkstra, T., & Grainger, J. (1998). Orthographic neigh-
borhood effects in bilingual word recognition. Journal of Memory and
Language, 39(3), 458–483. doi:10.1006/jmla.1998.2584.
van Zeeland, H., & Schmitt, N. (2012). Lexical coverage in L1 and L2 listening
comprehension: The same or different from reading comprehension? Applied
Linguistics, 34(4), 457–479.
Verhoeven, L., van Leeuwe, J., & Vermeer, A. (2011). Vocabulary growth and
reading development across the elementary school years. Scientific Studies of
Reading, 15(1), 8–25.
Vermeer, A. (2001). Breadth and depth of vocabulary in relation to L1/L2
acquisition and frequency of input. Applied Psycholinguistics, 22(2), 217–234.
Wagenmakers, E. J., & Brown, S. (2007). On the linear relation between the
mean and the standard deviation of a response time distribution. Psychological
Review, 114(3), 830–841. doi:10.1037/0033-295X.114.3.830.
Wagenmakers, E. J., Ratcliff, R., Gomez, P., & McKoon, G. (2008). A diffusion
model account of criterion shifts in the lexical decision task. Journal of
Memory and Language, 58(1), 140–159. doi:10.1016/j.jml.2007.04.006.
Walter, C. (2004). Transfer of reading comprehension skills to L2 was linked to
mental representations of text and to L2 working memory. Applied Linguistics,
25(3), 315–339.
Wang, M., & Koda, K. (2005). Commonalities and differences in word identi-
fication skills among learners of English as a second language. Language
Learning, 55(1), 71–98. doi:10.1111/j.0023-8333.2005.00290.x.
Ward, J. (1999). How large a vocabulary do EAP engineering students need?
Reading in a Foreign Language, 12(2), 309–323.
Waters, G. S., & Caplan, D. (2003). The reliability and stability of verbal work-
ing memory measures. Behavior Research Methods, Instruments, and Computers,
35(4), 550–564. doi:10.3758/BF03195534.
Webb, S. (2007). The effect of repetition on vocabulary knowledge. Applied
Linguistics, 28(1). doi:10.1093/applin/aml048.
West, M. (1953). A general service list of English words. London: Longman.
Wray, A. (2008). Formulaic language: Pushing the boundaries. Oxford: Oxford
University Press.
References
   301

Yamashita, J. (2013). Word recognition subcomponents and passage level read-


ing in a foreign language. Reading in a Foreign Language, 25(1), 52–71.
Yule, G., Yanz, J. L., & Tsuda, A. (1985). Investigating aspects of the language
learner’s confidence: An application of the theory of signal detection.
Language Learning, 35(3), 473–488. d ­oi:10.1111/j.1467-1770.1985.
tb01088.x.
Zareva, A. (2005). Models of lexical knowledge assessment of second language
learners of English at higher levels of language proficiency. System, 33(4),
547–562.
Zareva, A., Schwanenflugel, P., & Nikolova, Y. (2005). Relationship between
lexical competence and language proficiency: Variable sensitivity. Studies in
Second Language Acquisition, 27(4), 567–595. doi:10.1017/S0272263
105050254.
Zhang, X., & Lu, X. (2013). A longitudinal study of receptive vocabulary
breadth knowledge growth and fluency development. Applied Linguistics,
35(3), 283–304. doi:10.1093/applin/amt014.
Zhang, S., & Thomson, S. (2004). DIALANG: A diagnostic language assess-
ment system. The Canadian Modern Language Review/La Revue Canadienne
des Langues Vivantes, 61(2), 290–293.
Ziegler, J. C., & Perry, C. (1998). No more problems in Coltheart’s neighbor-
hood: Resolving neighborhood conflicts in the lexical decision task. Cognition,
68(2), B53–B62.
Zipf, G. K. (1949). Human behavior and the principle of least effort. Cambridge,
MA: Addison-Wesley.
Zwaan, R. A., & Brown, C. M. (1996). The influence of language proficiency
and comprehension skill on situation-model construction. Discourse Processes,
21(3), 289–327.
Index

A composite measures, 108


academic English, 115 construct-irrelevant, 55
academic risk, 183 construct validity, 132
ANOVA, 36 context dependent, 113
automaticity, 51 context independent, 113
contextual diversity, 16
correction-for-guessing, 100
B
band score levels, 188
behaviorist, 79 D
bilingual readers, 49 decision stage, 54
binary response format, 267 declarative memory, 5
breadth, 3 decoding skills, 47, 50
DELNA, 277
depth, 3
C DIALANG, 277
coefficient of determination, 36 discourse constraints, 7
coefficient of variation (CV), 51 discrete, 112
cognate status, 59
collocations, 5
Common European Framework of E
Reference for Languages (CEFR), effect size, xxv
36 embedded, 112
© The Author(s) 2018 303
M. Harrington, Lexical Facility, DOI 10.1057/978-1-137-37262-8
304 Index

emergent property, 76 L
English for Academic Purposes, 228 LanguageMAP, 160
English-medium academic study, 115 lemmas, 9
entry requirement, 194 lexical availability, 70
entry standards, 158 lexical decision task (LDT), 46
error rates, 105 lexical expertise, 70
lexical facility, 26
lexical fluency, 70
F lexicality, 59
false-alarm rate, 102 lexical quality hypothesis, 71
false alarms, 55, 98 liberal response condition, 102
familiarity, 57 L1 script effects, 269
fluency, 46 long-term memory, 48
formulaic speech, 5 low stakes, 113
frequency band, 34 low-stakes testing, 270
frequency distribution, 11
frequency-of-occurrence bands, 75
frequency statistics, 12 M
frequentist, 15 measurement-based variance, 55
mental lexicon, 13
multiword units, 5
G
Go/No-Go response format, 273
grade-point-averages (GPAs), 228 N
grain size, 73 neighbours, 52
nonparametric, 36
nonwords, 52
H
higher-order cognitive processes, 49
high stakes, 113 O
hits, 100 orthographic processing, 49
outlier values, 106

I
intentional retrieval activities, 278 P
International English Language phonological skills, 49
Testing System (IELTS), pivot, 73
26, 189 placement decisions, 37
interactionalist, 79 placement testing, 5, 276
Index
   305

postenrollment language Assessment/ TOEFL, 26


at-risk screening, 277 TOEIC, 36
predictor variables, 109 tokens, 8
procedural memory, 6 trait, 6, 79
program placement, 206 t-test, 36
Pseudohomophones, 59 types, 8
pseudowords, 30 type/token ratio (TTR), 8–9

R U
rapid serial visual presentation, university admission, 158
273
readiness testing, 276
recognition vocabulary, 4 V
reliability, 68 validity, 68
response bias, 102 variability, 77
response style, 100 verbal efficiency, 70
vocabulary fluency, 70
Vocabulary Levels Test (VLT), 13
S Vocabulary Size Test (VST), 25, 28
scoring formulas, 100
self-report, 266
single word, 74 W
single-word presentation, 39 word families, 9–10
situation model, 48 word frequency, 46
sound-spelling correspondences, 58 word identification, 47
speed-accuracy trade-offs, 83, 106, word recognition speed, 39, 85
107 working memory, 72
spoken version, 161 written version, 169
standardized tests, 37
strategic processing, 29
Y
Yes/No Test, 13
T
testing effect, 277
text coverage, 16 Z
text integration processes, 48 Zipf ’s law, 11

You might also like