(123doc) Quantitative Methods For Second Language Research Carsten Roever Aek Phakiti Routledge 2018 Scan

Download as pdf or txt
Download as pdf or txt
You are on page 1of 291

QUANTITATIVE METHODS FOR

SECOND LANGUAGE RESEARCH

Quantitative Methods for Second Language Research introduces approaches to and


techniques for quantitative data analysis in second language research, with a
primary focus on second language learning and assessment research. It takes a
conceptual, problem-solving approach by emphasizing the understanding of sta-
tistical theory and its application to research problems while paying less attention
to the mathematical side of statistical analysis. The text discusses a range of com-
mon statistical analysis techniques, presented and illustrated through applications
of the IBM Statistical Package for Social Sciences (SPSS) program. These include
tools for descriptive analysis (e.g., means and percentages) as well as inferential
analysis (e.g., correlational analysis, t-tests, and analysis of variance [ANOVA]).
The text provides conceptual explanations of quantitative methods through the
use of examples, cases, and published studies in the field. In addition, a companion
website to the book hosts slides, review exercises, and answer keys for each chapter
as well as SPSS files. Practical and lucid, this book is the ideal resource for data
analysis for graduate students and researchers in applied linguistics.

Carsten Roever is Associate Professor in Applied Linguistics in the School of


Languages and Linguistics at the University of Melbourne, Australia.

Aek Phakiti is Associate Professor in TESOL in the Sydney School of Education


and Social Work at the University of Sydney, Australia.
QUANTITATIVE
METHODS FOR
SECOND LANGUAGE
RESEARCH
A Problem-Solving Approach

Carsten Roever and Aek Phakiti


First published 2018
by Routledge
711 Third Avenue, New York, NY 10017
and by Routledge
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2018 Taylor & Francis
The right of Carsten Roever and Aek Phakiti to be identified as authors
of this work has been asserted by them in accordance with sections 77 and
78 of the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced
or utilised in any form or by any electronic, mechanical, or other means,
now known or hereafter invented, including photocopying and recording,
or in any information storage or retrieval system, without permission in
writing from the publishers.
Trademark notice: Product or corporate names may be trademarks or
registered trademarks, and are used only for identification and explanation
without intent to infringe.
Every effort has been made to contact copyright-holders. Please advise
the publisher of any errors or omissions, and these will be corrected in
subsequent editions.
Library of Congress Cataloging-in-Publication Data
A catalog record for this book has been requested
ISBN: 978-0-415-81401-0 (hbk)
ISBN: 978-0-415-81402-7 (pbk)
ISBN: 978-0-203-06765-9 (ebk)

Typeset in Bembo
by Apex CoVantage, LLC

Visit the Companion Website: www.routledge.com/cw/roever


CONTENTS

List of Illustrations vii


Foreword xv
Preface xvii
Acknowledgments xxii

1 Quantification 1

2 Introduction to SPSS 14

3 Descriptive Statistics 28

4 Descriptive Statistics in SPSS 44

5 Correlational Analysis 60

6 Basics of Inferential Statistics 81

7 T-Tests 92

8 Mann-Whitney U and Wilcoxon Signed-Rank Tests 106

9 One-Way Analysis of Variance (ANOVA) 117

10 Analysis of Covariance (ANCOVA) 135

11 Repeated-Measures ANOVA 154


vi Contents

12 Two-Way Mixed-Design ANOVA 166

13 Chi-Square Test 182

14 Multiple Regression 200

15 Reliability Analysis 219

Epilogue 246

References 250
Key Research Terms in Quantitative Methods 255
Index 263
ILLUSTRATIONS

Figures
2.1 New SPSS spreadsheet 16
2.2 SPSS Variable View 17
2.3 Type Column 18
2.4 Variable Type dialog 18
2.5 Label Column 18
2.6 Creating student and score variables for the Data View 19
2.7 Adding variables named ‘placement’ and ‘campus’ 19
2.8 The SPSS spreadsheet in Data View mode 19
2.9 Accessing Case Summaries in the SPSS menus 20
2.10 Summarize Cases dialog 21
2.11 SPSS output based on the variables set in the Summarize
Cases dialog 21
2.12 SPSS menu to open and import data 23
2.13 SPSS dialog to open a data file in SPSS 23
2.14 Illustrated example of an Excel data file to be imported into SPSS 24
2.15 SPSS dialog when opening an Excel data source 24
2.16 The personal factor questionnaire on demographic information 25
2.17 SPSS spreadsheet that shows the demographic data of
Phakiti et al. (2013) 25
2.18 The questionnaires and types of scales and descriptors in
Phakiti et al. (2013) 26
2.19 SPSS spreadsheet that shows questionnaire items of
Phakiti et al. (2013) 26
3.1 A pie chart based on gender 34
viii Illustrations

3.2 A pie chart based on a 10-point score range 34


3.3 A bar chart based on a 10-point score range 35
3.4 An example of questionnaire items using a Likert-type scale 40
3.5 The positively skewed distribution of length of residence 41
3.6 The negatively skewed distribution of speech act scores 42
3.7 The low skewed distribution of implicature scores 42
4.1 Ch4TEP.sav (Data View) 45
4.2 Ch4TEP.sav (Variable View) 45
4.3 Defining gender in the Value Labels dialog 46
4.4 Defining selfrate (self-rating of proficiency) in the Value Labels dialog 47
4.5 Defining missing values 48
4.6 SPSS menu for computing descriptive statistics 49
4.7 Frequencies dialog 50
4.8 Frequencies: Statistics dialog 50
4.9 Frequencies: Charts dialog 51
4.10 A histogram of the self-rating of proficiency variable with a normal
curve 53
4.11 SPSS Descriptives options 54
4.12 SPSS graphical options 55
4.13 SPSS bar option 55
4.14 SPSS pie option 56
4.15 SPSS histogram option 57
4.16 The histogram for the total score variable 58
5.1 A scatterplot displaying the values of two variables
with a perfect positive correlation of 1 64
5.2 A scatterplot displaying the values of two variables with a
correlation coefficient of 0.90 64
5.3 A scatterplot displaying the values of two variables with a
correlation coefficient of 0.33 65
5.4 A scatterplot displaying the values of two variables with a perfect
negative correlation coefficient of –1 66
5.5 A scatterplot displaying the values of two variables with a low
correlation coefficient of 0.06 67
5.6 SPSS output displaying the Pearson product moment correlation
between two subsections of a grammar test 71
5.7 A view of Ch5correlation.sav 72
5.8 SPSS graphs menu with Scatter/Dot option 74
5.9 Simple scatterplot options 74
5.10 A scatterplot displaying the values of the listening and
grammar scores 75
5.11 Adding the fit line in a scatterplot 76
5.12 A scatterplot displaying the values of the listening and
grammar scores with a line of best fit added 77
Illustrations ix

5.13 SPSS Bivariate Correlations dialog 77


6.1 A normally distributed data set 85
7.1 Accessing the SPSS menu to perform the
independent-samples t-test 98
7.2 SPSS dialog for the independent-samples t-test 99
7.3 Lee Becker’s effect size calculators 101
7.4 Accessing the SPSS menu to perform the
paired-samples t-test 102
7.5 Paired-Samples T Test dialog 103
8.1 SPSS menu to perform the Mann-Whitney U test 109
8.2 SPSS dialog to perform the Mann-Whitney U test 110
8.3 SPSS menu to perform the Wilcoxon Signed-rank test 113
8.4 SPSS dialog to perform the Wilcoxon Signed-rank test 113
9.1 SPSS menu to launch a one-way ANOVA 123
9.2 Univariate dialog for a one-way ANOVA 123
9.3 Options for post hoc tests 124
9.4 Options dialog for ANOVA 125
9.5 SPSS menu to launch the Kruskal-Wallis test 129
9.6 Setup for the Kruskal-Wallis test 130
9.7 Variable entry for the Kruskal-Wallis test 131
9.8 Analysis settings for the Kruskal-Wallis test 131
9.9 Kruskal-Wallis test results 132
9.10 Model Viewer window for the Kruskal-Wallis test 132
9.11 Viewing pairwise comparisons 133
9.12 Pairwise comparisons in the Kruskal-Wallis test 133
10.1 Accessing the SPSS menu to launch the Compute
Variable dialog 137
10.2 Compute Variable dialog 137
10.3 Checking ANCOVA assumption of independence of
covariate and independent variable 141
10.4 Accessing the SPSS menu to select Cases for analysis 143
10.5 Select Cases dialog 144
10.6 Defining case selection conditions 144
10.7 Data View with cases selected out 145
10.8 Accessing the SPSS menu to launch ANCOVA 146
10.9 Univariate dialog for choosing a model to examine an
interaction among factors and covariances 147
10.10 Univariate: Model dialog for defining the interaction term to
check the homogeneity of regression slopes 147
10.11 Changing the analysis setup back to the original setup 149
10.12 Options in the Univariate dialog 149
11.1 A pretest, posttest, and delayed posttest design 154
11.2 Accessing the SPSS menu to launch a repeated-measures ANOVA 159
x Illustrations

11.3 Repeated Measures Define Factors dialog 159


11.4 Repeated Measures dialog 160
11.5 Repeated Measures: Options dialog 161
12.1 Diagram of a pretest-posttest control-group design 167
12.2 Changes across time points among the five groups 169
12.3 The Repeated Measures dialog 170
12.4 Repeated Measures: Profile Plots dialog 171
12.5 Repeated Measures: Profile Plots dialog with colres∗section shown 172
12.6 Repeated Measures: Post Hoc Multiple Comparisons for Observed
Means dialog 172
12.7 Repeated Measures: Options dialog 173
12.8 Estimated marginal means of MEASURE_1 180
13.1 Accessing the SPSS menu to launch the two-dimensional
chi-square test 191
13.2 Crosstabs dialog 192
13.3 Crosstabs: Statistics settings 193
13.4 Crosstabs: Cell Display dialog 193
13.5 VassarStats website’s chi-square calculator
(http://vassarstats.net/newcs.html) 196
13.6 Contingency table for two rows and two columns 197
13.7 Contingency table for two rows and two columns with data
entered 197
13.8 Chi-square test results from VassarStats 198
14.1 A scatterplot of the relationship between chocolate consumption
and vocabulary recall success 201
14.2 Accessing the SPSS menu to launch multiple regression 207
14.3 Linear Regression dialog 207
14.4 Linear Regression: Statistics dialog 208
14.5 Linear Regression: Options dialog 208
14.6 Linear Regression dialog for a hierarchical regression (Block 1 of 1) 213
14.7 Linear Regression dialog for a hierarchical regression (Block 2 of 2) 214
14.8 Linear Regression dialog for a hierarchical regression (Block 3 of 3) 214
15.1 Accessing the SPSS menu to launch Cronbach’s alpha analysis 224
15.2 Reliability Analysis dialog for Cronbach’s alpha analysis 224
15.3 Reliability Analysis: Statistics dialog 225
15.4 A selection from Ch15analyticrater.sav (Data View) 228
15.5 Excerpt from Ch15raters.sav (Data View) 231
15.6 Accessing the SPSS menu to launch Reliability Analysis 232
15.7 Reliability Analysis dialog for the split-half analysis 233
15.8 Excerpt from Ch15kappa.sav 235
15.9 Accessing the SPSS menu to launch Crosstabs for kappa analysis 236
15.10 Crosstabs dialog 237
15.11 Crosstabs: Statistics dialog for choosing kappa 237
Illustrations xi

15.12 Reliability Analysis dialog for raters’ totals as selected variables 240
15.13 Reliability Analysis: Statistics dialog for intraclass correlation analysis 241

Tables
1.1 Examples of learners and their scores 4
1.2 An example of learners’ scores converted into percentages 4
1.3 How learners are rated and ranked 5
1.4 How learners are scored on the basis of performance descriptors 6
1.5 How learners are scored on a different set of performance descriptors 6
1.6 Nominal data and their numerical codes 8
1.7 Essay types chosen by students 8
1.8 The three placement levels taught at three different locations 9
1.9 The students’ test scores, placement levels, and campuses 9
1.10 The students’ placement levels and campuses 10
1.11 The students’ campuses 11
1.12 Downward transformation of scales 11
3.1 IDs, gender, self-rated proficiency, and test score of the first 50
participants 29
3.2 Frequency counts based on gender 31
3.3 Frequency counts based on test takers’ self-assessment of
their English proficiency 31
3.4 Frequency counts based on test takers’ test scores 32
3.5 Frequency counts based on test takers’ test score ranges 32
3.6 Test score ranges based on quartiles 33
3.7 Imaginary test taker sample with an outlier 36
4.1 SPSS output on the descriptive statistics 51
4.2 SPSS frequency table for gender 52
4.3 SPSS frequency table for the selfrate variable
(self-rating of proficiency) 52
4.4 Taxonomy of the questionnaire and Cronbach’s alpha (N = 51) 59
4.5 Example of item-level descriptive statistics (N = 51) 59
5.1 Descriptive statistics of the listening, grammar, vocabulary, and
reading scores (N = 50) 73
5.2 Pearson product moment correlation between the listening
scores and grammar scores 78
5.3 Spearman correlation between the listening scores and
grammar scores 78
6.1 Correlation between verb tenses and prepositions in a
grammar test 84
6.2 Explanations of the relationship between the sample size and the
effect 88
6.3 The null hypothesis versus alternative hypothesis 89
xii Illustrations

7.1 Mean and standard deviation of error counts for generation


1.5 learners and L1 writers 93
7.2 Mean and standard deviations of ratios of error-free clauses in the
cartoon description task for both modalities 95
7.3 Means and standard deviations of the two groups 99
7.4 Levene’s test 100
7.5 The independent-samples t-test results 100
7.6 Means and standard deviations of the two means 104
7.7 Correlation coefficient between the two means 104
7.8 Paired-samples t-test results 104
8.1 Mann-Whitney U test results 107
8.2 Descriptive statistics (N = 46) 110
8.3 Mean ranks in the Mann-Whitney U test (N = 46) 110
8.4 Mann-Whitney U test statistics (N = 46) 111
8.5 Descriptive statistics (N = 46) 114
8.6 Ranks statistics in the Wilcoxon Signed-rank test (N = 46) 115
8.7 Wilcoxon Signed-rank test statistics (N = 46) 115
9.1 Immediate posttest 118
9.2 Descriptives for proficiency in TEP 125
9.3 Levene’s statistic 126
9.4 Tests of between-subjects effects as the ANOVA result 126
9.5 Scheffé post hoc test for multiple comparisons 127
10.1 ANOVA for the independent variable and covariate
(test between-subjects effects) 141
10.2 Post hoc tests for independence of covariate and independent
variable (multiple comparisons) 142
10.3 Post hoc tests for the independence of covariate and independent
variable 142
10.4 Output of homogeneity of regression slopes check
(tests of between-subjects effects) 148
10.5 Descriptive statistics of the routines scores between the two
residence groups 150
10.6 Levene’s test 150
10.7 ANCOVA analysis 151
10.8 Estimated means after adjustment for the covariate 151
10.9 Group comparisons 152
11.1 Six different tests with 10 vocabulary items 155
11.2 The within-subjects factors 162
11.3 Descriptive statistics 162
11.4 The multivariate test output 162
11.5 Mauchly’s Test of Sphericity 162
11.6 Results from tests of within-subjects effects 163
11.7 Estimates 163
Illustrations xiii

11.8 Pairwise comparisons 164


12.1 Descriptive statistics of the percentage scores for correct use
for the five treatment conditions by three tasks 168
12.2 The within-subjects factors 174
12.3 The between-subjects factors 174
12.4 Descriptive statistics 175
12.5 Mauchly’s Test of Sphericity 175
12.6 Results from tests of within-subjects effects 176
12.7 Levene’s test 176
12.8 The between-subjects effects 176
12.9 Descriptive statistics for ‘residence’ 177
12.10 Pairwise comparisons on collapsed residence 177
12.11 Univariate tests 177
12.12 Post hoc test 178
12.13 Descriptive statistics for sections 179
12.14 Pairwise comparisons on test sections 179
13.1 Frequency of phrasal verb use in five registers 183
13.2 Chi-square observed and expected counts and residuals 184
13.3 Frequency counts of language-related episodes (LREs) by
accuracy of recall 185
13.4 Marginal totals, expected frequencies, and residuals for recall by
type of LREs 186
13.5 Collocation use by proficiency level 188
13.6 Marginal totals, expected frequencies, and residuals for
collocation type and proficiency level 188
13.7 SPSS summary of the two-dimensional chi-square analysis 194
13.8 Cross-tabulation output based on gender and collapsed residence 194
13.9 Outputs of the two-dimensional chi-square test 195
13.10 Symmetric measures for the two-dimensional chi-square test 195
14.1 Three hierarchical regression models 204
14.2 Descriptive statistics 209
14.3 Correlations among the outcome and predictor variables 209
14.4 Variables entered/removed 210
14.5 Model summary 210
14.6 The ANOVA result 211
14.7 Model coefficients output: Unstandardized and standardized Beta
coefficients 211
14.8 Model coefficients output: Correlations and collinearity statistics 212
14.9 Model summary 215
14.10 ANOVA results 216
14.11 Model coefficients output: Unstandardized and standardized Beta
coefficients 216
14.12 Model coefficients output: Correlations and collinearity statistics 217
xiv Illustrations

14.13 Excluded variables 217


15.1 A simple (simulated) data matrix for a course feedback
questionnaire (N = 10) 222
15.2 The reliability for the 12-item implicature section of the TEP 222
15.3 Item-total statistics of the 12-item implicature section of the TEP 223
15.4 The case processing summary for items ‘imp1sc’ to ‘imp12sc’ 226
15.5 The overall reliability statistics 226
15.6 The item statistics 226
15.7 The summary item statistics 227
15.8 The item-total statistics 227
15.9 The scale statistics 227
15.10 The Spearman-Brown coefficient 233
15.11 Cross-tabulation of pass-fail ratings for 25 ESL learners 234
15.12 Cross-tabulation of pass-fail ratings by raters 1 and 2 238
15.13 Case processing summary for raters 1 and 2 238
15.14 Measure of agreement (kappa value) 238
15.15 Simulated data set for two raters (rater 1 and rater 2) 239
15.16 The case processing summary output 242
15.17 The reliability estimate output 242
15.18 The item statistics output 242
15.19 The intraclass correlation coefficient 243
FOREWORD

There is a certain degree of confidence or credibility that often accompanies


statistical evidence. “The numbers don’t lie”, we often hear in casual conversa-
tion. As consumers of information, whether in the news or in published second
language (L2) research, we tend to associate statistical evidence with objectivity
and, consequently, truth. The road that leads to statistical evidence, however, is
often long, winding, and full of decisions (even detours!) that the researcher has
taken. In the case of L2 research, examples of such choices might include deciding
(a) whether to collect speech samples using a more open-ended versus a controlled
task, (b) whether certain items in a questionnaire—or individuals in a sample—
should be removed from analysis based on aberrant observations, and (c) how to
score learner production that is only partially correct. Each of these choices may
influence a study’s outcomes in one direction or another, and it is critical that we
recognize the centrality of researcher judgment in all that we read and produce. As
Huff (1954) stated in his now-classic introduction to pitfalls that both researchers
and consumers succumb to, How to Lie With Statistics, “Statistics is as much an art
as it is a science” (p. 120).
A second point I offer as you enter into the wonders of quantitative research
is that nearly all of the objects we measure and quantify are actually qualitative
in nature. It may seem odd to point this out in the forward of a text like this,
but it is true! And although quantitative techniques are valuable in helping us to
organize data and to conduct the many systematic and insight-producing analyses
described throughout this book, they almost necessarily involve abstractions from
our initial interests. Imagine, for example, a study of the effects of two instruc-
tional treatments on learners’ ability to speak accurately and fluently. En route to
addressing that issue we would likely transcribe participants’ speech samples and
then code or score them for a given set of features. Next, we would summarize
xvi Foreword

those scores across the sample, the results of which would be subject to one or
more statistical tests for subsequent interpretation. In each of these procedures, we
have made abstractions, tiny steps away from learner knowledge.
I realize these comments might make me appear skeptical of quantitative research.
Of course I am! Likewise, we should all approach the task of conducting, report-
ing, and understanding empirical research with a critical eye. And thankfully, that
is precisely what this very timely and well-crafted book will enable you to do,
thereby advancing our collective ability both to conduct and evaluate research.
The text, in my view, manages to balance on the one hand a conceptual grounding
that enlightens without overwhelming and, on the other, the need for a hands-
on tutorial—in other words, precisely the knowledge and skills needed to make
and justify your own decisions throughout the process of producing rigorous and
meaningful studies. I look forward to reading them!
Luke Plonsky
Georgetown University
PREFACE

In the field of L2 research, the quantitative approach is one of the predominant


methodologies (see e.g., Norris, Ross & Schoonen, 2015; Plonsky, 2013, 2014;
Plonsky & Gass, 2011; Purpura, 2011). Quantitative research uses numbers, quan-
tification, and statistics to answer research questions. It involves the measurement
and quantification of language and language-related features of interest, such as
language proficiency, language skills, aptitudes, and motivation. The data collected
are then analyzed using statistical tools, the results of which are used to produce
research findings. In practice, however, the use of statistical tools and the way that
the results of quantitative research are reported leaves much to be desired.
In 2013, Plonsky conducted a systematic review of 606 second language acqui-
sition (SLA) studies in regard to study design, analysis, and reporting practices.
Several weaknesses in those practices were found, including a lack of basic statisti-
cal information, such as mean, standard deviations, and probability values. Plonsky
and Gass (2011), and Plonsky (2013, 2014) call for a reform of the data analysis
and report practices used in L2 research. According to Plonsky and Gass (2011),
these shortcomings could be a reflection of inadequate methodological and statis-
tical concept training, as well as insufficient coverage in research methods courses
in graduate programs of how researchers should report statistical findings.
The dearth of adequate training in quantitative research has potentially seri-
ous repercussions for the field. Certain areas in L2 research cannot be adequately
addressed if there is a lack of appropriate training in statistical methods and if suf-
ficient resources are inaccessible to new researchers and experienced researchers
new to quantitative methods. Quantitative methods, particularly inferential statis-
tics, can be technical and difficult to learn because they require an understanding
of not only the logic underpinning the statistical approaches taken, but also the
technical procedures that need to be followed to produce statistical outcomes. In
xviii Preface

addition, researchers need to be able to interpret outcomes of statistical analyses


and draw conclusions from them to answer research questions.
Since researchers in applied linguistics frequently come from an arts, humani-
ties, education, and/or social sciences background, they often have little familiarity
with mathematical and statistical concepts and procedures, and perceive statistics
as a foreign language, feeling apprehensive at the prospect of grappling with quan-
titative concepts and developing statistical skills. This may lead them to choose a
qualitative research approach, despite a quantitative one being more suitable to
answer a particular research question.
Not only can a lack of familiarity with quantitative procedures close off major
avenues of research to students, but it can also prevent new researchers from under-
standing and critically evaluating existing studies that use quantitative methods: if
readers do not understand the use of statistics in a paper, they are forced to take
the author’s interpretation of statistical outcomes on faith, rather than being able
to critically evaluate it. In the current market, there are a number of books that
deal with quantitative methods (e.g., Bachman, 2004; Bachman & Kunnan, 2005;
Larson-Hall, 2010, 2016), but these can be highly technical, mathematical, and
lengthy in their statistical treatments, as such books are often written for a particular
audience (e.g., advanced doctoral students, or experienced researchers). By contrast,
the current book assumes no prior experience in quantitative research and is writ-
ten for students and researchers new to quantitative methods.

The Aims and Scope of This Book


This book aims to introduce approaches to and techniques for quantitative data
analysis in L2 research, with a primary focus on L2 learning and assessment
research. It takes a conceptual, problem-solving approach, emphasizing the under-
standing of statistical theory and its application to research problems and pays less
attention to the mathematical side of statistical analysis.
This book is, therefore, intended as a practical academic resource and a starting
point for new researchers in their quest to learn about data analysis. It provides con-
ceptual explanations of quantitative methods through the use of examples, cases, and
published studies in the field. Statistical analysis is presented and illustrated through
applications of the IBM Statistical Package for Social Sciences (SPSS) program.
Formulae that can easily be computed manually will be presented in this book.
More involved statistical formulae associated with complex statistical procedures
being introduced will not be presented for several reasons. First, this book is
intended to nurture a conceptual understanding of statistical tests at an intro-
ductory level. Second, applied linguistics researchers rarely calculate inferential
statistics such as those presented in this book manually because there are numerous
statistical programs and online tools that are able to perform the required compu-
tations. Finally, there are many books on statistics that present statistical formulae
that the reader can consult if desired.
Preface xix

In this book, a range of common statistical analysis techniques that can be


employed in L2 research are presented and discussed. These include tools for
descriptive analysis, such as means and percentages, as well as inferential analy-
sis, such as correlational analysis, t-tests, and analysis of variance (ANOVA). An
understanding of statistics for L2 research at this level will lay the foundation on
which readers can further their learning of more complex statistics not covered
in this book (e.g., factor analysis, multivariate analysis of variance, Rasch analysis,
generalizability theory, multilevel modeling, and structural equation modeling).

Overview of the Book


The book begins with the basics of the quantification process, then moves on to
more sophisticated statistical tools. The book comprises a preface, 15 chapters, an
epilogue, references, key research terms in quantitative methods, and an index.
However, readers may choose to skip some chapters and focus on those chapters
relevant to their particular interest or research need. The chapters in this book
include specific examples and cases in quantitative research in language acquisition
and assessment, as well as analysis of unpublished data collected by the authors.
Most chapters illustrate how to use SPSS to perform the statistical analysis related
to the focus of the chapter.
Chapter 1 (Quantification) introduces the concept of quantification and dis-
cusses its benefits and limitations, and how data that are not initially quantitative
may become quantitative through coding and frequency counts. It also introduces
different scales of measurement (interval/ratio, ordinal, and nominal scales).
Chapter 2 (Introduction to SPSS ) presents the interface of the SPSS program,
the appearance of an SPSS data sheet, and preparing a data file for quantitative
data entry.
Chapter 3 (Descriptive Statistics) describes ways of representing data sets,
including graphical displays, frequency counts, and descriptive statistics. It also
foreshadows some of the statistical conditions that must be met to use some of the
statistical tests described later in the book.
Chapter 4 (Descriptive Statistics in SPSS ) shows how to compute descriptive
statistics in SPSS, and how to create simple graphs or displays of data.
Chapter 5 (Correlational Analysis) introduces the first two types of inferential
statistics, Pearson and Spearman correlations. The rationale behind correlations
and how to interpret a correlation coefficient are discussed.
Chapter 6 (Basics of Inferential Statistics) discusses the distinction between a
population and a sample, the logic of hypothesis testing, the normal distribution,
and the concept of probability. The concept of significance is also discussed. The
relationships among significance level, effect size, and sample size are highlighted.
Chapter 7 ( T-Tests) presents inferential statistics for detecting differences
between groups (the independent-samples t-test), and between repeated measure-
ment instances from the same group of participants (the paired-samples t-test).
xx Preface

Chapter 8 (Mann-Whitney U and Wilcoxon Signed-Rank Tests) presents the two


nonparametric versions of the t-tests presented in Chapter 7. These two tests are
useful for the analysis of nonnormally distributed and ordinal data.
Chapter 9 (One-Way Analysis of Variance [ANOVA]) extends between-group
comparisons as performed in the independent-samples t-test to three or more groups.
It discusses the principles of the one-way ANOVA and effect size considerations.
Chapter 10 (One-Way Analysis of Covariance [ANCOVA] ) presents an extended
version of the one-way ANOVA that is used when there are preexisting differ-
ences between groups, which can distort outcomes.
Chapter 11 (Repeated-Measures ANOVA) is an extension of the independent-
samples t-test to more than two groups. The repeated-measures ANOVA can
analyze whether there are differences among several measures of the same group.
This chapter covers the procedures that must be followed when using the repeated-
measures ANOVA, and discusses the types of research questions for which this
procedure is useful.
Chapter 12 (Two-Way Mixed-Design ANOVA) presents an inferential statistic
that combines a repeated-measures ANOVA (Chapter 11) with a between-groups
ANOVA (Chapter 9). Such a combination has the advantage of not only evaluat-
ing whether group differences affect performance outcomes, but also of being able
to simultaneously analyze the influences of time or task factors on performance
outcomes.
Chapter 13 (Chi-Square Test) demonstrates the use of the chi-square test in
L2 research and compares it with the use of Pearson and Spearman correlations.
Chapter 14 (Multiple Regression) presents simple regression and multiple regres-
sion analyses, which are used for assessing the relative impact of language learning
and test performance variables. Multiple regression allows researchers to examine
the relative contributions of predictor variables on an outcome variable.
Chapter 15 (Reliability Analysis) demonstrates an extension and application of
correlational analysis to examine the reliability of research instruments.
The Epilogue at the end of the book suggests resources for further reading in
quantitative methods.

Quantitative Research Abilities


At the end of this book, readers will have developed the following abilities:

• to understand and use suitable quantitative research analyses and approaches


in a specific research area and context;
• to critically read and evaluate quantitative research reports (e.g., journal arti-
cles, theses, or dissertations), including the claims made by researchers;
• to apply statistical concepts to their own research contexts. This ability goes
beyond understanding the specific research examples and statistical proce-
Preface xxi

dures presented in this book; it means that researchers will be enabled to


conduct analysis on their own data to answer research questions; and,
• to independently extend their statistical knowledge beyond what has been
covered in this book. Numerous advanced statistical analyses, such as Rasch
analysis or structural equation modeling, are not included in this book. They
are, however, important methods for L2 research.

Companion Website
A Companion Website hosted by the publisher houses online and up-to-date
materials such as exercises and activities: www.routledge.com/cw/roever

Comments/suggestions
The authors would be grateful to hear comments and suggestions regarding this
book. Please contact Carsten Roever at [email protected] or Aek Phakiti
at [email protected].
ACKNOWLEDGMENTS

In preparing and writing this book, we have benefitted greatly from the support of
many friends, colleagues, and students. First and foremost, we wish to acknowledge
Tim McNamara, whose brilliant pedagogical design of the course Quantitative
Methods in Language Studies at the University of Melbourne inspired us to write an
introductory statistical methods book that focuses on conceptual understanding
rather than mathematical intricacies. In addition, several colleagues, mentors, and
friends have helped us shape the book structure and content through invaluable
feedback and engaging discussion: Mike Baynham, Janette Bobis, Andrew Cohen,
Talia Isaacs, Antony Kunnan, Susy Macqueen, Lourdes Ortega, Brian Paltridge,
Luke Plonsky, Jim Purpura, and Jack Richards. We would like to thank Guy
Middleton for his exceptional work on editing the book chapter drafts. We also
greatly appreciate the feedback from Master of Arts (Applied Linguistics) students
at the University of Melbourne and Master of Education (TESOL) students at
the University of Sydney on an early draft. We would like to thank the staff at
Routledge for their assistance during this book project: Kathrene Binag, Rebecca
Novack, and the copy editors.
The support of our institutions and departments has allowed us time to con-
centrate on completing this book. The School of Languages and Linguistics at the
University of Melbourne supported Carsten with a sabbatical semester, which he
spent in the stimulating environment of the Teachers College, Columbia Uni-
versity. The Sydney School of Education and Social Work (formerly the Faculty
of Education and Social Work) supported Aek with a sabbatical semester at the
University of Bristol to complete this book project. Finally, Kevin Yang and Damir
Jambrek deserve our gratitude for their unflagging support while we worked on
this project over several years.
1
QUANTIFICATION

Introduction
Quantification is the use of numbers to represent facts about the world. It is used to
inform the decision-making process in countless situations. For example, a doctor
might prescribe some form of treatment if a patient’s blood pressure is too high.
Similarly, a university may accept the application of a student who has attained the
minimum required grades. In both these cases, numbers are used to inform deci-
sions. In L2 research, quantification is also used. For example,

• researchers in SLA might investigate the effect of feedback on students’ writ-


ing by comparing the writing scores of a group of students that received
feedback with the scores of a group that did not. They may then draw con-
clusions regarding the effect of that feedback;
• researchers in cross-cultural pragmatics might code requests made by people
from different cultures as direct or indirect and then use the codings to com-
pare those cultures; and
• researchers may be interested in the effect of a study-abroad program on stu-
dents’ language proficiency level. In this case, they may administer a language
proficiency test prior to the program, and another following the program.
Analysis of the test scores can then be carried out to determine whether it is
worthwhile for students to attend such programs.

This chapter introduces fundamental concepts related to quantitative research,


such as the nature of variables, measurement scales, and research topics in L2
research that can be addressed through quantitative methods.
2 Quantification

Quantitative Research
Quantitative researchers aim to draw conclusions from their research that can be
generalized beyond the sample participants used in their research. To do this, they
must generate theories that describe and explain their research results. When a
theory is in the process of being tested, several aspects of the theory are referred to
as hypotheses. This testing process involves analyzing data collected from, for exam-
ple, research participants or databases. In language assessment research, researchers
may be interested in the interrelationships among test performances across various
language skills (e.g., reading, listening, speaking, and writing). Researchers may
hypothesize that there are positive relationships among these skills because there
are common linguistic aspects underlying each skill (e.g., vocabulary and syntac-
tic knowledge). To test this hypothesis, researchers may ask participants to take a
test for each of the skills. They may then perform statistical analysis to investigate
whether their hypothesis is supported by the collected data.

Variables, Constructs, and Data


In quantitative research, the term variable is used to describe a feature that can
vary in degree, value, or quantity. Values of a variable may be obtained directly
from research participants with a high degree of certainty (e.g., their ages or first
language), or may have to be inferred from data collected using observation or
measurements of behavior. In quantitative research, the term construct is used to
refer to a feature of interest that is not apparent to the naked eye. Often constructs
are internal to individuals, for example, L2 constructs include language profi-
ciency, motivation, anxiety, and beliefs. Researchers may use a research instrument
(e.g., a language proficiency test or questionnaire) to collect data regarding these
constructs. For example, if researchers are interested in the vocabulary knowledge
of a group of students, then vocabulary knowledge is the construct of interest.
Researchers can ask students to demonstrate their knowledge by taking a vocab-
ulary test. Here, students’ performance on the test is treated as a variable that
represents their vocabulary knowledge. The test scores are the data, which will
enable researchers to infer the students’ vocabulary knowledge. The term data is
used to refer to the values that a variable may take on. The term data is, therefore,
used as a plural noun (e.g., ‘data are’ and ‘data were analyzed’).

Issues in Quantification
For the results of a piece of quantitative research to be believable, a minimum number
of research participants is required, which will depend on the research question under
analysis, and, in particular, the expected effect size (to be discussed in Chapter 6).
Quantification 3

In most cases, researchers need to use some type of instrument (e.g., a lan-
guage test, a rating scale, or a Likert-type scale questionnaire) to help them
quantify a construct that cannot be directly seen or observed (e.g., writing abil-
ity, reading skills, motivation, and anxiety). When researchers try to quantify
how well a student can write, it is not a matter of simply counting. Rather, it
involves the conversion of observations into numbers, for example, by applying a
scoring rubric that contains criteria which allow researchers to assign an overall
score to a piece of writing. That score then becomes the data used for further
analyses.

Measurement Scales
Different types of data contain different levels of information. These differences
are reflected in the concept of measurement scales. What is measured and how it is
measured determines the kind of data that results. Raw data may be interpreted
differently on different measurement scales. For example, suppose Heather and
Tom took the same language test. The results of the test may be interpreted in
different ways according to the measurement scale adopted. It may be said that
Heather got three more items correct than Tom, or that Heather performed better
than Tom. Alternatively, it may simply be said that their performances were not
identical. The amount of information in these statements about the relative abili-
ties of Heather and Tom is quite different and affects what kinds of conclusion can
be drawn about their abilities. The three statements about Heather and Tom relate
directly to the three types of quantitative data that are introduced in this chapter:
interval, ordinal, and nomina/categorical data.

Interval and Ratio Data


Interval data allows the difference between data values to be calculated. Test scores
are a typical kind of interval data. For example, if Heather scored 19 points on
a test, and Tom scored 16 points, it is clear that Heather got three points more
than Tom. A ratio scale is an interval scale with the additional property that it
has a well-defined true zero, which an interval scale does not. Examples of ratio
data include age, period of time, height, and weight. In practice, interval data and
ratio data are treated exactly the same way, so the difference between them has no
statistical consequences, and researchers generally just refer to “interval data” or
sometimes “interval/ratio data”.
It is the precision and information richness of interval data that makes it the
preferred type of data for statistical analyses. For example, consider the test that
Heather and Tom (and some other students) took. Suppose that the test was com-
posed of 20 questions. The full results of the test appear in Table 1.1.
4 Quantification

TABLE 1.1 Examples of learners and their scores

Learner Score (out of 20)

Heather 19
Tom 16
Phil 16
Jack 11
Mary 8

TABLE 1.2 An example of learners’ scores converted into percentages

Learner Score (out of 20) Percentage correct

Heather 19 95%
Tom 16 80%
Phil 16 80%
Jack 11 55%
Mary 8 40%

According to Table 1.1, it can be said that:

• Heather got more questions right than Tom, and also that she got three more
right than Tom did;
• Tom got twice as many questions right as the lowest scorer, Mary; and,
• the difference between Heather and Jack’s scores was the same as the differ-
ence between Tom and Mary’s scores, namely eight points in each case.

Interval data contain a large amount of detailed information and they tell us exactly
how large the interval is between individual learners’ scores. They therefore lend them-
selves to conversion to percentages. Table 1.2 shows the learners’ scores in percentages.
Percentages allow researchers to compare results from tests with different maxi-
mum scores (via a transformation to a common scale). For example, if the next
test consists of only 15 items, and Tom gets 11 of them right, his percentage score
will have declined (as 11 out of 15 is 73%), even though in both cases he got
four questions wrong. In addition to allowing conversion to percentages, interval
data can also be used for a wide range of statistical computations (e.g., calculating
means) and analyses.
Typical real-world examples of interval data include age, annual income, weekly
expenditure, and the time it takes to run a marathon. In L2 research, interval data
include age, number of years learning the target language, and raw scores on lan-
guage tests. Scaled test scores on a language proficiency test, such as the Test of
English as a Foreign Language (TOEFL), International English Language Testing
System (IELTS), and Test of English for International Communication (TOEIC)
are also normally considered interval data.
Quantification 5

Ordinal Data
For statistical purposes, ratio and interval data are normally considered desirable
because they are rich in information. Nonetheless, not all data can be classified as
interval data, and some data contain less precise information. Ordinal data contains
information about relative ranking but not about the precise size of a difference.
If the data in Tables 1.1 and 1.2 regarding students’ test scores were expressed as
ordinal data (i.e., they were on an ordinal scale of measurement), they would tell
the researchers that Heather performed better than Tom, but they would not indi-
cate by how much Heather outperformed Tom. Ordinal data are obtained when
participants are rated or ranked according to their test performances or levels of
some trait. For example, when language testers score learners’ written production
holistically using a scoring rubric that describes characteristics of performance,
they are assigning ratings to texts such as ‘excellent’, ‘good’, ‘adequate’, ‘support
needed’, or ‘major support needed’. Table 1.3 is an example of how the learners
discussed earlier are rated and ranked.
According to Table 1.3, it can be said that

• Heather scored better than all of the other students;


• Phil and Tom scored the same, and each scored more highly than Jack and
Mary; and
• Mary scored the lowest of all the students.

While ordinal data contain useful information about the relative standings of
test takers, they do not show precisely how large the differences between test tak-
ers are. Phil and Tom performed better than Mary did, but it is unknown how
much better than her they performed. Consequently, with the data in Table 1.3,
it is impossible to see that Phil and Tom scored twice as high as Mary. Although
it could be said that Phil and Tom are two score levels above Mary, that is rather
vague.
Ordinal data can be used to put learners in order of ability, but they do little
beyond establishing that order. In other words, they do not give researchers as
much information about the extent of the differences between individual learn-
ers as interval data do. Ratings of students’ writing or speaking performance are

TABLE 1.3 How learners are rated and ranked

Learner Rating Rank

Heather Excellent 1
Tom Good 2
Phil Good 2
Jack Adequate 3
Mary Support Needed 4
6 Quantification

often expressed numerically; however, that does not mean that they are interval
data. For example, numerical values can be assigned to descriptors as follows:
Excellent (5), Good (4), Adequate (3), Support Needed (2); Major Support
Needed (1). Table 1.4 presents how the learners are rated on the basis of perfor-
mance descriptors.
The numerical scores in Table 1.4 may look like interval data, but they are not.
They are only numbers that represent the descriptor, so it would not make sense
to say that Tom scored twice as high as Mary did. It makes sense to say only that
his score is two levels higher than Mary’s. This becomes even clearer if the rating
scales are changed as follows: excellent (8), good (6), adequate (4), support needed
(2), and Major support (0). That would give the information in Table 1.5.
As can been seen in Tables 1.4 and 1.5, the descriptors do not change, but
the numerical scores do. Tom and Phil’s scores are still two levels higher than
Mary’s, but now their numerical scores are three times as high as Mary’s score.
This illustration makes it clear that numerical representations of descriptors are
only symbols that say nothing about the size of the intervals between adjacent
levels. They indicate that Heather is a better writer than Tom, but since they are
not based on counts, they cannot indicate precisely how much of a better writer
Heather is than Tom.
In L2 research, rating scale data are an example of ordinal data. These are
commonly collected in relation to productive tasks (e.g., writing and speaking).
Whenever there are band levels, such as A1, A2, and B1, as in the Common Euro-
pean Reference Framework for Languages (see Council of Europe, 2001), or bands

TABLE 1.4 How learners are scored on the basis of performance descriptors

Learner Descriptor Numerical score

Heather Excellent 5
Tom Good 4
Phil Good 4
Jack Adequate 3
Mary Support Needed 2

TABLE 1.5 How learners are scored on a different set of performance descriptors

Learner Descriptor Numerical score

Heather Excellent 8
Tom Good 6
Phil Good 6
Jack Adequate 4
Mary Support Needed 2
Quantification 7

1–9, as in the IELTS, researchers are dealing with ordinal data, rather than interval
data. Data collected by putting learners into ordered categories, such as ‘beginner’,
‘intermediate’, or ‘advanced’ are another case of ordinal data. Finally, ordinal data
occur when researchers rank learners relative to each other. For example, researchers
may say that in reference to a particular feature, Heather is the best, Tom and Phil
share second place, Jack is behind them, and Mary is the weakest. This ranking indi-
cates only that the first learner is better (e.g., stronger, faster, more capable) than the
second learner, but not by how much. Ordinal data can only provide information
about the relative strengths of the test takers in regard to the feature in question. The
final data type often used in L2 research (i.e., nominal or categorical data) does not
contain information about the strengths of learners, but rather about their attributes.

Nominal or Categorical Data


Nominal data (i.e., named data, also called categorical data) are concerned only
with sameness or difference, rather than size or strength. Gender, native language,
country of origin, experimental treatment group, and test version taken are typical
examples of nominal data (i.e., data on a nominal scale of measurement). In the
example of Heather, Tom, Phil, Jack, and Mary, the nominal variable of gender has
two levels (male and female), and there are two males and three females. In research,
nominal variables are often used as independent variables; in other words, variables
that are expected to affect an outcome. Independent variables, such as teaching
methods and types of corrective feedback on performance, can be hypothesized to
affect learning outcomes or behaviors, which are then treated as dependent variables,
as they depend on the independent variables. It should be noted that dependent
and independent variables are related to research design. The nominal variable
‘study-abroad experience’, with the levels ‘has studied abroad’ (Yes = coded 1) or
‘has not studied abroad’ (No = coded 0), can be used to split a sample of learn-
ers into two groups in order to compare the scores of learners with study-abroad
experience with the scores of learners without study-abroad experience.
Nominal data are often coded numerically to facilitate the use of spreadsheets.
Table 1.6 presents an example of how nominal data can be coded numerically.
As can be seen in Table 1.6, it does not matter which numbers are assigned to the
nominal data because the idea that one number is better than another is meaningless
in this case. Also, the numerical codes do not have a mathematical value in the way
that ratio, interval and ordinal data do. For example, it cannot be said that females
are better than males merely because the code assigned to females is 2 and the code
for males is 1. However, frequency counts of nominal variables can be made, which
do have mathematical values. For instance, for the variable ‘gender’, there are three
males and two females (i.e., 40% of the participants are female and 60% are male in
the data set).
Nominal data are sometimes called categorical data because objects of inter-
est can be sorted into categories (e.g., men versus women; Form A versus Form
8 Quantification

TABLE 1.6 Nominal data and their numerical codes

Nominal variables Numerical codes

Gender Male (coded 1), female (coded 2)


Native or nonnative speaker Native (coded 1), nonnative (coded 2)
Pass or fail Pass (coded 1), fail (coded 0)
Test form Form A (coded 1), Form B (coded 2), Form C (coded 3)
Nationality American (coded 1), Canadian (coded 2),
British (coded 3), Singaporean (coded 4), Australian
(coded 5), and New Zealander (coded 6)
First language English (coded 1), Mandarin (coded 2),
Spanish (coded 3), French (coded 4), Japanese (coded 5)
Experimental groups Treatment A group (coded 1), Treatment B group
(coded 2), Control group (coded 3)
Proficiency level groups Beginner (coded 1), Intermediate (coded 2),
High Intermediate (coded 3), Advanced (coded 4)

TABLE 1.7 Essay types chosen by students

Learner Type Coded

Tom Personal experience 1


Mary Argumentative essay 2
Heather Personal experience 1
Jack Process description 3
Phil Process description 3

B versus Form C). When a variable can only have two possible values (pass/
fail; international student/domestic student, correct/incorrect), this type of data
is sometimes called dichotomous data. For example, students may be asked to com-
plete a free writing task in which they are limited to three types of essays: personal
experience (coded 1), argumentative essay (coded 2), and description of a process
(coded 3). Table 1.7 shows which student chose which type.
The data in the Type column do not provide any information about one learner
being more capable than another. It only shows which learners chose which essay
type, from which frequency counts can be made. That is, the process description
and personal experience types were chosen two times each, and the argumenta-
tive essay was chosen once. How nominal data are used in statistical analysis for
research purposes will be addressed in the next few chapters.

Transforming Data in a Real-Life Context


In a real-life situation, raw data need to be transformed for a variety of reasons.
Take the common situation in which new students entering a language program
Quantification 9

TABLE 1.8 The three placement levels taught at three different locations

Test score Placement level Location

0–20 Beginner City Campus


21–40 Intermediate Eastern Campus
41–60 Advanced Ocean Campus

TABLE 1.9 The students’ test scores, placement levels, and campuses

Student Test score Placement level Campus

Heather 51 Advanced Ocean


Tom 38 Intermediate Eastern
Phil 21 Intermediate Eastern
Jack 17 Beginner City
Mary 11 Beginner City

take a placement test consisting of, say, 60 multiple-choice questions assessing their
listening, reading, and grammar skills. Based on the test scores, the students are
placed in one of three levels: beginner, intermediate, or advanced. In addition, the
three levels are taught at three different locations, as presented in Table 1.8.
Table 1.9 presents the scores and placements of the five students introduced earlier.
The test scores are measured on an interval measurement scale that is based on
the count of correct answers in the placement test and provides detailed informa-
tion. It can be said that:

• Heather’s score is in the advanced range since her score is 11 points above the
cut-off, and her score is much higher than Tom’s, whose score was 23 points
lower than hers;
• Tom’s score is in the intermediate range, but it is close to the cut-off for the
advanced range, missing it by just three points;
• Tom’s score is far higher than Phil’s, with a difference of 17 points, yet both
scores are in the intermediate range;
• Phil’s score is just one point above the cut-off for the intermediate level, and
is only four points higher than Jack’s score. Despite the small difference in
their scores, Jack was placed in the beginner level and Phil was placed in the
intermediate level; and,
• Mary’s score is in the middle of the beginner level.

Because the information is detailed, the placement test can be evaluated criti-
cally. For example, Phil and Tom’s scores are 17 points apart whereas Phil and
Jack’s are only four points apart. Phil’s proficiency level is arguably closer to Jack’s
than to Tom’s. Yet, Phil and Tom are both classified as intermediate, but Jack is
classified in the beginner level. This is known as the contiguity problem, and it is
10 Quantification

TABLE 1.10 The students’ placement levels and campuses

Student Placement level Campus

Heather Advanced Ocean


Tom Intermediate Eastern
Phil Intermediate Eastern
Jack Beginner City
Mary Beginner City

common whenever cut-off points are set arbitrarily: students close to each other
but on different sides of the cut-off point can be more similar to each other than
to people further away from each other but on the same side of the cut-off point.
Now imagine that there are no interval-level test-score data, but instead just the
ordinal-level placement levels data and the campus data, as in Table 1.10.
As can be seen in Table 1.10, the differences between Tom and Phil and the
problematic nature of the classification that were so apparent before are no longer
visible. The information about the size of the differences between learners has
been lost and all that can be deduced now is that some students are more profi-
cient than others. Tom and Phil have the same level of proficiency and Jack is
clearly different from both of them. This demonstrates why ordinal data are not as
precise as interval data. Information is lost, and the differences between the learn-
ers seen earlier are no longer as clear.
Highly informative interval data are often transformed into less informative
ordinal data to reduce the number of categories the data must be split into. No
language program can run with classes at 60 different proficiency levels; moreover,
some small differences are not meaningful, so it does not make sense to group
learners into such a large number of levels. However, setting the cut-off points is
often a problematic issue in practice.
While the ordinal proficiency level data are less informative than the interval
test-score data, they can be scaled down even further, namely to the nominal cam-
pus data (see Table 1.11).
If this is all that can be seen, it is impossible to know how campus assignment
is related to proficiency level. However, it can be said that:

• Tom and Phil are on the same campus;


• Mary and Jack are on the same campus; and
• Heather is the only one at the Ocean campus.

This information does not indicate who is more proficient since nominal data
do not contain information about the size or direction of differences. They indi-
cate only whether differences exist or not.
Transformation of types of data can happen downwards only, rather than
upwards, in the sense that interval data can be transformed into ordinal data and
Quantification 11

TABLE 1.11 The students’ campuses

Student Campus

Tom Eastern
Mary City
Heather Ocean
Jack City
Phil Eastern

TABLE 1.12 Downward transformation of scales

Student Test score ⇒ Placement level ⇒ Campus

Heather 51 ⇒ Advanced ⇒ Ocean


Jack 17 ⇒ Beginner ⇒ City
Mary 11 ⇒ Beginner ⇒ City
Phil 21 ⇒ Intermediate ⇒ Eastern
Tom 38 ⇒ Intermediate ⇒ Eastern

ordinal data can be transformed into nominal data (e.g., by using test scores to
place learners in classes based on proficiency levels and then by assigning classes to
campus locations). Table 1.12 illustrates the downward transformation of scales.
Transformation does not work the other way around. That is, if it is known
which campus a learner studies at, it is impossible to predict that learner’s profi-
ciency level. Similarly, if a learner’s proficiency level is known, it is impossible to
predict that learner’s exact test score.

Topics in L2 Research
It is useful to introduce some of the key topics in L2 research that can be examined
using a quantitative research methodology. Here, areas of research interests in SLA,
and language testing and assessment (LTA) research are presented.

SLA Research
There is a wide range of topics in SLA research that can be investigated using
quantitative methods, although the nature of SLA itself is qualitative. SLA research
aims to examine the nature of language learning and interlanguage processes (e.g.,
sequences of language acquisition; the order of morpheme acquisition; charac-
teristics of language errors and their sources; language use avoidance; cognitive
processes; and language accuracy, fluency, and complexity). SLA research also
aims to understand the factors that affect language learning and success. Such
factors may be internal or individual factors (e.g., age, first language or cross-
linguistic influences, language aptitude, motivation, anxiety, and self-regulation), or
external or social factors (e.g., language exposure and interactions, language and
12 Quantification

socialization, language community attitude, feedback, and scaffolding). There are


several texts that provide further details of the scope of SLA research (e.g., Ellis,
2015; Gass with Behney & Plonsky, 2013; Lightbown & Spada, 2013; Macaro,
2010; Ortega, 2009; Pawlak & Aronin, 2014).

Topics in LTA Research


LTA research primarily focuses on the quality and usefulness of language tests and
assessments, and issues surrounding test development and use (e.g., test validity,
impact, use and fairness; see Purpura, 2016, or Read, 2015, for an overview). Like
SLA research, LTA research focuses on the measurement of language skills and
communicative abilities in a variety of contexts (e.g., academic language purposes
such as achievement tests, proficiency tests, and screening tests, and occupational
purposes such as tests for medical professions, aviation, or tourist guides). The
term assessment is used to cover more than the use of tests to elicit language perfor-
mance. For example, assessment may be informally carried out by teachers in the
classroom. There are several books on LTA that consider the key issues: Bachman
and Palmer, 2010; Carr, 2011; Coombe, Davidson, O’Sullivan and Stoynoff, 2012;
Douglas, 2010; Fulcher, 2010; Green, 2014; Kunnan, 2014; Weir, 2003. While
there has been an increase in qualitative and mixed methods approaches in LTA,
quantitative methods remain predominant in LTA research. This is mainly because
tests and assessments involve the measurement and evaluation of language ability.
Like SLA researchers, LTA researchers are interested in understanding the internal
factors (e.g., language knowledge, cognitive processes, and affective factors), and
external factors (e.g., characteristics of test tasks such as text characteristics, test
techniques, and the task demands and roles of raters) that affect test performance
variation. SLA and LTA research are related to each other in that SLA research
focuses on developing an understanding of the processes of language learning,
whereas LTA research measures the products of language learning processes.

A Sample Study
Khang (2014) will be used to further illustrate how L2 researchers apply the prin-
ciples of scales of measurement in their research. Khang (2014) investigated the
fluency of spoken English of 31 Korean English as a Foreign Language (EFL)
learners compared to that of 15 native English (L1) speakers. The research partici-
pants included high and low proficiency learners. Khang conducted a stimulated
recall study with a subset of this population (eight high proficiency learners and
nine low proficiency learners). This study exemplifies all three measurement scales.
The status of a learner as native or nonnative speaker of English was used as a
nominal variable. ‘Native’ was not in any way better or worse than ‘nonnative’; it
was just different. The only statistic applied to this variable was a frequency count
(15 native speakers and 31 nonnative speakers). Khang used this variable to estab-
lish groups for comparison. Proficiency level was used as an ordinal variable in
Quantification 13

this study. High proficiency learners were assumed to have greater target language
competence than low proficiency learners had, but the degree of the difference
was not relevant. The researcher was interested only in comparing the issues that
high and low proficiency learners struggled with. Khang’s other measures were
interval variables (e.g., averaged syllable duration, number of corrections per min-
ute, and number of silent pauses per minute, which can all be precisely quantified).

Summary
It is essential that quantitative researchers consider the types of data and levels of
measurement that they use (i.e., the nature of the numbers used to measure the
variables). In this chapter, issues of quantification and measurement in L2 research,
particularly the types of data and scales associated with them, have been discussed.
The next chapter will turn to a practical concern: how to manage quantitative data
with the help of a statistical analysis program, namely the IBM Statistical Package
for Social Sciences (SPSS). The concept of measurement scales will be revisited
through SPSS in the next chapter.

Review Exercises
To download review questions and SPSS exercises for this chapter, visit the Com-
panion Website: www.routledge.com/cw/roever.
2
INTRODUCTION TO SPSS

Introduction
There are a number of statistical programs that can be used for statistical analysis
in L2 research, for example, SPSS (www.01.ibm.com/software/au/analytics/spss/),
SAS ( Statistical Analysis Software; www.sas.com/en_us/software/analytics/stat.
html), Minitab (www.minitab.com/en-us/), R (www.r-project.org/), and PSPP
(www.gnu.org/software/pspp/).
In this book, SPSS is used as part of a problem-solving approach to quantita-
tive data analysis. IBM is the current owner of SPSS, and SPSS is available in both
PC and MacOS formats. SPSS is widely used by L2 researchers, partly because its
interface is designed to be user friendly: users can use the point-and-click options
to perform statistical analysis. There are both professional and student versions of
SPSS. At the time of writing, SPSS uses a licensing system under which the user
has to pay to renew his/her license every year. It is advised that readers check
whether their academic institution holds an institutional license, under which
SPSS can be freely accessed by staff and students. Alternatively, readers could con-
sider PSPP, a freeware program modeled on SPSS.

Preparing Data for SPSS


In Chapter 1, the nature of quantification in L2 research was discussed. In this chap-
ter, four basic steps required to prepare the data for analysis using SPSS are outlined.

Step 1: Checking and Organizing Data


Once the data have been collected, the researchers check whether the data are com-
plete or whether there are missing data or responses from participants. Missing data
will reduce the sample size. The data should then be organized by assigning identity
Introduction to SPSS 15

numbers (IDs) to each participant’s data. IDs are important in that they allow the data
in SPSS to be checked against the actual data. If the research instrument requires scor-
ing (e.g., a test), the scoring will need to be completed and checked before the data
can be entered into SPSS. The data should be stored in a secure place.

Step 2: Coding Data


As discussed in Chapter 1, quantitative data can be categorized into ratio, interval,
ordinal, and nominal data. Researchers need to know the type of data that they
have obtained so that they can code the data appropriately for analysis. In some
cases, data may already be numerical and hence will not require coding (e.g., age,
test scores, grade point average, number of years of study, ratings or responses in
Likert-type scale questionnaires). These data can be used directly for data entry.
Researchers simply need to know the numerical data type to be able to analyze
them appropriately. Other types of data, especially nominal data, require coding.
For example, the numbers 1 and 2 can be assigned to male and female participants
respectively. Also, country codes for each participant can be assigned if the partici-
pants are from different countries (e.g., 1 = China, 2 = United States, 3 = Spain,
etc.). Once the data have been coded, they need to be entered into SPSS.

Step 3: Entering Data Into SPSS


After the data have been organized into quantitative form, they can be typed into
SPSS or imported from a Microsoft Excel file into an SPSS spreadsheet (see the
“Importing Data From Excel” section). Data types need to be defined in SPSS, so
that they can be properly analyzed.

Step 4: Screening and Cleaning Data


Once data entry has been completed, the accuracy of the data entry needs to be
carefully checked. The issue of missing data also needs to be addressed. The data
in the SPSS file can be compared with the actual data one point at a time, or by
randomly checking a sample of the data. Screening can also be achieved through
an application of descriptive statistics. The number of items in the data set and the
minimum and maximum values in the data set may easily be found and compared
with those of the actual data. For example, if the maximum score for a question-
naire item is 5, but the maximum score detected in the SPSS file is 55, it is clear
that there was a mistake in data entry.

Important Notes on SPSS


First, SPSS deals with quantitative data. There is limited scope to enter words into
SPSS, and this should be avoided. For example, while it is possible to enter the
word ‘male’ in an SPSS spreadsheet for a learner’s value under the nominal vari-
able ‘gender’, it is more effective to enter ‘1’ to represent ‘male’, and ‘2’ to represent
‘female’, for example. This chapter will illustrate how to code data in SPSS.
16 Introduction to SPSS

Second, SPSS can produce a statistical analysis output as per researchers’ instruc-
tions, but the output can be ‘meaningful’ or ‘meaningless’ depending on the types of
data used and how well the characteristics of the scales discussed in Chapter 1 are
understood. For example, SPSS will quite readily compute the average of two nomi-
nal data codes, such as gender coded as ‘1’ for male and ‘2’ for female. However, it does
not make sense to talk about ‘average gender’. SPSS will not stop researchers from
performing such meaningless computations, so knowledgeable quantitative research-
ers need to be aware of what computations will produce meaningful, useful results.

Creating a Spreadsheet in SPSS


When using SPSS, the first thing to do is to create a spreadsheet into which data
can be entered. Data collected on the five learners in Chapter 1 ( Tom, Mary,
Heather, Jack, and Phil) will be used to illustrate how to perform analysis in SPSS.

SPSS Instructions: Creating a Spreadsheet

Open SPSS.

Cancel the dialog offering to open an existing spreadsheet. A new,


blank spreadsheet will open (as shown in Figure 2.1).

There are two tabs at the bottom left-hand side of the spreadsheet (Data View and
Variable View). When a new file in SPSS is created, you will automatically be in

FIGURE 2.1 New SPSS spreadsheet


Introduction to SPSS 17

FIGURE 2.2 SPSS Variable View

Data View, and the data can be entered using this view. However, it is best to define
the variables that will be used first. To do this, click on the Variable View tab.
To illustrate how to define variables, the data from the five students in Table 1.9
in Chapter 1 will be used. There are four variables, namely student name, test
score, placement level, and campus. When the word ‘student’ is typed into a cell in
the Name Column, SPSS automatically populates the rest of the row with default
values (see Figure 2.2).
SPSS does not allow spaces in names. For example, ‘First Language’ (with a
space between the two words) cannot be typed into the Name Column, but
‘FirstLanguage’ (without a space) can. If a space is present in a variable name,
SPSS will indicate that the ‘variable name contains an illegal character’. Further
information on how to name variables can be found at: www.ibm.com/support/
knowledgecenter/SSLVMB_20.0.0/com.ibm.spss.statistics.help/syn_variables_
variable_names.htm).
In the second column (Figure 2.3), Type is automatically set to Numeric, which
means that only numbers can be entered into the spreadsheet for that variable. If
researchers wish to enter the names of the research participants, they need to be
able to enter words (SPSS calls variables that take on values containing characters
other than numbers ‘string’ variables). To do this, click on and then on
the blue square with ‘. . .’ that appears next to Numeric. When the Variable Type
dialog opens, choose ‘String’ and then click on the OK button (see Figure 2.4).
The variable type is now set to be a string variable. The column width in SPSS
is set to a default of eight characters, but this can be increased. Another column
that is optional but useful to fill in is Label (see Figure 2.5). Labels are useful when
abbreviations or acronyms are used as variables (e.g., L1 = first language; EFL =
English as a foreign language).
18 Introduction to SPSS

FIGURE 2.3 Type Column

FIGURE 2.4 Variable Type dialog

FIGURE 2.5 Label Column

For now, the other columns can be ignored (see further discussion in Chap-
ter 4). Each row in the Variable View (starting from 1) forms a variable column
in the Data View. Researchers can name a variable in each row (e.g., student and
score). Once added, the details of the ‘score’ variable can be adjusted to reflect its
Introduction to SPSS 19

characteristics (see Figure 2.6). To do that, click the cell in the Decimals Column
and adjust the number of decimals to zero since all the test scores are integers.
Then ‘Test Score’ can be entered as the variable label.
The number of decimal places used should not misrepresent the data. For
example, if the variable name is ‘gender’ (nominal data) and ‘1’ can be coded for
males and ‘2’ for females, decimals are not needed. However, for other data, there
is the possibility that there are digits after the decimal point (e.g., 3.49 and 3.50).
If the number of decimal places is set to zero, the score for 3.49 will be ‘3’, but for
3.50, it will be ‘4’. This can lead to a misrepresentation of the data, so choosing to
keep one or two decimal places will result in more accurate findings.
Let us return to the data from the five students. Two more string variables still
need to be entered: placement (level) and campus. Note that the column width of
eight characters will not be enough for placement level entries since the word ‘inter-
mediate’ has 12 characters, so 12 or higher is needed for the width of the Placement
Column. The final variable definition page appears as shown in Figure 2.7.
If the Data View tab is clicked, the program will return to Data View, which
is now set up for data entry (as shown in Figure 2.8). At this point the students’
names, their scores, their placement levels, and the campuses at which they study
can be entered.

FIGURE 2.6 Creating student and score variables for the Data View

FIGURE 2.7 Adding variables named ‘placement’ and ‘campus’

FIGURE 2.8 The SPSS spreadsheet in Data View mode


20 Introduction to SPSS

The following should be borne in mind at the data entry stage:

1. If a variable is numeric, numbers only can be entered (as noted earlier). In this
case, no letters or nonnumeric characters should be entered. Every value in
that column must be a number.
2. String variables can contain any combination of letters, numbers, and special
characters.
3. SPSS is case sensitive, so it will consider ‘beginner’ and ‘Beginner’ as two
entirely different values. This can become an issue if a researcher later uses the
placement variable for calculations or asks SPSS to count how many begin-
ners there are. So how values of string variables are treated must be consistent.

Once the data have been entered, SPSS can be used to conduct statistical anal-
ysis on them. Many different pieces of analysis can be done using SPSS. The
following section illustrates how to generate a list of students and their test scores,
placement levels, and campuses.

SPSS Instructions: Generating Case Summaries

Click Analyze, next Reports, and then Case Summaries (see Fig-
ure 2.9) to call up the Summarize Cases dialog.

FIGURE 2.9 Accessing Case Summaries in the SPSS menus


Introduction to SPSS 21

FIGURE 2.10 Summarize Cases dialog

In the Summarize Cases dialog shown in Figure 2.10, select all vari-
ables, then move them into the ‘Variables’ pane on the right-hand
side by clicking the arrow button to the left of that pane.

Do not worry about the Display cases options. Click on the OK but-
ton. A new dialog opens, showing the SPSS output (see Figure 2.11).

FIGURE 2.11 SPSS output based on the variables set in the Summarize Cases dialog
22 Introduction to SPSS

In Figure 2.11, the second table, labeled Case Summaries, shows the names of the
students, their test scores, their placement levels, and their campuses. SPSS output
tables can be copied and pasted into a Microsoft Word document.

Saving and Naming an SPSS Data File


If you do not specify the location for an SPSS file, the computer will save it in the Doc-
uments folder (on both Windows and MacOS computers). However, well-organized
researchers manage the locations and names of these files carefully. On a computer, a
folder for the research project should be created and within that folder, a sub-folder can
be created (e.g.,‘Data’). When a data file is named, the file name should be informative
and meaningful. For example, if a data file is named ‘Data.sav’, it may be difficult to
locate the file later on, especially when there are several research projects to complete.
Instead, it is recommended that the name of the project, and the year the project began
should be used (e.g., Motivation2017.sav, IELTS2018.sav).

Importing Data From Excel


There are circumstances in which data from an Excel spreadsheet will need to
be imported. This may occur, for example, when the file has been retrieved from
a third party. Bear in mind that a data set may be large (e.g., 50 variables and
500 participants). Hence, instead of entering the data manually into an SPSS
spreadsheet, which is time-consuming, researchers can import the Excel data file
into SPSS directly. It is important to make sure that the Excel spreadsheet to be
imported resembles an SPSS data spreadsheet in the following ways:

1. The first row in the Excel spreadsheet must contain the names of the variables;
there cannot be a headline. All the other rows must be data.
2. All variable names must consist of letters or numbers; the only special charac-
ter allowed is the underscore.
3. The data for each person must be contained in a single row in the Excel file.
If data from the same participant is contained in two separate rows, SPSS will
consider it as coming from two different participants.
4. There can be no formulae, graphs, or results in the Excel spreadsheet, only
variable names and data.

If you have prepared the Excel spreadsheet as specified, importing it into SPSS
can be done as follows.

SPSS Instructions: Importing an Excel Spreadsheet

Click File, next Open, and then Data (see Figure 2.12).
Introduction to SPSS 23

FIGURE 2.12 SPSS menu to open and import data

FIGURE 2.13 SPSS dialog to open a data file in SPSS

In the dialog that opens, the default file type is SPSS.sav. Select Excel
(∗.xls, ∗.xlsx, ∗.xlsm) as the file type (Figure 2.13).
24 Introduction to SPSS

FIGURE 2.14 Illustrated example of an Excel data file to be imported into SPSS

Select the Excel spreadsheet and click on the Open button. SPSS
then displays the dialog shown in Figure 2.14.

FIGURE 2.15 SPSS dialog when opening an Excel data source

Click on the OK button to create the SPSS spreadsheet. It is impor-


tant to check the Variable View to make sure that SPSS has inter-
preted the data correctly.

How SPSS Is Used in a Real Study


In order to illustrate how quantitative data can be transferred into an SPSS spread-
sheet, a study by Phakiti, Hirsh and Woodrow (2013) is used. Phakiti et al. (2013)
Introduction to SPSS 25

examined the relative influences of individual factors such as English language


proficiency, motivation, self-efficacy, personal value, and self-regulation on Eng-
lish language learning and the learning of academic English as a second language
(ESL) through the use of a structural equation modeling approach—a type of sta-
tistics used for testing hypotheses. The researchers used a questionnaire to collect
demographic data. There were 341 participants in this study. Figure 2.16 shows
the questionnaire.
In order to enter students’ responses into an SPSS file, coding was needed. In
Figure 2.16, the data on age, IELTS, and English Entry Test were ready for data
entry, but others required coding (e.g., nationality, gender, program, and stream).
The rules of coding discussed in Chapter 1 were used. For example, 21 codes were
assigned to nationalities (e.g., 1 = Chinese, 5 = Vietnamese, 10 = Cambodian,
etc). Further, two codes were assigned for gender (1 = male and 2 = female).
Numbers were also assigned to the types of study programs and streams. Figure 2.17
shows an SPSS screenshot of the resulting data file. In this data file, missing data
were coded ‘99’. Note that the participants’ IDs have been blanked out to pre-
serve anonymity.

Student ID: _________________________ Name: ____________________________________


Email: _____________________________ Nationality: ______________________________
Age: ___________ Gender: [ ] Male [ ] Female
IELTS: ___________ English Entry Test : ________
Program: [ ] Standard (33/34 weeks) [ ] Standard (40 weeks) [ ] Extended (59 Weeks)
Stream: [ ] Science/Engeneering and IT/Health Science [ ] Economics/Commerce
[ ] Arts/Media [ ] Music [ ] Visual arts and Design

FIGURE 2.16 The personal factor questionnaire on demographic information

FIGURE 2.17 SPSS spreadsheet that shows the demographic data of Phakiti et al. (2013)
26 Introduction to SPSS

According to Phakiti et al. (2013), the 341 ESL students were made up of 158
males and 179 females. Four participants did not report their gender. The majority
of the participants were from mainland China (N = 233). Their mean age was 19
with a standard deviation of 1.5.
Five personal factors were measured in this questionnaire: self-efficacy, personal
values, academic difficulty, motivation, and self-regulation. The questionnaire
to measure these factors was comprised of 61 items. Questions 1 to 5 asked

Part No of items and examples Scales

B: Self- 10 items 0% sure I can do (= 0%); 25% sure I can


efficacy (Items do (= 25%); 50% sure I can do (= 50%);
Item 5: I can pass this program.
5-14) 75% sure I can do (= 75%); 90% sure I
can do (= 90%); 100% sure I can do
(=100%)

C: Personal 8 items No importance (=1); Slight importance


values (Items (=2); Moderate importance (= 3); Great
Item 18: I am able to
15-22) importance (= 4); Extreme importance
understand what I read.
(=5)

D: Perceived 8 items No difficulty (= 1); Slight difficulty (= 2);


difficulty Moderate difficulty ( = 3); Great difficulty
Item 30: I understand what is
(Items 23-30) (= 4); Extreme difficulty (= 5)
required in assignments.

E: Motivation 11 items Not at all true of me (= 1); Slightly true of


(Items 31-41) me (= 2); True of me (= 3); Very true of
Item 32: I want my parents to be
me (= 4); Totally true of me (= 5)
proud of me.

F: Self- 20 items Not at all true of me (= 1); Slightly true of


regulation me (=2); True of me (= 3); Very true of
Items 42: I determine how to
(Items 42-61) me (= 4); Totally true of me (= 5)
solve a problem before I begin.

FIGURE 2.18 The questionnaires and types of scales and descriptors in Phakiti et al.
(2013)

FIGURE 2.19 SPSS spreadsheet that shows questionnaire items of Phakiti et al. (2013)
Introduction to SPSS 27

participants to describe themselves as a learner (short answers) but their responses


were not used in the study. Figure 2.18 summarizes the questionnaires and types
of scales and descriptors being used.
Figure 2.19 shows a section of the SPSS screenshot of this data file.

Summary
This chapter has provided details of the step-by-step procedures that need to be
followed in entering data manually into SPSS, importing data files from Excel,
and creating Case Summaries. In later chapters, relevant studies and examples in
L2 research are provided to help the reader contextualize the meaningfulness of
quantitative analysis and inferences made based on statistical results.

Review Exercises
To download review questions and SPSS exercises for this chapter, visit the Com-
panion Website: www.routledge.com/cw/roever.
3
DESCRIPTIVE STATISTICS

Introduction
This chapter presents the descriptive statistics that are used in quantitative research.
In L2 research, quantification is not only useful for describing the attributes of
an individual learner, but also how the attributes of different groups of learners
may differ. These differences can be handily summarized and highlighted using
descriptive statistics.

Quantification at the Individual Level


The first step in any quantitative study is to measure the feature of interest
(e.g., learners’ knowledge of vocabulary). To do this, a tool must be used to
quantify the strength of that feature. To measure learners’ vocabulary knowl-
edge, a vocabulary test that translates a learner’s knowledge of vocabulary into
a number can be used to generate numerical data for each research participant.
Any research tool needs to be carefully designed if the conclusions drawn from
the study are to be valid. Although no research process is completely flawless,
researchers can attempt to minimize any shortcomings through careful research
designs and by piloting the research instruments (see e.g., Brown, 2005; Dörnyei
& Taguchi, 2010; Mackey & Gass, 2015; Paltridge & Phakiti, 2015).

Quantification at the Group Level


Once data have been collected for each research participant, the data need to be
organized. The results could simply be listed. In Table 3.1, the first 50 scores (out
TABLE 3.1 IDs, gender, self-rated proficiency, and test score of the first 50 participants

Participant ID Gender Self-rated proficiency Total score

1 – advanced 86.11
2 female upper intermediate 61.11
3 male lower intermediate 41.67
4 female intermediate 69.44
5 male intermediate 75.00
6 male – 50.00
7 female advanced 90.74
8 male upper intermediate 77.78
9 female intermediate 72.22
10 female intermediate 75.00
11 male intermediate 63.89
12 female upper intermediate 44.44
13 male advanced 62.04
14 male intermediate 58.33
15 female intermediate 50.00
16 female advanced 80.05
17 female upper intermediate 71.30
18 female upper intermediate 63.89
19 female intermediate 50.00
20 female upper intermediate 80.56
21 male intermediate 63.89
22 female upper intermediate 71.03
23 female intermediate 72.22
24 male – 11.11
25 female upper intermediate 75.00
26 female intermediate 55.56
27 – upper intermediate 69.44
28 male upper intermediate 44.44
29 male upper intermediate 19.95
30 female intermediate 61.11
31 female upper intermediate 50.00
32 male upper intermediate 36.11
33 male upper intermediate 27.78
34 female intermediate 48.48
35 female intermediate 46.83
36 male upper intermediate 16.92
37 male upper intermediate 55.56
38 – upper intermediate 30.56
39 male upper intermediate 50.00
40 female advanced 83.33
41 male upper intermediate 75.00
42 male advanced 61.11

(Continued)
30 Descriptive Statistics

TABLE 3.1 (Continued)

Participant ID Gender Self-rated proficiency Total score

43 female intermediate 83.33


44 female upper intermediate 73.99
45 male upper intermediate 47.22
46 male upper intermediate 52.78
47 female upper intermediate 56.11
48 female upper intermediate 88.89
49 male upper intermediate 66.67
50 male upper intermediate 77.78

of a total of 267) from a run of Roever’s web-based Test of English Pragmatics


( TEP; Roever, 2005, 2006) are provided, together with each participant’s self-rated
proficiency level. These data will be used as an example throughout the book, and
you can find more details on TEP (including its sections and its item types) in
Chapter 3 exercise on the Companion Website.
Researchers may want to draw conclusions regarding the participants from
the data in this table. They may wish to know how well this group of test
takers did, or how well the test distinguished between high-ability and low-
ability test takers. However, it would be difficult to answer these questions
based on this list of data. This data set came from just 50 test takers; some
large-scale tests, such as TOEFL or IELTS, have runs of hundreds of thousands
of test takers, the data for which are extensive. For this reason, researchers seek
to summarize data sets.
Ko’s (2012) study can be used to exemplify how data are quantified at the
individual and group levels. In an experimental study on the effect of the avail-
ability of glossaries on vocabulary learning, Ko (2012) assigned Korean learners
of English to three groups: the control group, which read an English text without a
glossary (i.e., vocabulary explanations) for unknown words; an experimental group
A, which read the same text with a glossary in Korean; and an experimental group B,
which read the same text with a glossary in English. She then gave all the students
a vocabulary test to find out which of the three groups had learned more vocabu-
lary. Ko (2012), therefore, used quantification in two steps:

1. She quantified each individual learner’s knowledge of English vocabulary by


giving them a test.
2. She quantified the vocabulary knowledge at the level of her three groups by
averaging individual learners’ test scores.
Descriptive Statistics 31

Frequency Counts
The simplest way to reduce a mass of data is to count how often individual values
or scores occur. For example, in the data set in Table 3.1, it is possible to count
how many male and female test takers there were. Table 3.2 presents the frequency
counts according to gender.
Table 3.2 shows that there were slightly more females than males. Counting
the frequency with which each value occurs is all that can be done with nominal
data, such as gender, first language, and country of origin.
Frequency counts can also be used for an ordinal variable, such as self-assessed
proficiency level, and these show how many test takers self-assessed themselves as
being at the beginner, lower intermediate, intermediate, upper intermediate, and
advanced levels. Table 3.3 presents the frequency counts based on test takers’ self-
rated proficiency levels.
Table 3.3 shows that the majority of the test takers self-rated themselves as
upper intermediate, and about a third as intermediate. Frequency counts for each
score could also be computed. These are shown in Table 3.4.
While it summarizes the data somewhat, Table 3.4 does not reduce the volume
of information greatly. There is still much information to process. Score ranges

TABLE 3.2 Frequency counts based on gender

Gender Raw count Percent

Male 22 44
Female 25 50
Missing data 3 6
Total 50 100.0

TABLE 3.3 Frequency counts based on test takers’ self-assessment of


their English proficiency

Raw count Percent

Beginner 0 0
Lower intermediate 1 2.0
Intermediate 15 30.0
Upper intermediate 26 52.0
Advanced 6 12.0
Missing data 2 4
Total 50 100.0
32 Descriptive Statistics

TABLE 3.4 Frequency counts based on test takers’ test scores

Scores Raw count Percent Scores Raw count Percent

11.11 1 2.0 63.89 1 2.0


16.92 1 2.0 63.89 2 4.0
19.95 1 2.0 66.67 1 2.0
27.78 1 2.0 69.44 2 4.0
30.56 1 2.0 71.03 1 2.0
36.11 1 2.0 71.30 1 2.0
41.67 1 2.0 72.22 1 2.0
44.44 2 4.0 72.22 1 2.0
46.83 1 2.0 73.99 1 2.0
47.22 1 2.0 75.00 4 8.0
48.48 1 2.0 77.78 2 4.0
50.00 5 10.0 80.05 1 2.0
52.78 1 2.0 80.56 1 2.0
55.56 1 2.0 83.33 2 4.0
55.56 1 2.0 86.11 1 2.0
56.11 1 2.0 88.89 1 2.0
58.33 1 2.0 90.74 1 2.0
61.11 3 6.0 Total 50 100.0
62.04 1 2.0

TABLE 3.5 Frequency counts based on test takers’ test score ranges

Range Raw count Percent Cumulative percent

0–10 0 0 0
10–20 3 6.0 6.0
20–30 1 2.0 8.0
30–40 2 4.0 12.0
40–50 11 22.0 34.0
50–60 5 10.0 44.0
60–70 10 20.0 64.0
70–80 11 22.0 86.0
80–90 6 12.0 98.0
90–100 1 2.0 100.0
Total 50 100.0

(i.e., 0–10, 10–20, 20–30, etc.) could be used instead, which in effect transform
interval-level raw scores into ordinal-level ranges. Table 3.5 is the frequency table
for the score ranges.
Table 3.5 is easier to understand than Table 3.4 because it indicates where the
clusters are located: there are many test takers in the 40–50, 60–70, and 70–80
ranges. Smaller groups are in the 50–60 and 80–90 ranges, and the other ranges
have even fewer test takers in them. SPSS produces a Cumulative Percent Column
Descriptive Statistics 33

TABLE 3.6 Test score ranges based on quartiles

Range Frequency Percent Cumulative percent

0–25 3 6.0 6.0


26–50 14 28.0 34.0
51–75 24 48.0 82.0
76–100 9 18.0 100.0
Total 50 100.0

that allows readers to see the overall distribution of test takers. Only about a third
(34%) have scores lower than 50%, which implies that two thirds have scores
higher than 50% and indicates that this sample of the test takers did reasonably
well on this test.
These test results are skewed towards the higher score ranges. In a typical pro-
ficiency test (e.g., TOEFL, or a university designed placement test), it would be
expected that there would be equal numbers of test takers in the lower and upper
half of the score range; it would further be expected that most of test takers would
be clustered around the 50% mark. The results of this group of test takers indicate
that either the test taker group was somewhat more proficient than expected,
or that the test might have been somewhat too easy for them. Generally speak-
ing, the assumption that most test takers cluster around the 50% mark is valid for
proficiency tests, but this assumption is not necessary for achievement tests, which
measure what students have learned in a course. In an achievement test, most of
the learners would be expected to fall into the high score ranges, otherwise they
would not have learned what they were supposed to, or the test could have been
too difficult.
Of course, how the score ranges are selected is somewhat arbitrary. It is com-
mon to divide the range of possible scores into four equal parts, or quartiles. The
test takers’ scores divided into these ranges are presented in Table 3.6.
The data as presented in Table 3.6 are easier to grasp than that presented in
Table 3.5. However, the frequency counts are still somewhat difficult to interpret.
A quicker, all-at-one-glance way of representing the data is to use graphs and
diagrams.

Graphs and Diagrams

Pie Chart
A simple diagram that is effective for displaying the relative sizes of a small number
of groups is the pie chart. For the gender information in Table 3.2, it would look
as displayed in Figure 3.1.
The pie chart in Figure 3.1 shows that there are slightly more female than male
test takers. Note that this pie chart ignores test takers who did not disclose their
34 Descriptive Statistics

FIGURE 3.1 A pie chart based on gender

score
range
1 3 10
6 20
1
30
2 40
50
60
70
11 80
90
11

10

FIGURE 3.2 A pie chart based on a 10-point score range (Number in each slice = the
frequency count)

gender, so the percentages differ slightly from Table 3.2. Pie charts are effective
when there are few values that a variable can take on (e.g., 2–4 values). Displaying
the data of a 10-point score range is less effective, as illustrated in Figure 3.2, which
is based on Table 3.5.
Descriptive Statistics 35

In Figure 3.2, the value in each slice refers to the frequency count. There is
much information in the chart, and pie charts do not demonstrate clearly the
ordering of categories by size. In summary, pie charts are good as visual representa-
tions of frequency counts for nominal data for which there are only a few values
that that data can take on.

Bar Graphs
Bar graphs are suitable for the representation of ordinal data. For example, the bar
graph in Figure 3.3 makes it obvious that the majority of test takers was in the
score ranges between 40 and 90, and visually demonstrates the skewing of the data
towards the higher scores.
The bar graph in Figure 3.3 is a better representation than a long list of raw
data because it summarizes the overall picture of the test scores. However, it is
not portable, and it would be difficult to recall the entire graph. So the pre-
ferred method of summarizing interval data is the mean. Descriptive statistics,
which are the foundations of the many statistical analyses in L2 research, are
now discussed.

FIGURE 3.3 A bar chart based on a 10-point score range


36 Descriptive Statistics

Measures of Central Tendency


Measures of central tendency give researchers an idea about where the center of the
data set is. There are three measures of central tendency: the mean, the median,
and the mode.

The Mean
The mean is a frequently used numeric representation of a data set. It is the average
of all the values in the data set, and it is easy to compute by dividing the sum of all
the scores by the total number of scores. In the case of the data from Table 3.1, the
sum of all scores is 2,995.77, so the mean is 2,995.77 ÷ 50 = 59.91.
The mean provides less information than the bar graph in Figure 3.3 does. It
confirms the impression that the score distribution is skewed towards the higher
score range because it is larger than 50, and this allows readers to conclude that
either the test is on the easy side for this sample, or that the sample is slightly more
capable than the test assumed. Although the mean does not show how the data
are distributed, which the bar graph shows readers at a glance, it does have two
great advantages:

1. The mean is a portable summary. Researchers do not have to recall all the
details of a graph; instead, they just need to remember a single number.
2. The mean allows calculations and easy comparisons. If researchers have a sec-
ond sample of test takers and they want to check which group performed bet-
ter overall, they can just compare the means of the two samples. For example,
if a second group of learners had taken the same test and obtained a mean
score of 53.12, it can be deduced that they are, on average, less capable than
the first group.

The Median
The mean has one major shortcoming: it is sensitive to outliers (i.e., extreme scores),
which are scores far above or below the rest of the sample. For example, consider
a sample of five test takers, the scores for which are shown in Table 3.7.

TABLE 3.7 Imaginary test taker sample with an outlier

Test taker ID Score (out of 100)

1 27
2 33
3 40
4 46
5 99
Descriptive Statistics 37

According to Table 3.7, the first four test takers have a mean score of 36.5 (i.e.,
27 + 33 + 40 + 46 = 146 ÷ 4 = 36.5). This indicates that this group is below aver-
age in its ability. But when the score of Test Taker 5 is added, the mean increases
by more than a third to 49 (i.e., 27 + 33 + 40 + 46 + 99 = 245 ÷ 5 = 49). This
group of five test takers now appears to be of average ability. Accordingly, the mean
for the group of five does not reflect the fact that the majority of the test takers
achieved scores far below the overall mean; the inclusion of one test taker’s score
pulls the group average score up to an average level. Often quantitative researchers
consider removing extreme cases from their data set to avoid inaccurate findings.
To avoid distortion of the mean by outliers, another statistic is sometimes used,
namely the median. The median is the value that divides a data set into two
groups, so that half the participants have a value lower than or equal to the median,
and half the participants have a value higher than or equal to the median. In the
data set in Table 3.7, the median would be 40 (i.e., 27, 33, 40, 46 and 99). The
median itself does not have to occur in the data set. For example, for the data set
(27, 33, 41, 43, 46, 99), the median is 42, which is the average of the two middle
values. In the data set in Table 3.1, the median is 61.57, which also does not occur
in the data set.
In data sets with extreme values or outliers, the median can be more representa-
tive of the overall data set. The median is not very commonly used or reported in
applied linguistics or L2 research statistics, but is commonly found in research in
economics that investigates people’s incomes, for example. Imagine a community
that is for the most part in a lower-middle class income range, but contains a small
number of billionaires. The very high incomes of the billionaires will make the
overall community look much wealthier than it really is, and using the median
to represent overall income gives a more realistic picture of the typical income in
this community.

The Mode
The final descriptor of the central tendency of a data set is the mode. The mode
is the value that occurs most frequently in the data set. In order to illustrate the
mode, consider the following data set:

18, 29, 43, 43, 43, 58, 71, 71, 82

The mode can be seen to be 43, because it is the number that occurs most
often. In the larger data set in Table 3.4, the mode is 50 because the value 50
occurs most frequently (five times). In real research situations, it is possible for
a data set to have two or more modes. When there are two modes, it is called
bimodal; for example:

18, 29, 43, 43, 43, 58, 71, 82, 82, 82


38 Descriptive Statistics

This data set has two modes: 43 and 82. Bimodal data sets can sometimes be
suspicious because the sample may consist of some high-level and some low-level
learners, with few learners in between. This can affect some inferential statistics,
such as correlations and t-tests.
Measures of central tendency have one major shortcoming. They do not indi-
cate the dispersion of the data (i.e., how they are spread out). To analyze the spread
of a data set, measures of dispersion are used.

Measures of Dispersion
Measures of dispersion give researchers an idea of how different from one another
the data points in a sample are, i.e., how much variability there is in the data.
Consider the following two data sets from test scores of two groups of students:

Group 1: 47, 48, 49, 51, 52, 53


Group 2: 12, 37, 41, 59, 73, 88

The means of the test scores for each group are the same (each has a mean
of 50). However, the variability of the scores of the two groups is different. In
Group 1, the scores range from a minimum of 47 to a maximum of 53, whereas in
Group 2, they range from a much lower minimum of 12 to a much higher maxi-
mum of 88. In other words, in Group 1, the scores are very similar to one another,
meaning that all the students in that group are similar in their knowledge of the
subject matter. In contrast, the scores of the students in Group 2 are much more
diverse, which indicates that it contains students with very little knowledge and
some with much more extensive knowledge. This information may be important
for a teacher or curriculum designer. For example, teaching Group 1 may not
need much internal differentiation as all students are likely to benefit from the
same materials. Group 2, however, is much more challenging to teach because
some of the learners need a lot of basic instruction, while others require very little.
The most frequently used measure of dispersion is the standard deviation (SD
or Std. Dev in SPSS). Conceptually, the standard deviation indicates how different
individual values are from the mean. The more scores are spread out, the larger
the standard deviation will be. In the most extreme case, if all research participants
have the same score, the standard deviation is 0 because there is no difference at all
between individual values and the mean.
By looking at the data for Groups 1 and 2, it is easy to see that the standard
deviation for Group 1 will be much smaller than that for Group 2, because the
data for Group 1 are clustered much more tightly around the mean. In fact, that
is the case. The standard deviations of the two groups’ scores are very different:

• SD for Group 1 = 2.37, suggesting that the data set is quite homogeneous.
• SD for Group 2 = 27.32, suggesting that the data set is highly heterogeneous.
Descriptive Statistics 39

By including the standard deviation along with the mean, readers can get a bet-
ter idea of the general shape of a data set. In the case of Groups 1 and 2, the results
may be reported as follows:

• Group 1: M = 50, SD = 2.37


• Group 2: M = 50, SD = 27.32

Now the two numbers provide fairly similar information to what can be seen
in a graph, but in a much more precise and compact way.

Data Types and Scales of Measurement Revisited


Different types of data may need to be summarized in different ways. Nominal
data allow frequency counts only, including raw frequencies and percentages as
well as their visual representations (e.g., pie charts). The mode can be reported
as the category that occurs most commonly, but the mean, median and standard
deviation cannot be computed. If coding has been used (e.g., for the gender vari-
able, 1 for male and 2 for female), SPSS will compute a mean, but it is a meaningless
mean. Using numbers as codes does not convert the data from nominal to interval.
Interval data allow researchers to compute means and standard deviations, and
if they transform them to ordinal or nominal data (e.g., by using score ranges), it
is also possible to use frequencies to represent them. However, means and standard
deviations are the preferred representation for interval data because they are simple
and portable.
Ordinal data are a type of quantitative data with some unusual characteristics.
In statistical analysis, sometimes ordinal data can be considered similar to interval,
while at other times they cannot. In research publications, it is not uncommon to
see means and standard deviations being computed from ordinal data (e.g., from
questionnaire data). Whether this is legitimate depends on several factors, and
there is much discussion in the literature about this. For example, Jamieson (2004)
argues against this practice, whereas Carifio and Perla (2007) make a strong case
for it. The current book follows Brown (2011) in concluding that it is sometimes
justified to treat ordinal data like interval data. Most of the data in L2 research are
of the kind for which such a treatment makes sense. For example, data from ques-
tionnaires in which learners respond to several statements using a Likert-type scale
may be treated as interval data. Figure 3.4 illustrates an example of questionnaire
items using a Likert-type scale: Strongly Agree (5), Agree (4), Neutral (3), Disagree
(2), Strongly Disagree (1).
A bundle of Likert-type scale items, such as the ones shown in Figure 3.4
is normally used to arrive at a score for an underlying variable (e.g., language
anxiety). So if some learners were to respond as in Figure 3.4, their raw score for
‘language anxiety’ would be 16 out of 20, and their mean score would be 4 (i.e.,
16 ÷ 4 = 4). When the scores for different Likert-type scale items are added up
40 Descriptive Statistics

Language Anxiety 5 4 3 2 1
Speaking English makes me nervous. X
I feel I can’t get my message across when I speak English. X
I worry that people won’t understand me when I speak English. X
I avoid speaking English whenever I can. X

FIGURE 3.4 An example of questionnaire items using a Likert-type scale

to obtain an overall score (also known as a composite), these overall scores can be
treated as interval or continuous data.
For example, Fushino (2010) used a Likert-type scale questionnaire to collect
information about six learner characteristics, including willingness to communi-
cate in L2 group work, self-perceived communicative competence in L2 group
work, beliefs about the usefulness of group work, and others. Each characteristic
was measured using between 6 and 20 items. For example, 10 items were related
to ‘willingness to communicate in L2 group work’. Each learner received a mean
score for each characteristic, which was obtained by adding up their individual
item scores (1–5) and dividing the sum obtained by the number of items measur-
ing that characteristic.
Ordinal data should not be treated as interval data if researchers are dealing with
rankings of students (e.g., Tom is the best, Mary the second best, Jack the third
best) or discrete groups (e.g., beginner, intermediate, or advanced). With these
kinds of data, researchers can report only frequencies or a median rank. However,
if a piece of data is the result of adding up individual data points, ordinal data may
be treated as interval, and means and standard deviations can be computed.

Measures of Normal Distribution


Quantitative data, especially those from large data sets, should be normally dis-
tributed. A data set that has a bell-like curve can be described as being normally
distributed. Data that are not normally distributed may be skewed, i.e., they may
be asymmetrically distributed. Normally distributed data can also exhibit varia-
tion in their standard deviation. The statistics that describe these features of data
distribution are known as skewness and kurtosis statistics.

Skewness Statistics
A skewness statistic describes whether more of the data are at the low end of the
range or the high end of the range. The greater the value of a skewness statistic, the
more skewed the distribution of the data set is. A value of 0 indicates no skewness
at all because the data are symmetrical. Conservatively, statisticians recommend
Descriptive Statistics 41

that skewness values between ±1.00 suggest normally distributed data. In L2 data,
however, it is acceptable to use skewness values between ±2.00 as an indicator that
the data are generally normally distributed. Skewness values outside of the ±3.00
range are a warning sign that the data are highly skewed and hence some statistical
tests that require that the data be normally distributed may not be used.
Figure 3.5 shows the distribution of length of residence in a sample of 68 ESL
learners living in Australia. This variable was collapsed from an interval variable
to an ordinal variable (0–3 months, 3–6 months, 6–9 months, etc.). The graph is
bunched on the left-hand side; the data are then said to be positively skewed because
the tail points towards the positive side of the scale. This distribution is positively
skewed with a skewness statistic of +1.86.
The distribution of speech act scores shown in Figure 3.6, by contrast, is nega-
tively skewed with the values bunched up on the right-hand side of the scale, with
the tail pointing towards the negative side of the scale. The skewness statistic is
small at –0.60.
Figure 3.7 shows a distribution with very little skewness and with a low skew-
ness statistic of –0.03.
Exploring a data set in this way helps researchers understand whether there are
outliers or whether the characteristics of the sample are unexpected (e.g., there
may be clusters at the extremes, but few in the middle).

FIGURE 3.5 The positively skewed distribution of length of residence


FIGURE 3.6 The negatively skewed distribution of speech act scores

FIGURE 3.7 The low skewed distribution of implicature scores


Descriptive Statistics 43

Kurtosis Statistics
A kurtosis statistic describes how close the values in a data set are to the mean, and
whether the distribution is leptokurtic (i.e., tall and skinny) or platykurtic (i.e., wide
and flat). This is usually fairly obvious from the standard deviation, but a kurtosis
statistic gives a standardized value, which, like skewness statistics, conservatively
should be between ±1.00, but is acceptably between ±2.00. Values outside the
±3.00 range suggest that the data set may violate the assumptions of paramet-
ric statistical tests, such as Pearson correlation analysis and ANOVA. To address
a research question through inferential statistics, skewness and kurtosis statistics
are of concern only if they are extreme. In the case of skewness in particular, an
extreme value may suggest that there are outliers with the potential to distort the
data set. How skewness and kurtosis statistics can be computed from SPSS will be
presented in the next chapter.

Summary
A good quantitative study requires researchers to carefully examine the descrip-
tive statistics of their quantitative data set prior to any data analysis. This step of
quantitative analysis is to ensure that the characteristics of the data set are in order
and according to expectation. Finally, descriptive statistics including skewness and
kurtosis statistics should be presented in a research report because they allow read-
ers to evaluate the basic nature of the quantitative data that are used to address
research questions.

Review Exercises
To download review questions and SPSS exercises for this chapter, visit the Com-
panion Website: www.routledge.com/cw/roever.
4
DESCRIPTIVE STATISTICS
IN SPSS

Introduction
Although descriptive statistics can be calculated manually using a calculator, it is
more efficient to use SPSS to compute them. This is especially true when a data
set is large and when complex statistics are needed to answer research questions.
This chapter shows how to compute descriptive statistics using SPSS. Before tack-
ling complex statistical analysis, it is important to have a grasp of how to compute
the simplest descriptive statistics. In this chapter, the data set presented in Fig-
ure 4.1 will be used to illustrate how to compute descriptive statistics. The data
file (Ch4TEP.sav) can be downloaded from the Companion Website for this book.
In this data file, it can be seen that some values of the gender variable are 99.
That value does not indicate a gender score, which can only be 1 or 2; rather, this
value is used to indicate missing data, and this chapter will show how to set this
up. First, however, how to assign values to nominal variables will be presented.
Figure 4.2 shows SPSS in variable view for this data file.

Assigning Values to Nominal Variables


Chapter 2 showed how to create a variable in an SPSS spreadsheet. In Ch4TEP.
sav, the first language and gender variables are nominal or categorical data because
numbers assigned to these variables are not true measures, but only represent a
category (e.g., 1 for males, 2 for females). The variable type can be selected in the
Measure Column, and in Figure 4.2, the ID and gender variables are labeled as
‘nominal’. By contrast, the age and total score variables are interval, so these are
labeled as ‘scale’ in the Measure Column. The selfrate (i.e., self-rating of profi-
ciency) variable is labeled ‘ordinal’.
FIGURE 4.1 Ch4TEP.sav (Data View)

FIGURE 4.2 Ch4TEP.sav (Variable View)


46 Descriptive Statistics in SPSS

While the recent versions of SPSS can read strings, such as ‘male’, ‘female’, ‘Ger-
man’, ‘Thai’, and ‘English’ as values for nominal variables, it is more practical and
convenient for data entry purposes to use codes to represent them. In addition, SPSS
is strict about spelling: without assigning values to nominal variables, if you type
‘mael’ instead of ‘male’, or ‘Gemrn’ instead of ‘German’, SPSS will interpret this
misspelling as new information, and the subsequent analysis will not be accurate.
One disadvantage of using numbers to represent values of nominal variables is that
the numbers do not mean anything by themselves, so you need to program SPSS to
recognize what value is represented by a given number. This can be set up so that the
Values Column indicates what value is represented by a given number. Take gender
as an example, as illustrated in Figure 4.3, which shows the Value Labels dialog.

SPSS Instructions: Assigning Value Labels

Click the Value column of the gender variable. This column will be
activated and a blue button inside this column will appear.

Click on this blue button and a pop-up dialog will appear (see Fig-
ure 4.3).

Type ‘1’ in ‘Value’ and ‘male’ in ‘Label’. Then click on the Add but-
ton. Then repeat the same procedure for female. This time, type ‘2’.
Then click on the OK button to return to Data View.

FIGURE 4.3 Defining gender in the Value Labels dialog


Descriptive Statistics in SPSS 47

FIGURE 4.4 Defining selfrate (self-rating of proficiency) in the Value Labels dialog

The same can be done for the selfrate variable (see Figure 4.4). To check which
code is used for each value of selfrate, return to this Value Labels dialog where
they will be listed. While performing the data entry, type ‘1’ if participants rated
themselves as ‘beginner’ and ‘2’ if participants rated themselves as ‘lower intermedi-
ate’, and so on.

Assigning Missing Values


In Figure 4.1, there are cells showing ‘99’. As mentioned earlier, these numbers
indicate missing data (i.e., data that participants did not supply). Cells should not
be left blank, as then some computations in SPSS will not run properly.

SPSS Instructions: Assigning Missing Values

Click the Value column of the gender variable. This column will be
activated and a blue button inside this column will appear.

Click this blue button and the Missing Values dialog will appear (see
Figure 4.5).
48 Descriptive Statistics in SPSS

FIGURE 4.5 Defining missing values

You may select No missing values, Discrete missing values, or Range


plus one optional discrete missing value. In this illustration, tick the
Discrete missing values checkbox and type ‘99’. When you have finished,
click on the OK button to return to Data View.

In the case of the test score variable having a maximum score of 100, you should
not use ‘99’, but ‘999’ to define a missing value.

Computing Descriptive Statistics


Figure 4.6 presents the SPSS menu for computing descriptive statistics. There are
several options, but this chapter illustrates only the two most commonly used ones:
Frequencies and Descriptives.

SPSS Instructions: The Frequencies Option

Click Analyze, next Descriptive Statistics, and then Frequencies (Fig-


ure 4.6). The Frequencies pop-up dialog will appear (see Figure 4.7).

Move the ‘gender’, ‘age’, ‘self-rating of proficiency’, and ‘total


score’ variables from the pane on the left-hand side to the ‘Vari-
ables’ pane on the right.
Descriptive Statistics in SPSS 49

FIGURE 4.6 SPSS menu for computing descriptive statistics

Click the Statistics button and a pop-up dialog will appear (see Fig-
ure 4.8). Tick the following checkboxes: Mean, Median, Mode, Std.
deviation, Minimum, Maximum, Skewness, and Kurtosis. Then click on the
Continue button to return to the Frequencies dialog.

Click on the Charts button and a pop-up dialog will appear (see Fig-
ure 4.9). Only one chart type can be chosen. In this illustration, the
Histograms checkbox is selected with the Show normal curve on histogram
option. Then click on the Continue button to return to the Frequencies dialog.

The Frequencies dialog offers the Display frequency tables option. In


Figure 4.7 this checkbox is selected (by default) because SPSS can
report the frequency and percentages of the gender and selfrate variables
for the purpose of this illustration only. If frequency tables are not required
for use by researchers, the check can be removed.
50 Descriptive Statistics in SPSS

Finally, click on the OK button.

FIGURE 4.7 Frequencies dialog

FIGURE 4.8 Frequencies: Statistics dialog


Descriptive Statistics in SPSS 51

FIGURE 4.9 Frequencies: Charts dialog

TABLE 4.1 SPSS output on the descriptive statistics

Gender Age Self-rating of Total score


proficiency

N Valid 47.00 50.00 48.00 50.00


Missing 3.00 0.00 2.00 0.00
Mean 1.53 16.42 3.77 59.92
Median 2.00 16.00 4.00 61.57
Mode 2.00 16.00 4.00 50.00
Std. Deviation 0.50 1.70 0.69 18.71
Skewness –0.13 0.03 –0.07 –0.67
Std. Error of Skewness 0.35 0.34 0.34 0.34
Kurtosis –2.07 –1.12 –0.15 0.14
Std. Error of Kurtosis 0.68 0.66 0.67 0.66
Minimum 1.00 14.00 2.00 11.11
Maximum 2.00 19.00 5.00 90.74

Several output tables will be produced. For the purpose of this chapter, not all the
tables are shown. Table 4.1 presents the descriptive statistics of the gender, age,
selfrate, and total score variables.
It should be noted that not all the information in Table 4.1 is useful. While
descriptive statistics make sense for the age and total score variables, they do not
make sense for the gender and selfrate variables, as discussed earlier. These two
variables were included to demonstrate that it is possible to calculate descriptive
52 Descriptive Statistics in SPSS

TABLE 4.2 SPSS frequency table for gender

Frequency Percent Valid percent Cumulative percent

Valid male 22 44.00 46.80 46.80


female 25 50.00 53.20 100.00
Total 47 94.00 100.00
Missing 99 3 6.00
Total 50 100.00

TABLE 4.3 SPSS frequency table for the selfrate variable (self-rating of proficiency)

Frequency Percent Valid percent Cumulative


percent

Valid Lower intermediate 1 2.00 2.10 2.10


Intermediate 15 30.00 31.30 33.30
Upper intermediate 26 52.00 54.20 87.50
Advanced 6 12.00 12.50 100.00
Total 48 96.00 100.00
Missing 99 2 4.00
Total 50 100.00

statistics for all variables, but that frequency tables are more useful for nominal
variables.
In Table 4.1, the mean age of the participants was 16.42 (SD = 1.70). The
mean, median, and mode were similar, suggesting that the age data were normally
distributed. The skewness and kurtosis statistics for the age variable were 0.03
and –1.12 respectively, which are within the acceptable range for the assumption
of a normal distribution to be valid. The minimum and maximum test scores
were 11.11 and 90.74 respectively. The mean test score was 59.92 (SD = 18.71).
Although the mode was 50, this value occurred only five times in the data set, so
it did not greatly affect the data distribution. The score 75 occurred four times, so
the data were close to being bimodal. The skewness and kurtosis statistics (–0.67,
and 0.14 respectively) for the test scores were within the conservative limits of
±1.00. Finally, SPSS can also produce histograms. Tables 4.2 and 4.3 show the
frequency tables for the gender and selfrate variables.
Figure 4.10 shows the histogram for the selfrate variable, along with a normal
curve for the purpose of comparison.

SPSS Instructions: The Descriptive Statistics Option


The Descriptives option shown in Figure 4.6 can be used to compute descriptive
statistics.
Descriptive Statistics in SPSS 53

FIGURE 4.10 A histogram of the self-rating of proficiency variable with a normal curve

Click Analyze, next Descriptive Statistics, and then Descriptives. The


Descriptives dialog will appear (see Figure 4.11).

Move the ‘gender’, ‘age’, ‘self-rating of proficiency’, and ‘total


score’ variables in the pane on the left-hand side to the ‘Variable(s)’
pane on the right. Note that the ID variable does not occur in this dialog
because it is defined as a string variable.

Click on the Options button and the Descriptives: Options dialog


will appear (see Figure 4.11). Tick the following checkboxes: Mean,
Std. Deviation, Minimum, Maximum, Skewness, and Kurtosis.

Then click the Continue button and then the OK button.


54 Descriptive Statistics in SPSS

FIGURE 4.11 SPSS Descriptives options


It should be noted that the current descriptive statistics option does not allow the
calculation of the median and mode statistics. There is no option for frequency
tables in this option either. Thus, the choices for this Descriptives option are less
informative than those for the Frequencies option, but if frequency information is
not required for your research, this option is sufficient.

Graphical Displays

SPSS Instructions: Bar Graphs

Click Graphs, then Legacy Dialogs to find several options for the
graphical representation of data (see Figure 4.12).

To create a bar chart, click Bar. Choose Simple in the dialog that appears,
and then click Define (see Figure 4.13). In the Define Simple Bar . . .
dialog that pops up, move a variable of interest (e.g., ‘age’) from the pane on
the left-hand side to the Category Axis field in the pane on the right-hand side.

Click on the OK button.


Note: A chart title can be created by clicking the Titles button in the
top right-hand side corner.
FIGURE 4.12 SPSS graphical options

FIGURE 4.13 SPSS bar option


56 Descriptive Statistics in SPSS

SPSS Instructions: Pie Charts

To create a pie chart, click Graphs, next Legacy Dialogs, and then Pie
(see Figure 4.12). Choose Summaries for Groups of Cases in the dialog
that appears, and then click the Define button to call up the Define Pie . . .
dialog (see Figure 4.14). Note that there are two other options (Summaries of
Separate Variables and Values of Individual Cases) that are not presented here.

Move a variable of interest (e.g., ‘age’) in the pane on the left-hand


side to the Define Slices By field. Then click on the OK button.

FIGURE 4.14 SPSS pie option


Descriptive Statistics in SPSS 57

SPSS Instructions: Histograms

To create a histogram, click Graphs, next Legacy Dialogs, and then


Histogram (see Figure 4.12) to call up the Histogram dialog. Move a
variable of interest (e.g., ‘total score’) in the pane on the left-hand side to the
Variable field (see Figure 4.15). If it is desirable to display a normal distribu-
tion curve, make sure to tick the Display normal curve checkbox.

Click on the OK button.

Figure 4.16 shows the histogram for the total score variable.

FIGURE 4.15 SPSS histogram option


58 Descriptive Statistics in SPSS

FIGURE 4.16 The histogram for the total score variable

The Application of SPSS in a Real Quantitative Study


In order to illustrate how SPSS is used to calculate descriptive statistics, Phakiti
and Li’s (2011) study will be used. This study examined ESL postgraduate TESOL
students’ general academic difficulties—specifically their difficulties in reading
and writing. It provides a good illustration of how descriptive statistics can be
reported. In this study, 51 TESOL students answered a 5-point, Likert-type scale
questionnaire on their general academic difficulties, and their difficulties in read-
ing and writing. The scale descriptors were: 1 (not at all true of me); 2 (not true
of me); 3 (somewhat true of me); 4 (true of me), and 5 (very true of me). In the
original study, there was a total of 55 items, but only 25 of the items measured
students’ general academic difficulties and reading, writing, and academic facility
use (e.g., library, Information and Communications Technology (ICT), Learning
Center) difficulties. Table 4.4 presents the taxonomy of the questionnaire with
reported reliability estimates (Cronbach’s alpha is further discussed in Chapter 15).
Phakiti and Li (2011) decided to exclude the items that measured students’
approach to learning in their home country (Items 1 to 5) due to their low
Descriptive Statistics in SPSS 59

TABLE 4.4 Taxonomy of the questionnaire and Cronbach’s alpha (N = 51) (adapted from
Phakiti & Li, 2011, p. 273)

Category Subscale No. of items Cronbach’s


alpha

Learning Behaviors 1, 2, 3, 4, 5 0.45


in Home Country
Academic Difficulties General 6, 7, 8, 9, 10, 11, 12 0.76
Reading 13, 14, 15, 16, 17, 18 0.84
Writing 19, 20, 21, 22, 23, 24, 25 0.85
Facility 26, 26, 27, 28, 30 0.84
Overall reliability Items 6–30 0.93

TABLE 4.5 Example of item-level descriptive statistics (N = 51) (adapted from Phakiti &
Li, 2011, pp. 262–263)

Item Minimum Maximum Mean SD Skewness Kurtosis

1 2.00 5.00 3.78 0.78 –0.38 –0.01


2 2.00 5.00 3.98 0.81 –0.43 –0.28
3 2.00 5.00 3.59 0.78 0.09 –0.37
4 1.00 5.00 3.16 0.94 0.11 –0.48
5 2.00 5.00 3.51 1.14 –0.03 –1.40

reliability estimate (i.e., 0.45). Table 4.5 shows the descriptive statistics of five out
of 30 items.
When reporting descriptive statistics, minimum and maximum scores as well as
skewness and kurtosis statistics should be included. Skewness and kurtosis statistics
allow readers to evaluate whether the data for each variable were normally dis-
tributed. In the sample of items shown in Table 4.5, all the items have reasonable
skewness and kurtosis values.

Summary
This chapter has illustrated how SPSS can be used to perform basic analyses of
quantitative data using descriptive statistics. SPSS is practical as it allows research-
ers to handle a large data set, and the results it produces are reliable so long as
empirical data are entered into the data spreadsheet accurately. The next chapter
will present the concept of correlation in L2 research as well as how to perform
correlational analysis in SPSS.

Review Exercises
To download review questions and SPSS exercises for this chapter, visit the Com-
panion Website: www.routledge.com/cw/roever.
5
CORRELATIONAL ANALYSIS

Introduction
Correlation exists in many situations. For example, the further a car is driven, the
more fuel it will use, and the more the driver will have to spend on that fuel. In this
case, the distance driven and the amount of money spent on fuel would be said to
correlate: as distances increases, so do expenses for fuel. Correlation describes the
relationship between variables, and this chapter introduces and explores correla-
tional analysis for L2 research.

Inferential Statistics in L2 Research


In L2 research, there are several interesting research questions that can enhance
researchers’ knowledge about language learning, use, assessment, and pedagogy. In
their quest to answer these questions, L2 researchers are often interested in establishing
whether two variables are related. For example, they may be interested in whether
there is a relationship between vocabulary knowledge and reading comprehension (as
in Guo & Roehrig, 2011). To do this, researchers may use a sample of learners with
different levels of vocabulary knowledge and different levels of reading comprehen-
sion. In the case of Guo and Roehrig, this investigation involved three steps:

1. The researchers quantified learners’ vocabulary knowledge and their reading


comprehension through the use of research instruments (e.g., vocabulary and
reading comprehension tests).
2. They input the vocabulary and reading scores into an SPSS spreadsheet. The two
variables were (1) vocabulary test scores, and (2) reading comprehension test scores.
Correlational Analysis 61

3. They computed the correlation coefficient between the two variables,


and used it to evaluate the strength of the relationship between the two
variables.

Other research questions that concern relationships between variables include:

1. What is the contribution of L2 vocabulary knowledge, L2 syntactic awareness,


and metacognitive awareness to success in L2 reading comprehension (e.g.,
Guo & Roehrig, 2011)?
2. What is the relationship between L2 accuracy, complexity and fluency among
advanced learners (e.g., Mora & Valls-Ferrer, 2012)?
3. Do learners’ perceptions of their teachers’ commitment to teaching affect
their own L2 learning motivation (e.g., Matsumoto, 2011)?
4. To what extent do iBT (Internet-based Testing) TOEFL scores predict aca-
demic success (e.g., Cho & Bridgeman, 2012)?

Statistical procedures that allow researchers to draw conclusions about a relation-


ship among variables through data analysis constitute inferential statistical analysis.
This chapter introduces one type of inferential statistics: correlational analysis.

Correlation: Quantifying the Strength of a Relationship


The correlation coefficient (r) is a measure of the strength of the linear relationship
between two variables. The correlation coefficient can take a value between –1
and 1. The numerical value expresses the strength of the relationship, with a corre-
lation coefficient of 1 (r = 1), indicating that the variables are directly proportional
to one another. If the correlation coefficient between distance driven and amount
spent on fuel is 1, for example, then doubling the distance driven will also double
the cost of fuel used. The amount spent on fuel would then be easy to calculate, if
the distance driven were known, and vice versa.
In L2 research, a correlation coefficient of 1 between vocabulary knowledge
and success in reading comprehension would mean that success in reading com-
prehension is entirely determined by vocabulary knowledge, so that an increase in
vocabulary knowledge would lead to a proportionate increase in comprehension.
However, a correlation coefficient of 1 in this example is unlikely, because reading
comprehension does not depend solely on vocabulary knowledge.
A correlation coefficient of 0 (r = 0) indicates that there is no relationship
between two variables. The two variables are completely independent of each
other, and one variable cannot be used to predict the other. If the correlation
coefficient between vocabulary knowledge and success in reading comprehen-
sion were 0, it could be inferred that reading comprehension and vocabulary
62 Correlational Analysis

knowledge were entirely unrelated. This scenario is also unlikely because, theo-
retically and empirically, L2 reading and vocabulary are related to each other
to some extent (see e.g., Alderson, 2000; Qian, 2002; Read, 2000). It is more
likely that the correlation coefficient between these two variables lies between
0 and 1. For example, Guo and Roehrig (2011) found a correlation coefficient
of 0.43 between depth of L2 vocabulary knowledge and scores on the TOEFL
reading section.
Whether a particular correlation coefficient indicates a strong or weak relation-
ship may depend on various factors, including theoretical issues and the expectations
of the researchers. If researchers believe that vocabulary is essential to reading and
should account for most of the success in reading performance, a correlation coef-
ficient of 0.43 might seem low because they might have expected it to be 0.70 or
higher. However, if their stance is that reading comprehension is co-determined by
a range of other factors, such as background knowledge, metalinguistic knowledge,
and syntactic knowledge, they might have expected a lower correlation coefficient
of 0.30, for example, and 0.43 would then seem high to them.
The following is a general guideline about the strength of the correlation coef-
ficient (Cohen, 1988):

• 0.70 < r > 1.00 = strong relationship,


• 0.40 < r > 0.70 = medium relationship,
• 0.10 < r > 0.40 = weak relationship.

Following this guideline, the effect of vocabulary on reading comprehension


would be considered medium as the correlation coefficient that Guo and Roehrig
(2011) found was 0.43.

Positive and Negative Correlations


The + (positive) and – (negative) signs that make up part of correlation coef-
ficients indicate that the variables are directly or indirectly proportional to each
other. For instance, whereas the amount spent on fuel will increase with distance
driven, the number of avocados sold on a particular day will fall with an increase
in their price. In the former case, the correlation is positive, while in the latter it is
negative. The sign of a correlation coefficient indicates only the direction of the
correlation between the two variables. In statistics, it may be said that a positive
correlation coefficient is one that indicates that the variables move in the same
direction, so that an increase in one variable is accompanied by an increase in the
other variable. As an increase in vocabulary knowledge leads to greater success in
reading comprehension, there is a positive correlation between these two variables.
Correlational Analysis 63

Positive correlations are normally shown without the + sign, so when researchers
report ‘r = 0.43’, it is assumed to mean r = +0.43.
A negative correlation coefficient between two variables indicates that they
move in opposite directions, so that an increase in one variable is accompanied by
a decrease in the other variable. The following are examples in which the correla-
tion coefficients between the variables are likely to be negative:

• Learners’ general L2 proficiency and number of errors that they make when
completing a dictation test. This is because the higher learners’ proficiency
level is, the fewer errors they are likely to make.
• The amount of time learners take to read a text and their vocabulary knowl-
edge. This is because the more vocabulary learners know, the less time they
are likely to need to read a text.

Negative correlations are always shown with the – sign. For example, research-
ers report ‘r = –0.82’ or ‘r = –0.21’ in their journal articles.
Scatterplots are often used to visualize the direction of a correlation and the
strength of the relationship between two variables. In a scatterplot, the values of
the two variables are taken as coordinates on a pair of axes, and a dot is placed
for each data point. The closer the dots are to a ‘line of best fit’, the stronger the
correlation between the variables. The direction of the line indicates whether the
correlation is positive or negative:

• A line rising from the lower left-hand side to the upper right-hand side indi-
cates a positive correlation.
• A line falling from the upper left-hand side to the lower right-hand side indi-
cates a negative correlation.

Figure 5.1 presents a scatterplot (based on simulated data) that shows a perfect
positive correlation.
In Figure 5.1, a straight line can be drawn through all the dots. This scatter-
plot suggests that a value for Variable 1 can be predicted from the corresponding
value of Variable 2 with certainty. For example, in Figure 5.1, if the value of
Variable 1 is 20, then the value of Variable 2 is 60. The relationship between
the two variables is perfectly linear and deterministic. Finally, the line of best fit
goes from the lower left-hand side to the upper right-hand side, so the relation-
ship is positive. That is, as the values for Variable 1 increase, so do the values for
Variable 2.
Figure 5.2 shows a scatterplot that indicates a high but not perfect correlation
(the correlation coefficient is 0.90). It can be seen that the dots are fairly close to
FIGURE 5.1 A scatterplot displaying the values of two variables with a perfect positive
correlation of 1

FIGURE 5.2 A scatterplot displaying the values of two variables with a correlation coef-
ficient of 0.90
Correlational Analysis 65

FIGURE 5.3 A scatterplot displaying the values of two variables with a correlation coef-
ficient of 0.33

the line of best fit, which rises from the left-hand side to the right. Predictions of
the value of one variable based on the value of the other can be made using the
line of best fit, but there will inevitably be some degree of error.
Figure 5.3 shows a scatterplot indicates a much weaker correlation (the correla-
tion coefficient is only 0.33). An accurate prediction of the value of one variable
using the value of the other variable would be difficult to achieve. For example, if
the value of variable 2 is between 10 and 20, the corresponding value for variable
1 may lie anywhere between 4 and 10. However, there is still a noticeable correla-
tion between the variables.
Figure 5.4 shows a scatterplot that illustrates a perfect negative correlation coef-
ficient (r = –1). As with the case of a perfectly positive correlation (Figure 5.1), all
the dots lie on the line of best fit. However, in this case, the relationship between
the variables is inverse, so that a high value for Variable 1 would imply a low value
of Variable 2, and vice versa. Finally, Figure 5.5 shows a scatterplot for a data set
in which there is virtually no relationship between the variables (the correlation
coefficient is 0.06). For this data set, no reasonable prediction can be made for a
value of one variable from a value of the other.
66 Correlational Analysis

FIGURE 5.4 A scatterplot displaying the values of two variables with a perfect negative
correlation coefficient of –1

Types of Correlation
To calculate the correlation between two variables, the nature of the variables
(interval/ordinal/nominal) needs to be taken into account, and this will determine
which correlation analysis should be used.

Interval-Interval Relationships
Most correlations encountered in L2 research are between variables that are both
interval, and the statistic used for this is the Pearson Product Moment correlation or
Pearson’s r. This is a parametric statistic, which requires that the distribution of each
variable in the underlying population from which the sample is taken must be
normal. The normal distribution will be discussed in Chapter 6. It is not appro-
priate to use Pearson’s r if the data are not interval. Also, outliers can distort the
value of Pearson’s r, and it can become artificially inflated if there are clusters at the
extremes. Finally, Pearson’s r does always give useable results if either of the score
Correlational Analysis 67

FIGURE 5.5 A scatterplot displaying the values of two variables with a low correlation
coefficient of 0.06

ranges is restricted (e.g., the data for one variable consist only of ratings 1–5). In
such a case, it may be more appropriate to use a nonparametric statistic, such as
Spearman’s rho (as discussed in the “Interval-Ordinal or Ordinal-Ordinal Relation-
ships” section).
Pearson’s r can be converted to the coefficient of determination (denoted by R2).
R2 is the correlation coefficient squared expressed as a percentage. This coef-
ficient expresses the shared variance between the two variables, which refers to
the overlapping content between the two variables (e.g., vocabulary knowledge
and reading comprehension). If an r coefficient is 0.43 (as is the correlation
coefficient between vocabulary and success in reading comprehension in Guo
and Roehrig, 2011), the coefficient of determination will be 18.49% (i.e., R2 =
(0.43)2 = 0.1849 = 18.49%), which indicates approximately 18.5% of overlap
between vocabulary knowledge and reading comprehension. This figure can be
interpreted as showing that nearly one fifth of reading comprehension scores is
accounted for by vocabulary knowledge alone. That is a sizeable amount, but of
68 Correlational Analysis

course four fifths will still be accounted for by other variables, such as metalin-
guistic knowledge.
The coefficient of determination is useful because it allows researchers to
quantify the extent of the relationship between variables. Being able to say that
nearly one fifth of the variance in vocabulary scores is shared with reading com-
prehension scores is easier to understand than to say that the correlation between
the variables is 0.43.

Interval-Ordinal or Ordinal-Ordinal Relationships


If one or both of the variables are ordinal, Pearson’s r should not be used (see
the Ordinal Data section in Chapter 1). The same is true if the underlying
populations for one or both of the variables are non-normally distributed. An
alternative to Pearson’s r is Spearman’s rho, which is based on ranked data. If one
variable is interval, its data must be converted to ranked data before Spearman’s
rho can be used. The correlation between the ranked data for the two variables
is then computed. As with Pearson’s r, if data points that have high ranks on one
variable also have high ranks on the other, and those that have low ranks on one
variable also have low ranks on the other, the correlation is positive. If data points
that have high ranks on one variable have low ranks on the other, and those that
have low ranks on one variable have high ranks on the other, the correlation is
negative. If data points that have high ranks on one variable have a range of ranks
from low to high on the other, and vice versa, there is little or no correlation
between the variables.
Spearman’s rho is sometimes written with the Greek letter ρ or written out as
rho. It does not have a coefficient of determination, unlike Pearson’s r.
Spearman’s rho can be inappropriate if there are a lot of tied ranks (i.e., there are
many cases where the same rank has several data points in it). In that case, Kendall’s
tau is an alternative. Matsumoto (2011) used Kendall’s tau (after the Greek letter τ)
for this type of data. SPSS can compute values of both rho and tau. This book does
not present Kendall’s tau, as it is not commonly used in L2 research. The interested
reader may consult Phakiti (2014), which illustrates how to compute Kendall’s tau
and phi coefficients in SPSS, for more details.

Interval-Nominal Relationships
When data on one or both variables is nominal, correlations are not frequently
calculated. The most commonly encountered example of correlating interval with
nominal data in LTA research is the computation of the discrimination of a test item.
Discrimination here refers to a test item’s usefulness in distinguishing between
high- and low-ability test takers. For a mid-difficulty question, researchers would
Correlational Analysis 69

expect high-ability test takers to be more likely to answer the item correctly than
low-ability test takers. To find out if that is the case, researchers can correlate test
takers’ item scores (0 or 1) for a particular question with individual total test scores
excluding the item under consideration. In this case, researchers would be cor-
relating a nominal variable with an interval variable (the total test score). This can
be achieved through the use of a point-biserial correlation. The point-biserial cor-
relation is, however, not covered in this book (see instead Chapter 11 in Phakiti,
2014).

Interpreting Correlation
According to Guo and Roehrig (2011), for example, vocabulary scores and read-
ing comprehension scores correlate with a correlation coefficient of 0.43 and have
18.49% of shared variance. Statisticians often say that correlation is not causation.
Of course, causation is what quantitative researchers are ultimately interested in—
but merely because two variables systematically move in the same (or opposite)
direction does not necessarily mean that a change on one causes a change in the
other. The correlation coefficient only provides an idea of the strength and direc-
tion of the association between the variables; the exact nature of the relationship
between the variables has to be investigated in a different way.
To express the nature of correlation, it is said that two variables (such as
vocabulary knowledge and success in reading comprehension) ‘co-vary’ or ‘share
variance’. If one changes, the other also changes. Researchers also say that they
‘overlap’. Alternatively, they may say that ‘18.5% of success in reading comprehen-
sion is accounted for by vocabulary knowledge’. Expressing the relationship in
this way assumes a one-way relationship in which more vocabulary knowledge
implies better reading comprehension, but not necessarily the other way around.
To be able to make this claim, researchers need a good theoretical foundation
that supports their assumption that vocabulary knowledge supports reading com-
prehension. A different way of making the same claim is to say that vocabulary
knowledge explains 18.5% of success in reading comprehension. It may be implied
that vocabulary knowledge is the underlying factor and reading comprehension is
the outcome, which suggests a causal-like relationship.
Statisticians may be wary of causative explanations in correlational analysis as
there can be one or more underlying factors that explain both variables. For
example, suppose researchers take a random sample of teenagers aged between
12 and 18, give each of them the same IQ test, and then measure their shoe sizes.
A statistical analysis of the resulting data set may indicate that there is a correla-
tion between shoe size and IQ. However, age may explain the correlation as older
respondents are likely to have bigger feet and be able to score higher on the same
IQ test.
70 Correlational Analysis

Statistical Conditions for Pearson Product Moments and


Other Correlations
There are at least six conditions that must be met before correlational analysis in
L2 research should be conducted.

1. Data on each of the variables must come from the same group of people. If
researchers wish to correlate the scores on an ESL grammar test from a group
of students with scores on a listening test, this can be done only if the same
students took the two tests.
2. Data must be of the appropriate type for the specific correlation coefficient
being calculated. The Pearson Product Moment correlation requires interval
data or data resulting from the combination of ordinal numbers or scores.
Spearman’s rho requires ordinal data, as does Kendall’s tau. The point-biserial
correlation requires nominal data on one variable and interval or ordinal
data on the other. If there is nominal data on both variables (e.g., gender and
native language), the chi-square test should be used. This test is discussed in
Chapter 13.
3. To use the Pearson Product Moment correlation, the underlying population
from which the sample is taken should be normally distributed. If not, other
correlations such as Spearman’s rho and Kendall’s tau should be considered.
4. For the Pearson Product Moment correlation, it is preferable that the data be
spread across a wide range. The greater the variance in the data set, the more
suitable is the Pearson Product Moment correlation.
5. The relationship between the variables should be linear. Drawing a scatterplot
is an effective way to see if the relationship between variables is linear. If it is
non linear, the Pearson Product Moment and Spearman correlation are not
appropriate.
6. The paired variables to be correlated must not be dependent upon each other.
That is, researchers should not correlate scores on a subsection of an instru-
ment or a test, or even a single item with a total score, because the total score
is a result of the individual scores.

How to Interpret Statistical Output


Figure 5.6 presents the Pearson Product Moment correlation between two subsec-
tions of a grammar test, one focusing on prepositions, the other on verb tenses. In
this example, test scores from 104 test takers were correlated.
The Pearson Product Moment coefficient between English verb tenses and
prepositions is 0.719, or 0.72 to two decimal places, which is how correlation coef-
ficients are more commonly expressed. This relationship is statistically significant
Correlational Analysis 71

Correlations
verb tenses prepositions
**
verb tenses Pearson Correlation 1 .719
Sig. (2-tailed) .000
N 104 104
**
prepositions Pearson Correlation .719 1
Sig. (2-tailed) .000
N 104 104
** Correlation is significant at the 0.01 level (2-tailed).

FIGURE 5.6 SPSS output displaying the Pearson product moment correlation between
two subsections of a grammar test

at 0.01 (p < 0.01) (to be discussed further in the Probability and Statistical Sig-
nificance section in Chapter 6). As SPSS does not compute the coefficient of
determination (R2), this needs to be computed manually, and it is 52% (as 0.72 ×
0.72 = 0.518 = 0.52 to two decimal places).
In the SPSS output shown in Figure 5.6, both the significance level (which
will be discussed in Chapter 6), and the N-size of the sample (i.e., the number of
participants involved in the correlational analysis) are shown.

SPSS Instructions for the Pearson and Spearman


Correlations
This section illustrates how to perform the Pearson Product Moment and Spear-
man’s rho correlation between two variables in SPSS. There are three steps that
need to be followed:

1. Compute the descriptive statistics of the two variables to make sure that the
mean, median, mode, standard deviation, skewness, and kurtosis for each vari-
able of interest are within acceptable bounds (see Chapter 4 for the SPSS pro-
cedures for descriptive statistics). Examining descriptive statistics is a standard
practice prior to all inferential statistics. Recall that certain conditions need
to be fulfilled to be able to use a Pearson Product Moment. If the data set has
strong kurtosis, outliers, or a bimodal distribution, it might be better to use
the Spearman correlation rather than the Pearson Product Moment.
2. Draw a scatterplot between the two variables to determine whether the two
variables have a linear relationship.
72 Correlational Analysis

FIGURE 5.7 A view of Ch5correlation.sav

3. Compute the Pearson Product Moment and/or Spearman correlation in


SPSS.

To illustrate how to perform these two correlational tests, the file Ch5correla-
tion.sav will be used (downloadable from the Companion Website for this book).
This data set comprises the scores of 50 students who took an English proficiency
test that focused on listening, grammar, vocabulary, and reading skills. Figure 5.7
presents a screenshot of one of the worksheets in this data file.

Step 1: Compute Descriptive Statistics


Use SPSS to compute the mean, median, mode, standard deviation and skewness
and kurtosis statistics for the data set (see Chapter 4). Table 5.1 presents these
descriptive statistics.
According to Table 5.1, all four scores were distributed acceptably (the skew-
ness and kurtosis statistics are within acceptable values, see Chapter 3), so the
Pearson Product Moment correlation is appropriate to explore the relationships
among these variables.
Correlational Analysis 73

TABLE 5.1 Descriptive statistics of the listening, grammar, vocabulary, and reading scores
(N = 50)

Listening score Grammar score Vocabulary score Reading score

N Valid 50 50 50 50
Missing 0 0 0 0
Mean 8.60 14.42 13.82 8.58
Median 7.00 13.00 12.00 7.00
Mode 6.00 13.00 10.00 7.00
Std. Deviation 4.41 6.51 6.33 4.58
Skewness 0.79 0.97 0.86 0.91
Std. Error of 0.34 0.34 0.34 0.34
Skewness
Kurtosis –0.57 –0.13 –0.12 –0.13
Std. Error of 0.66 0.66 0.66 0.66
Kurtosis

Step 2: Draw a Scatterplot


To illustrate how to create a scatterplot of two variables to be correlated, the listen-
ing and grammar scores will be used.

SPSS Instructions: Scatterplots

Click Graphs, next Legacy Dialogs, and then Scatter/Dot (Fig-


ure 5.8).

In the dialog that appears, choose Simple Scatter and click the
Define button to access the Simple Scatterplot dialog. Move ‘Lis-
tening Score’ to the Y Axis field and ‘Grammar Score’ to the X Axis field
(Figure 5.9).

Click on the OK button.


Repeat to create scatterplots for other pairs of variables. In this
data set, six scatterplots need to be created.
FIGURE 5.8 SPSS graphs menu with Scatter/Dot option

FIGURE 5.9 Simple scatterplot options


Correlational Analysis 75

FIGURE 5.10 A scatterplot displaying the values of the listening and grammar scores

Figure 5.10 shows the scatterplot obtained for the listening and grammar scores
variables.

SPSS does not produce a fit line by default. To add one, double-click
the scatterplot in the SPSS output. A new dialog will appear (see
Figure 5.11). In the Element menu of this new window, choose Add Fit Line at
Total.

According to Figure 5.12, the relationship between listening and grammar


score was linear and positive. In Figure 5.12, it can be seen that the R2 (coefficient
of determination) is 0.668, which suggests that the variables were nearly 67%
overlapping.

Step 3: Compute the Pearson Product Moment Correlation


Compute the Pearson Product Moment correlation between the listening and
grammar scores in SPSS. To illustrate the difference between Pearson Product
Moment and Spearman correlation, also compute the Spearman correlation.
76 Correlational Analysis

FIGURE 5.11 Adding the fit line in a scatterplot

SPSS Instructions: Pearson and Spearman Correlations

Click Analyze, next Correlate, and then Bivariate. A Bivariate Correla-


tions dialog will appear (see Figure 5.13).

Move the ‘Listening Score’ and ‘Grammar Score’ in the left-hand


side pane to the ‘Variables’ pane on the right. Tick the Pearson,
Spearman, Two-tailed, and Flag significant correlations checkboxes.
FIGURE 5.12 A scatterplot displaying the values of the listening and grammar scores
with a line of best fit added

FIGURE 5.13 SPSS Bivariate Correlations dialog


78 Correlational Analysis

Click on the OK button.


Note: In SPSS, the Bivariate correlations option provides a number
of correlational tests.

TABLE 5.2 Pearson product moment correlation between the listening scores and grammar
scores

Listening score Grammar score

Listening Score Pearson Correlation 1 0.82∗∗


Sig. (2-tailed) 0.00
N 50 50
Grammar Score Pearson Correlation 0.82∗∗ 1
Sig. (2-tailed) 0.00
N 50 50

∗∗ Correlation is significant at the 0.01 level (2-tailed).

TABLE 5.3 Spearman correlation between the listening scores and grammar scores

Listening score Grammar score

Spearman’s rho Listening Score Correlation 1.00 0.73∗∗


Coefficient
Sig. (2-tailed) 0.00
N 50 50
Grammar Score Correlation 0.73∗∗ 1.00
Coefficient
Sig. (2-tailed) .000
N 50 50

∗∗ Correlation is significant at the 0.01 level (2-tailed).

The settings for Test of Significance and Flag significant correlations are preselected, and
these should be left unchanged. Chapter 6 will discuss the test of significance. At
this stage it is sufficient to know that this significance is acceptable.
Table 5.2 presents the SPSS output for the Pearson Product Moment correla-
tional analysis.
According to Table 5.2, the Pearson Product Moment correlation coefficient
was 0.82 (R2 = 0.67). For the purpose of comparison, Table 5.3 presents the SPSS
output for the Spearman correlational analysis.
Correlational Analysis 79

In Table 5.3, the Spearman correlation coefficient was smaller than the Pearson
Product Moment coefficient (0.73 versus 0.82). This is because the Spearman
analysis ranked the variables before it analyzed them, and so some information
will have been lost. According to Tables 5.2 and 5.3, both the Pearson Product
Moment and Spearman correlations suggest a strong positive correlation between
the listening scores and the grammar scores (with correlation coefficients of 0.82
and 0.73, respectively).

How Correlations Are Used in a Real Study


Ockey, Koyama, Setoguchi, and Sun (2015) examined the relationships between
TOEFL iBT speaking performance and indicators of Japanese university stu-
dents’ ability to communicate orally in an academic English context. In this
study 226 Japanese university students participated. Five instruments were
employed: three university oral tasks (group oral discussion, picture and graph
description, and oral presentation), a TOEFL iBT speaking section, and a study
log on the amount of time students spent preparing for the oral presentation
task. The researchers employed Pearson Product Moment correlational analysis
to determine the relationships between TOEFL iBT and the communicative
ability indicators. The researchers reported the descriptive statistics of the data
(the mean, SD, minimum, and maximum, as well as skewness and kurtosis statis-
tics), as well as the Cronbach’s alpha reliability of the three university oral tasks
(discussed in Chapter 15).
The correlations between the TOEFL iBT speaking section and the overall
group oral discussion, picture, and graph description and oral presentation were
0.76, 0.73, and 0.68, respectively (p < 0.05), which the researchers evaluated as
being moderate to high. The researchers also reported the correlations between
the TOEFL iBT speaking section and other components of the three university
oral tasks such as pronunciation, fluency, lexis/grammar, and interactional com-
petence. The correlations were found to be moderate, ranging from 0.50 (lexis/
grammar in the oral presentation task) to 0.75 (lexis/grammar in the picture and
graph task) (p < 0.05). The researchers concluded that the TOEFL iBT speaking
test is suitable for predicting academic speaking ability, but also that it is better at
predicting some aspects of speaking ability than others.

Summary
This chapter has introduced correlation as a measure of the relationship between
two variables. There are different types of correlational analyses, which depend
on the nature of the variables (interval/ordinal/nominal). This chapter has pre-
sented how to compute Pearson Product Moment correlation and Spearman’s rho
80 Correlational Analysis

correlation in SPSS, as well as the way in which a correlation coefficient should


be interpreted. The next chapter will further the concepts of inferential statistics.

Review Exercises
To download review questions and SPSS exercises for this chapter, visit the Com-
panion Website: www.routledge.com/cw/roever.
6
BASICS OF INFERENTIAL
STATISTICS

Introduction
Inferential statistics are used in L2 research to draw conclusions about a popula-
tion of interest from a sample of that population. This chapter focuses on the basic
notions of inferential statistics, including sampling, correlation coefficients, and
how researchers can use probability to quantify how likely it is that their conclu-
sions about the population of interest are correct. It is important that researchers
are fully aware of and disclose the limitations of their research, so they need to
understand the factors that limit the validity of their results. These factors include
the way in which samples are selected, sample size, and the strength of the effect.

Populations and Samples


Quantitative research not only examines descriptive statistics, but also employs
inferential statistics to address research questions. Descriptive statistics, as discussed
in Chapter 3, provide information about particular samples of the target popula-
tion. However, when researchers wish to make inferences about the characteristics
of a population using data drawn from a sample of that population, they use
inferential statistics. When the term population is used in statistics, it generally
refers to a particular population of interest. A population can be, for example, all
first-year undergraduate students at a specific university in a specific year, or all
English language teachers in Tokyo, Japan, or all ESL learners. The characteristics
of the population of interest are called parameters. Since researchers do not usually
have data from the entire population, they draw a sample from the population and
then use statistics to estimate these parameters. If these estimates are to be accu-
rate, researchers must ensure that the sample of the population they have used in
82 Basics of Inferential Statistics

their research is representative of the population, so how these samples are selected
(sampling) is a critical part of the quantitative research process (see e.g., Scheaffer,
Mendenhall, Ott & Gerow, 2012).
In quantitative research, it is frequently desirable that random sampling be
employed. In this type of sampling, each member of the target population has an
equal chance of being chosen. A random sampling technique is highly desirable
when researchers aim to generalize their research findings from a sample study to
the wider population. In L2 research, random sampling can be difficult to achieve
and the samples are often selected on the basis of how convenient they are for
researchers to obtain. When researchers use easily obtainable participants for their
research (e.g., a group of students they are teaching), the sampling technique may
be described as convenience sampling. Convenience sampling is unlikely to lead
to a representative sample, which is a major drawback when researchers wish to
make inferences about the population of interest. This problem can be avoided
by narrowly defining the target population on the basis of the sample and hence
treating this group of learners as the population of interest (e.g., EFL students in
an English for an engineering course at a Vietnamese university), but the results
of such a study will have limited scope for generalization and usefulness, and will
be beset by bias as the researchers have no guarantee that the selected participants
are representative of students who typically take the course. Some quantitative
researchers may describe their convenience sampling method as purposive (i.e.,
selective) sampling, which underlines the fact that their claims or generalizations
from their research findings will be limited to populations comprised of members
very similar to the actual sample.
In practice, a population of interest may be comprised of different proportions of
sub-populations, and researchers may need to adopt a sampling method that ensures
that those sub-populations are represented equally in the research sample. This is
known as stratified random sampling. For example, researchers may wish to ensure that
a sample includes equal numbers of high, intermediate, and low proficiency levels.
Researchers may first divide students into sub-groups based on their proficiency lev-
els, and then randomly choose equal numbers of participants from each sub-group
to form a total sample. This technique allows researchers to ensure that the sample
contains all proficiency levels, which may not be achieved by using a random sam-
pling technique. It is important to note the distinction between random sampling
and random assignment. Random assignment is a required condition for experimental
research (see Phakiti, 2014). When random assignment is employed, research par-
ticipants are randomly assigned into groups (e.g., experimental or control groups),
but these groups need to be equivalent in every respect except for the experimental
treatment, which is given to the experimental group only. For a further discussion
of sampling techniques in an applied linguistics context, see Blair and Blair (2015),
or Hudson and Llosa (2015) for an in-depth discussion.
It is important to stress that all sampling methods are prone to sampling error
or bias. That is, participants in a sample group can never perfectly represent the
Basics of Inferential Statistics 83

population. Statistics are used as a tool to help researchers understand the char-
acteristics of the target population or research participants, but they are based on
probability analysis, so researchers cannot claim that their findings are absolute,
but merely likely. How this likelihood can be quantified will be seen later in this
chapter.

Sources of Errors in Statistical Analysis


The issue of statistical errors is critical as such errors can have serious consequences
when the results of statistical studies are used as the basis for decision-making.
For example, curriculum designers might increase the amount of time spent on
vocabulary training in the classroom because statistical studies have shown that a
wider vocabulary enhances reading comprehension. It would be a waste of time
and resources to conduct that training without a certain level of confidence in its
beneficial effect. Or university administrators might decide to use TOEFL scores
for admissions decisions because a study has used statistical tools to show that lan-
guage proficiency impacts academic performance. If there were no link between
TOEFL scores and subsequent academic performance, the test would not aid the
decision-making process.
It is, therefore, important to understand what may cause a statistical study to
produce erroneous results. First, there may be limitations imposed by the cho-
sen sample of participants. The sample may be too small, or it may have been
selected in such a way that makes it atypical of the wider population. Second,
the selection of research instruments may be influenced by researchers’ expecta-
tions regarding the outcome of their research. Third, research instruments can
never be fully reliable and accurate, so they cannot entirely measure the character-
istics under scrutiny. Fourth, researchers may simply misinterpret their findings.
Shadish, Cook, and Campbell (2002) have outlined the various threats to infer-
ences from quantitative research and how they may be countered. Specifically, the
effect of two factors, the sample size and the effect size (e.g., size of the correlation
coefficient), can be statistically computed as the significance level.

Probability and Statistical Significance


Researchers can estimate how likely it is that their findings are incorrect. The
index that shows this likelihood is known as the significance level and it is given as
a decimal (p < 0.05 or p = 0.032) or as a percentage (e.g., 5%). The significance
level is normally reported together with the statistical index that has been com-
puted; for example: r (the Pearson correlation coefficient) = 0.56, p = 0.02. This
can be read as: ‘there is a 2% chance that the findings are due to chance’. In other
words, there is a 2% likelihood that the finding holds for this specific sample only,
and that it would not be replicable if the study were run again with a different
sample.
84 Basics of Inferential Statistics

TABLE 6.1 Correlation between verb tenses and prepositions in a grammar test

Verb tenses Prepositions

Verb tenses Pearson Correlation 1 0.719∗∗


Sig. (2-tailed) 0.000
N 104 104

∗∗ Correlation is significant at the 0.01 level (2-tailed).

In general, significance levels of p < 0.05 are acceptable in L2 research. In other


words, L2 researchers accept a 5% likelihood that their results are meaningless. A
significance level of p < 0.01 is also sometimes adopted. This lower value of sig-
nificance is more desirable, as a lower value indicates a lower probability that the
results of the research are not representative of the population of interest. However,
it is more difficult to attain than a higher value, as outlined further in the ‘Effect
Size and Sample Size’ section.
The correlation between the scores in two sections of the grammar test intro-
duced in Chapter 5 will be used to illustrate the concept of significance. Table 6.1
presents the correlation between the scores in the two sections.
The correlation between the scores in the verb tenses and prepositions sections
is high (r = 0.719), and from the significance level, it can be said that it is also sig-
nificant since the reported significance is below the standard level of 0.05. The SPSS
output in Table 6.1 seems to suggest that the significance value is 0.000 (i.e., that
the results are absolutely trustworthy, with no chance of error). That is, however,
illusory. SPSS displays three decimal places, so the significance level can only be said
to be less than 0.001 (p < 0.001), based on the SPSS output. Therefore, when report-
ing this result, it is important that researchers do not report p = 0.000. Instead the
result, should be reported as ‘the scores on the two test sections correlated strongly
(r = 0.719, p < 0.001)’. Researchers commonly report precise p-values when they
are larger than 0.001, and report p < 0.001 for values smaller than 0.001 (following
the Publication Manual of the American Psychological Association, sixth edition, hereafter
APA, 6th edition).
There are factors that affect how strictly a significance level can be set. In L2
research, these factors include sample size and effect size (e.g., the correlation coef-
ficient or coefficient of determination (R2), which was discussed in Chapter 5).

Sample Size
Researchers conduct empirical studies because they want to draw conclusions
about the population of interest. Not all L2 learners can be included in a study
because there are too many of them, so samples are taken instead (i.e., researchers
take groups of L2 learners they believe to be representative of the larger popula-
tion). Findings are generally more trustworthy if they are based on large samples,
Basics of Inferential Statistics 85

but such samples may be difficult to obtain due to resource limitations; it is both
time-consuming and costly to recruit and administer a large number of partici-
pants. Large samples are generally preferable to small samples as small samples may
not be able to capture a sufficiently wide range of characteristics of the popula-
tion. For example, in a normal distribution, around 95% of the data will lie within
two standard deviations of the mean. If the sample is too small, it is likely that the
data at the extremes (e.g., that associated with exceptionally strong or exception-
ally poor students) will be underrepresented. If a sample size of 10 is used, for
example, it is impossible for the population to be accurately represented as choos-
ing no exceptional participants would be an underestimation, and choosing one
or more would be an overestimation. Figure 6.1 shows a normal distribution. The
students represented on the far right-hand side are the extremely strong language
learners, while the ones on the far left-hand side are the extremely weak ones; the
vast majority of students lie between these two extremes. All parametric statistics
(e.g., Pearson’s r, t-test, or ANOVA) assume that the target construct (e.g., language
ability) is normally distributed in the population from which the sample is drawn.

FIGURE 6.1 A normally distributed data set


86 Basics of Inferential Statistics

An adequately large sample size is a necessary but insufficient condition for


statistical inferences to be trustworthy. If a sample consists of 100 students, but
those students are not representative of the target student population, the sample
cannot be used to make statistical inferences regarding that population. However,
non-representativeness becomes less and less likely as the sample grows larger.
A sample size of 30 is often regarded as the minimum sample size. However, all
samples should be chosen to be representative of the population, irrespective of
the size of those samples. Researchers need to be aware of any limitations that the
sample size they have used may impose, and take those into account when select-
ing the particular statistical tests they use.

Degrees of Freedom (df )


The concept of degrees of freedom (df ) is a way of making a correction for
sample size in statistical calculations. This concept has greater implications for
the analysis of data derived from small sample sizes than for large ones. A good
understanding of degrees of freedom is critical when statistical analysis is cal-
culated by hand because researchers need to refer to the critical values table for
a particular statistical test (see e.g., www.statisticssolutions.com/table-of-critical-
values-pearson-correlation/). However, statistical analysis through SPSS can be
performed without a detailed knowledge of degrees of freedom, as long as the
sample size to be used is sufficiently large. An accessible in-depth discussion of
degrees of freedom is Eisenhauer (2008).
In statistical analysis, degrees of freedom play a largely historical role in estab-
lishing the significance of statistical results through the use of a critical values table,
which relate degrees of freedom and effect sizes to significance levels. Different
statistical tests require different formulas for degrees of freedom. Technically, the
degree of freedom is the number of independent values that researchers use to
base their inferences about the parameter of interest on. For practical purposes,
in correlational analysis, the degree of freedom can be taken to be the difference
between the sample size and the number of variables, i.e., N – 2, where N is the
sample size and 2 is the number of variables associated with each element of the
sample. So if there are 100 participants in a study, there are 98 degrees of freedom
for examining the relationship between the two variables.
In comparative statistical analysis, such as an analysis of variance (ANOVA),
there are two different degrees of freedom to be considered. The first has to do
with the number of groups to be compared (df1 = k – 1, where k = the total num-
ber of groups). The second has to do with the sample size of each group (df2 =
N – k, where N = the total number of participants in each group and k = the total
number of groups) in the data analysis. So if there are three groups of learners to
be compared and the study is based on a sample size of 100, df1 is 2 and df2 is 97.
The degrees of freedom here differ from the sample size of each group, as ANOVA
deals with comparisons. When using SPSS, researchers do not need to look up a
Basics of Inferential Statistics 87

critical values table to determine whether a finding is statistically significant. SPSS


will flag if a result is statistically significant or not as well as produce df values for
researchers.

Effect Size
There are two issues associated with the effect size to be considered in inferential
statistics. The first has to do with the chance of detecting a relationship or differ-
ence through statistical analysis when such a relationship actually exists. This is
strongly influenced by sample size and is closely related to statistical significance
(e.g., p < 0.05). The second has to do with the magnitude of the effect size that
needs to be reported and interpreted in research findings. This is related to the
question of whether the relationship or difference is meaningful or has practical
relevance. Both considerations are discussed in the next section.

Effect Size and Sample Size


If large effects are being investigated, they can usually be detected in small sam-
ples. If two variables correlate very strongly, a large group is not required to
find that correlation. This can be compared to the detection of singing (a large
effect) and humming (a small effect) in a busy city square. On the one hand,
it is easy to establish that there is a group of people singing, even if there are
only a few of them. On the other hand, it may be hard to detect humming if
there are just a few people humming. Increasing the number of people hum-
ming increases the ease with which the humming can be distinguished from
the ambient noise, so that at some point it becomes clear that the humming is
coming from the people.
Analogous scenarios occur in quantitative analysis. For example, two indicators
of advanced vocabulary knowledge—knowledge of synonyms and knowledge of
collocations—were measured by means of two brief-response tests given to 127
learners and found to correlate at 0.806 (p < 0.001, Roever, 1995, unpublished
data). The correlation is strong, and even when five participants are randomly
selected from the sample, the correlation remains similarly strong (0.90 and 0.89
on another two analyses) and statistically significant. So, strong effects are clear
even when small samples are used. However, if the correlation is weak, it might
not be reliably detectable in a small sample because it is difficult to be sure that
the weak correlation found is due to the existence of a real relationship exists or
to random noise. The relationship between sample size and effect size can be seen
in Table 6.2.
The example in Table 6.2 also explains why researchers do not set significance
levels that are overly strict. Novices often think that it is good to be absolutely
certain, and wonder why the conventional significance level in L2 research is 0.05,
rather than 0.01, or even 0.001. The reason is that there is a trade-off between the
88 Basics of Inferential Statistics

TABLE 6.2 Explanations of the relationship between the sample size and the effect

If researchers want to find

a small effect with a small likelihood of error, they need a large sample.
a medium effect with a small likelihood of error, they need a medium sample.
a strong effect with a small likelihood of error, they need a small sample.

significance, sample size and effect size. If a strict significance level is set, a large
sample will be required to be able to draw conclusions, or only effects that are strong
will be able to be investigated.
Given the interaction between sample size, effect size and significance, it is
impossible to say what the ‘perfect’, or even the ‘minimum’ sample size should be.
The general rule of thumb is that the sample should have at least 30 participants,
but this may not be necessary if the effect can be expected to be strong and the
significance level is liberal. Conversely, a much greater sample size may be required
when the effect is expected to be weak and a strict significance level has been set.
Some statistical procedures, especially highly complex ones, also often require large
samples to render stable results.

The Magnitude of the Effect Size as Practical Significance


Sometimes statistical findings that are significant at a p-value of less than 0.05
are not worthy of attention. For example, if a significant correlation coefficient
of –0.10 (p < 0.05) is found between test anxiety and reading test scores in a
large sample, that does not mean that test anxiety strongly affects students’ test
scores. In fact, R2 = 0.01, so that there is only 1% of shared variance between the
two variables. For this reason, it may not be worthwhile developing a program
to help students lower their test anxiety. However, if it is found that the correla-
tion coefficient between vocabulary knowledge and reading test scores is 0.70
(p < 0.05), the level of vocabulary knowledge clearly plays a part in determining
success in a reading test (49% shared variance). In both correlations, the p-value
is less than 0.05, but the effect sizes differ in terms of meaning and practical
significance.
In inferential statistics, it is, therefore, insufficient to merely report statistical
significance. Researchers are required to incorporate and interpret the associated
effect size of a particular statistical test. The correlation coefficient and R2, which
were introduced in Chapter 5, are measures of effect size. In the remaining chap-
ters, the effect sizes associated with specific tests will be presented and discussed.
For example, Cohen’s d is used as the effect size for t-tests (see Chapter 7). It is
important to note that APA (6th edition) recommends that all published statistical
reports include details of the effect sizes found for the tests conducted.
Basics of Inferential Statistics 89

A Technical View on Statistical Significance


We have so far considered statistical significance as the likelihood that results are
correct but it can also be viewed from a more technical perspective based on the
premise that empirical research is an exercise in hypothesis testing. In L2 research,
researchers do not usually formulate hypotheses, but rather formulate research
questions, such as ‘what is the correlation between reading ability and vocabu-
lary knowledge?’ This research question could be phrased as a hypothesis, such as
‘vocabulary knowledge and reading ability correlate.’
Once a research question has been phrased as a hypothesis, it becomes a yes/no
proposition. That is, either scores for reading ability and vocabulary knowledge
correlate, or they do not. These two possible outcomes, can be viewed as two
hypotheses, known as the null hypothesis (H0) and the alternative hypothesis (H1):

• H0: There is no correlation between vocabulary knowledge and reading ability.


• H1: There is a correlation between vocabulary knowledge and reading ability.

The null hypothesis usually contains a word such as ‘no’ or ‘not’. In all statistical
investigations, these two hypotheses exist. What they imply is shown in Table 6.3.
Technically, in a statistical study, researchers test the null hypothesis using the
empirical data they have collected. They assume initially that the null hypothesis
is correct, and then conduct the study to test that assumption. Only if they are
certain beyond a reasonable doubt that the null hypothesis is incorrect do they
reject it.
Since to accept or reject is an either-or decision, significance is also an either-or
proposition. That is, either a result is significant or it is not. There is no middle
ground, and therefore, it is not possible to talk meaningfully about a result being
really significant, nearly significant, almost significant, or totally insignificant. It is
significant or it is nonsignificant. Those are the only two options. The significance
level of each finding of a study needs to be below the significance level that has
been set for the study. So at the beginning of the study, the researcher might be
satisfied with a likelihood of error of 5%, and therefore set the significance level
at the p-value of 0.05. That is known as setting the alpha level (this must not be

TABLE 6.3 The null hypothesis versus alternative hypothesis

The null hypothesis claims that The alternative hypothesis claims that

there is nothing to be found, there is something to be found,


there is no relationship, there is a relationship,
there is no difference, there is a difference,
there is no effect, it is not all the same,
it is all the same, and there is an effect, and
it is all random. it is certainly not random.
90 Basics of Inferential Statistics

confused with Cronbach’s alpha, which is used in reliability analysis). Any infer-
ential statistics, such as a correlation, need to be below 0.05 to be significant, but
since significance is an either-or proposition, it does not actually matter how far
below 0.05 the result is. For example:

• If a correlational analysis is conducted, and the significance of the result is less


than 0.05, it is significant, regardless of what the actual significance level of
the analysis is.
• If the p-value is larger than 0.05, the result is not significant, regardless of the
actual significance level.

These considerations explain why, traditionally, significance levels are not reported
to three decimal places (e.g., p = 0.029), but are reported only with regard to the
pre-set significance level (e.g., p < 0.05). From the point of view of statistical logic,
it makes more sense to report p < 0.05, but since computer programs provide sig-
nificance levels to a greater number of decimal places, it is becoming increasingly
common for researchers to report the p-value to three decimal places.

Types of Statistical Error


Researchers sometimes reject the null hypothesis even though it is true—in other
words, they find an effect in their sample that does not actually exist in the general
population. In that case, they have made a type I error (or false positive). Significance
in statistical analysis is exclusively focused on type I errors, and the significance
level describes the likelihood that researchers have committed a type I error.
Researchers can also incorrectly accept the null hypothesis (i.e., they can fail to
find an effect in their sample when there is one in the population). This is known
as a type II errors (or false negatives).Type II errors are usually due to sample sizes not
being large enough to detect a small effect.
Until fairly recently, statisticians were less concerned with type II errors than
with type I errors, which led to something known as the file drawer problem. A lot
of potentially interesting studies that did not reach the accepted significance level
disappeared into file drawers, never to see the light of day again. This reflects the
tendency for researchers to report positive results, rather than negative ones.

Summary
Inferential statistics seek to interpret raw empirical data. To use them effectively,
researchers require logical reasoning and an understanding of statistical probabil-
ity. Good quantitative research can be appropriately conducted when researchers
understand the conceptual basics of inferential statistics (e.g., population and sam-
pling, probability, statistical significance, sample size, and effect sizes). The next
Basics of Inferential Statistics 91

chapter further discusses inferential statistics by focusing on t-tests, which are used
to compare the mean scores of two samples.

Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
7
T-TESTS

Introduction
The statistical procedures discussed in Chapter 5 are designed for research-
ers to find relationships between variables. Such relationships are investigated
by asking research questions, such as ‘are vocabulary knowledge and reading
comprehension related?’ (e.g., Guo & Roehrig, 2011), or ‘is proficiency related
to use of collocations?’ (e.g., Laufer & Waldman, 2011). However, sometimes
researchers are not interested in relationships, but in differences. For example,
Doolan and Miller (2012) examined whether generation 1.5 writers make more
errors in their English essay writing than L1 writers. In Doolan and Miller’s
study, a group of generation 1.5 students (i.e., L2 speakers who have resided in
the target-language country for an extended period), and a group of L1 English
speakers wrote an essay based on the same prompt. Essays were rated, analyzed
for errors, and then ratings and mean numbers of errors were compared between
the two groups of writers. Kormos and Trebits (2012) investigated whether
modality (i.e., written versus spoken) affected task performance by L2 learners.
The researchers asked a group of EFL learners to describe a cartoon orally, and
then a month later asked them to describe a similar cartoon in writing. Learners’
accuracy, fluency, syntactic complexity, and lexical variety were measured and
analyzed for each description; the oral and the written descriptions could then
be compared.
In these two examples, the researchers were interested in differences: in the
first case between the number of errors made by L1 and generation 1.5 writ-
ers, and in the second between two descriptions produced by the same group of
learners, one oral and one written. To make such comparisons, researchers can
run a procedure known as a t-test. Two types of t-tests will be presented in this
T-Tests 93

chapter. In Doolan and Miller’s (2012) study, the researchers used a t-test known
as the independent-samples t-test because the performances of two different groups
of participants in the completion of the same task were compared. In Kormos and
Trebits’s (2012) study, however, the researchers used the paired-samples t-test because
two performances on two different tasks by the same group of participants were
compared. The paired-samples t-test is also called the dependent t-test. The paired-
samples t-test is related to a repeated-measures research design (hence it is also
called repeated-measures t-test). However, in this book, the term paired-samples t-test
is used, as it is consistent with SPSS.

The Independent-Samples T-Test


Researchers use the independent-samples t-test when they compare two dif-
ferent groups of research participants using measurements taken by means of
the same instrument (e.g., the participants of both the groups complete the
same essay task or answer the same motivation questionnaire). In Doolan and
Miller’s (2012) study, the two groups differed in a background variable in that
one group consisted of generation 1.5 writers, and the other group consisted
of L1 writers. The aim of the comparison was to determine whether the dif-
ference in background resulted in differences in writing performance. That
is, the researchers asked ‘does being a generation 1.5 writer impact essay per-
formance?’ In L2 research, two groups of participants comprise independent
samples when no member of one of the groups is also a member of the second
group.
In Doolan and Miller’s (2012) study, all the participants responded to the same
essay prompt, and the errors they made were identified, classified, and counted. As
Table 7.1 shows, the generation 1.5 writers produced more than twice as many
errors as the L1 writers did.
According to Table 7.1, it might be expected that the difference between the
groups was significant, and an independent-samples t-test confirmed that it was
indeed significant at the p-value of 0.05 (p < 0.001), with a t-value of 5.11. On
this basis, the authors concluded that generation 1.5 writers made significantly
more errors in their essay writing, and the likelihood that this conclusion would
be wrong was small, given the low p-value. Generally speaking, a significant result

TABLE 7.1 Mean and standard deviation of error counts for generation 1.5 learners and L1
writers (based on Doolan & Miller, 2012, p. 7)

Mean error SD

Generation 1.5 16.12 9.46


L1 English 7.20 3.96
94 T-Tests

for an independent-samples t-test implies that the group means are statistically
different. It could, therefore, be concluded that the difference in the background
variable (generation 1.5 status) affected the outcome measure (i.e., the error
count).
There is still the possibility that the background variable being used is a proxy
for another underlying variable that is the actual reason for the outcome. So, for
example, if an independent-samples t-test indicates significant differences in TEP
scores between test takers with and without residence, it might not be residence
itself that causes the difference in scores, but a host of associated factors, such as
higher proficiency going hand-in-hand with residence, or self-selection of high-
ability test takers going abroad. Which factors actually lead to this significant
difference cannot be answered simply by the use of the t-test, but requires further
thorough investigation.

The Paired-Samples T-Test


In an experimental study, a paired-samples t-test is used when researchers mea-
sure the same construct from the same group of people twice. For example,
researchers may ask students to retell one story in writing and another story in
speaking. Or researchers may measure students’ knowledge of vocabulary, give
them focused vocabulary training, and measure their vocabulary knowledge
again. In each case, the same group of students is asked to complete the same
task (or a parallel task) twice. What researchers try to detect is the effect of
different kinds of tasks (e.g., written versus oral), or the effect of a treatment
in a pretest-posttest experimental design (see Phakiti, 2014). Research ques-
tions may be ‘do students perform better in a written or an oral retelling?’, ‘do
students know more vocabulary after the training than they did before?’, and
‘do students’ IELTS listening scores differ significantly from their IELTS read-
ing scores?’
Researchers could also perform a correlation to answer these research
questions. That is, they could investigate the amount of overlap between par-
ticipants’ scores on a written retelling and on a spoken retelling. They could
also examine the differences between the scores on the pre-treatment vocabu-
lary test and those on the post-treatment vocabulary test. However, doing a
correlation analysis answers a fundamentally different research question than
a t-test does. The correlation analysis shows to what extent oral and written
retelling measure the same attribute, but the paired-samples t-test can deter-
mine whether or not oral and written retelling measures are equal in terms
of difficulty. Furthermore, the paired-samples t-test shows which task is more
difficult. Similarly, in the pretest-posttest experimental design, the correlation
analysis shows whether the pretest and the posttest measure the same attribute,
T-Tests 95

but the paired-samples t-test shows whether learners’ scores on the posttest
are significantly different from those on the pretest (e.g., whether the posttest
performance is higher). If the posttest scores are significantly higher than the
pretest scores, the researchers may be able to conclude that the experimental
treatment was the reason for the increase. An example of the use of a depen-
dent t-test is Kormos and Trebits’s (2012) study, in which the researchers gave
a group of 44 Hungarian high school students a cartoon description task and
a picture narration task, first as oral tasks, and a month later as written tasks
with no intervening treatment. The researchers then conducted comparisons
on measures of lexical variety, syntactic complexity, fluency, and accuracy
between:

1. the oral cartoon descriptions and the oral picture narrations;


2. the written cartoon description and the written picture narrations;
3. the oral cartoon descriptions and the written cartoon descriptions; and
4. the oral picture narrations and the written picture narrations.

All these comparisons involve paired-samples t-tests because it was always the
same participants providing data on both tasks. The researchers found, as Table 7.2
shows, that participants produced significantly more error-free clauses in their
written cartoon descriptions than in their oral cartoon descriptions.
In general, a significant result for the paired-samples t-test suggests that the
means for the two measures are significantly different (with less than a 5% chance
of error). From this finding, it may be concluded that the students found writing
a cartoon description easier than providing the description orally. To explain this
finding, it could be hypothesized that the offline nature of writing, the possibility
of revising and correcting errors, and the more formal atmosphere of a written
test setting may have led to a greater focus on accuracy, resulting in fewer errors.
However, the statistical result does not inform the researchers what the reason for
this outcome was, and researchers would have to do further research to pinpoint
what it was about writing that made it more accurate than speaking, at least for
this type of task.

TABLE 7.2 Mean and standard deviations of ratios of error-free clauses in the cartoon
description task for both modalities (adapted from Kormos & Trebits, 2012, p. 455, Table 3)

Mean SD t-value

Oral 0.75 0.11 t(43) = 3.27, p < 0.05


Written 0.81 0.08
96 T-Tests

Assumptions of T-Tests
Both types of t-tests require interval or continuous data, and the corresponding
data for the population to which the findings are to be generalized need to be
normally distributed. In independent-samples t-tests, the sizes of the two samples
should not differ greatly, and neither should their variances. SPSS can be used to
check for the violations of this equal variances assumption in independent t-tests
through the running of Levene’s test (to be discussed in the SPSS section). SPSS can
provide a corrected t-test result if the variances differ too greatly. It is desirable that
all samples have at least 30 participants so that small differences may be detected
(as discussed in Chapter 6).
The t-test compares the means of the sample scores, taking into account the sample
sizes and the standard deviations of the scores. The t-test is likely to be significant if:

• the difference between the two means is large;


• the standard deviations of the two means are low; and
• the sample sizes are large.

Similar to the chi-square test (discussed in Chapter 12), the t-test produces a
value (simply known as t ) that can only be used to determine statistical significance;
it does not say anything about the size of the difference between the two mean
scores (i.e., the effect size). For that, researchers have to run a separate effect size
calculation to obtain what is known as Cohen’s d (discussed further in the ‘Effect
Size for T-Tests’ section). Since the t-test formula involves the subtraction of one
sample mean from the other, a negative t-test result can be found when the larger
mean is subtracted from the smaller mean. This is not problematic—it is the size
of the t-value that is important.

The Effect Size for T-Tests


The size of the difference between groups is related to practical significance. That
is, merely finding that two means are significantly different at the significance value
of 0.05 does not imply that the difference is meaningful or worthy of attention. To
work out how large the difference is, an effect size measure needs to be calculated.
The most common of these measures is Cohen’s d, which is the difference between
the group means divided by the pooled standard deviation of the two groups.

Cohen’s d = (Mean 1 – Mean 2) ÷ pooled standard deviation

In the case of the error counts of generation 1.5 versus L1 students in Doolan
and Miller’s (2012) study, d was calculated as:

Cohen’s d = (16.12 – 7.2) ÷ 6.26 = 1.42.


T-Tests 97

According to Cohen (1988), a d-value of 0.8 or above is considered a large effect,


a d-value around 0.5 is considered a medium effect, and a d-value below 0.2 is con-
sidered a small effect. Therefore, the d-value of 1.42 would be classified as a large
effect. This Cohen’s d showed that the mean error count of the generation 1.5
group was 1.42 (pooled) standard deviations above the mean error count of the
L1 group. It is worth noting that how effect size measures are classified as small,
medium or large depends on the field of study. Cohen’s classifications were mostly
designed with research in psychology in mind. After reviewing typical effect sizes
in L2 research, Plonsky and Oswald (2014) posit that for L2 research, 0.4 should
be considered small, 0.7 medium, and 1.0 large. There is no absolute benchmark
for how Cohen’s d effect sizes should be categorized, and their evaluation depends
to a large extent on researchers’ expectations.
Unfortunately, SPSS does not provide the Cohen’s d effect size in its output.
To determine effect sizes, researchers either have to compute them by hand or use
a web-based calculator, such as Becker’s at www.uccs.edu/~lbecker/.
The following steps should be followed when using the t-test in a research study.

• Step 1: examine and evaluate the descriptive statistics of the data from two
groups and the reliability of the research instrument(s) being used.
• Step 2: check whether the statistical assumptions for the particular t-test are
met. Levene’s test can be used to determine whether the two means have
equal variances (SPSS can perform this statistical test; see the ‘SPSS Instruc-
tions: Independent-Samples T-test’ section).
• Step 3: perform the t-test using SPSS.
• Step 4: determine whether the two group means are statistically significantly
different (e.g., p < 0.05)
• Step 5: Compute Cohen’s d if there is a statistically significant difference
between the two means.

How to Perform the Independent-Samples T-Test in SPSS


To illustrate how to perform the independent-samples t-test in SPSS, we will run
an analysis to determine whether residence in the target language country influ-
ences the use of routine formulae. Routine formulae are fixed expressions that
are tied to specific social situations; e.g., ‘nice to meet you’ is used when first
meeting someone. Previous research (e.g., House, 1996; Roever, 2012) has shown
that knowledge of routine formulae is strongly influenced by residence, and that
even learners with a low level of residence have a distinct advantage over learners
with no residence at all. We will investigate this issue with TEP data (available in
Ch7TEP.sav, which can be found on the Companion Website). The research ques-
tion is ‘do TEP test takers with residence in English-speaking countries score more
highly on the routines test section than test takers without residence?’
98 T-Tests

To address this research question, the independent-samples t-test can be used to


compare these groups’ scores on the routines test section.

SPSS Instructions: Independent-Samples T-test

Click Analyze, next Compare Means, and then Independent-Samples


T Test (see Figure 7.1)

Clicking Independent-Samples T Test calls up a dialog in which vari-


ables can be selected (see Figure 7.2).

FIGURE 7.1 Accessing the SPSS menu to perform the independent-samples t-test
T-Tests 99

FIGURE 7.2 SPSS dialog for the independent-samples t-test

In the Independent-Samples T Test dialog, move ‘Routines score’ to


the Test Variable field, then move ‘collres2’ (collapsed residence as
yes/no) to the Grouping Variable field.

Click on the Define Groups button to tell SPSS how the two groups
are defined. In the resulting dialog, enter ‘0’ for Group 1 and ‘1’ for
Group 2 (see Figure 7.2).
Note: Defining groups may seem superfluous in the case of a dichotomous
variable (residence/no residence) but the t-test could be run with a group
variable that has several levels (e.g., learners’ L1s), and then it would be
important to define which groups to compare.

Click the Continue button, then the OK button.

The following is the SPSS output from the independent-samples t-test. Table 7.3
presents the descriptive statistics of the two group means.

TABLE 7.3 Means and standard deviations of the two groups

Collapsed residence (yes/no) N Mean SD Std. error mean

Routines No Residence 73 51.26 18.09 2.12


Score Residence 56 81.40 15.97 2.13
100 T-Tests

The group statistics are general descriptive statistics about the two groups. It
can be observed that there was a large difference in the routines scores of the two
groups. The test takers without residence had a mean score of 51.26%, whereas
the ones with residence had a mean score of 81.40%. The next SPSS output
will indicate whether the difference was statistically significant. SPSS presents one
large table with the statistics related to the independent-samples t-test (including
Levene’s test and the t-test for equality of means). For ease of presentation, this
output has been split into two tables. Table 7.4 presents the results from Levene’s
test. Note that this table does not yield the answer to the question regarding the
statistical significance of the difference in means.
In Table 7.4, both possible statistical assumptions for the equality of variances
are made separately by SPSS: one (equal variances assumed) posits that the t-test
condition of equal group variances (or at least similar) was met, and the other
(equal variances not assumed) assumes that it was not met. In the latter case, SPSS
corrects for the violation of this condition of equal variances. To know whether
the condition of equal variances holds, the result of Levene’s test can be examined.
Levene’s test has as its null hypothesis that variances are equal, so if it is nonsig-
nificant at 0.05, the t-test condition is met. This means that the t-test result for
equal variances can be used. In other words, Levene’s test must not be statistically
significant (i.e., the p-value must be larger than 0.05) in order to say that the data
have met the homogeneity assumption. In this particular SPSS output, Levene’s
test suggests that the p-value was 0.56, which is far above the threshold of the
p-value of 0.05, so it can safely be assumed that the group variances were similar
enough to run the independent-samples t-test without any corrections (hence in
Table 7.4, the row ‘Equal variances not assumed’ was left blank by SPSS). Table 7.5

TABLE 7.4 Levene’s test

Levene’s test

F Sig.

Routines score Equal variances assumed 0.35 0.56


Equal variances not assumed

TABLE 7.5 The independent-samples t-test results

t df Sig. Mean Std. error 95% confidence


(2-tailed) difference difference interval of the
difference

Lower Upper

–9.86 127 0.00 –30.14 3.06 –36.19 –24.10


–10.03 124.46 0.00 –30.14 3.01 –36.09 –24.19
T-Tests 101

presents the t-test for equality of means. This output can answer the question of
whether or not the two means were statistically different.
In Table 7.5, the first analysis row is based on the assumption of equal vari-
ances, whereas the second row is based on the assumption of unequal variances.
The second row can be ignored given that the result of Levene’s test was that the
condition of equality of variances was met. As can be seen in the column entitled
t, the t-test result is –9.86, which is not meaningful in itself. This t-value enables
researchers to determine the significance level only (if they use a critical value
table, as discussed in Chapter 6). In this output, SPSS reports the t-test result as
a negative because the higher mean was subtracted from the lower one. So the
negative sign in the t-value can be ignored when writing up a report. Also, df in
this table are needed only for reporting results. The entries in the Sig. (2-tailed)
column indicate whether the difference in means was significant, as it shows the
significance level. Even though SPSS reports the significance value here as 0.00,
it is important to be aware that this value cannot be the real value, but that it can
be assumed to be less than 0.001. This means that there is a low likelihood that it
would be wrong to claim an effect of residence on routines scores.
SPSS does not report values of Cohen’s d. The easiest way to calculate Cohen’s d
is to use an online calculator such as that found in www.uccs.edu/~lbecker/. In this
calculator, the means and standard deviations from the SPSS descriptive statistics
output can be entered (Table 7.3), after which the compute button in the online
calculator should be clicked. Figure 7.3 presents Cohen’s d as well as the effect size r
which is another effect size measure that is not focused on in this book. According
to Figure 7.3, the Cohen’s d effect size was 1.77, which is a large effect size.

FIGURE 7.3 Lee Becker’s effect size calculators


102 T-Tests

The independent-samples t-test result can be written up as follows: ‘the


routines test scores of learners with and without residence in English-speaking
countries were compared through the use of the independent-samples t-test. It
was found that learners with residence had significantly higher routines test scores
(t(127) = 9.86, p < 0.001), and the effect size of the difference was large (d = 1.76).
It may be concluded that residence has a strong impact on knowledge of routine
formulae.’

How to Perform the Paired-Samples T-Test in SPSS


In order to illustrate how to use SPSS to perform the paired-samples t-test, using the
same data file as earlier, we will investigate which of the multiple-choice sections on
TEP was more difficult for learners. The research question is ‘did learners find it easier
to interpret implicatures or to recognize routine formulae?’

SPSS Instructions: Paired-Samples T-test

Click Analyze, next Compare Means, and then Paired-Samples T Test


(see Figure 7.4).

FIGURE 7.4 Accessing the SPSS menu to perform the paired-samples t-test
T-Tests 103

FIGURE 7.5 Paired-Samples T Test dialog

Clicking Paired-Samples T Test calls up its dialog (see Figure 7.5).

First, select the ‘Implicature score’ and ‘Routines score’ variables and
move them to the Variable 1 and Variable 2 columns in the ‘Paired
Variables’ pane.

Click on the OK button.

In its output, SPSS presents the descriptive statistics of the two variables (Table 7.6).
According to Table 7.6, the implicature test section seems to be easier than the
routines test section. However, it cannot be said yet that the difference was statisti-
cally significant.
Table 7.7 presents the correlation coefficient between the two sets of scores.
It is provided mainly for the researchers’ information and need not be reported.
Table 7.8 presents the paired-samples t-test results. Examine the last three columns
of this table: the t-value was 1.81 with 165 degrees of freedom, and the signifi-
cance level was 0.07, which indicates that the difference between implicature test
scores and routines test scores was not statistically significant (p > 0.05). Therefore,
it cannot be concluded that one section is more difficult than the other. Given the
nonsignificant result, it is not necessary to calculate an effect size measure such as
Cohen’s d.
104 T-Tests

TABLE 7.6 Means and standard deviations of the two means

Mean N SD Std. error mean

Pair 1 Implicature score 64.22 166 26.70 2.07222


Routines score 60.74 166 23.03 1.78743

TABLE 7.7 Correlation coefficient between the two means

N Correlation Sig.

Pair 1 Implicature score & 166 0.51 0.00


Routines score

TABLE 7.8 Paired-samples t-test results

Paired differences t df Sig.


(2-tailed)
Mean SD Std. 95% confidence
difference error interval of the
mean difference

Lower Upper

Pair 1 Implicature 3.47 24.75 1.92 –0.32 7.27 1.81 165 0.07
score—
Routines score

This finding can be reported as follows: ‘the scores on the implicature and
routines test sections were compared through the use of the paired-samples t-test.
No significant difference was found between these two test sections (t(165) = 1.81,
p = 0.007)’.
It should be noted that in the case of statistical significance, researchers should
compute Cohen’s d. There is disagreement in the literature about whether
Cohen’s d for paired-samples t-tests needs to consider the correlation between the
two variables (see Lakens, 2013), which we showed in Table 7.7. Becker’s effect
size calculator does not take correlation into account, but Melody Wiseheart’s cal-
culator (www.cognitiveflexibility.org/effectsize/) can do so. For consistency and
simplicity, researchers can compute the paired-samples t-test effect sizes in the
same way they do for the independent-samples t-test. That is, there is no need to
integrate the correlation value into the computation.

Summary
This chapter has explained the underlying principles behind the two types of
t-tests (the independent t-test and paired-samples t-test). It has illustrated how to
T-Tests 105

compute them in SPSS. Effect size calculations for t-tests have been explained and
presented. The next chapter presents the nonparametric versions of the two t-tests.

Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
8
MANN-WHITNEY U AND
WILCOXON SIGNED-RANK
TESTS

Introduction
Nonparametric tests are useful in the analysis of data that do not meet the condi-
tions required for parametric tests, for example, if researchers are working with
small sample sizes or ordinal / rank data, or if the assumption of normally distrib-
uted data may not be justified. This chapter presents non-parametric alternatives to
t-tests, namely the Mann-Whitney U test, which is analogous to the independent-
samples t-test, followed by the Wilcoxon Signed-rank test, which is analogous to
the paired-samples t-test.

Making the Decision to Use a Nonparametric Test


The following are preliminary steps for determining whether a nonparametric test
should be used instead of the t-test.

1. Examine and evaluate the descriptive statistics of the data to be analyzed.


2. Check whether the statistical assumptions for the relevant t-test are met (see
Chapter 7). If not, consider using the Mann-Whitney U test or the Wilcoxon
Signed-rank test instead.

The Mann-Whitney U Test


The Mann-Whitney U test is an alternative to the independent-samples t-test.
Instead of comparing means, the Mann-Whitney U test agglomerates the par-
ticipants from both groups and ranks them relative to each other. It then checks
Mann-Whitney U and Wilcoxon Signed-Rank 107

whether high-, mid-, and low-ranked participants from each group are evenly
distributed in the pooled group. If each group has some high-ranking partici-
pants, some mid-ranking participants, and some low-ranking participants, the
two groups are not likely to be significantly different. However, if one group
has a lot of high-ranking participants but few mid- and low-ranking partici-
pants, and the other group has very few high-ranking participants, but a lot of
mid- and low-ranking ones, then the two groups are likely to be significantly
different. The U-value from this analysis is an index that helps researchers find
the relevant significance level. SPSS provides both the U-value and Z-value, and
both are reported in some studies. The Z-value can be used to compute an effect
size for the Mann-Whitney U test. Corder and Foreman (2009, p. 59) suggest
a simple formula to calculate the effect size for the Mann-Whitney U test (r)
as follows:

r = Z ÷ √total N

It should be noted that the effect size r here is not the same as a correlation coef-
ficient. According to Cohen (1988), r = 0.10 is considered a small effect size, r =
0.30 is considered a medium effect size, and r = 0.50 is considered a large effect size.
Doolan and Miller (2012), for example, used the Mann-Whitney U test to
detect differences between generation 1.5 writers and L1 writers in the frequency
of occurrence of a variety of error types. They found no significant difference
between the groups in their word choice errors, although generation 1.5 writers
made more word choice errors. However, they did find a statistically significant
difference in verb errors. Table 8.1 presents their Mann-Whitney U test results
(adapted from Doolan & Miller, 2012).
The authors did not report the r effect sizes in this table, but they can be easily
calculated. The r effect size for the word choice errors was 0.14 (i.e., 1.13 ÷ √61),
which is small, and the r effect size in the case of the verb errors was 0.47 (i.e.,
3.64 ÷ √61), which is considered medium-to-large. It should be noted that when
the test does not produce a statistically significant result, the r effect size does not
have to be calculated.

TABLE 8.1 Mann-Whitney U test results (adapted from Doolan & Miller, 2012, Table 2,
p. 7)

Mean SD Gen Mean L1 SD L1 Mean Mean z statistic Sig


Gen 1.5 1.5 rank rank L1
Gen 1.5

Wrong word 2.63 (2.24) 2.00 (1.89) 32.76 27.40 –1.13 .260
Verb error 6.24 (5.61) 1.50 (1.50) 36.72 19.28 –3.64 .001∗
108 Mann-Whitney U and Wilcoxon Signed-Rank

How to Perform the Mann-Whitney U Test in SPSS


To illustrate how the Mann-Whitney U test can be performed in SPSS, a small
portion of a data set from Phakiti (2006) is used. Only 46 research participants
were included. Phakiti (2006) examined the relationships among cognitive and
metacognitive strategy use and EFL reading test performance. In this section, the
data file Ch8strategy.sav is analyzed. (downloadable from the Companion Website
for this book).
In this data set, females were coded as ‘0’, whereas males were coded as ‘1’.
There were 20 females and 26 males. The variables in this file are (1) strategy use
and (2) reading test scores. The strategy use variables were (1) comprehending
strategy use (e.g., identifying main ideas), (2) memory strategy use (e.g., trying to
retain information from texts for later use), (3) retrieval strategy use (e.g., recall-
ing information from memory), (4) planning strategy use (e.g., organizing steps
to complete reading), (5) monitoring strategy use (e.g., checking ongoing com-
prehension and reading task completion), and (6) evaluating strategy use (e.g.,
appraising comprehension and test performance). The Mann-Whitney U test will
be used to examine gender differences in the overall reading test scores.

SPSS Instructions: Mann-Whitney U Test

Click Analyze, next Nonparametric Tests, then Legacy Dialogs, and


then 2 Independent Samples (see Figure 8.1).

Clicking 2 Independent Samples calls up the Two-Independent-


Samples Tests dialog, which allows the selection of new variables
(see Figure 8.2).

Move ‘Total Score’ to the ‘Test Variable List’ pane.

Click the Define Groups button to tell SPSS how the two groups are
defined. Enter ‘0’ for Group 1 and ‘1’ for Group 2. Note that 0 rep-
resents females and 1 represents males in this data set. Click on the Continue
button.
Mann-Whitney U and Wilcoxon Signed-Rank 109

FIGURE 8.1 SPSS menu to perform the Mann-Whitney U test

In the Two-Independent-Samples Tests dialog, click the Options but-


ton to open a new dialog. Tick the Descriptive checkbox and click
the Continue button.

Mann-Whitney U is checked by default. Click on the OK button.


110 Mann-Whitney U and Wilcoxon Signed-Rank

FIGURE 8.2 SPSS dialog to perform the Mann-Whitney U test

TABLE 8.2 Descriptive statistics (N = 46)

N Mean SD Minimum Maximum

Total Score 46 48.48 9.94 27.00 67.00


Gender 46 0.57 0.50 0.00 1.00

TABLE 8.3 Mean ranks in the Mann-Whitney U test (N = 46)

Gender N Mean rank Sum of ranks

Total Score Female 20 28.73 574.50


Male 26 19.48 506.50
Total 46

Table 8.2 presents the descriptive statistics produced by SPSS. You can ignore the
descriptive statistic for the gender variable, which makes no sense as it consists of
nominal data (see Chapter 3). In Table 8.2, the mean score for the total test score
was 48.48 (SD = 9.94).
Table 8.3 presents the mean ranks using the total test score. In this table, the
mean ranks for female and male test takers were 28.73 and 19.48 respectively.
Table 8.4 presents the Mann-Whitney U test statistics. In order to determine
whether the two groups significantly differed in their total test score, first examine
the Z-value and the Asymp. Sig (2-tailed) value. It can be seen that there was a sta-
tistically significant difference between the female and male test takers in the total
test score (Z = –2.32, p = 0.02, r = 0.34). SPSS does not produce the r-effect size,
so this needs to be calculated using the formula provided in the ‘Mann-Whitney U
Mann-Whitney U and Wilcoxon Signed-Rank 111

TABLE 8.4 Mann-Whitney U test statistics (N = 46)

Total score

Mann-Whitney U 155.50
Wilcoxon W 506.50
Z –2.318
Asymp. Sig. (2-tailed) 0.02

Test’ section. The effect size is medium in this case. It should be noted that similar
to Cohen’s d, the r-effect size can take a negative value but the negative sign in the
effect size can be ignored.
The Mann-Whitney U test result can be reported as follows: ‘the total scores of
female and male test takers were compared through the use of the Mann-Whitney
U test. Female test takers had significantly higher total test scores than their male
counterparts (Z = 2.32, p = 0.02, r = 0.34), and the effect size of the difference
was medium. It can be concluded that gender may play a role in determining suc-
cess in reading test performance’.

The Wilcoxon Signed-Rank Test


The Wilcoxon Signed-rank test is a nonparametric test that is analogous to the paired-
samples t-test. This test is useful when the statistical assumptions for a parametric test
are not met. Compared to the paired-samples t-test, the Wilcoxon Signed-rank test is
not commonly found in published journals.
An example of research employing the Wilcoxon Signed-rank test is that by
Gass, Svetics, and Lemelin (2003), which asked to what extent attention differen-
tially affects different parts of language learning and to what extent the differential
effect interacts with language proficiency. In SLA research, the role of attention or
awareness in language learning has been long contested among researchers. The
researchers aimed to compare the language learning performances of the same
group of learners who were asked to learn three grammatical areas (i.e., syntax,
morphosyntax, and lexicon) under two learning conditions: (1) the presence of
focused attention (+) to the target grammatical features, and (2) the absence of
focused attention (–) to the target grammatical features. That is, in the first con-
dition, students were asked to pay attention to a grammatical feature they might
have a problem using (hence, + attention), whereas in the second condition, the
instructor did not point out any such feature to students (hence, – attention).
In this study, the researchers used the Wilcoxon Signed-rank tests to examine
the differences between the pretest and posttest scores among 34 students. The
researchers found that in the + attention condition, all the three grammati-
cal areas (syntax, morphosyntax, and lexicon) showed significant, large gains. It
should be noted that the researchers did not report the r-effect size, but Cohen’s
112 Mann-Whitney U and Wilcoxon Signed-Rank

d instead. In this study, Cohen’s d ranged from 0.97 to 1.57, which implied
large effect sizes. The findings suggest that when learners pay attention to the
language areas they are learning, they are likely to learn them more successfully
than when they do not pay attention. However, another intriguing finding from
this study was that there seemed to be an interaction between the effects of
attention and learners’ proficiency levels (as determined by the number of years
of study, i.e., first-, second-, and third-year levels). For example, the impacts of
+focused attention on the three grammatical areas were pronounced among the
first- and second-year students, but became more complex for the third-year
students. That is, for the third-year group, there was no significant gain in the
three grammatical areas, whether attention was given or not. The researchers
suggested that the finding might suggest that for the third-year students, when
attention was not given, learners might use their own learning resources to help
them learn the target language features.

How to Perform the Wilcoxon Signed-Rank Test in SPSS


To illustrate how to run the Wilcoxon Signed-rank test, the data file Ch8strategy.sav
will be used (which is available on the Companion Website for this book). In this
data set, there are six strategy-use categories (i.e., comprehending, memory, retrieval,
planning, monitoring, and evaluating) as in the Mann-Whitney U test section earlier.
In regards to this data set, the aim is to determine whether test takers’ comprehending
strategy use differs significantly from other the use of other reported strategies.

SPSS Instructions: Wilcoxon Signed-Rank Test

Click Analyze, next Nonparametric Tests, then Legacy Dialogs, and


then 2 Related Samples (see Figure 8.3)

Clicking 2 Related Samples calls up the Two-Related-Samples Tests


dialog, which allows variables to be selected (see Figure 8.4).

From the left-hand pane, move ‘Comprehending’ to the Variable 1


column and ‘Memory’ to the Variable 2 column in the ‘Test Pairs’
pane. Repeat this, putting ‘Comprehending’ in Variable 1 and other vari-
ables in Variable 2 in subsequent rows (see Figure 8.4). There are a total of
five pairs to be compared.
FIGURE 8.3 SPSS menu to perform the Wilcoxon Signed-rank test

FIGURE 8.4 SPSS dialog to perform the Wilcoxon Signed-rank test


114 Mann-Whitney U and Wilcoxon Signed-Rank

Click the Options button to open the dialog shown in Figure 8.4,
then select Descriptive and click the Continue button. There is no
need to change other defaults.

Back in the Two-Related-Samples Tests dialog, Wilcoxon is checked


by default. Click on the OK button.

Table 8.5 presents the descriptive statistics produced by SPSS. In this table, the
mean scores ranged from 3.14 (memory) to 3.60 (monitoring).
Table 8.6 presents the ranks statistics, which compare comprehending strategy
use with the use of other strategies. In Table 8.6, the label negative ranks refers to
the observation that a test taker reported less use of the strategy being compared
(e.g., memory) than the use of the comprehending strategy. The label positive ranks
refers to the observation that a test taker reported higher use of the strategy being
compared than the use of the comprehending strategy. The label ties indicates that
the use of comprehending and the compared strategy were equal. According to
Table 8.6, the use of the comprehending strategy was reported to be higher than
the use of the memory, retrieval, and planning strategies, but lower than the use
of the monitoring and evaluating strategies. However, at this stage it is not known
whether these differences were statistically different.
Table 8.7 presents the Wilcoxon signed-rank test statistics. As SPSS does not
produce the r-effect sizes, these need to be calculated using the formula provided
for the Mann-Whitney U test in the ‘Mann-Whitney U Test’ section. In order
to determine whether a pair of strategies significantly differed from each other,
the Z-value and the Asymp. Sig (2-tailed) value should be examined. According
to Table 8.7, there was a statistically significant difference between the use of the
memory and comprehending strategies (Z = –3.07, p < 0.001, r = –0.45), and

TABLE 8.5 Descriptive statistics (N = 46)

N Mean SD Minimum Maximum

Comprehending 46 3.45 0.74 1.80 5.00


Memory 46 3.14 0.60 2.25 4.50
Retrieval 46 3.40 0.71 1.75 5.00
Planning 46 3.20 0.59 2.00 4.67
Monitoring 46 3.60 0.60 2.50 5.00
Evaluating 46 3.57 0.64 2.20 5.00
Mann-Whitney U and Wilcoxon Signed-Rank 115

TABLE 8.6 Ranks statistics in the Wilcoxon Signed-rank test (N = 46)

N Mean rank Sum of ranks

Memory versus Negative Ranks 29a 25.07 727.00


Comprehending Positive Ranks 14b 15.64 219.00
Ties 3c
Total 46
Retrieval versus Negative Ranks 24d 19.79 475.00
Comprehending Positive Ranks 19e 24.79 471.00
Ties 3f
Total 46
Planning versus Negative Ranks 32g 23.33 746.50
Comprehending Positive Ranks 14h 23.89 334.50
Ties 0i
Total 46
Monitoring versus Negative Ranks 16j 22.84 365.50
Comprehending Positive Ranks 28k 22.30 624.50
Ties 2l
Total 46
Evaluating versus Negative Ranks 18m 19.69 354.50
Comprehending Positive Ranks 25n 23.66 591.50
Ties 3o
Total 46
a
Memory < Comprehending b Memory > Comprehending c Memory = Comprehending
d
Retrieval < Comprehending e Retrieval > Comprehending f Retrieval = Comprehending
g
Planning < Comprehending h Planning > Comprehending i Planning = Comprehending
j
Monitoring < Comprehending k Monitoring > Comprehending l Monitoring = Comprehending
m
Evaluating < Comprehending n Evaluating > Comprehending o Evaluating = Comprehending

TABLE 8.7 Wilcoxon Signed-rank test statistics (N = 46)

Mem—Com Ret—Com Plan—Com Mon—Com Eva—Com

Z –3.07b –0.02b –2.25b –1.51c –1.44c


Asymp. Sig. (2-tailed) 0.00 0.98 0.02 0.13 0.15
r –0.45 –0.00 –0.33 –0.22 –0.21
a
Wilcoxon Signed-Ranks Test b Based on positive ranks. c Based on negative ranks.
Com = Comprehending Mem = Memory Ret = Retrieveal
Mon = Monitoring Eva = Evaluation

between the use of the planning and comprehending strategies (Z = –2.25, p =


0.02, r = –0.33). The use of the comprehending strategy did not significantly dif-
fer from the use of the retrieval, monitoring, and evaluating strategies. The effect
sizes for the significant differences were medium. The negative sign in the effect
size can be ignored.
116 Mann-Whitney U and Wilcoxon Signed-Rank

The Wilcoxon Signed-rank test results can be reported as follows: ‘Test takers’
reported use of comprehending strategies was compared to that of other reported
strategies (e.g., memory, retrieval, and planning) through the Wilcoxon Signed-
rank test. It was found that there was significantly higher use of comprehending
strategies compared to that of the memory (Z = 3.07, p < 0.001, r = 0.45, medium
effect) and planning strategies (Z = 2.25, p = 0.02, r = 0.33, medium effect). The
effect size of the difference were medium. Comprehending strategy use did not
significantly differ from retrieval, monitoring, or evaluating strategy use. Accord-
ing to the Wilcoxon Signed-rank test results, it may be concluded that in this
reading comprehension test, test takers reported using comprehending strategies
significantly more frequently than using memory and planning strategies, but the
frequency of use of comprehending strategies was not significantly different from
that of the retrieval, monitoring and evaluating strategies.’

Summary
This chapter has explained two nonparametric tests analogous to the independent-
samples and paired-samples t-tests, and illustrated how to compute them in SPSS.
These two nonparametric tests are suitable for the analysis of ordinal and non-
normal data. If a sample size is small, researchers may use these tests to explore
possible difference between data sets. These tests are useful as alternatives to the
t-tests if some assumptions of the t-tests cannot be met. The next chapter presents
the one-way ANOVA, which is an extension of the independent-sample t-tests to
compare three or more groups people.

Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
9
ONE-WAY ANALYSIS OF
VARIANCE (ANOVA)

Introduction
The independent-samples t-test allows researchers to compare two different groups
of participants measured with the same instrument. However, if there are more
than two groups of language learners or test takers, the independent-samples
t-test cannot be used. In this chapter, an extension of the independent t-test to an
inferential statistic called analysis of variance, commonly abbreviated as ANOVA,
is introduced. This chapter focuses on one-way ANOVA and its alternative non-
parametric test, namely the Kruskal-Wallis test, which can be used for ordinal data
and data that do not exhibit a normal distribution.

One-Way ANOVA
One-way ANOVA functions in a similar way to the independent-samples t-test,
but instead of two groups, it can examine the differences among three or more
groups based on one background variable that distinguishes participants (e.g.,
native languages, proficiency levels, and experimental conditions). The presence
of only one independent variable to distinguish participants is the reason that this
test is called the ‘one-way’ ANOVA. If there are two independent variables (e.g.,
native language and gender), it is called a ‘two-way’ ANOVA, and so on. This
chapter deals only with one-way ANOVA.
Once learner groups have been created that differ on the independent
variable, one-way ANOVA compares research participants on one outcome
variable (e.g., test score) to see if differences in the independent background
118 One-Way Analysis of Variance (ANOVA)

variable lead to differences in the outcome variable. ANOVA is also called a


univariate analysis of variance because only one dependent variable is examined
at a time. When more than one dependent variable is examined in this type of
analysis simultaneously, researchers can employ a multivariate analysis of variance
(MANOVA). MANOVA is not included here due to the introductory nature
of the current book (see Resources on Methods for Second Language Research in
the Epilogue for a list of publications that cover MANOVA, e.g., Field, 2013;
Larson-Hall, 2010).
Several research questions can be answered through the use of one-way
ANOVA. For example, Ko (2012) asked “what type of vocabulary gloss is the
most effective?” The study used one-way ANOVA to compare the effect of no
gloss, L1 gloss, and L2 gloss on Korean EFL learners’ learning of English vocabu-
lary. Based on Ko’s (2012) study, the background variable was the type of gloss
the learners had been exposed to, and the outcome variable was their score on
a multiple-choice vocabulary test. One-way ANOVA was used to determine
whether the learner groups’ vocabulary scores differed significantly from each
other. Ko chose a sample of 90 Korean EFL learners and randomly assigned them
to three groups. She first conducted a cloze test as a pretest to ensure that all
three groups had the same proficiency level at the outset of her study. She then
gave each group the same English language reading text. One group received
no vocabulary explanations (i.e., no-gloss group), another received vocabulary
explanations in Korean (i.e., L1 gloss group), and the third received vocabulary
explanations in English (i.e., L2 gloss group). She measured their learning of
vocabulary with an immediate posttest consisting of 16 multiple-choice items,
and she administered a delayed posttest four weeks later to measure students’
retention.
For the scores of the immediate posttest, she employed a one-way ANOVA to
compare the results of her three groups. The means and standard deviations of the
groups are shown in Table 9.1.
Looking at the descriptive statistics for the immediate posttest, it appears that
there were significant differences in the results of the three groups, especially
between the no-gloss group and the two gloss groups, which had much larger
means, but the researcher needed to conduct an ANOVA to be sure that the dif-
ferences were statistically significant.

TABLE 9.1 Immediate posttest (adapted from Ko, 2012, p. 66)

Groups N M SD

No gloss 30 15.10 3.57


L1 gloss 30 25.73 5.07
L2 gloss 30 26.96 3.70
Total 90 22.60 6.76
One-Way Analysis of Variance (ANOVA) 119

Key Steps in One-Way ANOVA


There are two basic steps when performing one-way ANOVA:
Step 1: One-way ANOVA looks for any significant differences in the data sets
and produces an overall result. If it does not find a statistically significant outcome,
the test does not proceed further. However, if the result is found to be significant,
there are differences between the groups, but where these differences lie remains
unknown. For example, in Ko’s study, the difference may have been between the
no-gloss and L1 gloss groups, the L1 and L2 gloss groups, or the no-gloss and L2
gloss groups.
Step 2: If the first step finds a statistically significant difference, a second step
in which each group is compared with all the other groups is required. This step is
called a post hoc test, and a number of different post hoc tests are available in SPSS.
In L2 research field, it is common to use the Scheffé post hoc test.
Ko (2012) found a significant ANOVA result of F(2,87) = 73.21, p < 0.05.
This means that the background variable was found to have an effect on test
performance; i.e., glossing made a difference in the students’ learning of vocabu-
lary. To find out where the differences between the groups lay, Ko employed the
Scheffé post hoc test and found that the no-gloss group differed significantly from
the L1 gloss group, and also from the L2 gloss group. However, the two gloss
groups did not differ significantly from each other.
The main conclusion that can be drawn from Ko’s study is that glossing helps
students learn vocabulary more successfully. Both gloss groups performed signifi-
cantly better than the no-gloss group did. It is less clear, however, whether L1 or
L2 gloss was more effective in this study. The L2 gloss group had a higher mean
score than the L1 gloss group, but the difference was not statistically significant.
It is important to remember that in quantitative research, the notion of ‘nonsig-
nificant’ is not the same as the notion ‘nonexistent’. Although the L2 gloss group
scored more highly, it would be difficult to claim that the difference between the
groups was real. The detected difference could well be a chance occurrence. So
strictly speaking, the researcher could not definitively show that a particular type
of gloss makes a difference in the students’ learning of vocabulary.

The Statistical Assumptions and Outcomes of ANOVA


One-way ANOVA has the same statistical assumptions as do t-tests. Similar to the
independent-samples t-test, two variables are relevant to one-way ANOVA: (1) the
grouping variable (also known as the factor or independent variable) and (2) the outcome
variable (also known as the dependent variable). The grouping variable sorts par-
ticipants into mutually exclusive groups, so that each participant is in one group
only. Typical grouping variables include first language, proficiency level, and type
of treatment (e.g., no gloss, L1 gloss, L2 gloss in Ko, 2012). Grouping variables are
nominal or ordinal and should have few levels with large numbers of participants
120 One-Way Analysis of Variance (ANOVA)

per level, so that each group should have at least 30 participants (as discussed in
Chapter 6). Although one-way ANOVA can be run with fewer than 30 partici-
pants per group, the results are generally more trustworthy if the sample size per
group is larger. The sizes of the groups should also be roughly similar.
The outcome variable in one-way ANOVA should be interval and have a
broad range of scores or data. The expected scores for the underlying popula-
tion should be normally distributed. Test scores are a typical outcome measure.
The score variances of each group should not be too different, although one-way
ANOVA can correct for unequal variances through its post hoc tests.
In its analysis, one-way ANOVA compares the differences between group
means with the differences between participants within groups. If there are large
differences between group means while the scores within groups are highly
homogeneous and have small standard deviations, the one-way ANOVA outcome
is likely to be significant. By contrast, the more similar the group means are, and
the more widely the individuals’ scores within groups are spread out, the less likely
it is that the one-way ANOVA outcome will be statistically significant.
The outcome of a one-way ANOVA is the F-value, which allows researchers to
determine whether the analysis is statistically significant. The F-value is a number
that researchers can use to look up the significance level in a table of critical val-
ues. However, in SPSS, the significance level is calculated automatically.
The F-value can be any positive value and does not take on negative values,
unlike the t-value. For strong effects, the F-value is often found to be quite high.
By convention, the degrees of freedom between groups (df1), and the degrees of
freedom within groups (df2) are reported with the F-value set apart by a comma
as F(df1,df2). For example, in Ko’s (2012) study, she had three groups and a total
of 90 participants, so she reported the F-value as F(2,87) = 73.21. Here, 2 is df1
(i.e., the total number of groups minus 1) and 87 is df2 (i.e., the total number of
participant minus the number of groups being compared).

The Post Hoc Tests


Researchers need to perform the second step in one-way ANOVA if they find a
significant F-value (the F-value is significant at a p-value of less than 0.05). The
second step is to run a post hoc test to identify the groups between which there
are statistically significant differences.
Some post hoc tests are designed for groups with unequal variances, and others
are designed for groups with equal variances. As with the independent-samples
t-test, the Levene’s test in ANOVA is used to indicate whether the comparison
groups have equal or unequal variances. If they are equal (Levene’s test is nonsig-
nificant), it is common to use the Scheffé post hoc test. If they are unequal (Levene’s
test is significant), it is common to use Tamhane’s T2 post hoc test. SPSS offers
several options for the post hoc tests (as can be seen in Figure 9.3 in the “SPSS
Instructions: ANOVA” section). Results from post hoc tests are often just reported
One-Way Analysis of Variance (ANOVA) 121

in the text of a Results section. Post hoc tests are not needed when the one-way
ANOVA result is nonsignificant, as a nonsignificant ANOVA result means that
there are no significant differences between the comparison groups.
Researchers may obtain a significant ANOVA result, but subsequently the post
hoc test does not show there to be significant differences between any of the
groups. This usually occurs when the one-way ANOVA result itself is significant
by only a small margin. In that case, the additive group differences may be large
enough to reach significance, but when groups are considered pairwise, the dif-
ferences may not be sufficiently large for the post hoc test to show significance.
One-way ANOVA and post hoc tests need to be conducted instead of several
independent-samples t-tests, which would compare each group with each of the
others, leading to an increased chance of a type I error. This is because every time
the independent-samples t-test is performed, there is a 5% chance of a type I error
in rejecting the null hypothesis when it is true (see Chapter 6). In other words,
there is a 95% chance of being right in rejecting it. If two computations of the
independent-samples t-test are performed from the same data set, and the null
hypothesis is rejected both times, the likelihood of being right both times is now
about 0.9025 (i.e., 0.95 × 0.95), so that there is now a 90.25% chance of being
right in rejecting the null hypothesis. This likelihood leaves a 9.75% likelihood of
being wrong in rejecting the null hypothesis. If there are three comparison groups
and three consecutive independent-samples t-tests are performed on the same data
set, the likelihood of being correct in rejecting all the three null hypotheses is now
85.7% (i.e., 0.95 × 0.95 × 0.95), and the likelihood of falsely rejecting the null
hypotheses is therefore, 14.3%. As the number of comparisons increases, the chance
of being correct in rejecting the null hypothesis increases, so that if there are five
comparison groups, 10 independent-samples t-tests would need to be done, and
there would be close to a 50% chance of being right in rejecting the null hypothesis.
One-way ANOVA and its post hoc tests have a correction for multiple tests built in,
so the increased chance of a type I error occurring is better minimized.

Effect Size for ANOVA


Results obtained through one-way ANOVA require the reporting of an effect
size, as is the case for t-tests. The effect size in one-way ANOVA is calculated as a
coefficient called eta squared (η2) or partial eta squared (partial η2) in SPSS. The η2
effect size cannot exceed the value of 1. This is in contrast to Cohen’s d, which is
unlimited. Similar to R2 for the Pearson correlation coefficient, the η2 effect size
can be interpreted as the percentage of overall variance that is due to the back-
ground variable. This means that, for example:

• If the η2 value is 0.1, then 10% of the overall variance is due to the back-
ground variable, and 90% is due to other factors. This is normally considered
a small effect size.
122 One-Way Analysis of Variance (ANOVA)

• If the η2 value is 0.3, then 30% of the overall variance is due to the back-
ground variable, and 70% is due to other factors. This is normally considered
a medium effect size.
• If the η2 value is 0.5, then half the overall variance is due to the background
variable and half is due to other factors. A η2 value of 0.5 or above is nor-
mally considered a large effect size.

How to Perform One-Way ANOVA in SPSS


There are two options in SPSS that allow you to perform one-way ANOVA. Both
are located in the Analyze menu.

• The first option is Analyze Æ Compare Means Æ One-way ANOVA, which


is a simple menu with few options, and the output does not include the
ANOVA effect size known as partial η2.
• The second option is Analyze Æ General Linear Model Æ Univariate, as
shown in Figure 9.1. This choice has more statistical options than the first
option and so is illustrated in this chapter.

To illustrate how to perform one-way ANOVA, the TEP data are used (available
in Ch9TEP.sav, which can be found on the Companion Website). In L2 pragmatics
research, the impact of general proficiency on pragmatic knowledge is a recurring
research question (see Taguchi & Roever, 2017, for further details). Researchers may
ask the question ‘does proficiency level impact pragmatic knowledge?’ TEP scores for
test takers at five proficiency levels (i.e., beginner, advanced beginner, low intermedi-
ate, upper intermediate, and advanced) can be compared using one-way ANOVA.
That is, the background variable is the test takers’ proficiency level, and the outcome
variable is their TEP score. A one-way ANOVA with the proficiency level (level) as
the independent, or background, variable (or ‘factor’ in SPSS terms), and total score
(totalscore) as the dependent or outcome variable will be performed.

SPSS Instructions: One-Way ANOVA

Click Analyze, next General Linear Model, and then Univariate (see
Figure 9.1).

In the dialog that appears, move ‘total score’ from the left pane to
the Dependent Variable field and ‘proficiency level’ to the Fixed
Factor(s) field (see Figure 9.2).
FIGURE 9.1 SPSS menu to launch a one-way ANOVA

FIGURE 9.2 Univariate dialog for a one-way ANOVA


124 One-Way Analysis of Variance (ANOVA)

Click the Post Hoc button. In the Univariate: Post Hoc . . . dialog that
appears, move ‘level’ from the ‘Factors’ pane to the ‘Post Hoc Tests
for:’ pane. Select Scheffé and Tamhane T2, as shown in Figure 9.3.
Notes: Which post hoc test is to be used depends on whether Levine’s test is
significant. Both post hoc tests are chosen at this point because it is not yet
known if the Levene’s test will show significance (see the “Post Hoc Tests” sec-
tion). The post hoc tests are not needed if the ANOVA results are nonsignificant.

Click the Continue button to confirm these choices, then click the
Options button.

In the Univariate: Options dialog, select Descriptive statistics, Estimates


of effect size, and Homogeneity tests (which includes the Levene’s test)
as shown in Figure 9.4.

Click the Continue button and the OK button.

FIGURE 9.3 Options for post hoc tests


One-Way Analysis of Variance (ANOVA) 125

FIGURE 9.4 Options dialog for ANOVA

TABLE 9.2 Descriptives for proficiency in TEP

Proficiency level Mean Std. deviation N

Beginner 28.3624 13.31038 15


Low Intermediate 40.5613 13.36891 36
Intermediate 58.7390 18.43626 21
Upper Intermediate 71.6956 12.16580 73
Advanced 87.2928 12.18019 18
Total 60.8847 22.30940 163

Table 9.2 presents the descriptive statistics output.


It can be seen that mean scores increased with proficiency level. The higher the
proficiency level, the higher the scores. The differences in mean scores between
adjacent groups seem pronounced, ranging from about 12% between the beginner
and the low intermediate groups, to 18% between the low intermediate and the
intermediate groups. The results of the Levene’s test, as shown in Table 9.3, show
that the group variances are not significantly different. Recall that, as discussed
in relation to the independent-samples t-test, the Levene’s statistic must not be
126 One-Way Analysis of Variance (ANOVA)

TABLE 9.3 Levene’s statistic

F df1 df2 Sig.

2.040 4 158 .091

TABLE 9.4 Tests of between-subjects effects as the ANOVA result

Source Type III sum df Mean square F Sig. Partial eta


of squares squared

Corrected 51,916.647a 4 12,979.162 71.423 .000 .644


Model
Intercept 388,840.122 1 388,840.122 2,139.739 .000 .931
Level 51,916.647 4 12,979.162 71.423 .000 .644
Error 28,712.260 158 181.723
Total 68,4861.688 163
Corrected 80,628.907 162
Total
a
R Squared = .644 (Adjusted R Squared = .635)

significant to conclude that the homogeneity of variance assumption for group


comparisons holds (i.e., p > 0.05). The value of 0.91 indicates that the assumption
for ANOVA of homogeneity of variance has been met, and that the Scheffé post
hoc test can be used, instead of the Tamhane T2 post hoc test.
The ANOVA result is presented in Table 9.4. The results for ‘level’ are shown
in the third row of Table 9.4. The F-value was 71.423, which was statistically
significant at 0.001. The partial eta squared effect size (η2) was 0.644. This was a
significant result with a large effect size that indicates that 64.4% of the variance
between total scores on TEP was accounted for by proficiency level. Table 9.5 pres-
ents the results of Scheffé post hoc test. Note that SPSS also produces the Tamhane
T2 post hoc results, but they are not reported in this chapter because the data met
the homogeneity of variance assumption. According to Table 9.5, except for the
beginners and low intermediate groups, all groups differ significantly from one
another.
The results can be written up as follows:

A one-way ANOVA was conducted with proficiency level as the indepen-


dent variable, and the total test TEP scores as the dependent variable. The
results indicates a significant and strong effect of proficiency level on the
TEP scores (F(4,158) = 71.423, p < 0.001, partial η2 = 0.644). The Scheffé
post hoc test showed significant differences between all groups, except
between the beginner and low intermediate groups.
One-Way Analysis of Variance (ANOVA) 127

TABLE 9.5 Scheffé post hoc test for multiple comparisons

(I) Proficiency ( j) Proficiency level I-J Std. error Sig. 95% confidence
level interval

Lower Upper
bound bound

Beginner Low Intermediate –12.1989 4.14279 .075 –25.1119 .7141


Intermediate –30.3766∗ 4.55723 .000 –44.5814 –16.1718
Upper –43.3333∗ 3.82155 .000 –55.2449 –31.4216
Intermediate
Advanced –58.9304∗ 4.71281 .000 –73.6201 –44.2407
Low Beginner 12.1989 4.14279 .075 –.7141 25.1119
Intermediate Intermediate –18.1777∗ 3.70153 .000 –29.7153 –6.6401
Upper –31.1344∗ 2.74540 .000 –39.6917 –22.5770
Intermediate
Advanced –46.7315∗ 3.89148 .000 –58.8611 –34.6019
Intermediate Beginner 30.3766∗ 4.55723 .000 16.1718 44.5814
Low Intermediate 18.1777∗ 3.70153 .000 6.6401 29.7153
Upper –12.9566∗ 3.33809 .006 –23.3614 –2.5519
Intermediate
Advanced –28.5538∗ 4.33004 .000 –42.0504 –15.0572
Upper Beginner 43.3333∗ 3.82155 .000 31.4216 55.2449
Intermediate Low Intermediate 31.1344∗ 2.74540 .000 22.5770 39.6917
Intermediate 12.9566∗ 3.33809 .006 2.5519 23.3614
Advanced –15.5971∗ 3.54755 .001 –26.6548 –4.5395
Advanced Beginner 58.9304∗ 4.71281 .000 44.2407 73.6201
Low Intermediate 46.7315∗ 3.89148 .000 34.6019 58.8611
Intermediate 28.5538∗ 4.33004 .000 15.0572 42.0504
Upper 15.5971∗ 3.54755 .001 4.5395 26.6548
Intermediate

Based on observed means.


The error term is Mean Square(Error) = 181.723.
∗ The mean difference is significant at the 0.05 level.

The Kruskal-Wallis Test


If the statistical assumptions for one-way ANOVA cannot be met, researchers can
consider whether the Kruskal-Wallis test is a plausible alternative. The Kruskal-
Wallis test is the nonparametric version of one-way ANOVA. Additionally, when
researchers deal with an outcome variable that is ordinal or based on ranked data,
ANOVA is not appropriate. The Kruskal-Wallis test is the multigroup equivalent
of the Mann-Whitney U test as presented in Chapter 8.
Di Silvio, Donovan, and Malone (2014) investigated American students’ gains in
three target languages (Spanish, Mandarin, and Russian), as well as their perceptions
about living with a host family. The researchers asked whether perceptions differed
128 One-Way Analysis of Variance (ANOVA)

by the target language. Di Silvio et al. (2014) collected data from 152 US college
students who spent a semester studying abroad in Peru, Chile, China, or Russia.
In addition to giving students a simulated oral proficiency test before and after the
study-abroad experience, the researchers administered a questionnaire on students’
perceptions of their study-abroad experiences. The questionnaire statements were
accompanied by five response options, ranging from ‘strongly agree’ to ‘strongly
disagree’. Cognizant of the ordinal nature of Likert-type scales, Di Silvio et al.
employed the Kruskal-Wallis test to compare participants’ perceptions after group-
ing participants by target language. For example, with the statement ‘I’m glad that
I lived with a host family’, there were statistically significant differences between
the three groups, with 94% of L2 Spanish learners agreeing, or strongly agreeing
with the statement, compared to 90% of L2 Mandarin learners, and only 74% of
L2 Russian learners. The authors do not show post hoc tests for the Kruskal-Wallis
test, but based on the frequency data, the Spanish and Mandarin learners appeared
to be happier with their host families than the Russian learners were.

The Statistical Assumptions and Outcomes


of the Kruskal-Wallis Test
Being a nonparametric test, the Kruskal-Wallis test makes fewer statistical assump-
tions than one-way ANOVA does. As in the case of Di Silvio et al. (2014), the
Kruskal-Wallis test can handle ordinal data and narrow score ranges, and does not
require the underlying population to be normally distributed. The independent
(or grouping) variable needs to be nominal or ordinal, with few levels and a large
numbers of participants per level.
As with most nonparametric procedures, the Kruskal-Wallis test ranks all par-
ticipants (e.g., from high scores to low), divides them into groups, and examines
the distribution of ranks between groups. If one group consists of predominantly
high-ranked students and another of predominantly low-ranked students, then
they are considered to be statistically different.
The outcome of the Kruskal-Wallis test is technically the H-index. However, it is also
sometimes written as χ2 because the critical values for the two are the same. For exam-
ple, in Di Silvio et al. (2014), for the statement ‘I’m glad that I lived with a host family’,
the result was H(2) = 9.359, p < 0.009, indicating a significant difference between
groups. The degree of freedom for the Kruskal-Wallis test is the number of comparison
groups minus 1 (i.e., df = 2 in Di Silvio et al.). While Di Silvio et al. (2014) did not
make comparisons between individual groups pairwise, such comparisons could be
made with the Mann-Whitney U test, adjusted for multiple comparisons.

How to Perform the Kruskal-Wallis Test in SPSS


To illustrate how to perform a Kruskal-Wallis test in SPSS, TEP data will be used
(available in Ch9TEP.sav, which can be found on the Companion Website). The
One-Way Analysis of Variance (ANOVA) 129

question to be addressed is: ‘do participants with residence (i.e., up to one year
residence and more than one year residence), regardless of length, have higher
proficiency levels than participants without residence?’

SPSS Instructions: Kruskal-Wallis Test

Click Analyze, next Nonparametric Tests, and then Independent


Samples (see Figure 9.5).

FIGURE 9.5 SPSS menu to launch the Kruskal-Wallis test


130 One-Way Analysis of Variance (ANOVA)

FIGURE 9.6 Setup for the Kruskal-Wallis test

Figure 9.6 presents the SPSS default for a nonparametric test: two or more inde-
pendent samples.

Click the Fields tab, then move ‘proficiency level’ from the ‘Fields’
pane to the ‘Test Fields’ pane, and move ‘collapsed residence’ to the
Groups field (see Figure 9.7).
Notes: It is important that collapsed residence has the correct designation
(nominal/ordinal/scale) in the Measure Column of the SPSS Data View. Only
ordinal and nominal variables work as Group variables.

Next, click the Settings tab and tick the Customize tests checkbox.

Tick the checkbox for Kruskal-Wallis 1-way ANOVA (k samples), and


from the Multiple Comparisons drop-down menu, select All pairwise
(see Figure 9.8).
FIGURE 9.7 Variable entry for the Kruskal-Wallis test

FIGURE 9.8 Analysis settings for the Kruskal-Wallis test


132 One-Way Analysis of Variance (ANOVA)

Finally, click on the Run button to view the SPSS output for the
Kruskal-Wallis test.

In order to see more detail for the results in the SPSS output, double-click on
the Hypothesis Test Summary table (see Figure 9.9), and a Model Viewer window
will be activated (see Figure 9.10). According to Figure 9.10, the test was statisti-
cally significant (H(2) = 30.448, p < 0.001). Note that the degree of freedom was
2 (i.e., the number of groups minus 1).

FIGURE 9.9 Kruskal-Wallis test results

FIGURE 9.10 Model Viewer window for the Kruskal-Wallis test


One-Way Analysis of Variance (ANOVA) 133

To examine pairwise comparisons between groups, from the View drop-down


menu, located at the bottom right-hand side of the window, select Pairwise Com-
parisons as shown in Figure 9.11. SPSS then presents the comparisons between
individual groups (using the Dunn-Bonferroni test) visually and numerically
(Figure 9.12). The pairwise comparisons in Figure 9.12 suggest that there were
significant differences between the no-residence group and each residence group,
but not between the two residence groups. While there is no standard way to
compute the effect size for the Kruskal-Wallis test, the Mann-Whitney U test can

FIGURE 9.11 Viewing pairwise comparisons

FIGURE 9.12 Pairwise comparisons in the Kruskal-Wallis test


134 One-Way Analysis of Variance (ANOVA)

be used to compute the r-effect size (see Chapter 8). The findings can be written
as follows:

A Kruskal-Wallis test showed that participants who differed in their length


of residence also differed in their proficiency levels (H(2) = 30.448, p <
0.001). The analysis suggests that participants with residence, regardless of
the length of their stay, had higher proficiency levels than participants with-
out residence.

Summary
This chapter has presented one-way ANOVA, which is the extension of the
independent-samples t-test to several groups. One-way ANOVA assumes that the
groups being compared differ in one dependent or outcome variable only (i.e.,
univariate). This chapter has also presented a nonparametric version of ANOVA,
namely the Kruskal-Wallis test. The next chapter presents the analysis of covari-
ance (ANCOVA), which is an extension of ANOVA that partials out the effect of
an intervening variable.

Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
10
ANALYSIS OF COVARIANCE
(ANCOVA)

Introduction
As discussed in the previous chapter, researchers use one-way ANOVA to inves-
tigate differences among three or more groups of individuals that vary on a
dependent variable. However, sometimes there are other variables that could
influence research results and that researchers have to take into account at the data
analysis stage. In a pre-post study, such as Ko (2012), which was discussed in the
previous chapter, it could have been the case that one of the groups had greater
knowledge of the target feature at pretest time, so that any differences at posttest
time might not have been due to the treatment alone. It is, therefore, important to
be able to account statistically for differences that may exist before the treatment
is applied. ANCOVA is one method that can be used for this purpose.
A variable that interferes with the influence of the target independent variable
on the dependent variable is called an intervening variable (also known as a moderator
variable or confounding variable). If researchers do not or cannot control its effect as
part of the research design, they can at least attempt to minimize its effect through
a statistical method such as using ANCOVA.

Eliminating Intervening Variables


One way to ensure that an intervening variable (such as pretest differences) does not
affect the main analysis is by running a separate ANOVA for the intervening vari-
able. Ko (2012) did this by running an ANOVA at the outset of her study to check
that the proficiency levels of her three groups were not significantly different before
the treatment. From this, she was able to claim that any differences at the posttest
136 Analysis of Covariance (ANCOVA)

time were mainly due to her treatment. Although the logic behind this research
strategy is sound, problems may arise, particularly when group sizes are small and
when the differences between the groups at the pretest times, although not statisti-
cally significant, are powerful enough to affect the results of the posttest analysis. If
groups have been chosen entirely via random sampling, and are large, these differ-
ences are likely to be negligible, but in the applied linguistics field, sampling is rarely
truly random and groups are seldom large. So even though researchers may obtain a
nonsignificant ANOVA result at the pretest time, any undetected differences that do
exist at the pretest time can affect the posttest outcome.
A simple way to deal with this issue in a pretest-posttest experimental design
(see Phakiti, 2014) is to run an ANOVA test on ‘gain scores’, rather than on the
posttest scores. Gain scores are easily computed by subtracting the pretest score
from the posttest score:

Gain score = posttest score – pretest score

Instead of using the posttest scores as the dependent variable in an ANOVA,


researchers can use gain scores as the dependent variable and investigate whether
group membership led to differences in gains, and which groups’ mean gain scores
differed significantly.

SPSS Instructions: Computing Variables

Click Transform, then Compute Variable (as shown in Figure 10.1).

In the Compute Variable dialog (see Figure 10.2), type ‘gainscore’ in


the Target Variable field. Then, one at a time, choose and move the
pretest variables into the ‘Numeric Expression’ pane, followed by a minus
sign from the provided calculator (or it can be typed in from the keyboard)
and then the posttest (e.g., pretest – posttest).

Click on the OK button and a new variable named ‘gainscore’ will


be created in the data sheet. Then use this variable to run ANOVA as
explained in Chapter 9.
FIGURE 10.1 Accessing the SPSS menu to launch the Compute Variable dialog

FIGURE 10.2 Compute Variable dialog


138 Analysis of Covariance (ANCOVA)

Eliminating the Effect of the Intervening Variable


Using gain scores is an option only in a pretest-posttest experimental design.
However, there are other cases of intervening variables that can distort statistical
results. For example, researchers may have research questions such as:

• Do different presentation formats of L2 Chinese vocabulary lead to different


learning outcomes, taking into account learners’ prior familiarity with the
vocabulary items?
• How strong is the effect of exposure on knowledge of L2 routine formulae
in a test of English pragmatic competence ( TEP), taking into account profi-
ciency differences?

In a vocabulary learning study for Chinese as a foreign language, Lee and


Kalyuga (2011) investigated which format was most suitable for presenting Chi-
nese characters, their Romanization (pinyin), and their English translations to
learners. The researchers compared a vertical arrangement with the pinyin and
translation below the character with a horizontal arrangement with the pinyin
and translation next to the character. They conducted a vocabulary training ses-
sion, delivering 25 vocabulary items to one group in the horizontal format and to
the other group in the vertical format. They then administered a multiple-choice
test of vocabulary recognition, and asked participants to indicate which of the
vocabulary items they had known prior to the study. Finding out what learners
knew before the study was crucial because the test result was the outcome of a
combination of students’ previous knowledge and the new vocabulary they had
learned during the training. Lee and Kalyuga could have administered a pretest,
but would have run the risk of frustrating their students because they were testing
a large number of new items.
To eliminate the effect of prior vocabulary knowledge, Lee and Kalyuga
(2011) used ANCOVA in their study. ANCOVA is basically a one-way ANOVA
test with the prior vocabulary knowledge variable being used as a covariate.
ANCOVA can correct participants’ test results for their prior familiarity with the
vocabulary items before comparing the results from the groups that experienced
the two experimental conditions. By using the prior vocabulary knowledge
variable as a covariate, ANCOVA adjusts participants’ test scores for their prior
knowledge, so that all participants can be considered equal at the outset of the
study. Lee and Kalyuga (2011) found that the vertical presentation format was
more effective than the horizontal presentation format (F(1,70) = 4.20, p <
0.05, d = 0.99). By correcting the test results for prior familiarity, the researchers
excluded a possible intervening variable, thereby strengthening their findings.
As Lee and Kalyuga (2011) compared two groups, the authors could have used
a t-test. However, the independent-samples t-test cannot deal with a covariate.
Analysis of Covariance (ANCOVA) 139

ANCOVA can perform comparisons tests between two or more groups, and if a
covariate is involved, it can take that into account.

How ANCOVA Works Conceptually


Similar to one-way ANOVA, ANCOVA first computes the means and standard
deviations for each of the group on the outcome variable without adjustment
for prior familiarity with the vocabulary items, in Lee and Kalyuga’s (2011) case.
As discussed earlier, these scores were partly influenced by the ‘prior familiarity’
covariate. This covariate, therefore, needed partialing out, i.e., its effect had to be
eliminated.
To partial out the covariate, ANCOVA first computes the mean of the
covariate for the whole sample (not for each group). In a complex subsequent
computation, which involves the correlation of the dependent variable and
the covariate, the scores of the outcome variable are adjusted for the covari-
ate. This adjustment leads to scores with the effect of the prior familiarity
minimized, if not totally eliminated. This method provides the answer to the
following question: if there was no effect of prior familiarity, what would the
group means be?
Once these ‘purified’ estimates have been computed, a normal ANOVA test
can be run on them. The main advantage of this process is that researchers can
reduce the error in the ANOVA test result. Once one intervening variable has
been partialled out, researchers can go on to identify other intervening factors and
include them as additional covariates as well. However, it is rare to see more than
one covariate used in an ANCOVA in L2 research.

Conditions of ANCOVA
ANCOVA remains a controversial analysis. Miller and Chapman (2001) list a
number of ways ANCOVA can be misused and has been misused by researchers.
In particular, researchers need to check carefully that the statistical conditions of
ANCOVA are met.
Field (2013) discusses two important conditions for ANCOVA. The first
condition concerns the independence of the independent variable (e.g., the
treatment condition or the proficiency level) and the covariate. That is, the
mean of the covariate should not differ significantly between the groups because
ANCOVA computes an overall value for the covariate across the whole sample,
rather than for each group or each participant. This condition can be checked
by running ANOVA with the grouping variable as the independent variable,
and the covariate as the dependent variable. The outcome should be nonsignifi-
cant ( p < 0.05).
140 Analysis of Covariance (ANCOVA)

The second condition concerns the relationship between the covariate and
the dependent variable. This condition is known as the homogeneity of regression
slopes, in which the effect of the covariate on the scores for each group should
be similar. Field (2013) shows a way to check that this condition holds in
SPSS. If the homogeneity of regression slopes condition is violated, ANCOVA
should not be performed or performed only with complex adjustments (Ruth-
erford, 2011).

How to Perform an ANCOVA in SPSS


In order to illustrate how to perform an ANCOVA, TEP data are used (avail-
able in Ch10TEP.sav, which can be found on the Companion Website).
Previous research (e.g., House, 1996; Roever, 2012) has shown a strong effect
of exposure on knowledge of routine formulae. However, in a test such as
TEP, in which situation descriptions and multiple-choice options are delivered
as written text, an effect of general proficiency can be expected as well. To
investigate the strength of the exposure effect once general proficiency as an
intervening factor has been removed, an ANCOVA is performed with rou-
tines scores (routines) as the dependent variable, collapsed residence (collres)
as the independent variable, or ‘fixed factor’, and proficiency level (level) as
the covariate. A collapsed residence (none, up to one year, and more than one
year) is used instead of the actual length of residence, because the independent
variable in ANOVA-type procedures is the grouping variable, so an indepen-
dent variable that does not have too many levels is needed. If the actual length
of stay were used, there would be a group for zero months, a group for one
month, a group for two months, a group for three months, and so on. Conse-
quently, there would be very few participants in each group making the result
unstable.

Checking the ANCOVA Conditions


Before the ANCOVA is run, it is necessary to check that its conditions are
met. To test the first condition of independence between the independent
variable and covariate, a regular ANOVA with the collapsed length of resi-
dence (collres) as the independent variable (factor) and proficiency level (level)
as the dependent variable can be run (see Chapter 9 for specific SPSS instruc-
tions for the one-way ANOVA). Scheffé and Tamhane’s T2 post hoc tests
should also be run, in case the ANOVA is statistically significant (as illustrated
in Figure 10.3).
Based on the SPSS outputs, it is found that the ANOVA is statistically signifi-
cant (F(2, 124) = 19.157, p < 0.001), as shown in Table 10.1. Table 10.2 presents
the results of the post hoc tests, which demonstrate that the no-residence group
has significantly lower proficiency than each of the two residence groups.
Analysis of Covariance (ANCOVA) 141

FIGURE 10.3 Checking ANCOVA assumption of independence of covariate and inde-


pendent variable

TABLE 10.1 ANOVA for the independent variable and covariate (test between-subjects
effects)

Source Type III sum df Mean square F Sig.


of squares

Corrected Model 34.403a 2 17.202 19.157 .000


Intercept 1,408.945 1 1,408.945 1,569.084 .000
collres 34.403 2 17.202 19.157 .000
Error 111.345 124 .898
Total 1,698.000 127
Corrected Total 145.748 126
a
R Squared = .236 (Adjusted R Squared = .224)

Table 10.3 shows the difference more clearly and indicates that the members of
the no-residence group are generally at proficiency level 3, whereas the members
of the two residence groups are generally at level 4.
According to these post hoc test results, the first condition of independence
between the independent variable and covariate is not met. This is a statistical
reason to leave out the no-residence group from the ANCOVA. Therefore, the
TABLE 10.2 Post hoc tests for independence of covariate and independent variable
(multiple comparisons)

(I) Collapsed ( J) Collapsed Mean Std. Sig. 95% confidence


residence residence difference error interval
(I-J)
Lower Upper
bound bound

Scheffé No residence Up to 1 year –1.02∗ .199 .000 –1.51 –.53


residence
More than –1.09∗ .231 .000 –1.67 –.52
1 year residence
Up to 1 year No residence 1.02∗ .199 .000 .53 1.51
residence More than –.08 .261 .959 –.72 .57
1 year residence
More than No residence 1.09∗ .231 .000 .52 1.67
1 year residence Up to 1 year .08 .261 .959 –.57 .72
residence
Tamhane No residence Up to 1 year –1.02∗ .148 .000 –1.38 –.66
residence
More than –1.09∗ .223 .000 –1.65 –.54
1 year residence
Up to 1 year No residence 1.02∗ .148 .000 .66 1.38
residence More than –.08 .188 .970 –.55 .40
1 year residence
More than No residence 1.09∗ .223 .000 .54 1.65
1 year residence Up to .08 .188 .970 –.40 .55
1 year residence

Based on observed means.


The error term is Mean Square(Error) = .898.
∗ The mean difference is significant at the 0.05 level.

TABLE 10.3 Post hoc tests for the independence of covariate and independent variable

Collapsed residence N Subset

1 2

Schefféa,b,c No residence 72 3.04


Up to 1 year residence 33 4.06
More than 1 year residence 22 4.14
Sig. 1.000 .948

Means for groups in homogeneous subsets are displayed.


Based on observed means.
The error term is Mean Square(Error) = .898.
a
Uses Harmonic Mean Sample Size = 33.465.
b
The group sizes are unequal. The harmonic mean of the group sizes is used. Type I error levels are
not guaranteed.
c
Alpha = 0.05.
Analysis of Covariance (ANCOVA) 143

next step to do is to instruct SPSS to ignore participants without residence when


performing the ANCOVA.

SPSS Instructions: Selecting Cases

Click Data, then Select Cases (Figure 10.4).

In the Select Cases dialog, select If condition is satisfied, then click


the If’ button (Figure 10.5).

FIGURE 10.4 Accessing the SPSS menu to select Cases for analysis
144 Analysis of Covariance (ANCOVA)

FIGURE 10.5 Select Cases dialog

FIGURE 10.6 Defining case selection conditions

Clicking the If button calls up the Select Cases: If dialog, in which


the condition may be specified: namely, only participants whose
residence is not 0 but 1 or 2 are to be included. For simplicity, this condition
can be entered as ‘collres > 0’ (see Figure 10.6).
Analysis of Covariance (ANCOVA) 145

FIGURE 10.7 Data View with cases selected out

Click the Continue button, and then the OK button in the Select
Cases dialog to return to Data View. It can be seen that a large num-
ber of participants has been crossed out (Figure 10.7).

Based on the previous post hoc test, researchers can be confident that the condi-
tion of independence between the independent variable and covariate holds for
the data relating to the two remaining groups (less than one year and more than
one year residence).
The next step is to check the second condition for ANCOVA, namely the
assumption of the homogeneity of regression slopes. This can be checked after the
main ANCOVA has been set up (Field, 2013).
146 Analysis of Covariance (ANCOVA)

FIGURE 10.8 Accessing the SPSS menu to launch ANCOVA

Go to Analyze, then General Linear Model, and choose Univariate,


as shown in Figure 10.8.

In the Univariate dialog (Figure 10.9), click the Model button.

In the Univariate: Model dialog, select Custom at the top. Move both
variables (i.e., collres and level) from the ‘Factor and Covariates’
pane into the ‘Model’ pane.

Highlight both variables at the same time and click the arrow to
build the interaction term, which appears as ‘collres ∗ level’ in the
‘Model’ pane (see Figure 10.10). Click on the Continue button.
Analysis of Covariance (ANCOVA) 147

FIGURE 10.9 Univariate dialog for choosing a model to examine an interaction among
factors and covariances

FIGURE 10.10 Univariate: Model dialog for defining the interaction term to check the
homogeneity of regression slopes

Click on the OK button.


148 Analysis of Covariance (ANCOVA)

TABLE 10.4 Output of homogeneity of regression slopes check (tests of between-subjects


effects)

Source Type III sum df Mean square F Sig.


of squares

Corrected Model 4,921.323a 3 1,640.441 9.231 .000


Intercept 1,643.487 1 1,643.487 9.248 .004
collres 5.934 1 5.934 .033 .856
level 913.915 1 913.915 5.143 .028
collres ∗ level 73.579 1 73.579 .414 .523
Error 9,063.526 51 177.716
Total 379,444.444 55
Corrected Total 13,984.848 54
a
R Squared = .352 (Adjusted R Squared = .314)

Table 10.4 presents the SPSS output for checking the homogeneity of regres-
sion slopes assumption. In Table 10.4, the entries in the collres ∗ level line indicate
whether or not the interaction is significant at the p-value of 0.05. The nonsig-
nificant value obtained here is desirable because it means that the condition of
homogeneity of regression slopes has been met. This means that ANCOVA can
be used in the final step.

Open the Univariate dialog and click the Model button to access the
Univariate: Model dialog (see Figure 10.11). Restore the model to
Full factorial by ticking that checkbox.

Click on the Continue button and then in the Univariate dialog select
Options to open the Univariate: Options dialog (Figure 10.12).

Move ‘(OVERALL) and ‘collres’ in the ‘Factor(s) and Factor Interac-


tions’ pane to the ‘Display Means for:’ pane.

Select ‘Compare main effects’, ‘Descriptive statistics’, ‘Estimates of


effect size’, and ‘Homogeneity tests’ (for the Levene’s test).
Analysis of Covariance (ANCOVA) 149

FIGURE 10.11 Changing the analysis setup back to the original setup

FIGURE 10.12 Options in the Univariate dialog

Click the Continue button to return to the Univariate dialog and


then click on the OK button.
150 Analysis of Covariance (ANCOVA)

In Figure 10.12, it is important that means are displayed and that main effects
are compared for the independent variable (collapsed residence = collres). The
checkbox ‘Compare main effects’ should be ticked. These are the post hoc tests,
but there are fewer of them here than in the one-way ANOVA. In the Confidence
Interval Adjustment field, select either the Bonferroni or Sidak post hoc tests,
which will be sufficient.
For the purpose of this chapter, not all SPSS output will be presented. Only
the output important for doing ANCOVA will be shown. The first output is nor-
mally labeled as ‘Univariate Analysis of Variance’, which is the same as one-way
ANOVA. Table 10.5 shows the descriptive statistics for the two groups regarding
the routine scores.
According to Table 10.5, residence seems to make a difference to learners’
routines scores, with the mean score of the group of more than one year residence
being 15 points higher than that of the up to one-year residence group. However,
it cannot be determined whether the scores were significantly different until the
ANOVA result is examined. Table 10.6 presents the results of the Levene’s test
between the two comparison groups.
The Levene’s test result is statistically significant, which indicates that the
homogeneity of variance assumption has been violated. However, the result of the
Levene’s test performed by SPSS is actually not relevant when using ANCOVA
because it is not homogeneity of error variances across groups that is assumed
in ANCOVA, though it does matter in ANOVA. Instead, a condition known
as homoscedasticity is required, which means that error variances are similar for
each combination of predictor variables (see Rutherford, 2011, for further dis-
cussion). The Levene’s test does not evaluate homoscedasticity, so it can be
discounted in ANCOVA. Unfortunately, SPSS does not include statistical tests

TABLE 10.5 Descriptive statistics of the routines scores between the two residence groups

Dependent Variable: routines score

Collapsed residence Mean Std. deviation N

Up to 1 year residence 75.5051 15.58180 33


More than 1 year residence 90.5303 12.41251 22
Total 81.5152 16.09281 55

TABLE 10.6 Levene’s test

Dependent Variable: routines score

F df1 df2 Sig.

7.396 1 53 .009
Analysis of Covariance (ANCOVA) 151

of homoscedasticity and the homoscedasticity assumption is rarely checked in


ANCOVA. Rutherford (2011) suggests how it can be checked through plotting
errors by experimental conditions and applying Cook and Weisberg’s (1983) score
test. This is an area that requires further improvement by SPSS.
Table 10.7 reports on the significance (Sig.), and eta squared of (1) the covariate
(level) and (2) the independent variable (collres) rows. The statistically significant
value for the level suggests that proficiency level had a noticeable effect on rou-
tines scores (F(1, 52) = 10.629, p < 0.01, partial eta squared = 0.17). The same is
true for residence, which is the independent variable, and which is also significant
(p < 0.001), with an even stronger effect (partial eta squared = 0.226). Note that
this is the effect of residence with proficiency removed, so may be viewed as a
‘pure’ residence effect. Table 10.8 shows the estimated routines scores for the two
groups with the proficiency level removed.

TABLE 10.7 ANCOVA analysis

Tests of Between-Subjects Effects

Dependent Variable: routines score

Source Type III sum df Mean square F Sig. Partial eta


of squares squared

Corrected Model 4,847.743a 2 2,423.872 13.794 .000 .347


Intercept 1,882.255 1 1,882.255 10.712 .002 .171
level 1,867.735 1 1,867.735 10.629 .002 .170
collres 2,674.510 1 2,674.510 15.221 .000 .226
Error 9,137.105 52 175.714
Total 379,444.444 55
Corrected Total 13,984.848 54
a
R Squared = 0.347 (Adjusted R Squared = .322)

TABLE 10.8 Estimated means after adjustment for the covariate

Estimates

Dependent Variable: routines score

Collapsed residence Mean Std. error 95% confidence interval

Lower bound Upper bound

Up to 1 year residence 75.810a 2.309 71.176 80.444


More than 1 year 90.073a 2.830 84.395 95.751
residence
a
Covariates appearing in the model are evaluated at the following values: proficiency level = 4.09.
152 Analysis of Covariance (ANCOVA)

TABLE 10.9 Group comparisons

Pairwise Comparisons

Dependent Variable: routines score

(I) Collapsed ( J) Collapsed Mean Std. error Sig.b 95% confidence


residence residence difference interval for
(I-J) differenceb

Lower Upper
bound bound

Up to 1 year More than 1 –14.263∗ 3.656 .000 –21.600 –6.927


residence year residence
More than 1 Up to 1 year 14.263∗ 3.656 .000 6.927 21.600
year residence residence

Based on estimated marginal means


∗The mean difference is significant at the 0.05 level.
b
Adjustment for multiple comparisons: Bonferroni.

According to Table 10.8, the means have changed little (compared to those
shown in Table 10.5). The mean for the up-to-one-year of residence group rose
from 75.5 to 75.8, and the mean for the more-than-one-year of residence group
fell from 90.5 to 90. The results in the final table of the first ANOVA with pro-
ficiency level as the dependent variable (Table 10.3) indicates that the members
of the more-than-one-year residence group had slightly higher proficiency than
the members of the up-to-one-year residence group, and as this proficiency effect
was minimized through the use of ANCOVA, the mean score for this group
subsequently decreased. Table 10.9 shows the post hoc comparisons between the
two groups.
In Table 10.9, it can be seen that the two residency groups significantly differ
from each other. This ANCOVA result could be written up as follows.

An analysis of covariance (ANCOVA) for routines scores was run with


residence as the independent variable and proficiency level as a covariate.
Checking the assumption that the residence and proficiency level were
independent factors led to the removal of participants without residence
from further analysis. That is, the analysis was subsequently run on par-
ticipants with residence only. The homogeneity of regression slopes was
confirmed. The ANCOVA results show that the proficiency level was sta-
tistically significant (F(1,52) = 10.629, p = 0.002, partial eta squared = 0.17,
small to medium effect size), and after removing it, length of residence was
also a significant factor (F(1, 52) = 15.221, p < 0.001, partial eta squared =
0.226, small to medium effect size). This finding suggests that being in an
Analysis of Covariance (ANCOVA) 153

English-speaking country for more than one year helped test takers improve
their knowledge of routines significantly more than being in the country
for less than one year.

Summary
In cases in which there are pre-treatment or pre-existing differences among research
participants, researchers can attempt to correct for them in two ways. The first
method is by analyzing gain scores; the second method is by employing ANCOVA.
As can be seen, to conduct an ANCOVA requires several steps. This chapter has
also shown that ANCOVA is still not without issues. First, complex statistical
assumptions need to be met in order to arrive at a meaningful outcome. Second,
the choice of intervening variables requires careful justification, especially in the
case of L2 research, in which multiple independent factors often affect language
learning or use simultaneously. This choice depends on researchers’ understand-
ing of the research context. While it is technically possible to include multiple
covariates in ANCOVA, this is not recommended as it makes the analysis and
its outcomes overly complex. Wherever possible, intervening variables should be
controlled in the research design. The next chapter presents repeated-measures
ANOVA, which is an extension of the paired-samples t-test, in which a dependent
variable is measured three or more times.

Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
11
REPEATED-MEASURES ANOVA

Introduction
The previous chapters on one-way ANOVA and ANCOVA presented the exten-
sion of the independent-samples t-tests to three or more groups. In this chapter, we
introduce an extension of the paired-samples t-test, which is used to compare the
mean scores of the same group of participants on two occasions, to three or more
occasions. This analysis is known as a repeated-measures analysis of variance (hereafter,
repeated-measures ANOVA). This kind of analysis is common in pre-post studies in
which researchers first give participants a pretest, then administer a treatment, and
finally give participants one or more posttests (posttest/delayed posttest) to investi-
gate the effect of the treatment on the participants (see Figure 11.1).
Another application of the repeated-measures ANOVA is to examine the rela-
tive levels of difficulty of several test sections or types of language tasks completed
by the same learner group. A one-way (i.e., one independent variable) repeated-
measures ANOVA takes a single sample and compares several measures taken on
that group. A study by Laufer and Rozovski-Roitblat (2011) used a repeated-
measures ANOVA to compare the learning of vocabulary presented 2–3, 4–5,
or 6–7 times. They investigated whether different amounts of exposure affected
learning by comparing learners’ recall and recognition scores for each group of
words. In this study, the ‘amount of exposure’ was considered the independent,
within-subject variable with three levels. In addition to varying the different

Delayed
Pretest 1 Posttest
Treatment Posttest
(Time 1) (Time 2)
(Time 3)

FIGURE 11.1 A pretest, posttest, and delayed posttest design


Repeated-Measures ANOVA 155

TABLE 11.1 Six different tests with 10 vocabulary items

2–3 encounters 4–5 encounters 6–7 encounters

Focus on Form 10 items 10 items 10 items


Focus on FormS 10 items 10 items 10 items

amounts of exposure to the vocabulary items, two types of instruction were


administered: Focus on Form and Focus on FormS. The Focus on Form condition
included no further practice of the items, whereas the Focus on FormS condition
included targeted, decontextualized practice.
The vocabulary presented consisted of 60 words that the participants were
unlikely to know. The researchers seeded these vocabulary items throughout read-
ing comprehension texts at three levels of ‘amount of exposure’: (1) 2–3 times,
(2) 4–5 times, and (3) 6–7 times. Twenty items occurred at each level of exposure;
these were then split again into two sets based on the two teaching conditions.
Altogether, six different tests with 10 vocabulary items were covered in the study,
as presented in Table 11.1 (see Laufer & Rozovski-Roitblat, 2011, p. 398, for the
vocabulary items used).
The instructional exposure phase consisted of learners’ completion of reading
comprehension tasks, and the additional Focus on FormS practice. Upon comple-
tion of the instructional phase, the participants completed (1) a passive recall task,
in which they provided an L1 Hebrew translation for English words, and (2) a
passive recognition task in which they selected the correct L1 Hebrew translation
for English words from four options.
The descriptive statistics (Laufer & Rozovski-Roitblat, 2011, pp. 400–401) sug-
gest that recognition was easier for participants than recall, and that more exposure
to vocabulary led to higher scores, regardless of the test or teaching condition.
Also, the students under the Focus on FormS condition always had a higher reten-
tion rate than those under the Focus on Form condition for the same amount of
exposure. To investigate whether the effect of the differences among the groups
with 2–3, 4–5, and 6–7 exposures was statistically significant, Laufer and Rozovski-
Roitblat (2011) performed four separate one-way repeated-measures ANOVAs,
with the following results:

• Recall scores under the Focus on Form condition: F(2,38) = 2.77, n.s.
• Recall scores under the Focus on FormS condition F(2,38) = 33.91, p < 0.001
• Recognition scores under the Focus on Form condition: F(2,38) = 1.45, n.s.
• Recognition scores under the Focus on FormS condition F(2,38) = 13.4,
p < 0.001

It can be seen that the differences under the Focus on FormS conditions were
significant at p < 0.001, while the differences under the Focus on Form conditions
156 Repeated-Measures ANOVA

were not statistically significant. The post hoc tests for the Focus on FormS condi-
tion indicated significant differences between all levels of exposure for the recall
scores, and between the 2–3 and 6–7 exposures levels for the recognition scores.
These post hoc tests were performed for the Focus on FormS condition only
because the repeated-measures ANOVA results for the Focus on Form condition
were not statistically significant. In summary, Laufer and Rozovski-Roitblat (2011)
found that students’ retention of vocabulary was better if they were exposed to it
more and received additional practice.

The assumptions of the Repeated-Measures ANOVA


The basic assumptions of the repeated-measures ANOVA are similar to those of
the one-way ANOVA discussed in Chapter 9. The repeated-measures ANOVA
requires an interval-level dependent (or outcome) variable, whose values are the
measurements of the target phenomenon (e.g., vocabulary test scores). This variable
has three or more “levels”, of which each is a measurement point, e.g., 2-3, 4-5 and
6-7 exposures. There is no grouping variable since the whole sample is measured at
least three times, and the results from the three measurements are considered levels
and compared with each other, not groups. In Laufer and Rozovski-Roitblat’s (2011)
study, the independent variable was the frequency of encountering vocabulary items.
In addition to the general assumptions of ANOVA, a repeated-measures
ANOVA has another requirement, known as sphericity. Sphericity refers to the
condition that the variances of differences between the individual measurements
should be roughly equal. SPSS can provide a test for sphericity, and if this assump-
tion is violated, the repeated-measures ANOVA will be corrected for this violation.
As with one-way ANOVA, the repeated-measures ANOVA compares three
or more means. However, in this case, it compares the means of the same group
of people across different measurements. When the means are largely different at
the different measurement times, researchers can assume that the treatment has an
effect on the dependent variable. For example, in Laufer and Rozovski-Roitblat’s
(2011) case, it was the amount of exposure to a vocabulary item that had an impact
on student learning.
There are also post hoc tests for repeated-measures ANOVA that compare the
various treatments with each other. In Laufer and Rozovski-Roitblat’s (2011) study,
the post hoc test compares 2–3 exposures with 4–5 and 6–7 exposures, as well as
4–5 exposures with 6–7 exposures. The selection of post hoc tests for repeated-
measures ANOVA in SPSS is more limited than that for one-way ANOVA. Usually
researchers choose Bonferroni as the post hoc test, which is quite conservative, but
the Sidak post hoc test is also an option for exploratory research. It is important to
note that a post hoc test does not have to be run or reported if the overall result
from a repeated-measures ANOVA is not statistically significant.
In one-way ANOVA, one source of error is possible differences between groups
that are unassociated with the independent variable, but affect the dependent vari-
able. Researchers can try to control these differences by adopting an experimental
Repeated-Measures ANOVA 157

design that controls these confounding factors or by including a covariate (and


running ANCOVA, as discussed in Chapter 10). Nevertheless, there is always the
risk that an intervening variable has contaminated the research results. In the case
of repeated-measures ANOVA, the participants are the same for all measurements,
so this source of extraneous variance does not occur. Of course, participants are
assumed not to change in other, uncontrolled ways over the course of the research.
One challenge with repeated-measures designs is, however, that measures, such as
tests, need to be ‘equivalent’, but they cannot be ‘identical’ to avoid a memory or
practice effect.
Finally, researchers cannot simply run multiple repeated-measures t-tests in place
of the repeated-measures ANOVA as there is the likelihood of errors becoming
accumulated, so that researchers become prone to making a Type I error—
believing they have found something statistically significant when in fact there
was nothing to find (see Chapter 9 for a fuller discussion in relation to one-way
ANOVA).

Meaning of Statistical Significance and


Significant Post Hoc Results
A significant result for the repeated-measures ANOVA indicates that there was
a significant difference in the means between the measurements. In Laufer and
Rozovski-Roitblat (2011), a significant result showed that the number of expo-
sures to new vocabulary items made a significant difference to the retention of
new vocabulary, but it did not tell the researchers the number of exposures that
was most conducive to the learning of vocabulary. To find out how the treatment
effect works more precisely, researchers need to look at post hoc results. In a pre-post
design with a delayed posttest, a significant difference between the pretest and the
posttest results as well as between the pretest and the delayed posttest results should
be expected. This is because the treatment between the pretest and posttest should
lead to learning and a change in the attribute being measured. However, any differ-
ences between the posttest and delayed posttest could be due to the maintenance or
sustainability of learned knowledge. If a decline in scores occurs between the posttest
and delayed posttest is observed, it may be inferred that the effect of the treatment
is not sustainable. That is, learners are able to remember what they learned for an
immediate posttest, but their learning does not last over time. In the case of Laufer
and Rozovski-Roitblat, the post hoc differences show that the 6–7 exposures was
the most effective number of exposures for vocabulary learning in the context of a
Focus on FormS curriculum. It can be seen that the results from a post hoc test can
be important for drawing pedagogical conclusions relevant to curriculum designs.

The Effect Size for the Repeated-Measures ANOVA


The effect size for the repeated-measures ANOVA is calculated as a coefficient
called partial eta squared (partial η2 or sometimes η2 partial). The partial η2 is calculated
158 Repeated-Measures ANOVA

in a slightly different way from the η2 in the one-way ANOVA, but the interpreta-
tion of the effect size is similar.
The value of the partial η2 is between 0 and 1. It can be interpreted as the
percentage of overall variance accounted for by the variable under measurement.
In the case of Laufer and Rozovski-Roitblat’s (2011) study, this variable would be
the treatment or amount of vocabulary exposure. Whether the size of the partial
η2 is considered small or large depends on researchers’ expectations. Generally, a
partial eta squared over 0.5 would be considered large and one below 0.1 would
be considered small.

How to Perform the Repeated-Measures ANOVA in SPSS


In order to illustrate how to perform the repeated-measures ANOVA, the TEP
data are used (available in Ch11TEP.sav, which can be found on the Companion
Website). While this example is related to the comparison of three test sections, the
methods for examining the pretest, posttest, and delayed posttest differences follow
the same procedures. In the TEP, scores on the three test sections can be compared
to evaluate their relative difficulty. The ‘type of pragmatic knowledge’ is the inde-
pendent, within-subject variable with three levels. The key research question is:
‘which aspect of pragmatics is more difficult for test takers?’ In other words, is it
easier to interpret implicature than to recognize routine formulae? Is producing
speech acts easier than recognizing implicature or producing routine formulae?
To answer these questions, a repeated-measures ANOVA can be performed for
the three section scores for (1) the implicature section (implicature), (2) routines
section (routines), and (3) speech acts section (speechacts).

SPSS Instructions: Repeated-Measures ANOVA

Click Analyze, next General Linear Model, and then Repeated Mea-
sures (Figure 11.2).

In the Repeated Measures Define Factors dialog (Figure 11.3), enter


the variable name ‘section’ (or a name of your choice). As can be
seen in the data sheet, there are three levels for this variable: implicature,
routines, and speech acts. Insert ‘3’ in the Number of Levels field, then click
on the Add button. In Figure 11.3, the Measure Name field can be ignored.
FIGURE 11.2 Accessing the SPSS menu to launch a repeated-measures ANOVA

FIGURE 11.3 Repeated Measures Define Factors dialog


160 Repeated-Measures ANOVA

FIGURE 11.4 Repeated Measures dialog

The Define button will become active after clicking Add (shown in
Figure 11.3). Click this button to obtain the Repeated Measures dia-
log shown in Figure 11.4. To add the variables ‘implicature’, ‘routines’, and
‘speech acts,’ move each of these three variables one at a time from the left-
hand pane to the ‘Within-Subjects Variables’ pane. In Figure 11.4, ‘implica-
ture’ and ‘routines’ have been added to the Within-Subjects Variables pane,
while ‘speech acts score’ is still to be added.

In the Repeated Measures dialog, click the Options button.

In the resulting Repeated Measures: Options dialog (Figure 11.5),


move ‘section’ from the ‘Factor(s) and Factor Interactions:’ pane to
the ‘Display Means for:’ pane.

Tick the Compare main effects checkbox, and select Bonferroni as the
post hoc test for the repeated-measures ANOVA from the Confidence
interval adjustment drop-down menu.
Repeated-Measures ANOVA 161

FIGURE 11.5 Repeated Measures: Options dialog

Tick the Descriptive statistics and Estimate of effect size checkboxes.


Click the Continue button to return to the Repeated Measures dia-
log, then click on the OK button.

The output begins with the codes that SPSS assigns to the levels of the within-
subjects factor (Section) as shown in Table 11.2. Table 11.3 presents the descriptive
statistics, which indicate that implicature was the easiest section for the partici-
pants, followed by routines, and then speech acts. The differences do not appear to
be large enough for them to be statistically significantly different.
Table 11.4 presents the multivariate test output. This is ordinarily used for a
multivariate ANOVA (also known as MANO VA), which has several dependent
variables. SPSS produces it here as well in case Mauchly’s Test of Sphericity indi-
cates a severe violation of sphericity since these tests do not assume sphericity.
Should sphericity be violated, results from these tests together with the corrections
for sphericity violations (see Table 11.6) can help researchers judge whether the
outcome is significant (see Chapter 14 in Field, 2013 for more discussion on this
strategy).
Table 11.5 presents the result of Mauchly’s Test of Sphericity, which has sphe-
ricity as its null hypothesis. Mauchly’s Test of Sphericity should be nonsignificant
in order for the Sphericity assumption to be met. In this table, the result of
162 Repeated-Measures ANOVA

TABLE 11.2 The within-subjects factors

Section Dependent variable

1 implicature
2 routines
3 speechacts

TABLE 11.3 Descriptive statistics

Mean Std. deviation N

Implicature score 64.2160 26.69875 166


Routines score 60.7430 23.02941 166
Speech acts score 58.7366 29.27428 166

TABLE 11.4 The multivariate test output

Effect Value F Hypothesis df Error df Sig. Partial eta


Squared

section Pillai’s Trace .055 4.774 2.000 164.000 .010 .055


Wilks’ Lambda .945 4.774 2.000 164.000 .010 .055
Hotelling’s .058 4.774 2.000 164.000 .010 .055
Trace
Roy’s Largest .058 4.774 2.000 164.000 .010 .055
Root

TABLE 11.5 Mauchly’s Test of Sphericity

Within- Mauchly’s W Approx. df Sig. Epsilonb


subjects chi-
effect square Greenhouse- Huynh- Lower
Geisser Feldt bound

section .981 3.093 2 .213 .982 .993 .500

Mauchly’s Test of Sphericity was nonsignificant ( p = 0.213), so this assumption


for the repeated-measures ANOVA has been met.
Table 11.6 presents the results of the tests of within-subjects effects, which
include the repeated-measures ANOVA result. Since Mauchly’s Test of Sphericity
was nonsignificant, the first row of the ‘Section’ part of the table can be examined, in
particular the significance column (Sig.). It can be seen that ‘Section’ has a significant
effect ( p = 0.016), which indicates that the three sections were indeed significantly
different in their difficulty levels. However, in the next column, it can be seen that
Repeated-Measures ANOVA 163

TABLE 11.6 Results from tests of within-subjects effects

Source Type III sum df Mean F Sig. Partial eta


of squares square squared

section Sphericity 2,551.417 2.000 1,275.708 4.159 .016 .025


Assumed
Greenhouse- 2,551.417 1.963 1,299.544 4.159 .017 .025
Geisser
Huynh-Feldt 2,551.417 1.987 1,284.227 4.159 .017 .025
Lower Bound 2,551.417 1.000 2,551.417 4.159 .043 .025
Error Sphericity 101,213.414 330.000 306.707
(section) Assumed
Greenhouse- 101,213.414 323.947 312.438
Geisser
Huynh-Feldt 101,213.414 327.811 308.755
Lower bound 101,213.414 165.000 613.415

TABLE 11.7 Estimates

Section Mean Std. error 95% confidence interval

Lower bound Upper bound

1 64.216 2.072 60.124 68.307


2 60.743 1.787 57.214 64.272
3 58.737 2.272 54.250 63.223

the partial η2 was small (0.025). The partial η2 value suggests that only 2.5% of the
overall variance was accounted for by the differences among the three test sections.
Therefore although the three test sections were significantly different in terms of dif-
ficulty for the test takers, the differences among them were small. The result of this
repeated-measures ANOVA can be written as F(2, 330) = 4.159, p < 0.05.
The other results in Table 11.6 include those for the Greenhouse-Geisser,
Huynh-Feldt and Lower Bound corrections. These are corrections for the statisti-
cal results when the sphericity assumption is violated. Greenhouse-Geisser is used
in cases of epsilon (ε) < 0.75 in Mauchly’s test and Huynh-Feldt if ε > 0.75 (see
Table 11.7). The Lower Bound test (also included in Table 11.7) is conservative,
and can be used if there are serious concerns about sphericity violations.
SPSS also produces extra outputs called Tests of Within-Subjects Contrast and Test
of Between-Subjects Effects, which can be ignored (these outputs are therefore not
included here). The within-subject contrasts are only relevant if there are specific
preexisting hypotheses, and the between-subjects effects are not of interest when a
repeated-measures ANOVA is performed. The next few tables in the SPSS output
164 Repeated-Measures ANOVA

TABLE 11.8 Pairwise comparisons

(I) section ( J) section Mean Std. error Sig.b 95% confidence interval for
difference (I-J) differenceb

Lower bound Upper bound

1 2 3.473 1.921 .217 –1.172 8.118


3 5.479∗ 1.806 .008 1.112 9.847
2 1 –3.473 1.921 .217 –8.118 1.172
3 2.006 2.034 .976 –2.912 6.925
3 1 –5.479∗ 1.806 .008 –9.847 –1.112
2 –2.006 2.034 .976 –6.925 2.912

Based on estimated marginal means


∗The mean difference is significant at the .05 level.
b
Adjustment for multiple comparisons: Bonferroni.

show the post hoc test results that compare the various levels of the test section.
Estimates are shown in Table 11.7 and presents similar information to that in the
descriptive statistics table (see Table 11.3). Table 11.7 includes the lower and upper
bounds at the 95% confidence interval.
Table 11.8 presents the pairwise comparisons from the results of the Bon-
ferrroni post hoc test. This table shows that the only significant difference was
between Sections 1 and 3 (implicature and speech acts); the former was signifi-
cantly easier than the latter. The routines section did not differ significantly from
either the implicature section or the speech act section. The final table in the SPSS
output, which is not presented in this section, is called Multivariate Tests. This out-
put is the same as that shown in Table 11.4.
On the basis of the repeated-measures ANOVA, the results may be written up
as follows.

A one-way repeated-measures ANOVA was conducted to compare the


three sections of the TEP (i.e., implicature, routines, and speech acts). The
result indicated that the three sections were significantly different in their
level of difficulty, but that the differences were small, F(2, 330) = 4.159, p
< 0.05, partial η2 = 0.025, small effect size. A post hoc comparison using
a Bonferrroni adjustment showed that the implicature section was signifi-
cantly easier than the speech acts section (p < 0.05), but the difference was
also small (Cohen’s d = 0.19). The routines section was not significantly
different from either of the implicature or speech acts sections.

Summary
The rationale behind repeated-measures ANOVA is the same as that of one-way
ANOVA discussed in Chapter 9. However, unlike one-way ANOVA, which
Repeated-Measures ANOVA 165

examines three or more group differences, repeated-measures ANOVA identifies


whether scores of the same measures differ over three or more occasions (e.g., in an
experimental study) or whether scores of more than three test sections that mea-
sure different aspects of ability differ. Repeated-measures ANOVA is the extension
of the paired-samples t-test presented in Chapter 7. Repeated-measures ANOVA
is a within-group design, whereas one-way ANOVA is a between-group design.
Repeated-measures ANOVA can be applied in language testing studies in which
researchers aim to find out whether one particular test section is more or less dif-
ficult than other sections, or whether there are differences among mean scores
under three or more different test conditions. Typically the repeated-measures
ANOVA requires one independent variable (nominal, categorical, or ordinal), and
one dependent variable (interval or continuous). The next chapter presents two-
way, mixed-design ANOVA.

Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
12
TWO-WAY MIXED-DESIGN
ANOVA

Introduction
L2 researchers can combine a repeated-measures ANOVA (Chapter 11) with a
between-groups ANOVA (Chapter 9). Such a combination allows researchers to
simultaneously examine the effect of a between-subject variable, such as length of
residence or type of treatment, a within-subject variable, such as test time (e.g., pre,
post, and delayed post) or task type (e.g., implicature, routines, and speech acts),
and the interaction among these variables. This chapter explains how a two-way
mixed-design ANOVA can be performed in SPSS and extends what has been
discussed in Chapter 11.
Figure 12.1 presents an experimental design that investigates the effect of a
treatment condition on an aspect of language learning, considering test time and
task types as independent factors.
The design of Shintani, Ellis, and Suzuki’s (2014) study was similar to that
shown in Figure 12.1, except that their study used five conditions. The study
employed a mixed-design ANOVA to address their research questions. The
researchers investigated the effect of different types of written feedback on ESL
learners’ use of English indefinite articles and hypothetical conditionals. Dictogloss
tasks were used in which the participants listened to a text twice, took notes, and
then tried to recreate the text as accurately as possible. After completing one dic-
togloss as a pretest, the learners received feedback, and a week after the feedback,
they completed another dictogloss task as a posttest, and a third one as a delayed
posttest two weeks following the first. In this study, ‘time’ was the independent,
within-subjects variable, and it had three levels since the researchers evaluated
performance at three points: pretest, posttest, and delayed posttest. Shintani et al.
(2014) divided their sample into five groups depending on the type of feedback
Two-Way Mixed-Design ANOVA 167

Group A + Treatment A Posttest Delayed


Pretest Posttest

Randomly
assign Group B + Treatment B Posttest Delayed
participants Pretest Posttest
to

Control No Delayed
Group + Posttest
Treatment Posttest
Pretest

FIGURE 12.1 Diagram of a pretest-posttest control-group design (adapted from Phakiti,


2014, p. 68)

the groups received, and whether or not they made a revision of their texts, and
they therefore not only had a within-subject variable, but also a between-subjects
variable. This made their study a two-way mixed-design ANOVA involving the
following five groups:

1. The first group received metalinguistic (grammatical) explanations, and were


given the opportunity to revise their texts.
2. The second received metalinguistic explanations, but were not given the
opportunity to revise their texts.
3. The third group received direct corrective feedback on their errors, and were
given the opportunity to revise their texts.
4. The fourth group received direct corrective feedback, but were not given the
opportunity to revise their texts.
5. The fifth group was the control group: its members received no metalinguistic
explanations, no direct corrective feedback, and did not have the opportunity
to revise their texts.

Shintani et al. were interested in whether the treatments were effective in


promoting learning, and whether one was more beneficial than another. It was
expected that the control group would not perform differently between the pre-
test and posttest, but that the treatment groups would.
The researchers ran separate analyses for each of the target features of inter-
est (i.e., the use of articles and the hypothetical conditional). Their approach
will be illustrated using the participants’ performance in using the hypothetical
168 Two-Way Mixed-Design ANOVA

TABLE 12.1 Descriptive statistics of the percentage scores for correct use for the five
treatment conditions by three tasks (adapted from Shintani et al., 2014, p. 118)

Group Pretest Posttest Delayed


posttest

N Mean SD Mean SD Mean SD

Explanation, without 29 28 28 78 26 55 38
revision
Correction, without 27 19 20 82 28 54 36
revision
Explanation with revision 26 21 29 81 23 55 41
Correction with revision 24 20 27 84 28 61 36
Control 34 27 30 20 27 32 33

conditional. Table 12.1 shows the percentage scores for correct use, N sizes, and
standard deviations for the five treatment conditions across the three task time
points.
According to Table 12.1, all the treatments had a strong immediate effect. The
scores of the members of the treatment groups were between 20 and 30 at the
pretest time and increased to approximately 80 at the posttest time. The scores for
the members of the control group, however, dropped between the pretest and the
posttest. In the delayed posttest, the scores of the members of the four treatment
groups had fallen from the time of the posttest to the 50–60 range, while those
of the members of the control group had risen slightly. Figure 12.2 illustrates the
changes across time points among the five groups (created based on the means
reported in Table 12.1).
The line graph in Figure 12.2 demonstrates the steep increase and subsequent
drop in scores for the members of the four treatment groups, which differed sig-
nificantly from one another. However, the overall trends show some similarities
among the groups. Shintani et al. performed a two-way mixed-design ANOVA
that showed statistically significant effects. Effects for independent variables (i.e.,
groups and time—hence, a two-way ANOVA) are called main effects, and it was
found that the main effect for groups was statistically significant (F(4, 135) =
9.428, p < 0.05, η2 = 0.156, small effect size), as was that for time (F(2, 270) =
113.574, p < 0.05, η2 = 0.429, large effect size). There was also a statistically sig-
nificant interaction of the group and time factors (F(8, 270) = 11.331, p < 0.05,
η2 = 0.196, medium effect size).
The findings from this study can be summarized as follows. First, for their
within-subjects variable, time, the researchers found statistically significant and large
differences between the scores in the pre- and posttest for all of the experimen-
tal groups, except for the control group. They also found statistically significant,
Two-Way Mixed-Design ANOVA 169

90

80

70

60
Explanation, w/o revision
50 Correction, w/o revision
Explanation with revision
40
Correction with revision
30 Control

20

10

0
pre post delayed

FIGURE 12.2 Changes across time points among the five groups

but small, differences among the treatment groups in regards to the changes in
the scores of the participants in the posttest and delayed posttest. The difference
between the posttest and delayed posttest scores for members of the control group
was not statistically significant.
Second, for the between-subject variable ‘Group’, the researchers found that
none of the groups differed significantly from any other at the pretest time. This
finding reduces the possibility that any differences that existed prior to the experi-
ment affected the final results. All four treatment groups significantly differed from
the control group at the posttest time. However, at the time of the delayed post-
test, only one treatment group (the group that received direct correction, and who
made revisions to their texts) had significantly higher scores than the control group
(with a medium effect size). None of the experimental groups differed significantly
from each other at any time.
Third, there are no post hoc tests for interactions. However, the significant
interaction term in the ANOVA shows that ‘time’ impacted the performance of
the groups differently. That is, the scores of the members of the experimental
groups increased greatly initially, then decreased over time, whereas the scores of
the members of the control groups first decreased, and then increased slightly. It
was this contrast between the control group and the experimental groups that
made the interaction term statistically significant. The pretest, posttest, and delayed
posttest changes for the control group might have been due to random fluctuation.
170 Two-Way Mixed-Design ANOVA

How to Perform a Two-Way Mixed-Design ANOVA in SPSS


To illustrate how to perform a two-way mixed-design ANOVA in SPSS, the TEP data
are used (available in Ch12TEP.sav, which can be found on the Companion Website).
The research question of interest is whether residence had an effect on performance,
and whether it affected some sections more than others. Learners with three levels of
residence in an English-speaking country—no residence, up to one year’s residence,
and residence of more than one year—completed all the three sections of the TEP.
It was expected that residence (the between-subject variable) would affect learners’
performance in each section of the test. The three sections (which together constitute
the within-subject variable ‘section’) did not have the same degree of difficulty.
A mixed-design ANOVA can be run with the test section as the within-subjects
variable with three levels (i.e., implicature, routines, and speech acts) and residence
as the between-subjects variable with three levels (i.e., no residence, up to one year’s
residence, and more than one year’s residence). This mixed-design ANOVA is two-
way because there are two independent variables, namely the residence variable and
the test section variable. The instructions for the use of SPSS are similar to those for
a repeated-measures ANOVA, but with a few additional procedures.

SPSS Instructions: Mixed-Design ANOVA


The first three steps to perform this two-way mixed design ANOVA can be found
in SPSS Instructions: Repeated-Measures ANOVA in Chapter 11.

After clicking the active Add button, followed by Define (see Figure 11.3
in Chapter 11), the Repeated Measures dialog appears (Figure 12.3).

FIGURE 12.3 The Repeated Measures dialog


Two-Way Mixed-Design ANOVA 171

FIGURE 12.4 Repeated Measures: Profile Plots dialog

Move ‘collapsed residence’ from the pane on the left to the


‘Between-Subjects Factor(s):’ pane.

In the Repeated Measures dialog (Figure 12.3), click the Plots but-
ton. (Plots are useful to help visualize the interactions.)

In the resulting Repeated Measures: Profile Plots dialog (Figure 12.4),


move the variable ‘collres’ from the left pane to the Horizontal Axis
field, and move ‘section’ from the left pane to the Separate Lines field.

Click on the Add button (see Figure 12.5, which shows colres∗section
in the Plots field) and then click on the Continue button.

Return to the Repeated Measures dialog shown in Figure 12.3 and


click on the Post Hoc button. In the dialog that results, tick the
Scheffé and Tamhane’s T2 checkboxes (Figure 12.6). Note that post hoc tests
can be chosen only for the between-subjects variable ‘collres’.
FIGURE 12.5 Repeated Measures: Profile Plots dialog with collres∗section shown

FIGURE 12.6 Repeated Measures: Post Hoc Multiple Comparisons for Observed Means
dialog
Two-Way Mixed-Design ANOVA 173

FIGURE 12.7 Repeated Measures: Options dialog

Click on the Continue button to return to the Repeated Measures


dialog, then click the Options button to call up the dialog shown in
Figure 12.7.

In the Repeated Measures: Options dialog, move ‘section’ from the


‘Factor(s) and Factor Interactions:’ pane to the ‘Display Means for:’
pane.

Tick the Compare main effects checkbox, and choose Bonferroni in


the Confidence Interval Adjustment drop-down menu.
174 Two-Way Mixed-Design ANOVA

Tick the Descriptive statistics, Estimates of effect size, and Homogene-


ity tests checkboxes.

Click on the Continue button to return to the Repeated Measures


dialog and then click the OK button to run the analysis.
Note: The reason for not choosing ‘collres’ and ‘collres∗section’ in the
Repeated Measures: Options dialog is that a post hoc comparison for the
within-subjects variable can be performed for ‘section’. The different levels
of ‘collres’ will be compared using a post hoc test (see Figure 12.6). There is
no comparison for the interaction term.

In the following SPSS output, it can be seen that SPSS assigns codes to the levels of
the within-subjects factor (section) as shown in Table 12.2. The between-subject
factor (collapsed residence) is shown in Table 12.3, and Table 12.4 presents the
descriptive statistics.
According to the descriptive statistics, the scores on all sections increased
with length of residence. Such increases were much more drastic for the routines
section than for the other test sections. Length of residence might benefit the
knowledge of routine formulae more than it benefits other aspects of pragmatic
knowledge, which is in line with previous research (e.g., Roever, 2005). On the
basis of the descriptive statistics, a main effect for length of residence, and an inter-
action between residence and section type may be expected.

TABLE 12.2 The within-subjects factors

Section Dependent variable

1 implicature
2 routines
3 speechacts

TABLE 12.3 The between-subjects factors

Value label N

collapsed residence 0 no residence 73


1 up to 1 year residence 33
2 more than 1 year residence 23
Two-Way Mixed-Design ANOVA 175

TABLE 12.4 Descriptive statistics

Collapsed residence Mean Std. deviation N

implicature score no residence 62.6712 25.68917 73


up to 1 year residence 75.5510 17.75993 33
more than 1 year residence 79.1502 17.26613 23
Total 68.9042 23.53117 129
routines score no residence 51.2557 18.09138 73
up to 1 year residence 75.5051 15.58180 33
more than 1 year residence 89.8551 12.55204 23
Total 64.3411 22.77362 129
speech acts score no residence 55.3124 29.46724 73
up to 1 year residence 70.7334 21.27750 33
more than 1 year residence 76.0267 22.87020 23
Total 62.9505 27.76912 129

TABLE 12.5 Mauchly’s Test of Sphericity

Within Mauchly’s Approx. df Sig. Epsilon


subjects W Chi-square
effect Greenhouse- Huynh-Feldt Lower-
Geisser bound

section .988 1.468 2 .480 .988 1.000 .500

For the purpose of this chapter, the multivariate test table will be ignored.
According to Table 12.5, Mauchly’s test was nonsignificant ( p > 0.05), so the
sphericity assumption can be taken to hold. The tests of within-subjects effect in
Table 12.6 indicate a significant effect of the section variable (F(2,252) = 3.117,
p < 0.05, partial η2 = 0.024, small effect size. On this basis, the type of test sec-
tion has a significant but small impact on performance (regardless of residence).
In Table 12.6, the interaction effect was statistically significant (F(4,252) = 4.869,
p < 0.01, partial η2 = 0.072, small effect size). It can be inferred that the impact of
the residence variable was stronger on some sections than on others. Tables 12.7
shows the result of Levene’s test and Table 12.8 displays the test of between-
subjects effects.
In Table 12.7, the results of Levene’s test indicate that the homogeneity of
equal variances conditions for the implicature and speech act sections were not
met ( p < 0.05). Since this concerns the within-subjects variable (rather than the
between-subjects variable), SPSS does not offer specific post-hoc tests to correct
for this violation of assumptions, so the result of comparison between sections
needs to be interpreted with caution. There is no option to run a Levene’s test for
the between-subject variable.
176 Two-Way Mixed-Design ANOVA

TABLE 12.6 Results from tests of within-subjects effects

Source Type III sum df Mean F Sig. Partial eta


of squares square squared

section Sphericity 1,700.073 2 850.037 3.117 .046 .024


Assumed
Greenhouse- 1,700.073 1.977 859.958 3.117 .047 .024
Geisser
Huynh-Feldt 1,700.073 2.000 850.037 3.117 .046 .024
Lower bound 1,700.073 1.000 1,700.073 3.117 .080 .024
section∗ Sphericity 5,311.680 4 1,327.920 4.869 .001 .072
collres Assumed
Greenhouse- 5,311.680 3.954 1,343.419 4.869 .001 .072
Geisser
Huynh-Feldt 5,311.680 4.000 1,327.920 4.869 .001 .072
Lower bound 5311.680 2.000 2,655.840 4.869 .009 .072
Error Sphericity 68,727.745 252 272.729
(section) Assumed
Greenhouse- 68,727.745 249.093 275.912
Geisser
Huynh-Feldt 68,727.745 252.000 272.729
Lower bound 68,727.745 126.000 545.458

TABLE 12.7 Levene’s test

F df1 df2 Sig.

implicature score 8.350 2 126 .000


routines score 2.036 2 126 .135
speech acts score 5.341 2 126 .006

TABLE 12.8 The between-subjects effects

Source Type III sum df Mean square F Sig. Partial eta


of squares squared

Intercept 1,541,585.230 1 1,541,585.230 1,635.652 .000 .928


collres 43,172.048 2 21,586.024 22.903 .000 .267
Error 118,753.721 126 942.490

The test of between-subjects effects as presented in Table 12.8 is the output


of a regular ANOVA. It was found that residence had a significant main effect
on scores (F(2,126) = 22.903, p < 0.001, partial η2 = 0.267, medium effect size).
This main effect was the impact of residence on the scores from all the sections
combined.
Two-Way Mixed-Design ANOVA 177

Table 12.9 presents the descriptive statistics for the between-subjects variable
(residence). The information in this table is similar to that shown in Table 12.4,
but with a 95% confidence interval. Table 12.10 presents the pairwise comparison
outcomes. Table 12.11 presents the results of the univariate tests.

TABLE 12.9 Descriptive statistics for ‘residence’

Collapsed residence Mean Std. error 95% confidence interval

Lower bound Upper bound

No residence 56.413 2.075 52.308 60.519


Up to 1 year residence 73.930 3.085 67.824 80.036
More than 1 year residence 81.677 3.696 74.363 88.991

TABLE 12.10 Pairwise comparisons on collapsed residence

(I) Collapsed ( J) Collapsed Mean Std. Sig.b 95% confidence


residence residence difference error interval for differenceb
(I-J)
Lower Upper
bound bound

No residence up to 1 year –17.517∗ 3.718 .000 –26.538 –8.495


residence
more than –25.264∗ 4.238 .000 –35.548 –14.981
1 year residence
Up to 1 year no residence 17.517∗ 3.718 .000 8.495 26.538
residence more than 1 –7.748 4.814 .330 –19.429 3.934
year residence
More than 1 no residence 25.264∗ 4.238 .000 14.981 35.548
year residence up to 1 year 7.748 4.814 .330 –3.934 19.429
residence

Based on estimated marginal means


∗ The mean difference is significant at the .05 level.
b
Adjustment for multiple comparisons: Bonferroni.

TABLE 12.11 Univariate tests

Sum of squares df Mean square F Sig. Partial Eta


squared

Contrast 14,390.683 2 7,195.341 22.903 .000 .267


Error 39,584.574 126 314.163

The F tests the effect of collapsed residence. This test is based on the linearly independent pairwise
comparisons among the estimated marginal means.
178 Two-Way Mixed-Design ANOVA

TABLE 12.12 Post hoc test

(I) Collapsed ( J) Collapsed Mean Std. error Sig. 95% confidence


residence residence difference interval
(I-J)
Lower Upper
bound bound

Scheffé No Up to 1 year –17.5167∗ 3.71802 .000 –26.7268 –8.3067


residence residence
More than –25.2642∗ 4.23826 .000 –35.7630 –14.7655
1 year
residence
Up to No residence 17.5167∗ 3.71802 .000 8.3067 26.7268
1 year More than –7.7475 4.81450 .278 –19.6737 4.1786
residence 1 year
residence
More than No residence 25.2642∗ 4.23826 .000 14.7655 35.7630
1 year Up to 1 year 7.7475 4.81450 .278 –4.1786 19.6737
residence residence
Tamhane No Up to 1 year –17.5167∗ 3.18531 .000 –25.2561 –9.7773
residence residence
More than –25.2642∗ 3.87957 .000 –34.8378 –15.6906
1 year
residence
Up to No residence 17.5167∗ 3.18531 .000 9.7773 25.2561
1 year More than –7.7475 3.69996 .122 –16.9542 1.4592
residence 1 year
residence
More than No residence 25.2642∗ 3.87957 .000 15.6906 34.8378
1 year Up to 1 year 7.7475 3.69996 .122 –1.4592 16.9542
residence residence

Based on observed means.


The error term is Mean Square(Error) = 314.163.
∗ The mean difference is significant at the .05 level.

According to Table 12.10, statistically significant differences between the no-


residence group and each residence group were found. The two groups with
residence were not found to be significantly different from each other. Given the
outcome of Levene’s test, the Tamhane T2 test should be used for the implicature
and speech act sections, and the Scheffé post hoc test should be used for the rou-
tines section (see Table 12.12).
Tables 12.13 and 12.14 present the statistical results for the within-subjects
variable (section). According to Table 12.14, there was no statistically significant
difference among the three test sections (p > 0.05). This may be surprising given
Two-Way Mixed-Design ANOVA 179

TABLE 12.13 Descriptive statistics for sections

Section Mean Std. Error 95% confidence interval

Lower bound Upper bound

1 72.457 2.225 68.055 76.860


2 72.205 1.638 68.963 75.448
3 67.358 2.613 62.186 72.529

TABLE 12.14 Pairwise comparisons on test sections

(I) Section ( J) Section Mean Std. Error Sig.a 95% confidence interval for
difference differencea
(I-J)
Lower bound Upper bound

1 2 .252 2.204 1.000 –5.095 5.599


3 5.100 2.283 .082 –.439 10.639
2 1 –.252 2.204 1.000 –5.599 5.095
3 4.848 2.416 .141 –1.014 10.710
3 1 –5.100 2.283 .082 –10.639 .439
2 –4.848 2.416 .141 –10.710 1.014

Based on estimated marginal means


a
Adjustment for multiple comparisons: Bonferroni

that the ANOVA result for sections was significant. The reason for this is that
the post hoc test (Bonferroni) is conservative, which means the 0.05 criterion is
harder to meet. Also, the violation of the assumption of equal error variances (as
indicated by Levene’s test in Table 12.7) may have made this result less stable.
The results of a mixed-design ANOVA show the interaction between the
variables. Figure 12.8 presents the plots for the mean scores for each residence
group in each of the three sections. The graphs for sections 1 and 3 (implicature
and speech acts) are similar to one another. The graphs both rose sharply from no
residence to up to one year’s residence, then rose less sharply to residence of more
than one year. This means that some residence might increase implicature and
speech act knowledge strongly, but that an extended residence might not further
influence such knowledge to the same degree. In regards to the routine score, not
only did the routines score rise more steeply than the other scores between no
residence and up to one year’s residence, it kept rising at a noticeable rate after
one year. On this basis, it can be inferred that residence had a strong effect on
routines scores within the first year, and continued to have a strong effect after
the first year.
180 Two-Way Mixed-Design ANOVA

section
90.00
1
2
3

80.00
Estimated Marginal Means

70.00

60.00

50.00

no residence up to 1 year more than 1 year


residence residence
collapsed residence

FIGURE 12.8 Estimated marginal means of MEASURE_1

On the basis of the two-way mixed-design ANOVA, the results may be written
up as follows.

A two-way mixed-design ANOVA was run with the TEP test section as
the within-subjects variable, and length of residence (none, up to one year,
more than one year) as the between-subjects variable. There was a signifi-
cant main effect for section with a small effect size (F(2,252) = 3.117, p <
0.05, partial η2 = 0.024). The main effect for length of residence was also
statistically significant and had a medium effect size (F(2,126) = 22.903,
p < 0.001, partial η2 = 0.267). The interaction term was also significant
(F(4,252) = 4.869, p < 0.001, partial η2 = 0.072), and a profile plot indi-
cated that length of residence led to much stronger increases in the routines
score than in the implicature or speech act scores.
Two-Way Mixed-Design ANOVA 181

Summary
The two-way mixed-design ANOVA is used to examine mean differences between
several independent groups with several repeated measures. A two-way mixed-
design needs at least one between-subject variable and one within-subjects variable.
A two-way mixed-design ANOVA can be used to investigate the interaction effect
between the between-subject and within-subject variables. This chapter completes
the presentation of inferential statistics for group differences. The next three chap-
ters return to the relationships among variables. Chapter 13 presents the chi-square
test, which is a nonparametric test for examining relationships between categorical
variables.

Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
13
CHI-SQUARE TEST

Introduction
The chi-square test (also written χ2 test) is a type of inferential statistic for ana-
lyzing frequency counts of nominal data. It is used to determine whether the
counts of two nominal variables are associated with each other. It is useful when
questions about relationships among variables cannot be answered by means of a
correlational analysis, such as Pearson’s correlation coefficient or Spearman’s rho, or
by means of a comparative analysis, such as a t-test or ANOVA. The following are
some examples of questions that have been answered by using the chi-square test:

• Liu (2011) investigated whether the occurrence of phrasal verbs relates to


register. Specifically, does English used in spoken interaction, fiction writing,
magazines, newspapers, and academic writing differ in terms of the frequency
of the use of phrasal verbs?
• Bell (2012) asked whether playful engagement with language facilitates the
learning of meaning and form as compared to non-playful engagement.
• Laufer and Waldman (2011) sought to address the question of whether native
speaker status relates to the production of noun-verb collocations, and whether
learners’ different levels of proficiency relate to their production of deviant
collocations.

In this chapter, two types of the chi-square test are presented: one-dimensional
and two-dimensional.

The One-Dimensional Chi-Square Test


This test is used to compare the observed frequencies of a single variable with
expected frequencies. For example, as part of a larger study, Liu (2011) investigated
Chi-Square Test 183

TABLE 13.1 Frequency of phrasal verb use in five registers (adapted from Liu, 2011, p. 674)

Spoken Fiction Magazine Newspaper Academic Total

Phrasal 5,219 6,006 3,028 2,949 1,239 18,441


verbs
pmw∗

∗ Standardized as phrasal verbs per million words; pmw = per million words

whether phrasal verb use was related to register. Liu searched corpora for phrasal
verbs used in different registers and then computed frequency counts. She then
used the chi-square test to check whether more phrasal verbs occur in certain
registers than in others, or whether the use of phrasal verbs was independent of
register.
The null hypothesis for this study would be that phrasal verb use is not depen-
dent on register, whereas the alternative hypothesis would be that phrasal verb use
is associated with register. This relationship cannot be investigated using a Pear-
son or Spearman correlation; the variable ‘register’ (i.e., spoken, fiction, magazine,
newspaper, and academic writing) was the only variable under investigation, so
there was no other variable to run a correlation with. All Liu could do was to
compare the frequencies of occurrence of phrasal verbs for the five levels of the
variable ‘register’. These frequencies are shown in Table 13.1.
According to Table 13.1, phrasal verbs occurred most frequently in fiction,
with spoken language second, then magazines and newspapers, and finally aca-
demic writing. From these figures, it is unclear if the differences in frequency
were statistically significant. Due to sampling error or random fluctuations, some
differences are to be expected, so it is necessary to test whether there is a genuine
relationship between register and the frequency of use of phrasal verbs or not.

Steps for Running the One-Dimensional Chi-Square Test


To answer this type of question, a one-dimensional chi-square test can be per-
formed. It is called one-dimensional because there is only one variable under
analysis. In Liu’s (2011) study, this variable is the frequency of use of phrasal
verbs. The one-dimensional chi-square test assumes the null hypothesis for this
question, which is ‘there is no relationship between the frequency of use of
phrasal verbs and register’. Under the null hypothesis, each register would show
an equal number of phrasal verbs (per million words). The first step in using the
chi-square test is to calculate this number. Using the data in Table 13.1, it can be
calculated to be 18,441 ÷ 5 = 3,688. This value is called the ‘expected count’ in
the chi-square test. It should be noted that the expected count is not normally
the researcher’s expectation; in this case, it is the average number of occurrences
across all registers.
184 Chi-Square Test

TABLE 13.2 Chi-square observed and expected counts and residuals

Spoken Fiction Magazine Newspaper Academic Total

Observed phrasal 5,219 6,006 3,028 2,949 1,239 18,441


verbs pmw
Expected phrasal 3,688 3,688 3,688 3,688 3,688
verbs pmw
Residual 1,530 2,318 –660 –739 –2,449
Residual in % 42 63 –18 –20 –66

pmw = per million words

In the second step, the chi-square test is used to compare each observed count (i.e.,
the actual count) for each level of the variable with the corresponding expected
count, and to compute the difference, which is called the residual. The observed
counts, expected counts, and residuals for each level are shown in Table 13.2.
Table 13.2 shows that for spoken language, 5,216 phrasal verbs per million
words were observed (from the spoken corpus), whereas the expected number is
3,688 phrasal verbs per million words. The difference between the actual count
and the expected count is 1,530.36 words, and this residual amounts to 41.49% of
the expected count. Residuals can be computed similarly for other variables. The
degree of freedom for the chi-square test in the case of Liu’s (2011) study is ‘the
number of variable levels – 1’ = 5 – 1 = 4 (see Chapter 6). With four degrees of
freedom, the chi-square test uses the residuals and expected frequencies to arrive
at a chi-square value. It was found that this value was statistically significant (χ2(4)
= 3984, p < 0.0001).
To investigate the strength of the association between register and phrasal verb
use, the researcher needed to calculate the phi coefficient (ϕ). This is required as the
chi-square value itself does not express the strength of the association. Phi is calcu-
lated using the chi-square value through the following formula:

Phi = √ (χ2 ÷ N).

In this example, phi was calculated as follows:

Phi = √ (3983.65 ÷ 18,440.96) = 0.465 (or ≈ 0.47).

Based on Cohen (1988), phi can be considered small at about 0.10, medium at
about 0.30, and large at about 0.50, so this phi coefficient indicates a medium-to-strong
effect size. Overall, it can be concluded that there was a significant, medium-to-strong
association between register and the frequency of use of phrasal verbs.
One useful feature of the one-dimension chi-square test is that the residuals allow
researchers to see which categories contribute most to the final chi-square value. In
the example in Liu’s (2011) study, the academic and fictional registers deviated most
Chi-Square Test 185

from their expected values, though in opposite directions: academic language con-
tained fewer phrasal verbs than expected, while fictional language contained more.
Spoken language contained the second highest number of phrasal verbs, and the
actual count was higher than expected. Newspapers and magazines contained fewer
phrasal verbs than expected. On this basis, it can be concluded that students learning to
write academic texts should be cautioned against the overuse of phrasal verbs. How-
ever, students learning to write fictional texts should be encouraged to use phrasal verbs,
and students writing journalistic texts should not be advised to avoid phrasal verbs,
but to use them sparingly. The use of phrasal verbs in speaking classes should be
promoted since they are common in the spoken register. The one-dimensional chi-
square test can be used as a procedure in its own right, as in Liu’s study, but it is more
commonly used to evaluate the goodness of fit of a statistical model.

The Two-Dimensional Chi-Square Test


The two-dimensional chi-square test is used to test the association between two
variables, shown as the row and column variables in a two-dimensional table. For
example, Bell (2012) investigated whether adult learners’ recall of language-related
episodes (LREs) differed according to whether the metalinguistic discussion in the
LREs was conducted seriously or playfully. Bell (2012) used two variables, each
with two levels, as follows:

• recall, which could be correct or incorrect, and


• language-related episodes, which could be serious or playful.

The null hypothesis was that there is no association between the accuracy of
recall and the type of LRE. According to this hypothesis, playful or serious LREs
are recalled with similar level of accuracy. The alternative hypothesis would claim
that there is a relationship between the accuracy of recall and the type of LRE.

Steps for Running the Two-Dimensional Chi-Square Test


In this example, the first step is to count how many regular LREs were recalled
correctly and incorrectly, and how many playful LREs were recalled correctly and
incorrectly. This leads to a cross-tabulation (which is known as a contingency table
in quantitative research). This cross-tabulation is presented in Table 13.3.

TABLE 13.3 Frequency counts of language-related episodes (LREs) by accuracy of recall


(adapted from Table 5 in Bell, 2012, p. 258)

Serious LREs Playful LREs

Correct 41 18
Incorrect 82 16
186 Chi-Square Test

Table 13.3 shows that there were 41 serious LREs that were recalled correctly,
and 82 serious LREs that were recalled incorrectly. For the playful LREs, 18 were
recalled correctly and 16 were recalled incorrectly. Based on Table 13.3, it can be
argued that the playful LREs led to more accurate recall. Around twice as many
serious, non-playful LREs were recalled incorrectly than were recalled correctly;
fewer than half the playful LREs were recalled incorrectly. There seems to be a
tendency for playful LREs to be recalled correctly more frequently than non-
playful LREs. To find out whether this observation is statistically significant, the
two-dimensional chi-square test needs to be performed.
The two-dimensional chi-square test follows the same principles as the one-
dimensional chi-square test. First, the expected values for each cell are calculated.
Second, the totals for each row and column are calculated (the marginal totals).
Finally, the product of the marginal totals for each cell’s row and column is com-
puted and divided by the overall total. So, for example, the expected frequency
for correctly recalled LREs was 59 × 123 ÷ 157 = 46.22. Table 13.4 shows the
marginal totals, expected frequencies, and percentage residuals.
The data indicate that more playful LREs were recalled correctly than expected
and fewer were recalled incorrectly. Fewer serious LREs than expected were
recalled correctly, and more than expected were recalled incorrectly. The two-
way chi-square test can now be used to investigate whether these differences were
statistically significant.
Using the residuals, the chi-square computes the chi-square value, which in
this case was χ2(1) = 4.37, p = 0.037. It should be noted that this value functions
similarly to the F-value that is compared with the critical values table and is not
an effect size.
When a two-dimensional chi-square test that uses a 2 × 2 table such as this one
(‘accurate/inaccurate’ by ‘serious/playful’) is performed, it is common to apply a
correction to the chi-square value, known as the Yates correction (Furr, 2010), which
is done to ensure that the chi-square value is not overestimated. Once the Yates

TABLE 13.4 Marginal totals, expected frequencies, and residuals for recall by type of LREs

Serious LREs Playful LREs Marginal total

Correct 41 18 59
Expected 46 13
Residual % –10% +37%
Incorrect 82 16 98
Expected 77 21
Residual % +6% –22%
Total 123 34 157
Chi-Square Test 187

correction was applied in the earlier example (which SPSS will do automatically
with a 2 × 2 table), it was found that the chi-square test was nonsignificant: χ2(1)
= 3.57, p = 0.059, n.s (nonsignificant). So after the Yates correction was applied, it
can no longer be claimed that playful LREs were significantly more likely to facili-
tate correct recall, but only that there appears to be a tendency for playful LREs
to facilitate correct recall. The Yates correction makes it more difficult to attain a
significant result in a 2 × 2 table. This issue is, however, controversial. In a widely
cited paper, Haviland (1990) argued against it, but Greenhouse (1990), and Mantel
(1990)—two long-standing proponents of the Yates correction—defended it. Furr
(2010) summarizes the debate by saying that there are mixed recommendations,
and that there is no clear consensus as to the appropriate use of the Yates correction.
Since it is still widely used, the Yates correction to 2 × 2 tables is recommended for
applications, but researchers should also provide the uncorrected value. Should the
uncorrected value be significant, but the corrected one not, authors could make an
explicit case for applying the Yates correction or not.
Again, the chi-square value does not tell researchers how strongly the variables
are related; to find this out, phi needs to be calculated, and this is done in the same
way as for the one-dimensional chi-square.
Without the Yates correction, the calculation looks as follows:

Phi = √ (4.37 ÷ 157) = 0.1668 (or ≈ 0.17).

With the Yates correction, phi was found to be slightly lower:

Phi = √ (3.57 ÷ 157) = 0.1508 (or ≈ 0.15).

Cohen (1988) classifies these values as indicating small effects. So it can be


inferred that even if there was an association between the type of language-
related episode and recall, it was not strong. Playful language-related episodes
appear to help language learning, but they probably do not help enough to base
an instructional program on them and to encourage students to be playful about
metalinguistic reflection.

The Two-Dimensional Chi-Square Test: A Larger Table


When the chi-square test is used to investigate the association between two vari-
ables, these variables can have more than two levels each, potentially leading to
large chi-square tables. This was the case in Laufer and Waldman’s (2011) study,
which investigated whether L1 Hebrew-speaking learners of English differed in
their use of verb-noun collocations by proficiency level, and from native English
188 Chi-Square Test

speakers. The researchers divided the proficiency level variable into four levels
(basic, intermediate, advanced, and native speaker) and the use of collocation vari-
able into two levels (collocation/non-collocation). They then used the chi-square
test to check whether there was an association between proficiency level and the
use of collocations.
The null hypothesis for this study would claim that proficiency is not related
to correct collocation use, whereas the alternative hypothesis would claim that
proficiency is associated with the correct use of collocations. The hypotheses
make no claim about whether higher proficiency leads to a higher level of cor-
rect collocation use.
As a first step, Laufer and Waldman (2011) cross-tabulated the frequencies of
collocations and non-collocations for each proficiency level, as shown in Table 13.5
(adapted from Table 2 in Laufer & Waldman, 2011, p. 660).
Due to the large numbers and differences between groups in the table, it
was difficult to know at a glance whether there was an association between the
variables. So the chi-square test used this 2 × 4 table to compute the expected
frequencies and residuals using the marginal totals, as illustrated in Table 13.6.
The calculation produced χ2(3) = 264.18, p < 0.0001. With three degrees of
freedom and a significance level of 0.001. The effect of proficiency on the use of
collocations, was that the native speakers used more collocations than expected,
whereas each of the members of the learner groups used fewer. To find out how

TABLE 13.5 Collocation use by proficiency level (adapted from Table 2 in Laufer &
Waldman, 2011, p. 660)

NS Advanced Intermediate Basic Total

Collocations 2,527 852 162 68 3,609


Non-collocations 22,242 12,953 2,895 1,465 39,555

TABLE 13.6 Marginal totals, expected frequencies, and residuals for collocation type and
proficiency level (adapted from Table 2 in Laufer & Waldman, 2011, p. 660)

NS Advanced Intermediate Basic Total

Collocations 2,527 852 162 68 3,609


Expected 2,071 1,154 256 128
Residual % +22% –26% –37% –47%
Non-collocations 22,242 12,953 2,895 1,465 39,555
Expected 22,698 12,651 2,801 1,405
Residual % –2% +2% +3% +4%
Total 24,769 13,805 3,057 1,533 43,164
Chi-Square Test 189

strong the effect is, Cramer’s V (also called ‘Cramer’s Phi’) can be calculated, which
is commonly used when the contingency table under consideration is larger than
2 × 2.
Cramer’s V is calculated in a similar manner to phi:

Φc = √ (χ2 ÷ N × (the smaller of the number of rows – 1 and the number


of columns – 1)

In this example, Cramer’s V was:

Φc = √ (264.18 ÷ 43,164 × 1) = 0.078

This low result shows a weak effect size. That is, while the chi-square test result
was statistically significant, in reality, the effect is small. On the basis of this finding,
it can be concluded that native speakers use more noun-verb collocations than L2
learners, but that the difference is minimal.

Assumptions of the Chi-Square Test


The chi-square test is a versatile procedure because it requires frequency counts
for nominal data only. However, it is not a precise test, so the calculation of a para-
metric statistic would be preferable. While it is flexible, the chi-square test also has
some conditions that need to be met:

1. The data must be nominal, and consist only of frequency counts.

The chi-square test is used for nominal data in which each nominal variable has
several levels. For each of the levels of a variable, there must be a frequency count.

2. Whenever the frequency of occurrence of an event is counted, the frequency


of nonoccurrence must also be counted.

This condition is related to the fact that the chi-square test evaluates propor-
tions by using marginal totals. If gender is one of the variables, both male and
female learners need to be included, because they make up the total for gender. If
the occurrence of the third person singular -s in learner texts is being examined,
all the cases where the third person singular -s was correctly used and all the
cases where it was either incorrectly provided or missing need to be included.
Whether this should be considered a three-level variable (i.e., used correctly, used
incorrectly, missing) or a two-level variable (i.e., used correctly, used incorrectly)
depends on the research question under investigation.
190 Chi-Square Test

3. The cells must be mutually exclusive; that is, the same participant cannot be
in more than one cell.

The chi-square test is not appropriate for before-after comparisons, or any


cross-tabulations in which the same person occurs in two cells of the table. There
is an alternative test called McNemar’s test that may be used if this is the case, but it
is rarely used and is not covered in this book.

4. The sample size must be large enough.

The conclusions that can be drawn from the chi-square test will be weaker if
a small sample size is used. For example, if the number of participants is low (say
30), and the participants are subdivided into several groups (e.g., low beginners,
mid beginners, high beginners, low intermediate, mid intermediate, high inter-
mediate, low advanced, mid advanced, and high advanced), there will be very few
or no people in some of the cells. The convention is that the expected (not the
actual!) count for each cell should be at least five, and SPSS will provide a warn-
ing if that is not the case. One solution to the issue of a small sample size may be
to collapse categories. So instead of low/mid/high beginners, the single category
‘beginners’ could be used. However, the rationale employed in collapsing catego-
ries must be carefully considered as a too-broad category, such as ‘miscellaneous’,
is not useful for research purposes.

How to Perform a Chi-Square Test in SPSS


SPSS performs the chi-square test on raw data only. If there is a preexisting cross-
tabulation table that needs to be checked, it cannot be done in SPSS. In such a
case, Microsoft Excel could be used; however, it is cumbersome to use and does
not give a chi-square value, but only the p-value of the chi-square. Alternatively, a
web-based statistics calculator could be used, for example, that found at the Vassar-
Stats website (http://vassarstats.net). The use of both SPSS and VassarStats for the
chi-square test will be illustrated in this chapter.
To illustrate how to perform the chi-square test in SPSS, the file Ch13TEP.sav
is used (downloadable from the Companion Website of this book). In the TIE
data set, the research question is whether duration of residence related to gender.
In other words, are male or female language learners more likely to have spent
time in an English-speaking country? Gender is a nominal variable with two levels
(male/female), whereas residence is an interval variable with a potentially endless
number of possible levels (0 months, 1 month, 2 months . . .). For this reason, the
variable residence was collapsed into just two levels (residence/no residence). The
chi-square table is, therefore, two-dimensional.
Chi-Square Test 191

SPSS Instructions: Chi-Square Test

Click Analyze, next Descriptive Statistics, and then Crosstabs (see


Figure 13.1).

FIGURE 13.1 Accessing the SPSS menu to launch the two-dimensional chi-square test
192 Chi-Square Test

FIGURE 13.2 Crosstabs dialog

On selecting Crosstabs, a new dialog opens that allows the selection


of variables (see Figure 13.2). In this Crosstabs dialog, move ‘Gen-
der’ to the ‘Row(s)’ pane and ‘Collapsed residence (yes/no)’ to the
‘Column(s)’ pane.

After selecting the variables, click the Statistics button. In the Cross-
tab: Statistics dialog that appears, tick the Chi-square and Phi and
Cramer’s V checkboxes (see Figure 13.3).

Click on the Continue button to return to the Crosstabs dialog (see


Figure 13.2) and click Cells to call up the Crosstabs: Cell Display
dialog.

Tick the Observed, Expected, and the Unstandardized checkboxes to


display expected frequencies and deviations from expectations (see
Figure 13.4).
Chi-Square Test 193

FIGURE 13.3 Crosstabs: Statistics settings

FIGURE 13.4 Crosstabs: Cell Display dialog

Click on the Continue button to return to the Crosstabs dialog (see


Figure 13.2) and then click the OK button.
194 Chi-Square Test

TABLE 13.7 SPSS summary of the two-dimensional chi-square analysis

Cases

Valid Missing Total

N Percent N Percent N Percent

gender ∗ collapsed 122 73.50% 44 26.50% 166 100.00%


residence (yes/no)

TABLE 13.8 Cross-tabulation output based on gender and collapsed residence

Collapsed residence (yes/no) Total

No residence Residence

Gender Male Count 28.00 13.00 41.00


Expected Count 22.20 18.80 41.00
Residual 5.80 –5.80
Female Count 38.00 43.00 81.00
Expected Count 43.80 37.20 81.00
Residual –5.80 5.80
Total Count 66.00 56.00 122.00
Expected Count 66.00 56.00 122.00

Table 13.7 presents the output of the two-dimensional chi-square analysis.


Table 13.7 shows that 122 out of 166 participants had values on both variables;
in addition, there are missing data for 44 participants. Table 13.8 presents cross-
tabulation output based on gender and collapsed residence. Table 13.8 shows that
the chi-square test expected 22.20 male learners to have had no residence but found
28 to have had none. It expected 18.80 male learners to have had residence, but found
that only 13 had. It can be concluded that male learners were less likely to have had
residence than expected. For female learners, the chi-square test expected 43.80 not to
have had residence but found only 38 had had none, and it expected 37.20 to have had
residence, but found that 43 had had some. It can be concluded that female learners
were more likely to have had residence than expected.
According to Table 13.8 alone, it appears that there was an association between
gender and residence. That is, male learners were less likely to have had residence,
and female learners were more likely to have had residence. The chi-square test
will be used to investigate whether this association was significant. Table 13.9
presents the results of the chi-square test.
SPSS produces Yates correction outputs as ‘Continuity Corrections’. The result
with the Continuity Correction was a chi-square value of χ2(1) = 4.186, p = 0.041.
This was a statistically significant result and it can be concluded that gender influenced
Chi-Square Test 195

TABLE 13.9 Outputs of the two-dimensional chi-square test

Value df Asymp. Sig. Exact Sig. Exact Sig.


(2-sided) (2-sided) (1-sided)

Pearson Chi-Square 5.010a 1 0.025


Continuity Correctionb 4.186 1 0.041
Likelihood Ratio 5.106 1 0.024
Fisher’s Exact Test 0.034 0.020
Linear-by-Linear Association 4.969 1 0.026
N of Valid Cases 122.000
a
0 cells (0.0%) have expected count less than 5. The minimum expected count is 18.82.
b
Computed only for a 2 × 2 table

TABLE 13.10 Symmetric measures for the two-dimensional chi-square test

Symmetric measures

Value Approx. Sig.

Nominal by Nominal Phi 0.203 0.025


Cramer’s V 0.203 0.025
N of Valid Cases 122.00

residence. It is important to note that the chi-square result in itself says nothing about
the direction of an effect, so residence might influence gender, but such a conclu-
sion does not seem plausible. Furthermore, the chi-square results say nothing about
the extent to which gender influences residence, but according to Table 13.8, female
learners are more likely to have had residence and male learners were less likely to
have had it. So the final question that remains is ‘how strong was the influence of
gender on residence?’ Table 13.10 presents the results for this question.
In Table 13.10, the Phi value can be seen to be 0.203, so the effect size was con-
sidered to be weak to medium. Based on this effect size, there was an effect of gender
on the likelihood of residence, but it was not strong. This finding can be reported as:

The chi-square test was used to investigate whether gender affects the likeli-
hood of residence in the target language country. It was found that female
learners were significantly more likely to have had residence than male
learners, but the effect size was weak-to-medium (χ2(1) = 4.186, p = 0.041,
ϕ = 0.203, weak-to-medium effect size).

Non-SPSS Method for the Chi-Square Test


As mentioned earlier, SPSS can perform the chi-square test only if the original raw
data are used. If there is a preexisting contingency table with frequencies, SPSS
196 Chi-Square Test

FIGURE 13.5 VassarStats website’s chi-square calculator (http://vassarstats.net/newcs.


html)

cannot be used. However, the chi-square calculator on the VassarStats website may
be used in this case. Figure 13.5 presents a screenshot from this website.
In order to compute the chi-square statistic, the cells to be used have to be
selected first. The data from Bell (2012) are adapted to illustrate this. Figure 13.6
illustrates the selection of four cells to define a 2 × 2 table. This selection makes
the rest of the table unavailable.
Chi-Square Test 197

FIGURE 13.6 Contingency table for two rows and two columns

FIGURE 13.7 Contingency table for two rows and two columns with data entered

The data from Table 13.3 need to be entered (see Figure 13.7). When ‘Calcu-
late’ is clicked, the results are shown at the bottom of the web page, as can be seen
in Figure 13.8. The box under the contingency table in Figure 13.8 shows the
chi-square result with the Yates correction, and Cramer’s V (which is the same as
phi for a 2 × 2 table). In addition, the textbox next to the chi-square value, df, and
significance level provides the uncorrected result. The tables shown at the bottom
of Figure 13.8 present deviations from expectations as percentage deviations or
standardized residuals.
198 Chi-Square Test

FIGURE 13.8 Chi-square test results from VassarStats

Summary
The results of chi-square tests allow researchers to understand the characteristics
of language learners or language learning contexts that shape how learners may
differ in terms of learning success, acquisition rates, and behaviors and thoughts.
Chi-square tests can be used to compare the observed frequencies of a single
variable with their expected frequencies, and to examine whether two nominal
variables are associated with one another. In L2 research, there are various types of
Chi-Square Test 199

data, so various statistical tools are required to analyze them. The chi-square test
is suitable for the analysis of nominal data, which cannot be analyzed by means of
correlational analysis. The one-dimensional chi-square test is the simplest form of
this statistic. However, to analyze L2 research data, two-dimensional (or higher)
chi-square tests may be required. SPSS uses raw data only, so if there is a preexist-
ing contingency table with frequencies available, the VassarStats website can be
used instead of SPSS. The next chapter presents multiple regression, which is used
to examine the extent to which a dependent (outcome) variable can be predicted
by several independent (predictor) variables.

Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
14
MULTIPLE REGRESSION

Introduction
The last few chapters have presented inferential statistics, such as t-tests and the
ANOVA, that are used to compare group means. However, if researchers aim to
understand how different variables might affect language learning or test per-
formance, then another kind of inferential statistic is required. For example, Jia,
Gottardo, Koh, Chen, and Pasquarella (2014) investigated the relative effect of
several reading and personal background variables on reading comprehension of
ESL learners in Canada, including word-level reading ability, vocabulary knowl-
edge, length of residence in Canada, enculturation in the mainstream culture, and
enculturation in the heritage culture. Not only were the researchers interested in
the effect of each variable, but they were also interested in the relative importance
of those variables. For example, they were interested in finding out which variable
had the strongest impact on reading comprehension, and which combination of
variables best explained reading comprehension. The statistical procedure used to
answer these questions is known as multiple regression. In order to illustrate how
multiple regression is performed, this chapter begins with a description of simple
regression.

Simple Regression
Correlational analysis was introduced in Chapter 5. Correlation expresses the
strength of the relationship between two variables. For example, if the correlation
between vocabulary knowledge and reading ability is strong, it may be expected
that the higher language learners’ vocabulary scores are, the higher their reading
scores will be. For this reason, if researchers know learners’ vocabulary scores, they
Multiple Regression 201

FIGURE 14.1 A scatterplot of the relationship between chocolate consumption and


vocabulary recall success

will be able to make an informed guess as to their reading scores. Predictions can
therefore be made on the basis of the existence of a strong correlation between
two variables, and the stronger the relationship between two variables, the better
the prediction will be.
As an illustration, the graph presented in Figure 14.1 plots the relationship
between chocolate consumption and vocabulary recall success for a fictitious
group of learners. It shows a correlation of 1.0, which means that predictions can
be made with a high level of certainty. If a learner does not consume any choco-
late, that learner is likely to be able to answer 10 vocabulary questions correctly. If
a learner consumes 10 pieces of chocolate, that learner is likely to able to answer
30 vocabulary questions correctly, and so on.
In regression analysis, the independent variable that is used to predict a depen-
dent variable is called a predictor variable, and the dependent variable is the outcome
variable. The relationship between chocolate consumption (the predictor variable)
and vocabulary recall success (the outcome variable) can be expressed in the fol-
lowing formula:

Vocabulary recall = 10 + 2 × chocolate consumption


202 Multiple Regression

This formula can be used to predict how many vocabulary recall questions a
student will be able to answer correctly, given how many pieces of chocolate the
student consumes. If a student consumes 25 pieces of chocolate, that student is
likely to be able to answer 60 vocabulary recall question correctly (i.e., 10 + 2 ×
25 = 60). Such formulae are the basis of simple regression models. Their more
general form is written as the standard regression equation:

Y = A + BX, where

• Y is the outcome variable that researchers aim to predict (e.g., vocabulary


recall success);
• X is the predictor variable (e.g., the amount of chocolate consumption);
• A is the Y-intercept, which is the value of the outcome variable when the
predictor variable is 0. In this illustration, A = 10, which means that a learner
who does not eat any chocolate will be expected to get 10 vocabulary items
right; and
• B is the multiplier for the predictor variable. For each unit increase in the
predictor variable, the outcome variable increases by B units. In the example
under discussion, B = 2, which can be interpreted as saying that for every
extra piece of chocolate that a learner eats, that learner will answer two extra
vocabulary items correctly. B is sometimes referred to as the B-coefficient.

There is also a standardized version of B, expressed in standard deviations, which


is known as the β-value (beta value), or β-coefficient, and this ranges from –1 to
+1. In simple regression, β is the correlation coefficient, so in this case, β = 1. The
β-value is also important in multiple regression (as discussed later in the chapter).
Finally, on the basis of the formula, how much of the variance in vocabulary
recall success is accounted for by chocolate consumption is known (i.e., all of it,
or 100%). Regression expresses the variance accounted for as R2, which was intro-
duced in Chapter 5 as the coefficient of determination, and which ranges from 0
(no variance explained) to 1 (all variance explained). In this illustration, R2 = 1.0.
The example discussed earlier is intended only to illustrate the concept of
regression analysis, and shows a perfect correlation between the two variables. In
L2 research, it is uncommon to find a perfect correlation because a variable that
totally and completely accounts for another variable is rare (see e.g., Ellis, 2015).
One reason for this is that outcomes such as learning performance are not usually
monocausal. That is, they are not explained by just one predictor variable, but by
several predictor variables that all contribute to the explanation of an outcome.
A complex ability, such as L2 reading comprehension, is non-monocausal as it is
not determined by just one linguistic factor such as vocabulary knowledge, but
by several that may include grammar knowledge and discourse structure. It may
also be partly determined by cognitive factors (e.g., reading strategies, anxiety, and
motivation), and external factors (e.g., types of text, reading task complexity and
Multiple Regression 203

demand, and consequences of performing well or poorly). Therefore, it would


be more useful to investigate several predictor variables at the same time and
to evaluate how much each contributes to L2 reading comprehension. Such an
investigation can be achieved through the use of multiple regression.

Multiple Regression
In correlation and simple regression, the relationship between two variables can
be examined. Multiple regression, however, can examine the effect of several vari-
ables on an outcome variable simultaneously. The main task in multiple regression
analysis is to explain (i.e., predict) the outcome variable as precisely and efficiently
as possible on the basis of the predictor variables. The combination of predictor
variables that produces the best prediction of the outcome is called the regression
model, and the best regression model explains as much variance in the outcome as
possible with as few predictors as possible. One of the tasks of multiple regression
is to determine the unique, individual contribution of each independent variable
to the outcome, taking into account the fact that the predictor variables may cor-
relate with each other, and that some of the same variance in the outcome variable
may therefore be explained by each of two or more variables.
A good example of a study using multiple regression is Jia et al. (2014). The
researchers investigated the relative impact of several predictor variables on the
reading comprehension of 94 immigrant high school students of Chinese back-
ground in Canada. They were interested in the effect of vocabulary knowledge,
word-level reading ability, length of residence in Canada, acculturation to the
mainstream (Anglo-Canadian) culture, and enculturation to the heritage (Chi-
nese) culture. While multiple regression allows researchers to begin by including
all possible predictor variables, that is not the best approach in practice. It is more
efficient to begin by including the variables that previous research has shown to
have a significant impact on the outcome variable, and then to progressively add
more variables in an attempt to improve the model. If a variable is found not to
improve the model, it should be excluded from the multiple regression model.
This approach is known as hierarchical (or sequential) regression, and it was this
that Jia et al. used. When hierarchical regression is used, researchers enter variables
in steps (called blocks). Each block contains the variables of the previous block
and adds a new variable (or a small number of new variables) to check if the new
variable improves the prediction. This allows researchers to compare regression
models to decide how many variables are needed to make satisfactory predictions.
Table 14.1 shows the three models that Jia et al. compared, the variables in each
model, the β-value for the variables used, the R2 of each model, and the change in
R2 when one model is replaced by the next.
In Table 14.1, the researchers first entered the length of residence as the sole
predictor, and found that this one-variable regression model explained 59% of
the variance in reading comprehension. The length of residence variable had a
204 Multiple Regression

TABLE 14.1 Three hierarchical regression models (adapted from Jia et al., 2014, p. 257)

Block Variables β value Model R2

1 Length of residence .77∗ .60


R2 change: .19∗
2 Length of residence .20∗ .79
Vocabulary .46∗
Word-level reading .29∗
R2 change: .02∗
3 Length of residence .16 .81
Vocabulary .47∗
Word-level reading .27∗
Heritage enculturation .03
Mainstream .12∗
enculturation

∗ significant at p < .05

statistically significant β of 0.77, which is high. This, however, was not surprising
since it was the only predictor in the model. In this one-variable model, you can
also think of β as the Pearson correlation of residence and reading score (r = 0.77
with a coefficient of determination R2 of 0.59).
In the next block, Jia et al. added vocabulary and word-level reading scores
to the model, as can be seen in Table 14.1. Both new variables were also found
to be significant predictors of reading comprehension scores, with strong con-
tributions at β of 0.46 for vocabulary, and β of 0.29 for word-level reading. In
multipredictor models such as this one, the β values are not identical to correla-
tion coefficients as they are in a one-variable model. The new model with length
of residence, vocabulary, and word-level reading explained 79% of the variance
in reading comprehension scores. The increase from the previous model to this
one was significant (20%), so the new model was taken to be better than the first.
In this second model, the contribution of residence dropped; it had a β value of
0.20. It can be concluded that in the first model residence had covered some read-
ing comprehension variance that was actually due to vocabulary knowledge and
word-level reading ability.
Finally, Jia et al. added their two enculturation variables (i.e., mainstream and
heritage enculturation). This time, with all five variables, the model explained an
extra 2% of the variance in reading comprehension, which was a statistically sig-
nificant improvement over the previous model. However, heritage enculturation
was not significant, making only a very minor contribution (β = 0.03), and length
of residence was found to be no longer significant (β = 0.16). In the third regres-
sion model, it appears that mainstream enculturation (β = 0.12) explained some of
the variance that had previously been explained by residence. The contributions
Multiple Regression 205

of vocabulary scores (β = 0.47) and word-level reading scores (β = 0.27) did not
change much and remained significant.
In this study, Jia et al. accepted this final model as the best combination of
predictor variables, although it could be argued that their second model, which
explained 79% of the variance using three variables, is preferable to a model that
explained just over 80% with five variables (two of which were nonsignificant).
The reason the authors may have decided to use the final model was that they
were hoping to demonstrate a significant role for mainstream enculturation in
reading comprehension, which they managed to do, though its role is outweighed
by linguistic factors.
The example of Jia et al.’s study shows how multiple regression can be used to
help researchers evaluate different combinations of predictor variables to account
for an outcome. However, the final determination of which combination of pre-
dictor variables explains the outcome variable most strongly needs to be decided
by the researcher. Statistics can only provide supporting evidence for this decision.

The Assumptions of Multiple Regression


In multiple regression, predictor variables can be interval, ordinal, or even nominal.
Nominal predictor variables can be used only if they have exactly two levels (e.g.,
male/female, residence/no residence). There is a way to use nominal predictor
variables with more than two levels, which is called dummy coding, but this is not
covered in this book.
As multiple regression is a parametric procedure, the outcome variable should
be interval and exhibit a broad range of values, which should be normally dis-
tributed. There are special types of regression for the analysis of ordinal outcome
variables (ordinal regression), and nominal outcome variables (logistic regression).
These two types are not covered in this book.
If the predictor variables correlate very strongly, it may be difficult to discern
which predictor actually affects the outcome variable. This problem is known as
collinearity. SPSS provides statistics to check for collinearity (see Table 14.3). If
there is an indication of collinearity, the selection of predictor variables should be
reconsidered. There are also more subtle and complex threats to the validity of
multiple regression, such as dependent residuals and heteroscedasticity. However, these
will not be elaborated on here.
Sample size is as important a consideration in multiple regression as in any
other statistical procedure, and there are several rules of thumb for how big the
sample should be. Stevens (2012) recommends 15 participants per predictor,
whereas Field (2013, see p. 313) stresses that the minimum number will depend
on the expected effect size (small, medium, or large) and the number of predictors.
See also Tabachnik and Fidell (2012), which provides a more in-depth discussion
of this question.
206 Multiple Regression

How to Perform Multiple Regression in SPSS


To illustrate how to perform multiple regression in SPSS, the file Ch14TEP.sav will
be used (downloadable from the Companion Website of this book). The research
question to be considered is ‘what are the relative contributions of background
variables, such as proficiency level, length of residence, and computer familiarity
to TEP performance’? For the purpose of illustration, multiple regression will be
run using total TEP scores. It is generally preferable to use a hierarchical approach
to the inclusion of predictor variables. In the absence of prior research indicating
which of these are likely to have the greatest impact on the outcome, the following
procedure may be followed:

1. Run a multiple regression with all predictors at the same time to obtain a
picture of their relative importance in predicting the outcome variable.
2. Based on the results of the first run of the regression analysis, re-run the mul-
tiple regression hierarchically, starting with the strongest predictor, adding the
second strongest, and so on until the weakest is reached.

SPSS Instructions: Multiple Regression

Click Analyze, next Regression, and then Linear (see Figure 14.2)

In the Linear Regression dialog (Figure 14.3), move ‘totalscore’ from


the left-hand pane into the Dependent field and ‘computer familiar-
ity’, ‘years in English-speaking countries’, and ‘proficiency level’ into the
Independent(s) field.

Click the Statistics button to call up the Linear Regression: Statistics


dialog (Figure 14.4). Note that in Figure 14.4, Estimates and Model
fit are pre-selected.

Tick the other four checkboxes on the right-hand side (i.e., R squared
change, Descriptives, Part and partial correlations, and Collinearity
diagnostics).
FIGURE 14.2 Accessing the SPSS menu to launch multiple regression

FIGURE 14.3 Linear Regression dialog


208 Multiple Regression

FIGURE 14.4 Linear Regression: Statistics dialog

FIGURE 14.5 Linear Regression: Options dialog

Click the Continue button to return to the Linear Regression dialog,


then click the Options button.

In the Linear Regression: Options dialog (Figure 14.5), select Exclude


cases pairwise. This ensures that data are included in the analysis
even if participants do not have data on all variables
Multiple Regression 209

Click the Continue button to return to the Linear Regression dialog


and click on the OK button to run the analysis.

Table 14.2 presents the descriptive statistics of the analysis, with outcome variable
being the total TEP score, plus three predictor variables: computer familiarity,
proficiency level, and years in English-speaking country. The descriptives indicate
that there is a large variation in sample sizes, suggesting that it may be a good idea
to exclude cases of missing data pairwise rather than listwise. In this way, far fewer
participants will be excluded from the analysis, especially on the basis of their lack
of computer experience.
Table 14.3 shows the correlations among the predictor and outcome variables.
It can be seen that proficiency level was a strong predictor of total score, while

TABLE 14.2 Descriptive statistics

Mean Std. deviation N

total score 61.2319 22.26167 166


computer familiarity 2.33 .672 63
proficiency level 3.26 1.191 163
years in English-speaking country .77 1.848 128

TABLE 14.3 Correlations among the outcome and predictor variables

Total Computer Proficiency Years in english-


score familiarity level speaking country

Pearson total score 1.000 .017 .801 .415


Correlation computer familiarity .017 1.000 –.049 –.047
proficiency level .801 –.049 1.000 .332
years in English- .415 –.047 .332 1.000
speaking country
Sig. total score . .448 .000 .000
(1-tailed) computer familiarity .448 . .353 .362
proficiency level .000 .353 . .000
years in English- .000 .362 .000 .
speaking country
N total score 166 63 163 128
computer familiarity 63 63 61 60
proficiency level 163 61 163 126
years in English- 128 60 126 128
speaking country
210 Multiple Regression

TABLE 14.4 Variables entered/removed

Model Variables entered Variables Method


removed

1 years in English-speaking country, . Enter


computer familiarity, proficiency levelb
a
Dependent Variable: total score
b
All requested variables entered.

TABLE 14.5 Model summary

Model R R square Adjusted Std. Error Change statistics


R square of the
Estimate R square F change df1 df2 Sig. F
change change

1 .819a .670 .653 13.11776 .670 37.974 3 56 .000


a
Predictors: (Constant), years in English-speaking country, computer familiarity, proficiency level

length of residence was a medium predictor of total score. Computer familiarity


was not a good predictor of total score.
Table 14.4 lists the predictor variables that were entered into the model. Table
14.5 shows information about the model as a whole. First, the R-value is the
correlation between all the predictors taken together and the outcome variable.
The three variables together correlate with the total score at 0.819, which is high.
Second, the R-squared (R2) is the square of the correlation r and represents the
amount of variance explained by the model. At 0.67, it explains 67% of the vari-
ance, which is also a large amount. The adjusted R2 describes how well the model
generalizes from the sample to the population. The adjusted R2 is usually slightly
lower than R2. In this case, the model explains 65% of the variance in the popula-
tion. Third, the R2 change is the difference between the model without predictors
and this model, and the associated F-statistic, which is significant in this case, is
based on an ANOVA analysis between the hypothetical model without predictors
and this current model with the three predictors. The ANOVA result shows that
this model differs significantly from the hypothetical model that does not have any
predictor variables, which shows that the predictors help improve the model. It
should be noted that the R2 change is the same as the R2 of the model in this case
because the final model was compared to the hypothetical model that does not
have any predictor variables.
Table 14.6 presents a more detailed description of the ANOVA result from
Table 14.5. Tables 14.7 and 14.8 show the model coefficients. It should be noted
that the table of the model coefficients in the SPSS output was large, so it was
Multiple Regression 211

TABLE 14.6 The ANOVA result

Model Sum of squares df Mean square F Sig.

1 Regression 19,603.095 3 6,534.365 37.974 .000b


Residual 9,636.242 56 172.076
Total 29,239.338 59
a
Dependent Variable: total score
b
Predictors: (Constant), years in English-speaking country, computer familiarity, proficiency level

TABLE 14.7 Model coefficients output: Unstandardized and standardized Beta coefficients

Model Unstandardized Standardized t Sig.


coefficients coefficients

B Std. Error Beta

1 (Constant) 9.243 7.957 1.162 .250


computer familiarity 2.044 2.546 .062 .803 .425
proficiency level 13.983 1.522 .748 9.189 .000
years in English- 2.040 .980 .169 2.081 .042
speaking country

broken up into two tables (Table 14.7 and 14.8) in this chapter. Table 14.7 con-
tains the standardized beta coefficients and their statistical significance levels.
In this table, proficiency level had the largest beta coefficient (β = 0.748, p <
0.001). Length of residence had a smaller beta (β = 0.169, p = 0.042), and com-
puter familiarity had a nonsignificant beta (β = 0.062, p = 0.425). Table 14.7
also contains B-coefficients (column 3), which indicate the direct links between
the predictor variables and the outcome variable. Recall that the B-coefficient is
related to how much the outcome variable increases when the predictor variable
increases by one unit. So, for example, if Participant X has a proficiency level that
is one level higher than Participant Y, Participant X’s total test score would be
expected to be 13.983 higher than that of Participant Y. Similarly, if Participant X
has a residence one year greater than that of Participant Y, then Participant X’s total
score would be expected to be 2.044 higher than that of Participant Y. This is
interesting in itself, but because the units of the predictors differ greatly (i.e., years
versus proficiency levels versus levels of computer familiarity), the B-coefficient
is not commonly used.
Table 14.8 presents the correlations between the total score (the outcome
variable) and each of the predictor variables. The zero-order correlation is a Pear-
son correlation (as presented in Table 14.3). The partial correlation treats other
212 Multiple Regression

TABLE 14.8 Model coefficients output: Correlations and collinearity statistics

Model Correlations Collinearity statistics

Zero-order Partial Part Tolerance VIF

1 (Constant)
computer familiarity .017 .107 .062 .997 1.003
proficiency level .801 .775 .705 .888 1.126
years in English- .415 .268 .160 .889 1.125
speaking country

predictors as covariates, and removes their influence from the outcome variable
and the predictor in question. It shows how much of the remaining variance that
predictor explains. For example, in the case of proficiency level, the effects of
length of residence and computer familiarity are taken out of the total test score
and their overlaps with proficiency level are removed to create a ‘pure’ proficiency
level result. This pure proficiency level correlates at 0.775 with the outcome vari-
able. Finally, the part (sometimes also called semi-partial) correlations show how
much the purified predictor correlates with the unpurified outcome variable. In
the case of proficiency level, only overlaps between proficiency level and length of
residence and computer familiarity were removed from the proficiency level vari-
able, creating a purified proficiency level variable, but no changes were made to
the outcome variable. The resulting correlation is the unique contribution of the
predictor variable to the outcome variable, which is 0.705 for proficiency level.
The part correlation can be squared to understand how much of the variance in
the outcome variable a predictor explains. In this case, proficiency level explained
nearly 50% of the variance in the outcome variable.
Table 14.8 also presents the collinearity statistics. As mentioned previously, col-
linearity describes excessive correlations between the predictors to the point that
it becomes difficult to distinguish the contribution of each predictor variable to
the outcome variable. In the collinearity statistics, all tolerance values should be
larger than 0.2, and the VIF indicators should be below 10 to confirm the absence
of excessive collinearity (Field, 2013, p. 342). All tolerance and VIF values in
Table 14.8 are acceptable, meaning that the variables make independent contribu-
tions to the regression model.
In summary, this first exploratory multiple regression run with all variables
has suggested that proficiency level had the strongest effect on the outcome vari-
able, and that residence had a moderate effect. It has also suggested that computer
familiarity should be excluded from further analysis. A hierarchical regression
model will be run next, with the objective of determining a final model using
only significant predictors. When that has been done, it will be possible to deter-
mine the amount of variance in the outcome variable explained by each of these
predictors.
Multiple Regression 213

How to Perform a Hierarchical Regression in SPSS


In order to perform a hierarchical regression, the same procedures can be followed
as for a non-hierarchical regression, with some additional steps.

SPSS Instructions: Hierarchical Regression

Click Analyze, next Regression, and then Linear (see Figure 14.2).

In the resulting Linear Regression dialog, move the variable ‘total


score’ from the left-hand pane into the Dependent field (Fig-
ure 14.6). Then move ‘proficiency level’ into the Independent(s) field. As this
is a hierarchical regression, one predictor variable will be added at a time,
beginning with the one that is the strongest predictor based on the previous
initial analysis. Several regression models will be created. The first will use
just the first predictor variable entered (called Model 1 in SPSS); the second
will use the first two predictor variables entered, and so on.

Do not add any other variables at this stage. Instead, click the Next
button. ‘Block 2 of 2’ will appear above the Independent(s) field.
The Independent(s) field will be empty to allow another variable to be
entered.

FIGURE 14.6 Linear Regression dialog for a hierarchical regression (Block 1 of 1)


214 Multiple Regression

Enter ‘years in English-speaking country’ as the next predictor (see


Figure 14.7) and click the Next button.

Enter ‘computer familiarity’ as the final predictor (see Figure 14.8). Note
that this predictor is added only for the purpose of illustration of how
hierarchical regression can be performed with more than two predictor variables.
The earlier analysis suggests that it did not predict the outcome variable at all.

FIGURE 14.7 Linear Regression dialog for a hierarchical regression (Block 2 of 2)

FIGURE 14.8 Linear Regression dialog for a hierarchical regression (Block 3 of 3)


Multiple Regression 215

Finally, choose Statistics and Options as in Figures 14.4 and 14.5 in


the ‘SPSS Instructions: Multiple Regression’ section), and click on
the OK button to run the analysis.

Some of the SPSS outputs (Descriptive statistics, Correlations) are the same as
those for the exploratory model. However, the model summary contains different
information, as presented in Table 14.9.
According to Table 14.9, Regression Model 1 used proficiency level only as the
predictor variable. It correlates with the outcome variable at 0.801, and accounted
for 63.6% of the variance in TEP performance. The F-statistic (from the ANOVA
that compares models) indicates that proficiency level was a significantly better
predictor than the model that did not use predictor variables. Regression Model 2
used the predictors ‘proficiency level’ and ‘length of residence’. Its correlation with
the outcome was slightly higher than that of Regression Model 1 at 0.816, and it
accounted for 65.5% of the variance in TEP performance. This was an improve-
ment of 1.9% over Regression Model 1. The F-statistic was based on a comparison
between Regression Model 2 and Regression Model 1. It indicates that Regression
Model 2 is significantly better at explaining the variance than Regression Model
1 (p = 0.044). Finally, Regression Model 3 was the full model based on profi-
ciency level, residence, and computer familiarity as the hierarchical predictors of
the outcome variable. It correlates at 0.819 with the outcome variable. At 65.3%,
it explains slightly less of the population variance than Regression Model 2. The
F-statistic was not statistically significant (p = 0.425), indicating that this model is
not significantly better than Regression Model 2.
This model summary shows that Regression Model 2 (proficiency and resi-
dence) was the best model in explaining the variance at 66%, and it also shows that
including residence leads to just a small improvement in the model over includ-
ing proficiency level alone. Computer familiarity was found not to be a helpful

TABLE 14.9 Model summary

Model R R square Adjusted Std. error Change statistics


R square of the
estimate R square F change df1 df2 Sig. F
change change

1 .801a .642 .636 13.43893 .642 103.897 1 58 .000


2 .816b .667 .655 13.07685 .025 4.256 1 57 .044
3 .819c .670 .653 13.11776 .004 .645 1 56 .425
a
Predictors: (Constant), proficiency level
b
Predictors: (Constant), proficiency level, years in English-speaking country
c
Predictors: (Constant), proficiency level, years in English-speaking country, computer familiarity
216 Multiple Regression

predictor. Regression Model 2 would be reported in a final write-up, whereas


Regression Model 3 would be discarded.
Table 14.10 shows the ANOVA results of the comparison between each model
and the model that uses no predictor variables. Tables 14.11 and 14.12 present the
output showing the model coefficients. Table 14.11 can be used to compare the
β-coefficients of Regression Model 1 and Regression Model 2. The β-coefficient
of proficiency level can be seen to have declined by 0.055 (i.e., 0.801–0.746),
whilst the β-coefficient of the new predictor residence was 0.167, thereby claim-
ing variance that had previously been attributed to proficiency level. However,
when comparing regression models 2 and 3, it can be seen that the β-coefficients

TABLE 14.10 ANOVA results

Model Sum of squares df Mean square F Sig.

1 Regression 18,764.256 1 18,764.256 103.897 .000b


Residual 10,475.082 58 180.605
Total 29,239.338 59
2 Regression 19,492.115 2 9,746.058 56.993 .000c
Residual 9,747.222 57 171.004
Total 29,239.338 59
3 Regression 19,603.095 3 6,534.365 37.974 .000d
Residual 9,636.242 56 172.076
Total 29,239.338 59
b
Predictors: (Constant), proficiency level
c
Predictors: (Constant), proficiency level, years in English-speaking country
d
Predictors: (Constant), proficiency level, years in English-speaking country, computer familiarity

TABLE 14.11 Model coefficients output: Unstandardized and standardized Beta coefficients

Model Unstandardized Standardized t Sig.


coefficients coefficients

B Std. Error Beta

1 (Constant) 12.346 5.100 2.421 .019


proficiency level 14.978 1.469 .801 10.193 .000
2 (Constant) 14.176 5.041 2.812 .007
proficiency level 13.939 1.516 .746 9.195 .000
years in English- 2.015 .977 .167 2.063 .044
speaking country
3 (Constant) 9.243 7.957 1.162 .250
proficiency level 13.983 1.522 .748 9.189 .000
years in English- 2.040 .980 .169 2.081 .042
speaking country
computer familiarity 2.044 2.546 .062 .803 .425
Multiple Regression 217

of proficiency and residence changed little, and the β-coefficient of the new pre-
dictor computer familiarity was very small at 0.062.
Table 14.12 further suggests that partial and part correlations for proficiency
level changed strongly between regression models 1 and 2, but only minimally for
proficiency level and residence between regression models 2 and 3. In Regression
Model 2, proficiency level had a part correlation of 0.703 and hence accounted
for 49.4% of the variance in the outcome variable, whereas residence had a part
correlation of 0.158 and accounted for only 2.5% of the variance. On the basis of
this statistical finding, proficiency level was nearly 20 times as influential in deter-
mining total TEP scores as residence. The three aspects of pragmatic competence
being investigated were nearly entirely dependent on learners’ proficiency level,
and residence and computer familiarity were found to be almost irrelevant. In
Table 14.13, the collinearity diagnostics indicate that none of the models violate
the collinearity condition of multiple regression.

TABLE 14.12 Model coefficients output: Correlations and collinearity statistics

Model Correlations Collinearity statistics

Zero-order Partial Part Tolerance VIF

1 (Constant)
proficiency level .801 .801 .801 1.000 1.000
2 (Constant)
proficiency level .801 .773 .703 .890 1.124
years in English- .415 .264 .158 .890 1.124
speaking country
3 (Constant)
proficiency level .801 .775 .705 .888 1.126
years in English- .415 .268 .160 .889 1.125
speaking country
computer familiarity .017 .107 .062 .997 1.003

TABLE 14.13 Excluded variables

Model Beta t Sig. Partial Collinearity statistics


In correlation
Tolerance VIF Minimum
tolerance

1 years in English- .167b 2.063 .044 .264 .890 1.124 .890


speaking country
computer familiarity .057b .716 .477 .094 .998 1.002 .998
2 computer familiarity .062c .803 .425 .107 .997 1.003 .888
b
Predictors in the Model: (Constant), proficiency level
c
Predictors in the Model: (Constant), proficiency level, years in English-speaking country
218 Multiple Regression

Table 14.13 shows statistics for the variables excluded from each model and
more detailed collinearity statistics, but they are not relevant here because there
were no problems with the collinearity conditions in the regression model.
On the basis of the multiple regression, the results may be written up as follows.

A hierarchical regression was conducted to examine the relative contri-


butions of language proficiency level, residence, and computer familiarity
to total performance on the TEP test. Each predictor was entered into a
regression model consecutively in the order indicated. The regression results
suggest that a regression model with the predictors proficiency level (β =
0.75) and residence (β = 0.17) was the best model, explaining 66% of the
variance in test performance. Computer familiarity was not a significant
predictor of TEP test performance.

Summary
Multiple regression is a useful statistical procedure to help researchers evaluate the
relative influences of several independent variables on an outcome variable, such
as language learning success and test performance. It is a procedure with many
options. However, in this chapter, only a hierarchical multiple regression option
for an interval outcome variable has been presented. Details of other multiple
regression procedures are presented in other texts (see Resources on Methods for
Second Language Research in the Epilogue section of this book). The next and final
chapter of this book will show how to analyze the reliability of research instru-
ments and data coder or rater agreement data in SPSS.

Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
15
RELIABILITY ANALYSIS

Introduction
This chapter explores the concept of reliability and illustrates how to conduct reli-
ability analysis in L2 research through the use of SPSS. The key aim of this chapter
is to discuss and present statistical methods for evaluating the reliability of quan-
titative research instruments (e.g., tests and Likert-type scale questionnaires), and
qualitative data coding. Researchers are expected to provide reliability estimates
of their research instruments because unreliable measures imply that data analysis
outcomes cannot be fully trusted.

Reliability
Reliability can be understood as the consistency or repeatability of observations of
behaviors, performance, and/or psychological attributes. A language test is reliable
when students with the required language knowledge and ability can consistently
answer the test questions correctly, while those with little or no knowledge of the
target language cannot. A Likert-type scale questionnaire is reliable when research
participants choose 5 when they strongly agree with a statement, but 1 when
they strongly disagree with it. The issue of the reliability of research instruments
is critical for good L2 research as researchers rely on them for the collection of
useable data.
The reliability of a test or research instrument is commonly expressed as a value
between 0 and 1. Unlike correlation coefficients, reliability coefficients can be
understood as coefficients of determination (R2), which were discussed in Chap-
ter 5. That is, a reliability coefficient of 0 indicates that the test or instrument
220 Reliability Analysis

does not measure the target construct consistently (i.e., it is 0% reliable). That is,
the results are random and are not useful in drawing conclusions about the target
construct. If the reliability estimate of an instrument is 0, the data collected using
that instrument should not be used for statistical analysis to answer research ques-
tions. A reliability coefficient of 1 means that the test or research instrument is
perfectly precise with no measurement error (i.e., it is 100% reliable or consistent).
The extreme values of 0 or 1 are unlikely to be found in L2 research. Measuring
abstract constructs or indirectly observed attributes, such as language proficiency
and psychological attributes (e.g., motivation, learning style, and attitudes) is not
a precise science.
The level of reliability of a particular test or research instrument that is accept-
able depends on the seriousness of the consequences of the test results or research
outcome. For example, if test scores are used for a high-stakes purpose, such in the
decision-making process for university admission, an overall reliability of around
0.90 would be needed though reliabilities of individual test sections can be lower,
in the 0.80 region. Test section reliabilities are generally acceptable if they are at
least 0.80. A reliability around 0.80 is also acceptable if the potential consequences
of the test scores are less serious; for example, the scores may be used as part of the
decision-making process for placement in a language program, or the test may be
one part of a course assessment. For low-stakes tests, such as self-assessments that
provide feedback to students, 0.70 is generally acceptable as a reliability estimate.
A reliability level below 0.70 means that more than one third of the test result
or research outcome is effectively random, and this is acceptable only when the
process of making modifications to tests or research instruments is still ongoing,
and changes to the instrument can be made prior to the final collection of data.

Reliability as Internal Consistency


The reliability coefficient most commonly expresses the internal consistency of a
test or research instrument, such as a questionnaire or an elicitation task. Inter-
nal consistency manifests itself in several ways. First, when measuring students’
language ability through a language proficiency test, for example, it is expected
that high-ability learners should consistently obtain high scores, while low-ability
learners should consistently obtain low scores. The variation of the scores of learn-
ers with similar levels of ability should not be excessive. Second, difficult questions
should be answered correctly more frequently by high-ability learners than by
low-ability learners. Therefore, a test cannot have high internal consistency if
low-ability learners can frequently answer difficult test questions better than high-
ability learners can.
Third, on two tests that have been designed to measure the same ability con-
structs (e.g., two versions of official TOEFL tests), individual test takers should
Reliability Analysis 221

obtain similar scores. If an instrument has high reliability, the data it elicits will be
consistent. In this chapter, split-half reliability, test-retest reliability, and Cronbach’s
alpha will be presented.

Test-Retest Reliability
While not actually a measure of the internal consistency of a single test, test-retest
reliability is conceptually important to understand consistency. It assumes that the
same test is administered to the same participants twice and the results correlated.
A highly reliable test that consistently measures the same attribute should produce
very similar results and a high correlation between administrations. However, a
practice effect from the first to the second administration is likely to distort results
so test-retest studies are not normally done. More practical approaches are split-
half reliability and Cronbach’s alpha reliability.

Split-Half Reliability
Split-half reliability can be obtained in different ways. The simplest method is
to split the test or instrument in half (first half/second half ), and correlate the
scores or results from the two halves. However, due to the possible effects of test
fatigue towards the end of the test, it is preferable to correlate scores from the
odd-numbered items with those from the even-numbered items. The resulting
correlation from this method can underestimate the actual reliability of the test or
instrument so the Spearman-Brown prophecy formula (see Brown, 2005, for details)
can be applied to obtain a more reliable measure. SPSS can compute the split-half
reliability and Spearman-Brown prophecy estimate for a test or questionnaire.
See the ‘Measures for Inter-Rater and Inter-Coder Reliability’ section for the SPSS
procedure that should be followed to do this.

Cronbach’s Alpha
Cronbach’s alpha (α) is a standard measure of reliability for tests and question-
naires. It is most affected by how strongly test or questionnaire items correlate
with each other since this inter-item correlation reflects how well the items mea-
sure the same attribute. Cronbach’s alpha is also affected by how many items there
are in the test or questionnaire. As a general rule, the higher the number of items
used, the more reliable a research instrument is. A high Cronbach’s alpha provides
evidence that the instrument is internally consistent.
Cronbach’s alpha is high when questions or items are answered consistently.
Table 15.1 presents a simple (simulated) data matrix from a course feedback ques-
tionnaire answered by 10 students.
222 Reliability Analysis

TABLE 15.1 A simple (simulated) data matrix for a course feedback questionnaire (N = 10)

ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10

1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 2 2 2 2 2 2 2 2 2 2
4 2 2 2 2 2 2 2 2 2 2
5 3 3 3 3 3 3 3 3 3 3
6 3 3 3 3 3 3 3 3 3 3
7 4 4 4 4 4 4 4 4 4 4
8 4 4 4 4 4 4 4 4 4 4
9 5 5 5 5 5 5 5 5 5 5
10 5 5 5 5 5 5 5 5 5 5

TABLE 15.2 The reliability for the 12-item implicature section of the TEP

Cronbach’s alpha Cronbach’s alpha based on standardized items N of items

0.83 0.83 12

In Table 15.1, students can be seen to be consistent in their evaluations: par-


ticipants 1 and 2 gave a score of 1 for every question, which indicates that they
did not rate their teacher highly. Participants 3 and 4 gave a score of 2 for every
question and so on, up to participants 9 and 10, who gave a score of 5 for every
question. It can be seen that each participant is homogeneous in their evaluation of
the teacher. Each item seems to measure the same construct, which is presumably
‘attitude towards the teacher’. With an adequate sample size, this very high level
of consistency will lead to a high Cronbach’s alpha. SPSS finds the Cronbach’s
alpha to be 1. This does not normally happen when real data are under analysis,
however.
In analyzing reliability, SPSS provides a useful table column that allows research-
ers to see which items in a test or research instrument lower the overall reliability
of that test or instrument, entitled ‘Cronbach’s alpha if item deleted’. In particular,
it indicates what the reliability would be if an item were removed. Table 15.2
presents the reliability for the 12-item implicature section of the TEP. The Cron-
bach’s alpha is 0.83, which is high for a test with such a small number of items.
Table 15.3 presents the item-total statistics, which reflect item discrimination and
tell researchers which items lower the overall reliability coefficient of the scale.
The numbers in the final column, Cronbach’s Alpha if Item Deleted (shaded
in Table 15.3), would ideally be lower than the reliability estimates seen in the
Reliability Analysis 223

TABLE 15.3 Item-total statistics of the 12-item implicature section of the TEP

Scale mean if Scale variance Corrected item- Squared multiple Cronbach’s


item deleted if item deleted total correlation correlation alpha if item
deleted

imp 1 6.44 9.85 0.55 0.37 0.81


imp 2 6.56 9.48 0.60 0.49 0.81
imp 3 6.64 9.61 0.53 0.48 0.81
imp 4 6.60 9.94 0.42 0.24 0.82
imp 5 6.86 10.32 0.30 0.16 0.83
imp 6 6.58 10.06 0.38 0.26 0.83
imp 7 6.68 9.74 0.48 0.38 0.82
imp 8 6.78 10.05 0.37 0.25 0.83
imp9sc 6.52 9.53 0.61 0.48 0.81
imp10sc 6.61 9.96 0.41 0.29 0.82
imp11sc 6.54 9.32 0.67 0.55 0.80
imp12sc 6.50 9.71 0.55 0.45 0.81

previous table (i.e., 0.83). For example, if imp1 were removed, the alpha coefficient
would actually drop from 0.83 to 0.81, which indicates that imp1 should be kept
because it contributes to a higher reliability coefficient. That is also the case for
each of the other items so in this illustration, all the items should be kept for the
purpose of further data analysis.
However, if a similar analysis were performed, and it was found that several
items were contributing to a low Cronbach’s alpha, items may be removed to
make the coefficient higher, and so improve reliability. It is also important to keep
in mind that Cronbach’s alpha is sample dependent. If the sample size is large,
diverse, and heterogeneous, spreading across the entire ability spectrum, the alpha
is likely to be higher than one that could be obtained from a small, homogeneous
sample.

How to Compute Cronbach’s Alpha in SPSS


In this section, the file Ch15TEPitemlevel.sav (downloadable from the Compan-
ion Website of this book) is used. For Cronbach’s alpha, the scores on individual
test items are needed.

SPSS Instructions: Cronbach’s Alpha

Click Analyze, next Scale, and then Reliability Analysis (Figure 15.1).
224 Reliability Analysis

FIGURE 15.1 Accessing the SPSS menu to launch Cronbach’s alpha analysis

FIGURE 15.2 Reliability Analysis dialog for Cronbach’s alpha analysis

In the resulting Reliability Analysis dialog, choose the variables


(items) from ‘imp1sc’ to ‘imp12sc’ in the left-hand pane and move
them to the ‘Items:’ field on the right (Figure 15.2).

Click the Statistics button. The Reliability Analysis: Statistics dialog


will appear (Figure 15.3).
Reliability Analysis 225

FIGURE 15.3 Reliability Analysis: Statistics dialog

Tick the Item, Scale, Scale if item deleted, Mean, and Correlations
checkboxes.

Click the Continue button to return to the Reliability Analysis dialog,


then click on the OK button.

Table 15.4 presents the case processing summary. The case processing summary
indicates that only 100 out of 167 test takers’ scores were used in the calcula-
tion of Cronbach’s alpha. This is because the calculation of Cronbach’s alpha
in SPSS requires a complete data set for each participant. It ignores the data for
those participants who have missing scores on the items being analyzed (note that
missing data were coded ‘999’). Table 15.5 presents the overall reliability statistics.
Cronbach’s alpha based on standardized items is usually similar to that of the
unstandardized items.
226 Reliability Analysis

TABLE 15.4 The case processing summary for items ‘imp1sc’ to ‘imp12sc’

N %

Cases Valid 100 59.90


Excludeda 67 40.10
Total 167 100.00
a
Listwise deletion based on all variables in the procedure.

TABLE 15.5 The overall reliability statistics

Cronbach’s alpha Cronbach’s alpha based on standardized items No of items

0.83 0.83 12

TABLE 15.6 The item statistics

Mean SD N

imp1sc 0.77 0.42 100


imp2sc 0.65 0.48 100
imp3sc 0.57 0.50 100
imp4sc 0.61 0.49 100
imp5sc 0.35 0.48 100
imp6sc 0.63 0.49 100
imp7sc 0.53 0.50 100
imp8sc 0.43 0.50 100
imp9sc 0.69 0.47 100
imp10sc 0.60 0.49 100
imp11sc 0.67 0.47 100
imp12sc 0.71 0.46 100

Table 15.6 presents the item statistics. The item statistics show information on
each item (the lower the mean, the more difficult the item). Table 15.7 presents
the summary statistics for each test item. An item mean of 0.60 indicates that this
section of the test was easy for the test takers (a mean of 0.50 would indicate that
the section was at an ideal level of difficulty). The average inter-item correlation
of 0.29 is appropriate according to Clark and Watson (1995), who suggest that an
inter-item correlation of between 0.15 and 0.5 is acceptable. Stronger inter-item
correlations suggest that the items are too similar to each other and the construct
under measurement may be narrow.
Table 15.8 presents the item-total statistics, which allow researchers to examine
‘Cronbach’s alpha if item deleted’. The last column shows what Cronbach’s alpha
would be if individual items were deleted (as discussed in Table 15.3). Ideally, all
the values in this column should be lower than the overall Cronbach’s alpha.
Reliability Analysis 227

TABLE 15.7 The summary item statistics

Mean Minimum Maximum Range Maximum/ Variance N of


minimum items

Item Means 0.60 0.35 0.77 0.42 2.20 0.01 12


Inter-Item Correlations 0.29 0.05 0.59 0.54 11.57 0.01 12

TABLE 15.8 The item-total statistics

Scale mean Scale variance Corrected Squared Cronbach’s


if item if item item-total multiple alpha if item
deleted deleted correlation correlation deleted

imp1sc 6.44 9.85 0.55 0.37 0.81


imp2sc 6.56 9.48 0.60 0.49 0.81
imp3sc 6.64 9.61 0.53 0.46 0.81
imp4sc 6.60 9.94 0.42 0.24 0.82
imp5sc 6.86 10.32 0.30 0.16 0.83
imp6sc 6.58 10.06 0.38 0.26 0.83
imp7sc 6.68 9.74 0.48 0.38 0.82
imp8sc 6.78 10.05 0.37 0.25 0.83
imp9sc 6.52 9.53 0.61 0.48 0.81
imp10sc 6.61 9.96 0.41 0.29 0.82
imp11sc 6.54 9.32 0.67 0.55 0.80
imp12sc 6.50 9.71 0.55 0.45 0.81

TABLE 15.9 The scale statistics

Mean Variance SD N of items

7.21 11.48 3.39 12

Finally, the scale statistics (Table 15.9) show the mean score for all test takers for
this section. Overall, the Cronbach’s alpha analysis suggests that this test section is
reliable and that all the test questions worked together well to elicit implicature.
No items should be removed from the data set.

Rater Reliability
In performance assessment (such as the assessment of speaking and writing),
assigning scores to performance is subjective to a certain degree. That is, the same
rater may not always assign scores to performance of the same standard in exactly
the same way. Moreover, even two well-trained, highly experienced raters may
not always assign a similar score to the same piece of written work or spoken
228 Reliability Analysis

FIGURE 15.4 A selection from Ch15analyticrater.sav (Data View)

performance. This makes the study of reliability in this area particularly impor-
tant. There are two common types of rater reliability: intra-rater reliability and
inter-rater reliability.

Measures for Intra-Rater Reliability


Intra-rater reliability is related to the consistency of the scores a rater assigns to stu-
dents’ work. Raters generally use an analytic rating-scale rubric, which is divided
into various components of ability (e.g., in a speaking test, the criteria may include
linguistic accuracy, linguistic complexity, fluency, vocabulary, and language con-
trol). In analytic scoring, the sum of the scores for each criterion makes up the
total score. Cronbach’s alpha can be employed to examine intra-rater reliability.
The data in the file Ch15analyticrater.sav (which can be downloaded from the
Companion Website of this book) will be used to calculate the intra-rater reliabil-
ity coefficient through the use of the Cronbach’s alpha. Figure 15.4 shows part of
an SPSS worksheet for this data file in Data View.

Measures for Inter-Rater and Inter-Coder Reliability


In performance assessment, a higher level of reliability can be achieved when each
student’s performance is rated by two or more raters. For example, in an essay test,
reliability concerns whether there is variation of scoring within an individual rater
(intra-rater reliability), as well as among different raters (inter-rater reliability).
Some raters may be more lenient towards students’ work and might apply scoring
rubrics more loosely than other raters. The resulting variation of scoring can be a
source of error that affects test reliability, because it limits the ability to consistently
Reliability Analysis 229

distinguish performances of students with different abilities. Unlike intra-rater


reliability, inter-rater reliability can be used to indicate the extent to which two or
more raters assign scores to the same performance similarly.
In this chapter, the SPSS procedures for the Spearman-Brown prophecy coeffi-
cient, Cohen’s kappa coefficient and intraclass correlation are presented. However,
three other measures of inter-rater reliability, namely the percentage of agreement,
correlations, and Cronbach’s alpha will be discussed first.

Percentage of Agreement
In L2 research, some researchers may simply examine and report the percentage of
agreement among raters or coders (e.g., 90% agreement in assigning a test score).
While this report is more informative than not reporting the reliability at all, the
percentage agreement measure is not useful evidence of rater/coder reliability, and
therefore should be avoided. This is mainly because agreement between two peo-
ple depends on several complex factors. For example, first, the level of agreement
depends on the score ranges in the rating scales being used (e.g., 1–4, or 1–20), and
the nature of the feature being rated (e.g., surface features, such as factual informa-
tion and frequencies of occurrence versus content or thematic features, such as beliefs,
perceptions, attitudes, and cognitive processes). In performance assessment, raters are
more likely to agree with each other when the range of scores is low (e.g., between
1 and 4) than when the range is broad (e.g., 1–20). In qualitative data coding, cod-
ers are more likely to agree with each other when coding factual information than
when coding qualitative content, because of the complex nature of some constructs,
such as motivation, attitudes and beliefs, and cognitive processing.
Second, the percentage agreement, especially in rating scales, depends on
whether or not researchers adopt an exact or adjacent agreement method. An exact
agreement suggests no discrepancy between two scores as assigned by the two rat-
ers, whereas an adjacent agreement allows a 1 point difference between two scores
assigned by the two raters. Third, the percentage agreement depends on the sam-
ple size. When the sample size is small, agreement tends to be higher than when
the sample size is large. For example, the percentage agreement rate would be
higher for 10 participants than for 20 participants. On the basis of this discussion,
the percentage agreement provides an inflated measure of the relationship between
two scores assigned by two different people (see e.g., Keith, 2003; Williamson, Xi,
& Breyer, 2012; Yang, Buckendahl, Juszkewicz, & Bhola, 2002, who discuss this
concern in the context of automated essay scoring [AES]). Therefore, this type of
inter-rater or inter-coder reliability should be avoided or used with caution.

Correlation Coefficients
Another method that can be used as evidence of inter-rater reliability is corre-
lational analysis (see Chapter 5). This measure is preferable to the percentage of
230 Reliability Analysis

agreement method. If the data are continuous and normally distributed, a Pearson
product moment correlation can be computed. If the data are ordinal or non-
normal, a Spearman’s rho correlation can be used. A strong correlation coefficient
(e.g., 0.80) suggests that two raters or coders agree on their ratings of the same
piece of performance or data. Nonetheless, it should be noted that a correlation
coefficient is a not a reliability estimate. That is, a correlation coefficient of 0.70
does not equate to a reliability coefficient of 0.70. For several reliability mea-
sures, a correlation coefficient is just one ingredient of the reliability formula (see
e.g., “The Spearman-Brown Prophecy Coefficient” section). However, the use
of correlation coefficients as reliability indices is often seen in language assess-
ment practice. For example, in AES research, validators employ Pearson product
moment correlations between human raters and a computer rater as one measure
of AES reliability. For example, Williamson et al. (2012) recommend a threshold
of 0.70 for human-human and human-AES correlations. They point out that a
correlation of 0.70 nearly reaches the tipping point at which signal outweighs
noise in the prediction, so nearly 50% of the variance in the agreement between
two raters is accounted for. While the use of correlations is preferable to percent-
age agreements to analyze inter-rater reliability, it is recommended that correlation
coefficients be used only as complements to the inter-rater or coder reliability
estimates.

Cronbach’s Alpha Coefficient


It can be viable to use Cronbach’s alpha coefficients as measures of inter-rater
reliability. For example, Ockey et al. (2015) used Cronbach’s alpha to determine
how TOEFL iBT speaking scores are related to Japanese university students’ per-
formance on the oral ability components of this test. The authors reported the
Cronbach’s alpha coefficients of three assessment tasks as rated by two trained rat-
ers in tables: the coefficients for the group oral discussion task ranged from 0.65
(pronunciation) to 0.74 (fluency), with an overall reliability estimate of 0.78; the
coefficients for the picture and graph description task ranged from 0.62 (pro-
nunciation) to 0.79 (fluency), with a total reliability estimate of 0.81; and the
coefficients for the oral presentation task ranged from 0.56 (vocabulary/grammar)
to 0.73 (fluency), with a total reliability estimate of 0.72. In this study, it was found
that raters found it more difficult to agree on scores awarded for pronunciation
than for fluency. The Cronbach’s alpha for the TOEFL iBT speaking test was
found to be 0.82.

The Spearman-Brown Coefficient


The Spearman-Brown prophecy formula is usually applied to examine test reli-
ability, particularly when researchers want to know if the reliability of a test will
increase if more items are added. The Spearman-Brown prophecy can also be
Reliability Analysis 231

FIGURE 15.5 Excerpt from Ch15raters.sav (Data View)

applied to investigate whether two or more raters have assigned holistic scores
similarly to each other in subjective assessments, such as essay writing and speaking
tasks (see Brown, 2005, p. 187, for a discussion of the Spearman-Brown prophecy
formula). Holistic scoring provides an overall impression of a performance (e.g.,
1–5 for poor, 6–10 for average, 11–15 for good, and 16–20 for excellent). The
Spearman-Brown prophecy uses a Pearson correlation as part of its formula.

How to Compute a Spearman-Brown Coefficient in SPSS


The Spearman-Brown coefficient in SPSS is the same as the Spearman-Brown
prophecy coefficient. It can be calculated through the Split-half model in the Reli-
ability menu in SPSS. The file Ch15raters.sav (downloadable from the Companion
Website for this book) can be used to compute this coefficient. Figure 15.5 shows
an excerpt from a worksheet from this data file.

SPSS Instructions: Spearman-Brown Prophecy Coefficient

Click Analyze, next Scale, and then Reliability Analysis (Figure 15.6).
232 Reliability Analysis

FIGURE 15.6 Accessing the SPSS menu to launch Reliability Analysis

In the resulting Reliability Analysis dialog, move ‘Rater1’ and ‘Rater2’


from the left-hand pane to the ‘Items’ pane. In the Model drop-
down menu, choose Split-half (Figure 15.7).

Click on the OK button.

Table 15.10 presents the reliability statistics for these data. In this table, the
information on Cronbach’s alpha can be ignored as there are only two items
(two raters). Instead, focus on the ‘Spearman-Brown Coefficient’ rows, and on the
‘Equal Length’ result. According to this output, the Spearman-Brown coefficient
Reliability Analysis 233

FIGURE 15.7 Reliability Analysis dialog for the split-half analysis

TABLE 15.10 The Spearman-Brown coefficient

Reliability statistics

Cronbach’s Alpha Part 1 Value 1.000


N of Items 1a
Part 2 Value 1.000
N of Items 1b
Total N of Items 2
Correlation Between Forms .632
Spearman-Brown Coefficient Equal Length .774
Unequal Length .774
Guttman Split-Half Coefficient .774
a
The items are: Rater1
b
The items are: Rater2

between the two raters was 0.77. This output also shows the Guttman Split-Half
coefficient, which is the split-half coefficient used in SPSS and is another index
for inter-rater reliability. On this occasion, it had the same value as the Spearman-
Brown prophecy coefficient. In Table 15.10, the correlation coefficient between
the two raters can be seen to be 0.63. This correlation was used in the calculation
of the Spearman-Brown coefficient. A Spearman-Brown prophecy coefficient of
0.77 indicates that 77% of the test scores by the two raters were common. In a
medium-stakes test, this coefficient would be acceptable for use in judging stu-
dents’ abilities. However, in a high-stakes situation, a Spearman-Brown coefficient
of 0.90 or higher should be obtained. For the purposes of research, a coefficient
234 Reliability Analysis

of 0.77 is acceptable for the data to be used to infer students’ performance, and to
perform other inferential statistics.

Cohen’s Kappa for Performance Rating


or Qualitative Data Coding
A popular inter-coder reliability estimate is Cohen’s kappa. Cohen’s kappa is
particularly suitable for estimating the reliability of data obtained through cod-
ing. It is also appropriate when researchers use categorical codes such as yes/no,
present/absent, and pass/fail. Cohen’s kappa is based on chi-square statistics (see
Chapter 13), and computes the proportion of agreement between raters or coders
with chance agreement subtracted. It is recommended for use over the percentage
agreement discussed earlier.
A piece of research that used Cohen’s kappa was Crossley, Cobb, and McNamara
(2013), in which the authors used software that analyzed word frequency in texts
written by ESL learners and native speakers to classify texts as belonging to one of
four proficiency levels (beginning, intermediate, advanced, native speaker). They
then compared the classifications assigned by the software with the participants’
actual proficiency levels and found absolute agreement of 47.7%, and a moderate
Cohen’s kappa coefficient of 0.37, which indicated that the software was moder-
ately reliable.
As an imaginary extension of Crossley et al., imagine that two raters need to
decide whether 25 ESL learners pass or fail a speaking course, based on a short oral
interview. A cross-tabulation of their pass or fail ratings is shown in Table 15.11.
In this table, it can be seen that the ratings were the same for 18 out of the 25
learners. That is, there were 11 students who were given a ‘pass’ by both raters,
and seven who were given a ‘fail’ by both raters. The raters disagreed on seven
occasions, so their agreement rate was 72% (18 out of 25). This level of agreement
may seem quite high, but Cohen’s kappa coefficient was found to be 0.43 because
it takes ‘chance agreement’ into account in its calculation. This result may appear
small, but is actually considered to indicate a moderate level of agreement between
the two raters. As with other reliability coefficients, a kappa coefficient of 1 indi-
cates perfect agreement between two raters. If there is no agreement between the
raters other than what would be expected by chance, then the kappa coefficient is
less than or equal to 0. A moderate agreement is indicated by a kappa coefficient

TABLE 15.11 Cross-tabulation of pass-fail ratings for 25 ESL learners

Rater 1

Pass Fail
Rater 2 Pass 11 4
Fail 3 7
Reliability Analysis 235

FIGURE 15.8 Excerpt from Ch15kappa.sav

between 0.4 and 0.6, and a substantial agreement is indicated by a kappa coef-
ficient between 0.6 and 0.7. A strong agreement is indicated by a kappa of 0.8
or above.
Cohen’s kappa coefficient is influenced by the number of decision options
there are (e.g., two options for pass or fail; three options for poor, average or
good). The more categories there are, the higher the kappa coefficient will be.
This is because the likelihood of agreement between two raters drops as the num-
ber of options increases. It is important to note that Cohen’s kappa coefficient
is designed for two raters or coders only, and is most appropriate for analyzing
categorical data that has been coded.

How to Compute Cohen’s Kappa in SPSS


In this section, the file Ch15kappa.sav (downloadable from the Companion Web-
site of this book) will be used. Using the example of pass/fail decisions, first set up
236 Reliability Analysis

the data in three columns: candidate ID, ratings by Rater 1 (0 = fail, 1 = pass), and
ratings by Rater 2 (0 = fail, 1 = pass) as in Figure 15.8.

SPSS Instructions: Cohen’s kappa

Click Analyze, next Descriptive Statistics, and then Crosstabs (Fig-


ure 15.9).

When the Crosstabs dialog opens, move ‘Rater 1’ from the left pane
to the Columns field and Rater 2 to the Rows field, or vice versa
(Figure 15.10).

Click the Statistics button, and in the resulting Crosstabs: Statistics


dialog, tick the Kappa checkbox (Figure 15.11).

Click on the Continue button to return to the Crosstabs dialog, and


click the OK button.

FIGURE 15.9 Accessing the SPSS menu to launch Crosstabs for kappa analysis
FIGURE 15.10 Crosstabs dialog

FIGURE 15.11 Crosstabs: Statistics dialog for choosing kappa


238 Reliability Analysis

TABLE 15.12 Cross-tabulation of pass-fail ratings by raters 1 and 2

Cases

Valid Missing Total

N Percent N Percent N Percent

Rater 2 ∗ Rater 1 25 100.00% 0 0.00% 25 100.00%

TABLE 15.13 Case processing summary for raters 1 and 2

Rater 1 Total

fail pass

Rater 2 fail 7 3 10
pass 4 11 15
Total 11 14 25

TABLE 15.14 Measure of agreement (kappa value)

Value Asymp. Approx. T b Approx. Sig.


std. error a

Measure of Agreement Kappa 0.43 0.18 2.14 0.03


N of Valid Cases 25
a
Not assuming the null hypothesis
b
Using the asymptotic standard error assuming the null hypothesis

Table 15.12 presents a cross-tabulation of the pass-fail ratings by the two raters.
Table 15.13 presents the case processing summary, which shows that all cases in
the spreadsheet were included. Table 15.14 presents the measure of agreement
(i.e., the kappa coefficient). The kappa coefficient was found to be 0.43.

Intraclass Correlation
The last type of inter-rater reliability estimate to be discussed here is the intraclass
correlation coefficient (ICC), which is suitable for interval data and works for multiple
raters. This statistic indicates the consistency of the raters (i.e., the extent to which
the raters agree on which test takers deserve a high rating and which deserve a low
rating). ICC is useful for helping researchers assess rater quality and check that the
rating scale is interpreted in the same way by different raters. If several raters come
to the same conclusion regarding the rating scale, the scale can be assumed to be
clearly described and unambiguous.
Reliability Analysis 239

Although the intraclass correlation coefficient is not common in L2 research, there


are some important examples in the literature. Derwing and Munro (2013) com-
puted intraclass correlations between their 34 native English speaking judges of L2
speakers’ comprehensibility, fluency, and accentedness. They found high intraclass
correlations of 0.96, 0.97, and 0.95 respectively, confirming that native speakers had
very consistent impressions of L2 speakers’ comprehensibility, fluency, and accented-
ness. This meant that their ratings were reliable and could be used for further analyses.
Isaacs, Trofimovich, Yu, and Munoz (2015, see pp. 18–19) employed interclass
correlations to examine the internal consistency of eight IELTS examiners’ ratings
on the IELTS Speaking scales (i.e., component and overall scores) and seman-
tic differential scales, e.g., Comprehensibility: speech is painstakingly effortful to
understand ( ) <—————> ( ) speech is effortless to understand. As the
researchers predicted, it was found that IELTS examiners agreed strongly when
assigning scores using the IELTS Speaking scales. The interclass correlations were
found to range from 0.84 to 0.87. However, the interclass correlations were found
to be weaker for the semantic differential scales compared to the IELTS Speak-
ing scales (they ranged from 0.54 to 0.80). The researchers pointed out that the
examiners received no training on how to use the semantic differential scales, and
that the semantic differential scale lacked level demarcations, except for the scalar
endpoints. These might have been the key reasons for the low agreement among
the IELTS examiners.
Let us examine a simulated data set containing the ratings of two raters (Rater 1
and Rater 2), who rated 10 candidates on two tasks (Task A and Task B) on a scale
of 1–5 (Table 15.15). In this table, the averages for Rater 1 and Rater 2 are similar,
with the ratings within 0.50 points of each other, except for Candidates 5 and 9,
for whom the ratings are quite different. The intraclass correlation coefficient for
this data set was 0.74, which was reasonable. This coefficient, however, may not be
considered high enough for inter-rater reliability in a high-stakes scenario.

TABLE 15.15 Simulated data set for two raters (rater 1 and rater 2)

Candidate TaskARater1 TaskARater2 TaskBRater1 TaskBRater2 Rater1Avg Rater2Avg

1 5 4 4 4 4.50 4.00
2 2 3 3 3 2.50 3.00
3 3 3 2 2 2.50 2.50
4 3 4 3 3 3.00 3.50
5 4 2 3 1 3.50 1.50
6 2 1 1 2 1.50 1.50
7 1 2 1 1 1.00 1.50
8 5 5 4 5 4.50 5.00
9 2 4 2 5 2.00 4.50
10 3 3 2 3 2.50 3.00
240 Reliability Analysis

How to Compute the Intraclass Correlation in SPSS


To illustrate how to compute intraclass correlation coefficients in SPSS, use the file
Ch15intraclasscorrelation.sav (downloadable from the Companion Website of this
book). The file contains the ratings of three raters for speech act responses by 40
TEP test takers on a scale of 0–3. To calculate the intraclass correlation coefficient,
the average of each rater’s rating for each test taker needs to be calculated, giving
three totals per test taker. These totals are then used to compute the intraclass cor-
relation coefficient.

SPSS Instructions: Intraclass Correlation


In order to perform the analysis, the steps used to obtain Cronbach’s alpha, as
presented earlier, are followed almost exactly.

Click Analyze, next Scale, and then Reliability Analysis (Figure 15.6).

In the Reliability Analysis dialog, move ‘Rater 1 total’, ‘Rater 2 total’,


and ‘Rater 3 total’ from the left pane to the ‘Items’ pane (see Fig-
ure 15.12).

FIGURE 15.12 Reliability Analysis dialog for raters’ totals as selected variables
Reliability Analysis 241

FIGURE 15.13 Reliability Analysis: Statistics dialog for intraclass correlation analysis

Click the Statistics button to call up the Reliability Analysis: Statistics


dialog. Then tick the Item, Scale, and Intraclass correlation coeffi-
cient checkboxes (Figure 15.13).

Select Absolute Agreement in the Type drop-down menu (Fig-


ure 15.13)

Click the Continue button to return to the Reliability Analysis dialog,


then click on the OK button.

In relation to Figure 15.13, when there are three raters, selecting Scale if item
deleted is likely to dramatically reduce reliability. However, if there are more rat-
ers (e.g., as in Derwing & Munro, 2013), selecting Scale if item deleted could be
242 Reliability Analysis

used to find poorly performing raters, who could then be excluded. Selecting
Absolute agreement in the Type drop-down (see Figure 15.13) renders an estimate
of the level of agreement between raters in addition to the reliability estimate.
Table 15.16 presents the case processing summary output.
The case processing summary in Table 15.16 indicates that the average ratings
of the three raters for all 40 participants were included. Table 15.17 presents the
reliability statistics output as the Cronbach’s alpha coefficient. Cronbach’s alpha is
found to be very high at 0.95, indicating a high degree of consistency among the
raters, which indicates that the rating scale is reliable.
Table 15.18 presents the item statistics output. The item statistics in this case
are average ratings for each rater, and they can be used to ensure that raters do not
diverge too much in their ratings. In this case, all the raters are within 0.35 of a
score level of each other. Table 15.19 presents the intraclass correlation coefficient
output.
Intraclass correlation shows the absolute agreement between raters, rather than
just consistency, as shown by Cronbach’s alpha. This can be relevant because raters
can be consistent without really being in agreement. That is, if one rater system-
atically rates one score level lower than the other raters, consistency will be high
but agreement will be low. Ideally, the Average Measures intraclass correlations (for
several raters) should be similar. In this SPSS output, the Single Measures intraclass

TABLE 15.16 The case processing summary output

N %

Cases Valid 40 100.00


Excludeda 0 0.00
Total 40 100.00
a
Listwise deletion based on all variables in the procedure.

TABLE 15.17 The reliability estimate output

Cronbach’s Alpha No of items

0.95 3

TABLE 15.18 The item statistics output

Mean SD N

Rater 1 total 1.37 0.70 40


Rater 2 total 1.31 0.70 40
Rater 3 total 1.03 0.48 40
Reliability Analysis 243

TABLE 15.19 The intraclass correlation coefficient

Intraclass 95% confidence F test with true value 0


correlationb interval

Lower Upper Value df1 df2 Sig


bound bound

Single Measures 0.79a 0.57 0.89 18.06 39 78 0.00


Average Measures 0.92c 0.80 0.96 18.06 39 78 0.00
a
The estimator is the same, whether the interaction effect is present or not.
b
Type A intraclass correlation coefficients using an absolute agreement definition.
c
This estimate is computed assuming the interaction effect is absent, because it is not estimable
otherwise.

correlation does not need to be considered. In a research report, researchers include


both the Cronbach alpha coefficient, which is the measure of rater consistency,
and the intraclass correlation based on absolute agreement.

The Standard Error of Measurement (SEM)


It is useful for L2 researchers to be familiar with the concept of the standard error of
measurement (SEM). SEM is particular useful when language tests are used as part
of a research project. It is related to the reliability of a research instrument, but it
is applied to an individual participant. In principle, if the reliability estimate of a
language test is 1, the SEM is 0 by default. This means that there is no measure-
ment error associated with observed scores or data. Therefore, if a student obtains
30 out of 40, that student’s actual score is 30. However, if the reliability estimate
is 0.85, there is a possibility that the student’s score is not 30. SEM is used to help
researchers identify a score range for an individual and is developed around the
principle of a normal distribution, which was discussed in Chapter 3. To compute
an SEM, researchers need to have two statistics: the standard deviation of the mean
score (SD), and the reliability estimate of the research instrument or test (α). An
SEM can be computed as follows:

SEM = SD × √1 – α

For example, if the standard deviation of a test is 7 and the reliability coefficient
or Cronbach’s alpha is 0.85, the SEM can be computed as follows:

SEM = 7 × √1–– 0.85 = 7 × √0.15 = 7 × 0.39 = 2.73

If a test taker obtains a score of 30 out of 40, the SEM score can be added to as
well as subtracted from the raw score. This allows a computation of the lower and
244 Reliability Analysis

upper bound of the test taker’s score at a 68% confidence band. In this case, the test
taker’s score could fall between 27.27 and 32.73. Knowing that the student’s score
could be within this range provides some confidence about the use of the test
score in research. If the SEM is found to be larger, say 5, this test taker’s score could
lie between 25 and 35, so this test would not be considered precise: if the score is
used for research, there is a chance that statistical inferences based on it would be
inaccurate. According to this discussion, a highly reliable test will produce a small
SEM value, which means that there is little error in measurement.

Factors Affecting Reliability Estimates


In practice, there are multiple interrelated factors that affect reliability estimates
(see Brown, 2014; Phakiti, 2014). Such factors include:

1. The nature of the research construct of interest. Some constructs are less complex
than others, and so can be assessed more reliably than others. For example,
vocabulary knowledge can be measured by a vocabulary test more reliably
than by a general language proficiency test, which involves the ability to use
several language skills.
2. The quality of an instrument or coding scheme. A good research instrument that
is developed on the basis of robust theories and clearly defined and well-
informed test specifications is likely to have high reliability. An instrument
that is tested, piloted, and/or validated is likely to have a high reliability esti-
mate. The same applies to a data coding scheme or performance rating scale,
which needs to be developed and revised carefully before its actual use. Raters
or coders need to be trained in understanding descriptors and practice rating
different performance levels and coding qualitative data. This type of quality
control is likely to lead to a high inter-rater/coder reliability estimate. There-
fore, any research instruments that do not involve any of these elements are
unlikely to be very reliable.
3. Objective versus subjective scoring. Generally, it is easier to attain high reliability
when tests or instruments are scored objectively. Multiple-choice tests are a
typical example of objective scoring. Subjective scoring—such as the rating
of writing or speaking, or coding of qualitative data—requires more training
of raters or coders to ensure intra- and inter-rater/coder reliability.
4. Sample size. A large sample size can lead to more reliable measures. This means
that in a language test there will be a greater chance of having high-ability,
intermediate-ability, and low-ability learners in the same data set. A small
sample size often leads to restricted data ranges, which in turn affects observa-
tions of consistency.
5. The range of ability or attributes of participants. A motivation questionnaire that
is given to highly motivated learners only is likely to produce a low reliability
estimate because there is low variance in the data set. A well-developed lan-
Reliability Analysis 245

guage proficiency test that is given to low-ability learners only is also likely
to produce a low reliability estimate.
6. Length of an instrument. The longer the test/questionnaire, the more reliable
it is. However, it is essential that items or questions are carefully designed to
measure the target construct, as tests or questionnaires that are very long will be
a burden for participants and can be less practical and more expensive to use.

Instrument Reliability Versus Research Validity


Reliability estimates are blind to the attribute under measurement. Reliability tells
researchers only if their test or instrument measures something consistently. It does
not tell them whether their test or instrument is measuring what they intend to
measure. To be sure of the validity of an instrument, validation work is required.
Test reliability analysis is only one aspect of validity analysis (see Chapelle, Enright,
& Jamieson, 2008; Kane, 2006; Messick, 1989).

Summary
This chapter has presented the important concept of reliability, which is inte-
gral to solid quantitative research. There are several measures of reliability. The
choice of reliability measure will be influenced by the source of the data being
analyzed; it may have been collected from research instruments such as tests and
questionnaires, or from subjective data coding by human beings. This chapter has
illustrated how to compute Cronbach’s alpha, the Spearman-Brown prophecy
coefficient, Cohen’s kappa, and interclass correlations in SPSS. There are other
statistical methods that are not presented in this chapter (see Resources on Methods
for Second Language Research in the Epilogue for a list of publications that deal
with these methods). The SEM and factors affecting the reliability or research
instruments and data coding or rating have been discussed. Finally, the relation-
ship between reliability and validity has also been addressed in this chapter. In the
context of L2 research, reliability is important because it is a prerequisite condi-
tion for research validity. That is, research findings cannot be valid if the research
instruments being used are not reliable. The next part that follows is the Epilogue,
which concludes this book.

Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
EPILOGUE

This book has outlined the methodology required for sound quantitative research.
It has provided the reader with the essential tools needed to perform quantitative
research for many different purposes, and with various data types. The steps required
to produce reliable and valid results have been described. Researchers carry a heavy
responsibility as the results of their work may have real-world consequences for a
number of stakeholders. For this reason, this book should be viewed as the founda-
tion on which further training in research methodology should be built.

Resources on Methods for L2 Research


The following books and resources are recommended for further reading.

General Research Methods and Designs in L2 Research


Since the current book does not focus on research approaches and methods for L2
research, the following are useful and practical resources for further study.

Dörnyei, Z. (2007). Research methods in applied linguistics: quantitative, qualitative, and mixed
methodologies. Oxford: Oxford University Press.
This book presents research methodologies and many other important considerations
in applied linguistics. It focuses on the various stages of academic research in applied
linguistics.
Mackey, A., & Gass, M. S. (Eds.). (2012). Research methods in second language acquisition: A
practical guide. Malden, MA: Wiley-Blackwell.
This edited volume focuses on methodological issues in L2 acquisition research related
to data collection and data analysis.
Epilogue 247

Mackey, A., & Gass, S. M. (2015). Second language research: Methodology and design (2nd ed.).
London: Routledge.
This book describes the essential principles of research methodology, methods and tech-
niques in L2 research.
Paltridge, B., & Phakiti, A. (Eds.). (2015). Research methods in applied linguistics: A practical
resource. London: Bloomsbury.
This edited volume presents several types of research in applied linguistics (e.g., general
quantitative, qualitative and mixed methods research, experimental research, and survey
research). The areas of research it focuses on include language skills, LTA, and classroom
practice.

Quantitative Methods
The following books present more sophisticated statistical concepts that are not
covered in the current book.

Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge


University Press.
This book discusses the basic concepts of quantitative data analysis for language assess-
ment purposes. It presents several statistical formulae and reliability analysis for various
types of language tests in a comprehensive fashion.
Field, A. (2013). Discovering statistics using IBM SPSS statistics. Los Angeles: Sage.
This book presents a comprehensive treatment of statistical analysis through the use of
IBM SPSS for Windows and MacOS. Many examples from educational and psychologi-
cal research are provided.
Larson-Hall, J. (2010). A guide to doing statistics in second language research using SPSS. New
York: Routledge.
This book is comprehensive in its treatment of statistics in L2 research. It presents both
statistical formulae and provides conceptual explanations with examples of studies in
applied linguistics. It also describes procedures that should be followed to perform sta-
tistical analyses in SPSS that are not covered in the current book.
Larson-Hall, J. (2016). A guide to doing research in second language acquisition with SPSS and R
(2nd ed.). New York: Routledge.
This book shows how to use SPSS and R to analyze quantitative data in SLA research.
It covers common statistical tests, such as correlations, t-tests, and analysis of variance.
Lowie, W., & Seton, B. (2013). Essential statistics for applied linguistics. Hampshire, UK: Pal-
grave Macmillan.
This book presents both descriptive and inferential statistics in applied linguistics. It
presents how to use SPSS for common statistical tests with examples of how a particular
analysis can be done.
Norris, J. M., Ross, S. J., & Schoonen, R. (Eds.). (2015). Currents in language learning
series: Improving and extending quantitative reasoning in L2 research. Language Learn-
ing, 65(S1), v–vi, 1–264.
This supplement to Language Learning is a state-of-the-art volume designed to advance
the quality of future quantitative methodology in L2 research. The first section of this
248 Epilogue

volume contains chapters on current challenges in L2 research, and the second section
includes chapters on alternatives, advances, and the future of L2 quantitative methodol-
ogy. The authors of the chapters in this volume identify specific problems in L2 quanti-
tative research and recommend solutions to such problems, so that quantitative research
practices can meet the required assumptions of quantitative methodology.
Porte, G. K. (2010). Appraising research in second language learning: A practical approach to critical
analysis of quantitative research (2nd ed.). Amsterdam and Philadelphia: John Benjamins
Publishing Company.
This book provides guidance on how to evaluate quantitative research articles. It
explains the different components of research articles; in particular, it illustrates how to
interpret and evaluate quantitative findings.
Plonsky, L. (Ed.). (2015). Advancing quantitative methods in second language research. New York:
Routledge.
This edited volume is at a more advanced level than the current book, and extends the
content of the current book. It includes, for example, chapters on statistical power and
p-values, mixed effects modeling and longitudinal data analysis, cluster analysis, explor-
atory factor analysis, Rasch analysis, structural equation modeling, and the Bayesian
approach to hypothesis testing.
Woodrow, L. (2014). Writing quantitative research in applied linguistics. New York: Palgrave
Macmillan.
This book focuses on strategies that can be useful when writing quantitative research
reports. It illustrates how to write about specific statistical procedures and findings (e.g.,
t-tests, ANOVA, correlations, and nonparametric tests).

Key Academic Journals (Published in English)


The following are highly regarded journals that publish studies in L2 learning
and assessment and applied linguistics. These journals are published in English
only and hence the list cannot be considered comprehensive. The journals are
presented alphabetically.

• Applied Linguistics (http://applij.oxfordjournals.org/) publishes research on


language and language-related concerns in the various areas of applied lin-
guistics.
• Language Learning (http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)
1467-9922) publishes research related to language learning and acquisition. It
includes systematic research reviews and research related to language teaching
and assessment.
• Language Testing (http://ltj.sagepub.com/) publishes theoretical, methodolog-
ical and empirical articles in the area of LTA.
• Language Assessment Quarterly (www.tandfonline.com/toc/hlaq20/current)
publishes articles related to language assessment, language test development,
and testing practice.
Epilogue 249

• Second Language Research (http://slr.sagepub.com/) publishes theoretical and


experimental studies in SLA and L2 performance.
• Studies in Second Language Acquisition (http://journals.cambridge.org/action/
displayJournal?jid=SLA) publishes research and theoretical or methodologi-
cal discussion in second and foreign language acquisition of any language.
• TESOL Quarterly (www.tesol.org/read-and-publish/journals/tesol-quarterly)
publishes research related to English as a second or foreign language learning
and teaching.
• The Modern Language Journal (http://onlinelibrary.wiley.com/journal/10.1111/
(ISSN)1540-4781) publishes research and/or discussion articles about foreign
and L2 learning and teaching.

Applied Linguistics and L2 Research Associations


Finally, the activities of the various academic, applied linguistics associations pro-
vide a way not only to learn about research methods, but also to extend existing
research networks. The associations are also keen to promote an awareness of
current research areas. They have their own annual conferences, research seminars
and workshops.

• American Association for Applied Linguistics (AAAL): www.aaal.org/


• Applied Linguistics Association of Australia (ALAA): www.alaa.org.au/
• Applied Linguistics Association of New Zealand (ALANZ): www.alanz.ac.nz/
• British Association for Applied Linguistics (BAAL): www.baal.org.uk/
• Canadian Association for Applied Linguistics: www.aclacaal.org/AccueilAn.htm
• International Association of Applied Linguistics (AILA): www.aila.info/en/
• Language Testing Research Colloquium (LTRC): www.iltaonline.com/index.
php/en/
REFERENCES

Alderson, J. C. (2000). Assessing reading. Cambridge: Cambridge University Press.


American Psychological Association (APA). (2010). Publication manual of the American Psy-
chological Association (6th ed.). Washington, DC: American Psychological Association.
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge
University Press.
Bachman, L. F., & Kunnan, A. J. (2005). Statistical analyses for language assessment workbook
and CD ROM. Cambridge: Cambridge University Press.
Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford: Oxford
University Press.
Bell, N. (2012). Comparing playful and nonplayful incidental attention to form. Language
Learning, 62(1), 236–265.
Blair, E., & Blair, J. (2015). Applied survey sampling. Thousand Oaks: Sage.
Brown, J. D. (2005). Testing in language programs. New York: McGraw Hill.
Brown, J. D. (2011). Likert items and scales of measurement. SHIKEN: JALT Testing &
Evaluation SIG Newsletter, 15(1), 10–14.
Brown, J. D. (2014). Classical theory reliability. In A. J. Kunnan (Ed.), Companion to lan-
guage assessment (pp. 1165–1181). Oxford, UK: John Wiley & Sons.
Carifio, J., & Perla, R. J. (2007). Ten common misunderstandings, misconceptions, persis-
tent myths and urban legends about Likert scales and Likert response formats and their
antidotes. Journal of Social Sciences, 3(3), 106–116.
Carr, N. (2011). Designing and analysing language tests. Oxford: Oxford University Press.
Chapelle, C. A., Enright, M. K., & Jamieson, J. (Eds.). (2008). Building a validity argument
for the test of English as a foreign language. London: Routledge.
Cho, Y., & Bridgeman, B. (2012). Relationship of TOEFL iBT® scores to academic perfor-
mance: Some evidence from American universities. Language Testing, 29(3), 421–442.
Clark, L. A., & Watson, D. B. (1995). Constructing validity: Basic issues in objective scale
development. Psychological Assessment, 7, 309–319.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Newbury Park, CA: Sage.
Cook, R. D., & Weisberg, S. (1983). Diagnostics for heteroscedasticity in regression.
Biometrika, 70(1), 1–10.
References 251

Coombe, C. A., Davidson, P., O’Sullivan, B., & Stoynoff, S. (Eds.). (2012). Cambridge guide
to second language assessment. Cambridge: Cambridge University Press.
Corder, G. W., & Foreman, D. I. (2009). Non-parametric statistics for non-statisticians. Hoboken,
NJ: John Wiley.
Council of Europe (2001). Common European framework of reference for languages: Learning,
teaching, assessment. Cambridge: Cambridge University Press.
Crossley, S. A., Cobb, T., & McNamara, D. S. (2013). Comparing count-based and band-
based indices of word frequency: Implications for active vocabulary research and peda-
gogical applications. System, 41(4), 965–982.
Derwing, T. M., & Munro, M. J. (2013). The development of L2 oral language skills in two
L1 groups: A 7-year study. Language Learning, 63(2), 163–185.
Di Silvio, F., Donovan, A., & Malone, M. E. (2014). The effect of study abroad homestay
placements: Participant perspectives and oral proficiency gains. Foreign Language Annals,
47(1), 168–188.
Doolan, S. M., & Miller, D. (2012). Generation 1.5 written error patterns: A comparative
study. Journal of Second Language Writing, 21(1), 1–22.
Dörnyei, Z., & Taguchi, T. (2010). Questionnaires in second language research. London:
Routledge.
Douglas, D. (2010). Understanding language testing. London: Hodder Education.
Eisenhauer, J. G. (2008). Degrees of freedom. Teaching Statistics, 30(3), 75–78.
Elder, C., Knoch, U., & Zhang, R. (2009). Diagnosing the support needs of second lan-
guage writers: Does the time allowance matter? TESOL Quarterly, 43(2), 351–360.
Ellis, R. (2015). Understanding second language acquisition. Oxford: Oxford University Press.
Field, A. (2013). Discovering statistics using IBM SPSS statistics (3rd ed.). Los Angeles: Sage.
Fulcher, G. (2010). Practical language testing. London: Hodder Education.
Furr, R. M. (2010). Yates correction. In N. J. Salkind (Ed.), Encyclopedia of research design
(Vol. 3, pp. 1645–1648). Los Angeles: Sage.
Fushino, K. (2010). Causal relationships between communication confidence, beliefs about
group work, and willingness to communicate in foreign language group work. TESOL
Quarterly, 44(4), 700–724.
Gass, S. M. with Behney, J., & Plonsky, L. (2013). Second language acquisition: An introductory
course (4th ed.). New York and London: Routledge.
Gass, S., Svetics, I., & Lemelin, S. (2003). Differential effects of attention. Language Learning,
53(3), 497–545.
Green, A. (2014). Exploring language assessment and testing: Language in action. New York:
Routledge.
Greenhouse, S. (1990). Yates’s correction for continuity and the analysis of 2×2 contingency
tables: Comment. Statistics in Medicine, 9(4), 371–372.
Guo, Y., & Roehrig, A. D. (2011). Roles of general versus second language (L2) knowledge
in L2 reading comprehension. Reading in a Foreign Language, 23(1), 42–64.
Haviland, M. G. (1990). Yates’s correction for continuity and the analysis of 2×2 contin-
gency tables. Statistics in Medicine, 9(4), 363–367.
House, J. (1996). Developing pragmatic fluency in English as a foreign language: Routines
and metapragmatic awareness. Studies in Second Language Acquisition, 18(2), 225–252.
Hudson, T., & Llosa, L. (2015). Design issues and inference in experimental L2 research.
Language Learning, 65(S1), 76–96.
Huff, D. (1954). How to lie with statistics. New York: Norton.
252 References

Isaacs, T., Trofimovich, P., Yu, G., & Munoz, B. M. (2015). Examining the linguistic aspects
of speech that most efficiently discriminate between upper levels of the revised IELTS
Pronunciation scale. IELTS Research Report, 4, 1–48.
Jamieson, S. (2004). Likert scales: How to (ab)use them. Medical Education, 38(12),
1212–1218.
Jia, F., Gottardo, A., Koh, P. W., Chen, X., & Pasquarella, A. (2014). The role of accultura-
tion in reading a second language: Its relation to English literacy skills in immigrant
Chinese adolescents. Reading Research Quarterly, 49(2), 251–261.
Kane, M. (2006). Validation. In R. Brennan (Ed.), Educational measurement (4th ed.,
pp. 17–64). Westport, CT: Greenwood Publishing.
Keith, Z. K. (2003). Validity of automated essay scoring systems. In M. D. Shermis, &
J. Burstein, J. (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 147–167).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Khang, J. (2014). Exploring utterance and cognitive fluency of L1 and L2 English speakers:
Temporal measures and stimulated recall. Language Learning, 64(4), 809–854.
Ko, M. H. (2012). Glossing and second language vocabulary learning. TESOL Quarterly,
46(1), 56–79.
Kormos, J., & Trebits, A. (2012). The role of task complexity, modality and aptitude in
narrative task performance. Language Learning, 62(2), 439–472.
Kunnan, A. J. (Ed.). (2014). The companion to language assessment. Oxford, UK: John Wiley
& Sons.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a
practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863.
Larson-Hall, J. (2010). A guide to doing statistics in second language research using SPSS. New
York: Routledge.
Larson-Hall, J. (2016). A guide to doing research in second language acquisition with SPSS and R
(2nd ed.). New York: Routledge.
Laufer, B., & Waldman, T. (2011). Verb-noun collocations in second language writing: A
corpus analysis of learners’ English. Language Learning, 61(2), 647–672.
Laufer, L., & Rozovski-Roitblat, L. (2011). Incidental vocabulary acquisition: The effects of task
type, word occurrence and their combination. Language Teaching Research, 15(4), 391–411.
Lee, C. H., & Kalyuga, S. (2011). Effectiveness of different pinyin presentation formats
in learning Chinese characters: A cognitive load perspective. Language Learning, 61(4),
1099–1118.
Lightbown, P. M., & Spada, N. (2013). How languages are learned (4th ed.). Oxford: Oxford
University Press.
Liu, D. (2011). The most frequently used English phrasal verbs in American and British
English: A multicorpus examination. TESOL Quarterly, 45(4), 661–688.
Macaro, E. (2010). Continuum companion to second language acquisition. London: Continuum.
Mackey, A., & Gass, S. M. (2015). Second language research: Methodology and design (2nd ed.).
London: Routledge.
Mantel, N. (1990). Yates’s correction for continuity and the analysis of 2×2 contingency
tables: Comment. Statistics in Medicine, 9(4), 369–370.
Matsumoto, M. (2011). Second language learners’ motivation and their perception of their
teachers as an affecting factor. New Zealand Studies in Applied Linguistics, 17(2), 37–52.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103).
New York: Macmillan.
References 253

Miller, G. A., & Chapman, J. P. (2001). Misunderstanding analysis of covariance. Journal of


Abnormal Psychology, 110(1), 40–48.
Mora, J. C., & Valls-Ferrer, M. (2012). Oral fluency, accuracy, and complexity in formal
instruction and study abroad learning contexts. TESOL Quarterly, 46(4), 610–641.
Norris, J. M., Ross, S. J., & Schoonen, R. (Eds.). (2015). Improving and extending quan-
titative reasoning in second language research. Language Learning, 65(S1), v–vi, 1–264.
Ockey, G. J., Koyama, D., Setoguchi, E., & Sun, A. (2015). The extent to which TOEFL
iBT speaking scores are associated with performance on oral ability components for
Japanese university students. Language Testing, 32(1), 39–62.
Ortega, L. (2009). Understanding second language acquisition. London: Hodder.
Paltridge, B., & Phakiti, A. (Eds.) (2015). Research methods in Applied Linguistics: A practical
resource. London: Bloomsbury.
Pawlak, M., & Aronin, L. (2014). Essential topics in applied linguistics and multilingualism: Stud-
ies in honor of David Singleton. New York, NY: Springer.
Phakiti, A. (2006). Modeling cognitive and metacognitive strategies and their relationships
to EFL reading test performance. Melbourne Papers in Language Testing, 1(1), 53–96.
Phakiti, A. (2014). Experimental research methods in language learning. London: Bloomsbury.
Phakiti, A., Hirsh, D., & Woodrow, L. (2013). It’s not only English: Effects of other indi-
vidual factors on English language learning and academic learning of ESL Interna-
tional students in Australia. Journal of Research in International Education, 12(3), 239–258.
doi:10.1177/1475240913513520.
Phakiti, A., & Li, L. (2011). General academic difficulties and reading and writing dif-
ficulties among Asian ESL postgraduate students in TESOL at an Australian university.
RELC Journal, 42(3), 227–264.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and report-
ing practices in quantitative L2 research. Studies in Second Language Acquisition, 35(4),
655–687.
Plonsky, L. (2014). Study quality in quantitative L2 research (1990–2010): A methodologi-
cal synthesis and call for reform. The Modern Language Journal, 98(1), 450–470.
Plonsky, L., & Gass, S. (2011). Quantitative research methods, study quality, and outcomes:
The case of interaction research. Language Learning, 61(2), 325–366.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2
research. Language Learning, 64(4), 878–912.
Purpura, J. E. (2011). Quantitative research methods in assessment and testing. In E. Hinkel
(Ed.), Handbook of research in second language teaching and learning Vol. 2 (pp. 731–751).
London: Routledge.
Purpura, J. E. (2016). Second and foreign language assessment. The Modern Language Jour-
nal, 100(S), 190–208.
Qian, D. (2002). Investigating the relationship between vocabulary knowledge and academic
reading performance: An assessment perspective. Language Learning, 52(3), 513–536.
Read, J. (2000). Assessing vocabulary. Cambridge: Cambridge University Press.
Read, J. (2015). Researching language testing and assessment. In B. Paltridge, & A. Phakiti
(Eds.), Research methods in applied linguistics: A practical resource (pp. 471–486). London:
Bloomsbury.
Roever, C. (1995). Routine formulae in acquiring English as a foreign language. Unpub-
lished raw data.
Roever, C. (2005). Testing ESL pragmatics. Frankfurt: Peter Lang.
254 References

Roever, C. (2006). Validation of a web-based test of ESL pragmalinguistics. Language Testing,


23(2), 229–256.
Roever, C. (2012). What learners get for free: Learning of routine formulae in ESL and EFL
environments. ELT Journal, 66(1), 10–21.
Rutherford, A. (2011). ANOVA and ANCOVA: A GLM approach. Oxford: John Wiley &
Sons.
Scheaffer, R. L., Mendenhall, W., Ott, R. L., & Gerow, K. G. (2012). Elementary survey sam-
pling. Boston: Brooks/Cole.
Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental
designs for generalized causal inference. Boston: Houghton, Mifflin.
Shintani, N., Ellis, R., & Suzuki, W. (2014). Effects of written feedback and revision on
learners’ accuracy in using two English grammatical structures. Language Learning, 64(1),
103–131.
Stevens, J. P. (2012). Applied multivariate statistics for the social sciences (5th ed.). New York:
Routledge.
Tabachnik, B., & Fidell, L. (2012). Using multivariate statistics. Boston: Pearson.
Taguchi, N., & Roever, C. (2017). Second language pragmatics. Oxford: Oxford University
Press.
Weir, C. J. (2003). Language testing and validation: An evidence-based approach. New York,
NY: Macmillan.
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of
automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13.
Yang, Y., Buckendahl, C. W., Juszkewicz, P. J., & Bhola, D. S. (2002). A review of strategies
for validating computer-automated scoring. Applied Measurement in Education, 15(4),
391–412.
KEY RESEARCH TERMS IN
QUANTITATIVE METHODS

Alternative hypothesis (H1): A statistical hypothesis that is complementary to


the null hypothesis. This hypothesis is usually related to researchers’ expecta-
tion of research findings.
Analysis of covariance (ANCOVA): A type of analysis of variance (ANOVA)
that attempts to minimize the effect of intervening variables on the dependent
variable.
Analysis of variance (ANOVA): A parametric test that functions similarly to
the independent-samples t-test, but instead of two groups, it can examine the
differences among three or more groups.
Bar graph: Also called a bar chart, this is a diagram for displaying the compara-
tive sizes of values of an independent variable.
Between-groups design: An experimental research design that places different
groups of learners in different conditions and tests whether an outcome vari-
able differs significantly among the different conditions.
Categorical data: Nominal data that are sorted into categories (e.g., men versus
women; Form A versus Form B versus Form C).
Chi-square test (also written as ‘χ2 test’): A type of inferential statistics for
analyzing nominal data. It is used to determine whether two nominal variables
are associated or related to each other.
Coefficient of determination (denoted by R2): The Pearson correlation
coefficient squared. It is expressed as a percentage and describes how much of
the variance in one variable is explained by the other variable.
Cohen’s d: A statistical effect size for t-tests or group comparisons. It indicates
the magnitude of group differences (small, medium or large). Cohen’s d value
is unlimited. A Cohen’s d above 0.8 is considered large.
256 Research Terms

Cohen’s kappa: An inter-rater or inter-coder reliability measure. Cohen’s kappa


is appropriate when researchers use categorical codes, such as yes/no, present/
absent and pass/fail.
Construct: A research feature of interest that cannot be observed directly (e.g.,
language proficiency, motivation, anxiety, and beliefs).
Contiguity problem: This problem occurs whenever cut-off points are set arbi-
trarily, so that adjacent data points are classified into different groups.
Convenience sampling: A sampling technique in which researchers use easily
obtainable participants for their research (e.g., a group of students they are teach-
ing). Convenience samples are often not representative of a larger population.
Correlation coefficient (r): A measure of the strength of a linear relationship
between two variables. It has a value between –1 and 1.
Correlational research: A type of research design which aims to investigate to
what extent two variables are related to each other.
Cronbach’s alpha (α): A standard measure of reliability for tests and question-
naires. Cronbach’s alpha tends to be higher if samples are heterogenous, there
are many items, and the items correlate strongly with one another.
Data: Information collected for research. Quantitative data are in a plural form
(e.g., data are/were).
Degree of freedom (df ): A statistical approach for correcting a sample size for
statistical calculations. It has critical implications for statistical results when a
sample size is small.
Dependent variables: Variables (e.g., test performance) that are affected by
independent variables. Dependent variables are also known as outcome variables.
Dichotomous data: This term describes nominal data that can have only two
possible values (e.g., pass/fail; international student /domestic student; correct/
incorrect; 0/1).
Effect size: A magnitude of a relationship between two variables or a difference
among groups. For example, the correlation coefficient and R2 are measures of
effect size in correlational analysis. Cohen’s d is an effect size index for a t-test.
Eta squared (ƞ2) or partial eta squared (partial ƞ2): The effect size in
ANOVA and ANOVA-related tests. The η2 effect size can never exceed the
value of 1.
Experimental research: A type of research design that aims to test a hypothesis
of the effect of one or more factors on another factor, under strictly controlled
conditions (e.g., the effects of feedback on writing accuracy, the effect of a
particular teaching approach on language learning). Also, participants must be
randomly assigned to conditions. Experimental research can allow researchers
to investigate causal relationships among independent and dependent variables.
Research Terms 257

External validity: The generalization of the current research findings of a quan-


titative study to other participants or populations, and other contexts or settings.
Frequency count: The quantification of occurrences or frequencies of a feature
(e.g., number of males and females).
Hypothesis testing: A process of inferential statistics for assessing how well the
data support a null hypothesis and whether the null hypothesis can be rejected
with minimal error.
Independent-samples t-test: A parametric test for comparing the mean scores
of two different groups of participants. This test is called ‘independent’ because
no member of one group is also a member of the other.
Independent variables: Variables whose effect on other variables researchers are
investigating (e.g., the effect of teaching methods on language learning success,
or the effect of study-abroad on a fluency measure). Independent variables are
known as grouping variables in t-tests, factors in analysis of variance, and predictor
variables in regression analysis.
Individual differences research: Research that aims to determine the effects
or influences of individual factors (e.g., age, gender, native language, study-
abroad experience, a particular psychological aspect) on language learning, use
or performance.
Inferential statistics: Statistics used to draw conclusions about a population of
interest from a sample of that population.
Internal consistency: A statistical measure of the reliability of a test or research
instrument, such as a questionnaire or an elicitation task. Cronbach’s alpha is
the most common example of this type of measure.
Internal validity: The trustworthiness of a quantitative study, which is related
to all aspects of the research design of the study (e.g., theoretical framework,
instruments and data analysis) that lead to inferences and conclusions in the
study.
Inter-rater or inter-coder reliability: A measure of the reliability of rating
among different raters or data coders. It concerns the amount of variation of
scoring of the same learner performance by different raters.
Interval data: Data that take on continuous values that allow the difference
between data values to be calculated. Interval data do not have a true zero.
Interval data allow researchers to compute means and standard deviations.
Intervening variable: An independent variable that interferes with the influence
of the target independent variable on the dependent variable. This variable is
also known as a moderator variable or confounding variable. An example of an
intervening variable is pretest differences.
258 Research Terms

Intraclass correlation coefficient: A type of correlational analysis that can be


used as a measure of inter-rater reliability.
Intra-rater reliability: The consistency of the scores a rater assigns to learners’
performances, as well as that rater’s ability to consistently discriminate among
good, average, and poor work.
Introspection or think-aloud protocol: A form of verbal report of current,
ongoing thoughts while participants are engaged in language or cognitive
activities.
Kruskal-Wallis test: An alternative to a one-way analysis of variance (ANOVA)
if the conditions for ANOVA are not met.
Kurtosis statistic: A statistic that describes how close the values in a data set are
to the mean.
Language aptitude tests: Measures of individuals’ inherent talent for language
learning. Language aptitude tests can be used to predict success in future lan-
guage learning.
Language tests: Tools to measure and sample learners’ ability or skills to use the
target language (e.g., reading, listening, speaking, writing, vocabulary, pragmat-
ics, and grammar).
Learner corpora: A large amount of natural language data that are produced
by language users in authentic communication (written and spoken language).
Corpus analysis can be conducted using a computer program (i.e., corpus lin-
guistic tools) to automatically examine quantitative features of language data
from linguistic units (e.g., morphemes) to syntactic structures (e.g., single
words, lexical phrases).
Leptokurtic distribution: A data distribution that is tall and narrow.
Levene’s test: A statistical test for checking the equality of variances in t-tests
and analysis of variance.
Likert-type scale: A measurement scale that allows research participants to
choose from a discrete set of responses (e.g., 1 [strongly disagree] to 5 [strongly
agree]).
Mann-Whitney U test: A nonparametric test alternative to the independent-
samples t-test.
Mean: The average of all the values in the data set.
Measures of central tendency: Each of these measures is a single value that
attempts to describe a data set by identifying a typical central value, i.e., mean,
median, and mode.
Measures of dispersion: Descriptive statistics that indicate how much variabil-
ity there is in the data set.
Research Terms 259

Median: The value that divides a data set into two groups.
Meta-analysis: A systematic review of previous empirical studies through the use
of statistical analysis for an average effect size.
Mode: The value that occurs most frequently in a data set.
Multiple regression: An extension of simple regression to examine the effect of
several independent variables on an outcome variable simultaneously. Hierar-
chical regression is performed when researchers enter independent variables in
steps (called blocks).
Negative correlation coefficient: A correlation coefficient that indicates that
as one variable increases, the other decreases, and vice versa.
Nominal data: Data that consists of named categories. Nominal data can be
compared only in terms of sameness or difference, rather than size or strength
(e.g., gender, nationalities, first language). Nominal data allow frequency
counts, including raw frequencies and percentages, as well as visual representa-
tions (e.g., pie charts)
Normal distribution: A data distribution that is bell-shaped.
Null hypothesis (H0): A statistical hypothesis that is tested against empirical
data. Inferential statistics aim to reject the null hypothesis. A null hypothesis
may have a word such as ‘no’ or ‘not’ (e.g., there is no relationship . . ., there is
no difference . . ., and there is no effect . . .). A rejection of a null hypothesis is
linked to a probability value being set (e.g., when p < 0.05).
Ordinal data: Data that are put into an order. Ordinal data can be obtained when
participants are rated or ranked according to their test performances or levels
of some feature. Ordinal data are more informative than nominal data since
they contain information about relative size or position (running at average
speed, high speed, very high speed), but are less informative than interval data,
which contain information about the exact size of the difference (race times of
25 seconds, 12 seconds, 10 seconds).
Paired-samples t-test: A parametric test for comparing the mean scores of two
tests or measurement instruments taken by the same group of participants. This
test is called ‘paired’ because pretest scores are compared with posttest scores.
This test is also called ‘a dependent t-test’ because both mean scores depend on
the same group of participants.
Parameters: The characteristics of the population of interest.
Participants: People who take part in a research study. Participants are sources
of data for analysis.
Pearson Product Moment correlation (Pearson’s r): A parametric statistic
for correlational analysis.
Pie chart: A circular diagram for displaying the relative sizes of values of a variable.
260 Research Terms

Platykurtic distribution: A data distribution that is wide and flat.


Point-biserial correlation: A type of correlational analysis that is used when
one variable is nominal and the other is interval or ordinal.
Population: A particular group of people or learners of interest.
Positive correlation coefficient: A correlation coefficient that indicates that as
one variable increases, the other also increases.
Post hoc test: A statistical test that is used to inform researchers which groups
differ significantly from one another. Post hoc tests are used when there are
more than two comparison groups (e.g., in ANOVA).
Psycholinguistics methods: Data elicitation methods that collect individuals’
online and/or offline mental representations and processes as they use and com-
prehend language in second language research. Psycholinguistics methods rely on
technology in software and hardware packages (e.g., E-Prime, Presentation for PC
and Eye-Tracking systems) that can produce numerical data for statistical purposes.
Quantification: A numerical measurement process of the features of individual
data points.
Questionnaires: Research instruments used to measure individuals’ beliefs, per-
ceptions, cognitive processes, or self-knowledge through those individuals
answering a series of questions.
Random assignment: A required condition for a true experimental research
design. Research participants (identified through a sampling method) are ran-
domly assigned into groups (e.g., experimental or control groups).
Random sampling: A sampling technique that allows each member of the tar-
get population to have an equal chance of being chosen.
Ratio data: Data that can be expressed using an interval, continuous scale with
a true zero value.
Raw data: Data that are not yet processed or analyzed to answer research ques-
tions (e.g., answers or responses provided by research participants).
Regression analysis (or simple regression): An extension of correlation anal-
ysis in which the values of an independent variable are used to predict the
values of a dependent variable. Regression expresses the variance accounted
for as R2 (the coefficient of determination). It ranges from 0 (no variance
explained) to 1 (all variance explained).
Reliability: Consistency or repeatability of observations of behaviors, perfor-
mance and/or psychological attributes. It is related to the extent to which
data are free of random error. The reliability of a test or research instrument is
commonly expressed as a value between 0 and 1.
Repeated-measures analysis of variance (repeated-measures ANOVA):
An extension of the paired-samples t-test to the case in which the mean scores
are compared across three or more tests or measurement instruments.
Research Terms 261

Research design: A research plan, outline, and method to help researchers tackle
a particular research problem.
Research reliability: The confidence that similar findings or conclusions are
likely to be repeated in new studies (i.e., replicability).
Sample size: The number of participants who produce data for quantitative
analysis. Large samples are generally preferable to small samples.
Scatterplot: A diagram that visualizes the correlation between two variables.
Skewness statistic: A statistic that describes whether more of the data are at the
low end of the range or the high end of the range.
Spearman’s rho: A nonparametric (distribution-free) statistic for correlational
analysis between two variables. Spearman’s rho is sometimes written with the
Greek letter rho (ρ) or written out (rho). This statistic is alternative to Pearson
Product Moment correlation. Unlike the Pearson correlation, it does not have
a coefficient of determination.
Spearman-Brown prophecy coefficient: A reliability measure in the spilt-half
reliability test. It can be used to inform researchers whether the reliability of
the test will increase if more items are added. It can also be used to examine
inter-rater reliability.
Sphericity: A statistical assumption for the repeated-measures ANOVA that
refers to condition that the variances of differences between the individual
measurements should be roughly equal.
Standard deviation (SD or Std. Dev): A statistic that indicates how different
individual values are from the mean.
Standard error of measurement (SEM): A statistical method for estimating
the lower and upper bound of an individual’s score through the use of a reli-
ability coefficient of a research instrument and the standard deviation of the
mean score. The higher the reliability coefficient, the lower the value of SEM.
Standardization: A procedure in which all research participants receive the same
conditions (e.g., same tasks and equal time allowance) during data collection.
Statistical Package for the Social Sciences (SPSS): A statistical program for
performing quantitative data analysis.
Statistical reasoning: The process of making inferences or drawing conclusions.
Statistical significance: The index that shows how likely it is that a statistical
finding is due to chance. It is known as the significance level and it is given as a
decimal (e.g., p < 0.05 or p = 0.032). In inferential statistics, it is insufficient to
merely report statistical significance (see ‘effect size’).
Stratified random sampling: A sampling technique in which researchers divide
the target population into sub-groups or strata and then randomly choose
equal numbers of participants from each sub-group to form a total sample.
262 Research Terms

Two-way mixed-design analysis of variance (ANOVA): A combination of


a repeated-measures ANOVA and a between-groups ANOVA. This test allows
researchers to simultaneously examine the effect of a between-subject variable,
a within-subject variable, and the interaction between these variables.
Type I error (or false positive): An error made when researchers reject the null
hypothesis when it is true.
Type II error (or false negative): An error made when researchers accept the
null hypothesis when it is not true.
Variability: The extent to which data differs from the mean.
Variable: A feature that can vary in degree, value, or quantity (e.g., age or first
language).
Wilcoxon signed-rank test: A nonparametric test analogous to the paired-
samples t-test.
Within-groups design: An experimental research design that examines whether
an outcome variable changes following the application of a treatment condi-
tion. Measures taken on the same learners are compared, rather than different
groups. Pretest-posttest studies typically involve a within-groups design.
Z-score: A raw score that has been transformed into a standardized score.
INDEX

Italic page references indicate illustrations and tables

adjacent agreement method 229 bar graphs: in descriptive statistics 35, 35;
alpha level, setting 89–90 in descriptive statistics in SPSS program
alternative hypothesis 89, 89 54–5, 55; SPSS program instructions for
analysis of covariance (ANCOVA): 54–5, 55
between-subjects factors/contrasts and Becker’s effect size calculator 104
151, 151; case selection in SPSS program ß value and coefficient 202–5, 211,
and 142, 143–53, 143–51; conditions of 216–17
139–40; conditions in SPSS, checking between-subjects factors/contrasts 151,
140–3, 141–2; covariate and 138–40, 151, 174, 174, 176, 176
151; describing 139; gain scores and bimodal data 37–8
136; homogeneity of regression slopes bivariate correlation 77
and 140, 148, 148; homoscedasticity Bonferroni post hoc test 150, 156
and 150–1; intervening variables and,
eliminating 135–9; overview 153; in case summaries, generating in SPSS
second language research 135; in SPSS program 20–2, 20–1
program 140–53, 141–52 categorical data 7–8, 8, 39
analysis of variance (ANOVA): assumptions central tendency measures 36–8
of, statistical 119–20; degrees of freedom chi-square test: assumptions of 189–90;
in 86–7; describing 117–22; effect size non-SPSS method for 195–8, 196–8;
for 121–2; F-statistic and 210; outcomes one-dimensional 182–5, 183;
of 119–20; overview 117, 134; post hoc overview 198–9; in L2 research 182;
tests and 120–1, 126, 127; posttest and in SPSS program 190–5, 191–5;
118, 118; in second language research two-dimensional 185–9, 185–6, 188
117–22; in SPSS program 122–7, 122–7; coding data 15, 234–8
steps in, key 119 coefficient of determination 67
ANOVA see analysis of variance; repeated- Cohen’s d 88, 96–7, 101, 103–4, 111–12,
measures ANOVA; two-way mixed 121
design ANOVA Cohen’s kappa 234–8, 236–8
assessment 11 collinearity 205, 212, 212, 217, 217
Asymp. Sig (2-tailed) value 110, 115 collocations 188–9, 188
264 Index

Common European Reference Framework 234–8; demographic 25; dichotomous


for Languages 6 8, 9; entering 15; heterogeneous 38;
composite 40 homogeneous 38; importing from Excel
confounding variables 135–9 15, 22–4, 22–4; interval 3–4, 4, 39;
constructs 2 nominal 7–8, 8, 39; ordinal 5–7, 5–6,
contingency table 185, 185, 188, 189–90, 39–40; organizing 14–15; in quantification
196–7, 197, 234, 234 2; ratio 3–4, 4; screening 15; in second
Continuity Correction 194 language research 4; in SPSS program,
control group 30 preparing 14–15; transforming, in real-life
convenience sampling 82 context 8–11, 9–11
correlation: bivariate 77; coefficients data file: importing from Excel 22–4, 22–4;
61–2, 229–30; defining 200; interval- saving and naming SPSS program 22
interval 66–8; interval-nominal 68–9; degrees of freedom (df ) 86–7, 120
interval-ordinal 68; negative 62–6, 66, demographic data 25
68; occurrence of 60; ordinal-ordinal dependent residuals 205
68; point-biserial 69; positive 60, 62–6, dependent t-tests 93
64, 68; in simple regression 200–1, 201; dependent variables 7, 119
Spearman 70–9, 72, 78 descriptive statistics: bar graphs and 35,
correlational analysis: application 35; central tendency measures and
of, in real study 79; background 36–8; computing 72–3, 72; data in,
information 60; conditions to be met summarizing 39–40; diagrams and
for using 70; correlation coefficient 33–5, 34–5; dispersion measures and
and 61–2; inferential statistics and 38–9; distribution measures and 40–3,
60–1; interpreting correlation and 41–2; frequency counts and 31–3, 31–3;
69; interval-interval relationship and graphs and 33–5, 34–5; histograms in
66–8; interval-nominal relationship 41–2; kurtosis statistics and 40, 43; in
and 68–9; interval-ordinal relationship Laufer and Rozovski-Roitblat study
and 68; negative correlations and 62–6, 155; mean and 36; median and 36–7,
66, 68; occurrence of correlation and 36, 39; mode and 37–9; in multiple
60; ordinal-ordinal relationship and regression 209, 209; overview 28, 43; pie
68; overview 79–80; Pearson Product charts and 33–5, 34; in quantification
Moment and 66, 70–9, 71, 77; point- 28; quantification at group level and
biserial correlation and 69; positive 28–30, 29–30; in repeated-measures
correlations and 60, 62–6, 64, 68; ANOVA 155, 161, 162; in L2 research
scatterplots and 63, 64–7; in second 28; skewness statistics and 40–3, 41–2;
language research 60–1, 70; Spearman in SPSS program 52–4, 53–4; standard
correlation and 70–9, 72, 78 deviation and 38–9; in two-way
covariate 138–40, 151 mixed-design ANOVA 168, 168, 174,
Cramer’s Phi 189, 197 175, 177, 177; in Wilcoxon Signed-rank
Cramer’s V 189, 197 test 114, 114
Cronbach’s alpha: internal consistency descriptive statistics in SPSS program:
measures and 221–3, 222–3; inter-rater application of, in real quantitative study
reliability and 230; intraclass correlation 58–9, 59; background information 44;
coefficient and 242–3; intra-rater bar graphs and 54–5, 55; computing
reliability and 228; in reliability analysis descriptive statistics and 48–54, 49–52;
90; setting the alpha level versus descriptive statistics option and 52–4,
89–90; in SPSS program 223–7, 223–7; 53–4; frequency option and 48–52,
taxonomy of questionnaire and 58, 59 49–52; graphs and 54–8, 55–8; histograms
cross-tabulation 185, 185, 188, 188, 190, and 57, 57–8; missing values and, assigning
196–7, 197, 234, 234 47–8, 47–8; nominal variables and,
assigning values to 44–7, 45–7; overview
data: bimodal 37–8; categorical 7–8, 8, 39; 44, 59; pie charts and 56, 56
checking 14–15; clearing 15; coding 15, descriptors 5–7, 6
Index 265

diagrams: in descriptive statistics 33–5, homogeneous data 38


34–5; in descriptive statistics in SPSS homoscedasticity 150–1
program 54–6, 55–6; SPSS program Huynh-Feldt corrections 163
instructions for 54–8, 55–8 hypotheses 2, 25, 89, 89
dichotomous data 8, 9
Dictogloss tasks 166 importing data from Excel spreadsheet 15,
discrimination of test item 68 22–4, 22–4
dispersion measures 38–9 independent-samples t-tests 93–4, 93,
distortion of mean 37 96–102, 98–101, 106, 117, 121, 138
distribution measures 40–3, 41–2 independent variables 7–8, 119, 128
dummy coding 205 inferential statistics: alternative hypothesis
Dunn-Bonferroni test 133 and 89, 89; correlational analysis and
d-value 97 60–1; degrees of freedom and 86–7;
effect size and 87–8, 88; errors in
effect size: for analysis of variance statistical analysis and 83, 90; normal
121–2; inferential statistics and 87–8, 88; distribution and 85, 85; null hypothesis
magnitude of 88; for repeated-measures and 89, 89; overview 90–1; populations
ANOVA 157–8; sample size versus and 81–3; probability and 83–4;
87–8, 88; for t-tests 96–7, 104 quantification and 81–2; samples and
equal variances assumption 96 81–3; sample size and 84–8, 88; in
eta squared 121–2, 126, 151, 151 second language research 60–1, 81;
exact agreement method 229 statistical significance and 83–4, 89–90;
Excel spreadsheet (Microsoft), importing see also correlational analysis
data from 15, 22–4, 22–4 instrument reliability 245
inter-coder reliability measures 228–30
factor variable 119–20 internal consistency measures: Cronbach’s
false negative 90 alpha 221–3, 222–3; describing 219–20;
false positive 90 split-half reliability 221; test-retest
file drawer problem 90 reliability 221
frequency counts 31–3, 31–3 International English Language Testing
F-statistic 210, 215 System (IELTS) 4, 25, 94, 239
F-value 120, 186 inter-rater reliability measures 228–30
interval data 3–4, 4, 39
gain scores 136 interval-interval relationships 66–8
graphs: in descriptive statistics 33–5, 34–5; interval-nominal relationships 68–9
in descriptive statistics in SPSS program interval-ordinal relationships 68
54–6, 55–6; SPSS program instructions intervening variables 135–9
for 54–8, 558 intraclass correlation coefficient (ICC)
Greenhouse-Geisser corrections 163 238–43, 239–43
grouping variable 119–20, 128 intrarater reliability measures 228

heterogeneous data 38 Kruskal-Wallis test: assumptions about,


heteroscedasticity 205 statistical 128; describing 117, 127–8;
hierarchical regression: multiple regression outcomes of 128; in SPSS program
and 203, 204; in SPSS program 213–18, 128–34, 129–33
213–18 kurtosis statistics 40, 43
H-index 128
histograms: in descriptive statistics 41–2; language-related episodes (LREs) 185–6,
in descriptive statistics in SPSS program 185–6
57, 57–8; SPSS program instructions for language testing and assessment (LTA)
57, 57 research 12; see also second language
homogeneity of regression slopes 140, 148, (L2) research
148 leptokurtic distribution 43
266 Index

Levene’s test 96, 100–1, 100, 125–6, 150, non-SPSS method for chi-square test
175, 176, 178–9 195–8, 196–8
Likert-type scale 3, 39–40, 40, 58, 108, 128, normal distribution 66, 85, 85
219–20 normal distribution measures 40–3, 41–2
Lower Bound corrections 163 null hypothesis 89, 89, 183, 185
low skewness statistic 41, 42
one-dimensional chi-square test 182–5, 183
main effects 168 one-way analysis of variance see analysis of
Mann-Whitney U test 106–11, 107–11 variance (ANOVA)
marginal totals 186, 188, 188 ordinal data 5–7, 5–6, 39–40
Mauchly’s Test of Sphericity 161–2, 162, ordinal-ordinal relationships 68
175, 175 outcome variable 119–20
mean 36–7, 118 outliers 36–7, 36, 106
measurements: central tendency 36–8;
dispersion 38–9; distribution 40–3, paired-samples t-tests 93–5, 95, 102–4,
41–2; inter-coder reliability 228–30; 102–4
inter-rater reliability 228–30; intra-rater pairwise comparisons 133, 133, 164, 164
reliability 228; normal distribution parameters 81
40–3, 41–2; proficiency level 12, 188–9, parametric statistic 66
188; scales 3–8, 4–8; see also internal partial eta squared 121, 157–8
consistency measures partialing out covariate 139
median 36–7, 36, 39 Pearson: correlation analysis 43; correlation
medium effect 97 coefficient 121, 182, 204; Product
Minitab software 14 Moment 66, 70–9, 71, 77, 230; Pearson’s
missing values, assigning 47–8, 47–8 r 66–8
mode 37–9 percentage of agreement 229
moderator variables 135–9 performance rating 234–8
multiple regression: ANOVA result and phi coefficients 68, 184; Phi value 184, 187,
210–11, 211, 216, 216; assumptions 195
of 205; collinearity and 205, 212, pie charts: in descriptive statistics 33–5, 34;
212, 217, 217; describing 203–5, in descriptive statistics in SPSS program
204; descriptive statistics in 209, 209; 56, 56; SPSS program instructions for
hierarchical regression and 203, 204; 56, 56
model coefficient outputs and 211–12, platykurtic distribution 43
211–12, 216–17, 216–17; overview 218; point-biserial correlation 69
sample size in 205; in second language populations 81–3
research 200; simple 200–3, 201; in SPSS positive correlations 60, 62–6, 64, 68
program 206–12, 206–12 positively skewed distribution 41, 41
multivariate analysis of variance positive ranks 114
(MANOVA) 118, 160, 161 post hoc tests 119–21, 126, 127, 140, 150,
Multivariate Tests 164 156–7, 178–9, 178
predictor variable 201, 205, 209–10, 209
negative correlations 62–6, 66, 68 pre-post studies 154–6, 154–5
negatively skewed distribution 41, 42 probability 83–4
negative ranks 114 PSPP software 14
nominal data 7–8, 8, 39 purposive sampling 82
nominal variables, assigning value to 44–7, p-value 84, 88–90, 93, 100, 120
45–7
nonparametric tests: determining use of qualitative data coding 234–8
106; Mann-Whitney U test 106–11, quantification: categorical data in 7–8, 8;
107–11; overview 116; in second constructs in 2; data in 2; describing
language research 106; Wilcoxon 2–3; descriptive statistics in 28; at
Signed-ranked test 111–16, 112–15 group level 28–30, 29–30; hypotheses
Index 267

in 2; inferential statistics and 81–2; R software 14


interval data in 3–4, 4; issues in 2–3; R-squared 210
measurement scales in 3–8; nominal R-value 210
data in 7–8, 8; ordinal data in 5–7, 5–6;
overview 1, 13; ratio data in 3–4, 4; samples 81–3
sample study 12–13; in second language sample size 84–8, 88, 205
research 1, 3; topics in second language SAS (Statistical Analysis Software) 14
research and 11–12; transforming data scatterplots: correlational analysis and 63,
in real-life context 8–11, 9–11 64–7; in simple regression 201, 201;
SPSS program instructions for 73, 74–7
random assignment 82 Scheffé post hoc test 119–20, 126, 127,
random sampling 82 140, 178, 178
ranks statistics 114, 115 screening data 15
rater reliability: Cohen’s kappa and 234–8, selective sampling 82
234–8; correlation coefficients and sequential regression see hierarchical
229–30; describing 227–8; inter-coder regression
reliability and 228–30; inter-rater setting the alpha level 89–90
reliability and 228–30; intraclass Sidak post hoc test 150, 156
correlation coefficient 238–9, 239; significance level 83–4, 84
intra-rater reliability measures and significance, statistical 2–3, 83–4, 86–7,
228; percentage of agreement and 89–90, 140, 156–7
229; Spearman-Brown coefficient and simple regression 200–3, 201
230–4, 231–3 skewness statistics 40–3, 41–2
ratio data 3–4, 4 Spearman-Brown coefficient 230–4,
regression see multiple regression; simple 231–3
regression Spearman-Brown prophecy formula 221,
reliability 219–20 230
reliability analysis: Cronbach’s alpha in 90; Spearman correlation 70–9, 72, 78
estimates and, factors affecting 244–5; Spearman’s rho 67–8, 71, 182, 230
instrument reliability versus research sphericity 156, 161–2, 175, 175
validity and 245; internal consistency split-half reliability 221
measures and 219–27, 222–7; spreadsheet, creating in SPSS 16–20, 16–19
overview 245; rater reliability and SPSS program (IBM): analysis of covariance
227–43, 228–43; reliability and 219–20; in 140–53, 141–52; analysis of variance
reliability coefficient and 220; in second in 122–7, 122–7; application of, in
language research 219; standard error of real study 24–7, 25–6; background
measurement and 243–4 information 14; bar graphs in 54–5, 55;
reliability coefficient 220 case selection in 142, 143–53, 143–51;
reliability estimates, factors affecting 244–5 case summaries in, generating 20–2, 20,
repeated-measures ANOVA: assumptions 21; chi-square test in 190–5, 191–5;
of 156–7; between-subjects contrasts Cohen’s d in 97; Cohen’s kappa in
and 151, 151, 163; describing 154; 235–8, 236–8; computing descriptive
descriptive statistics in 155, 161, 162; statistics in 48–54, 49–52; Cronbach’s
effect size for 157–8; Mauchly’s Test of alpha in 223–7, 223–7; data file in, saving
Sphericity and 161–2, 162; overview and naming 22; data in, preparing 14–15;
164–5; paired-samples t-test and 154; describing 14; descriptive statistics option
post hoc tests and 157; in second in 52–4, 53–4; diagrams in 54–8, 55–8;
language research 154–5, 154–5; Excel spreadsheet and, importing data
sphericity and 156, 161–2; in SPSS from 15, 22–4, 22–4; frequency option
program 158–64, 158–64; statistical in 48–52, 49–52, 54; graphs in 54–8,
significance and 157; within-subjects 55–8; hierarchical regression in 213–18,
factors/contrasts and 162, 163, 163 213–18; importing data from Excel
repeated measures t-tests 93 and 22–4, 22–4; independent-samples
268 Index

t-tests in 97–102, 98–101; intraclass 104; equal variances assumption and 96;
correlation in 240–3, 240–3; Kendall’s independent-samples 93–4, 93, 96–102,
tau in 68; Kruskal-Wallis test in 128–34, 98–101, 106, 117, 121, 138; Levene’s test
129–33; Mann-Whitney U test in and 96; overview 104–5; paired-samples
108–11, 108–10; missing values in, 93–5, 95, 102–4, 102–4; repeated
assigning 47–8, 47–8; multiple regression measures 93; in second language
in 206–12, 206–12; notes on, important research 92–3; steps for using 97
15–16; overview 14, 27; paired-samples t-value 93, 103
t-tests in 102–4, 102–4; Pearson Product two-dimensional chi-square test 185–9,
Moment in 78, 78, 79; pie charts in 185–7
56, 56; repeated-measures ANOVA two-way analysis of variance 117
in 158–64, 158–64; scatterplots in 73, two-way mixed-design ANOVA: between-
74–7; in second language research 14; subjects factors/contrasts and 174, 174,
Spearman-Brown coefficient in 231–4, 176, 176; descriptive statistics in 168,
231–3; Spearman correlation in 76, 79, 168, 174, 175, 177, 177; Levene’s test
78; Spearman’s rho and 68; spreadsheet and 175, 176, 178–9; Mauchly’s Test
in, creating 16–20, 16–19; standard of Sphericity and 175, 175; overview
deviation in 38; statistical significance 180–1; pairwise comparisons and 177,
and 86; Test of Between-Subjects Effect 177, 179; post hoc tests and 178–9, 178;
163; Tests of Within-Subjects Contrast pretest-posttest control-group design
163; two-way mixed design ANOVA in and 166, 167; results, written 180; in
170–80, 170–80; value labels in, assigning second language research 166–9, 167–9;
44–7, 45–7; variables in, computing 136– in SPSS program 170–80, 170–80;
7, 136–7; Wilcoxon Signed-rank test univariate tests and 177, 177;
in 112–16, 112–15; see also descriptive within-subjects factors/contrasts and
statistics in SPSS program 166–7, 174–5, 174, 176, 178
standard deviation (SD) 38–9, 118, 243 type I error 90
standard error of measurement (SEM) type II error 90
243–4
Statistical Package for Social Sciences univariate analysis of variance see analysis
program see SPSS program (IBM) of variance (ANOVA)
statistical significance 2–3, 83–4, 86–7, U-value 107
89–90, 140, 156–7
stratified random sampling 82 variables: confounding 135–9; dependent 7,
119–20; excluded 217, 218; factor 119–
Tamhane T2 post hoc test 120, 126, 140, 20; grouping 119–20, 128; independent
178, 178 7–8, 119–20, 128; intervening 135–9;
Test of Between-Subjects Effects 163 moderator 135–9; nominal, assigning
Test of English as a Foreign Language values to 44–7, 45–7; outcome 119–20;
(TOEFL) 4, 58, 61–2, 83 predictor 201, 205, 209–10, 209; in
Test of English for International quantification 2; SPSS program and
Communication (TOEIC) 4 computing 136–7, 136–7
Test of English Pragmatics (TEP) 30, 126, VassarStats website 196–7, 196, 198
140, 206, 223
test item discrimination 68 Wiseheart’s calculator 104
test-retest reliability 221 within-subjects factors/contrasts 162, 163,
theories 2 163, 166–7, 174–5, 174, 176, 178
transforming data in real-life context 8–11,
9–11 Yates correction 187, 194, 197
t-tests: assumptions of 96; Cohen’s d in 88,
96–7; dependent 93; effect size for 96–7, Z-value 107, 110–11, 114–15, 115

You might also like