(123doc) Quantitative Methods For Second Language Research Carsten Roever Aek Phakiti Routledge 2018 Scan
(123doc) Quantitative Methods For Second Language Research Carsten Roever Aek Phakiti Routledge 2018 Scan
(123doc) Quantitative Methods For Second Language Research Carsten Roever Aek Phakiti Routledge 2018 Scan
Typeset in Bembo
by Apex CoVantage, LLC
1 Quantification 1
2 Introduction to SPSS 14
3 Descriptive Statistics 28
5 Correlational Analysis 60
7 T-Tests 92
Epilogue 246
References 250
Key Research Terms in Quantitative Methods 255
Index 263
ILLUSTRATIONS
Figures
2.1 New SPSS spreadsheet 16
2.2 SPSS Variable View 17
2.3 Type Column 18
2.4 Variable Type dialog 18
2.5 Label Column 18
2.6 Creating student and score variables for the Data View 19
2.7 Adding variables named ‘placement’ and ‘campus’ 19
2.8 The SPSS spreadsheet in Data View mode 19
2.9 Accessing Case Summaries in the SPSS menus 20
2.10 Summarize Cases dialog 21
2.11 SPSS output based on the variables set in the Summarize
Cases dialog 21
2.12 SPSS menu to open and import data 23
2.13 SPSS dialog to open a data file in SPSS 23
2.14 Illustrated example of an Excel data file to be imported into SPSS 24
2.15 SPSS dialog when opening an Excel data source 24
2.16 The personal factor questionnaire on demographic information 25
2.17 SPSS spreadsheet that shows the demographic data of
Phakiti et al. (2013) 25
2.18 The questionnaires and types of scales and descriptors in
Phakiti et al. (2013) 26
2.19 SPSS spreadsheet that shows questionnaire items of
Phakiti et al. (2013) 26
3.1 A pie chart based on gender 34
viii Illustrations
15.12 Reliability Analysis dialog for raters’ totals as selected variables 240
15.13 Reliability Analysis: Statistics dialog for intraclass correlation analysis 241
Tables
1.1 Examples of learners and their scores 4
1.2 An example of learners’ scores converted into percentages 4
1.3 How learners are rated and ranked 5
1.4 How learners are scored on the basis of performance descriptors 6
1.5 How learners are scored on a different set of performance descriptors 6
1.6 Nominal data and their numerical codes 8
1.7 Essay types chosen by students 8
1.8 The three placement levels taught at three different locations 9
1.9 The students’ test scores, placement levels, and campuses 9
1.10 The students’ placement levels and campuses 10
1.11 The students’ campuses 11
1.12 Downward transformation of scales 11
3.1 IDs, gender, self-rated proficiency, and test score of the first 50
participants 29
3.2 Frequency counts based on gender 31
3.3 Frequency counts based on test takers’ self-assessment of
their English proficiency 31
3.4 Frequency counts based on test takers’ test scores 32
3.5 Frequency counts based on test takers’ test score ranges 32
3.6 Test score ranges based on quartiles 33
3.7 Imaginary test taker sample with an outlier 36
4.1 SPSS output on the descriptive statistics 51
4.2 SPSS frequency table for gender 52
4.3 SPSS frequency table for the selfrate variable
(self-rating of proficiency) 52
4.4 Taxonomy of the questionnaire and Cronbach’s alpha (N = 51) 59
4.5 Example of item-level descriptive statistics (N = 51) 59
5.1 Descriptive statistics of the listening, grammar, vocabulary, and
reading scores (N = 50) 73
5.2 Pearson product moment correlation between the listening
scores and grammar scores 78
5.3 Spearman correlation between the listening scores and
grammar scores 78
6.1 Correlation between verb tenses and prepositions in a
grammar test 84
6.2 Explanations of the relationship between the sample size and the
effect 88
6.3 The null hypothesis versus alternative hypothesis 89
xii Illustrations
those scores across the sample, the results of which would be subject to one or
more statistical tests for subsequent interpretation. In each of these procedures, we
have made abstractions, tiny steps away from learner knowledge.
I realize these comments might make me appear skeptical of quantitative research.
Of course I am! Likewise, we should all approach the task of conducting, report-
ing, and understanding empirical research with a critical eye. And thankfully, that
is precisely what this very timely and well-crafted book will enable you to do,
thereby advancing our collective ability both to conduct and evaluate research.
The text, in my view, manages to balance on the one hand a conceptual grounding
that enlightens without overwhelming and, on the other, the need for a hands-
on tutorial—in other words, precisely the knowledge and skills needed to make
and justify your own decisions throughout the process of producing rigorous and
meaningful studies. I look forward to reading them!
Luke Plonsky
Georgetown University
PREFACE
Companion Website
A Companion Website hosted by the publisher houses online and up-to-date
materials such as exercises and activities: www.routledge.com/cw/roever
Comments/suggestions
The authors would be grateful to hear comments and suggestions regarding this
book. Please contact Carsten Roever at [email protected] or Aek Phakiti
at [email protected].
ACKNOWLEDGMENTS
In preparing and writing this book, we have benefitted greatly from the support of
many friends, colleagues, and students. First and foremost, we wish to acknowledge
Tim McNamara, whose brilliant pedagogical design of the course Quantitative
Methods in Language Studies at the University of Melbourne inspired us to write an
introductory statistical methods book that focuses on conceptual understanding
rather than mathematical intricacies. In addition, several colleagues, mentors, and
friends have helped us shape the book structure and content through invaluable
feedback and engaging discussion: Mike Baynham, Janette Bobis, Andrew Cohen,
Talia Isaacs, Antony Kunnan, Susy Macqueen, Lourdes Ortega, Brian Paltridge,
Luke Plonsky, Jim Purpura, and Jack Richards. We would like to thank Guy
Middleton for his exceptional work on editing the book chapter drafts. We also
greatly appreciate the feedback from Master of Arts (Applied Linguistics) students
at the University of Melbourne and Master of Education (TESOL) students at
the University of Sydney on an early draft. We would like to thank the staff at
Routledge for their assistance during this book project: Kathrene Binag, Rebecca
Novack, and the copy editors.
The support of our institutions and departments has allowed us time to con-
centrate on completing this book. The School of Languages and Linguistics at the
University of Melbourne supported Carsten with a sabbatical semester, which he
spent in the stimulating environment of the Teachers College, Columbia Uni-
versity. The Sydney School of Education and Social Work (formerly the Faculty
of Education and Social Work) supported Aek with a sabbatical semester at the
University of Bristol to complete this book project. Finally, Kevin Yang and Damir
Jambrek deserve our gratitude for their unflagging support while we worked on
this project over several years.
1
QUANTIFICATION
Introduction
Quantification is the use of numbers to represent facts about the world. It is used to
inform the decision-making process in countless situations. For example, a doctor
might prescribe some form of treatment if a patient’s blood pressure is too high.
Similarly, a university may accept the application of a student who has attained the
minimum required grades. In both these cases, numbers are used to inform deci-
sions. In L2 research, quantification is also used. For example,
Quantitative Research
Quantitative researchers aim to draw conclusions from their research that can be
generalized beyond the sample participants used in their research. To do this, they
must generate theories that describe and explain their research results. When a
theory is in the process of being tested, several aspects of the theory are referred to
as hypotheses. This testing process involves analyzing data collected from, for exam-
ple, research participants or databases. In language assessment research, researchers
may be interested in the interrelationships among test performances across various
language skills (e.g., reading, listening, speaking, and writing). Researchers may
hypothesize that there are positive relationships among these skills because there
are common linguistic aspects underlying each skill (e.g., vocabulary and syntac-
tic knowledge). To test this hypothesis, researchers may ask participants to take a
test for each of the skills. They may then perform statistical analysis to investigate
whether their hypothesis is supported by the collected data.
Issues in Quantification
For the results of a piece of quantitative research to be believable, a minimum number
of research participants is required, which will depend on the research question under
analysis, and, in particular, the expected effect size (to be discussed in Chapter 6).
Quantification 3
In most cases, researchers need to use some type of instrument (e.g., a lan-
guage test, a rating scale, or a Likert-type scale questionnaire) to help them
quantify a construct that cannot be directly seen or observed (e.g., writing abil-
ity, reading skills, motivation, and anxiety). When researchers try to quantify
how well a student can write, it is not a matter of simply counting. Rather, it
involves the conversion of observations into numbers, for example, by applying a
scoring rubric that contains criteria which allow researchers to assign an overall
score to a piece of writing. That score then becomes the data used for further
analyses.
Measurement Scales
Different types of data contain different levels of information. These differences
are reflected in the concept of measurement scales. What is measured and how it is
measured determines the kind of data that results. Raw data may be interpreted
differently on different measurement scales. For example, suppose Heather and
Tom took the same language test. The results of the test may be interpreted in
different ways according to the measurement scale adopted. It may be said that
Heather got three more items correct than Tom, or that Heather performed better
than Tom. Alternatively, it may simply be said that their performances were not
identical. The amount of information in these statements about the relative abili-
ties of Heather and Tom is quite different and affects what kinds of conclusion can
be drawn about their abilities. The three statements about Heather and Tom relate
directly to the three types of quantitative data that are introduced in this chapter:
interval, ordinal, and nomina/categorical data.
Heather 19
Tom 16
Phil 16
Jack 11
Mary 8
Heather 19 95%
Tom 16 80%
Phil 16 80%
Jack 11 55%
Mary 8 40%
• Heather got more questions right than Tom, and also that she got three more
right than Tom did;
• Tom got twice as many questions right as the lowest scorer, Mary; and,
• the difference between Heather and Jack’s scores was the same as the differ-
ence between Tom and Mary’s scores, namely eight points in each case.
Interval data contain a large amount of detailed information and they tell us exactly
how large the interval is between individual learners’ scores. They therefore lend them-
selves to conversion to percentages. Table 1.2 shows the learners’ scores in percentages.
Percentages allow researchers to compare results from tests with different maxi-
mum scores (via a transformation to a common scale). For example, if the next
test consists of only 15 items, and Tom gets 11 of them right, his percentage score
will have declined (as 11 out of 15 is 73%), even though in both cases he got
four questions wrong. In addition to allowing conversion to percentages, interval
data can also be used for a wide range of statistical computations (e.g., calculating
means) and analyses.
Typical real-world examples of interval data include age, annual income, weekly
expenditure, and the time it takes to run a marathon. In L2 research, interval data
include age, number of years learning the target language, and raw scores on lan-
guage tests. Scaled test scores on a language proficiency test, such as the Test of
English as a Foreign Language (TOEFL), International English Language Testing
System (IELTS), and Test of English for International Communication (TOEIC)
are also normally considered interval data.
Quantification 5
Ordinal Data
For statistical purposes, ratio and interval data are normally considered desirable
because they are rich in information. Nonetheless, not all data can be classified as
interval data, and some data contain less precise information. Ordinal data contains
information about relative ranking but not about the precise size of a difference.
If the data in Tables 1.1 and 1.2 regarding students’ test scores were expressed as
ordinal data (i.e., they were on an ordinal scale of measurement), they would tell
the researchers that Heather performed better than Tom, but they would not indi-
cate by how much Heather outperformed Tom. Ordinal data are obtained when
participants are rated or ranked according to their test performances or levels of
some trait. For example, when language testers score learners’ written production
holistically using a scoring rubric that describes characteristics of performance,
they are assigning ratings to texts such as ‘excellent’, ‘good’, ‘adequate’, ‘support
needed’, or ‘major support needed’. Table 1.3 is an example of how the learners
discussed earlier are rated and ranked.
According to Table 1.3, it can be said that
While ordinal data contain useful information about the relative standings of
test takers, they do not show precisely how large the differences between test tak-
ers are. Phil and Tom performed better than Mary did, but it is unknown how
much better than her they performed. Consequently, with the data in Table 1.3,
it is impossible to see that Phil and Tom scored twice as high as Mary. Although
it could be said that Phil and Tom are two score levels above Mary, that is rather
vague.
Ordinal data can be used to put learners in order of ability, but they do little
beyond establishing that order. In other words, they do not give researchers as
much information about the extent of the differences between individual learn-
ers as interval data do. Ratings of students’ writing or speaking performance are
Heather Excellent 1
Tom Good 2
Phil Good 2
Jack Adequate 3
Mary Support Needed 4
6 Quantification
often expressed numerically; however, that does not mean that they are interval
data. For example, numerical values can be assigned to descriptors as follows:
Excellent (5), Good (4), Adequate (3), Support Needed (2); Major Support
Needed (1). Table 1.4 presents how the learners are rated on the basis of perfor-
mance descriptors.
The numerical scores in Table 1.4 may look like interval data, but they are not.
They are only numbers that represent the descriptor, so it would not make sense
to say that Tom scored twice as high as Mary did. It makes sense to say only that
his score is two levels higher than Mary’s. This becomes even clearer if the rating
scales are changed as follows: excellent (8), good (6), adequate (4), support needed
(2), and Major support (0). That would give the information in Table 1.5.
As can been seen in Tables 1.4 and 1.5, the descriptors do not change, but
the numerical scores do. Tom and Phil’s scores are still two levels higher than
Mary’s, but now their numerical scores are three times as high as Mary’s score.
This illustration makes it clear that numerical representations of descriptors are
only symbols that say nothing about the size of the intervals between adjacent
levels. They indicate that Heather is a better writer than Tom, but since they are
not based on counts, they cannot indicate precisely how much of a better writer
Heather is than Tom.
In L2 research, rating scale data are an example of ordinal data. These are
commonly collected in relation to productive tasks (e.g., writing and speaking).
Whenever there are band levels, such as A1, A2, and B1, as in the Common Euro-
pean Reference Framework for Languages (see Council of Europe, 2001), or bands
TABLE 1.4 How learners are scored on the basis of performance descriptors
Heather Excellent 5
Tom Good 4
Phil Good 4
Jack Adequate 3
Mary Support Needed 2
TABLE 1.5 How learners are scored on a different set of performance descriptors
Heather Excellent 8
Tom Good 6
Phil Good 6
Jack Adequate 4
Mary Support Needed 2
Quantification 7
1–9, as in the IELTS, researchers are dealing with ordinal data, rather than interval
data. Data collected by putting learners into ordered categories, such as ‘beginner’,
‘intermediate’, or ‘advanced’ are another case of ordinal data. Finally, ordinal data
occur when researchers rank learners relative to each other. For example, researchers
may say that in reference to a particular feature, Heather is the best, Tom and Phil
share second place, Jack is behind them, and Mary is the weakest. This ranking indi-
cates only that the first learner is better (e.g., stronger, faster, more capable) than the
second learner, but not by how much. Ordinal data can only provide information
about the relative strengths of the test takers in regard to the feature in question. The
final data type often used in L2 research (i.e., nominal or categorical data) does not
contain information about the strengths of learners, but rather about their attributes.
B versus Form C). When a variable can only have two possible values (pass/
fail; international student/domestic student, correct/incorrect), this type of data
is sometimes called dichotomous data. For example, students may be asked to com-
plete a free writing task in which they are limited to three types of essays: personal
experience (coded 1), argumentative essay (coded 2), and description of a process
(coded 3). Table 1.7 shows which student chose which type.
The data in the Type column do not provide any information about one learner
being more capable than another. It only shows which learners chose which essay
type, from which frequency counts can be made. That is, the process description
and personal experience types were chosen two times each, and the argumenta-
tive essay was chosen once. How nominal data are used in statistical analysis for
research purposes will be addressed in the next few chapters.
TABLE 1.8 The three placement levels taught at three different locations
TABLE 1.9 The students’ test scores, placement levels, and campuses
take a placement test consisting of, say, 60 multiple-choice questions assessing their
listening, reading, and grammar skills. Based on the test scores, the students are
placed in one of three levels: beginner, intermediate, or advanced. In addition, the
three levels are taught at three different locations, as presented in Table 1.8.
Table 1.9 presents the scores and placements of the five students introduced earlier.
The test scores are measured on an interval measurement scale that is based on
the count of correct answers in the placement test and provides detailed informa-
tion. It can be said that:
• Heather’s score is in the advanced range since her score is 11 points above the
cut-off, and her score is much higher than Tom’s, whose score was 23 points
lower than hers;
• Tom’s score is in the intermediate range, but it is close to the cut-off for the
advanced range, missing it by just three points;
• Tom’s score is far higher than Phil’s, with a difference of 17 points, yet both
scores are in the intermediate range;
• Phil’s score is just one point above the cut-off for the intermediate level, and
is only four points higher than Jack’s score. Despite the small difference in
their scores, Jack was placed in the beginner level and Phil was placed in the
intermediate level; and,
• Mary’s score is in the middle of the beginner level.
Because the information is detailed, the placement test can be evaluated criti-
cally. For example, Phil and Tom’s scores are 17 points apart whereas Phil and
Jack’s are only four points apart. Phil’s proficiency level is arguably closer to Jack’s
than to Tom’s. Yet, Phil and Tom are both classified as intermediate, but Jack is
classified in the beginner level. This is known as the contiguity problem, and it is
10 Quantification
common whenever cut-off points are set arbitrarily: students close to each other
but on different sides of the cut-off point can be more similar to each other than
to people further away from each other but on the same side of the cut-off point.
Now imagine that there are no interval-level test-score data, but instead just the
ordinal-level placement levels data and the campus data, as in Table 1.10.
As can be seen in Table 1.10, the differences between Tom and Phil and the
problematic nature of the classification that were so apparent before are no longer
visible. The information about the size of the differences between learners has
been lost and all that can be deduced now is that some students are more profi-
cient than others. Tom and Phil have the same level of proficiency and Jack is
clearly different from both of them. This demonstrates why ordinal data are not as
precise as interval data. Information is lost, and the differences between the learn-
ers seen earlier are no longer as clear.
Highly informative interval data are often transformed into less informative
ordinal data to reduce the number of categories the data must be split into. No
language program can run with classes at 60 different proficiency levels; moreover,
some small differences are not meaningful, so it does not make sense to group
learners into such a large number of levels. However, setting the cut-off points is
often a problematic issue in practice.
While the ordinal proficiency level data are less informative than the interval
test-score data, they can be scaled down even further, namely to the nominal cam-
pus data (see Table 1.11).
If this is all that can be seen, it is impossible to know how campus assignment
is related to proficiency level. However, it can be said that:
This information does not indicate who is more proficient since nominal data
do not contain information about the size or direction of differences. They indi-
cate only whether differences exist or not.
Transformation of types of data can happen downwards only, rather than
upwards, in the sense that interval data can be transformed into ordinal data and
Quantification 11
Student Campus
Tom Eastern
Mary City
Heather Ocean
Jack City
Phil Eastern
ordinal data can be transformed into nominal data (e.g., by using test scores to
place learners in classes based on proficiency levels and then by assigning classes to
campus locations). Table 1.12 illustrates the downward transformation of scales.
Transformation does not work the other way around. That is, if it is known
which campus a learner studies at, it is impossible to predict that learner’s profi-
ciency level. Similarly, if a learner’s proficiency level is known, it is impossible to
predict that learner’s exact test score.
Topics in L2 Research
It is useful to introduce some of the key topics in L2 research that can be examined
using a quantitative research methodology. Here, areas of research interests in SLA,
and language testing and assessment (LTA) research are presented.
SLA Research
There is a wide range of topics in SLA research that can be investigated using
quantitative methods, although the nature of SLA itself is qualitative. SLA research
aims to examine the nature of language learning and interlanguage processes (e.g.,
sequences of language acquisition; the order of morpheme acquisition; charac-
teristics of language errors and their sources; language use avoidance; cognitive
processes; and language accuracy, fluency, and complexity). SLA research also
aims to understand the factors that affect language learning and success. Such
factors may be internal or individual factors (e.g., age, first language or cross-
linguistic influences, language aptitude, motivation, anxiety, and self-regulation), or
external or social factors (e.g., language exposure and interactions, language and
12 Quantification
A Sample Study
Khang (2014) will be used to further illustrate how L2 researchers apply the prin-
ciples of scales of measurement in their research. Khang (2014) investigated the
fluency of spoken English of 31 Korean English as a Foreign Language (EFL)
learners compared to that of 15 native English (L1) speakers. The research partici-
pants included high and low proficiency learners. Khang conducted a stimulated
recall study with a subset of this population (eight high proficiency learners and
nine low proficiency learners). This study exemplifies all three measurement scales.
The status of a learner as native or nonnative speaker of English was used as a
nominal variable. ‘Native’ was not in any way better or worse than ‘nonnative’; it
was just different. The only statistic applied to this variable was a frequency count
(15 native speakers and 31 nonnative speakers). Khang used this variable to estab-
lish groups for comparison. Proficiency level was used as an ordinal variable in
Quantification 13
this study. High proficiency learners were assumed to have greater target language
competence than low proficiency learners had, but the degree of the difference
was not relevant. The researcher was interested only in comparing the issues that
high and low proficiency learners struggled with. Khang’s other measures were
interval variables (e.g., averaged syllable duration, number of corrections per min-
ute, and number of silent pauses per minute, which can all be precisely quantified).
Summary
It is essential that quantitative researchers consider the types of data and levels of
measurement that they use (i.e., the nature of the numbers used to measure the
variables). In this chapter, issues of quantification and measurement in L2 research,
particularly the types of data and scales associated with them, have been discussed.
The next chapter will turn to a practical concern: how to manage quantitative data
with the help of a statistical analysis program, namely the IBM Statistical Package
for Social Sciences (SPSS). The concept of measurement scales will be revisited
through SPSS in the next chapter.
Review Exercises
To download review questions and SPSS exercises for this chapter, visit the Com-
panion Website: www.routledge.com/cw/roever.
2
INTRODUCTION TO SPSS
Introduction
There are a number of statistical programs that can be used for statistical analysis
in L2 research, for example, SPSS (www.01.ibm.com/software/au/analytics/spss/),
SAS ( Statistical Analysis Software; www.sas.com/en_us/software/analytics/stat.
html), Minitab (www.minitab.com/en-us/), R (www.r-project.org/), and PSPP
(www.gnu.org/software/pspp/).
In this book, SPSS is used as part of a problem-solving approach to quantita-
tive data analysis. IBM is the current owner of SPSS, and SPSS is available in both
PC and MacOS formats. SPSS is widely used by L2 researchers, partly because its
interface is designed to be user friendly: users can use the point-and-click options
to perform statistical analysis. There are both professional and student versions of
SPSS. At the time of writing, SPSS uses a licensing system under which the user
has to pay to renew his/her license every year. It is advised that readers check
whether their academic institution holds an institutional license, under which
SPSS can be freely accessed by staff and students. Alternatively, readers could con-
sider PSPP, a freeware program modeled on SPSS.
numbers (IDs) to each participant’s data. IDs are important in that they allow the data
in SPSS to be checked against the actual data. If the research instrument requires scor-
ing (e.g., a test), the scoring will need to be completed and checked before the data
can be entered into SPSS. The data should be stored in a secure place.
Second, SPSS can produce a statistical analysis output as per researchers’ instruc-
tions, but the output can be ‘meaningful’ or ‘meaningless’ depending on the types of
data used and how well the characteristics of the scales discussed in Chapter 1 are
understood. For example, SPSS will quite readily compute the average of two nomi-
nal data codes, such as gender coded as ‘1’ for male and ‘2’ for female. However, it does
not make sense to talk about ‘average gender’. SPSS will not stop researchers from
performing such meaningless computations, so knowledgeable quantitative research-
ers need to be aware of what computations will produce meaningful, useful results.
Open SPSS.
There are two tabs at the bottom left-hand side of the spreadsheet (Data View and
Variable View). When a new file in SPSS is created, you will automatically be in
Data View, and the data can be entered using this view. However, it is best to define
the variables that will be used first. To do this, click on the Variable View tab.
To illustrate how to define variables, the data from the five students in Table 1.9
in Chapter 1 will be used. There are four variables, namely student name, test
score, placement level, and campus. When the word ‘student’ is typed into a cell in
the Name Column, SPSS automatically populates the rest of the row with default
values (see Figure 2.2).
SPSS does not allow spaces in names. For example, ‘First Language’ (with a
space between the two words) cannot be typed into the Name Column, but
‘FirstLanguage’ (without a space) can. If a space is present in a variable name,
SPSS will indicate that the ‘variable name contains an illegal character’. Further
information on how to name variables can be found at: www.ibm.com/support/
knowledgecenter/SSLVMB_20.0.0/com.ibm.spss.statistics.help/syn_variables_
variable_names.htm).
In the second column (Figure 2.3), Type is automatically set to Numeric, which
means that only numbers can be entered into the spreadsheet for that variable. If
researchers wish to enter the names of the research participants, they need to be
able to enter words (SPSS calls variables that take on values containing characters
other than numbers ‘string’ variables). To do this, click on and then on
the blue square with ‘. . .’ that appears next to Numeric. When the Variable Type
dialog opens, choose ‘String’ and then click on the OK button (see Figure 2.4).
The variable type is now set to be a string variable. The column width in SPSS
is set to a default of eight characters, but this can be increased. Another column
that is optional but useful to fill in is Label (see Figure 2.5). Labels are useful when
abbreviations or acronyms are used as variables (e.g., L1 = first language; EFL =
English as a foreign language).
18 Introduction to SPSS
For now, the other columns can be ignored (see further discussion in Chap-
ter 4). Each row in the Variable View (starting from 1) forms a variable column
in the Data View. Researchers can name a variable in each row (e.g., student and
score). Once added, the details of the ‘score’ variable can be adjusted to reflect its
Introduction to SPSS 19
characteristics (see Figure 2.6). To do that, click the cell in the Decimals Column
and adjust the number of decimals to zero since all the test scores are integers.
Then ‘Test Score’ can be entered as the variable label.
The number of decimal places used should not misrepresent the data. For
example, if the variable name is ‘gender’ (nominal data) and ‘1’ can be coded for
males and ‘2’ for females, decimals are not needed. However, for other data, there
is the possibility that there are digits after the decimal point (e.g., 3.49 and 3.50).
If the number of decimal places is set to zero, the score for 3.49 will be ‘3’, but for
3.50, it will be ‘4’. This can lead to a misrepresentation of the data, so choosing to
keep one or two decimal places will result in more accurate findings.
Let us return to the data from the five students. Two more string variables still
need to be entered: placement (level) and campus. Note that the column width of
eight characters will not be enough for placement level entries since the word ‘inter-
mediate’ has 12 characters, so 12 or higher is needed for the width of the Placement
Column. The final variable definition page appears as shown in Figure 2.7.
If the Data View tab is clicked, the program will return to Data View, which
is now set up for data entry (as shown in Figure 2.8). At this point the students’
names, their scores, their placement levels, and the campuses at which they study
can be entered.
FIGURE 2.6 Creating student and score variables for the Data View
1. If a variable is numeric, numbers only can be entered (as noted earlier). In this
case, no letters or nonnumeric characters should be entered. Every value in
that column must be a number.
2. String variables can contain any combination of letters, numbers, and special
characters.
3. SPSS is case sensitive, so it will consider ‘beginner’ and ‘Beginner’ as two
entirely different values. This can become an issue if a researcher later uses the
placement variable for calculations or asks SPSS to count how many begin-
ners there are. So how values of string variables are treated must be consistent.
Once the data have been entered, SPSS can be used to conduct statistical anal-
ysis on them. Many different pieces of analysis can be done using SPSS. The
following section illustrates how to generate a list of students and their test scores,
placement levels, and campuses.
Click Analyze, next Reports, and then Case Summaries (see Fig-
ure 2.9) to call up the Summarize Cases dialog.
In the Summarize Cases dialog shown in Figure 2.10, select all vari-
ables, then move them into the ‘Variables’ pane on the right-hand
side by clicking the arrow button to the left of that pane.
Do not worry about the Display cases options. Click on the OK but-
ton. A new dialog opens, showing the SPSS output (see Figure 2.11).
FIGURE 2.11 SPSS output based on the variables set in the Summarize Cases dialog
22 Introduction to SPSS
In Figure 2.11, the second table, labeled Case Summaries, shows the names of the
students, their test scores, their placement levels, and their campuses. SPSS output
tables can be copied and pasted into a Microsoft Word document.
1. The first row in the Excel spreadsheet must contain the names of the variables;
there cannot be a headline. All the other rows must be data.
2. All variable names must consist of letters or numbers; the only special charac-
ter allowed is the underscore.
3. The data for each person must be contained in a single row in the Excel file.
If data from the same participant is contained in two separate rows, SPSS will
consider it as coming from two different participants.
4. There can be no formulae, graphs, or results in the Excel spreadsheet, only
variable names and data.
If you have prepared the Excel spreadsheet as specified, importing it into SPSS
can be done as follows.
Click File, next Open, and then Data (see Figure 2.12).
Introduction to SPSS 23
In the dialog that opens, the default file type is SPSS.sav. Select Excel
(∗.xls, ∗.xlsx, ∗.xlsm) as the file type (Figure 2.13).
24 Introduction to SPSS
FIGURE 2.14 Illustrated example of an Excel data file to be imported into SPSS
Select the Excel spreadsheet and click on the Open button. SPSS
then displays the dialog shown in Figure 2.14.
FIGURE 2.17 SPSS spreadsheet that shows the demographic data of Phakiti et al. (2013)
26 Introduction to SPSS
According to Phakiti et al. (2013), the 341 ESL students were made up of 158
males and 179 females. Four participants did not report their gender. The majority
of the participants were from mainland China (N = 233). Their mean age was 19
with a standard deviation of 1.5.
Five personal factors were measured in this questionnaire: self-efficacy, personal
values, academic difficulty, motivation, and self-regulation. The questionnaire
to measure these factors was comprised of 61 items. Questions 1 to 5 asked
FIGURE 2.18 The questionnaires and types of scales and descriptors in Phakiti et al.
(2013)
FIGURE 2.19 SPSS spreadsheet that shows questionnaire items of Phakiti et al. (2013)
Introduction to SPSS 27
Summary
This chapter has provided details of the step-by-step procedures that need to be
followed in entering data manually into SPSS, importing data files from Excel,
and creating Case Summaries. In later chapters, relevant studies and examples in
L2 research are provided to help the reader contextualize the meaningfulness of
quantitative analysis and inferences made based on statistical results.
Review Exercises
To download review questions and SPSS exercises for this chapter, visit the Com-
panion Website: www.routledge.com/cw/roever.
3
DESCRIPTIVE STATISTICS
Introduction
This chapter presents the descriptive statistics that are used in quantitative research.
In L2 research, quantification is not only useful for describing the attributes of
an individual learner, but also how the attributes of different groups of learners
may differ. These differences can be handily summarized and highlighted using
descriptive statistics.
1 – advanced 86.11
2 female upper intermediate 61.11
3 male lower intermediate 41.67
4 female intermediate 69.44
5 male intermediate 75.00
6 male – 50.00
7 female advanced 90.74
8 male upper intermediate 77.78
9 female intermediate 72.22
10 female intermediate 75.00
11 male intermediate 63.89
12 female upper intermediate 44.44
13 male advanced 62.04
14 male intermediate 58.33
15 female intermediate 50.00
16 female advanced 80.05
17 female upper intermediate 71.30
18 female upper intermediate 63.89
19 female intermediate 50.00
20 female upper intermediate 80.56
21 male intermediate 63.89
22 female upper intermediate 71.03
23 female intermediate 72.22
24 male – 11.11
25 female upper intermediate 75.00
26 female intermediate 55.56
27 – upper intermediate 69.44
28 male upper intermediate 44.44
29 male upper intermediate 19.95
30 female intermediate 61.11
31 female upper intermediate 50.00
32 male upper intermediate 36.11
33 male upper intermediate 27.78
34 female intermediate 48.48
35 female intermediate 46.83
36 male upper intermediate 16.92
37 male upper intermediate 55.56
38 – upper intermediate 30.56
39 male upper intermediate 50.00
40 female advanced 83.33
41 male upper intermediate 75.00
42 male advanced 61.11
(Continued)
30 Descriptive Statistics
Frequency Counts
The simplest way to reduce a mass of data is to count how often individual values
or scores occur. For example, in the data set in Table 3.1, it is possible to count
how many male and female test takers there were. Table 3.2 presents the frequency
counts according to gender.
Table 3.2 shows that there were slightly more females than males. Counting
the frequency with which each value occurs is all that can be done with nominal
data, such as gender, first language, and country of origin.
Frequency counts can also be used for an ordinal variable, such as self-assessed
proficiency level, and these show how many test takers self-assessed themselves as
being at the beginner, lower intermediate, intermediate, upper intermediate, and
advanced levels. Table 3.3 presents the frequency counts based on test takers’ self-
rated proficiency levels.
Table 3.3 shows that the majority of the test takers self-rated themselves as
upper intermediate, and about a third as intermediate. Frequency counts for each
score could also be computed. These are shown in Table 3.4.
While it summarizes the data somewhat, Table 3.4 does not reduce the volume
of information greatly. There is still much information to process. Score ranges
Male 22 44
Female 25 50
Missing data 3 6
Total 50 100.0
Beginner 0 0
Lower intermediate 1 2.0
Intermediate 15 30.0
Upper intermediate 26 52.0
Advanced 6 12.0
Missing data 2 4
Total 50 100.0
32 Descriptive Statistics
TABLE 3.5 Frequency counts based on test takers’ test score ranges
0–10 0 0 0
10–20 3 6.0 6.0
20–30 1 2.0 8.0
30–40 2 4.0 12.0
40–50 11 22.0 34.0
50–60 5 10.0 44.0
60–70 10 20.0 64.0
70–80 11 22.0 86.0
80–90 6 12.0 98.0
90–100 1 2.0 100.0
Total 50 100.0
(i.e., 0–10, 10–20, 20–30, etc.) could be used instead, which in effect transform
interval-level raw scores into ordinal-level ranges. Table 3.5 is the frequency table
for the score ranges.
Table 3.5 is easier to understand than Table 3.4 because it indicates where the
clusters are located: there are many test takers in the 40–50, 60–70, and 70–80
ranges. Smaller groups are in the 50–60 and 80–90 ranges, and the other ranges
have even fewer test takers in them. SPSS produces a Cumulative Percent Column
Descriptive Statistics 33
that allows readers to see the overall distribution of test takers. Only about a third
(34%) have scores lower than 50%, which implies that two thirds have scores
higher than 50% and indicates that this sample of the test takers did reasonably
well on this test.
These test results are skewed towards the higher score ranges. In a typical pro-
ficiency test (e.g., TOEFL, or a university designed placement test), it would be
expected that there would be equal numbers of test takers in the lower and upper
half of the score range; it would further be expected that most of test takers would
be clustered around the 50% mark. The results of this group of test takers indicate
that either the test taker group was somewhat more proficient than expected,
or that the test might have been somewhat too easy for them. Generally speak-
ing, the assumption that most test takers cluster around the 50% mark is valid for
proficiency tests, but this assumption is not necessary for achievement tests, which
measure what students have learned in a course. In an achievement test, most of
the learners would be expected to fall into the high score ranges, otherwise they
would not have learned what they were supposed to, or the test could have been
too difficult.
Of course, how the score ranges are selected is somewhat arbitrary. It is com-
mon to divide the range of possible scores into four equal parts, or quartiles. The
test takers’ scores divided into these ranges are presented in Table 3.6.
The data as presented in Table 3.6 are easier to grasp than that presented in
Table 3.5. However, the frequency counts are still somewhat difficult to interpret.
A quicker, all-at-one-glance way of representing the data is to use graphs and
diagrams.
Pie Chart
A simple diagram that is effective for displaying the relative sizes of a small number
of groups is the pie chart. For the gender information in Table 3.2, it would look
as displayed in Figure 3.1.
The pie chart in Figure 3.1 shows that there are slightly more female than male
test takers. Note that this pie chart ignores test takers who did not disclose their
34 Descriptive Statistics
score
range
1 3 10
6 20
1
30
2 40
50
60
70
11 80
90
11
10
FIGURE 3.2 A pie chart based on a 10-point score range (Number in each slice = the
frequency count)
gender, so the percentages differ slightly from Table 3.2. Pie charts are effective
when there are few values that a variable can take on (e.g., 2–4 values). Displaying
the data of a 10-point score range is less effective, as illustrated in Figure 3.2, which
is based on Table 3.5.
Descriptive Statistics 35
In Figure 3.2, the value in each slice refers to the frequency count. There is
much information in the chart, and pie charts do not demonstrate clearly the
ordering of categories by size. In summary, pie charts are good as visual representa-
tions of frequency counts for nominal data for which there are only a few values
that that data can take on.
Bar Graphs
Bar graphs are suitable for the representation of ordinal data. For example, the bar
graph in Figure 3.3 makes it obvious that the majority of test takers was in the
score ranges between 40 and 90, and visually demonstrates the skewing of the data
towards the higher scores.
The bar graph in Figure 3.3 is a better representation than a long list of raw
data because it summarizes the overall picture of the test scores. However, it is
not portable, and it would be difficult to recall the entire graph. So the pre-
ferred method of summarizing interval data is the mean. Descriptive statistics,
which are the foundations of the many statistical analyses in L2 research, are
now discussed.
The Mean
The mean is a frequently used numeric representation of a data set. It is the average
of all the values in the data set, and it is easy to compute by dividing the sum of all
the scores by the total number of scores. In the case of the data from Table 3.1, the
sum of all scores is 2,995.77, so the mean is 2,995.77 ÷ 50 = 59.91.
The mean provides less information than the bar graph in Figure 3.3 does. It
confirms the impression that the score distribution is skewed towards the higher
score range because it is larger than 50, and this allows readers to conclude that
either the test is on the easy side for this sample, or that the sample is slightly more
capable than the test assumed. Although the mean does not show how the data
are distributed, which the bar graph shows readers at a glance, it does have two
great advantages:
1. The mean is a portable summary. Researchers do not have to recall all the
details of a graph; instead, they just need to remember a single number.
2. The mean allows calculations and easy comparisons. If researchers have a sec-
ond sample of test takers and they want to check which group performed bet-
ter overall, they can just compare the means of the two samples. For example,
if a second group of learners had taken the same test and obtained a mean
score of 53.12, it can be deduced that they are, on average, less capable than
the first group.
The Median
The mean has one major shortcoming: it is sensitive to outliers (i.e., extreme scores),
which are scores far above or below the rest of the sample. For example, consider
a sample of five test takers, the scores for which are shown in Table 3.7.
1 27
2 33
3 40
4 46
5 99
Descriptive Statistics 37
According to Table 3.7, the first four test takers have a mean score of 36.5 (i.e.,
27 + 33 + 40 + 46 = 146 ÷ 4 = 36.5). This indicates that this group is below aver-
age in its ability. But when the score of Test Taker 5 is added, the mean increases
by more than a third to 49 (i.e., 27 + 33 + 40 + 46 + 99 = 245 ÷ 5 = 49). This
group of five test takers now appears to be of average ability. Accordingly, the mean
for the group of five does not reflect the fact that the majority of the test takers
achieved scores far below the overall mean; the inclusion of one test taker’s score
pulls the group average score up to an average level. Often quantitative researchers
consider removing extreme cases from their data set to avoid inaccurate findings.
To avoid distortion of the mean by outliers, another statistic is sometimes used,
namely the median. The median is the value that divides a data set into two
groups, so that half the participants have a value lower than or equal to the median,
and half the participants have a value higher than or equal to the median. In the
data set in Table 3.7, the median would be 40 (i.e., 27, 33, 40, 46 and 99). The
median itself does not have to occur in the data set. For example, for the data set
(27, 33, 41, 43, 46, 99), the median is 42, which is the average of the two middle
values. In the data set in Table 3.1, the median is 61.57, which also does not occur
in the data set.
In data sets with extreme values or outliers, the median can be more representa-
tive of the overall data set. The median is not very commonly used or reported in
applied linguistics or L2 research statistics, but is commonly found in research in
economics that investigates people’s incomes, for example. Imagine a community
that is for the most part in a lower-middle class income range, but contains a small
number of billionaires. The very high incomes of the billionaires will make the
overall community look much wealthier than it really is, and using the median
to represent overall income gives a more realistic picture of the typical income in
this community.
The Mode
The final descriptor of the central tendency of a data set is the mode. The mode
is the value that occurs most frequently in the data set. In order to illustrate the
mode, consider the following data set:
The mode can be seen to be 43, because it is the number that occurs most
often. In the larger data set in Table 3.4, the mode is 50 because the value 50
occurs most frequently (five times). In real research situations, it is possible for
a data set to have two or more modes. When there are two modes, it is called
bimodal; for example:
This data set has two modes: 43 and 82. Bimodal data sets can sometimes be
suspicious because the sample may consist of some high-level and some low-level
learners, with few learners in between. This can affect some inferential statistics,
such as correlations and t-tests.
Measures of central tendency have one major shortcoming. They do not indi-
cate the dispersion of the data (i.e., how they are spread out). To analyze the spread
of a data set, measures of dispersion are used.
Measures of Dispersion
Measures of dispersion give researchers an idea of how different from one another
the data points in a sample are, i.e., how much variability there is in the data.
Consider the following two data sets from test scores of two groups of students:
The means of the test scores for each group are the same (each has a mean
of 50). However, the variability of the scores of the two groups is different. In
Group 1, the scores range from a minimum of 47 to a maximum of 53, whereas in
Group 2, they range from a much lower minimum of 12 to a much higher maxi-
mum of 88. In other words, in Group 1, the scores are very similar to one another,
meaning that all the students in that group are similar in their knowledge of the
subject matter. In contrast, the scores of the students in Group 2 are much more
diverse, which indicates that it contains students with very little knowledge and
some with much more extensive knowledge. This information may be important
for a teacher or curriculum designer. For example, teaching Group 1 may not
need much internal differentiation as all students are likely to benefit from the
same materials. Group 2, however, is much more challenging to teach because
some of the learners need a lot of basic instruction, while others require very little.
The most frequently used measure of dispersion is the standard deviation (SD
or Std. Dev in SPSS). Conceptually, the standard deviation indicates how different
individual values are from the mean. The more scores are spread out, the larger
the standard deviation will be. In the most extreme case, if all research participants
have the same score, the standard deviation is 0 because there is no difference at all
between individual values and the mean.
By looking at the data for Groups 1 and 2, it is easy to see that the standard
deviation for Group 1 will be much smaller than that for Group 2, because the
data for Group 1 are clustered much more tightly around the mean. In fact, that
is the case. The standard deviations of the two groups’ scores are very different:
• SD for Group 1 = 2.37, suggesting that the data set is quite homogeneous.
• SD for Group 2 = 27.32, suggesting that the data set is highly heterogeneous.
Descriptive Statistics 39
By including the standard deviation along with the mean, readers can get a bet-
ter idea of the general shape of a data set. In the case of Groups 1 and 2, the results
may be reported as follows:
Now the two numbers provide fairly similar information to what can be seen
in a graph, but in a much more precise and compact way.
Language Anxiety 5 4 3 2 1
Speaking English makes me nervous. X
I feel I can’t get my message across when I speak English. X
I worry that people won’t understand me when I speak English. X
I avoid speaking English whenever I can. X
to obtain an overall score (also known as a composite), these overall scores can be
treated as interval or continuous data.
For example, Fushino (2010) used a Likert-type scale questionnaire to collect
information about six learner characteristics, including willingness to communi-
cate in L2 group work, self-perceived communicative competence in L2 group
work, beliefs about the usefulness of group work, and others. Each characteristic
was measured using between 6 and 20 items. For example, 10 items were related
to ‘willingness to communicate in L2 group work’. Each learner received a mean
score for each characteristic, which was obtained by adding up their individual
item scores (1–5) and dividing the sum obtained by the number of items measur-
ing that characteristic.
Ordinal data should not be treated as interval data if researchers are dealing with
rankings of students (e.g., Tom is the best, Mary the second best, Jack the third
best) or discrete groups (e.g., beginner, intermediate, or advanced). With these
kinds of data, researchers can report only frequencies or a median rank. However,
if a piece of data is the result of adding up individual data points, ordinal data may
be treated as interval, and means and standard deviations can be computed.
Skewness Statistics
A skewness statistic describes whether more of the data are at the low end of the
range or the high end of the range. The greater the value of a skewness statistic, the
more skewed the distribution of the data set is. A value of 0 indicates no skewness
at all because the data are symmetrical. Conservatively, statisticians recommend
Descriptive Statistics 41
that skewness values between ±1.00 suggest normally distributed data. In L2 data,
however, it is acceptable to use skewness values between ±2.00 as an indicator that
the data are generally normally distributed. Skewness values outside of the ±3.00
range are a warning sign that the data are highly skewed and hence some statistical
tests that require that the data be normally distributed may not be used.
Figure 3.5 shows the distribution of length of residence in a sample of 68 ESL
learners living in Australia. This variable was collapsed from an interval variable
to an ordinal variable (0–3 months, 3–6 months, 6–9 months, etc.). The graph is
bunched on the left-hand side; the data are then said to be positively skewed because
the tail points towards the positive side of the scale. This distribution is positively
skewed with a skewness statistic of +1.86.
The distribution of speech act scores shown in Figure 3.6, by contrast, is nega-
tively skewed with the values bunched up on the right-hand side of the scale, with
the tail pointing towards the negative side of the scale. The skewness statistic is
small at –0.60.
Figure 3.7 shows a distribution with very little skewness and with a low skew-
ness statistic of –0.03.
Exploring a data set in this way helps researchers understand whether there are
outliers or whether the characteristics of the sample are unexpected (e.g., there
may be clusters at the extremes, but few in the middle).
Kurtosis Statistics
A kurtosis statistic describes how close the values in a data set are to the mean, and
whether the distribution is leptokurtic (i.e., tall and skinny) or platykurtic (i.e., wide
and flat). This is usually fairly obvious from the standard deviation, but a kurtosis
statistic gives a standardized value, which, like skewness statistics, conservatively
should be between ±1.00, but is acceptably between ±2.00. Values outside the
±3.00 range suggest that the data set may violate the assumptions of paramet-
ric statistical tests, such as Pearson correlation analysis and ANOVA. To address
a research question through inferential statistics, skewness and kurtosis statistics
are of concern only if they are extreme. In the case of skewness in particular, an
extreme value may suggest that there are outliers with the potential to distort the
data set. How skewness and kurtosis statistics can be computed from SPSS will be
presented in the next chapter.
Summary
A good quantitative study requires researchers to carefully examine the descrip-
tive statistics of their quantitative data set prior to any data analysis. This step of
quantitative analysis is to ensure that the characteristics of the data set are in order
and according to expectation. Finally, descriptive statistics including skewness and
kurtosis statistics should be presented in a research report because they allow read-
ers to evaluate the basic nature of the quantitative data that are used to address
research questions.
Review Exercises
To download review questions and SPSS exercises for this chapter, visit the Com-
panion Website: www.routledge.com/cw/roever.
4
DESCRIPTIVE STATISTICS
IN SPSS
Introduction
Although descriptive statistics can be calculated manually using a calculator, it is
more efficient to use SPSS to compute them. This is especially true when a data
set is large and when complex statistics are needed to answer research questions.
This chapter shows how to compute descriptive statistics using SPSS. Before tack-
ling complex statistical analysis, it is important to have a grasp of how to compute
the simplest descriptive statistics. In this chapter, the data set presented in Fig-
ure 4.1 will be used to illustrate how to compute descriptive statistics. The data
file (Ch4TEP.sav) can be downloaded from the Companion Website for this book.
In this data file, it can be seen that some values of the gender variable are 99.
That value does not indicate a gender score, which can only be 1 or 2; rather, this
value is used to indicate missing data, and this chapter will show how to set this
up. First, however, how to assign values to nominal variables will be presented.
Figure 4.2 shows SPSS in variable view for this data file.
While the recent versions of SPSS can read strings, such as ‘male’, ‘female’, ‘Ger-
man’, ‘Thai’, and ‘English’ as values for nominal variables, it is more practical and
convenient for data entry purposes to use codes to represent them. In addition, SPSS
is strict about spelling: without assigning values to nominal variables, if you type
‘mael’ instead of ‘male’, or ‘Gemrn’ instead of ‘German’, SPSS will interpret this
misspelling as new information, and the subsequent analysis will not be accurate.
One disadvantage of using numbers to represent values of nominal variables is that
the numbers do not mean anything by themselves, so you need to program SPSS to
recognize what value is represented by a given number. This can be set up so that the
Values Column indicates what value is represented by a given number. Take gender
as an example, as illustrated in Figure 4.3, which shows the Value Labels dialog.
Click the Value column of the gender variable. This column will be
activated and a blue button inside this column will appear.
Click on this blue button and a pop-up dialog will appear (see Fig-
ure 4.3).
Type ‘1’ in ‘Value’ and ‘male’ in ‘Label’. Then click on the Add but-
ton. Then repeat the same procedure for female. This time, type ‘2’.
Then click on the OK button to return to Data View.
FIGURE 4.4 Defining selfrate (self-rating of proficiency) in the Value Labels dialog
The same can be done for the selfrate variable (see Figure 4.4). To check which
code is used for each value of selfrate, return to this Value Labels dialog where
they will be listed. While performing the data entry, type ‘1’ if participants rated
themselves as ‘beginner’ and ‘2’ if participants rated themselves as ‘lower intermedi-
ate’, and so on.
Click the Value column of the gender variable. This column will be
activated and a blue button inside this column will appear.
Click this blue button and the Missing Values dialog will appear (see
Figure 4.5).
48 Descriptive Statistics in SPSS
In the case of the test score variable having a maximum score of 100, you should
not use ‘99’, but ‘999’ to define a missing value.
Click the Statistics button and a pop-up dialog will appear (see Fig-
ure 4.8). Tick the following checkboxes: Mean, Median, Mode, Std.
deviation, Minimum, Maximum, Skewness, and Kurtosis. Then click on the
Continue button to return to the Frequencies dialog.
Click on the Charts button and a pop-up dialog will appear (see Fig-
ure 4.9). Only one chart type can be chosen. In this illustration, the
Histograms checkbox is selected with the Show normal curve on histogram
option. Then click on the Continue button to return to the Frequencies dialog.
Several output tables will be produced. For the purpose of this chapter, not all the
tables are shown. Table 4.1 presents the descriptive statistics of the gender, age,
selfrate, and total score variables.
It should be noted that not all the information in Table 4.1 is useful. While
descriptive statistics make sense for the age and total score variables, they do not
make sense for the gender and selfrate variables, as discussed earlier. These two
variables were included to demonstrate that it is possible to calculate descriptive
52 Descriptive Statistics in SPSS
TABLE 4.3 SPSS frequency table for the selfrate variable (self-rating of proficiency)
statistics for all variables, but that frequency tables are more useful for nominal
variables.
In Table 4.1, the mean age of the participants was 16.42 (SD = 1.70). The
mean, median, and mode were similar, suggesting that the age data were normally
distributed. The skewness and kurtosis statistics for the age variable were 0.03
and –1.12 respectively, which are within the acceptable range for the assumption
of a normal distribution to be valid. The minimum and maximum test scores
were 11.11 and 90.74 respectively. The mean test score was 59.92 (SD = 18.71).
Although the mode was 50, this value occurred only five times in the data set, so
it did not greatly affect the data distribution. The score 75 occurred four times, so
the data were close to being bimodal. The skewness and kurtosis statistics (–0.67,
and 0.14 respectively) for the test scores were within the conservative limits of
±1.00. Finally, SPSS can also produce histograms. Tables 4.2 and 4.3 show the
frequency tables for the gender and selfrate variables.
Figure 4.10 shows the histogram for the selfrate variable, along with a normal
curve for the purpose of comparison.
FIGURE 4.10 A histogram of the self-rating of proficiency variable with a normal curve
Graphical Displays
Click Graphs, then Legacy Dialogs to find several options for the
graphical representation of data (see Figure 4.12).
To create a bar chart, click Bar. Choose Simple in the dialog that appears,
and then click Define (see Figure 4.13). In the Define Simple Bar . . .
dialog that pops up, move a variable of interest (e.g., ‘age’) from the pane on
the left-hand side to the Category Axis field in the pane on the right-hand side.
To create a pie chart, click Graphs, next Legacy Dialogs, and then Pie
(see Figure 4.12). Choose Summaries for Groups of Cases in the dialog
that appears, and then click the Define button to call up the Define Pie . . .
dialog (see Figure 4.14). Note that there are two other options (Summaries of
Separate Variables and Values of Individual Cases) that are not presented here.
Figure 4.16 shows the histogram for the total score variable.
TABLE 4.4 Taxonomy of the questionnaire and Cronbach’s alpha (N = 51) (adapted from
Phakiti & Li, 2011, p. 273)
TABLE 4.5 Example of item-level descriptive statistics (N = 51) (adapted from Phakiti &
Li, 2011, pp. 262–263)
reliability estimate (i.e., 0.45). Table 4.5 shows the descriptive statistics of five out
of 30 items.
When reporting descriptive statistics, minimum and maximum scores as well as
skewness and kurtosis statistics should be included. Skewness and kurtosis statistics
allow readers to evaluate whether the data for each variable were normally dis-
tributed. In the sample of items shown in Table 4.5, all the items have reasonable
skewness and kurtosis values.
Summary
This chapter has illustrated how SPSS can be used to perform basic analyses of
quantitative data using descriptive statistics. SPSS is practical as it allows research-
ers to handle a large data set, and the results it produces are reliable so long as
empirical data are entered into the data spreadsheet accurately. The next chapter
will present the concept of correlation in L2 research as well as how to perform
correlational analysis in SPSS.
Review Exercises
To download review questions and SPSS exercises for this chapter, visit the Com-
panion Website: www.routledge.com/cw/roever.
5
CORRELATIONAL ANALYSIS
Introduction
Correlation exists in many situations. For example, the further a car is driven, the
more fuel it will use, and the more the driver will have to spend on that fuel. In this
case, the distance driven and the amount of money spent on fuel would be said to
correlate: as distances increases, so do expenses for fuel. Correlation describes the
relationship between variables, and this chapter introduces and explores correla-
tional analysis for L2 research.
knowledge were entirely unrelated. This scenario is also unlikely because, theo-
retically and empirically, L2 reading and vocabulary are related to each other
to some extent (see e.g., Alderson, 2000; Qian, 2002; Read, 2000). It is more
likely that the correlation coefficient between these two variables lies between
0 and 1. For example, Guo and Roehrig (2011) found a correlation coefficient
of 0.43 between depth of L2 vocabulary knowledge and scores on the TOEFL
reading section.
Whether a particular correlation coefficient indicates a strong or weak relation-
ship may depend on various factors, including theoretical issues and the expectations
of the researchers. If researchers believe that vocabulary is essential to reading and
should account for most of the success in reading performance, a correlation coef-
ficient of 0.43 might seem low because they might have expected it to be 0.70 or
higher. However, if their stance is that reading comprehension is co-determined by
a range of other factors, such as background knowledge, metalinguistic knowledge,
and syntactic knowledge, they might have expected a lower correlation coefficient
of 0.30, for example, and 0.43 would then seem high to them.
The following is a general guideline about the strength of the correlation coef-
ficient (Cohen, 1988):
Positive correlations are normally shown without the + sign, so when researchers
report ‘r = 0.43’, it is assumed to mean r = +0.43.
A negative correlation coefficient between two variables indicates that they
move in opposite directions, so that an increase in one variable is accompanied by
a decrease in the other variable. The following are examples in which the correla-
tion coefficients between the variables are likely to be negative:
• Learners’ general L2 proficiency and number of errors that they make when
completing a dictation test. This is because the higher learners’ proficiency
level is, the fewer errors they are likely to make.
• The amount of time learners take to read a text and their vocabulary knowl-
edge. This is because the more vocabulary learners know, the less time they
are likely to need to read a text.
Negative correlations are always shown with the – sign. For example, research-
ers report ‘r = –0.82’ or ‘r = –0.21’ in their journal articles.
Scatterplots are often used to visualize the direction of a correlation and the
strength of the relationship between two variables. In a scatterplot, the values of
the two variables are taken as coordinates on a pair of axes, and a dot is placed
for each data point. The closer the dots are to a ‘line of best fit’, the stronger the
correlation between the variables. The direction of the line indicates whether the
correlation is positive or negative:
• A line rising from the lower left-hand side to the upper right-hand side indi-
cates a positive correlation.
• A line falling from the upper left-hand side to the lower right-hand side indi-
cates a negative correlation.
Figure 5.1 presents a scatterplot (based on simulated data) that shows a perfect
positive correlation.
In Figure 5.1, a straight line can be drawn through all the dots. This scatter-
plot suggests that a value for Variable 1 can be predicted from the corresponding
value of Variable 2 with certainty. For example, in Figure 5.1, if the value of
Variable 1 is 20, then the value of Variable 2 is 60. The relationship between
the two variables is perfectly linear and deterministic. Finally, the line of best fit
goes from the lower left-hand side to the upper right-hand side, so the relation-
ship is positive. That is, as the values for Variable 1 increase, so do the values for
Variable 2.
Figure 5.2 shows a scatterplot that indicates a high but not perfect correlation
(the correlation coefficient is 0.90). It can be seen that the dots are fairly close to
FIGURE 5.1 A scatterplot displaying the values of two variables with a perfect positive
correlation of 1
FIGURE 5.2 A scatterplot displaying the values of two variables with a correlation coef-
ficient of 0.90
Correlational Analysis 65
FIGURE 5.3 A scatterplot displaying the values of two variables with a correlation coef-
ficient of 0.33
the line of best fit, which rises from the left-hand side to the right. Predictions of
the value of one variable based on the value of the other can be made using the
line of best fit, but there will inevitably be some degree of error.
Figure 5.3 shows a scatterplot indicates a much weaker correlation (the correla-
tion coefficient is only 0.33). An accurate prediction of the value of one variable
using the value of the other variable would be difficult to achieve. For example, if
the value of variable 2 is between 10 and 20, the corresponding value for variable
1 may lie anywhere between 4 and 10. However, there is still a noticeable correla-
tion between the variables.
Figure 5.4 shows a scatterplot that illustrates a perfect negative correlation coef-
ficient (r = –1). As with the case of a perfectly positive correlation (Figure 5.1), all
the dots lie on the line of best fit. However, in this case, the relationship between
the variables is inverse, so that a high value for Variable 1 would imply a low value
of Variable 2, and vice versa. Finally, Figure 5.5 shows a scatterplot for a data set
in which there is virtually no relationship between the variables (the correlation
coefficient is 0.06). For this data set, no reasonable prediction can be made for a
value of one variable from a value of the other.
66 Correlational Analysis
FIGURE 5.4 A scatterplot displaying the values of two variables with a perfect negative
correlation coefficient of –1
Types of Correlation
To calculate the correlation between two variables, the nature of the variables
(interval/ordinal/nominal) needs to be taken into account, and this will determine
which correlation analysis should be used.
Interval-Interval Relationships
Most correlations encountered in L2 research are between variables that are both
interval, and the statistic used for this is the Pearson Product Moment correlation or
Pearson’s r. This is a parametric statistic, which requires that the distribution of each
variable in the underlying population from which the sample is taken must be
normal. The normal distribution will be discussed in Chapter 6. It is not appro-
priate to use Pearson’s r if the data are not interval. Also, outliers can distort the
value of Pearson’s r, and it can become artificially inflated if there are clusters at the
extremes. Finally, Pearson’s r does always give useable results if either of the score
Correlational Analysis 67
FIGURE 5.5 A scatterplot displaying the values of two variables with a low correlation
coefficient of 0.06
ranges is restricted (e.g., the data for one variable consist only of ratings 1–5). In
such a case, it may be more appropriate to use a nonparametric statistic, such as
Spearman’s rho (as discussed in the “Interval-Ordinal or Ordinal-Ordinal Relation-
ships” section).
Pearson’s r can be converted to the coefficient of determination (denoted by R2).
R2 is the correlation coefficient squared expressed as a percentage. This coef-
ficient expresses the shared variance between the two variables, which refers to
the overlapping content between the two variables (e.g., vocabulary knowledge
and reading comprehension). If an r coefficient is 0.43 (as is the correlation
coefficient between vocabulary and success in reading comprehension in Guo
and Roehrig, 2011), the coefficient of determination will be 18.49% (i.e., R2 =
(0.43)2 = 0.1849 = 18.49%), which indicates approximately 18.5% of overlap
between vocabulary knowledge and reading comprehension. This figure can be
interpreted as showing that nearly one fifth of reading comprehension scores is
accounted for by vocabulary knowledge alone. That is a sizeable amount, but of
68 Correlational Analysis
course four fifths will still be accounted for by other variables, such as metalin-
guistic knowledge.
The coefficient of determination is useful because it allows researchers to
quantify the extent of the relationship between variables. Being able to say that
nearly one fifth of the variance in vocabulary scores is shared with reading com-
prehension scores is easier to understand than to say that the correlation between
the variables is 0.43.
Interval-Nominal Relationships
When data on one or both variables is nominal, correlations are not frequently
calculated. The most commonly encountered example of correlating interval with
nominal data in LTA research is the computation of the discrimination of a test item.
Discrimination here refers to a test item’s usefulness in distinguishing between
high- and low-ability test takers. For a mid-difficulty question, researchers would
Correlational Analysis 69
expect high-ability test takers to be more likely to answer the item correctly than
low-ability test takers. To find out if that is the case, researchers can correlate test
takers’ item scores (0 or 1) for a particular question with individual total test scores
excluding the item under consideration. In this case, researchers would be cor-
relating a nominal variable with an interval variable (the total test score). This can
be achieved through the use of a point-biserial correlation. The point-biserial cor-
relation is, however, not covered in this book (see instead Chapter 11 in Phakiti,
2014).
Interpreting Correlation
According to Guo and Roehrig (2011), for example, vocabulary scores and read-
ing comprehension scores correlate with a correlation coefficient of 0.43 and have
18.49% of shared variance. Statisticians often say that correlation is not causation.
Of course, causation is what quantitative researchers are ultimately interested in—
but merely because two variables systematically move in the same (or opposite)
direction does not necessarily mean that a change on one causes a change in the
other. The correlation coefficient only provides an idea of the strength and direc-
tion of the association between the variables; the exact nature of the relationship
between the variables has to be investigated in a different way.
To express the nature of correlation, it is said that two variables (such as
vocabulary knowledge and success in reading comprehension) ‘co-vary’ or ‘share
variance’. If one changes, the other also changes. Researchers also say that they
‘overlap’. Alternatively, they may say that ‘18.5% of success in reading comprehen-
sion is accounted for by vocabulary knowledge’. Expressing the relationship in
this way assumes a one-way relationship in which more vocabulary knowledge
implies better reading comprehension, but not necessarily the other way around.
To be able to make this claim, researchers need a good theoretical foundation
that supports their assumption that vocabulary knowledge supports reading com-
prehension. A different way of making the same claim is to say that vocabulary
knowledge explains 18.5% of success in reading comprehension. It may be implied
that vocabulary knowledge is the underlying factor and reading comprehension is
the outcome, which suggests a causal-like relationship.
Statisticians may be wary of causative explanations in correlational analysis as
there can be one or more underlying factors that explain both variables. For
example, suppose researchers take a random sample of teenagers aged between
12 and 18, give each of them the same IQ test, and then measure their shoe sizes.
A statistical analysis of the resulting data set may indicate that there is a correla-
tion between shoe size and IQ. However, age may explain the correlation as older
respondents are likely to have bigger feet and be able to score higher on the same
IQ test.
70 Correlational Analysis
1. Data on each of the variables must come from the same group of people. If
researchers wish to correlate the scores on an ESL grammar test from a group
of students with scores on a listening test, this can be done only if the same
students took the two tests.
2. Data must be of the appropriate type for the specific correlation coefficient
being calculated. The Pearson Product Moment correlation requires interval
data or data resulting from the combination of ordinal numbers or scores.
Spearman’s rho requires ordinal data, as does Kendall’s tau. The point-biserial
correlation requires nominal data on one variable and interval or ordinal
data on the other. If there is nominal data on both variables (e.g., gender and
native language), the chi-square test should be used. This test is discussed in
Chapter 13.
3. To use the Pearson Product Moment correlation, the underlying population
from which the sample is taken should be normally distributed. If not, other
correlations such as Spearman’s rho and Kendall’s tau should be considered.
4. For the Pearson Product Moment correlation, it is preferable that the data be
spread across a wide range. The greater the variance in the data set, the more
suitable is the Pearson Product Moment correlation.
5. The relationship between the variables should be linear. Drawing a scatterplot
is an effective way to see if the relationship between variables is linear. If it is
non linear, the Pearson Product Moment and Spearman correlation are not
appropriate.
6. The paired variables to be correlated must not be dependent upon each other.
That is, researchers should not correlate scores on a subsection of an instru-
ment or a test, or even a single item with a total score, because the total score
is a result of the individual scores.
Correlations
verb tenses prepositions
**
verb tenses Pearson Correlation 1 .719
Sig. (2-tailed) .000
N 104 104
**
prepositions Pearson Correlation .719 1
Sig. (2-tailed) .000
N 104 104
** Correlation is significant at the 0.01 level (2-tailed).
FIGURE 5.6 SPSS output displaying the Pearson product moment correlation between
two subsections of a grammar test
at 0.01 (p < 0.01) (to be discussed further in the Probability and Statistical Sig-
nificance section in Chapter 6). As SPSS does not compute the coefficient of
determination (R2), this needs to be computed manually, and it is 52% (as 0.72 ×
0.72 = 0.518 = 0.52 to two decimal places).
In the SPSS output shown in Figure 5.6, both the significance level (which
will be discussed in Chapter 6), and the N-size of the sample (i.e., the number of
participants involved in the correlational analysis) are shown.
1. Compute the descriptive statistics of the two variables to make sure that the
mean, median, mode, standard deviation, skewness, and kurtosis for each vari-
able of interest are within acceptable bounds (see Chapter 4 for the SPSS pro-
cedures for descriptive statistics). Examining descriptive statistics is a standard
practice prior to all inferential statistics. Recall that certain conditions need
to be fulfilled to be able to use a Pearson Product Moment. If the data set has
strong kurtosis, outliers, or a bimodal distribution, it might be better to use
the Spearman correlation rather than the Pearson Product Moment.
2. Draw a scatterplot between the two variables to determine whether the two
variables have a linear relationship.
72 Correlational Analysis
To illustrate how to perform these two correlational tests, the file Ch5correla-
tion.sav will be used (downloadable from the Companion Website for this book).
This data set comprises the scores of 50 students who took an English proficiency
test that focused on listening, grammar, vocabulary, and reading skills. Figure 5.7
presents a screenshot of one of the worksheets in this data file.
TABLE 5.1 Descriptive statistics of the listening, grammar, vocabulary, and reading scores
(N = 50)
N Valid 50 50 50 50
Missing 0 0 0 0
Mean 8.60 14.42 13.82 8.58
Median 7.00 13.00 12.00 7.00
Mode 6.00 13.00 10.00 7.00
Std. Deviation 4.41 6.51 6.33 4.58
Skewness 0.79 0.97 0.86 0.91
Std. Error of 0.34 0.34 0.34 0.34
Skewness
Kurtosis –0.57 –0.13 –0.12 –0.13
Std. Error of 0.66 0.66 0.66 0.66
Kurtosis
In the dialog that appears, choose Simple Scatter and click the
Define button to access the Simple Scatterplot dialog. Move ‘Lis-
tening Score’ to the Y Axis field and ‘Grammar Score’ to the X Axis field
(Figure 5.9).
FIGURE 5.10 A scatterplot displaying the values of the listening and grammar scores
Figure 5.10 shows the scatterplot obtained for the listening and grammar scores
variables.
SPSS does not produce a fit line by default. To add one, double-click
the scatterplot in the SPSS output. A new dialog will appear (see
Figure 5.11). In the Element menu of this new window, choose Add Fit Line at
Total.
TABLE 5.2 Pearson product moment correlation between the listening scores and grammar
scores
TABLE 5.3 Spearman correlation between the listening scores and grammar scores
The settings for Test of Significance and Flag significant correlations are preselected, and
these should be left unchanged. Chapter 6 will discuss the test of significance. At
this stage it is sufficient to know that this significance is acceptable.
Table 5.2 presents the SPSS output for the Pearson Product Moment correla-
tional analysis.
According to Table 5.2, the Pearson Product Moment correlation coefficient
was 0.82 (R2 = 0.67). For the purpose of comparison, Table 5.3 presents the SPSS
output for the Spearman correlational analysis.
Correlational Analysis 79
In Table 5.3, the Spearman correlation coefficient was smaller than the Pearson
Product Moment coefficient (0.73 versus 0.82). This is because the Spearman
analysis ranked the variables before it analyzed them, and so some information
will have been lost. According to Tables 5.2 and 5.3, both the Pearson Product
Moment and Spearman correlations suggest a strong positive correlation between
the listening scores and the grammar scores (with correlation coefficients of 0.82
and 0.73, respectively).
Summary
This chapter has introduced correlation as a measure of the relationship between
two variables. There are different types of correlational analyses, which depend
on the nature of the variables (interval/ordinal/nominal). This chapter has pre-
sented how to compute Pearson Product Moment correlation and Spearman’s rho
80 Correlational Analysis
Review Exercises
To download review questions and SPSS exercises for this chapter, visit the Com-
panion Website: www.routledge.com/cw/roever.
6
BASICS OF INFERENTIAL
STATISTICS
Introduction
Inferential statistics are used in L2 research to draw conclusions about a popula-
tion of interest from a sample of that population. This chapter focuses on the basic
notions of inferential statistics, including sampling, correlation coefficients, and
how researchers can use probability to quantify how likely it is that their conclu-
sions about the population of interest are correct. It is important that researchers
are fully aware of and disclose the limitations of their research, so they need to
understand the factors that limit the validity of their results. These factors include
the way in which samples are selected, sample size, and the strength of the effect.
their research is representative of the population, so how these samples are selected
(sampling) is a critical part of the quantitative research process (see e.g., Scheaffer,
Mendenhall, Ott & Gerow, 2012).
In quantitative research, it is frequently desirable that random sampling be
employed. In this type of sampling, each member of the target population has an
equal chance of being chosen. A random sampling technique is highly desirable
when researchers aim to generalize their research findings from a sample study to
the wider population. In L2 research, random sampling can be difficult to achieve
and the samples are often selected on the basis of how convenient they are for
researchers to obtain. When researchers use easily obtainable participants for their
research (e.g., a group of students they are teaching), the sampling technique may
be described as convenience sampling. Convenience sampling is unlikely to lead
to a representative sample, which is a major drawback when researchers wish to
make inferences about the population of interest. This problem can be avoided
by narrowly defining the target population on the basis of the sample and hence
treating this group of learners as the population of interest (e.g., EFL students in
an English for an engineering course at a Vietnamese university), but the results
of such a study will have limited scope for generalization and usefulness, and will
be beset by bias as the researchers have no guarantee that the selected participants
are representative of students who typically take the course. Some quantitative
researchers may describe their convenience sampling method as purposive (i.e.,
selective) sampling, which underlines the fact that their claims or generalizations
from their research findings will be limited to populations comprised of members
very similar to the actual sample.
In practice, a population of interest may be comprised of different proportions of
sub-populations, and researchers may need to adopt a sampling method that ensures
that those sub-populations are represented equally in the research sample. This is
known as stratified random sampling. For example, researchers may wish to ensure that
a sample includes equal numbers of high, intermediate, and low proficiency levels.
Researchers may first divide students into sub-groups based on their proficiency lev-
els, and then randomly choose equal numbers of participants from each sub-group
to form a total sample. This technique allows researchers to ensure that the sample
contains all proficiency levels, which may not be achieved by using a random sam-
pling technique. It is important to note the distinction between random sampling
and random assignment. Random assignment is a required condition for experimental
research (see Phakiti, 2014). When random assignment is employed, research par-
ticipants are randomly assigned into groups (e.g., experimental or control groups),
but these groups need to be equivalent in every respect except for the experimental
treatment, which is given to the experimental group only. For a further discussion
of sampling techniques in an applied linguistics context, see Blair and Blair (2015),
or Hudson and Llosa (2015) for an in-depth discussion.
It is important to stress that all sampling methods are prone to sampling error
or bias. That is, participants in a sample group can never perfectly represent the
Basics of Inferential Statistics 83
population. Statistics are used as a tool to help researchers understand the char-
acteristics of the target population or research participants, but they are based on
probability analysis, so researchers cannot claim that their findings are absolute,
but merely likely. How this likelihood can be quantified will be seen later in this
chapter.
TABLE 6.1 Correlation between verb tenses and prepositions in a grammar test
Sample Size
Researchers conduct empirical studies because they want to draw conclusions
about the population of interest. Not all L2 learners can be included in a study
because there are too many of them, so samples are taken instead (i.e., researchers
take groups of L2 learners they believe to be representative of the larger popula-
tion). Findings are generally more trustworthy if they are based on large samples,
Basics of Inferential Statistics 85
but such samples may be difficult to obtain due to resource limitations; it is both
time-consuming and costly to recruit and administer a large number of partici-
pants. Large samples are generally preferable to small samples as small samples may
not be able to capture a sufficiently wide range of characteristics of the popula-
tion. For example, in a normal distribution, around 95% of the data will lie within
two standard deviations of the mean. If the sample is too small, it is likely that the
data at the extremes (e.g., that associated with exceptionally strong or exception-
ally poor students) will be underrepresented. If a sample size of 10 is used, for
example, it is impossible for the population to be accurately represented as choos-
ing no exceptional participants would be an underestimation, and choosing one
or more would be an overestimation. Figure 6.1 shows a normal distribution. The
students represented on the far right-hand side are the extremely strong language
learners, while the ones on the far left-hand side are the extremely weak ones; the
vast majority of students lie between these two extremes. All parametric statistics
(e.g., Pearson’s r, t-test, or ANOVA) assume that the target construct (e.g., language
ability) is normally distributed in the population from which the sample is drawn.
Effect Size
There are two issues associated with the effect size to be considered in inferential
statistics. The first has to do with the chance of detecting a relationship or differ-
ence through statistical analysis when such a relationship actually exists. This is
strongly influenced by sample size and is closely related to statistical significance
(e.g., p < 0.05). The second has to do with the magnitude of the effect size that
needs to be reported and interpreted in research findings. This is related to the
question of whether the relationship or difference is meaningful or has practical
relevance. Both considerations are discussed in the next section.
TABLE 6.2 Explanations of the relationship between the sample size and the effect
a small effect with a small likelihood of error, they need a large sample.
a medium effect with a small likelihood of error, they need a medium sample.
a strong effect with a small likelihood of error, they need a small sample.
significance, sample size and effect size. If a strict significance level is set, a large
sample will be required to be able to draw conclusions, or only effects that are strong
will be able to be investigated.
Given the interaction between sample size, effect size and significance, it is
impossible to say what the ‘perfect’, or even the ‘minimum’ sample size should be.
The general rule of thumb is that the sample should have at least 30 participants,
but this may not be necessary if the effect can be expected to be strong and the
significance level is liberal. Conversely, a much greater sample size may be required
when the effect is expected to be weak and a strict significance level has been set.
Some statistical procedures, especially highly complex ones, also often require large
samples to render stable results.
The null hypothesis usually contains a word such as ‘no’ or ‘not’. In all statistical
investigations, these two hypotheses exist. What they imply is shown in Table 6.3.
Technically, in a statistical study, researchers test the null hypothesis using the
empirical data they have collected. They assume initially that the null hypothesis
is correct, and then conduct the study to test that assumption. Only if they are
certain beyond a reasonable doubt that the null hypothesis is incorrect do they
reject it.
Since to accept or reject is an either-or decision, significance is also an either-or
proposition. That is, either a result is significant or it is not. There is no middle
ground, and therefore, it is not possible to talk meaningfully about a result being
really significant, nearly significant, almost significant, or totally insignificant. It is
significant or it is nonsignificant. Those are the only two options. The significance
level of each finding of a study needs to be below the significance level that has
been set for the study. So at the beginning of the study, the researcher might be
satisfied with a likelihood of error of 5%, and therefore set the significance level
at the p-value of 0.05. That is known as setting the alpha level (this must not be
The null hypothesis claims that The alternative hypothesis claims that
confused with Cronbach’s alpha, which is used in reliability analysis). Any infer-
ential statistics, such as a correlation, need to be below 0.05 to be significant, but
since significance is an either-or proposition, it does not actually matter how far
below 0.05 the result is. For example:
These considerations explain why, traditionally, significance levels are not reported
to three decimal places (e.g., p = 0.029), but are reported only with regard to the
pre-set significance level (e.g., p < 0.05). From the point of view of statistical logic,
it makes more sense to report p < 0.05, but since computer programs provide sig-
nificance levels to a greater number of decimal places, it is becoming increasingly
common for researchers to report the p-value to three decimal places.
Summary
Inferential statistics seek to interpret raw empirical data. To use them effectively,
researchers require logical reasoning and an understanding of statistical probabil-
ity. Good quantitative research can be appropriately conducted when researchers
understand the conceptual basics of inferential statistics (e.g., population and sam-
pling, probability, statistical significance, sample size, and effect sizes). The next
Basics of Inferential Statistics 91
chapter further discusses inferential statistics by focusing on t-tests, which are used
to compare the mean scores of two samples.
Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
7
T-TESTS
Introduction
The statistical procedures discussed in Chapter 5 are designed for research-
ers to find relationships between variables. Such relationships are investigated
by asking research questions, such as ‘are vocabulary knowledge and reading
comprehension related?’ (e.g., Guo & Roehrig, 2011), or ‘is proficiency related
to use of collocations?’ (e.g., Laufer & Waldman, 2011). However, sometimes
researchers are not interested in relationships, but in differences. For example,
Doolan and Miller (2012) examined whether generation 1.5 writers make more
errors in their English essay writing than L1 writers. In Doolan and Miller’s
study, a group of generation 1.5 students (i.e., L2 speakers who have resided in
the target-language country for an extended period), and a group of L1 English
speakers wrote an essay based on the same prompt. Essays were rated, analyzed
for errors, and then ratings and mean numbers of errors were compared between
the two groups of writers. Kormos and Trebits (2012) investigated whether
modality (i.e., written versus spoken) affected task performance by L2 learners.
The researchers asked a group of EFL learners to describe a cartoon orally, and
then a month later asked them to describe a similar cartoon in writing. Learners’
accuracy, fluency, syntactic complexity, and lexical variety were measured and
analyzed for each description; the oral and the written descriptions could then
be compared.
In these two examples, the researchers were interested in differences: in the
first case between the number of errors made by L1 and generation 1.5 writ-
ers, and in the second between two descriptions produced by the same group of
learners, one oral and one written. To make such comparisons, researchers can
run a procedure known as a t-test. Two types of t-tests will be presented in this
T-Tests 93
chapter. In Doolan and Miller’s (2012) study, the researchers used a t-test known
as the independent-samples t-test because the performances of two different groups
of participants in the completion of the same task were compared. In Kormos and
Trebits’s (2012) study, however, the researchers used the paired-samples t-test because
two performances on two different tasks by the same group of participants were
compared. The paired-samples t-test is also called the dependent t-test. The paired-
samples t-test is related to a repeated-measures research design (hence it is also
called repeated-measures t-test). However, in this book, the term paired-samples t-test
is used, as it is consistent with SPSS.
TABLE 7.1 Mean and standard deviation of error counts for generation 1.5 learners and L1
writers (based on Doolan & Miller, 2012, p. 7)
Mean error SD
for an independent-samples t-test implies that the group means are statistically
different. It could, therefore, be concluded that the difference in the background
variable (generation 1.5 status) affected the outcome measure (i.e., the error
count).
There is still the possibility that the background variable being used is a proxy
for another underlying variable that is the actual reason for the outcome. So, for
example, if an independent-samples t-test indicates significant differences in TEP
scores between test takers with and without residence, it might not be residence
itself that causes the difference in scores, but a host of associated factors, such as
higher proficiency going hand-in-hand with residence, or self-selection of high-
ability test takers going abroad. Which factors actually lead to this significant
difference cannot be answered simply by the use of the t-test, but requires further
thorough investigation.
but the paired-samples t-test shows whether learners’ scores on the posttest
are significantly different from those on the pretest (e.g., whether the posttest
performance is higher). If the posttest scores are significantly higher than the
pretest scores, the researchers may be able to conclude that the experimental
treatment was the reason for the increase. An example of the use of a depen-
dent t-test is Kormos and Trebits’s (2012) study, in which the researchers gave
a group of 44 Hungarian high school students a cartoon description task and
a picture narration task, first as oral tasks, and a month later as written tasks
with no intervening treatment. The researchers then conducted comparisons
on measures of lexical variety, syntactic complexity, fluency, and accuracy
between:
All these comparisons involve paired-samples t-tests because it was always the
same participants providing data on both tasks. The researchers found, as Table 7.2
shows, that participants produced significantly more error-free clauses in their
written cartoon descriptions than in their oral cartoon descriptions.
In general, a significant result for the paired-samples t-test suggests that the
means for the two measures are significantly different (with less than a 5% chance
of error). From this finding, it may be concluded that the students found writing
a cartoon description easier than providing the description orally. To explain this
finding, it could be hypothesized that the offline nature of writing, the possibility
of revising and correcting errors, and the more formal atmosphere of a written
test setting may have led to a greater focus on accuracy, resulting in fewer errors.
However, the statistical result does not inform the researchers what the reason for
this outcome was, and researchers would have to do further research to pinpoint
what it was about writing that made it more accurate than speaking, at least for
this type of task.
TABLE 7.2 Mean and standard deviations of ratios of error-free clauses in the cartoon
description task for both modalities (adapted from Kormos & Trebits, 2012, p. 455, Table 3)
Mean SD t-value
Assumptions of T-Tests
Both types of t-tests require interval or continuous data, and the corresponding
data for the population to which the findings are to be generalized need to be
normally distributed. In independent-samples t-tests, the sizes of the two samples
should not differ greatly, and neither should their variances. SPSS can be used to
check for the violations of this equal variances assumption in independent t-tests
through the running of Levene’s test (to be discussed in the SPSS section). SPSS can
provide a corrected t-test result if the variances differ too greatly. It is desirable that
all samples have at least 30 participants so that small differences may be detected
(as discussed in Chapter 6).
The t-test compares the means of the sample scores, taking into account the sample
sizes and the standard deviations of the scores. The t-test is likely to be significant if:
Similar to the chi-square test (discussed in Chapter 12), the t-test produces a
value (simply known as t ) that can only be used to determine statistical significance;
it does not say anything about the size of the difference between the two mean
scores (i.e., the effect size). For that, researchers have to run a separate effect size
calculation to obtain what is known as Cohen’s d (discussed further in the ‘Effect
Size for T-Tests’ section). Since the t-test formula involves the subtraction of one
sample mean from the other, a negative t-test result can be found when the larger
mean is subtracted from the smaller mean. This is not problematic—it is the size
of the t-value that is important.
In the case of the error counts of generation 1.5 versus L1 students in Doolan
and Miller’s (2012) study, d was calculated as:
• Step 1: examine and evaluate the descriptive statistics of the data from two
groups and the reliability of the research instrument(s) being used.
• Step 2: check whether the statistical assumptions for the particular t-test are
met. Levene’s test can be used to determine whether the two means have
equal variances (SPSS can perform this statistical test; see the ‘SPSS Instruc-
tions: Independent-Samples T-test’ section).
• Step 3: perform the t-test using SPSS.
• Step 4: determine whether the two group means are statistically significantly
different (e.g., p < 0.05)
• Step 5: Compute Cohen’s d if there is a statistically significant difference
between the two means.
FIGURE 7.1 Accessing the SPSS menu to perform the independent-samples t-test
T-Tests 99
Click on the Define Groups button to tell SPSS how the two groups
are defined. In the resulting dialog, enter ‘0’ for Group 1 and ‘1’ for
Group 2 (see Figure 7.2).
Note: Defining groups may seem superfluous in the case of a dichotomous
variable (residence/no residence) but the t-test could be run with a group
variable that has several levels (e.g., learners’ L1s), and then it would be
important to define which groups to compare.
The following is the SPSS output from the independent-samples t-test. Table 7.3
presents the descriptive statistics of the two group means.
The group statistics are general descriptive statistics about the two groups. It
can be observed that there was a large difference in the routines scores of the two
groups. The test takers without residence had a mean score of 51.26%, whereas
the ones with residence had a mean score of 81.40%. The next SPSS output
will indicate whether the difference was statistically significant. SPSS presents one
large table with the statistics related to the independent-samples t-test (including
Levene’s test and the t-test for equality of means). For ease of presentation, this
output has been split into two tables. Table 7.4 presents the results from Levene’s
test. Note that this table does not yield the answer to the question regarding the
statistical significance of the difference in means.
In Table 7.4, both possible statistical assumptions for the equality of variances
are made separately by SPSS: one (equal variances assumed) posits that the t-test
condition of equal group variances (or at least similar) was met, and the other
(equal variances not assumed) assumes that it was not met. In the latter case, SPSS
corrects for the violation of this condition of equal variances. To know whether
the condition of equal variances holds, the result of Levene’s test can be examined.
Levene’s test has as its null hypothesis that variances are equal, so if it is nonsig-
nificant at 0.05, the t-test condition is met. This means that the t-test result for
equal variances can be used. In other words, Levene’s test must not be statistically
significant (i.e., the p-value must be larger than 0.05) in order to say that the data
have met the homogeneity assumption. In this particular SPSS output, Levene’s
test suggests that the p-value was 0.56, which is far above the threshold of the
p-value of 0.05, so it can safely be assumed that the group variances were similar
enough to run the independent-samples t-test without any corrections (hence in
Table 7.4, the row ‘Equal variances not assumed’ was left blank by SPSS). Table 7.5
Levene’s test
F Sig.
Lower Upper
presents the t-test for equality of means. This output can answer the question of
whether or not the two means were statistically different.
In Table 7.5, the first analysis row is based on the assumption of equal vari-
ances, whereas the second row is based on the assumption of unequal variances.
The second row can be ignored given that the result of Levene’s test was that the
condition of equality of variances was met. As can be seen in the column entitled
t, the t-test result is –9.86, which is not meaningful in itself. This t-value enables
researchers to determine the significance level only (if they use a critical value
table, as discussed in Chapter 6). In this output, SPSS reports the t-test result as
a negative because the higher mean was subtracted from the lower one. So the
negative sign in the t-value can be ignored when writing up a report. Also, df in
this table are needed only for reporting results. The entries in the Sig. (2-tailed)
column indicate whether the difference in means was significant, as it shows the
significance level. Even though SPSS reports the significance value here as 0.00,
it is important to be aware that this value cannot be the real value, but that it can
be assumed to be less than 0.001. This means that there is a low likelihood that it
would be wrong to claim an effect of residence on routines scores.
SPSS does not report values of Cohen’s d. The easiest way to calculate Cohen’s d
is to use an online calculator such as that found in www.uccs.edu/~lbecker/. In this
calculator, the means and standard deviations from the SPSS descriptive statistics
output can be entered (Table 7.3), after which the compute button in the online
calculator should be clicked. Figure 7.3 presents Cohen’s d as well as the effect size r
which is another effect size measure that is not focused on in this book. According
to Figure 7.3, the Cohen’s d effect size was 1.77, which is a large effect size.
FIGURE 7.4 Accessing the SPSS menu to perform the paired-samples t-test
T-Tests 103
First, select the ‘Implicature score’ and ‘Routines score’ variables and
move them to the Variable 1 and Variable 2 columns in the ‘Paired
Variables’ pane.
In its output, SPSS presents the descriptive statistics of the two variables (Table 7.6).
According to Table 7.6, the implicature test section seems to be easier than the
routines test section. However, it cannot be said yet that the difference was statisti-
cally significant.
Table 7.7 presents the correlation coefficient between the two sets of scores.
It is provided mainly for the researchers’ information and need not be reported.
Table 7.8 presents the paired-samples t-test results. Examine the last three columns
of this table: the t-value was 1.81 with 165 degrees of freedom, and the signifi-
cance level was 0.07, which indicates that the difference between implicature test
scores and routines test scores was not statistically significant (p > 0.05). Therefore,
it cannot be concluded that one section is more difficult than the other. Given the
nonsignificant result, it is not necessary to calculate an effect size measure such as
Cohen’s d.
104 T-Tests
N Correlation Sig.
Lower Upper
Pair 1 Implicature 3.47 24.75 1.92 –0.32 7.27 1.81 165 0.07
score—
Routines score
This finding can be reported as follows: ‘the scores on the implicature and
routines test sections were compared through the use of the paired-samples t-test.
No significant difference was found between these two test sections (t(165) = 1.81,
p = 0.007)’.
It should be noted that in the case of statistical significance, researchers should
compute Cohen’s d. There is disagreement in the literature about whether
Cohen’s d for paired-samples t-tests needs to consider the correlation between the
two variables (see Lakens, 2013), which we showed in Table 7.7. Becker’s effect
size calculator does not take correlation into account, but Melody Wiseheart’s cal-
culator (www.cognitiveflexibility.org/effectsize/) can do so. For consistency and
simplicity, researchers can compute the paired-samples t-test effect sizes in the
same way they do for the independent-samples t-test. That is, there is no need to
integrate the correlation value into the computation.
Summary
This chapter has explained the underlying principles behind the two types of
t-tests (the independent t-test and paired-samples t-test). It has illustrated how to
T-Tests 105
compute them in SPSS. Effect size calculations for t-tests have been explained and
presented. The next chapter presents the nonparametric versions of the two t-tests.
Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
8
MANN-WHITNEY U AND
WILCOXON SIGNED-RANK
TESTS
Introduction
Nonparametric tests are useful in the analysis of data that do not meet the condi-
tions required for parametric tests, for example, if researchers are working with
small sample sizes or ordinal / rank data, or if the assumption of normally distrib-
uted data may not be justified. This chapter presents non-parametric alternatives to
t-tests, namely the Mann-Whitney U test, which is analogous to the independent-
samples t-test, followed by the Wilcoxon Signed-rank test, which is analogous to
the paired-samples t-test.
whether high-, mid-, and low-ranked participants from each group are evenly
distributed in the pooled group. If each group has some high-ranking partici-
pants, some mid-ranking participants, and some low-ranking participants, the
two groups are not likely to be significantly different. However, if one group
has a lot of high-ranking participants but few mid- and low-ranking partici-
pants, and the other group has very few high-ranking participants, but a lot of
mid- and low-ranking ones, then the two groups are likely to be significantly
different. The U-value from this analysis is an index that helps researchers find
the relevant significance level. SPSS provides both the U-value and Z-value, and
both are reported in some studies. The Z-value can be used to compute an effect
size for the Mann-Whitney U test. Corder and Foreman (2009, p. 59) suggest
a simple formula to calculate the effect size for the Mann-Whitney U test (r)
as follows:
r = Z ÷ √total N
It should be noted that the effect size r here is not the same as a correlation coef-
ficient. According to Cohen (1988), r = 0.10 is considered a small effect size, r =
0.30 is considered a medium effect size, and r = 0.50 is considered a large effect size.
Doolan and Miller (2012), for example, used the Mann-Whitney U test to
detect differences between generation 1.5 writers and L1 writers in the frequency
of occurrence of a variety of error types. They found no significant difference
between the groups in their word choice errors, although generation 1.5 writers
made more word choice errors. However, they did find a statistically significant
difference in verb errors. Table 8.1 presents their Mann-Whitney U test results
(adapted from Doolan & Miller, 2012).
The authors did not report the r effect sizes in this table, but they can be easily
calculated. The r effect size for the word choice errors was 0.14 (i.e., 1.13 ÷ √61),
which is small, and the r effect size in the case of the verb errors was 0.47 (i.e.,
3.64 ÷ √61), which is considered medium-to-large. It should be noted that when
the test does not produce a statistically significant result, the r effect size does not
have to be calculated.
TABLE 8.1 Mann-Whitney U test results (adapted from Doolan & Miller, 2012, Table 2,
p. 7)
Wrong word 2.63 (2.24) 2.00 (1.89) 32.76 27.40 –1.13 .260
Verb error 6.24 (5.61) 1.50 (1.50) 36.72 19.28 –3.64 .001∗
108 Mann-Whitney U and Wilcoxon Signed-Rank
Click the Define Groups button to tell SPSS how the two groups are
defined. Enter ‘0’ for Group 1 and ‘1’ for Group 2. Note that 0 rep-
resents females and 1 represents males in this data set. Click on the Continue
button.
Mann-Whitney U and Wilcoxon Signed-Rank 109
Table 8.2 presents the descriptive statistics produced by SPSS. You can ignore the
descriptive statistic for the gender variable, which makes no sense as it consists of
nominal data (see Chapter 3). In Table 8.2, the mean score for the total test score
was 48.48 (SD = 9.94).
Table 8.3 presents the mean ranks using the total test score. In this table, the
mean ranks for female and male test takers were 28.73 and 19.48 respectively.
Table 8.4 presents the Mann-Whitney U test statistics. In order to determine
whether the two groups significantly differed in their total test score, first examine
the Z-value and the Asymp. Sig (2-tailed) value. It can be seen that there was a sta-
tistically significant difference between the female and male test takers in the total
test score (Z = –2.32, p = 0.02, r = 0.34). SPSS does not produce the r-effect size,
so this needs to be calculated using the formula provided in the ‘Mann-Whitney U
Mann-Whitney U and Wilcoxon Signed-Rank 111
Total score
Mann-Whitney U 155.50
Wilcoxon W 506.50
Z –2.318
Asymp. Sig. (2-tailed) 0.02
Test’ section. The effect size is medium in this case. It should be noted that similar
to Cohen’s d, the r-effect size can take a negative value but the negative sign in the
effect size can be ignored.
The Mann-Whitney U test result can be reported as follows: ‘the total scores of
female and male test takers were compared through the use of the Mann-Whitney
U test. Female test takers had significantly higher total test scores than their male
counterparts (Z = 2.32, p = 0.02, r = 0.34), and the effect size of the difference
was medium. It can be concluded that gender may play a role in determining suc-
cess in reading test performance’.
d instead. In this study, Cohen’s d ranged from 0.97 to 1.57, which implied
large effect sizes. The findings suggest that when learners pay attention to the
language areas they are learning, they are likely to learn them more successfully
than when they do not pay attention. However, another intriguing finding from
this study was that there seemed to be an interaction between the effects of
attention and learners’ proficiency levels (as determined by the number of years
of study, i.e., first-, second-, and third-year levels). For example, the impacts of
+focused attention on the three grammatical areas were pronounced among the
first- and second-year students, but became more complex for the third-year
students. That is, for the third-year group, there was no significant gain in the
three grammatical areas, whether attention was given or not. The researchers
suggested that the finding might suggest that for the third-year students, when
attention was not given, learners might use their own learning resources to help
them learn the target language features.
Click the Options button to open the dialog shown in Figure 8.4,
then select Descriptive and click the Continue button. There is no
need to change other defaults.
Table 8.5 presents the descriptive statistics produced by SPSS. In this table, the
mean scores ranged from 3.14 (memory) to 3.60 (monitoring).
Table 8.6 presents the ranks statistics, which compare comprehending strategy
use with the use of other strategies. In Table 8.6, the label negative ranks refers to
the observation that a test taker reported less use of the strategy being compared
(e.g., memory) than the use of the comprehending strategy. The label positive ranks
refers to the observation that a test taker reported higher use of the strategy being
compared than the use of the comprehending strategy. The label ties indicates that
the use of comprehending and the compared strategy were equal. According to
Table 8.6, the use of the comprehending strategy was reported to be higher than
the use of the memory, retrieval, and planning strategies, but lower than the use
of the monitoring and evaluating strategies. However, at this stage it is not known
whether these differences were statistically different.
Table 8.7 presents the Wilcoxon signed-rank test statistics. As SPSS does not
produce the r-effect sizes, these need to be calculated using the formula provided
for the Mann-Whitney U test in the ‘Mann-Whitney U Test’ section. In order
to determine whether a pair of strategies significantly differed from each other,
the Z-value and the Asymp. Sig (2-tailed) value should be examined. According
to Table 8.7, there was a statistically significant difference between the use of the
memory and comprehending strategies (Z = –3.07, p < 0.001, r = –0.45), and
The Wilcoxon Signed-rank test results can be reported as follows: ‘Test takers’
reported use of comprehending strategies was compared to that of other reported
strategies (e.g., memory, retrieval, and planning) through the Wilcoxon Signed-
rank test. It was found that there was significantly higher use of comprehending
strategies compared to that of the memory (Z = 3.07, p < 0.001, r = 0.45, medium
effect) and planning strategies (Z = 2.25, p = 0.02, r = 0.33, medium effect). The
effect size of the difference were medium. Comprehending strategy use did not
significantly differ from retrieval, monitoring, or evaluating strategy use. Accord-
ing to the Wilcoxon Signed-rank test results, it may be concluded that in this
reading comprehension test, test takers reported using comprehending strategies
significantly more frequently than using memory and planning strategies, but the
frequency of use of comprehending strategies was not significantly different from
that of the retrieval, monitoring and evaluating strategies.’
Summary
This chapter has explained two nonparametric tests analogous to the independent-
samples and paired-samples t-tests, and illustrated how to compute them in SPSS.
These two nonparametric tests are suitable for the analysis of ordinal and non-
normal data. If a sample size is small, researchers may use these tests to explore
possible difference between data sets. These tests are useful as alternatives to the
t-tests if some assumptions of the t-tests cannot be met. The next chapter presents
the one-way ANOVA, which is an extension of the independent-sample t-tests to
compare three or more groups people.
Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
9
ONE-WAY ANALYSIS OF
VARIANCE (ANOVA)
Introduction
The independent-samples t-test allows researchers to compare two different groups
of participants measured with the same instrument. However, if there are more
than two groups of language learners or test takers, the independent-samples
t-test cannot be used. In this chapter, an extension of the independent t-test to an
inferential statistic called analysis of variance, commonly abbreviated as ANOVA,
is introduced. This chapter focuses on one-way ANOVA and its alternative non-
parametric test, namely the Kruskal-Wallis test, which can be used for ordinal data
and data that do not exhibit a normal distribution.
One-Way ANOVA
One-way ANOVA functions in a similar way to the independent-samples t-test,
but instead of two groups, it can examine the differences among three or more
groups based on one background variable that distinguishes participants (e.g.,
native languages, proficiency levels, and experimental conditions). The presence
of only one independent variable to distinguish participants is the reason that this
test is called the ‘one-way’ ANOVA. If there are two independent variables (e.g.,
native language and gender), it is called a ‘two-way’ ANOVA, and so on. This
chapter deals only with one-way ANOVA.
Once learner groups have been created that differ on the independent
variable, one-way ANOVA compares research participants on one outcome
variable (e.g., test score) to see if differences in the independent background
118 One-Way Analysis of Variance (ANOVA)
Groups N M SD
per level, so that each group should have at least 30 participants (as discussed in
Chapter 6). Although one-way ANOVA can be run with fewer than 30 partici-
pants per group, the results are generally more trustworthy if the sample size per
group is larger. The sizes of the groups should also be roughly similar.
The outcome variable in one-way ANOVA should be interval and have a
broad range of scores or data. The expected scores for the underlying popula-
tion should be normally distributed. Test scores are a typical outcome measure.
The score variances of each group should not be too different, although one-way
ANOVA can correct for unequal variances through its post hoc tests.
In its analysis, one-way ANOVA compares the differences between group
means with the differences between participants within groups. If there are large
differences between group means while the scores within groups are highly
homogeneous and have small standard deviations, the one-way ANOVA outcome
is likely to be significant. By contrast, the more similar the group means are, and
the more widely the individuals’ scores within groups are spread out, the less likely
it is that the one-way ANOVA outcome will be statistically significant.
The outcome of a one-way ANOVA is the F-value, which allows researchers to
determine whether the analysis is statistically significant. The F-value is a number
that researchers can use to look up the significance level in a table of critical val-
ues. However, in SPSS, the significance level is calculated automatically.
The F-value can be any positive value and does not take on negative values,
unlike the t-value. For strong effects, the F-value is often found to be quite high.
By convention, the degrees of freedom between groups (df1), and the degrees of
freedom within groups (df2) are reported with the F-value set apart by a comma
as F(df1,df2). For example, in Ko’s (2012) study, she had three groups and a total
of 90 participants, so she reported the F-value as F(2,87) = 73.21. Here, 2 is df1
(i.e., the total number of groups minus 1) and 87 is df2 (i.e., the total number of
participant minus the number of groups being compared).
in the text of a Results section. Post hoc tests are not needed when the one-way
ANOVA result is nonsignificant, as a nonsignificant ANOVA result means that
there are no significant differences between the comparison groups.
Researchers may obtain a significant ANOVA result, but subsequently the post
hoc test does not show there to be significant differences between any of the
groups. This usually occurs when the one-way ANOVA result itself is significant
by only a small margin. In that case, the additive group differences may be large
enough to reach significance, but when groups are considered pairwise, the dif-
ferences may not be sufficiently large for the post hoc test to show significance.
One-way ANOVA and post hoc tests need to be conducted instead of several
independent-samples t-tests, which would compare each group with each of the
others, leading to an increased chance of a type I error. This is because every time
the independent-samples t-test is performed, there is a 5% chance of a type I error
in rejecting the null hypothesis when it is true (see Chapter 6). In other words,
there is a 95% chance of being right in rejecting it. If two computations of the
independent-samples t-test are performed from the same data set, and the null
hypothesis is rejected both times, the likelihood of being right both times is now
about 0.9025 (i.e., 0.95 × 0.95), so that there is now a 90.25% chance of being
right in rejecting the null hypothesis. This likelihood leaves a 9.75% likelihood of
being wrong in rejecting the null hypothesis. If there are three comparison groups
and three consecutive independent-samples t-tests are performed on the same data
set, the likelihood of being correct in rejecting all the three null hypotheses is now
85.7% (i.e., 0.95 × 0.95 × 0.95), and the likelihood of falsely rejecting the null
hypotheses is therefore, 14.3%. As the number of comparisons increases, the chance
of being correct in rejecting the null hypothesis increases, so that if there are five
comparison groups, 10 independent-samples t-tests would need to be done, and
there would be close to a 50% chance of being right in rejecting the null hypothesis.
One-way ANOVA and its post hoc tests have a correction for multiple tests built in,
so the increased chance of a type I error occurring is better minimized.
• If the η2 value is 0.1, then 10% of the overall variance is due to the back-
ground variable, and 90% is due to other factors. This is normally considered
a small effect size.
122 One-Way Analysis of Variance (ANOVA)
• If the η2 value is 0.3, then 30% of the overall variance is due to the back-
ground variable, and 70% is due to other factors. This is normally considered
a medium effect size.
• If the η2 value is 0.5, then half the overall variance is due to the background
variable and half is due to other factors. A η2 value of 0.5 or above is nor-
mally considered a large effect size.
To illustrate how to perform one-way ANOVA, the TEP data are used (available
in Ch9TEP.sav, which can be found on the Companion Website). In L2 pragmatics
research, the impact of general proficiency on pragmatic knowledge is a recurring
research question (see Taguchi & Roever, 2017, for further details). Researchers may
ask the question ‘does proficiency level impact pragmatic knowledge?’ TEP scores for
test takers at five proficiency levels (i.e., beginner, advanced beginner, low intermedi-
ate, upper intermediate, and advanced) can be compared using one-way ANOVA.
That is, the background variable is the test takers’ proficiency level, and the outcome
variable is their TEP score. A one-way ANOVA with the proficiency level (level) as
the independent, or background, variable (or ‘factor’ in SPSS terms), and total score
(totalscore) as the dependent or outcome variable will be performed.
Click Analyze, next General Linear Model, and then Univariate (see
Figure 9.1).
In the dialog that appears, move ‘total score’ from the left pane to
the Dependent Variable field and ‘proficiency level’ to the Fixed
Factor(s) field (see Figure 9.2).
FIGURE 9.1 SPSS menu to launch a one-way ANOVA
Click the Post Hoc button. In the Univariate: Post Hoc . . . dialog that
appears, move ‘level’ from the ‘Factors’ pane to the ‘Post Hoc Tests
for:’ pane. Select Scheffé and Tamhane T2, as shown in Figure 9.3.
Notes: Which post hoc test is to be used depends on whether Levine’s test is
significant. Both post hoc tests are chosen at this point because it is not yet
known if the Levene’s test will show significance (see the “Post Hoc Tests” sec-
tion). The post hoc tests are not needed if the ANOVA results are nonsignificant.
Click the Continue button to confirm these choices, then click the
Options button.
(I) Proficiency ( j) Proficiency level I-J Std. error Sig. 95% confidence
level interval
Lower Upper
bound bound
by the target language. Di Silvio et al. (2014) collected data from 152 US college
students who spent a semester studying abroad in Peru, Chile, China, or Russia.
In addition to giving students a simulated oral proficiency test before and after the
study-abroad experience, the researchers administered a questionnaire on students’
perceptions of their study-abroad experiences. The questionnaire statements were
accompanied by five response options, ranging from ‘strongly agree’ to ‘strongly
disagree’. Cognizant of the ordinal nature of Likert-type scales, Di Silvio et al.
employed the Kruskal-Wallis test to compare participants’ perceptions after group-
ing participants by target language. For example, with the statement ‘I’m glad that
I lived with a host family’, there were statistically significant differences between
the three groups, with 94% of L2 Spanish learners agreeing, or strongly agreeing
with the statement, compared to 90% of L2 Mandarin learners, and only 74% of
L2 Russian learners. The authors do not show post hoc tests for the Kruskal-Wallis
test, but based on the frequency data, the Spanish and Mandarin learners appeared
to be happier with their host families than the Russian learners were.
question to be addressed is: ‘do participants with residence (i.e., up to one year
residence and more than one year residence), regardless of length, have higher
proficiency levels than participants without residence?’
Figure 9.6 presents the SPSS default for a nonparametric test: two or more inde-
pendent samples.
Click the Fields tab, then move ‘proficiency level’ from the ‘Fields’
pane to the ‘Test Fields’ pane, and move ‘collapsed residence’ to the
Groups field (see Figure 9.7).
Notes: It is important that collapsed residence has the correct designation
(nominal/ordinal/scale) in the Measure Column of the SPSS Data View. Only
ordinal and nominal variables work as Group variables.
Next, click the Settings tab and tick the Customize tests checkbox.
Finally, click on the Run button to view the SPSS output for the
Kruskal-Wallis test.
In order to see more detail for the results in the SPSS output, double-click on
the Hypothesis Test Summary table (see Figure 9.9), and a Model Viewer window
will be activated (see Figure 9.10). According to Figure 9.10, the test was statisti-
cally significant (H(2) = 30.448, p < 0.001). Note that the degree of freedom was
2 (i.e., the number of groups minus 1).
be used to compute the r-effect size (see Chapter 8). The findings can be written
as follows:
Summary
This chapter has presented one-way ANOVA, which is the extension of the
independent-samples t-test to several groups. One-way ANOVA assumes that the
groups being compared differ in one dependent or outcome variable only (i.e.,
univariate). This chapter has also presented a nonparametric version of ANOVA,
namely the Kruskal-Wallis test. The next chapter presents the analysis of covari-
ance (ANCOVA), which is an extension of ANOVA that partials out the effect of
an intervening variable.
Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
10
ANALYSIS OF COVARIANCE
(ANCOVA)
Introduction
As discussed in the previous chapter, researchers use one-way ANOVA to inves-
tigate differences among three or more groups of individuals that vary on a
dependent variable. However, sometimes there are other variables that could
influence research results and that researchers have to take into account at the data
analysis stage. In a pre-post study, such as Ko (2012), which was discussed in the
previous chapter, it could have been the case that one of the groups had greater
knowledge of the target feature at pretest time, so that any differences at posttest
time might not have been due to the treatment alone. It is, therefore, important to
be able to account statistically for differences that may exist before the treatment
is applied. ANCOVA is one method that can be used for this purpose.
A variable that interferes with the influence of the target independent variable
on the dependent variable is called an intervening variable (also known as a moderator
variable or confounding variable). If researchers do not or cannot control its effect as
part of the research design, they can at least attempt to minimize its effect through
a statistical method such as using ANCOVA.
time were mainly due to her treatment. Although the logic behind this research
strategy is sound, problems may arise, particularly when group sizes are small and
when the differences between the groups at the pretest times, although not statisti-
cally significant, are powerful enough to affect the results of the posttest analysis. If
groups have been chosen entirely via random sampling, and are large, these differ-
ences are likely to be negligible, but in the applied linguistics field, sampling is rarely
truly random and groups are seldom large. So even though researchers may obtain a
nonsignificant ANOVA result at the pretest time, any undetected differences that do
exist at the pretest time can affect the posttest outcome.
A simple way to deal with this issue in a pretest-posttest experimental design
(see Phakiti, 2014) is to run an ANOVA test on ‘gain scores’, rather than on the
posttest scores. Gain scores are easily computed by subtracting the pretest score
from the posttest score:
ANCOVA can perform comparisons tests between two or more groups, and if a
covariate is involved, it can take that into account.
Conditions of ANCOVA
ANCOVA remains a controversial analysis. Miller and Chapman (2001) list a
number of ways ANCOVA can be misused and has been misused by researchers.
In particular, researchers need to check carefully that the statistical conditions of
ANCOVA are met.
Field (2013) discusses two important conditions for ANCOVA. The first
condition concerns the independence of the independent variable (e.g., the
treatment condition or the proficiency level) and the covariate. That is, the
mean of the covariate should not differ significantly between the groups because
ANCOVA computes an overall value for the covariate across the whole sample,
rather than for each group or each participant. This condition can be checked
by running ANOVA with the grouping variable as the independent variable,
and the covariate as the dependent variable. The outcome should be nonsignifi-
cant ( p < 0.05).
140 Analysis of Covariance (ANCOVA)
The second condition concerns the relationship between the covariate and
the dependent variable. This condition is known as the homogeneity of regression
slopes, in which the effect of the covariate on the scores for each group should
be similar. Field (2013) shows a way to check that this condition holds in
SPSS. If the homogeneity of regression slopes condition is violated, ANCOVA
should not be performed or performed only with complex adjustments (Ruth-
erford, 2011).
TABLE 10.1 ANOVA for the independent variable and covariate (test between-subjects
effects)
Table 10.3 shows the difference more clearly and indicates that the members of
the no-residence group are generally at proficiency level 3, whereas the members
of the two residence groups are generally at level 4.
According to these post hoc test results, the first condition of independence
between the independent variable and covariate is not met. This is a statistical
reason to leave out the no-residence group from the ANCOVA. Therefore, the
TABLE 10.2 Post hoc tests for independence of covariate and independent variable
(multiple comparisons)
TABLE 10.3 Post hoc tests for the independence of covariate and independent variable
1 2
FIGURE 10.4 Accessing the SPSS menu to select Cases for analysis
144 Analysis of Covariance (ANCOVA)
Click the Continue button, and then the OK button in the Select
Cases dialog to return to Data View. It can be seen that a large num-
ber of participants has been crossed out (Figure 10.7).
Based on the previous post hoc test, researchers can be confident that the condi-
tion of independence between the independent variable and covariate holds for
the data relating to the two remaining groups (less than one year and more than
one year residence).
The next step is to check the second condition for ANCOVA, namely the
assumption of the homogeneity of regression slopes. This can be checked after the
main ANCOVA has been set up (Field, 2013).
146 Analysis of Covariance (ANCOVA)
In the Univariate: Model dialog, select Custom at the top. Move both
variables (i.e., collres and level) from the ‘Factor and Covariates’
pane into the ‘Model’ pane.
Highlight both variables at the same time and click the arrow to
build the interaction term, which appears as ‘collres ∗ level’ in the
‘Model’ pane (see Figure 10.10). Click on the Continue button.
Analysis of Covariance (ANCOVA) 147
FIGURE 10.9 Univariate dialog for choosing a model to examine an interaction among
factors and covariances
FIGURE 10.10 Univariate: Model dialog for defining the interaction term to check the
homogeneity of regression slopes
Table 10.4 presents the SPSS output for checking the homogeneity of regres-
sion slopes assumption. In Table 10.4, the entries in the collres ∗ level line indicate
whether or not the interaction is significant at the p-value of 0.05. The nonsig-
nificant value obtained here is desirable because it means that the condition of
homogeneity of regression slopes has been met. This means that ANCOVA can
be used in the final step.
Open the Univariate dialog and click the Model button to access the
Univariate: Model dialog (see Figure 10.11). Restore the model to
Full factorial by ticking that checkbox.
Click on the Continue button and then in the Univariate dialog select
Options to open the Univariate: Options dialog (Figure 10.12).
FIGURE 10.11 Changing the analysis setup back to the original setup
In Figure 10.12, it is important that means are displayed and that main effects
are compared for the independent variable (collapsed residence = collres). The
checkbox ‘Compare main effects’ should be ticked. These are the post hoc tests,
but there are fewer of them here than in the one-way ANOVA. In the Confidence
Interval Adjustment field, select either the Bonferroni or Sidak post hoc tests,
which will be sufficient.
For the purpose of this chapter, not all SPSS output will be presented. Only
the output important for doing ANCOVA will be shown. The first output is nor-
mally labeled as ‘Univariate Analysis of Variance’, which is the same as one-way
ANOVA. Table 10.5 shows the descriptive statistics for the two groups regarding
the routine scores.
According to Table 10.5, residence seems to make a difference to learners’
routines scores, with the mean score of the group of more than one year residence
being 15 points higher than that of the up to one-year residence group. However,
it cannot be determined whether the scores were significantly different until the
ANOVA result is examined. Table 10.6 presents the results of the Levene’s test
between the two comparison groups.
The Levene’s test result is statistically significant, which indicates that the
homogeneity of variance assumption has been violated. However, the result of the
Levene’s test performed by SPSS is actually not relevant when using ANCOVA
because it is not homogeneity of error variances across groups that is assumed
in ANCOVA, though it does matter in ANOVA. Instead, a condition known
as homoscedasticity is required, which means that error variances are similar for
each combination of predictor variables (see Rutherford, 2011, for further dis-
cussion). The Levene’s test does not evaluate homoscedasticity, so it can be
discounted in ANCOVA. Unfortunately, SPSS does not include statistical tests
TABLE 10.5 Descriptive statistics of the routines scores between the two residence groups
7.396 1 53 .009
Analysis of Covariance (ANCOVA) 151
Estimates
Pairwise Comparisons
Lower Upper
bound bound
According to Table 10.8, the means have changed little (compared to those
shown in Table 10.5). The mean for the up-to-one-year of residence group rose
from 75.5 to 75.8, and the mean for the more-than-one-year of residence group
fell from 90.5 to 90. The results in the final table of the first ANOVA with pro-
ficiency level as the dependent variable (Table 10.3) indicates that the members
of the more-than-one-year residence group had slightly higher proficiency than
the members of the up-to-one-year residence group, and as this proficiency effect
was minimized through the use of ANCOVA, the mean score for this group
subsequently decreased. Table 10.9 shows the post hoc comparisons between the
two groups.
In Table 10.9, it can be seen that the two residency groups significantly differ
from each other. This ANCOVA result could be written up as follows.
English-speaking country for more than one year helped test takers improve
their knowledge of routines significantly more than being in the country
for less than one year.
Summary
In cases in which there are pre-treatment or pre-existing differences among research
participants, researchers can attempt to correct for them in two ways. The first
method is by analyzing gain scores; the second method is by employing ANCOVA.
As can be seen, to conduct an ANCOVA requires several steps. This chapter has
also shown that ANCOVA is still not without issues. First, complex statistical
assumptions need to be met in order to arrive at a meaningful outcome. Second,
the choice of intervening variables requires careful justification, especially in the
case of L2 research, in which multiple independent factors often affect language
learning or use simultaneously. This choice depends on researchers’ understand-
ing of the research context. While it is technically possible to include multiple
covariates in ANCOVA, this is not recommended as it makes the analysis and
its outcomes overly complex. Wherever possible, intervening variables should be
controlled in the research design. The next chapter presents repeated-measures
ANOVA, which is an extension of the paired-samples t-test, in which a dependent
variable is measured three or more times.
Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
11
REPEATED-MEASURES ANOVA
Introduction
The previous chapters on one-way ANOVA and ANCOVA presented the exten-
sion of the independent-samples t-tests to three or more groups. In this chapter, we
introduce an extension of the paired-samples t-test, which is used to compare the
mean scores of the same group of participants on two occasions, to three or more
occasions. This analysis is known as a repeated-measures analysis of variance (hereafter,
repeated-measures ANOVA). This kind of analysis is common in pre-post studies in
which researchers first give participants a pretest, then administer a treatment, and
finally give participants one or more posttests (posttest/delayed posttest) to investi-
gate the effect of the treatment on the participants (see Figure 11.1).
Another application of the repeated-measures ANOVA is to examine the rela-
tive levels of difficulty of several test sections or types of language tasks completed
by the same learner group. A one-way (i.e., one independent variable) repeated-
measures ANOVA takes a single sample and compares several measures taken on
that group. A study by Laufer and Rozovski-Roitblat (2011) used a repeated-
measures ANOVA to compare the learning of vocabulary presented 2–3, 4–5,
or 6–7 times. They investigated whether different amounts of exposure affected
learning by comparing learners’ recall and recognition scores for each group of
words. In this study, the ‘amount of exposure’ was considered the independent,
within-subject variable with three levels. In addition to varying the different
Delayed
Pretest 1 Posttest
Treatment Posttest
(Time 1) (Time 2)
(Time 3)
• Recall scores under the Focus on Form condition: F(2,38) = 2.77, n.s.
• Recall scores under the Focus on FormS condition F(2,38) = 33.91, p < 0.001
• Recognition scores under the Focus on Form condition: F(2,38) = 1.45, n.s.
• Recognition scores under the Focus on FormS condition F(2,38) = 13.4,
p < 0.001
It can be seen that the differences under the Focus on FormS conditions were
significant at p < 0.001, while the differences under the Focus on Form conditions
156 Repeated-Measures ANOVA
were not statistically significant. The post hoc tests for the Focus on FormS condi-
tion indicated significant differences between all levels of exposure for the recall
scores, and between the 2–3 and 6–7 exposures levels for the recognition scores.
These post hoc tests were performed for the Focus on FormS condition only
because the repeated-measures ANOVA results for the Focus on Form condition
were not statistically significant. In summary, Laufer and Rozovski-Roitblat (2011)
found that students’ retention of vocabulary was better if they were exposed to it
more and received additional practice.
in a slightly different way from the η2 in the one-way ANOVA, but the interpreta-
tion of the effect size is similar.
The value of the partial η2 is between 0 and 1. It can be interpreted as the
percentage of overall variance accounted for by the variable under measurement.
In the case of Laufer and Rozovski-Roitblat’s (2011) study, this variable would be
the treatment or amount of vocabulary exposure. Whether the size of the partial
η2 is considered small or large depends on researchers’ expectations. Generally, a
partial eta squared over 0.5 would be considered large and one below 0.1 would
be considered small.
Click Analyze, next General Linear Model, and then Repeated Mea-
sures (Figure 11.2).
The Define button will become active after clicking Add (shown in
Figure 11.3). Click this button to obtain the Repeated Measures dia-
log shown in Figure 11.4. To add the variables ‘implicature’, ‘routines’, and
‘speech acts,’ move each of these three variables one at a time from the left-
hand pane to the ‘Within-Subjects Variables’ pane. In Figure 11.4, ‘implica-
ture’ and ‘routines’ have been added to the Within-Subjects Variables pane,
while ‘speech acts score’ is still to be added.
Tick the Compare main effects checkbox, and select Bonferroni as the
post hoc test for the repeated-measures ANOVA from the Confidence
interval adjustment drop-down menu.
Repeated-Measures ANOVA 161
The output begins with the codes that SPSS assigns to the levels of the within-
subjects factor (Section) as shown in Table 11.2. Table 11.3 presents the descriptive
statistics, which indicate that implicature was the easiest section for the partici-
pants, followed by routines, and then speech acts. The differences do not appear to
be large enough for them to be statistically significantly different.
Table 11.4 presents the multivariate test output. This is ordinarily used for a
multivariate ANOVA (also known as MANO VA), which has several dependent
variables. SPSS produces it here as well in case Mauchly’s Test of Sphericity indi-
cates a severe violation of sphericity since these tests do not assume sphericity.
Should sphericity be violated, results from these tests together with the corrections
for sphericity violations (see Table 11.6) can help researchers judge whether the
outcome is significant (see Chapter 14 in Field, 2013 for more discussion on this
strategy).
Table 11.5 presents the result of Mauchly’s Test of Sphericity, which has sphe-
ricity as its null hypothesis. Mauchly’s Test of Sphericity should be nonsignificant
in order for the Sphericity assumption to be met. In this table, the result of
162 Repeated-Measures ANOVA
1 implicature
2 routines
3 speechacts
the partial η2 was small (0.025). The partial η2 value suggests that only 2.5% of the
overall variance was accounted for by the differences among the three test sections.
Therefore although the three test sections were significantly different in terms of dif-
ficulty for the test takers, the differences among them were small. The result of this
repeated-measures ANOVA can be written as F(2, 330) = 4.159, p < 0.05.
The other results in Table 11.6 include those for the Greenhouse-Geisser,
Huynh-Feldt and Lower Bound corrections. These are corrections for the statisti-
cal results when the sphericity assumption is violated. Greenhouse-Geisser is used
in cases of epsilon (ε) < 0.75 in Mauchly’s test and Huynh-Feldt if ε > 0.75 (see
Table 11.7). The Lower Bound test (also included in Table 11.7) is conservative,
and can be used if there are serious concerns about sphericity violations.
SPSS also produces extra outputs called Tests of Within-Subjects Contrast and Test
of Between-Subjects Effects, which can be ignored (these outputs are therefore not
included here). The within-subject contrasts are only relevant if there are specific
preexisting hypotheses, and the between-subjects effects are not of interest when a
repeated-measures ANOVA is performed. The next few tables in the SPSS output
164 Repeated-Measures ANOVA
(I) section ( J) section Mean Std. error Sig.b 95% confidence interval for
difference (I-J) differenceb
show the post hoc test results that compare the various levels of the test section.
Estimates are shown in Table 11.7 and presents similar information to that in the
descriptive statistics table (see Table 11.3). Table 11.7 includes the lower and upper
bounds at the 95% confidence interval.
Table 11.8 presents the pairwise comparisons from the results of the Bon-
ferrroni post hoc test. This table shows that the only significant difference was
between Sections 1 and 3 (implicature and speech acts); the former was signifi-
cantly easier than the latter. The routines section did not differ significantly from
either the implicature section or the speech act section. The final table in the SPSS
output, which is not presented in this section, is called Multivariate Tests. This out-
put is the same as that shown in Table 11.4.
On the basis of the repeated-measures ANOVA, the results may be written up
as follows.
Summary
The rationale behind repeated-measures ANOVA is the same as that of one-way
ANOVA discussed in Chapter 9. However, unlike one-way ANOVA, which
Repeated-Measures ANOVA 165
Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
12
TWO-WAY MIXED-DESIGN
ANOVA
Introduction
L2 researchers can combine a repeated-measures ANOVA (Chapter 11) with a
between-groups ANOVA (Chapter 9). Such a combination allows researchers to
simultaneously examine the effect of a between-subject variable, such as length of
residence or type of treatment, a within-subject variable, such as test time (e.g., pre,
post, and delayed post) or task type (e.g., implicature, routines, and speech acts),
and the interaction among these variables. This chapter explains how a two-way
mixed-design ANOVA can be performed in SPSS and extends what has been
discussed in Chapter 11.
Figure 12.1 presents an experimental design that investigates the effect of a
treatment condition on an aspect of language learning, considering test time and
task types as independent factors.
The design of Shintani, Ellis, and Suzuki’s (2014) study was similar to that
shown in Figure 12.1, except that their study used five conditions. The study
employed a mixed-design ANOVA to address their research questions. The
researchers investigated the effect of different types of written feedback on ESL
learners’ use of English indefinite articles and hypothetical conditionals. Dictogloss
tasks were used in which the participants listened to a text twice, took notes, and
then tried to recreate the text as accurately as possible. After completing one dic-
togloss as a pretest, the learners received feedback, and a week after the feedback,
they completed another dictogloss task as a posttest, and a third one as a delayed
posttest two weeks following the first. In this study, ‘time’ was the independent,
within-subjects variable, and it had three levels since the researchers evaluated
performance at three points: pretest, posttest, and delayed posttest. Shintani et al.
(2014) divided their sample into five groups depending on the type of feedback
Two-Way Mixed-Design ANOVA 167
Randomly
assign Group B + Treatment B Posttest Delayed
participants Pretest Posttest
to
Control No Delayed
Group + Posttest
Treatment Posttest
Pretest
the groups received, and whether or not they made a revision of their texts, and
they therefore not only had a within-subject variable, but also a between-subjects
variable. This made their study a two-way mixed-design ANOVA involving the
following five groups:
TABLE 12.1 Descriptive statistics of the percentage scores for correct use for the five
treatment conditions by three tasks (adapted from Shintani et al., 2014, p. 118)
Explanation, without 29 28 28 78 26 55 38
revision
Correction, without 27 19 20 82 28 54 36
revision
Explanation with revision 26 21 29 81 23 55 41
Correction with revision 24 20 27 84 28 61 36
Control 34 27 30 20 27 32 33
conditional. Table 12.1 shows the percentage scores for correct use, N sizes, and
standard deviations for the five treatment conditions across the three task time
points.
According to Table 12.1, all the treatments had a strong immediate effect. The
scores of the members of the treatment groups were between 20 and 30 at the
pretest time and increased to approximately 80 at the posttest time. The scores for
the members of the control group, however, dropped between the pretest and the
posttest. In the delayed posttest, the scores of the members of the four treatment
groups had fallen from the time of the posttest to the 50–60 range, while those
of the members of the control group had risen slightly. Figure 12.2 illustrates the
changes across time points among the five groups (created based on the means
reported in Table 12.1).
The line graph in Figure 12.2 demonstrates the steep increase and subsequent
drop in scores for the members of the four treatment groups, which differed sig-
nificantly from one another. However, the overall trends show some similarities
among the groups. Shintani et al. performed a two-way mixed-design ANOVA
that showed statistically significant effects. Effects for independent variables (i.e.,
groups and time—hence, a two-way ANOVA) are called main effects, and it was
found that the main effect for groups was statistically significant (F(4, 135) =
9.428, p < 0.05, η2 = 0.156, small effect size), as was that for time (F(2, 270) =
113.574, p < 0.05, η2 = 0.429, large effect size). There was also a statistically sig-
nificant interaction of the group and time factors (F(8, 270) = 11.331, p < 0.05,
η2 = 0.196, medium effect size).
The findings from this study can be summarized as follows. First, for their
within-subjects variable, time, the researchers found statistically significant and large
differences between the scores in the pre- and posttest for all of the experimen-
tal groups, except for the control group. They also found statistically significant,
Two-Way Mixed-Design ANOVA 169
90
80
70
60
Explanation, w/o revision
50 Correction, w/o revision
Explanation with revision
40
Correction with revision
30 Control
20
10
0
pre post delayed
FIGURE 12.2 Changes across time points among the five groups
but small, differences among the treatment groups in regards to the changes in
the scores of the participants in the posttest and delayed posttest. The difference
between the posttest and delayed posttest scores for members of the control group
was not statistically significant.
Second, for the between-subject variable ‘Group’, the researchers found that
none of the groups differed significantly from any other at the pretest time. This
finding reduces the possibility that any differences that existed prior to the experi-
ment affected the final results. All four treatment groups significantly differed from
the control group at the posttest time. However, at the time of the delayed post-
test, only one treatment group (the group that received direct correction, and who
made revisions to their texts) had significantly higher scores than the control group
(with a medium effect size). None of the experimental groups differed significantly
from each other at any time.
Third, there are no post hoc tests for interactions. However, the significant
interaction term in the ANOVA shows that ‘time’ impacted the performance of
the groups differently. That is, the scores of the members of the experimental
groups increased greatly initially, then decreased over time, whereas the scores of
the members of the control groups first decreased, and then increased slightly. It
was this contrast between the control group and the experimental groups that
made the interaction term statistically significant. The pretest, posttest, and delayed
posttest changes for the control group might have been due to random fluctuation.
170 Two-Way Mixed-Design ANOVA
After clicking the active Add button, followed by Define (see Figure 11.3
in Chapter 11), the Repeated Measures dialog appears (Figure 12.3).
In the Repeated Measures dialog (Figure 12.3), click the Plots but-
ton. (Plots are useful to help visualize the interactions.)
Click on the Add button (see Figure 12.5, which shows colres∗section
in the Plots field) and then click on the Continue button.
FIGURE 12.6 Repeated Measures: Post Hoc Multiple Comparisons for Observed Means
dialog
Two-Way Mixed-Design ANOVA 173
In the following SPSS output, it can be seen that SPSS assigns codes to the levels of
the within-subjects factor (section) as shown in Table 12.2. The between-subject
factor (collapsed residence) is shown in Table 12.3, and Table 12.4 presents the
descriptive statistics.
According to the descriptive statistics, the scores on all sections increased
with length of residence. Such increases were much more drastic for the routines
section than for the other test sections. Length of residence might benefit the
knowledge of routine formulae more than it benefits other aspects of pragmatic
knowledge, which is in line with previous research (e.g., Roever, 2005). On the
basis of the descriptive statistics, a main effect for length of residence, and an inter-
action between residence and section type may be expected.
1 implicature
2 routines
3 speechacts
Value label N
For the purpose of this chapter, the multivariate test table will be ignored.
According to Table 12.5, Mauchly’s test was nonsignificant ( p > 0.05), so the
sphericity assumption can be taken to hold. The tests of within-subjects effect in
Table 12.6 indicate a significant effect of the section variable (F(2,252) = 3.117,
p < 0.05, partial η2 = 0.024, small effect size. On this basis, the type of test sec-
tion has a significant but small impact on performance (regardless of residence).
In Table 12.6, the interaction effect was statistically significant (F(4,252) = 4.869,
p < 0.01, partial η2 = 0.072, small effect size). It can be inferred that the impact of
the residence variable was stronger on some sections than on others. Tables 12.7
shows the result of Levene’s test and Table 12.8 displays the test of between-
subjects effects.
In Table 12.7, the results of Levene’s test indicate that the homogeneity of
equal variances conditions for the implicature and speech act sections were not
met ( p < 0.05). Since this concerns the within-subjects variable (rather than the
between-subjects variable), SPSS does not offer specific post-hoc tests to correct
for this violation of assumptions, so the result of comparison between sections
needs to be interpreted with caution. There is no option to run a Levene’s test for
the between-subject variable.
176 Two-Way Mixed-Design ANOVA
Table 12.9 presents the descriptive statistics for the between-subjects variable
(residence). The information in this table is similar to that shown in Table 12.4,
but with a 95% confidence interval. Table 12.10 presents the pairwise comparison
outcomes. Table 12.11 presents the results of the univariate tests.
The F tests the effect of collapsed residence. This test is based on the linearly independent pairwise
comparisons among the estimated marginal means.
178 Two-Way Mixed-Design ANOVA
(I) Section ( J) Section Mean Std. Error Sig.a 95% confidence interval for
difference differencea
(I-J)
Lower bound Upper bound
that the ANOVA result for sections was significant. The reason for this is that
the post hoc test (Bonferroni) is conservative, which means the 0.05 criterion is
harder to meet. Also, the violation of the assumption of equal error variances (as
indicated by Levene’s test in Table 12.7) may have made this result less stable.
The results of a mixed-design ANOVA show the interaction between the
variables. Figure 12.8 presents the plots for the mean scores for each residence
group in each of the three sections. The graphs for sections 1 and 3 (implicature
and speech acts) are similar to one another. The graphs both rose sharply from no
residence to up to one year’s residence, then rose less sharply to residence of more
than one year. This means that some residence might increase implicature and
speech act knowledge strongly, but that an extended residence might not further
influence such knowledge to the same degree. In regards to the routine score, not
only did the routines score rise more steeply than the other scores between no
residence and up to one year’s residence, it kept rising at a noticeable rate after
one year. On this basis, it can be inferred that residence had a strong effect on
routines scores within the first year, and continued to have a strong effect after
the first year.
180 Two-Way Mixed-Design ANOVA
section
90.00
1
2
3
80.00
Estimated Marginal Means
70.00
60.00
50.00
On the basis of the two-way mixed-design ANOVA, the results may be written
up as follows.
A two-way mixed-design ANOVA was run with the TEP test section as
the within-subjects variable, and length of residence (none, up to one year,
more than one year) as the between-subjects variable. There was a signifi-
cant main effect for section with a small effect size (F(2,252) = 3.117, p <
0.05, partial η2 = 0.024). The main effect for length of residence was also
statistically significant and had a medium effect size (F(2,126) = 22.903,
p < 0.001, partial η2 = 0.267). The interaction term was also significant
(F(4,252) = 4.869, p < 0.001, partial η2 = 0.072), and a profile plot indi-
cated that length of residence led to much stronger increases in the routines
score than in the implicature or speech act scores.
Two-Way Mixed-Design ANOVA 181
Summary
The two-way mixed-design ANOVA is used to examine mean differences between
several independent groups with several repeated measures. A two-way mixed-
design needs at least one between-subject variable and one within-subjects variable.
A two-way mixed-design ANOVA can be used to investigate the interaction effect
between the between-subject and within-subject variables. This chapter completes
the presentation of inferential statistics for group differences. The next three chap-
ters return to the relationships among variables. Chapter 13 presents the chi-square
test, which is a nonparametric test for examining relationships between categorical
variables.
Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
13
CHI-SQUARE TEST
Introduction
The chi-square test (also written χ2 test) is a type of inferential statistic for ana-
lyzing frequency counts of nominal data. It is used to determine whether the
counts of two nominal variables are associated with each other. It is useful when
questions about relationships among variables cannot be answered by means of a
correlational analysis, such as Pearson’s correlation coefficient or Spearman’s rho, or
by means of a comparative analysis, such as a t-test or ANOVA. The following are
some examples of questions that have been answered by using the chi-square test:
In this chapter, two types of the chi-square test are presented: one-dimensional
and two-dimensional.
TABLE 13.1 Frequency of phrasal verb use in five registers (adapted from Liu, 2011, p. 674)
∗ Standardized as phrasal verbs per million words; pmw = per million words
whether phrasal verb use was related to register. Liu searched corpora for phrasal
verbs used in different registers and then computed frequency counts. She then
used the chi-square test to check whether more phrasal verbs occur in certain
registers than in others, or whether the use of phrasal verbs was independent of
register.
The null hypothesis for this study would be that phrasal verb use is not depen-
dent on register, whereas the alternative hypothesis would be that phrasal verb use
is associated with register. This relationship cannot be investigated using a Pear-
son or Spearman correlation; the variable ‘register’ (i.e., spoken, fiction, magazine,
newspaper, and academic writing) was the only variable under investigation, so
there was no other variable to run a correlation with. All Liu could do was to
compare the frequencies of occurrence of phrasal verbs for the five levels of the
variable ‘register’. These frequencies are shown in Table 13.1.
According to Table 13.1, phrasal verbs occurred most frequently in fiction,
with spoken language second, then magazines and newspapers, and finally aca-
demic writing. From these figures, it is unclear if the differences in frequency
were statistically significant. Due to sampling error or random fluctuations, some
differences are to be expected, so it is necessary to test whether there is a genuine
relationship between register and the frequency of use of phrasal verbs or not.
In the second step, the chi-square test is used to compare each observed count (i.e.,
the actual count) for each level of the variable with the corresponding expected
count, and to compute the difference, which is called the residual. The observed
counts, expected counts, and residuals for each level are shown in Table 13.2.
Table 13.2 shows that for spoken language, 5,216 phrasal verbs per million
words were observed (from the spoken corpus), whereas the expected number is
3,688 phrasal verbs per million words. The difference between the actual count
and the expected count is 1,530.36 words, and this residual amounts to 41.49% of
the expected count. Residuals can be computed similarly for other variables. The
degree of freedom for the chi-square test in the case of Liu’s (2011) study is ‘the
number of variable levels – 1’ = 5 – 1 = 4 (see Chapter 6). With four degrees of
freedom, the chi-square test uses the residuals and expected frequencies to arrive
at a chi-square value. It was found that this value was statistically significant (χ2(4)
= 3984, p < 0.0001).
To investigate the strength of the association between register and phrasal verb
use, the researcher needed to calculate the phi coefficient (ϕ). This is required as the
chi-square value itself does not express the strength of the association. Phi is calcu-
lated using the chi-square value through the following formula:
Based on Cohen (1988), phi can be considered small at about 0.10, medium at
about 0.30, and large at about 0.50, so this phi coefficient indicates a medium-to-strong
effect size. Overall, it can be concluded that there was a significant, medium-to-strong
association between register and the frequency of use of phrasal verbs.
One useful feature of the one-dimension chi-square test is that the residuals allow
researchers to see which categories contribute most to the final chi-square value. In
the example in Liu’s (2011) study, the academic and fictional registers deviated most
Chi-Square Test 185
from their expected values, though in opposite directions: academic language con-
tained fewer phrasal verbs than expected, while fictional language contained more.
Spoken language contained the second highest number of phrasal verbs, and the
actual count was higher than expected. Newspapers and magazines contained fewer
phrasal verbs than expected. On this basis, it can be concluded that students learning to
write academic texts should be cautioned against the overuse of phrasal verbs. How-
ever, students learning to write fictional texts should be encouraged to use phrasal verbs,
and students writing journalistic texts should not be advised to avoid phrasal verbs,
but to use them sparingly. The use of phrasal verbs in speaking classes should be
promoted since they are common in the spoken register. The one-dimensional chi-
square test can be used as a procedure in its own right, as in Liu’s study, but it is more
commonly used to evaluate the goodness of fit of a statistical model.
The null hypothesis was that there is no association between the accuracy of
recall and the type of LRE. According to this hypothesis, playful or serious LREs
are recalled with similar level of accuracy. The alternative hypothesis would claim
that there is a relationship between the accuracy of recall and the type of LRE.
Correct 41 18
Incorrect 82 16
186 Chi-Square Test
Table 13.3 shows that there were 41 serious LREs that were recalled correctly,
and 82 serious LREs that were recalled incorrectly. For the playful LREs, 18 were
recalled correctly and 16 were recalled incorrectly. Based on Table 13.3, it can be
argued that the playful LREs led to more accurate recall. Around twice as many
serious, non-playful LREs were recalled incorrectly than were recalled correctly;
fewer than half the playful LREs were recalled incorrectly. There seems to be a
tendency for playful LREs to be recalled correctly more frequently than non-
playful LREs. To find out whether this observation is statistically significant, the
two-dimensional chi-square test needs to be performed.
The two-dimensional chi-square test follows the same principles as the one-
dimensional chi-square test. First, the expected values for each cell are calculated.
Second, the totals for each row and column are calculated (the marginal totals).
Finally, the product of the marginal totals for each cell’s row and column is com-
puted and divided by the overall total. So, for example, the expected frequency
for correctly recalled LREs was 59 × 123 ÷ 157 = 46.22. Table 13.4 shows the
marginal totals, expected frequencies, and percentage residuals.
The data indicate that more playful LREs were recalled correctly than expected
and fewer were recalled incorrectly. Fewer serious LREs than expected were
recalled correctly, and more than expected were recalled incorrectly. The two-
way chi-square test can now be used to investigate whether these differences were
statistically significant.
Using the residuals, the chi-square computes the chi-square value, which in
this case was χ2(1) = 4.37, p = 0.037. It should be noted that this value functions
similarly to the F-value that is compared with the critical values table and is not
an effect size.
When a two-dimensional chi-square test that uses a 2 × 2 table such as this one
(‘accurate/inaccurate’ by ‘serious/playful’) is performed, it is common to apply a
correction to the chi-square value, known as the Yates correction (Furr, 2010), which
is done to ensure that the chi-square value is not overestimated. Once the Yates
TABLE 13.4 Marginal totals, expected frequencies, and residuals for recall by type of LREs
Correct 41 18 59
Expected 46 13
Residual % –10% +37%
Incorrect 82 16 98
Expected 77 21
Residual % +6% –22%
Total 123 34 157
Chi-Square Test 187
correction was applied in the earlier example (which SPSS will do automatically
with a 2 × 2 table), it was found that the chi-square test was nonsignificant: χ2(1)
= 3.57, p = 0.059, n.s (nonsignificant). So after the Yates correction was applied, it
can no longer be claimed that playful LREs were significantly more likely to facili-
tate correct recall, but only that there appears to be a tendency for playful LREs
to facilitate correct recall. The Yates correction makes it more difficult to attain a
significant result in a 2 × 2 table. This issue is, however, controversial. In a widely
cited paper, Haviland (1990) argued against it, but Greenhouse (1990), and Mantel
(1990)—two long-standing proponents of the Yates correction—defended it. Furr
(2010) summarizes the debate by saying that there are mixed recommendations,
and that there is no clear consensus as to the appropriate use of the Yates correction.
Since it is still widely used, the Yates correction to 2 × 2 tables is recommended for
applications, but researchers should also provide the uncorrected value. Should the
uncorrected value be significant, but the corrected one not, authors could make an
explicit case for applying the Yates correction or not.
Again, the chi-square value does not tell researchers how strongly the variables
are related; to find this out, phi needs to be calculated, and this is done in the same
way as for the one-dimensional chi-square.
Without the Yates correction, the calculation looks as follows:
speakers. The researchers divided the proficiency level variable into four levels
(basic, intermediate, advanced, and native speaker) and the use of collocation vari-
able into two levels (collocation/non-collocation). They then used the chi-square
test to check whether there was an association between proficiency level and the
use of collocations.
The null hypothesis for this study would claim that proficiency is not related
to correct collocation use, whereas the alternative hypothesis would claim that
proficiency is associated with the correct use of collocations. The hypotheses
make no claim about whether higher proficiency leads to a higher level of cor-
rect collocation use.
As a first step, Laufer and Waldman (2011) cross-tabulated the frequencies of
collocations and non-collocations for each proficiency level, as shown in Table 13.5
(adapted from Table 2 in Laufer & Waldman, 2011, p. 660).
Due to the large numbers and differences between groups in the table, it
was difficult to know at a glance whether there was an association between the
variables. So the chi-square test used this 2 × 4 table to compute the expected
frequencies and residuals using the marginal totals, as illustrated in Table 13.6.
The calculation produced χ2(3) = 264.18, p < 0.0001. With three degrees of
freedom and a significance level of 0.001. The effect of proficiency on the use of
collocations, was that the native speakers used more collocations than expected,
whereas each of the members of the learner groups used fewer. To find out how
TABLE 13.5 Collocation use by proficiency level (adapted from Table 2 in Laufer &
Waldman, 2011, p. 660)
TABLE 13.6 Marginal totals, expected frequencies, and residuals for collocation type and
proficiency level (adapted from Table 2 in Laufer & Waldman, 2011, p. 660)
strong the effect is, Cramer’s V (also called ‘Cramer’s Phi’) can be calculated, which
is commonly used when the contingency table under consideration is larger than
2 × 2.
Cramer’s V is calculated in a similar manner to phi:
This low result shows a weak effect size. That is, while the chi-square test result
was statistically significant, in reality, the effect is small. On the basis of this finding,
it can be concluded that native speakers use more noun-verb collocations than L2
learners, but that the difference is minimal.
The chi-square test is used for nominal data in which each nominal variable has
several levels. For each of the levels of a variable, there must be a frequency count.
This condition is related to the fact that the chi-square test evaluates propor-
tions by using marginal totals. If gender is one of the variables, both male and
female learners need to be included, because they make up the total for gender. If
the occurrence of the third person singular -s in learner texts is being examined,
all the cases where the third person singular -s was correctly used and all the
cases where it was either incorrectly provided or missing need to be included.
Whether this should be considered a three-level variable (i.e., used correctly, used
incorrectly, missing) or a two-level variable (i.e., used correctly, used incorrectly)
depends on the research question under investigation.
190 Chi-Square Test
3. The cells must be mutually exclusive; that is, the same participant cannot be
in more than one cell.
The conclusions that can be drawn from the chi-square test will be weaker if
a small sample size is used. For example, if the number of participants is low (say
30), and the participants are subdivided into several groups (e.g., low beginners,
mid beginners, high beginners, low intermediate, mid intermediate, high inter-
mediate, low advanced, mid advanced, and high advanced), there will be very few
or no people in some of the cells. The convention is that the expected (not the
actual!) count for each cell should be at least five, and SPSS will provide a warn-
ing if that is not the case. One solution to the issue of a small sample size may be
to collapse categories. So instead of low/mid/high beginners, the single category
‘beginners’ could be used. However, the rationale employed in collapsing catego-
ries must be carefully considered as a too-broad category, such as ‘miscellaneous’,
is not useful for research purposes.
FIGURE 13.1 Accessing the SPSS menu to launch the two-dimensional chi-square test
192 Chi-Square Test
After selecting the variables, click the Statistics button. In the Cross-
tab: Statistics dialog that appears, tick the Chi-square and Phi and
Cramer’s V checkboxes (see Figure 13.3).
Cases
No residence Residence
Symmetric measures
residence. It is important to note that the chi-square result in itself says nothing about
the direction of an effect, so residence might influence gender, but such a conclu-
sion does not seem plausible. Furthermore, the chi-square results say nothing about
the extent to which gender influences residence, but according to Table 13.8, female
learners are more likely to have had residence and male learners were less likely to
have had it. So the final question that remains is ‘how strong was the influence of
gender on residence?’ Table 13.10 presents the results for this question.
In Table 13.10, the Phi value can be seen to be 0.203, so the effect size was con-
sidered to be weak to medium. Based on this effect size, there was an effect of gender
on the likelihood of residence, but it was not strong. This finding can be reported as:
The chi-square test was used to investigate whether gender affects the likeli-
hood of residence in the target language country. It was found that female
learners were significantly more likely to have had residence than male
learners, but the effect size was weak-to-medium (χ2(1) = 4.186, p = 0.041,
ϕ = 0.203, weak-to-medium effect size).
cannot be used. However, the chi-square calculator on the VassarStats website may
be used in this case. Figure 13.5 presents a screenshot from this website.
In order to compute the chi-square statistic, the cells to be used have to be
selected first. The data from Bell (2012) are adapted to illustrate this. Figure 13.6
illustrates the selection of four cells to define a 2 × 2 table. This selection makes
the rest of the table unavailable.
Chi-Square Test 197
FIGURE 13.6 Contingency table for two rows and two columns
FIGURE 13.7 Contingency table for two rows and two columns with data entered
The data from Table 13.3 need to be entered (see Figure 13.7). When ‘Calcu-
late’ is clicked, the results are shown at the bottom of the web page, as can be seen
in Figure 13.8. The box under the contingency table in Figure 13.8 shows the
chi-square result with the Yates correction, and Cramer’s V (which is the same as
phi for a 2 × 2 table). In addition, the textbox next to the chi-square value, df, and
significance level provides the uncorrected result. The tables shown at the bottom
of Figure 13.8 present deviations from expectations as percentage deviations or
standardized residuals.
198 Chi-Square Test
Summary
The results of chi-square tests allow researchers to understand the characteristics
of language learners or language learning contexts that shape how learners may
differ in terms of learning success, acquisition rates, and behaviors and thoughts.
Chi-square tests can be used to compare the observed frequencies of a single
variable with their expected frequencies, and to examine whether two nominal
variables are associated with one another. In L2 research, there are various types of
Chi-Square Test 199
data, so various statistical tools are required to analyze them. The chi-square test
is suitable for the analysis of nominal data, which cannot be analyzed by means of
correlational analysis. The one-dimensional chi-square test is the simplest form of
this statistic. However, to analyze L2 research data, two-dimensional (or higher)
chi-square tests may be required. SPSS uses raw data only, so if there is a preexist-
ing contingency table with frequencies available, the VassarStats website can be
used instead of SPSS. The next chapter presents multiple regression, which is used
to examine the extent to which a dependent (outcome) variable can be predicted
by several independent (predictor) variables.
Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
14
MULTIPLE REGRESSION
Introduction
The last few chapters have presented inferential statistics, such as t-tests and the
ANOVA, that are used to compare group means. However, if researchers aim to
understand how different variables might affect language learning or test per-
formance, then another kind of inferential statistic is required. For example, Jia,
Gottardo, Koh, Chen, and Pasquarella (2014) investigated the relative effect of
several reading and personal background variables on reading comprehension of
ESL learners in Canada, including word-level reading ability, vocabulary knowl-
edge, length of residence in Canada, enculturation in the mainstream culture, and
enculturation in the heritage culture. Not only were the researchers interested in
the effect of each variable, but they were also interested in the relative importance
of those variables. For example, they were interested in finding out which variable
had the strongest impact on reading comprehension, and which combination of
variables best explained reading comprehension. The statistical procedure used to
answer these questions is known as multiple regression. In order to illustrate how
multiple regression is performed, this chapter begins with a description of simple
regression.
Simple Regression
Correlational analysis was introduced in Chapter 5. Correlation expresses the
strength of the relationship between two variables. For example, if the correlation
between vocabulary knowledge and reading ability is strong, it may be expected
that the higher language learners’ vocabulary scores are, the higher their reading
scores will be. For this reason, if researchers know learners’ vocabulary scores, they
Multiple Regression 201
will be able to make an informed guess as to their reading scores. Predictions can
therefore be made on the basis of the existence of a strong correlation between
two variables, and the stronger the relationship between two variables, the better
the prediction will be.
As an illustration, the graph presented in Figure 14.1 plots the relationship
between chocolate consumption and vocabulary recall success for a fictitious
group of learners. It shows a correlation of 1.0, which means that predictions can
be made with a high level of certainty. If a learner does not consume any choco-
late, that learner is likely to be able to answer 10 vocabulary questions correctly. If
a learner consumes 10 pieces of chocolate, that learner is likely to able to answer
30 vocabulary questions correctly, and so on.
In regression analysis, the independent variable that is used to predict a depen-
dent variable is called a predictor variable, and the dependent variable is the outcome
variable. The relationship between chocolate consumption (the predictor variable)
and vocabulary recall success (the outcome variable) can be expressed in the fol-
lowing formula:
This formula can be used to predict how many vocabulary recall questions a
student will be able to answer correctly, given how many pieces of chocolate the
student consumes. If a student consumes 25 pieces of chocolate, that student is
likely to be able to answer 60 vocabulary recall question correctly (i.e., 10 + 2 ×
25 = 60). Such formulae are the basis of simple regression models. Their more
general form is written as the standard regression equation:
Y = A + BX, where
Multiple Regression
In correlation and simple regression, the relationship between two variables can
be examined. Multiple regression, however, can examine the effect of several vari-
ables on an outcome variable simultaneously. The main task in multiple regression
analysis is to explain (i.e., predict) the outcome variable as precisely and efficiently
as possible on the basis of the predictor variables. The combination of predictor
variables that produces the best prediction of the outcome is called the regression
model, and the best regression model explains as much variance in the outcome as
possible with as few predictors as possible. One of the tasks of multiple regression
is to determine the unique, individual contribution of each independent variable
to the outcome, taking into account the fact that the predictor variables may cor-
relate with each other, and that some of the same variance in the outcome variable
may therefore be explained by each of two or more variables.
A good example of a study using multiple regression is Jia et al. (2014). The
researchers investigated the relative impact of several predictor variables on the
reading comprehension of 94 immigrant high school students of Chinese back-
ground in Canada. They were interested in the effect of vocabulary knowledge,
word-level reading ability, length of residence in Canada, acculturation to the
mainstream (Anglo-Canadian) culture, and enculturation to the heritage (Chi-
nese) culture. While multiple regression allows researchers to begin by including
all possible predictor variables, that is not the best approach in practice. It is more
efficient to begin by including the variables that previous research has shown to
have a significant impact on the outcome variable, and then to progressively add
more variables in an attempt to improve the model. If a variable is found not to
improve the model, it should be excluded from the multiple regression model.
This approach is known as hierarchical (or sequential) regression, and it was this
that Jia et al. used. When hierarchical regression is used, researchers enter variables
in steps (called blocks). Each block contains the variables of the previous block
and adds a new variable (or a small number of new variables) to check if the new
variable improves the prediction. This allows researchers to compare regression
models to decide how many variables are needed to make satisfactory predictions.
Table 14.1 shows the three models that Jia et al. compared, the variables in each
model, the β-value for the variables used, the R2 of each model, and the change in
R2 when one model is replaced by the next.
In Table 14.1, the researchers first entered the length of residence as the sole
predictor, and found that this one-variable regression model explained 59% of
the variance in reading comprehension. The length of residence variable had a
204 Multiple Regression
TABLE 14.1 Three hierarchical regression models (adapted from Jia et al., 2014, p. 257)
statistically significant β of 0.77, which is high. This, however, was not surprising
since it was the only predictor in the model. In this one-variable model, you can
also think of β as the Pearson correlation of residence and reading score (r = 0.77
with a coefficient of determination R2 of 0.59).
In the next block, Jia et al. added vocabulary and word-level reading scores
to the model, as can be seen in Table 14.1. Both new variables were also found
to be significant predictors of reading comprehension scores, with strong con-
tributions at β of 0.46 for vocabulary, and β of 0.29 for word-level reading. In
multipredictor models such as this one, the β values are not identical to correla-
tion coefficients as they are in a one-variable model. The new model with length
of residence, vocabulary, and word-level reading explained 79% of the variance
in reading comprehension scores. The increase from the previous model to this
one was significant (20%), so the new model was taken to be better than the first.
In this second model, the contribution of residence dropped; it had a β value of
0.20. It can be concluded that in the first model residence had covered some read-
ing comprehension variance that was actually due to vocabulary knowledge and
word-level reading ability.
Finally, Jia et al. added their two enculturation variables (i.e., mainstream and
heritage enculturation). This time, with all five variables, the model explained an
extra 2% of the variance in reading comprehension, which was a statistically sig-
nificant improvement over the previous model. However, heritage enculturation
was not significant, making only a very minor contribution (β = 0.03), and length
of residence was found to be no longer significant (β = 0.16). In the third regres-
sion model, it appears that mainstream enculturation (β = 0.12) explained some of
the variance that had previously been explained by residence. The contributions
Multiple Regression 205
of vocabulary scores (β = 0.47) and word-level reading scores (β = 0.27) did not
change much and remained significant.
In this study, Jia et al. accepted this final model as the best combination of
predictor variables, although it could be argued that their second model, which
explained 79% of the variance using three variables, is preferable to a model that
explained just over 80% with five variables (two of which were nonsignificant).
The reason the authors may have decided to use the final model was that they
were hoping to demonstrate a significant role for mainstream enculturation in
reading comprehension, which they managed to do, though its role is outweighed
by linguistic factors.
The example of Jia et al.’s study shows how multiple regression can be used to
help researchers evaluate different combinations of predictor variables to account
for an outcome. However, the final determination of which combination of pre-
dictor variables explains the outcome variable most strongly needs to be decided
by the researcher. Statistics can only provide supporting evidence for this decision.
1. Run a multiple regression with all predictors at the same time to obtain a
picture of their relative importance in predicting the outcome variable.
2. Based on the results of the first run of the regression analysis, re-run the mul-
tiple regression hierarchically, starting with the strongest predictor, adding the
second strongest, and so on until the weakest is reached.
Click Analyze, next Regression, and then Linear (see Figure 14.2)
Tick the other four checkboxes on the right-hand side (i.e., R squared
change, Descriptives, Part and partial correlations, and Collinearity
diagnostics).
FIGURE 14.2 Accessing the SPSS menu to launch multiple regression
Table 14.2 presents the descriptive statistics of the analysis, with outcome variable
being the total TEP score, plus three predictor variables: computer familiarity,
proficiency level, and years in English-speaking country. The descriptives indicate
that there is a large variation in sample sizes, suggesting that it may be a good idea
to exclude cases of missing data pairwise rather than listwise. In this way, far fewer
participants will be excluded from the analysis, especially on the basis of their lack
of computer experience.
Table 14.3 shows the correlations among the predictor and outcome variables.
It can be seen that proficiency level was a strong predictor of total score, while
TABLE 14.7 Model coefficients output: Unstandardized and standardized Beta coefficients
broken up into two tables (Table 14.7 and 14.8) in this chapter. Table 14.7 con-
tains the standardized beta coefficients and their statistical significance levels.
In this table, proficiency level had the largest beta coefficient (β = 0.748, p <
0.001). Length of residence had a smaller beta (β = 0.169, p = 0.042), and com-
puter familiarity had a nonsignificant beta (β = 0.062, p = 0.425). Table 14.7
also contains B-coefficients (column 3), which indicate the direct links between
the predictor variables and the outcome variable. Recall that the B-coefficient is
related to how much the outcome variable increases when the predictor variable
increases by one unit. So, for example, if Participant X has a proficiency level that
is one level higher than Participant Y, Participant X’s total test score would be
expected to be 13.983 higher than that of Participant Y. Similarly, if Participant X
has a residence one year greater than that of Participant Y, then Participant X’s total
score would be expected to be 2.044 higher than that of Participant Y. This is
interesting in itself, but because the units of the predictors differ greatly (i.e., years
versus proficiency levels versus levels of computer familiarity), the B-coefficient
is not commonly used.
Table 14.8 presents the correlations between the total score (the outcome
variable) and each of the predictor variables. The zero-order correlation is a Pear-
son correlation (as presented in Table 14.3). The partial correlation treats other
212 Multiple Regression
1 (Constant)
computer familiarity .017 .107 .062 .997 1.003
proficiency level .801 .775 .705 .888 1.126
years in English- .415 .268 .160 .889 1.125
speaking country
predictors as covariates, and removes their influence from the outcome variable
and the predictor in question. It shows how much of the remaining variance that
predictor explains. For example, in the case of proficiency level, the effects of
length of residence and computer familiarity are taken out of the total test score
and their overlaps with proficiency level are removed to create a ‘pure’ proficiency
level result. This pure proficiency level correlates at 0.775 with the outcome vari-
able. Finally, the part (sometimes also called semi-partial) correlations show how
much the purified predictor correlates with the unpurified outcome variable. In
the case of proficiency level, only overlaps between proficiency level and length of
residence and computer familiarity were removed from the proficiency level vari-
able, creating a purified proficiency level variable, but no changes were made to
the outcome variable. The resulting correlation is the unique contribution of the
predictor variable to the outcome variable, which is 0.705 for proficiency level.
The part correlation can be squared to understand how much of the variance in
the outcome variable a predictor explains. In this case, proficiency level explained
nearly 50% of the variance in the outcome variable.
Table 14.8 also presents the collinearity statistics. As mentioned previously, col-
linearity describes excessive correlations between the predictors to the point that
it becomes difficult to distinguish the contribution of each predictor variable to
the outcome variable. In the collinearity statistics, all tolerance values should be
larger than 0.2, and the VIF indicators should be below 10 to confirm the absence
of excessive collinearity (Field, 2013, p. 342). All tolerance and VIF values in
Table 14.8 are acceptable, meaning that the variables make independent contribu-
tions to the regression model.
In summary, this first exploratory multiple regression run with all variables
has suggested that proficiency level had the strongest effect on the outcome vari-
able, and that residence had a moderate effect. It has also suggested that computer
familiarity should be excluded from further analysis. A hierarchical regression
model will be run next, with the objective of determining a final model using
only significant predictors. When that has been done, it will be possible to deter-
mine the amount of variance in the outcome variable explained by each of these
predictors.
Multiple Regression 213
Click Analyze, next Regression, and then Linear (see Figure 14.2).
Do not add any other variables at this stage. Instead, click the Next
button. ‘Block 2 of 2’ will appear above the Independent(s) field.
The Independent(s) field will be empty to allow another variable to be
entered.
Enter ‘computer familiarity’ as the final predictor (see Figure 14.8). Note
that this predictor is added only for the purpose of illustration of how
hierarchical regression can be performed with more than two predictor variables.
The earlier analysis suggests that it did not predict the outcome variable at all.
Some of the SPSS outputs (Descriptive statistics, Correlations) are the same as
those for the exploratory model. However, the model summary contains different
information, as presented in Table 14.9.
According to Table 14.9, Regression Model 1 used proficiency level only as the
predictor variable. It correlates with the outcome variable at 0.801, and accounted
for 63.6% of the variance in TEP performance. The F-statistic (from the ANOVA
that compares models) indicates that proficiency level was a significantly better
predictor than the model that did not use predictor variables. Regression Model 2
used the predictors ‘proficiency level’ and ‘length of residence’. Its correlation with
the outcome was slightly higher than that of Regression Model 1 at 0.816, and it
accounted for 65.5% of the variance in TEP performance. This was an improve-
ment of 1.9% over Regression Model 1. The F-statistic was based on a comparison
between Regression Model 2 and Regression Model 1. It indicates that Regression
Model 2 is significantly better at explaining the variance than Regression Model
1 (p = 0.044). Finally, Regression Model 3 was the full model based on profi-
ciency level, residence, and computer familiarity as the hierarchical predictors of
the outcome variable. It correlates at 0.819 with the outcome variable. At 65.3%,
it explains slightly less of the population variance than Regression Model 2. The
F-statistic was not statistically significant (p = 0.425), indicating that this model is
not significantly better than Regression Model 2.
This model summary shows that Regression Model 2 (proficiency and resi-
dence) was the best model in explaining the variance at 66%, and it also shows that
including residence leads to just a small improvement in the model over includ-
ing proficiency level alone. Computer familiarity was found not to be a helpful
TABLE 14.11 Model coefficients output: Unstandardized and standardized Beta coefficients
of proficiency and residence changed little, and the β-coefficient of the new pre-
dictor computer familiarity was very small at 0.062.
Table 14.12 further suggests that partial and part correlations for proficiency
level changed strongly between regression models 1 and 2, but only minimally for
proficiency level and residence between regression models 2 and 3. In Regression
Model 2, proficiency level had a part correlation of 0.703 and hence accounted
for 49.4% of the variance in the outcome variable, whereas residence had a part
correlation of 0.158 and accounted for only 2.5% of the variance. On the basis of
this statistical finding, proficiency level was nearly 20 times as influential in deter-
mining total TEP scores as residence. The three aspects of pragmatic competence
being investigated were nearly entirely dependent on learners’ proficiency level,
and residence and computer familiarity were found to be almost irrelevant. In
Table 14.13, the collinearity diagnostics indicate that none of the models violate
the collinearity condition of multiple regression.
1 (Constant)
proficiency level .801 .801 .801 1.000 1.000
2 (Constant)
proficiency level .801 .773 .703 .890 1.124
years in English- .415 .264 .158 .890 1.124
speaking country
3 (Constant)
proficiency level .801 .775 .705 .888 1.126
years in English- .415 .268 .160 .889 1.125
speaking country
computer familiarity .017 .107 .062 .997 1.003
Table 14.13 shows statistics for the variables excluded from each model and
more detailed collinearity statistics, but they are not relevant here because there
were no problems with the collinearity conditions in the regression model.
On the basis of the multiple regression, the results may be written up as follows.
Summary
Multiple regression is a useful statistical procedure to help researchers evaluate the
relative influences of several independent variables on an outcome variable, such
as language learning success and test performance. It is a procedure with many
options. However, in this chapter, only a hierarchical multiple regression option
for an interval outcome variable has been presented. Details of other multiple
regression procedures are presented in other texts (see Resources on Methods for
Second Language Research in the Epilogue section of this book). The next and final
chapter of this book will show how to analyze the reliability of research instru-
ments and data coder or rater agreement data in SPSS.
Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
15
RELIABILITY ANALYSIS
Introduction
This chapter explores the concept of reliability and illustrates how to conduct reli-
ability analysis in L2 research through the use of SPSS. The key aim of this chapter
is to discuss and present statistical methods for evaluating the reliability of quan-
titative research instruments (e.g., tests and Likert-type scale questionnaires), and
qualitative data coding. Researchers are expected to provide reliability estimates
of their research instruments because unreliable measures imply that data analysis
outcomes cannot be fully trusted.
Reliability
Reliability can be understood as the consistency or repeatability of observations of
behaviors, performance, and/or psychological attributes. A language test is reliable
when students with the required language knowledge and ability can consistently
answer the test questions correctly, while those with little or no knowledge of the
target language cannot. A Likert-type scale questionnaire is reliable when research
participants choose 5 when they strongly agree with a statement, but 1 when
they strongly disagree with it. The issue of the reliability of research instruments
is critical for good L2 research as researchers rely on them for the collection of
useable data.
The reliability of a test or research instrument is commonly expressed as a value
between 0 and 1. Unlike correlation coefficients, reliability coefficients can be
understood as coefficients of determination (R2), which were discussed in Chap-
ter 5. That is, a reliability coefficient of 0 indicates that the test or instrument
220 Reliability Analysis
does not measure the target construct consistently (i.e., it is 0% reliable). That is,
the results are random and are not useful in drawing conclusions about the target
construct. If the reliability estimate of an instrument is 0, the data collected using
that instrument should not be used for statistical analysis to answer research ques-
tions. A reliability coefficient of 1 means that the test or research instrument is
perfectly precise with no measurement error (i.e., it is 100% reliable or consistent).
The extreme values of 0 or 1 are unlikely to be found in L2 research. Measuring
abstract constructs or indirectly observed attributes, such as language proficiency
and psychological attributes (e.g., motivation, learning style, and attitudes) is not
a precise science.
The level of reliability of a particular test or research instrument that is accept-
able depends on the seriousness of the consequences of the test results or research
outcome. For example, if test scores are used for a high-stakes purpose, such in the
decision-making process for university admission, an overall reliability of around
0.90 would be needed though reliabilities of individual test sections can be lower,
in the 0.80 region. Test section reliabilities are generally acceptable if they are at
least 0.80. A reliability around 0.80 is also acceptable if the potential consequences
of the test scores are less serious; for example, the scores may be used as part of the
decision-making process for placement in a language program, or the test may be
one part of a course assessment. For low-stakes tests, such as self-assessments that
provide feedback to students, 0.70 is generally acceptable as a reliability estimate.
A reliability level below 0.70 means that more than one third of the test result
or research outcome is effectively random, and this is acceptable only when the
process of making modifications to tests or research instruments is still ongoing,
and changes to the instrument can be made prior to the final collection of data.
obtain similar scores. If an instrument has high reliability, the data it elicits will be
consistent. In this chapter, split-half reliability, test-retest reliability, and Cronbach’s
alpha will be presented.
Test-Retest Reliability
While not actually a measure of the internal consistency of a single test, test-retest
reliability is conceptually important to understand consistency. It assumes that the
same test is administered to the same participants twice and the results correlated.
A highly reliable test that consistently measures the same attribute should produce
very similar results and a high correlation between administrations. However, a
practice effect from the first to the second administration is likely to distort results
so test-retest studies are not normally done. More practical approaches are split-
half reliability and Cronbach’s alpha reliability.
Split-Half Reliability
Split-half reliability can be obtained in different ways. The simplest method is
to split the test or instrument in half (first half/second half ), and correlate the
scores or results from the two halves. However, due to the possible effects of test
fatigue towards the end of the test, it is preferable to correlate scores from the
odd-numbered items with those from the even-numbered items. The resulting
correlation from this method can underestimate the actual reliability of the test or
instrument so the Spearman-Brown prophecy formula (see Brown, 2005, for details)
can be applied to obtain a more reliable measure. SPSS can compute the split-half
reliability and Spearman-Brown prophecy estimate for a test or questionnaire.
See the ‘Measures for Inter-Rater and Inter-Coder Reliability’ section for the SPSS
procedure that should be followed to do this.
Cronbach’s Alpha
Cronbach’s alpha (α) is a standard measure of reliability for tests and question-
naires. It is most affected by how strongly test or questionnaire items correlate
with each other since this inter-item correlation reflects how well the items mea-
sure the same attribute. Cronbach’s alpha is also affected by how many items there
are in the test or questionnaire. As a general rule, the higher the number of items
used, the more reliable a research instrument is. A high Cronbach’s alpha provides
evidence that the instrument is internally consistent.
Cronbach’s alpha is high when questions or items are answered consistently.
Table 15.1 presents a simple (simulated) data matrix from a course feedback ques-
tionnaire answered by 10 students.
222 Reliability Analysis
TABLE 15.1 A simple (simulated) data matrix for a course feedback questionnaire (N = 10)
ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
1 1 1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1 1
3 2 2 2 2 2 2 2 2 2 2
4 2 2 2 2 2 2 2 2 2 2
5 3 3 3 3 3 3 3 3 3 3
6 3 3 3 3 3 3 3 3 3 3
7 4 4 4 4 4 4 4 4 4 4
8 4 4 4 4 4 4 4 4 4 4
9 5 5 5 5 5 5 5 5 5 5
10 5 5 5 5 5 5 5 5 5 5
TABLE 15.2 The reliability for the 12-item implicature section of the TEP
0.83 0.83 12
TABLE 15.3 Item-total statistics of the 12-item implicature section of the TEP
previous table (i.e., 0.83). For example, if imp1 were removed, the alpha coefficient
would actually drop from 0.83 to 0.81, which indicates that imp1 should be kept
because it contributes to a higher reliability coefficient. That is also the case for
each of the other items so in this illustration, all the items should be kept for the
purpose of further data analysis.
However, if a similar analysis were performed, and it was found that several
items were contributing to a low Cronbach’s alpha, items may be removed to
make the coefficient higher, and so improve reliability. It is also important to keep
in mind that Cronbach’s alpha is sample dependent. If the sample size is large,
diverse, and heterogeneous, spreading across the entire ability spectrum, the alpha
is likely to be higher than one that could be obtained from a small, homogeneous
sample.
Click Analyze, next Scale, and then Reliability Analysis (Figure 15.1).
224 Reliability Analysis
FIGURE 15.1 Accessing the SPSS menu to launch Cronbach’s alpha analysis
Tick the Item, Scale, Scale if item deleted, Mean, and Correlations
checkboxes.
Table 15.4 presents the case processing summary. The case processing summary
indicates that only 100 out of 167 test takers’ scores were used in the calcula-
tion of Cronbach’s alpha. This is because the calculation of Cronbach’s alpha
in SPSS requires a complete data set for each participant. It ignores the data for
those participants who have missing scores on the items being analyzed (note that
missing data were coded ‘999’). Table 15.5 presents the overall reliability statistics.
Cronbach’s alpha based on standardized items is usually similar to that of the
unstandardized items.
226 Reliability Analysis
TABLE 15.4 The case processing summary for items ‘imp1sc’ to ‘imp12sc’
N %
0.83 0.83 12
Mean SD N
Table 15.6 presents the item statistics. The item statistics show information on
each item (the lower the mean, the more difficult the item). Table 15.7 presents
the summary statistics for each test item. An item mean of 0.60 indicates that this
section of the test was easy for the test takers (a mean of 0.50 would indicate that
the section was at an ideal level of difficulty). The average inter-item correlation
of 0.29 is appropriate according to Clark and Watson (1995), who suggest that an
inter-item correlation of between 0.15 and 0.5 is acceptable. Stronger inter-item
correlations suggest that the items are too similar to each other and the construct
under measurement may be narrow.
Table 15.8 presents the item-total statistics, which allow researchers to examine
‘Cronbach’s alpha if item deleted’. The last column shows what Cronbach’s alpha
would be if individual items were deleted (as discussed in Table 15.3). Ideally, all
the values in this column should be lower than the overall Cronbach’s alpha.
Reliability Analysis 227
Finally, the scale statistics (Table 15.9) show the mean score for all test takers for
this section. Overall, the Cronbach’s alpha analysis suggests that this test section is
reliable and that all the test questions worked together well to elicit implicature.
No items should be removed from the data set.
Rater Reliability
In performance assessment (such as the assessment of speaking and writing),
assigning scores to performance is subjective to a certain degree. That is, the same
rater may not always assign scores to performance of the same standard in exactly
the same way. Moreover, even two well-trained, highly experienced raters may
not always assign a similar score to the same piece of written work or spoken
228 Reliability Analysis
performance. This makes the study of reliability in this area particularly impor-
tant. There are two common types of rater reliability: intra-rater reliability and
inter-rater reliability.
Percentage of Agreement
In L2 research, some researchers may simply examine and report the percentage of
agreement among raters or coders (e.g., 90% agreement in assigning a test score).
While this report is more informative than not reporting the reliability at all, the
percentage agreement measure is not useful evidence of rater/coder reliability, and
therefore should be avoided. This is mainly because agreement between two peo-
ple depends on several complex factors. For example, first, the level of agreement
depends on the score ranges in the rating scales being used (e.g., 1–4, or 1–20), and
the nature of the feature being rated (e.g., surface features, such as factual informa-
tion and frequencies of occurrence versus content or thematic features, such as beliefs,
perceptions, attitudes, and cognitive processes). In performance assessment, raters are
more likely to agree with each other when the range of scores is low (e.g., between
1 and 4) than when the range is broad (e.g., 1–20). In qualitative data coding, cod-
ers are more likely to agree with each other when coding factual information than
when coding qualitative content, because of the complex nature of some constructs,
such as motivation, attitudes and beliefs, and cognitive processing.
Second, the percentage agreement, especially in rating scales, depends on
whether or not researchers adopt an exact or adjacent agreement method. An exact
agreement suggests no discrepancy between two scores as assigned by the two rat-
ers, whereas an adjacent agreement allows a 1 point difference between two scores
assigned by the two raters. Third, the percentage agreement depends on the sam-
ple size. When the sample size is small, agreement tends to be higher than when
the sample size is large. For example, the percentage agreement rate would be
higher for 10 participants than for 20 participants. On the basis of this discussion,
the percentage agreement provides an inflated measure of the relationship between
two scores assigned by two different people (see e.g., Keith, 2003; Williamson, Xi,
& Breyer, 2012; Yang, Buckendahl, Juszkewicz, & Bhola, 2002, who discuss this
concern in the context of automated essay scoring [AES]). Therefore, this type of
inter-rater or inter-coder reliability should be avoided or used with caution.
Correlation Coefficients
Another method that can be used as evidence of inter-rater reliability is corre-
lational analysis (see Chapter 5). This measure is preferable to the percentage of
230 Reliability Analysis
agreement method. If the data are continuous and normally distributed, a Pearson
product moment correlation can be computed. If the data are ordinal or non-
normal, a Spearman’s rho correlation can be used. A strong correlation coefficient
(e.g., 0.80) suggests that two raters or coders agree on their ratings of the same
piece of performance or data. Nonetheless, it should be noted that a correlation
coefficient is a not a reliability estimate. That is, a correlation coefficient of 0.70
does not equate to a reliability coefficient of 0.70. For several reliability mea-
sures, a correlation coefficient is just one ingredient of the reliability formula (see
e.g., “The Spearman-Brown Prophecy Coefficient” section). However, the use
of correlation coefficients as reliability indices is often seen in language assess-
ment practice. For example, in AES research, validators employ Pearson product
moment correlations between human raters and a computer rater as one measure
of AES reliability. For example, Williamson et al. (2012) recommend a threshold
of 0.70 for human-human and human-AES correlations. They point out that a
correlation of 0.70 nearly reaches the tipping point at which signal outweighs
noise in the prediction, so nearly 50% of the variance in the agreement between
two raters is accounted for. While the use of correlations is preferable to percent-
age agreements to analyze inter-rater reliability, it is recommended that correlation
coefficients be used only as complements to the inter-rater or coder reliability
estimates.
applied to investigate whether two or more raters have assigned holistic scores
similarly to each other in subjective assessments, such as essay writing and speaking
tasks (see Brown, 2005, p. 187, for a discussion of the Spearman-Brown prophecy
formula). Holistic scoring provides an overall impression of a performance (e.g.,
1–5 for poor, 6–10 for average, 11–15 for good, and 16–20 for excellent). The
Spearman-Brown prophecy uses a Pearson correlation as part of its formula.
Click Analyze, next Scale, and then Reliability Analysis (Figure 15.6).
232 Reliability Analysis
Table 15.10 presents the reliability statistics for these data. In this table, the
information on Cronbach’s alpha can be ignored as there are only two items
(two raters). Instead, focus on the ‘Spearman-Brown Coefficient’ rows, and on the
‘Equal Length’ result. According to this output, the Spearman-Brown coefficient
Reliability Analysis 233
Reliability statistics
between the two raters was 0.77. This output also shows the Guttman Split-Half
coefficient, which is the split-half coefficient used in SPSS and is another index
for inter-rater reliability. On this occasion, it had the same value as the Spearman-
Brown prophecy coefficient. In Table 15.10, the correlation coefficient between
the two raters can be seen to be 0.63. This correlation was used in the calculation
of the Spearman-Brown coefficient. A Spearman-Brown prophecy coefficient of
0.77 indicates that 77% of the test scores by the two raters were common. In a
medium-stakes test, this coefficient would be acceptable for use in judging stu-
dents’ abilities. However, in a high-stakes situation, a Spearman-Brown coefficient
of 0.90 or higher should be obtained. For the purposes of research, a coefficient
234 Reliability Analysis
of 0.77 is acceptable for the data to be used to infer students’ performance, and to
perform other inferential statistics.
Rater 1
Pass Fail
Rater 2 Pass 11 4
Fail 3 7
Reliability Analysis 235
between 0.4 and 0.6, and a substantial agreement is indicated by a kappa coef-
ficient between 0.6 and 0.7. A strong agreement is indicated by a kappa of 0.8
or above.
Cohen’s kappa coefficient is influenced by the number of decision options
there are (e.g., two options for pass or fail; three options for poor, average or
good). The more categories there are, the higher the kappa coefficient will be.
This is because the likelihood of agreement between two raters drops as the num-
ber of options increases. It is important to note that Cohen’s kappa coefficient
is designed for two raters or coders only, and is most appropriate for analyzing
categorical data that has been coded.
the data in three columns: candidate ID, ratings by Rater 1 (0 = fail, 1 = pass), and
ratings by Rater 2 (0 = fail, 1 = pass) as in Figure 15.8.
When the Crosstabs dialog opens, move ‘Rater 1’ from the left pane
to the Columns field and Rater 2 to the Rows field, or vice versa
(Figure 15.10).
FIGURE 15.9 Accessing the SPSS menu to launch Crosstabs for kappa analysis
FIGURE 15.10 Crosstabs dialog
Cases
Rater 1 Total
fail pass
Rater 2 fail 7 3 10
pass 4 11 15
Total 11 14 25
Table 15.12 presents a cross-tabulation of the pass-fail ratings by the two raters.
Table 15.13 presents the case processing summary, which shows that all cases in
the spreadsheet were included. Table 15.14 presents the measure of agreement
(i.e., the kappa coefficient). The kappa coefficient was found to be 0.43.
Intraclass Correlation
The last type of inter-rater reliability estimate to be discussed here is the intraclass
correlation coefficient (ICC), which is suitable for interval data and works for multiple
raters. This statistic indicates the consistency of the raters (i.e., the extent to which
the raters agree on which test takers deserve a high rating and which deserve a low
rating). ICC is useful for helping researchers assess rater quality and check that the
rating scale is interpreted in the same way by different raters. If several raters come
to the same conclusion regarding the rating scale, the scale can be assumed to be
clearly described and unambiguous.
Reliability Analysis 239
TABLE 15.15 Simulated data set for two raters (rater 1 and rater 2)
1 5 4 4 4 4.50 4.00
2 2 3 3 3 2.50 3.00
3 3 3 2 2 2.50 2.50
4 3 4 3 3 3.00 3.50
5 4 2 3 1 3.50 1.50
6 2 1 1 2 1.50 1.50
7 1 2 1 1 1.00 1.50
8 5 5 4 5 4.50 5.00
9 2 4 2 5 2.00 4.50
10 3 3 2 3 2.50 3.00
240 Reliability Analysis
Click Analyze, next Scale, and then Reliability Analysis (Figure 15.6).
FIGURE 15.12 Reliability Analysis dialog for raters’ totals as selected variables
Reliability Analysis 241
FIGURE 15.13 Reliability Analysis: Statistics dialog for intraclass correlation analysis
In relation to Figure 15.13, when there are three raters, selecting Scale if item
deleted is likely to dramatically reduce reliability. However, if there are more rat-
ers (e.g., as in Derwing & Munro, 2013), selecting Scale if item deleted could be
242 Reliability Analysis
used to find poorly performing raters, who could then be excluded. Selecting
Absolute agreement in the Type drop-down (see Figure 15.13) renders an estimate
of the level of agreement between raters in addition to the reliability estimate.
Table 15.16 presents the case processing summary output.
The case processing summary in Table 15.16 indicates that the average ratings
of the three raters for all 40 participants were included. Table 15.17 presents the
reliability statistics output as the Cronbach’s alpha coefficient. Cronbach’s alpha is
found to be very high at 0.95, indicating a high degree of consistency among the
raters, which indicates that the rating scale is reliable.
Table 15.18 presents the item statistics output. The item statistics in this case
are average ratings for each rater, and they can be used to ensure that raters do not
diverge too much in their ratings. In this case, all the raters are within 0.35 of a
score level of each other. Table 15.19 presents the intraclass correlation coefficient
output.
Intraclass correlation shows the absolute agreement between raters, rather than
just consistency, as shown by Cronbach’s alpha. This can be relevant because raters
can be consistent without really being in agreement. That is, if one rater system-
atically rates one score level lower than the other raters, consistency will be high
but agreement will be low. Ideally, the Average Measures intraclass correlations (for
several raters) should be similar. In this SPSS output, the Single Measures intraclass
N %
0.95 3
Mean SD N
SEM = SD × √1 – α
For example, if the standard deviation of a test is 7 and the reliability coefficient
or Cronbach’s alpha is 0.85, the SEM can be computed as follows:
If a test taker obtains a score of 30 out of 40, the SEM score can be added to as
well as subtracted from the raw score. This allows a computation of the lower and
244 Reliability Analysis
upper bound of the test taker’s score at a 68% confidence band. In this case, the test
taker’s score could fall between 27.27 and 32.73. Knowing that the student’s score
could be within this range provides some confidence about the use of the test
score in research. If the SEM is found to be larger, say 5, this test taker’s score could
lie between 25 and 35, so this test would not be considered precise: if the score is
used for research, there is a chance that statistical inferences based on it would be
inaccurate. According to this discussion, a highly reliable test will produce a small
SEM value, which means that there is little error in measurement.
1. The nature of the research construct of interest. Some constructs are less complex
than others, and so can be assessed more reliably than others. For example,
vocabulary knowledge can be measured by a vocabulary test more reliably
than by a general language proficiency test, which involves the ability to use
several language skills.
2. The quality of an instrument or coding scheme. A good research instrument that
is developed on the basis of robust theories and clearly defined and well-
informed test specifications is likely to have high reliability. An instrument
that is tested, piloted, and/or validated is likely to have a high reliability esti-
mate. The same applies to a data coding scheme or performance rating scale,
which needs to be developed and revised carefully before its actual use. Raters
or coders need to be trained in understanding descriptors and practice rating
different performance levels and coding qualitative data. This type of quality
control is likely to lead to a high inter-rater/coder reliability estimate. There-
fore, any research instruments that do not involve any of these elements are
unlikely to be very reliable.
3. Objective versus subjective scoring. Generally, it is easier to attain high reliability
when tests or instruments are scored objectively. Multiple-choice tests are a
typical example of objective scoring. Subjective scoring—such as the rating
of writing or speaking, or coding of qualitative data—requires more training
of raters or coders to ensure intra- and inter-rater/coder reliability.
4. Sample size. A large sample size can lead to more reliable measures. This means
that in a language test there will be a greater chance of having high-ability,
intermediate-ability, and low-ability learners in the same data set. A small
sample size often leads to restricted data ranges, which in turn affects observa-
tions of consistency.
5. The range of ability or attributes of participants. A motivation questionnaire that
is given to highly motivated learners only is likely to produce a low reliability
estimate because there is low variance in the data set. A well-developed lan-
Reliability Analysis 245
guage proficiency test that is given to low-ability learners only is also likely
to produce a low reliability estimate.
6. Length of an instrument. The longer the test/questionnaire, the more reliable
it is. However, it is essential that items or questions are carefully designed to
measure the target construct, as tests or questionnaires that are very long will be
a burden for participants and can be less practical and more expensive to use.
Summary
This chapter has presented the important concept of reliability, which is inte-
gral to solid quantitative research. There are several measures of reliability. The
choice of reliability measure will be influenced by the source of the data being
analyzed; it may have been collected from research instruments such as tests and
questionnaires, or from subjective data coding by human beings. This chapter has
illustrated how to compute Cronbach’s alpha, the Spearman-Brown prophecy
coefficient, Cohen’s kappa, and interclass correlations in SPSS. There are other
statistical methods that are not presented in this chapter (see Resources on Methods
for Second Language Research in the Epilogue for a list of publications that deal
with these methods). The SEM and factors affecting the reliability or research
instruments and data coding or rating have been discussed. Finally, the relation-
ship between reliability and validity has also been addressed in this chapter. In the
context of L2 research, reliability is important because it is a prerequisite condi-
tion for research validity. That is, research findings cannot be valid if the research
instruments being used are not reliable. The next part that follows is the Epilogue,
which concludes this book.
Review Exercises
To download review questions and exercises for this chapter, visit the Companion
Website: www.routledge.com/cw/roever.
EPILOGUE
This book has outlined the methodology required for sound quantitative research.
It has provided the reader with the essential tools needed to perform quantitative
research for many different purposes, and with various data types. The steps required
to produce reliable and valid results have been described. Researchers carry a heavy
responsibility as the results of their work may have real-world consequences for a
number of stakeholders. For this reason, this book should be viewed as the founda-
tion on which further training in research methodology should be built.
Dörnyei, Z. (2007). Research methods in applied linguistics: quantitative, qualitative, and mixed
methodologies. Oxford: Oxford University Press.
This book presents research methodologies and many other important considerations
in applied linguistics. It focuses on the various stages of academic research in applied
linguistics.
Mackey, A., & Gass, M. S. (Eds.). (2012). Research methods in second language acquisition: A
practical guide. Malden, MA: Wiley-Blackwell.
This edited volume focuses on methodological issues in L2 acquisition research related
to data collection and data analysis.
Epilogue 247
Mackey, A., & Gass, S. M. (2015). Second language research: Methodology and design (2nd ed.).
London: Routledge.
This book describes the essential principles of research methodology, methods and tech-
niques in L2 research.
Paltridge, B., & Phakiti, A. (Eds.). (2015). Research methods in applied linguistics: A practical
resource. London: Bloomsbury.
This edited volume presents several types of research in applied linguistics (e.g., general
quantitative, qualitative and mixed methods research, experimental research, and survey
research). The areas of research it focuses on include language skills, LTA, and classroom
practice.
Quantitative Methods
The following books present more sophisticated statistical concepts that are not
covered in the current book.
volume contains chapters on current challenges in L2 research, and the second section
includes chapters on alternatives, advances, and the future of L2 quantitative methodol-
ogy. The authors of the chapters in this volume identify specific problems in L2 quanti-
tative research and recommend solutions to such problems, so that quantitative research
practices can meet the required assumptions of quantitative methodology.
Porte, G. K. (2010). Appraising research in second language learning: A practical approach to critical
analysis of quantitative research (2nd ed.). Amsterdam and Philadelphia: John Benjamins
Publishing Company.
This book provides guidance on how to evaluate quantitative research articles. It
explains the different components of research articles; in particular, it illustrates how to
interpret and evaluate quantitative findings.
Plonsky, L. (Ed.). (2015). Advancing quantitative methods in second language research. New York:
Routledge.
This edited volume is at a more advanced level than the current book, and extends the
content of the current book. It includes, for example, chapters on statistical power and
p-values, mixed effects modeling and longitudinal data analysis, cluster analysis, explor-
atory factor analysis, Rasch analysis, structural equation modeling, and the Bayesian
approach to hypothesis testing.
Woodrow, L. (2014). Writing quantitative research in applied linguistics. New York: Palgrave
Macmillan.
This book focuses on strategies that can be useful when writing quantitative research
reports. It illustrates how to write about specific statistical procedures and findings (e.g.,
t-tests, ANOVA, correlations, and nonparametric tests).
Coombe, C. A., Davidson, P., O’Sullivan, B., & Stoynoff, S. (Eds.). (2012). Cambridge guide
to second language assessment. Cambridge: Cambridge University Press.
Corder, G. W., & Foreman, D. I. (2009). Non-parametric statistics for non-statisticians. Hoboken,
NJ: John Wiley.
Council of Europe (2001). Common European framework of reference for languages: Learning,
teaching, assessment. Cambridge: Cambridge University Press.
Crossley, S. A., Cobb, T., & McNamara, D. S. (2013). Comparing count-based and band-
based indices of word frequency: Implications for active vocabulary research and peda-
gogical applications. System, 41(4), 965–982.
Derwing, T. M., & Munro, M. J. (2013). The development of L2 oral language skills in two
L1 groups: A 7-year study. Language Learning, 63(2), 163–185.
Di Silvio, F., Donovan, A., & Malone, M. E. (2014). The effect of study abroad homestay
placements: Participant perspectives and oral proficiency gains. Foreign Language Annals,
47(1), 168–188.
Doolan, S. M., & Miller, D. (2012). Generation 1.5 written error patterns: A comparative
study. Journal of Second Language Writing, 21(1), 1–22.
Dörnyei, Z., & Taguchi, T. (2010). Questionnaires in second language research. London:
Routledge.
Douglas, D. (2010). Understanding language testing. London: Hodder Education.
Eisenhauer, J. G. (2008). Degrees of freedom. Teaching Statistics, 30(3), 75–78.
Elder, C., Knoch, U., & Zhang, R. (2009). Diagnosing the support needs of second lan-
guage writers: Does the time allowance matter? TESOL Quarterly, 43(2), 351–360.
Ellis, R. (2015). Understanding second language acquisition. Oxford: Oxford University Press.
Field, A. (2013). Discovering statistics using IBM SPSS statistics (3rd ed.). Los Angeles: Sage.
Fulcher, G. (2010). Practical language testing. London: Hodder Education.
Furr, R. M. (2010). Yates correction. In N. J. Salkind (Ed.), Encyclopedia of research design
(Vol. 3, pp. 1645–1648). Los Angeles: Sage.
Fushino, K. (2010). Causal relationships between communication confidence, beliefs about
group work, and willingness to communicate in foreign language group work. TESOL
Quarterly, 44(4), 700–724.
Gass, S. M. with Behney, J., & Plonsky, L. (2013). Second language acquisition: An introductory
course (4th ed.). New York and London: Routledge.
Gass, S., Svetics, I., & Lemelin, S. (2003). Differential effects of attention. Language Learning,
53(3), 497–545.
Green, A. (2014). Exploring language assessment and testing: Language in action. New York:
Routledge.
Greenhouse, S. (1990). Yates’s correction for continuity and the analysis of 2×2 contingency
tables: Comment. Statistics in Medicine, 9(4), 371–372.
Guo, Y., & Roehrig, A. D. (2011). Roles of general versus second language (L2) knowledge
in L2 reading comprehension. Reading in a Foreign Language, 23(1), 42–64.
Haviland, M. G. (1990). Yates’s correction for continuity and the analysis of 2×2 contin-
gency tables. Statistics in Medicine, 9(4), 363–367.
House, J. (1996). Developing pragmatic fluency in English as a foreign language: Routines
and metapragmatic awareness. Studies in Second Language Acquisition, 18(2), 225–252.
Hudson, T., & Llosa, L. (2015). Design issues and inference in experimental L2 research.
Language Learning, 65(S1), 76–96.
Huff, D. (1954). How to lie with statistics. New York: Norton.
252 References
Isaacs, T., Trofimovich, P., Yu, G., & Munoz, B. M. (2015). Examining the linguistic aspects
of speech that most efficiently discriminate between upper levels of the revised IELTS
Pronunciation scale. IELTS Research Report, 4, 1–48.
Jamieson, S. (2004). Likert scales: How to (ab)use them. Medical Education, 38(12),
1212–1218.
Jia, F., Gottardo, A., Koh, P. W., Chen, X., & Pasquarella, A. (2014). The role of accultura-
tion in reading a second language: Its relation to English literacy skills in immigrant
Chinese adolescents. Reading Research Quarterly, 49(2), 251–261.
Kane, M. (2006). Validation. In R. Brennan (Ed.), Educational measurement (4th ed.,
pp. 17–64). Westport, CT: Greenwood Publishing.
Keith, Z. K. (2003). Validity of automated essay scoring systems. In M. D. Shermis, &
J. Burstein, J. (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 147–167).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Khang, J. (2014). Exploring utterance and cognitive fluency of L1 and L2 English speakers:
Temporal measures and stimulated recall. Language Learning, 64(4), 809–854.
Ko, M. H. (2012). Glossing and second language vocabulary learning. TESOL Quarterly,
46(1), 56–79.
Kormos, J., & Trebits, A. (2012). The role of task complexity, modality and aptitude in
narrative task performance. Language Learning, 62(2), 439–472.
Kunnan, A. J. (Ed.). (2014). The companion to language assessment. Oxford, UK: John Wiley
& Sons.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a
practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863.
Larson-Hall, J. (2010). A guide to doing statistics in second language research using SPSS. New
York: Routledge.
Larson-Hall, J. (2016). A guide to doing research in second language acquisition with SPSS and R
(2nd ed.). New York: Routledge.
Laufer, B., & Waldman, T. (2011). Verb-noun collocations in second language writing: A
corpus analysis of learners’ English. Language Learning, 61(2), 647–672.
Laufer, L., & Rozovski-Roitblat, L. (2011). Incidental vocabulary acquisition: The effects of task
type, word occurrence and their combination. Language Teaching Research, 15(4), 391–411.
Lee, C. H., & Kalyuga, S. (2011). Effectiveness of different pinyin presentation formats
in learning Chinese characters: A cognitive load perspective. Language Learning, 61(4),
1099–1118.
Lightbown, P. M., & Spada, N. (2013). How languages are learned (4th ed.). Oxford: Oxford
University Press.
Liu, D. (2011). The most frequently used English phrasal verbs in American and British
English: A multicorpus examination. TESOL Quarterly, 45(4), 661–688.
Macaro, E. (2010). Continuum companion to second language acquisition. London: Continuum.
Mackey, A., & Gass, S. M. (2015). Second language research: Methodology and design (2nd ed.).
London: Routledge.
Mantel, N. (1990). Yates’s correction for continuity and the analysis of 2×2 contingency
tables: Comment. Statistics in Medicine, 9(4), 369–370.
Matsumoto, M. (2011). Second language learners’ motivation and their perception of their
teachers as an affecting factor. New Zealand Studies in Applied Linguistics, 17(2), 37–52.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103).
New York: Macmillan.
References 253
Median: The value that divides a data set into two groups.
Meta-analysis: A systematic review of previous empirical studies through the use
of statistical analysis for an average effect size.
Mode: The value that occurs most frequently in a data set.
Multiple regression: An extension of simple regression to examine the effect of
several independent variables on an outcome variable simultaneously. Hierar-
chical regression is performed when researchers enter independent variables in
steps (called blocks).
Negative correlation coefficient: A correlation coefficient that indicates that
as one variable increases, the other decreases, and vice versa.
Nominal data: Data that consists of named categories. Nominal data can be
compared only in terms of sameness or difference, rather than size or strength
(e.g., gender, nationalities, first language). Nominal data allow frequency
counts, including raw frequencies and percentages, as well as visual representa-
tions (e.g., pie charts)
Normal distribution: A data distribution that is bell-shaped.
Null hypothesis (H0): A statistical hypothesis that is tested against empirical
data. Inferential statistics aim to reject the null hypothesis. A null hypothesis
may have a word such as ‘no’ or ‘not’ (e.g., there is no relationship . . ., there is
no difference . . ., and there is no effect . . .). A rejection of a null hypothesis is
linked to a probability value being set (e.g., when p < 0.05).
Ordinal data: Data that are put into an order. Ordinal data can be obtained when
participants are rated or ranked according to their test performances or levels
of some feature. Ordinal data are more informative than nominal data since
they contain information about relative size or position (running at average
speed, high speed, very high speed), but are less informative than interval data,
which contain information about the exact size of the difference (race times of
25 seconds, 12 seconds, 10 seconds).
Paired-samples t-test: A parametric test for comparing the mean scores of two
tests or measurement instruments taken by the same group of participants. This
test is called ‘paired’ because pretest scores are compared with posttest scores.
This test is also called ‘a dependent t-test’ because both mean scores depend on
the same group of participants.
Parameters: The characteristics of the population of interest.
Participants: People who take part in a research study. Participants are sources
of data for analysis.
Pearson Product Moment correlation (Pearson’s r): A parametric statistic
for correlational analysis.
Pie chart: A circular diagram for displaying the relative sizes of values of a variable.
260 Research Terms
Research design: A research plan, outline, and method to help researchers tackle
a particular research problem.
Research reliability: The confidence that similar findings or conclusions are
likely to be repeated in new studies (i.e., replicability).
Sample size: The number of participants who produce data for quantitative
analysis. Large samples are generally preferable to small samples.
Scatterplot: A diagram that visualizes the correlation between two variables.
Skewness statistic: A statistic that describes whether more of the data are at the
low end of the range or the high end of the range.
Spearman’s rho: A nonparametric (distribution-free) statistic for correlational
analysis between two variables. Spearman’s rho is sometimes written with the
Greek letter rho (ρ) or written out (rho). This statistic is alternative to Pearson
Product Moment correlation. Unlike the Pearson correlation, it does not have
a coefficient of determination.
Spearman-Brown prophecy coefficient: A reliability measure in the spilt-half
reliability test. It can be used to inform researchers whether the reliability of
the test will increase if more items are added. It can also be used to examine
inter-rater reliability.
Sphericity: A statistical assumption for the repeated-measures ANOVA that
refers to condition that the variances of differences between the individual
measurements should be roughly equal.
Standard deviation (SD or Std. Dev): A statistic that indicates how different
individual values are from the mean.
Standard error of measurement (SEM): A statistical method for estimating
the lower and upper bound of an individual’s score through the use of a reli-
ability coefficient of a research instrument and the standard deviation of the
mean score. The higher the reliability coefficient, the lower the value of SEM.
Standardization: A procedure in which all research participants receive the same
conditions (e.g., same tasks and equal time allowance) during data collection.
Statistical Package for the Social Sciences (SPSS): A statistical program for
performing quantitative data analysis.
Statistical reasoning: The process of making inferences or drawing conclusions.
Statistical significance: The index that shows how likely it is that a statistical
finding is due to chance. It is known as the significance level and it is given as a
decimal (e.g., p < 0.05 or p = 0.032). In inferential statistics, it is insufficient to
merely report statistical significance (see ‘effect size’).
Stratified random sampling: A sampling technique in which researchers divide
the target population into sub-groups or strata and then randomly choose
equal numbers of participants from each sub-group to form a total sample.
262 Research Terms
adjacent agreement method 229 bar graphs: in descriptive statistics 35, 35;
alpha level, setting 89–90 in descriptive statistics in SPSS program
alternative hypothesis 89, 89 54–5, 55; SPSS program instructions for
analysis of covariance (ANCOVA): 54–5, 55
between-subjects factors/contrasts and Becker’s effect size calculator 104
151, 151; case selection in SPSS program ß value and coefficient 202–5, 211,
and 142, 143–53, 143–51; conditions of 216–17
139–40; conditions in SPSS, checking between-subjects factors/contrasts 151,
140–3, 141–2; covariate and 138–40, 151, 174, 174, 176, 176
151; describing 139; gain scores and bimodal data 37–8
136; homogeneity of regression slopes bivariate correlation 77
and 140, 148, 148; homoscedasticity Bonferroni post hoc test 150, 156
and 150–1; intervening variables and,
eliminating 135–9; overview 153; in case summaries, generating in SPSS
second language research 135; in SPSS program 20–2, 20–1
program 140–53, 141–52 categorical data 7–8, 8, 39
analysis of variance (ANOVA): assumptions central tendency measures 36–8
of, statistical 119–20; degrees of freedom chi-square test: assumptions of 189–90;
in 86–7; describing 117–22; effect size non-SPSS method for 195–8, 196–8;
for 121–2; F-statistic and 210; outcomes one-dimensional 182–5, 183;
of 119–20; overview 117, 134; post hoc overview 198–9; in L2 research 182;
tests and 120–1, 126, 127; posttest and in SPSS program 190–5, 191–5;
118, 118; in second language research two-dimensional 185–9, 185–6, 188
117–22; in SPSS program 122–7, 122–7; coding data 15, 234–8
steps in, key 119 coefficient of determination 67
ANOVA see analysis of variance; repeated- Cohen’s d 88, 96–7, 101, 103–4, 111–12,
measures ANOVA; two-way mixed 121
design ANOVA Cohen’s kappa 234–8, 236–8
assessment 11 collinearity 205, 212, 212, 217, 217
Asymp. Sig (2-tailed) value 110, 115 collocations 188–9, 188
264 Index
Levene’s test 96, 100–1, 100, 125–6, 150, non-SPSS method for chi-square test
175, 176, 178–9 195–8, 196–8
Likert-type scale 3, 39–40, 40, 58, 108, 128, normal distribution 66, 85, 85
219–20 normal distribution measures 40–3, 41–2
Lower Bound corrections 163 null hypothesis 89, 89, 183, 185
low skewness statistic 41, 42
one-dimensional chi-square test 182–5, 183
main effects 168 one-way analysis of variance see analysis of
Mann-Whitney U test 106–11, 107–11 variance (ANOVA)
marginal totals 186, 188, 188 ordinal data 5–7, 5–6, 39–40
Mauchly’s Test of Sphericity 161–2, 162, ordinal-ordinal relationships 68
175, 175 outcome variable 119–20
mean 36–7, 118 outliers 36–7, 36, 106
measurements: central tendency 36–8;
dispersion 38–9; distribution 40–3, paired-samples t-tests 93–5, 95, 102–4,
41–2; inter-coder reliability 228–30; 102–4
inter-rater reliability 228–30; intra-rater pairwise comparisons 133, 133, 164, 164
reliability 228; normal distribution parameters 81
40–3, 41–2; proficiency level 12, 188–9, parametric statistic 66
188; scales 3–8, 4–8; see also internal partial eta squared 121, 157–8
consistency measures partialing out covariate 139
median 36–7, 36, 39 Pearson: correlation analysis 43; correlation
medium effect 97 coefficient 121, 182, 204; Product
Minitab software 14 Moment 66, 70–9, 71, 77, 230; Pearson’s
missing values, assigning 47–8, 47–8 r 66–8
mode 37–9 percentage of agreement 229
moderator variables 135–9 performance rating 234–8
multiple regression: ANOVA result and phi coefficients 68, 184; Phi value 184, 187,
210–11, 211, 216, 216; assumptions 195
of 205; collinearity and 205, 212, pie charts: in descriptive statistics 33–5, 34;
212, 217, 217; describing 203–5, in descriptive statistics in SPSS program
204; descriptive statistics in 209, 209; 56, 56; SPSS program instructions for
hierarchical regression and 203, 204; 56, 56
model coefficient outputs and 211–12, platykurtic distribution 43
211–12, 216–17, 216–17; overview 218; point-biserial correlation 69
sample size in 205; in second language populations 81–3
research 200; simple 200–3, 201; in SPSS positive correlations 60, 62–6, 64, 68
program 206–12, 206–12 positively skewed distribution 41, 41
multivariate analysis of variance positive ranks 114
(MANOVA) 118, 160, 161 post hoc tests 119–21, 126, 127, 140, 150,
Multivariate Tests 164 156–7, 178–9, 178
predictor variable 201, 205, 209–10, 209
negative correlations 62–6, 66, 68 pre-post studies 154–6, 154–5
negatively skewed distribution 41, 42 probability 83–4
negative ranks 114 PSPP software 14
nominal data 7–8, 8, 39 purposive sampling 82
nominal variables, assigning value to 44–7, p-value 84, 88–90, 93, 100, 120
45–7
nonparametric tests: determining use of qualitative data coding 234–8
106; Mann-Whitney U test 106–11, quantification: categorical data in 7–8, 8;
107–11; overview 116; in second constructs in 2; data in 2; describing
language research 106; Wilcoxon 2–3; descriptive statistics in 28; at
Signed-ranked test 111–16, 112–15 group level 28–30, 29–30; hypotheses
Index 267
t-tests in 97–102, 98–101; intraclass 104; equal variances assumption and 96;
correlation in 240–3, 240–3; Kendall’s independent-samples 93–4, 93, 96–102,
tau in 68; Kruskal-Wallis test in 128–34, 98–101, 106, 117, 121, 138; Levene’s test
129–33; Mann-Whitney U test in and 96; overview 104–5; paired-samples
108–11, 108–10; missing values in, 93–5, 95, 102–4, 102–4; repeated
assigning 47–8, 47–8; multiple regression measures 93; in second language
in 206–12, 206–12; notes on, important research 92–3; steps for using 97
15–16; overview 14, 27; paired-samples t-value 93, 103
t-tests in 102–4, 102–4; Pearson Product two-dimensional chi-square test 185–9,
Moment in 78, 78, 79; pie charts in 185–7
56, 56; repeated-measures ANOVA two-way analysis of variance 117
in 158–64, 158–64; scatterplots in 73, two-way mixed-design ANOVA: between-
74–7; in second language research 14; subjects factors/contrasts and 174, 174,
Spearman-Brown coefficient in 231–4, 176, 176; descriptive statistics in 168,
231–3; Spearman correlation in 76, 79, 168, 174, 175, 177, 177; Levene’s test
78; Spearman’s rho and 68; spreadsheet and 175, 176, 178–9; Mauchly’s Test
in, creating 16–20, 16–19; standard of Sphericity and 175, 175; overview
deviation in 38; statistical significance 180–1; pairwise comparisons and 177,
and 86; Test of Between-Subjects Effect 177, 179; post hoc tests and 178–9, 178;
163; Tests of Within-Subjects Contrast pretest-posttest control-group design
163; two-way mixed design ANOVA in and 166, 167; results, written 180; in
170–80, 170–80; value labels in, assigning second language research 166–9, 167–9;
44–7, 45–7; variables in, computing 136– in SPSS program 170–80, 170–80;
7, 136–7; Wilcoxon Signed-rank test univariate tests and 177, 177;
in 112–16, 112–15; see also descriptive within-subjects factors/contrasts and
statistics in SPSS program 166–7, 174–5, 174, 176, 178
standard deviation (SD) 38–9, 118, 243 type I error 90
standard error of measurement (SEM) type II error 90
243–4
Statistical Package for Social Sciences univariate analysis of variance see analysis
program see SPSS program (IBM) of variance (ANOVA)
statistical significance 2–3, 83–4, 86–7, U-value 107
89–90, 140, 156–7
stratified random sampling 82 variables: confounding 135–9; dependent 7,
119–20; excluded 217, 218; factor 119–
Tamhane T2 post hoc test 120, 126, 140, 20; grouping 119–20, 128; independent
178, 178 7–8, 119–20, 128; intervening 135–9;
Test of Between-Subjects Effects 163 moderator 135–9; nominal, assigning
Test of English as a Foreign Language values to 44–7, 45–7; outcome 119–20;
(TOEFL) 4, 58, 61–2, 83 predictor 201, 205, 209–10, 209; in
Test of English for International quantification 2; SPSS program and
Communication (TOEIC) 4 computing 136–7, 136–7
Test of English Pragmatics (TEP) 30, 126, VassarStats website 196–7, 196, 198
140, 206, 223
test item discrimination 68 Wiseheart’s calculator 104
test-retest reliability 221 within-subjects factors/contrasts 162, 163,
theories 2 163, 166–7, 174–5, 174, 176, 178
transforming data in real-life context 8–11,
9–11 Yates correction 187, 194, 197
t-tests: assumptions of 96; Cohen’s d in 88,
96–7; dependent 93; effect size for 96–7, Z-value 107, 110–11, 114–15, 115