0% found this document useful (0 votes)

22 views26 pages

Using Observation Checklists To Validate Speaking

CHECK LIST

Uploaded by

mhuasaquicheb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views26 pages

Using Observation Checklists To Validate Speaking

CHECK LIST

Uploaded by

mhuasaquicheb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/237225018

Using observation checklists to validate speaking-test tasks

Article · January 2002

DOI: 10.1191/0265532202lt219oa

CITATIONS READS

79 4,484

3 authors:

Barry O'Sullivan Cyril Weir

British Council University of Bedfordshire
37 PUBLICATIONS 558 CITATIONS 34 PUBLICATIONS 1,828 CITATIONS

SEE PROFILE SEE PROFILE

Nick Saville
University of Cambridge
30 PUBLICATIONS 720 CITATIONS

SEE PROFILE

All content following this page was uploaded by Barry O'Sullivan on 19 October 2015.

The user has requested enhancement of the downloaded file.

Using observation checklists to validate
speaking-test tasks
Barry O’Sullivan The University of Reading, Cyril J. Weir
University of Surrey, Roehampton and Nick Saville
University of Cambridge Local Examinations Syndicate

Test-task validation has been an important strand in recent revision projects for
University of Cambridge Local Examinations Syndicate (UCLES) examinations.
This article addresses the relatively neglected area of validating the match between
intended and actual test-taker language with respect to a blueprint of language
functions representing the construct of spoken language ability. An observation
checklist designed for both a priori and a posteriori analysis of speaking task
output has been developed. This checklist enables language samples elicited by
the task to be scanned for these functions in real time, without resorting to the
laborious and somewhat limited analysis of transcripts. The process and results of
its development, implications and further applications are discussed.

I Background to the study

This article reports on the development and use of observation check-
lists in the validation of the Speaking Tests within the University
of Cambridge Local Examinations Syndicate (UCLES) ‘Main Suite’
examination system (see Figure 1). These checklists are intended to

ALTE Level 1 ALTE Level 2 ALTE Level 3 ALTE Level 4 ALTE Level 5

Waystage Threshold Independent Competent Good

User User User User User
CAMBRIDGE CAMBRIDGE CAMBRIDGE CAMBRIDGE CAMBRIDGE
Level 1 Level 2 Level 3 Level 4 Level 5
Key English Preliminary First Certificate in Certificate of
Test (KET) English Test Certificate in Advanced Proficiency in
(PET) English (FCE) English (CAE) English (CPE)
BASIC INTERMEDIATE ADVANCED

Figure 1 The Cambridge/ALTE ve-level system

Address for correspondence: Barry O’Sullivan, Testing and Evaluation Unit, School of Linguis-
tics and Applied Language Studies, The University of Reading, PO Box 241, Whiteknights,
Reading RG6 6WB, UK; email: b.e.osullivanKreading.ac.uk

Language Testing 2002 19 (1) 33–56 10.1191/0265532202lt219oa Ó 2002 Arnold

34 Validating speaking-test tasks
Table 1 Format of the Main Suite Speaking Test

Part Participants Task format

1 Interviewer–candidate Interview: Verbal questions

2 Candidate–candidate Collaborative task: Visual stimulus;
Verbal instructions
3 Interviewer–candidate–candidate Long turns and discussion: Written
stimulus; Verbal questions

provide an effective and ef cient tool for investigating variation in

language produced by different task types, different tasks within task
types, and different interview organization at the pro ciency levels
in Figure 1. As such, they represent a unique attempt to validate the
match between intended and actual test-taker language with respect to
a blueprint of language functions representing the construct of spoken
language ability in the UCLES tests of general language pro ciency,
from PET to CPE level (for further information related to the different
tests in the ‘Main Suite’ battery, see the individual handbooks pro-
duced by UCLES). Beyond this study, the application of such check-
lists has clear relevance for any test of spoken interaction.
The standard Cambridge approach in testing speaking is based on
a paired format involving an interlocutor, an additional examiner and
two candidates. Careful attention has been given to the tasks through
which the spoken language performance is elicited in each different
part. The format of the Main Suite Speaking Tests (with the exception
of the Level 1 KET test) is summarized in Table 1.

II Issues in validating tests of oral performance

In considering the issue of the validity of a performance test1 of
speaking, we need a framework that describes the relationship
between the construct being measured, the tasks used to oper-
ationalize that construct and the assessment of the performances that
are used to make inferences to that underlying ability.
There have been a number of models that have attempted to portray
the relationship between a test-taker’s knowledge of, and ability to
use, a language and the score they receive in a test designed to evalu-
ate that knowledge (e.g., Milanovic and Saville, 1996; McNamara,
1996; Skehan, 1998; Upshur and Turner, 1999).

1
By performance tests we are referring to direct tests where a test-taker’s ability is evaluated
from their performance on a set task or tasks.
Barry O’Sullivan, Cyril J. Weir and Nick Saville 35

Milanovic and Saville (1996) provide a useful overview of the vari-

ables that interact in performance testing and suggest a conceptual
framework for setting out different avenues of research. The frame-
work was in uential in the revisions of the Cambridge examinations
during the 1990s, including the development of KET and CAE exams
and revisions to PET, FCE and, most recently, CPE (for a summary
of the UCLES approach, see Saville and Hargreaves, 1999).
The Milanovic and Saville framework is one of the earliest, and
most comprehensive of these models (reproduced here as Figure 2).
This framework highlights the many factors (or facets) that must be
considered when designing a test from which particular inferences are
to be drawn about performances; all of the factors represented in the
model pose potential threats to the reliability and validity of these
inferences. From this model, a framework can be derived, through
which a validation strategy can be devised for Speaking Tests such
as those produced by UCLES.
The essential elements of this framework are:
· the test-taker;
· the interlocutor/examiner;
· the assessment criteria (scales);
· the task;
· the interactions between these elements.

Knowledge
and ability

Examination
conditions
Candidates

Tasks

Specifications
and Sample of Score
construct language
Examination
developer Assessment
criteria

Assessment
conditions
and training
E xaminers

Knowledge
and ability

Figure 2 A conceptual framework for performance testing

Source: adapted from Milanovic and Saville, 1996: 6
36 Validating speaking-test tasks
The subject of this study, the task, has been explored from a number
of perspectives. Brie y, these have been:
· Task/method comparison (quantitative): involving studies in
which comparisons are made between performances on different
tasks or methods (Clark, 1979; 1988; Henning, 1983; Shohamy,
1983; Shohamy et al., 1986; Clark and Hooshmand, 1992; Stans-
eld and Kenyon, 1992; Wigglesworth and O’Loughlin, 1993;
Chalhoub-Deville, 1995a; O’Loughlin, 1995; Fulcher, 1996; Lum-
ley and O’Sullivan, 2000; O’Sullivan, 2000).
· Task/method comparison (qualitative): as above but where quali-
tative methods are employed (Shohamy, 1994; Young, 1995;
Luoma, 1997; O’Loughlin, 1997; Bygate, 1999; Kormos, 1999).
· Task performance (method effect): where aspects of the task are
systematically manipulated; e.g., planning time, pre- or post-task
operations, etc. (Foster and Skehan, 1996; 1999; Wigglesworth,
1997; Mehnert, 1998; Ortega, 1999; Upshur and Turner, 1999).
· Native speaker/Nonnative speaker comparison: where native
speaker performance on speci c tasks is compared to nonnative
speaker performance on the same tasks (Weir, 1983; Ballman,
1991).
· Task dif culty/classi cation: where an attempt has been made to
classify tasks in terms of their dif culty (Weir, 1993; Fulcher,
1994; Kenyon, 1995; Robinson, 1995; Skehan, 1996; 1998; Norris
et al., 1998 ).
The central importance of the test task has been clearly recognized;
however, in terms of test validation, there is one question that has,
to date, remained largely unexplored. Although there has been a great
deal of debate over the validation of performance tests through analy-
sis of the language generated in the performance of language elici-
tation tasks (LETs) (e.g., van Lier, 1989; Lazaraton, 1992; 1996),
attention has not been drawn to the one aspect of task performance
that would appear to be of most interest to the test designer. That is,
when tasks are performed in a test event, how does that performance
relate to the test designer’s predictions or expectations based on their
de nition or interpretation of the construct? After all, no matter how
reliably the performance is scored, if it does not match the expec-
tations of the test designer (in other words represent the constructs
which are to be tested), then the inferences that the test designer hopes
to draw from the evaluated performance will not be valid.
Cronbach went to the heart of the matter (1971: 443): ‘Construc-
tion of a test itself starts from a theory about behaviour or mental
organization derived from prior research that suggests the ground plan
for the test.’ Davies (1977: 63) argued in similar vein: ‘it is, after
Barry O’Sullivan, Cyril J. Weir and Nick Saville 37

all, the theory on which all else rests; it is from there that the construct
is set up and it is on the construct that validity, of the content and
predictive kinds, is based.’ Kelly (1978: 8) supported this view, com-
menting that: ‘the systematic development of tests requires some
theory, even an informal, inexplicit one, to guide the initial selection
of item content and the division of the domain of interest into appro-
priate sub-areas.’
Because we lack an adequate theory of language in use, a priori
attempts to determine the construct validity of pro ciency tests
involve us in matters that relate more evidently to content validity.
We need to talk of the communicative construct in descriptive terms
and, as a result, we become involved in questions of content relevance
and content coverage. Thus, for Kelly (1978: 8) content validity
seemed ‘an almost completely overlapping concept’ with construct
validity, and for Moller (1982: 68): ‘the distinction between construct
and content validity in language testing is not always very marked,
particularly for tests of general language pro ciency.’
Content validity is considered important as it is principally con-
cerned with the extent to which the selection of test tasks is represen-
tative of the larger universe of tasks of which the test is assumed to
be a sample (see Bachman and Palmer, 1981; Henning, 1987: 94;
Messick, 1989: 16; Bachman, 1990: 244). Similarly, Anastasi (1988:
131) de ned content validity as involving: ‘essentially the systematic
examination of the test content to determine whether it covers a
representative sample of the behaviour domain to be measured.’ She
outlined (Anastasi, 1988: 132) the following guidelines for estab-
lishing content validity:
1) ‘the behaviour domain to be tested must be systematically
analysed to make certain that all major aspects are covered by
the test items, and in the correct proportions’;
2) ‘the domain under consideration should be fully described in
advance, rather than being de ned after the test has been pre-
pared’;
3) ‘content validity depends on the relevance of the individual’s test
responses to the behaviour area under consideration, rather than
on the apparent relevance of item content.’
The directness of t and adequacy of the test sample is thus dependent
on the quality of the description of the target language behaviour
being tested. In addition, if the responses to the item are invoked
Messick (1975: 961) suggests ‘the concern with processes underlying
test responses places this approach to content validity squarely in the
realm of construct validity’. Davies (1990: 23) similarly notes: ‘con-
tent validity slides into construct validity’.
38 Validating speaking-test tasks

Content validation is, of course, extremely problematic given the

dif culty we have in characterizing language pro ciency with suf-
cient precision to ensure the validity of the representative sample
we include in our tests, and the further threats to validity arising out of
any attempts to operationalize real life behaviours in a test. Specifying
operations, let alone the conditions under which these are performed,
is challenging and at best relatively unsophisticated (see Cronbach,
1990). Weir (1993) provides an introductory attempt to specify the
operations and conditions that might form a framework for test task
description (see also Bachman, 1990; Bachman and Palmer, 1996).
The dif culties involved do not, however, absolve us from
attempting to make our tests as relevant as possible in terms of con-
tent. Generating content related evidence is seen as a necessary,
although not suf cient, part of the validation process of a speaking
test. To this end we sought to establish in this study an effective and
ef cient procedure for establishing the content validity of speaking
tests. As well as being useful in helping specify the domain to be
tested we would argue that the checklist discussed below would
enable the researcher to address how predicted vs. actual task per-
formance can be compared.

III Methodological issues

While it is relatively easy to rationalize the need to establish that the
LETs used in performance tests are working as predicted (i.e., in
terms of language generated), the dif culty lies in how this might
best be done.
UCLES EFL (English as a foreign language) routinely collects
audio recordings and carries out transcriptions of its Speaking Tests.
These transcripts are used for a range of validation purposes, and in
particular they contribute to revision projects for the Speaking Tests,
for example, FCE which was revised in 1996, and currently the
revision of the International English Language Testing System
(IELTS) Speaking Test, in addition to the CPE revision project.
In a series of UCLES studies focusing on the language of the
Speaking Tests, Lazaraton has applied conversational analysis (CA)
techniques to contribute to our understanding of the language used in
pair-format Speaking Tests, including the language of the candidates
and the interlocutor. Her approach requires a very careful, ne-tuned
transcription of the tests in order to provide the data for analysis (see
Lazaraton, 2000). Similar qualitative methodologies have been
applied by Young and Milanovic (1992) – also to UCLES data – by
Brown (1998) and by Ross and Berwick (1992), amongst others.
Barry O’Sullivan, Cyril J. Weir and Nick Saville 39

While there is clearly a great deal of potential for this detailed analy-
sis of transcribed performances, there are also a number of drawbacks,
the most serious of which involves the complexity of the transcription
process. In practice, this means that a great deal of time and expertise
is required in order to gain the kind of data that will answer the basic
question concerning validity. Even where this is done, it is impractical
to attempt to deal with more than a small number of test events;
therefore, the generalizability of the results may be questioned.
Clearly then, a more ef cient methodology is required that allows
the test designer to evaluate the procedures and, especially, the tasks
in terms of the language produced by a larger number of candidates.
Ideally this should be possible in ‘real’ time, so that the relationship
of predicted outcome to speci c outcome can be established using a
data set that satisfactorily re ects the typical test-taking population.
The primary objective of this project, therefore, was to create an
instrument, built on a framework that describes the language of per-
formance in a way that can be readily accessed by evaluators who
are familiar with the tests being observed. This work is designed to
be complementary to the use of transcriptions and to provide an
additional source of validation evidence.
The FCE was chosen as the focus of this study for a number of
reasons:
· It is ‘stable’, in that it is neither under review nor is due to be
reviewed.
· It represents the middle of the ALTE (and UCLES Main Suite)
range, and is the most widely subscribed test in the battery.
· It offers the most likelihood of a wide range of performance of
any Main Suite examination: as it is often used as an ‘entry-point’
into the suite, candidates tend to range from below to above this
level in terms of ability.
· Like all of the other Main Suite examinations, a database of
recordings (audio and video) already existed.

IV The development of the observation checklists

Weir (1993), building on the earlier work of Bygate (1988), suggests
that the language of a speaking test can be described in terms of
the informational and interactional functions and those of interaction
management generated by the participants involved. With this as a
starting point, a group of researchers at the University of Reading
were commissioned by UCLES EFL, to examine the spoken langu-
age, second language acquisition and language testing literatures to
come up with an initial set of such functions (see Schegloff et al.,
40 Validating speaking-test tasks

1977; Schwartz, 1980; van Ek and Trim, 1984; Bygate, 1988;

Shohamy, 1988; 1994; Walker, 1990; Weir, 1994; Stenström, 1994;
Chalhoub-Deville, 1995b; Hayashi, 1995; Ellerton, 1997; Suhua,
1998; Kormos, 1999; O’Sullivan, 2000; O’Loughlin, 2001).
These were then presented as a draft set of three checklists
(Appendix 1), representing each of the elements of Weir’s categoriz-
ation. What follows in the three phases of the development process
described below (Section VI), was an attempt to customize the
checklist to more closely re ect the intended outcomes of spoken
language test tasks in the UCLES Main Suite. The checklists were
designed to help establish which of these functions resulted, and
which were absent.
The next concern was with the development of a procedure for
devising a ‘working’ version of the checklists to be followed by an
evaluation of using this type of instrument in ‘real’ time (using tapes
or perhaps live speaking tests).

V The development model

The process through which the checklists were developed is shown
in Figure 3. The concept that drives this model is the evaluation at
each level by different stakeholders. At this stage of the project these
stakeholders were identi ed as:

Figure 3 The development model

Barry O’Sullivan, Cyril J. Weir and Nick Saville 41

· the consulting ‘expert’ testers (the University of Reading group);

· the test development and validation staff at UCLES;
· UCLES Senior Team Leaders (i.e., key staff in the oral examiner
training system).
All these individuals participated in the application of each draft. It
should also be noted that a number of drafts were anticipated.

VI The development process

In order to arrive at a working version of the checklists, a number of
developmental phases were anticipated. At each phase, the latest ver-
sion (or draft) of the instruments was applied and this application
evaluated.

Phase 1
The rst attempt to examine how the draft checklists would be
viewed, and applied, by a group of language teachers was conducted
by ffrench (1999). Of the participants at the seminar, approximately
50% of the group reported that English (British English, American
English or Australian English) was their rst language, while the
remaining 50% were native Greek speakers.
In their introduction to the application of the Observation Check-
lists (OCs), the participants were given a series of activities that
focused on the nature and use of those functions of language seen by
task designers at UCLES to be particularly applicable to their EFL
Main Suite Speaking Tests (principally FCE, CAE and CPE). Once
familiar with the nature of the functions (and where they might occur
in a test), the participants applied the OCs in ‘real’ time to an FCE
Speaking Test from the 1998 Standardization Video. This video fea-
tured a pair of French speakers who were judged by a panel of
‘expert’ raters (within UCLES) to be slightly above the criterion
(‘pass’) level.
Of the 37 participants, 32 completed the task successfully, that is,
they attempted to make frequency counts of the items represented in
the OCs. Among this group, there appear to be varying degrees of
agreement as to the use of language functions, particularly in terms
of the speci c number of observations of each function. However,
when the data are examined from the perspective of agreement on
whether a particular function was observed or not (ignoring the count,
which in retrospect was highly ambitious when we consider the lack
of systematic training in the use of the questionnaires given to the
teachers who attended), we nd that there is a striking degree of
42 Validating speaking-test tasks

agreement on all but a small number of functions (Appendix 2). Note

here that, in order to make these patterns of behaviour clear, the data
have been sorted both horizontally and vertically by the total number
of observations made by each participant and of each item.
From this perspective, this aspect of the developmental process was
considered to be quite successful. However, it was apparent that there
were a number of elements within the checklists that were causing
some dif culty. These are highlighted in the table by the tram-lines.
Items above the lines have been identi ed by some participants, in
one case by a single person, while those below have been observed
by a majority of participants (in two cases by all of them). For these
cases, we might infer a high degree of agreement. However, the
middle range of items appears to have caused a degree of confusion,
and so are highlighted here, i.e., marked for further investigation.

Phase 2
In this phase, a much smaller gathering was organized, this time
involving members of the development team as well as the three UK-
based UCLES Senior Team Leaders. In advance of this meeting all
participants were asked to study the existing checklists and to exemp-
lify each function with examples drawn from their experiences of the
various UCLES Main Suite examinations. The resulting data were
collated and presented as a single document that formed the basis of
discussion during a day-long session. Participants were not made
aware of the ndings from Phase 1.
During this session many questions were asked of all aspects of
the checklist, and a more streamlined version of the three sections
was suggested. In addition to a number of participants making a writ-
ten record of the discussions, the entire session was recorded. This
proved to be a valuable reminder of the way in which particular
changes came about and was used when the nal decisions regarding
inclusion, con ation or omission were being made. Although it is
beyond the scope of this project to analyse this recording, when
coupled with the earlier and revised documents, it is in itself a valu-
able source of data in that it provides a signi cant record of the devel-
opmental process.
Among the many interesting outcomes of this phase were the
decisions either to rethink, to reorganize or to omit items from the
initial list. These decisions were seen to mirror the results of the Phase
1 application quite closely. Of the 13 items identi ed in Phase 1 as
being in need of review (7 were rarely observed, indicating a high
degree of agreement that they were not, in fact, present, and 6
appeared to be confused with very mixed reported observations), 7
Barry O’Sullivan, Cyril J. Weir and Nick Saville 43

were recommended for either omission or inclusion in other items by

the panel, while the remaining 6 items were identi ed by them as
being of value. Although no examples of the latter had appeared in
the earlier data, the panel agreed that they represented language func-
tions that the UCLES Main Suite examinations were intended to elicit.
It was also decided that each item in this latter group was in need of
further clari cation and/or exempli cation. Of the remaining 17
items:
· two were changed: the item ‘analysing’ was recoded as ‘staging’
in order to clarify its intended meaning, while it was decided to
separate the item ‘(dis)agreeing’ into its two separate components;
· three were omitted: it was argued that the item ‘providing non-
personal information’ referred to what was happening with the
other items in the informational function category, while the items
‘explaining’ and ‘justifying/supporting’ were not functions usu-
ally associated with the UCLES Main Suite tasks and no occur-
rences of these had been noted.
We would emphasize that, as reported in Section IV above, the initial
list was developed to cover the language functions that various spoken
language test tasks might elicit. The development of the checklists
described here re ects an attempt to customize the lists, in line with
the intended functional outcomes of a speci c set of tests.
We are, of course, aware that closed instruments of this type may
be open to the criticism that valuable information could be lost. How-
ever, for reasons of practicality, we felt it necessary to limit the list
to what the examinations were intended to elicit, rather than attempt
to operationalize a full inventory. Secondly, any functions that
appeared in the data that were not covered by the reduced list would
have been noted. There appeared to be no cases of this.
The data from these two phases were combined to result in a work-
ing version of the checklists (Appendix 3), which was then applied
to a pair of FCE Speaking Tests in Phase 3.

Phase 3
In the third phase, the revised checklists were given to a group of 15
MA TEFL students who were asked to apply them to two FCE tests.
Both of these tests involved a mixed-sex pair of learners, one pair of
approximately average ability and the other pair above average.
Before using the observation checklists (OCs), the students were
asked rst to attempt to predict which functions they might expect to
nd. To help in this pre-session task, the students were given details
of the FCE format and tasks.
44 Validating speaking-test tasks

Unfortunately, a small number of students did not manage to com-

plete the observation task, as they were somewhat overwhelmed with
the real-time application of the checklists. As a result only 12 sets of
completed checklists were included in the nal analysis.
Prior to the session, the group was given an opportunity to have a
practice run using a third FCE examination. While this ‘training’ per-
iod, coupled with the pre-session task, was intended to provide the
students with the background they needed to apply the checklists con-
sistently, there was a problem during the session itself. This problem
was caused by the failure of a number of students to note the change
from Task 3 to Task 4 in the rst test observed. This was possibly
caused by a lack of awareness of the test itself and was not helped
by the seamless way in which the examiner on the video moved from
a two-way discussion involving the test-takers to a three-way dis-
cussion. This meant that a full set of data exists only for the rst two
tasks of this test. As the problem was noticed in time, the second test
did not cause these problems. Unlike the earlier seminar, on this
occasion the participants were asked only to record each function
when it was rst observed. This was done as it was felt that the earlier
seminar showed that, without extensive training, it would be far too
dif cult to apply the OCs fully in ‘real’ time in order to generate
comprehensive frequency counts. We are aware that a full tally would
enable us to draw more precise conclusions about the relative fre-
quency of occurrence of these functions and the degree of consensus
(reliability) of observers.
Against this we must emphasize that the checklists, in their current
stage of development, are designed to be used in real time. Their use
was therefore restricted to determining the presence or absence of a
particular function. Rater agreement, in this case, is limited to a some-
what crude account of whether a function occurred or did not occur
in a particular task performance. We do not, therefore, have evidence
of whether the function observed was invariant across raters.
The results from this session are included as Appendix 4. It can
be seen from this table that the participants again display mixed levels
of agreement, ranging from a single perceived observation to total
agreement. As with the earlier session, it appears that there is rela-
tively broad agreement on a range of functions, but that others appear
to be more dif cult to identify easily. These dif culties appear to be
greatest where the task involves a degree of interaction between the
test-takers.
Phase 4
In this phase a transcription was made of the second of the two inter-
views used in Phase 3, since there was a full set of data available for
Barry O’Sullivan, Cyril J. Weir and Nick Saville 45

this interview. The OCs were then ‘mapped’ on to this transcript in

order to give an overview from a different perspective of what func-
tions were generated (it being felt that this map would result in an
accurate description of the test in terms of the items included in the
OCs). This mapping was carried out by two researchers, who initially
worked independently of each other, but discussed their nished work
in order to arrive at a consensus.
Finally the results of Phases 2 and 3 were compared (Appendix
5). This clearly indicates that the checklists are now working well.
There are still some problems in items such as ‘staging’ and ‘describ-
ing’, and feedback from participants suggests that this may be due to
misunderstandings or misinterpretations of the gloss and examples
used. In addition, there are some similar dif culties with the initial
three items in the interactional functions checklist, in which the great-
est dif culties in applying the checklists appear to lie.

VII Discussion and initial conclusions

The results of this study appear to substantiate our belief that,
although still under development for use with the UCLES Main Suite
examinations, an operational version of these checklists is certainly
feasible, and has potentially wider application, mutatis mutandis, to
the content validation of other spoken language tests. Further re ne-
ment of the checklists is clearly required, although the developmental
process adopted here appears to have borne positive results.

1 Validities
We would not wish to claim that the checklists on their own offer a
satisfactory demonstration of the construct validity of a spoken langu-
age test, for, as Messick argues (1989: 16): ‘the varieties of evidence
supporting validity are not alternatives but rather supplements to one
another.’ We recognize the necessity for a broad view of ‘the eviden-
tial basis for test interpretation’ (Messick, 1989: 20). Bachman (1990:
237) similarly concludes: ‘it is important to recognise that none of
these [evidences of validity] by itself is suf cient to demonstrate the
validity of a particular interpretation or use of test scores’ (see also
Bachman, 1990: 243). Fulcher (1999: 224) adds a further caveat
against an overly narrow interpretation of content validity when he
quotes Messick (1989: 41):
the major problem is that so-called content validity is focused upon test forms
rather than test scores, upon instruments rather than measurements . . . selecting
content is an act of classi cation, which is in itself a hypothesis that needs to
be con rmed empirically.
46 Validating speaking-test tasks

Like these authors, we regard as inadequate any conceptualization of

validity that does not involve the provision of evidence on a number
of levels, but would argue strongly that without a clear idea of the
match between intended content and actual content, any comprehen-
sive investigation of the construct validity of a test is built on sand.
De ning the construct is, in our view, underpinned by establishing
the nature of the actual performances elicited by test tasks, i.e. the
true content of tasks.

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practice
similar to that given to raters if a reliable and consistent outcome is
to be expected. This requires that standardized training materials be
developed alongside the checklists. In the case of these checklists,
this process has already begun with the initial versions piloted during
Phase 3 of the project.
The checklists have great potential as an evaluative tool and can
provide comprehensive insight into various issues. It is hoped that,
amongst other issues, the checklists will provide insights into the fol-
lowing:
· the language functions that the different task-types (and different
sub-tasks within these) employed in the UCLES Main Suite Paper
5 (Speaking) Tests typically elicit;
· the language that the pair-format elicits, and how it differs in nat-
ure and quality from that elicited by interlocutor-single candi-
date testing;
· the extent to which there is functional variation across the top
four levels of the UCLES Main Suite Spoken Language Test.
In addition to these issues, the way in which the checklists can be
applied may allow for other important questions to be answered. For
example, by allowing the evaluator multiple observations (stopping
and starting a recording of a test at will), it will be possible to estab-
lish whether there are quanti able differences in the language func-
tions generated by the different tasks; i.e., the evaluators will have
the time they need to make frequency counts of the functions.
While the results to date have focused on a posteriori validation
procedures, these checklists are also relevant to task design. By taking
into account the expected response of a task (and by describing that
response in terms of these functions) it will be possible to explore
predicted and actual test task outcome. It will also be a useful guide
for item writers in taking a priori decisions about content coverage.
Through this approach it should be possible to predict more accurately
Barry O’Sullivan, Cyril J. Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) and

to apply this to the design of test tasks – and of course to evaluate
the success of the prediction later on. In the longer term this will
lead to a greater understanding of how tasks and task formats can be
manipulated to result in speci c language use. We are not claiming
that it is possible to predict language use at a micro level
(grammatical form or lexis), but that it is possible to predict infor-
mational and interactional functions and features of interaction man-
agement – a notion supported by Bygate (1999).
The checklists should also enable us to explore how systematic
variation in such areas as interviewer questioning behaviour (and
interlocutor frame adherence) affects the language produced in this
type of test. In the interview transcribed for this study, for example,
the examiner directed his questions very deliberately (systematically
aiming the questions at one participant and then the other). This
tended to sti e any spontaneity in the intended three-way discussion
(Task 4), so occurrences of Interactional and Discourse Management
Functions did not materialize to the extent intended by the task
designers. It is possible that a less deliberate (unscripted) questioning
technique would lead to a less interviewer-oriented interaction pattern
and allow for the more genuine interactive communication envisaged
in the task design.
Perhaps the most valuable contribution that this type of validation
procedure offers is its potential to improve the quality of oral assess-
ment in both low-stakes and high-stakes contexts. By offering the
investigator an instrument that can be used in real time, the checklists
broaden the scope of investigation from limited case study analysis
of small numbers of test transcripts to large scale eld studies across
a wide range of testing contexts.

Acknowledgements

We would like to thank Don Porter and Rita Green for their early
input into the rst version of the checklist. In addition, help was
received from members of the ELT division in UCLES, in particular
from Angela ffrench, Lynda Taylor and Christina Rimini, from a
group of UCLES Senior Team Leaders and from MA TEFL students
at the University of Reading. Finally, we would like to thank the
editors and anonymous reviewers of Language Testing for their
insightful comments and helpful suggestions for its improvement. The
faults that remain are, as ever, ours.
48 Validating speaking-test tasks

VIII References

Anastasi, A. 1988: Psychological testing. 6th edition. New York: Macmil-

lan.
Bachman, L.F. 1990: Fundamental considerations in language testing.
Oxford: Oxford University Press.
Bachman, L.F. and Palmer, A.S. 1981: The construct validation of the FSI
oral interview. Language Learning 31, 67–86.
—— 1996: Language testing in practice. Oxford: Oxford University Press.
Ballman, T.L. 1991: The oral task of picture description: similarities and
differences in native and nonnative speakers of Spanish. In Teschner,
R.V., editor, Assessing foreign language pro ciency of undergrad-
uates. AAUSC Issues in Language Program Direction. Boston: Heinle
and Heinle, 221–31.
Brown, A. 1998: Interviewer style and candidate performance in the IELST
oral interview. Paper presented at the Language Testing Research Col-
loquium, Monterey, CA.
Bygate, M. 1988: Speaking. Oxford: Oxford University Press.
—— 1999: Quality of language and purpose of task: patterns of learners’
language on two oral communication tasks. Language Teaching
Research 3, 185–214.
Chalhoub-Deville, M. 1995a: Deriving oral assessment scales across differ-
ent tests and rater groups. Language Testing 12, 16–33.
—— 1995b: A contextualized approach to describing oral language pro-
ciency. Language Learning 45, 251–81.
Clark, J.L.D. 1979: Direct vs. semi-direct tests of speaking ability. In Bri-
ere, E.J. and Hinofotis, F.B., editors, Concepts in language testing:
some recent studies. Washington DC: TESOL.
—— 1988: Validation of a tape-mediated ACTFL/ILR scale based test of
Chinese speaking pro ciency. Language Testing 5, 187–205.
Clark, J.L.D. and Hooshmand, D. 1992: ‘Screen to Screen’ testing: an
exploratory study of oral pro ciency interviewing using video tele-
conferencing. System 20, 293 –304.
Cronbach, L.J. 1971: Validity. In Thorndike, R.L., editor, Educational
measurement. 2nd edition. Washington DC: American Council on Edu-
cation, 443–597.
—— 1990: Essentials of psychological testing. 5th edition. New York:
Harper & Row.
Davies, A. 1977: The construction of language tests. In Allen, J.P.B. and
Davies, A., editors, Testing and experimental methods. The Edinburgh
Course in Applied Linguistics, Volume 4. London: Oxford University
Press, 38–194.
—— 1990: Principles of language testing. Oxford: Blackwell.
Ellerton, A.W. 1997: Considerations in the validation of semi-direct oral
testing. Unpublished PhD thesis, CALS, University of Reading.
ffrench, A. 1999: Language functions and UCLES speaking tests. Seminar
in Athens, Greece. October 1999.
Barry O’Sullivan, Cyril J. Weir and Nick Saville 49

Foster, P. and Skehan, P. 1996: The in uence of planning and task type
on second language performance. Studies in Second Language Acqui-
sition 18, 299–323.
—— 1999: The in uence of source of planning and focus of planning on
task-based performance. Language Teaching Research 3, 215–47.
Fulcher, G. 1994: Some priority areas for oral language testing. Language
Testing Update 15, 39–47.
—— 1996: Testing tasks: issues in task design and the group oral. Language
Testing 13, 23–51.
—— 1999: Assessment in English for academic purposes: putting content
validity in its place. Applied Linguistics 20, 221–36.
Hayashi, M. 1995: Conversational repair: a contrastive study of Japanese
and English. MA Project Report, University of Canberra.
Henning, G. 1983: Oral pro ciency testing: comparative validities of inter-
view, imitation, and completion methods. Language Learning 33,
315 –32.
—— 1987: A guide to language testing. Cambridge, MA: Newbury House.
Kelly, R. 1978: On the construct validation of comprehension tests: an exer-
cise in applied linguistics. Unpublished PhD thesis, University of
Queensland.
Kenyon, D. 1995: An investigation of the validity of task demands on
performance-based tests of oral pro ciency. In Kunnan, A.J., editor,
Validation in language assessment: selected papers from the 17th Lan-
guage Testing Research Colloquium, Long Beach. Mahwah, NJ: Lawr-
ence Erlbaum, 19–40.
Kormos, J. 1999: Simulating conversations in oral-pro ciency assessment:
a conversation analysis of role plays and non-scripted interviews in
language exams. Language Testing 16, 163–88.
Lazaraton, A. 1992: The structural organisation of a language interview: a
conversational analytic perspective. System 20, 373–86.
——1996: A qualitative approach to monitoring examiner conduct in the
Cambridge assessment of spoken English (CASE). In Milanovic, M.
and Saville, N., editors, Performance testing, cognition and
assessment: selected papers from the 15th Language Testing Research
Colloquium, Cambridge and Arnhem. Studies in Language Testing 3.
Cambridge: University of Cambridge Local Examinations Syndicate,
18–33.
—— 2000: A qualitative approach to the validation of oral language tests.
Studies in Language Testing, Volume 14. Cambridge: Cambridge Uni-
versity Press.
Lumley, T. and O’Sullivan, B. 2000: The effect of speaker and topic vari-
ables on task performance in a tape-mediated assessment of speaking.
Paper presented at the 2nd Annual Asian Language Assessment
Research Forum, The Hong Kong Polytechnic University.
Luoma, S. 1997: Comparability of a tape-mediated and a face-to-face test
of speaking: a triangulation study. Unpublished Licentiate Thesis,
Centre for Applied Language Studies, Jyväskylä University, Finland.
50 Validating speaking-test tasks

McNamara, T. 1996: Measuring second language performance. London:

Longman.
Mehnert, U. 1998: The effects of different lengths of time for planning on
second language performance. Studies in Second Language Acquisition
20, 83–108.
Messick, S. 1975: The standard problem: meaning and values in measure-
ment and evaluation. American Psychologist 30, 955–66.
—— 1989: Validity. In Linn, R.L., editor, Educational measurement. 3rd
edition. New York: Macmillan.
Milanovic, M. and Saville, N. 1996: Introduction. Performance testing, cog-
nition and assessment. Studies in Language Testing, Volume 3. Cam-
bridge: University of Cambridge Local Examinations Syndicate, 1–17.
Moller, A. D. 1982: A study in the validation of pro ciency tests of English
as a Foreign Language. Unpublished PhD thesis, University of Edin-
burgh.
Norris, J, Brown, J. D., Hudson, T. and Yoshioka, J. 1998: Designing
second language performance assessments. Technical Report 18.
Honolulu, HI: University of Hawaii Press.
O’Loughlin, K. 1995: Lexical density in candidate output on direct and
semi-direct versions of an oral pro ciency test. Language Testing 12,
217–37.
—— 1997: The comparability of direct and semi-direct speaking tests: a case
study. Unpublished PhD Thesis, University of Melbourne, Melbourne.
—— 2001: An investigatory study of the equivalence of direct and semi-
direct speaking skills. Studies in Language Testing 13. Cambridge:
Cambridge University Press/UCLES.
Ortega, L. 1999: Planning and focus on form in L2 oral performance. Stud-
ies in Second Language Acquisition 20, 109–48.
O’Sullivan, B. 2000: Towards a model of performance in oral language
testing. Unpublished PhD dissertation, CALS, University of Reading.
Robinson, P. 1995: Task complexity and second language narrative dis-
course. Language Learning 45, 99–140.
Ross, S. and Berwick, R. 1992: The discourse of accommodation in oral
pro ciency interviews. Studies in Second Language Acquisition 14,
159–76.
Saville, N. and Hargreaves, P. 1999: Assessing speaking in the revised
FCE. ELT Journal 53, 42–51.
Schegloff, E., Jefferson, G. and Sachs, H. 1977: The preference for self-
correction in the organisation of repair in conversation. Language 53,
361–82.
Schwartz, J. 1980: The negotiation for meaning: repair in conversations
between second language learners of English. In Larsen-Freeman, D.,
editor, Discourse analysis in second language research. Rowley, MA:
Newbury House.
Shohamy. E. 1983. The stability of oral language pro ciency assessment in
the oral interview testing procedure. Language Learning 33, 527 –40.
—— 1988: A proposed framework for testing the oral language of
Barry O’Sullivan, Cyril J. Weir and Nick Saville 51

second/foreign language learners. Studies in Second Language Acqui-

sition 10, 165–79.
—— 1994: The validity of direct versus semi-direct oral tests. Language
Testing 11, 99–123.
Shohamy, E., Reves, T. and Bejarano, Y. 1986: Introducing a new compre-
hensive test of oral pro ciency. ELT Journal 40, 212–20.
Skehan, P. 1996: A framework for the implementation of task based instruc-
tion. Applied Linguistics 17, 38–62.
—— 1998: A cognitive approach to language learning. Oxford: Oxford
University Press.
Stans eld, C.W. and Kenyon, D.M. 1992: Research on the comparability
of the oral pro ciency interview and the simulated oral pro ciency
interview. System 20, 347–64.
Stenström, A. 1994: An introduction to spoken interaction. London: Long-
man.
Suhua H. 1998: A communicative test of spoken English for the CET 6.
Unpublished PhD Thesis, Shanghai Jiao Tong University, Shanghai.
Upshur, J.A. and Turner, C. 1999: Systematic effects in the rating of
second-language speaking ability: test method and learner discourse.
Language Testing 16, 82–111.
van Ek, J.A. and Trim J.L.M., editors, 1984: Across the threshold.
Oxford: Pergamon.
van Lier, L. 1989: Reeling, writhing, drawling, stretching, and fainting in
coils: oral pro ciency interviews as conversation. TESOL Quarterly,
23, 489 –508.
Walker, C. 1990: Large-scale oral testing. Applied Linguistics 11, 200–19.
Weir, C.J. 1983: Identifying the language needs of overseas students in
tertiary education in the United Kingdom. Unpublished PhD thesis,
University of London.
—— 1993: Understanding and developing language tests. Hemel
Hempstead: Prentice Hall.
Wigglesworth, G. 1997: An investigation of planning time and pro ciency
level on oral test discourse. Language Testing 14, 85–106.
Wigglesworth, G. and O’Loughlin, K. 1993: An investigation into the com-
parability of direct and semi-direct versions of an oral interaction test
in English. Melbourne Papers in Language Testing 2, 56–67.
Young, R. 1995: Conversational styles in language pro ciency interviews.
Language Learning 45, 3–42.
Young, R. and Milanovic, M. 1992: Discourse variation in oral pro ciency
interviews. Studies in Second Language Acquisition 14, 403–24.
52 Validating speaking-test tasks
Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functions
Providing personal information · Give information on present circumstances?
· Give information on past experiences?
· Give information on future plans?
Providing nonpersonal Give information which does not relate to the individual?
information
Elaborating Elaborate on an idea?
Expressing opinions Express opinions?
Justifying opinions Express reasons for assertions s/he has made?
Comparing Compare things/people/events?
Complaining Complain about something?
Speculating Hypothesize or speculate?
Analysing Separate out the parts of an issue?
Making excuses Make excuses?
Explaining Explain anything?
Narrating Describe a sequence of events?
Paraphrasing Paraphrase something?
Summarizing Summarize what s/he had said?
Suggesting Suggest a particular idea?
Expressing preferences Express preferences?

Interactional functions
Challenging Challenge assertions made by another speaker?
(Dis)agreeing Indicate (dis)agreement with what another speaker
says? (apart from ‘yeah’/‘no’ or simply nodding)
Justifying/Providing support Offer justi cation or support for a comment made by
another speaker?
Qualifying Modify arguments or comments?
Asking for opinions Ask for opinions?
Persuading Attempt to persuade another person?
Asking for information Ask for information?
Conversational repair Repair breakdowns in interaction?
Negotiating meaning · Check understanding?
· Attempt to establish common ground or strategy?
· Respond to requests for clari cation?
· Ask for clari cation?
· Make corrections?
· Indicate purpose?
· Indicate understanding/uncertainty?

Managing interaction
Initiating Start any interactions?
Changing Take the opportunity to change the topic?
Reciprocity Share the responsibility for developing the interaction?
Deciding Come to a decision?
Terminating Decide when the discussion should stop?
Appendix 2 Phase 1 results (summarized)

Participants
Make excuses
Terminate
Conversational repair
Summarize
Complain
Paraphrase
Persuade
Change topic
Challenge
Qualify
Ask for info
Suggest
Narrate
Reciprocate
Analyse
Elaborate
Initiate
Provide nonpersonal
information
Explain
Justify opinions
Negotiate meaning
Decide
(Dis) agree
Justify/Support
Ask for opinions
Express preferences
Speculate
Compare
Barry O’Sullivan, Cyril J. Weir and Nick Saville

Provide nonpersonal
information
Express opinion
53
54 Validating speaking-test tasks
Appendix 3 Operational checklist (used in Phase 3)

Informational functions
Providing personal information · Give information on present circumstances
· Give information on past experiences
· Give information on future plans
Expressing opinions Express opinions
Elaborating Elaborate on, or modify an opinion
Justifying opinions Express reasons for assertions s/he had made
Comparing Compare things/people/events
Speculating Speculate
Staging Separate out or interpret the parts of an issue
Describing · Describe a sequence of events
· Describe a scene
Summarizing Summarize what s/he has said
Suggesting Suggest a particular idea
Expressing preferences Express preferences

Interactional functions
Agreeing Agree with an assertion made by another speaker
(apart from ‘yeah’ or nonverbal)
Disagreeing Disagree with what another speaker says (apart from
‘no’ or nonverbal)
Modifying Modify arguments or comments made by other speaker
or by the test-taker in response to another speaker
Asking for opinions Ask for opinions
Persuading Attempt to persuade another person
Asking for information Ask for information
Conversational repair Repair breakdowns in interaction
Negotiating meaning · Check understanding
· Indicate understanding of point made by partner
· Establish common ground/purpose or strategy
· Ask for clari cation when an utterance is misheard or
misinterpreted
· Correct an utterance made by other speaker which is
perceived to be incorrect or inaccurate
· Respond to requests for clari cation

Managing interaction
Initiating Start any interactions
Changing Take the opportunity to change the topic
Reciprocating Share the responsibility for developing the interaction
Deciding Come to a decision
Barry O’Sullivan, Cyril J. Weir and Nick Saville 55
Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task

1 2 3 4 1 2 3 4

Informational functions
Providing personal
information
Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)
Past 10 (G) 4 (S) 12 (G)
Future 11 (G) 3 (L) 6 (S) 12 (G)
Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)
Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)
Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)
Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)
Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)
Staging 6 (S) 1 (L) 3 (L) 6 (L)
Describing
Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)
Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)
Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)
Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)
Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functions
Agreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)
Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)
Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)
Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)
Persuading 2 (L) 2 (L)
Asking for information 2 (L) 1 (L) 5 (S)
Conversational repair 5 (S) 4 (L) 1 (L)
Negotiating meaning
Check meaning 2 (L) 4 (S) 4 (L)
Understanding 5 (S) 3 (L) 3 (L)
Common group 2 (L) 2 (L) 1 (L)
Ask clari cation 2 (L) 1 (L) 2 (L)
Correct utterance 3 (L) 1 (L)
Respond to required 4 (S) 1 (L)
clari cation

Managing interaction
Initiating 8 (G) 1 (L) 10 (G) 5 (S)
Changing 8 (G) 7 (S)
Reciprocating 7 (G) 9 (G) 1 (L)
Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes: The gures indicate the number of students that complete the task in each case. L: Little
agreement; S: Some agreement; G: Good aggreement. For Tasks 3 and 4 in the rst tape
observed, the maximum was 9; for all others the maximum was 12. This is because 3 of the
12 MA students did not complete the task for these last 2 tasks. This was not a problem during
the observation of the second tape, so for all the maximum gures are 12.
56 Validating speaking-test tasks
Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal information

Present T G L T L
Past T G
Future T G
Expressing opinions T G T G T G T G
Elaborating L T G T S T G
Justifying opinions L T S T S T S
Comparing L T G T S S
Speculating T S T G T G S
Staging T L T S
Describing
Sequence of events T L L L
Scene T G L L
Summarizing T L L L L
Suggesting L L
Expressing preferences T G T G S T G

Interactional functions
Agreeing T G T L
Disagreeing T S
Modifying T S T L
Asking for opinions T G
Persuading L
Asking for information S
Conversational repair T S T L L
Negotiating meaning
Check meaning L
Understanding L L
Common ground L L
Ask clari cation L T L
Correct utterance L
Respond to required L
clari cation

Managing interaction
Initiating T G T S
Changing T S
Reciprocating T G L
Deciding L L

Notes: T indicates that this function has been identi ed as occurring in the transcript of the
interaction. L, S and G indicate the degree of agreement among the raters using the check-
lists in real time (L: Little agreement; S: Some agreement; G: Good agreement).
View publication stats

19 - Testing-for-Language-Teachers-2nd - Hughes PDF
80% (20)
19 - Testing-for-Language-Teachers-2nd - Hughes PDF
132 pages
Hughes - Testing-for-Language-Teachers-2nd - Hughes - 2
100% (1)
Hughes - Testing-for-Language-Teachers-2nd - Hughes - 2
132 pages
Chalhoub-Deville 2000 What To Look For in ESL Admission Tests Cambridge Certificate Exams, IELTS, Nad TOEFL
No ratings yet
Chalhoub-Deville 2000 What To Look For in ESL Admission Tests Cambridge Certificate Exams, IELTS, Nad TOEFL
17 pages
FP012 at Eng Trabajo Miledys
No ratings yet
FP012 at Eng Trabajo Miledys
9 pages
RICOH - MP C5503 Manual Tecnico
100% (2)
RICOH - MP C5503 Manual Tecnico
51 pages
BERGER-testing Speaking Developing A Rating Sca
No ratings yet
BERGER-testing Speaking Developing A Rating Sca
92 pages
Studies in Language Testing Volume 13
No ratings yet
Studies in Language Testing Volume 13
293 pages
Language Testing
No ratings yet
Language Testing
132 pages
Research Notes 02
No ratings yet
Research Notes 02
21 pages
Cambridge Research Notes 2005
No ratings yet
Cambridge Research Notes 2005
24 pages
Pendekatan Komunikatif
No ratings yet
Pendekatan Komunikatif
180 pages
ResearchNotes 59
No ratings yet
ResearchNotes 59
52 pages
Speaking Definition
No ratings yet
Speaking Definition
16 pages
CJR Inggris'
No ratings yet
CJR Inggris'
7 pages
Testing For Language Teachers 2nbsped 0521484952 9780521484954 Compress
No ratings yet
Testing For Language Teachers 2nbsped 0521484952 9780521484954 Compress
15 pages
Current Issues in Language Evaluation, Assessment and Testing
No ratings yet
Current Issues in Language Evaluation, Assessment and Testing
340 pages
978 1 4438 8590 4 Sample
No ratings yet
978 1 4438 8590 4 Sample
30 pages
Putting English Language Proficiency Tests On The Curriculum
No ratings yet
Putting English Language Proficiency Tests On The Curriculum
7 pages
FlanaganAlfonsoDixon AcademicAchievementforLittles
No ratings yet
FlanaganAlfonsoDixon AcademicAchievementforLittles
49 pages
Classroom Testing - Heaton PDF
No ratings yet
Classroom Testing - Heaton PDF
192 pages
Language Testing Philosophy and Rationale
No ratings yet
Language Testing Philosophy and Rationale
7 pages
Developing Assessment Scales For Large-Scale Speaking Tests A Multiple-Method Approach
No ratings yet
Developing Assessment Scales For Large-Scale Speaking Tests A Multiple-Method Approach
23 pages
Te30503 Week 4
No ratings yet
Te30503 Week 4
50 pages
Review Article Suitability of The PTE Academics
No ratings yet
Review Article Suitability of The PTE Academics
5 pages
IJES 24 1 3 023 19 1075 Ahmad M Tx4
No ratings yet
IJES 24 1 3 023 19 1075 Ahmad M Tx4
12 pages
Introduction To English Test: What Is TOEFL?
No ratings yet
Introduction To English Test: What Is TOEFL?
17 pages
Current Issues and Trends in Language Assessment in Australia and New Zealand
No ratings yet
Current Issues and Trends in Language Assessment in Australia and New Zealand
6 pages
Research Notes 10
No ratings yet
Research Notes 10
24 pages
The Changing Landscape of English - Implications For Language Assessment - Taylor, 2006
No ratings yet
The Changing Landscape of English - Implications For Language Assessment - Taylor, 2006
10 pages
Testing Speaking Skills Term Paper
No ratings yet
Testing Speaking Skills Term Paper
10 pages
Article Summaries
No ratings yet
Article Summaries
5 pages
Preparing For The Speaking Paper Slides
No ratings yet
Preparing For The Speaking Paper Slides
30 pages
Temas para Examen Oral Del Segundo Parcial
No ratings yet
Temas para Examen Oral Del Segundo Parcial
9 pages
Types of Tests Used in English Language Teaching Bachelor Paper
100% (3)
Types of Tests Used in English Language Teaching Bachelor Paper
41 pages
Research Notes 59
No ratings yet
Research Notes 59
51 pages
Testing Pronoun
No ratings yet
Testing Pronoun
18 pages
Assessing Speaking 1
No ratings yet
Assessing Speaking 1
6 pages
Types of Tests Used in English Language
No ratings yet
Types of Tests Used in English Language
28 pages
Language Testing. Oxford. Oxford University Press)
No ratings yet
Language Testing. Oxford. Oxford University Press)
6 pages
I06 - Speaking Test Instructions
100% (1)
I06 - Speaking Test Instructions
18 pages
Nur Alfi Laela 20400112016 Pbi 1: Test Types
No ratings yet
Nur Alfi Laela 20400112016 Pbi 1: Test Types
6 pages
2011 - Jin - Test For English Major
No ratings yet
2011 - Jin - Test For English Major
8 pages
Dördüncükısım
No ratings yet
Dördüncükısım
4 pages
Session 4 - Designing Language Test
No ratings yet
Session 4 - Designing Language Test
22 pages
Language Testing and Assessment Part I
No ratings yet
Language Testing and Assessment Part I
25 pages
Bibliography
No ratings yet
Bibliography
66 pages
Preliminary English Test (PET) Reading Preliminary English Test (PET) Listening
No ratings yet
Preliminary English Test (PET) Reading Preliminary English Test (PET) Listening
16 pages
How To Test Grammar
No ratings yet
How To Test Grammar
39 pages
Validation of An Oral Assessment Tool For Classroom Use: Articles
No ratings yet
Validation of An Oral Assessment Tool For Classroom Use: Articles
19 pages
Issues in Assessing Speaking: Shery Lou M. de Leon Maed-Elt I
No ratings yet
Issues in Assessing Speaking: Shery Lou M. de Leon Maed-Elt I
18 pages
Language Testing
100% (1)
Language Testing
5 pages
Testing Speaking and Listening Skills
No ratings yet
Testing Speaking and Listening Skills
16 pages
Language and Literature Assessment
100% (4)
Language and Literature Assessment
4 pages
Research Notes 18 PDF
No ratings yet
Research Notes 18 PDF
24 pages
Modal Verbs
No ratings yet
Modal Verbs
3 pages
Error Recognition
No ratings yet
Error Recognition
9 pages
Grammar
No ratings yet
Grammar
15 pages
Configure Static Routes in Debian or Red Hat Enterprise Linux
No ratings yet
Configure Static Routes in Debian or Red Hat Enterprise Linux
3 pages
CSE391 v2 OBE - Programming For The Internet Course Outlines
No ratings yet
CSE391 v2 OBE - Programming For The Internet Course Outlines
3 pages
Final Revision Eng 201
No ratings yet
Final Revision Eng 201
39 pages
Recdatatjv Manual Eng
No ratings yet
Recdatatjv Manual Eng
21 pages
Pas KLS 6 B.ing SMT 1 20242025
No ratings yet
Pas KLS 6 B.ing SMT 1 20242025
5 pages
Log 4
No ratings yet
Log 4
55 pages
Graphics Assignment
No ratings yet
Graphics Assignment
26 pages
PowerBI 1 To 151 05 01 2021
No ratings yet
PowerBI 1 To 151 05 01 2021
43 pages
Mohd Hafeez 1
No ratings yet
Mohd Hafeez 1
3 pages
Digital Illustration
No ratings yet
Digital Illustration
5 pages
Afar People
100% (1)
Afar People
4 pages
Arthur Koestler The Three Domains of Creativity
100% (4)
Arthur Koestler The Three Domains of Creativity
16 pages
01 - Catalysts For Change
No ratings yet
01 - Catalysts For Change
225 pages
Siggraph14 Bjorge TLS Presentation
No ratings yet
Siggraph14 Bjorge TLS Presentation
50 pages
PDF Solution Manual For Munson, Young and Okiishi's Fundamentals of Fluid Mechanics 8th by Gerhart Download
100% (7)
PDF Solution Manual For Munson, Young and Okiishi's Fundamentals of Fluid Mechanics 8th by Gerhart Download
42 pages
Gauss Quadrature
No ratings yet
Gauss Quadrature
5 pages
Tutorial NodeJS
No ratings yet
Tutorial NodeJS
16 pages
Acer V12LC-2x
No ratings yet
Acer V12LC-2x
5 pages
Internationalization in SAP
No ratings yet
Internationalization in SAP
6 pages
Final Demo Teaching in Grade IV
100% (2)
Final Demo Teaching in Grade IV
14 pages
Horror Genre Analysis
No ratings yet
Horror Genre Analysis
4 pages
2PAA104378 C en System 800xa Course T315C - Engineering Part I - Control Builder
No ratings yet
2PAA104378 C en System 800xa Course T315C - Engineering Part I - Control Builder
2 pages
Chapter 1B Predicate Logic
No ratings yet
Chapter 1B Predicate Logic
40 pages
Michael Löwy Consumed by Night's Fire - The Dark Romanticism of Guy Debord (1998)
No ratings yet
Michael Löwy Consumed by Night's Fire - The Dark Romanticism of Guy Debord (1998)
10 pages
LR Inventory 08 2022
No ratings yet
LR Inventory 08 2022
799 pages
Vmware Multi Cloud Services White Paper
No ratings yet
Vmware Multi Cloud Services White Paper
12 pages

Using Observation Checklists To Validate Speaking

Uploaded by

Using Observation Checklists To Validate Speaking

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Using observation checklists to validate speaking-test tasks

Article · January 2002

Barry O'Sullivan Cyril Weir

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

I Background to the study

Waystage Threshold Independent Competent Good

Figure 1 The Cambridge/ALTE  ve-level system

Language Testing 2002 19 (1) 33–56 10.1191/0265532202lt219oa Ó 2002 Arnold

Part Participants Task format

1 Interviewer–candidate Interview: Verbal questions

provide an effective and ef cient tool for investigating variation in

II Issues in validating tests of oral performance

Milanovic and Saville (1996) provide a useful overview of the vari-

Figure 2 A conceptual framework for performance testing

Content validation is, of course, extremely problematic given the

III Methodological issues

IV The development of the observation checklists

1977; Schwartz, 1980; van Ek and Trim, 1984; Bygate, 1988;

V The development model

Figure 3 The development model

· the consulting ‘expert’ testers (the University of Reading group);

VI The development process

agreement on all but a small number of functions (Appendix 2). Note

were recommended for either omission or inclusion in other items by

Unfortunately, a small number of students did not manage to com-

this interview. The OCs were then ‘mapped’ on to this transcript in

VII Discussion and initial conclusions

Like these authors, we regard as inadequate any conceptualization of

2 Present and future applications of observational checklists

linguistic response (in terms of the elements of the checklists) and

Anastasi, A. 1988: Psychological testing. 6th edition. New York: Macmil-

McNamara, T. 1996: Measuring second language performance. London:

second/foreign language learners. Studies in Second Language Acqui-

Task Task Task Task Task Task Task Task

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal information

You might also like

Figure 1 The Cambridge/ALTE ve-level system

provide an effective and ef cient tool for investigating variation in