Using Observation Checklists To Validate Speaking
Using Observation Checklists To Validate Speaking
net/publication/237225018
CITATIONS READS
79 4,484
3 authors:
Nick Saville
University of Cambridge
30 PUBLICATIONS 720 CITATIONS
SEE PROFILE
All content following this page was uploaded by Barry O'Sullivan on 19 October 2015.
Test-task validation has been an important strand in recent revision projects for
University of Cambridge Local Examinations Syndicate (UCLES) examinations.
This article addresses the relatively neglected area of validating the match between
intended and actual test-taker language with respect to a blueprint of language
functions representing the construct of spoken language ability. An observation
checklist designed for both a priori and a posteriori analysis of speaking task
output has been developed. This checklist enables language samples elicited by
the task to be scanned for these functions in real time, without resorting to the
laborious and somewhat limited analysis of transcripts. The process and results of
its development, implications and further applications are discussed.
ALTE Level 1 ALTE Level 2 ALTE Level 3 ALTE Level 4 ALTE Level 5
Address for correspondence: Barry O’Sullivan, Testing and Evaluation Unit, School of Linguis-
tics and Applied Language Studies, The University of Reading, PO Box 241, Whiteknights,
Reading RG6 6WB, UK; email: b.e.osullivanKreading.ac.uk
1
By performance tests we are referring to direct tests where a test-taker’s ability is evaluated
from their performance on a set task or tasks.
Barry O’Sullivan, Cyril J. Weir and Nick Saville 35
Knowledge
and ability
Examination
conditions
Candidates
Tasks
Specifications
and Sample of Score
construct language
Examination
developer Assessment
criteria
Assessment
conditions
and training
E xaminers
Knowledge
and ability
all, the theory on which all else rests; it is from there that the construct
is set up and it is on the construct that validity, of the content and
predictive kinds, is based.’ Kelly (1978: 8) supported this view, com-
menting that: ‘the systematic development of tests requires some
theory, even an informal, inexplicit one, to guide the initial selection
of item content and the division of the domain of interest into appro-
priate sub-areas.’
Because we lack an adequate theory of language in use, a priori
attempts to determine the construct validity of pro ciency tests
involve us in matters that relate more evidently to content validity.
We need to talk of the communicative construct in descriptive terms
and, as a result, we become involved in questions of content relevance
and content coverage. Thus, for Kelly (1978: 8) content validity
seemed ‘an almost completely overlapping concept’ with construct
validity, and for Moller (1982: 68): ‘the distinction between construct
and content validity in language testing is not always very marked,
particularly for tests of general language pro ciency.’
Content validity is considered important as it is principally con-
cerned with the extent to which the selection of test tasks is represen-
tative of the larger universe of tasks of which the test is assumed to
be a sample (see Bachman and Palmer, 1981; Henning, 1987: 94;
Messick, 1989: 16; Bachman, 1990: 244). Similarly, Anastasi (1988:
131) de ned content validity as involving: ‘essentially the systematic
examination of the test content to determine whether it covers a
representative sample of the behaviour domain to be measured.’ She
outlined (Anastasi, 1988: 132) the following guidelines for estab-
lishing content validity:
1) ‘the behaviour domain to be tested must be systematically
analysed to make certain that all major aspects are covered by
the test items, and in the correct proportions’;
2) ‘the domain under consideration should be fully described in
advance, rather than being de ned after the test has been pre-
pared’;
3) ‘content validity depends on the relevance of the individual’s test
responses to the behaviour area under consideration, rather than
on the apparent relevance of item content.’
The directness of t and adequacy of the test sample is thus dependent
on the quality of the description of the target language behaviour
being tested. In addition, if the responses to the item are invoked
Messick (1975: 961) suggests ‘the concern with processes underlying
test responses places this approach to content validity squarely in the
realm of construct validity’. Davies (1990: 23) similarly notes: ‘con-
tent validity slides into construct validity’.
38 Validating speaking-test tasks
While there is clearly a great deal of potential for this detailed analy-
sis of transcribed performances, there are also a number of drawbacks,
the most serious of which involves the complexity of the transcription
process. In practice, this means that a great deal of time and expertise
is required in order to gain the kind of data that will answer the basic
question concerning validity. Even where this is done, it is impractical
to attempt to deal with more than a small number of test events;
therefore, the generalizability of the results may be questioned.
Clearly then, a more ef cient methodology is required that allows
the test designer to evaluate the procedures and, especially, the tasks
in terms of the language produced by a larger number of candidates.
Ideally this should be possible in ‘real’ time, so that the relationship
of predicted outcome to speci c outcome can be established using a
data set that satisfactorily re ects the typical test-taking population.
The primary objective of this project, therefore, was to create an
instrument, built on a framework that describes the language of per-
formance in a way that can be readily accessed by evaluators who
are familiar with the tests being observed. This work is designed to
be complementary to the use of transcriptions and to provide an
additional source of validation evidence.
The FCE was chosen as the focus of this study for a number of
reasons:
· It is ‘stable’, in that it is neither under review nor is due to be
reviewed.
· It represents the middle of the ALTE (and UCLES Main Suite)
range, and is the most widely subscribed test in the battery.
· It offers the most likelihood of a wide range of performance of
any Main Suite examination: as it is often used as an ‘entry-point’
into the suite, candidates tend to range from below to above this
level in terms of ability.
· Like all of the other Main Suite examinations, a database of
recordings (audio and video) already existed.
Phase 1
The rst attempt to examine how the draft checklists would be
viewed, and applied, by a group of language teachers was conducted
by ffrench (1999). Of the participants at the seminar, approximately
50% of the group reported that English (British English, American
English or Australian English) was their rst language, while the
remaining 50% were native Greek speakers.
In their introduction to the application of the Observation Check-
lists (OCs), the participants were given a series of activities that
focused on the nature and use of those functions of language seen by
task designers at UCLES to be particularly applicable to their EFL
Main Suite Speaking Tests (principally FCE, CAE and CPE). Once
familiar with the nature of the functions (and where they might occur
in a test), the participants applied the OCs in ‘real’ time to an FCE
Speaking Test from the 1998 Standardization Video. This video fea-
tured a pair of French speakers who were judged by a panel of
‘expert’ raters (within UCLES) to be slightly above the criterion
(‘pass’) level.
Of the 37 participants, 32 completed the task successfully, that is,
they attempted to make frequency counts of the items represented in
the OCs. Among this group, there appear to be varying degrees of
agreement as to the use of language functions, particularly in terms
of the speci c number of observations of each function. However,
when the data are examined from the perspective of agreement on
whether a particular function was observed or not (ignoring the count,
which in retrospect was highly ambitious when we consider the lack
of systematic training in the use of the questionnaires given to the
teachers who attended), we nd that there is a striking degree of
42 Validating speaking-test tasks
Phase 2
In this phase, a much smaller gathering was organized, this time
involving members of the development team as well as the three UK-
based UCLES Senior Team Leaders. In advance of this meeting all
participants were asked to study the existing checklists and to exemp-
lify each function with examples drawn from their experiences of the
various UCLES Main Suite examinations. The resulting data were
collated and presented as a single document that formed the basis of
discussion during a day-long session. Participants were not made
aware of the ndings from Phase 1.
During this session many questions were asked of all aspects of
the checklist, and a more streamlined version of the three sections
was suggested. In addition to a number of participants making a writ-
ten record of the discussions, the entire session was recorded. This
proved to be a valuable reminder of the way in which particular
changes came about and was used when the nal decisions regarding
inclusion, con ation or omission were being made. Although it is
beyond the scope of this project to analyse this recording, when
coupled with the earlier and revised documents, it is in itself a valu-
able source of data in that it provides a signi cant record of the devel-
opmental process.
Among the many interesting outcomes of this phase were the
decisions either to rethink, to reorganize or to omit items from the
initial list. These decisions were seen to mirror the results of the Phase
1 application quite closely. Of the 13 items identi ed in Phase 1 as
being in need of review (7 were rarely observed, indicating a high
degree of agreement that they were not, in fact, present, and 6
appeared to be confused with very mixed reported observations), 7
Barry O’Sullivan, Cyril J. Weir and Nick Saville 43
Phase 3
In the third phase, the revised checklists were given to a group of 15
MA TEFL students who were asked to apply them to two FCE tests.
Both of these tests involved a mixed-sex pair of learners, one pair of
approximately average ability and the other pair above average.
Before using the observation checklists (OCs), the students were
asked rst to attempt to predict which functions they might expect to
nd. To help in this pre-session task, the students were given details
of the FCE format and tasks.
44 Validating speaking-test tasks
1 Validities
We would not wish to claim that the checklists on their own offer a
satisfactory demonstration of the construct validity of a spoken langu-
age test, for, as Messick argues (1989: 16): ‘the varieties of evidence
supporting validity are not alternatives but rather supplements to one
another.’ We recognize the necessity for a broad view of ‘the eviden-
tial basis for test interpretation’ (Messick, 1989: 20). Bachman (1990:
237) similarly concludes: ‘it is important to recognise that none of
these [evidences of validity] by itself is suf cient to demonstrate the
validity of a particular interpretation or use of test scores’ (see also
Bachman, 1990: 243). Fulcher (1999: 224) adds a further caveat
against an overly narrow interpretation of content validity when he
quotes Messick (1989: 41):
the major problem is that so-called content validity is focused upon test forms
rather than test scores, upon instruments rather than measurements . . . selecting
content is an act of classi cation, which is in itself a hypothesis that needs to
be con rmed empirically.
46 Validating speaking-test tasks
Acknowledgements
We would like to thank Don Porter and Rita Green for their early
input into the rst version of the checklist. In addition, help was
received from members of the ELT division in UCLES, in particular
from Angela ffrench, Lynda Taylor and Christina Rimini, from a
group of UCLES Senior Team Leaders and from MA TEFL students
at the University of Reading. Finally, we would like to thank the
editors and anonymous reviewers of Language Testing for their
insightful comments and helpful suggestions for its improvement. The
faults that remain are, as ever, ours.
48 Validating speaking-test tasks
VIII References
Foster, P. and Skehan, P. 1996: The in uence of planning and task type
on second language performance. Studies in Second Language Acqui-
sition 18, 299–323.
—— 1999: The in uence of source of planning and focus of planning on
task-based performance. Language Teaching Research 3, 215–47.
Fulcher, G. 1994: Some priority areas for oral language testing. Language
Testing Update 15, 39–47.
—— 1996: Testing tasks: issues in task design and the group oral. Language
Testing 13, 23–51.
—— 1999: Assessment in English for academic purposes: putting content
validity in its place. Applied Linguistics 20, 221–36.
Hayashi, M. 1995: Conversational repair: a contrastive study of Japanese
and English. MA Project Report, University of Canberra.
Henning, G. 1983: Oral pro ciency testing: comparative validities of inter-
view, imitation, and completion methods. Language Learning 33,
315 –32.
—— 1987: A guide to language testing. Cambridge, MA: Newbury House.
Kelly, R. 1978: On the construct validation of comprehension tests: an exer-
cise in applied linguistics. Unpublished PhD thesis, University of
Queensland.
Kenyon, D. 1995: An investigation of the validity of task demands on
performance-based tests of oral pro ciency. In Kunnan, A.J., editor,
Validation in language assessment: selected papers from the 17th Lan-
guage Testing Research Colloquium, Long Beach. Mahwah, NJ: Lawr-
ence Erlbaum, 19–40.
Kormos, J. 1999: Simulating conversations in oral-pro ciency assessment:
a conversation analysis of role plays and non-scripted interviews in
language exams. Language Testing 16, 163–88.
Lazaraton, A. 1992: The structural organisation of a language interview: a
conversational analytic perspective. System 20, 373–86.
——1996: A qualitative approach to monitoring examiner conduct in the
Cambridge assessment of spoken English (CASE). In Milanovic, M.
and Saville, N., editors, Performance testing, cognition and
assessment: selected papers from the 15th Language Testing Research
Colloquium, Cambridge and Arnhem. Studies in Language Testing 3.
Cambridge: University of Cambridge Local Examinations Syndicate,
18–33.
—— 2000: A qualitative approach to the validation of oral language tests.
Studies in Language Testing, Volume 14. Cambridge: Cambridge Uni-
versity Press.
Lumley, T. and O’Sullivan, B. 2000: The effect of speaker and topic vari-
ables on task performance in a tape-mediated assessment of speaking.
Paper presented at the 2nd Annual Asian Language Assessment
Research Forum, The Hong Kong Polytechnic University.
Luoma, S. 1997: Comparability of a tape-mediated and a face-to-face test
of speaking: a triangulation study. Unpublished Licentiate Thesis,
Centre for Applied Language Studies, Jyväskylä University, Finland.
50 Validating speaking-test tasks
Informational functions
Providing personal information · Give information on present circumstances?
· Give information on past experiences?
· Give information on future plans?
Providing nonpersonal Give information which does not relate to the individual?
information
Elaborating Elaborate on an idea?
Expressing opinions Express opinions?
Justifying opinions Express reasons for assertions s/he has made?
Comparing Compare things/people/events?
Complaining Complain about something?
Speculating Hypothesize or speculate?
Analysing Separate out the parts of an issue?
Making excuses Make excuses?
Explaining Explain anything?
Narrating Describe a sequence of events?
Paraphrasing Paraphrase something?
Summarizing Summarize what s/he had said?
Suggesting Suggest a particular idea?
Expressing preferences Express preferences?
Interactional functions
Challenging Challenge assertions made by another speaker?
(Dis)agreeing Indicate (dis)agreement with what another speaker
says? (apart from ‘yeah’/‘no’ or simply nodding)
Justifying/Providing support Offer justi cation or support for a comment made by
another speaker?
Qualifying Modify arguments or comments?
Asking for opinions Ask for opinions?
Persuading Attempt to persuade another person?
Asking for information Ask for information?
Conversational repair Repair breakdowns in interaction?
Negotiating meaning · Check understanding?
· Attempt to establish common ground or strategy?
· Respond to requests for clari cation?
· Ask for clari cation?
· Make corrections?
· Indicate purpose?
· Indicate understanding/uncertainty?
Managing interaction
Initiating Start any interactions?
Changing Take the opportunity to change the topic?
Reciprocity Share the responsibility for developing the interaction?
Deciding Come to a decision?
Terminating Decide when the discussion should stop?
Appendix 2 Phase 1 results (summarized)
Participants
Make excuses
Terminate
Conversational repair
Summarize
Complain
Paraphrase
Persuade
Change topic
Challenge
Qualify
Ask for info
Suggest
Narrate
Reciprocate
Analyse
Elaborate
Initiate
Provide nonpersonal
information
Explain
Justify opinions
Negotiate meaning
Decide
(Dis) agree
Justify/Support
Ask for opinions
Express preferences
Speculate
Compare
Barry O’Sullivan, Cyril J. Weir and Nick Saville
Provide nonpersonal
information
Express opinion
53
54 Validating speaking-test tasks
Appendix 3 Operational checklist (used in Phase 3)
Informational functions
Providing personal information · Give information on present circumstances
· Give information on past experiences
· Give information on future plans
Expressing opinions Express opinions
Elaborating Elaborate on, or modify an opinion
Justifying opinions Express reasons for assertions s/he had made
Comparing Compare things/people/events
Speculating Speculate
Staging Separate out or interpret the parts of an issue
Describing · Describe a sequence of events
· Describe a scene
Summarizing Summarize what s/he has said
Suggesting Suggest a particular idea
Expressing preferences Express preferences
Interactional functions
Agreeing Agree with an assertion made by another speaker
(apart from ‘yeah’ or nonverbal)
Disagreeing Disagree with what another speaker says (apart from
‘no’ or nonverbal)
Modifying Modify arguments or comments made by other speaker
or by the test-taker in response to another speaker
Asking for opinions Ask for opinions
Persuading Attempt to persuade another person
Asking for information Ask for information
Conversational repair Repair breakdowns in interaction
Negotiating meaning · Check understanding
· Indicate understanding of point made by partner
· Establish common ground/purpose or strategy
· Ask for clari cation when an utterance is misheard or
misinterpreted
· Correct an utterance made by other speaker which is
perceived to be incorrect or inaccurate
· Respond to requests for clari cation
Managing interaction
Initiating Start any interactions
Changing Take the opportunity to change the topic
Reciprocating Share the responsibility for developing the interaction
Deciding Come to a decision
Barry O’Sullivan, Cyril J. Weir and Nick Saville 55
Appendix 4 Summary of Phase 2 observation
Tape 1 Tape 2
Informational functions
Providing personal
information
Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)
Past 10 (G) 4 (S) 12 (G)
Future 11 (G) 3 (L) 6 (S) 12 (G)
Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)
Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)
Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)
Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)
Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)
Staging 6 (S) 1 (L) 3 (L) 6 (L)
Describing
Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)
Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)
Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)
Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)
Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)
Interactional functions
Agreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)
Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)
Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)
Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)
Persuading 2 (L) 2 (L)
Asking for information 2 (L) 1 (L) 5 (S)
Conversational repair 5 (S) 4 (L) 1 (L)
Negotiating meaning
Check meaning 2 (L) 4 (S) 4 (L)
Understanding 5 (S) 3 (L) 3 (L)
Common group 2 (L) 2 (L) 1 (L)
Ask clari cation 2 (L) 1 (L) 2 (L)
Correct utterance 3 (L) 1 (L)
Respond to required 4 (S) 1 (L)
clari cation
Managing interaction
Initiating 8 (G) 1 (L) 10 (G) 5 (S)
Changing 8 (G) 7 (S)
Reciprocating 7 (G) 9 (G) 1 (L)
Deciding 3 (L) 1 (L) 1 (L) 2 (L)
Notes: The gures indicate the number of students that complete the task in each case. L: Little
agreement; S: Some agreement; G: Good aggreement. For Tasks 3 and 4 in the rst tape
observed, the maximum was 9; for all others the maximum was 12. This is because 3 of the
12 MA students did not complete the task for these last 2 tasks. This was not a problem during
the observation of the second tape, so for all the maximum gures are 12.
56 Validating speaking-test tasks
Appendix 5 Transcript results and observation checklist results
Interactional functions
Agreeing T G T L
Disagreeing T S
Modifying T S T L
Asking for opinions T G
Persuading L
Asking for information S
Conversational repair T S T L L
Negotiating meaning
Check meaning L
Understanding L L
Common ground L L
Ask clari cation L T L
Correct utterance L
Respond to required L
clari cation
Managing interaction
Initiating T G T S
Changing T S
Reciprocating T G L
Deciding L L
Notes: T indicates that this function has been identi ed as occurring in the transcript of the
interaction. L, S and G indicate the degree of agreement among the raters using the check-
lists in real time (L: Little agreement; S: Some agreement; G: Good agreement).
View publication stats