100% found this document useful (1 vote)
358 views48 pages

Language Testing Handout (UST)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 48

Language Testing

 Melchor A. Tatlonghari, Ph.D.


 University of Santo Tomas
 Manila
I. Course Description
 This course covers the theoretical foundations of
language testing and evaluation and the practices which
have evolved from such. The various types of language
tests, the process involved in their development, the
various ways of interpreting test results and the issues
and concerns in both language testing will be discussed.
II. Objectives
At the end of the course, the graduate students are expected
to have:
– acquired knowledge of the concepts in language
testing and the processes involved in the development
and validation of language tests;
– developed the skills in developing and validating
various types of language tests for various purposes;
and
– developed skills in critiquing existing language tests
and/or language tests developed by themselves and
others.
Topics
Language testing and evaluation:
Foundations and trends
 Pre-scientific Trend
 Psychometric-structuralist Trend
 Integrative-sociolinguistic Trend
Test
Testing, assessing, and teaching
Language testing
A. Why test? Uses of language tests
B. Kinds of tests and testing
C. Characteristics of a good test
D. Testing linguistic macro-skills
1. Testing listening
2. Testing speaking
3. Testing reading
4. Testing writing
5. Testing grammar
E. Stages in test construction
F. Alternative assessment
G. Computer-based testing
H. Current concerns and issues in language testing and
future direction
I. Concluding remarks

 References:
 Alderson, J. Charles, et al. 1995. Language test
construction and evaluation. Cambridge: Cambridge
University Press.
 Brown, H. Douglas. 2004. Language assessment:
principles and classroom practices. New York: Pearson
education, Inc.
 Hughes, Arthur. 2003. Testing for language teachers.
Cambridge: Cambridge University Press.
 Lynch, Brian K. 2003. Language assessment and
programme evaluation. Edinburgh: Edinburgh
University Press.
 McNamara, Tim. 1996. Measuring second language
performance. Essex: Addison Longman Ltd.
 _____________. 2000. Language testing. Oxford:
Oxford University Press.
 Weir, Cyril J. 2005. Language testing and evaluation.
New York: PALGRAVE Macmillan.
 __________. 1993. Understanding and developing
language tests. Hertfordshire: Prentice Hall
International (UK) Ltd.
 __________. 1988. Communicative language testing.
Hertfordshire: Prentice Hall International (UK) Ltd.
IV. Means of Evaluation
 For all:
• Oral report/s of assigned topic/s
• Final examination
C. Papers:
For all:

– A critique of an existing language test


_ A sample of TOS with sample test items for
each component (To be reported orally)
 For Ph.D students only:
A detailed proposal of a test development program
A review of any language testing book/research
LANGUAGE TESTING AND EVALUATION
Approaches to Language Assessment: A Brief History
A brief history of language assessment over the past half-
century will serve as a backdrop to an understanding of
answering WHERE WE ARE NOW IN LANGUAGE
ASSESSMENT.
Historically, language assessment trends and practices
have followed the shifting developments of language
teaching methodology .

Relationship between Theory and Practice


Language Testing: Foundations and Trends
 Pre-scientific Trend
 Psychometric-structuralist Trend
 Integrative-sociolinguistic Trend
I. Pre-scientific
Characterized by lack of concern for statistical
matters or for such notions as objectivity and
reliability.
Oral examinations were the exception. Examinations
were mostly open-ended written examinations and
would include:
– translation into or from the foreign language
– free composition in it
– selected items of grammatical, textual, or cultural
interest.
Language tests are the business of language teachers. No
expertise is required: if a person knows how to teach, it is
assumed he can judge the proficiency of his students.
II. Psychometric-structuralist
Marks the invasion of the field by experts.

Marked by interaction and conflict of two sets of


experts, agreeing with each other mainly in their belief
that testing can be made precise, objective, reliable
and scientific.
C. The two sets of experts:
The testers : the psychologists responsible for the
development of modern theories and techniques of
educational measurement.

– Key concerns: To provide “objective” measure


using various statistical techniques to assure
reliability and validity.
– Developed short items, multiple choice,
“objective” tests. Result: Tests would require
written response and so were limited to reading
and listening.
The linguists
Observed that the test items chosen did not
reflect newer ideas about language teaching
and learning.
As Lado in Carroll (1953) said:
A number of conclusions are reached. They are:
(1) that a great lag exists in measurement
in English as a foreign language
(2) that the lag is connected with
unscientific views of language,
(3) that the science of language should be
used in defining what to teach…

This study by Lado gives procedures for the


application of linguistics to the development of
foreign language tests.
This new set of experts added notions from the science
of language to those from the science of educational
measurement. John B. Carroll’s work (1940, 1978)
show his concern with psychologically and
linguistically valid measures of verbal abilities,
whether in native or learned languages.

– Lado’s work emphasized that linguists, with their


understanding of the nature of language, must be
the ones to set the specifications for language tests.

Brown (2004) says that in the 1950s, an era of


behaviorism and special attention to contrastive analysis,
testing focused on specific language elements such as the
phonological, grammatical, and lexical contrasts between
two languages.
There was at the time still an easy congruence between
American structuralists view of language and the
psychological theories and practical needs of testers.
The marriage of the two fields provided the basis
for the flourishing of the standardized test, with its
special emphasis on what Carroll (1961) labeled the
“discrete structure point” items.
Examples:
 The Graduate Record Examinations Advanced Test
 The MLA Foreign Language Tests for Teachers and
Advanced Students
 The TOEFL
 The College Entrance Exam Board Achievement Test
The Psychometric-structuralist trend has not completely
overcome the objections of the traditionalists (Pre-
scientific) who continue to feel that less specific
measures are still of great value.
They have therefore been instrumental in the
development of more reliable methods of judging the
more subjective kinds of performance like the judgment
of written proficiency and oral proficiency.
Discrete-point tests are constructed on the assumption
that language can be broken down into component parts
and that those parts can be tested successfully.

These components are the skills of listening, speaking,


reading, and writing, and various units of language
(discrete points) of phonology, graphology, morphology,
lexicon, syntax, and discourse.

It was claimed that an overall language proficiency test,


then, can should sample all four skills and as many linguistic
discrete points as possible.
Such an approach demanded a decontextualization that
often confused the test-taker.

So, as the profession emerged into an era of emphasizing


communication, authenticity, and context, new approaches
were sought.

Discrete-Point and Integrative Testing


This historical perspective underscores two major
approaches to language testing that were debated in the
1970s and early 1980s. These approaches still prevail today
even if in mutated form: the choice between discrete-point
and integrative testing methods.

III. Integrative-sociolinguistics

There have been, however, increasingly strong attacks on the


principles of Psychometric-structuralist trend associated
with two trends in contemporary linguistics:

– The language competence trend - - connected to


various views of psycholinguistics. It is based on a
belief in such as thing as language proficiency, and
a feeling that knowledge of a language is more
than just the sum of a set of discrete parts.
– The communicative competence trend - -
connected with views of modern sociolinguistics:
it accepts the belief in integrative testing, but
insists on the need to add a strong functional
dimension to language testing.

Communicative competence is a term in linguistic which


refers to a language user’s grammatical knowledge of
syntax, morphology, phonology and the like as well as social
knowledge about how and when to use utterances
appropriately.

In the 1970s, research on communicative competence


distinguished between linguistic and communicative
competence (Hymes 1967, Paulston, 1974) to highlight the
difference between knowledge “about” language forms and
knowledge that enables a person to communicate
functionally and interactively.

The term communicative competence was coined by Dell


Hymes (1967, 1972), a sociolinguist who was convinced that
Chomsky’s (1965) notion of competence was too limited. So
Hymes referred to communicative competence as that aspect
of our competence that enables us to convey and interpret
messages and to negotiate meanings interpersonally within
specific contexts.

Chomsky (1965) defines competence as the ideal user’s


knowledge of the rules of his language, and performance as
the actual realisation of this knowledge in linguistic
communication.

According to Chomsky, a speaker has internalised a set of


rules about his language which enables him to produce and
understand an infinitely large number of sentences and
recognise sentences that are ungrammatical and ambiguous.

However not all linguists agree with Chomsky and one of


them is Dell Hymes (1966). He argues that language consists
not only of Chomsky’s (1957, 1965) grammatical competence
but also of sociolinguistic or pragmatic competence.

Sociolinguistic or pragmatic competence covers all situated


aspects of language use and related issues of appropriacy:
the speaker (and, if different, the original author), the
addressee, the message, the setting or event, the activity, the
register, and so forth.

A more recent survey of communicative competence by


Bachman (1990) divides it into the broad headings of
"organizational competence," which includes both
grammatical and discourse (or textual) competence, and
"pragmatic competence," which includes both sociolinguistic
and "illocutionary" competence.

Strategic Competence is associated with the interllocutors'


ability in using communication strategies ( Faerch &
Kasper, 1983; Lin, 2009).

Hymes’ term “communicative competence” was taken up by


those language teaching methodologists who contributed to
the development of Communicative Language Teaching
(Wilkins, 1976; Widdowson, 1978).
However, a pedagogical framework based explicitly on the
notion of communicative competence was first proposed by
Canale and Swain (1980) and Canale (1983).

Canale and Swain (1980) defined communicative


competence in terms of three components:
– 1. Grammatical competence: words and
rules
– 2. Sociolinguistic competence:
appropriateness
– 3. Strategic competence: appropriate use
commnication strategies
Later Canale (1983) refined it into four components or
subcategories of the construct of communicative
competence. The first two subcategories reflect the use of
the linguistic system itself, the last two define the functional
aspects of communication.

Grammatical competence – that aspect of communicative


competence that encompasses knowledge of lexical items and
of rules of morphology, syntax, sentence-grammar
semantics, and phonology. It is the competence that we
associate with mastering the linguistic code of a language.

Discourse competence – the ability that we have to connect


sentences in stretches of discourse and to form meaningful
whole out of a series of utterances. It focuses on
intersentential relationship.

Sociolinguistic competence - the knowledge of the


sociocultural rules of language and of discourse. This type
of competence requires an understanding of the social
context in which language is used: the roles of the
participants, the information they share, and the function of
the interaction. Only in a full context of this kind can
judgments be made on the appropriateness of a particular
utterance.

James Cummins (1979,1980) proposed a distinction


between:
– (1) CALP (Cognitive academic
language proficiency; and
– (2) BICS (Basic interpersonal
communication skills)

CALP is that dimension of proficiency in which


the learner manipulates or reflects upon the surface
features of language outside of the immediate
interpersonal context.
– BICS, on the other hand, is the communicative
capacity that all children acquire in order to be able
to function in daily interpersonal exchange.

– Cummins later (1981) modified his notion his notion


of CALP and BICS in the form of context-reduced
and context-embedded communication. A good
share of classroom, school-oriented language is
context-reduced while face-to-face communication
with people is context-embedded.
Carroll (1961) argued that the Psychometric-structuralists
fail to meet a number of basic criteria for the measurement
of language knowledge.

He stressed the need for what he called an “integrative


approach” where one pays attention not to specific structure
or lexical items, but to the “total communicative effect of an
utterance.”
– Advantages:
An integrative approach is broader in its sampling
and is likely to be tied up to a particular course of
training.
– The difficulty of the task involved is more easily
related to a subjective standard.

– It focuses on the general question of how well a


learner is functioning in the target language
regardless of his own language background.
Carroll is thus the first to argue for the “integrative-
sociolinguistic” trend: he refers in his 1961 paper not just to
integrative testing but to “communicative effect” and
“normal communicative situation”.

This approach insists upon the specification of the


communicative contexts in which the behavior to be assessed
occurs, and can be justified on two grounds:

– (1) Language behavior and behavior toward


language vary as a function of communicative
context. Thus, global, uncontextualized measures
of language proficiency, language usage, and
language attitude may mask important systematic
differences,
– (2) Language assessment procedures have been
successfully contextualized so as to gather data
reflecting systematic sociolinguistic variation.

Oller (1979) argued that language competence is a unified


set of interacting abilities that cannot be tested separately.
His claim was that communicative competence is so global
and requires such integration (hence the term “integrative”
testing) that it cannot be captured in additive tests of
grammar, reading, vocabulary, and other discrete points of
language. This view was supported by others like Cziko 1982
and Savignon 1982).

 What does an integrative test look like?


 Two types of tests have historically been claimed to be
examples if integrative tests:
– Cloze tests
– Dictations
Proponents of integrative test methods centered their
arguments on what became known as the unitary trait
hypothesis.

The unitary trait hypothesis suggested an “indivisible” view


of language proficiency: that vocabulary, grammar,
phonology, the “four skills,” and other discrete points of
language could not be disentangled from each other in
language performance.

It claimed that there is a general factor of language


proficiency such that all the discrete points do NOT add up
to that whole.

Others argued against the unitary trait position such as


Faradhy (1982) who found significant and widely varying
differences in performance on a ESL proficiency test in a
study of students in Brazil and the Philippines. Farhady’s
contentions were supported in other research that seriously
questioned the unitary trait hypothesis.

Finally, in the face of the evidence, Oller (1983) retreated


from his earlier stand and admitted that the “unitary trait
hypothesis” was wrong.

In the 1970s and 1980s, communicative theories brought


with them a more integrative view of testing in which
specialists claimed that “the whole of the communicative
event was considerably greater than the sum of its linguistic
elements” (Clark 1983).

So the quest for authenticity was launched as test designers


centered on communicative performance. Following Canale
and Swain’s (1980) model of communicative competence,
Bachman (1990) proposed a model of language competence
consisting of or as well as strategic competence.
organizational and pragmatic competence

Communicative testing presented challenges to test


designers. Test constructors began to identify the kinds of
real-world tasks that language learners were called upon to
perform.

As Weir (1990) reminded “ to measure language


proficiency… account must now be taken of: where, who,
how, with whom, and why language is to be used, and on
what topics, and with what effect.
The assessment field became more and more concerned with
authenticity of tasks and the genuineness of texts.

Performance-Based Assessment

In language courses and programs around the world, test


designers are now tackling this new and more student-
centered agenda (Alderson 2001, 2002). Instead of offering
paper-and-pencil selective response tests of a plethora of
separate items, performance-based assessment of language
typically involves oral production, written production, open-
ended responses, integrated performance across skill areas
and other interactive tasks.

The design of communicative, performance-based


assessment rubrics continues to challenge both assessment
experts and classroom teachers. Such efforts to improve
various facets of classroom testing are accompanied by
stimulating issues, all of which are helping to shape our
current understanding of effective assessment.

Among these issues are:


– The effect of new theories of intelligence
– The advent of what has come to be called
“alternative assessment”
– The increasing popularity of computer-based testing
 Relationship between Theory and Practice
I. Our New Knowledge of the Human Brain: Classroom
Implications
Jensen (1996) emphasizes the three key points we
must remember if we are to create classrooms in
which the human brain can function at its highest
capability.
Jensen introduces these points by stating, “The
search for meaning is innate. All learners are trying to
make sense out of what is happening at all times.
Learners need time to ‘go internal’ and create
individual meaning for everything they learn.”

A summary of his three key points regarding those


conditions needed by the human brain in order to
construct meaning includes:
Relevance
This is the activation of existing connections
in the brain. It relates to something the learner
already currently knows. The more relevance
this has to the learner, the greater the meaning.
Relevance also includes the learner’s perceived
need or future use of the information.
Emotion
When the learner’s emotions are engaged,
the brain “codes” the content by triggering the
release of chemicals that single out and “mark”
the experience as important and meaningful. We
now know emotions activate many areas of the
whole body.
Patterns
The brain needs to place new information in
the context of an overall pattern. This context
may be social, intellectual, physical, economic,
geographic, political, or any other pieces of the
puzzle that create a complete picture.

New Views on Intelligence


Intelligence was once viewed strictly as the ability to
perform (a) linguistic and (b) logical-mathematic problem
solving. This “IQ” (intelligence quotient) concept of
intelligence has permeated the Western world and its way of
testing for almost a century.

For many years, we have lived in a world of standardized,


norm-referenced tests that are timed in a multiple-choice
format consisting of a mutiplicity of logic-constrained items,
many of which are inauthentic.

However, research on intelligence by psychologists like


Howard Gardner, Robert Sternberg, and Daniel Goleman
has begun to turn the psychometric world upside down.
Gardner (1983, 1999), for example, extended the traditional
view of intelligence to seven components:

linguistic intelligence
Logical-mathematical intelligence
Spatial intelligence (the ability to find your way around an
environment, to form mental images of reality)
Musical intelligence (the ability to perceive and create pitch
and rhythmic patterns)
Bodily-kinesthetic intelligence (fine motor movement,
athletic prowess)
Interpersonal intelligence (the ability to understand others
and how they feel, and to interact effectively with them)
Intrapersonal intelligence (the ability to understand oneself
and to develop a sense of self-identity)
Gardner maintained that by looking only at the first two
mental abilities; we see only a portion of the total of the total
capacity of the human mind.

Moreover, he showed that our traditional definitions of


intelligence are culture-bound. The “sixth sense” of a hunter
in New Guinea or the navigational abilities of a sailor in
Micronesia are not accounted for in the Western definitions
of IQ.

Robert Sternberg (1988, 1997) also charted new territory in


intelligence research in recognizing creative thinking and
manipulative strategies as part of intelligence. Forms of
smartness, for instance, are found in those who know how to
manipulate their environment, namely, other people.

 Debaters, politicians, successful salespersons, smooth


talkers, and con artists are all smart in their
manipulative ability to persuade others to think their
way, vote for them, make a purchase, or do something
they might not otherwise do.

 More recently, Daniel Goleman’s (1995) concept of “EQ”


(emotional quotient) has spurred us to underscore the
importance of the emotions in our cognitive processing.

 Those who manage their emotions – especially emotions


that can be detrimental – tend to be more capable of fully
intelligent processing. Anger, grief, resentment, self-
doubt, and other feelings can easily impair peak
performance in everyday tasks as well as higher-order
problem solving.

 Although these new conceptualizations of intelligence


have not been universally accepted by the academic
community, they help us to remind ourselves NOT to rely
exclusively on timed, discrete-point, analytical tests in
measuring language.

 We are prodded to cautiously combat the potential


tyranny of “objectivity” and its accompanying
impersonal approach.

 Our challenge is to test interpersonal, creative,


communicative, interactive skills, and in doing so to place
trust in our subjectivity and intuition.
 Williams (1996) also made a comparison of the
Traditional Perspectives, Brain Research and
Constructivist Perspectives
Upshur (1972) suggests, “Trends in second language
testing tend to follow trends in second language teaching
and in the United States, at least in recent times, trends
in second language testing tended to follow trends in
linguistics.

 The field of second language acquisition and teaching has


enjoyed a half century of academic prosperity. Among
the myriads of topics and issues related to it, assessment
remains an area of intense fascination (Brown 2004)
 Questions like:
– What is the best way to assess learners’ ability?
– In an era of communicative language teaching, do
our classrooms measure up to standards of
authenticity and meaningfulness?

These and many more questions being addressed by


teachers, researchers, and specialists can be
overwhelming to the novice language teacher, who is
already baffled by linguistic and psychological
paradigms and by a multitude of methodological
options.

TEST
 What is a test?
 A test is a method of measuring a person’s ability,
knowledge, or performance in a given domain.

 A well-constructed test is an instrument that provides an


accurate measure of the test-taker’s ability within a
particular domain.
 The need for tests
 Why test?

 Information about people’s language ability is often very


useful and sometimes necessary to make rational
decisions.
 Testing, Assessing, and Teaching

What do the following terms mean to you?


testing
assessing
teaching
evaluation

What is the relationship among them?


A test is narrow in focus, designed to measure a set of
skills or behaviors at one point in time.

 Assessment is broader in scope and involves gathering


information over a period of time. This information
might include formal tests, classroom observation,
student self-assessment, or from other data sources.

 Tests are prepared administrative procedures that occur


at identifiable times in a curriculum when learners
muster all their faculties to offer peak performance,
knowing that their responses are being measured and
evaluated.

 Assessment, on the other hand, is an ongoing process that


encompasses a much wider domain. Tests are a subset of
assessment; they are certainly not the only form of
assessment that a teacher can make.

 Evaluation applies assessment data that have been scored


and analysed to make judgements, or draw inferences
about students and educational programmes.
Assessment and Teaching
 Assessment is a popular and sometimes misunderstood
term in current educational practice. Testing and
assessing are not synonymous terms.

 Teaching sets up language learning: the opportunities for


learners to listen, speak, read, and/or write in the target
language.

Where do you belong?


 Washback
 The effect of testing on teaching and learning is known as
washback/backwash, and can be harmful or beneficial.
 If a test is regarded as important, if the stakes are high,
preparation for it can come to dominate all teaching and
learning activities.

 And if the test content and testing techniques are at


variance with the objectives of the course, there is likely
to be harmful washback/backwash.
 An example of washback is “teaching to the test”.

 Challenge to teachers:

 To create classroom tests that serve as learning devices


through which positive washback is achieved.

 Give examples of how positive washback can be


enhanced.

LANGUAGE TESTING
TYPES OF LANGUAGE TESTS
 Kinds of tests and testing
 Proficiency tests:

 Proficiency tests are designed to measure people’s ability


in a language, regardless of any training they may have
had in that language.

 The content of a proficiency test, therefore, is not based


on the content or objectives of language courses that
people taking the test may have followed.

 Rather, it is based on a specification of what candidates


have to be able to do in the language in order to be
considered proficient.

 Achievement tests:

 In contrast to proficiency tests, achievement tests are


directly related to language courses, their purpose being
to establish how successful individual students, groups of
students, or the courses themselves have been in
achieving objectives.

 An achievement test is related directly to classroom


lessons, units, or even a total curriculum. Achievement
tests should be limited to particular material addressed
in a curriculum within a particular time frame and
should be offered after a course has focused on the
objectives in question.
There’s a fine line of differences between a
diagnostic test and an achievement test.
-Achievement tests analyze the extent to which students
have acquired language features that have already been
taught.
-Diagnostic tests should elicit information on what students
need to work on in the future.

 They are of two kinds:


– final achievement tests (summative assessment) or
– progress achievement tests (formative assessment).

 Diagnostic tests:

 Diagnostic tests are used to identify learners’ strengths


and weaknesses. They are intended primarily to
ascertain what learning still needs to take place.

A diagnostic test can help a student become aware of errors


and encourage the adoption of appropriate
compensatory strategies.
A test of  pronunciation, for example, might diagnose the
phonological features of English that are difficult for
learners and should therefore become part of a
curriculum. Usually such tests offer a checklist of
features for the administrator to use in pinpointing
difficulties.
 Placement tests:

 Placement tests are intended to provide information that


will help to place students at the stage of the teaching
programme most appropriate to their abilities. Typically
they are used to assign students to classes at different
levels.

 The ultimate objective of a placement test is to correctly


place a student into a course or level. Certain proficiency
tests can act in the role of  placement tests.

 A placement test usually includes a sampling of the


material to be covered in the various courses in a
curriculum.

In a placement test, a student should find the test material


neither too easy nor too difficult but appropriately
challenging.

 Direct versus indirect testing:

 Testing is said to be direct when it requires the candidate


to perform precisely the skill we wish to measure. If we
want to know how well candidates can write
compositions, we get them to write compositions. If we
want to know how well they pronounce a language, we
get them to speak. The tasks and the texts that are used
should be as authentic as possible.
 Indirect testing attempts to measure the abilities that
underlie the skill in which we are interested.

 An example: Lado’s (1961) proposed method of testing


pronunciation ability by a paper and pencil test in which
the candidate has to identify pairs of words which rhyme
with each other.

 Discrete point versus integrative testing:


– Discrete point testing refers to the testing of one
element at a time, item by item.

– Integrative testing, by contrast, requires the


candidate to combine many language elements in the
completion of the task. This might involve writing a
composition, making notes while listening to lecture,
etc.

 Norm-referenced versus criterion-referenced testing

 In norm-referenced tests, each test-taker’s score in


interpreted in relation to a mean (average score), and
or/percentile rank.

 Criterion-referenced tests, on the other hand, are


designed to give test-takers feedback, usually in the form
of grades, on specific course or lesson objectives.

 Objective testing versus subjective testing


 The distinction here is between methods of scoring. If no
judgement is required on the part of the scorer, then the
scoring is objective. If judgement is called for, the
scoring is said to be subjective.
 Characteristics of a good test
 How do you know if a test is effective? The following are
the five cardinal criteria for “testing a test.”
– Practicality
– Reliability
– Validity
– Authenticity
– Washback

 Practicality:

 An effective test is practical. This means that it:


 Is not excessively expensive
 Stays within the appropriate time constraint
 Is relatively easy to administer, and
 Has a scoring/evaluation procedure that is
specific and time efficient
 Reliability

 A reliable test is consistent and dependable. If you give


the same test to the same student or matched student on
two different occasions, the test should yield similar
results.

 Factors that may contribute to the unreliability of a test:


– Student-related reliability. The most common
learner-related issue in reliability is caused by
temporary illness, fatigue, a “bad day”, anxiety, and
other physical or psychological factors, which may
make an “observed” score deviate from one’s “true”
score. Also related are such factors as a test-taker’s
“test-wiseness” or strategies for efficient test taking.
 Rater reliability.
– Human error, subjectivity, and bias may enter into
the scoring process. Inter-rater reliability occurs
when two or more scorers yield inconsistent scores of
the same test, possibly for lack of attention to scoring
criteria, inexperience, inattention, or even
preconceived biases.

 Test administration reliability. Unreliability may also


result from the conditions in which the test is
administered.

 Test reliability. Sometimes the nature of the test itself


can cause measurement errors. If the test is too long,
test-takers may become fatigued by the time they reach
the later items and hastily respond incorrectly. Timed
tests may also discriminate against students who do not
perform well on a test with time limit.

 Validity is by far the most complex criterion of an


effective test. Validity is “the extent to which inferences
made from assessment results are appropriate,
meaningful, and useful in terms of the purpose of the
assessment.”

 A valid test of reading ability actually measures reading


ability.
 Authenticity
 Bachman and Palmer (1996) define authenticity as “the
degree of correspondence of the characteristics of a given
language test task to the features of target language
task.”

 In a test authenticity may be present in the following


ways:
– The language of the test is as natural as possible.
– Items are contextualised rather than isolated.
– Topics are meaningful (relevant, interesting) for the
learner.
– Some thematic organization to items is provided,
such as through a story line or episode.
– Tasks represent, or closely approximate, real-world
tasks.

TESTING LINGUISTIC MACRO-SKILLS

LINGUISTIC MACRO-SKILLS

LISTENING
SPEAKING
READING
WRITING
VIEWING (?)

TESTING LISTENING

TESTING SPEAKING
TESTING READING

TESTING WRITING

TESTING GRAMMAR

STAGES IN TEST CONSTRUCTION


 1. Determining Purpose of Test

 2. Developing Test Specifications

 A test’s specifications provide the official statement


about what the test tests and how it tests it. The
specifications are the blueprint to be followed by test and
item writers, and they are a so essential in the
establishment of the test’s construct validity.

 A test specification is a detailed document, and is often


for internal purposes only. It is sometimes confidential to
the examining body.

 Who needs test specifications?


Test specifications are needed by a range of different
people. First and foremost, they are needed by those who
produce the test itself.

Test constructors need to have clear statements about who


the test is aimed at, what its purpose is, what content is to
be covered, what methods are to be used, how many
papers or sections there are, how long the test takes, and
so on.

In addition, the specifications will need to be available to


those responsible for editing and moderating the work of
individual item writers or teams.

Such editors may operate in a committee or they may be


individual chief examiners or board officials. In smaller
institutions, they may simply be fellow teachers who have
a responsibility for vetting a test before it is used.

 The specifications should be consulted when items and


tests are reviewed, and therefore need to be clearly
written so that they can be referred to easily during
debate.
For test developers, the specifications document will need
to be as detailed as possible, and may even be of a
confidential nature, especially if the test is a ‘high-stakes’
test.

Test specifications are also needed by those responsible for


or interested in establishing the test’s validity (that is,
whether the test tests what it is supposed to test).

Test specifications a valuable source of information for


publishers wishing to produce textbooks related to the
test: textbook writers will wish to ensure that the
practice tests they produce, for example, are of an
appropriate level of difficulty, with appropriate content,
topics, tasks and so on.

 Specifications for test writers


• What is the purpose of the test?
• What sort of learner will be taking the test — age,
sex, level of proficiency/stage of learning, first
language, cultural background, country of origin,
level and nature of education, reason for taking
the likely personal and, if applicable, professional
interests, likely levels of background (world)
knowledge?

3. How many sections/papers should the test have, how


long should they be and how will they be
differentiated — one three-hour exam ,five separate
two-hour papers, three 45 minute sections, reading
tested separately from grammar, listening and
writing integrated into one paper, and so on.

4. What target language is envisaged for the test, and is


this to be simulated in some way in the test content and
method?

5. What text types, should be chosen — written and/or


spoken? What should be the sources of these, the
supposed audience, the topics the degree of authenticity?
How difficult or long should they be? What functions
should be embodied in the texts — persuasion, definition,
summarising, etc.? How complex should the language
be?
6. What language skills should be tested? Are
enabling/micro skills specified, and should items be
designed to test these individually or in some integrated
fashion? Are distinctions made between items testing
main idea, specific detail, inference?

7. What language elements should be tested? Is there a


list of grammatical structures/features to be included? Is
the lexis specified in some way — frequency lists etc.?
Are notions and functions, speech acts or pragmatic
features specified?

8. What sort of tasks are required — discrete point,


integrative, simulated ‘authentic’, objectively assessable?

9. How many items are required for each section? What


is the relative weight for each item — equal weighting,
extra weighting for more difficult items?

10. What test methods are to be used — multiple choice,


gap filling, matching, transformation, short answer
question, picture description, role play with cue cards,
essay, structured writing?
11.What rubrics are to be used as instructions for•
candidates? Will examples be required to help
candidates know what is expected? Should the criteria by
which candidates will be assessed be included in the
rubric?
12. Which criteria will be used for assessment by
markers? How important is accuracy, appropriacy,
spelling, length of utterance/script, etc.?
 Checklist
 Since specifications vary according to their uses, not all
the points in the following checklist will need to be
covered in all specifications.

 Specification writers must first decide who their


audience is and provide the appropriate information.

 Test specifications should include all or most of the


following:
– The test purpose
– Description of the test taker
– Test level
– Construct (theoretical framework for the test)
– Description of suitable language course or textbook

– Number of sections/papers
– Weighting for each section/paper
– Target language situation
– Text-types
– Text length
– Language skills to be tested
– Language elements to be tested
– Test tasks
– Test methods
– Rubrics
– Criteria for marking
– Description of typical performance at each level
– Description of what candidates at each level can do
in the world
3. Designing Test
4. Piloting/Trialling Test
 The test can be piloted to a group similar to that for
which the test is intended. Problems in administration
and scoring are noted.
5. Analysis of Results of the Trial
 There are two kinds of analysis that should be carried
out:
• Statistical
• Qualitative

• Statistical
• Measures of central tendency: the mean, the mode
and the median
• Measures of dispersion: the standard deviation and
the range
• The standard Error of measurement
• Item Analysis
The purpose of item analysis is to examine the
contribution that each item is making to the test.
Items that are identified as faulty or inefficient can
be modified or rejected.

 Classical item analysis


This usually involves the calculation of facility values and
discrimination indices, as well as an analysis of
distractors in the case of multiple choice items.

 The Facility Values


The facility value of an item on which only scores of zero
or one can be scored is simply the proportion of test
takers that score one on it.
 Discrimination indices
A discrimination index is an indicator of how well an item
discriminates between weak and strong candidates.

6. Finetuning or Revising Test


Administering Test
Marking test

ALTERNATIVE ASSESSEMENT
Traditional and “Alternative” Assessment

There is a trend to supplement traditional test designs with


alternatives that are more authentic in their elicitation of
meaningful communication. The following table highlights
the differences between the two approaches (adapted from
Armstrong, 1994 and Bailey 1998).

While standard tests of all types provide the literacy teacher


with some insights into a student’s literacy strengths and
weaknesses, the sole use of standardized devices to evaluate
the performance of students has many inherent dangers.

Miller (1995) stresses the importance of using informal


assessment of literacy along with standard evaluation.
 She says that informal assessment must always be
considered an essential part of instruction and therefore
should occur continuously.
 Informal assessment can take many different forms. It
can be in the form of surveys, checklists, miscue analysis,
various types of informal inventories, conferences and
interviews of various types, retellings, dialogue and
response journals, creative book reports,
autobiographies of various kinds, holistic scoring of
writing, and portfolio assessment.

Advantages of Using Informal Assessment Devices


 They are more authentic in evaluating many literacy
programs.
 They are often more relevant to the information that is
being taught in the classroom or special reading
program.
 They emphasize the process aspects of literacy rather
than the product aspects aas traditionally is done by
standardized tests of various types.
 They are able to assess the affective (emotional and
attitudinal) aspects effectively.
 They usually accurately reflect the accomplishments and
attitudes of “at-risk” learners than do standardized tests.
 They usually reflect different styles of teaching and
learning than do standardized tests.
 They do not have the prescribed directions and time
limits that typically are found on standardized tests.
As Gattegno (1972), the proponent of The Silent
Way says … “Teaching should be subordinated to
learning.” In other words he believes that to teach
means to serve the learning process rather than to
dominate it. To teach contrary to how the brain
works is counter-productive. Also we test
according to how we teach and we teach according
to how we learn.
Computer-Based Testing
 Recent years have seen a burgeoning of assessment in
which the test-taker performs responses on a computer.
Some computer-based tests ( also known as “computer-
assisted” or “web-based” tests) are small-scale
“homegrown” tests available on websites. Other are
standardized, large-scale tests.
 Advantages
– Classroom-based testing
– Self-directed testing on various aspects of a language
– Some individualization
– Can be administered and scored easily for rapid
reporting of scores

 Disadvantages
– Lack of security and the possibility of cheating if
unsupervised
– Some “home-grown ‘
quizzes may be mistaken for validated assessments
– Open-ended responses are less likely to appear
because of the need for human scorers
– The human interactive element (especially in oral
interaction) is absent.

CURRENT CONCERNS AND ISSUES IN LANGUAGE


TESTING AND FUTURE DIRECTION
1. Theories have advanced but practices in testing still lag
behind.
The shift in perception from views which
emphasized reading as a collection of specific
skills to views of reading as a total process with
skills interrelated and individual strategies
effectively directing learning greatly affect
research and practice.

Both research and practice are increasingly


emphasizing cognitive processing, meaning-
making, activation and use of prior knowledge,
levels of questioning in comprehension, active
response to what is read, and similar dimensions
of cognition.

 Yet assessment of reading performance remains


mired at the skills level. Increasingly we seem to be
trying to assess process-oriented learning with
product-oriented measures (Squire, 1987).
Valencia & Pearson (1987) share some
issues/concerns in reading assessment in
relation to the current reading theories:
 A major contribution of recent research has
been to articulate a strategic view of the process
of reading (e.g., Collins, Brown, and Larking,
1980; Pearson and Spiro, 1980). This view
emphasizes the active role of readers as they
use print clues to “construct” a model of the
text’s meaning.
 It deemphasizes the notion that progress toward
expert reading is the aggregation of component
skills. Instead, it suggests that at all levels of
sophistication, from kindergarten to research
scientist, readers use available resources (e.g.,
text, prior knowledge, environmental clues, and
potential helpers) to make sense of the text.

2. Reading assessment has not kept pace with


advances in reading theory, research, or
practice (Valencia and Pearson, 1987).

 The time has come to change the way we assess


reading. The advances of the last 15-20 years in
our knowledge of basic reading processes have
begun to impact instructional research (Pearson,
1985) and are beginning to find a home in
instructional materials and classroom practice
(Pearson, 1986).

 Yet the tests used to monitor the abilities of


individual students and to make policy decisions
have remained remarkably impervious to advances
in reading research (Farr and Carey, 1986;
Johnston in press; Pearson and Dunning, 1985).

4. What has happened, of course, is that with reading


conceptualized as the mastery of small, separate
enabling skills, there has been a great
temptation to operationalize “skilled reading” as
an aggregation – not even an integration – of all
these skills; “instruction” becomes
operationalized as opportunities for students to
practice these discrete skills on worksheets,
workbook pages, and Ditto sheets.
 As long as reading research and instructional
innovations are based upon one view of the
reading process while reading assessment
instruments as based upon a contradictory point
of view, we will nurture tension and confusion
among those charged with the dual responsibility
of instructional improvement and monitoring
student achievement.

6. Current practices in reading assessment which run


contrary to the current reading theories pose some
hidden dangers:

6.1 One danger lies in a false sense of security if


we equate skilled reading with high scores on
our current reading tests. A close inspection of
the tasks involved in these tests would cast
doubt upon any such conclusion.

6.2 A second danger stems from the potential


insensitivity of current tests to changes in
instruction motivated by strategic view of
reading.
6.3 A third danger is that given the strong
influence of assessment on curriculum, we
are likely to see little change in instruction
without an overhaul in tests. Conscientious
teachers want their students to succeed on
reading tests; not surprisingly, they look to
tests as guides for instruction. In the best
tradition of schooling, they teach to the test,
directly or indirectly. Tests that model an
inappropriate concept of skilled reading will
foster inappropriate instruction.

6.4 A fourth danger stems from the aura of


objectivity associated with published tests
and the corollary taint of subjectivity
associated with informal assessment. For
whatever reasons, teachers are taught that
the data from either standardized or basal
tests are somehow more trustworthy than
the data that they collect each day as a part
of teaching. The price we pay for such a
lesson is high; it reduces the likelihood that
teachers will use their own data for their own
decision making.

 We live in a time of contradictions. The speed and


impressiveness of technological advance suggest an era of
great certainty and confidence.

 Yet at the same time current social theories undermine


our certainties, and have engendered a profound
questioning of existing assumptions about the self and its
social construction.

 Aspects of these contradictory trends also define


important points of change in language testing.

 Rapid developments in computer technology have had a


major impact on test delivery. Already, many important
national and international language tests, including
TOEFL, are moving to cornputer based testing (CBT).

 Stimulus texts and prompts are presented not in


examination booklets but on the screen, with candidates
being required to key in their responses. The advent of
CBT has not necessarily involved any change in the test
content.

 The use of computers for the delivery of test materials


raises questions of validity, as we might expect.

 For example, different levels of familiarity with


computers will affect people’s performance with them,
and interaction with the computer may be a stressful
experience for some.

 Attempts are usually made to reduce the impact of prior


experience by the provision of an extensive tutorial on
relevant skills as part of the test (that is, before the test
proper begins). Nevertheless, the question about the
impact of computer delivery still remains.
 While computers represent the most rapid point of
technological change, other less complex technologies,
which have been in use for some time, have led to similar
validity questions.

 Tape recorders can be used in the administration of


speaking tests. Candidates are presented with a prompt
on tape, and are asked to respond as if they were talking
to a person, the response being recorded on tape. This
performance is then scored from the tape.

 Such a test is called a semi-direct test of speaking, as


compared with a direct test format such as a live face-to-
face interview.

 But not everybody likes speaking to tapes! We all know


the difficulty many people experience in leaving messages
on answering machines. Most test-takers prefer a direct
rather than a semi- direct format if given the formats.

 But the question then arises as to whether these options


are equivalent in testing terms. How far can you infer the
same ability from performance on different formats? It is
possible for somebody to be voluble in direct face-to- face
interaction but tongue-tied when confronted with a
machine, and vice versa.

 Research looking ai the performance of the same


candidates under each condition has shown that this is a
complex issue, as not all candidates react in the same way
(hardly surprising, of course).
 The speed of technological advances affecting language
testing sometimes gives an impression of a field
confidently moving ahead, notwithstanding the issues of
validity raised above. But concomitantly the change in
perspective from the individual to the social nature of
test performance has provoked something of an
intellectual crisis in the field.

 Developments in discourse analysis and pragmatics have


revealed the essential interactivity of all communication.
This is especially clear in relation to the assessment of
speaking.

 The problem is that of isolating the contribution of a


single individual (the candidate) in a joint
communicative activity. As soon as you try to test use (as
opposed to usage) you cannot confine yourself to the
single individual. So whose performance are we
assessing?
Concluding Remarks
The paradigm shift in language acquisition theory
causes language instruction to focus on the
learner so as to understand the relationships
among three variables – knowledge, thinking, and
behavior. Current research in testing argues for a
more direct connection between teaching and
testing. The same kinds of activities designed for
classroom interaction can serve as valid testing
formats, with instruction and evaluation more
closely integrated.

As Oller (1991) points out, “Perhaps the essential


insight of quarter of a century of language testing
(both research and practice) is that good teaching and
good testing are, or ought to be, nearly
indistinguishable.”

Shohamy (1990) suggests that language teachers


make extensive use of formative testing that is
integrated into the teaching and learning process.
Oller (1991) suggests the use of pragmatic tests,
in which the learner has the opportunity to process
authentic language within normal contextual
constraints and link that language to his/her own
experience.

There is a need to look into current language


instruction and assessment practices. A scenario in
which there is no difference between language and
assessment is the ideal. While this model may never
be fully integrated into large scale testing, it holds
enormous promise for classroom and individual
student assessment.

Do your test/assessment practices provide positive


experiences?
Do they build your students’ confidence and they become
learning experiences?
Do they bring out the best in your students?

ONLY YOU CAN ANSWER THOSE QUESTIONS


Language testing remains a complex and perplexing activity.
While insights from evolving theories of communication may
be disconcerting, it is necessary to fully grasp them and the
challenge they pose if our assessments are to have any
chance of having the meaning we intend them to have.

WHERE ARE YOU NOW IN LANGUAGE


ASSESSMENT?

You might also like