Unit 3 Stages of Test Development
Unit 3 Stages of Test Development
Multiple Choice
It cannot be said too many times that the essential first step in testing is to make
oneself perfectly clear about what it is one wants to know and for what purpose. The
following questions , the significance of which should be clear from previous chapters,
have to be answered:
Once the problem is clear, steps can be taken to solve it. It is to be hoped that a
handbook of the present kind will take readers a long way towards appropriate
solutions. In addition, however, efforts should be made to gather information on tests
that have been designedfor similar situation. If possible, sample of such test should be
obtained. There is nothing dishonourable in doing this ; it is what professional testing
bodies do when they are planning a test of a kind for which they do not already have
first-hand experience. Nor does it contradict the claim made earlier that each testing
situation is correct. It is not intended that other test should simply be copied; rather
that their development can serve to suggest possibilities and to help avoid the need to
‘reinvent the wheel’.
2. Writing specifications for the test
(i)Content
This refers not to the content of a single, particular version of a test, but to the
entire potential content of any number of versions. Sample of this content will appear in
individual versions of the test.
The fuller the information on content, the less arbitrary should be the
subsequent decisions as to what to include in the writing of any version of the test.
There is danger, however, that in the desire to be highly specific, we may go beyond
our current understanding of what the components of language ability are and what
their relationship is to each other.
The way in which content is described will vary with its nature. The content of a
grammar test, for example, may simply list all the relevant structures. The content of a
test of a language skill, on the other hand, may be specified along the numbers of
dimension. The following provides a possible framework for doing this. It is not meant
to be prescriptive; readers may wish to describe test content differently. The important
thing is that content should be as fully specified as possible.
Operations
Types of text
• Letters
• Forms
• Academic essays up to three pages in length
Length of text(s)
• For a reading test, this would be the length of the passages on which items are
set.
• For a listening test, it could be the length of the spoken texts
• For a writing test, the length of the pieces to be written
Topics- it may be specified quite loosely and selected according to suitability for the
candidate and the type of test.
Vocabulary Range- This may be loosely or closely specified. An example of the latter
is to be found in the handbook of the Cambridge Young Learners tests, where words
are listed.
Dialect, accent, style- This may refer to the dialects and accents that test takers are
meant to understand or those in which they are expected to write or speak. Style may
be formal, informal, conversational etc.
Speed of processing-
• For reading, this may be expressed in the number of words to be read per
minute
• For speaking, it will be rate of speech, also expressed in words per minute.
• For listening, it will be the speed at which texts are spoken.
• Test structure
What sections will the test have and what will be tested in each? (for
example: 3 sections- grammar, careful reading, expeditious reading)
• Number of Items
(in total and in the various sections)
• Number of Passages
(and number of items associated with each)
• Medium/channel
(paper and pencil, tape, computer, face-to-face, telephone, etc.)
• Timing
(for each section and for entire test)
• Techniques
What techniques will be used to measure what skills or subskills?
The required level(s) of performance for ( different levels of) success should be
specified. This may involve a simple statement to the effect that, to demonstrate
‘mastery’, 80 percent of the items must be responded to correctly.
For speaking or writing, however, one can expect a description of the criterial
level to be much more complex. For example, the handbook of the Cambridge
Certificates in Communicative Skills in English (CCSE) specifies the following degree of
skill for the award of the Certificate in Oral Interaction at level 2:
• Accuracy
Pronunciation must be clearly intelligible even if still obviously influenced by L1.
Grammatical/ lexical accuracy is generally high although some errors that do not
destroy communication are acceptable.
• Appropriacy
The use of language must be generally appropriate to function. The overall
intention of the speaker must be generally clear.
• Range
A fair range of language must be available to the candidate. Only in complex
utterances is there a need to search for words.
• Flexibility
There must be some evidence of the ability to initiate and concede a
conversation and to adapt to new topics or changes of direction.
• Size
Must be capable of responding with more than short-form answers where
appropriate. Should be able to expand simple utterances with occasional
prompting from the Interlocutor.
(iv) These are always important, but particularly so where scoring will be subjective.
The test developers should be clear as to how they will achieve high reliability and
validity in scoring. What rating scale will be used ? How many people will rate each
piece of work? What happens if two or more raters disagree about a piece of work?
a. Sampling
➢ It is most unlikely that everything found under the heading of 'Content' in the
specifications can be covered by the items in any one version of the test.
➢ Choices have to be made. For content validity and for beneficial backwash, the
important thing is to choose widely from the whole area of content.
b. Writing Items
c. Moderating Items
➢ Items which have been through the process of moderation should be presented
in the form of a test to a number of native speakers-twenty or more, if possible.
➢ The native speakers should be similar to the people for whom is the test is being
developed, in terms of age, education, and general background.
➢ Items that proved difficult for the native speakers almost certainly need revision
or replacement.
➢ Those items that have survived moderation and informal trialling on native
speakers should be put together into a test, which is then administered under
test conditions to a group similar to that for which the test is intended.
➢ Problems in administration and scoring are noted.
➢ It has to be accepted that, for a number of reasons, trialling of this kind is not
feasible.
➢ It is often the case, therefore, that faults in a test are discovered only after it has
been administered to the target group.
➢ Unless it is intended that no part of the test should be used again, it is
worthwhile noting problems that become apparent during administration and
scoring, and afterwards carrying out statistical analysis of the kind referred to
below and treated more fully in Apendix 1.
➢ This will reveal qualities (such as reliability) of the test as a whole and of
individual items (for example, how difficult they are, how well they discriminate
between stronger and weaker candidates.
➢ The second kind of analysis is qualitative. Responses should be examined in
order to discover misinterpretations, unanticipated but possibly correct
responses, and any other indicators of faulty items.
➢ Items that analysis shows to be faulty should be modified or dropped from the
test.
➢ Assuming that more items have been trialled than are needed for the final test, a
final selection can be made, basing decisions on the results of the analyses.
7. Calibration of Scales
➢ Where rating scales are going to be used for oral testing or the testing of writing,
these should be calibrated.
➢ Essentially this means collecting samples of performance (for example, pieces of
writing) which cover the full range of the scales. A team of "experts" then looks
at these samples and assigns each one of them to a point on the relevant scale.
➢ The assigned samples provide reference points for all future uses of the scale, as
well as being necessary training materials.
8. Evaluation
A further problem with multiple choice is that, even where items are possible,
good ones are extremely difficult to write. Professional test writers reckon to have to
write many more multiple choice items than they actually need for a test, and it is
only after trailing and statistical analysis of performance on the items that they can
recognize the ones that are usable.
Multiple choice tests that are produced for use within institutions are often shot
through with faults.
Savings in the time for administration and scoring will be outweighed by the time
spent on successful test preparation.
It is true that item banks are worthwhile but great demands are still made on time
and expertise.
It should hardly be necessary to point out that where a test that is important to
students is multiple choice in nature, there is a danger that practice for the test will
have a harmful effect on learning and teaching.
Practice at multiple choice items (especially when-as can happen-as much attention
is paid to improving one’s educated guessing as to the content of items) will not usually
be the best way for students to improve their command of language.
The fact that the responses on a multiple choice test (a,b,c,d) are so simple that
makes them easy to communicate to other candidates non-verbally.
Some defence against this is to have at least two versions of the test, the only
difference between them being the order in which the options are presented.
All in all, the multiple choice technique is best suited to relatively infrequent testing
of large numbers of candidates.
• Items in which the test taker has merely to choose between Yes and NO or
between True or False.
• The obvious weakness of such items is that the test taker has a 50% chance of
choosing the correct response by chance alone.
• True/False items are sometimes modified by requiring test takers to give a reason
for their choice.
Short-answer items
• Items in which the test taker has to provide a short answer are common
particularly in listening and reading tests.
• Advantages over multiple choice:
1. guessing will contribute less to test scores;
2. the technique is not restricted by the need for distractors (though
there have to be potential alternative responses) ;
3. cheating is to be more difficult;
4. though great care must still be taken, items should be easier to
write.
• Disadvantages are:
1. responses may take longer and so reduce the possible number of items;
2. the test taker has produce language in order to respond;
3. scoring may be invalid or unreliable, if judgment is required;
4. scoring may take longer.