0% found this document useful (0 votes)
105 views9 pages

Unit 3 Stages of Test Development

The document outlines the 5 key stages of test development: 1. Stating the problem - Clearly defining what is being tested and the test's purpose. 2. Writing test specifications - Including content, structure, timing, scoring criteria. 3. Writing and moderating test items - Developing items aligned to specifications and revising through peer review. 4. Informal trialling of items on native speakers - Identifying and revising difficult items. 5. Trialling the full test on a sample group - Administering the test to identify flaws before official use.

Uploaded by

john paul tagoc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views9 pages

Unit 3 Stages of Test Development

The document outlines the 5 key stages of test development: 1. Stating the problem - Clearly defining what is being tested and the test's purpose. 2. Writing test specifications - Including content, structure, timing, scoring criteria. 3. Writing and moderating test items - Developing items aligned to specifications and revising through peer review. 4. Informal trialling of items on native speakers - Identifying and revising difficult items. 5. Trialling the full test on a sample group - Administering the test to identify flaws before official use.

Uploaded by

john paul tagoc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Unit 3

Stages of Test Development

Stages of Test Development

Common Test Techniques

Multiple Choice

Stages of Test Development

1. Stating the problem

It cannot be said too many times that the essential first step in testing is to make
oneself perfectly clear about what it is one wants to know and for what purpose. The
following questions , the significance of which should be clear from previous chapters,
have to be answered:

i. What kind of test is it to be? Achievement (final or progress), proficiency,


diagnostic, or placement?
ii. What is its precise purpose?
iii. What abilities are to be tested?
iv. How detailed must the result be?
v. How accurate must the result be?
vi. How important is backwash?
vii. What constraints are set by unavailability of expertise, facilities, time ( for
construction, administration and scoring)?

Once the problem is clear, steps can be taken to solve it. It is to be hoped that a
handbook of the present kind will take readers a long way towards appropriate
solutions. In addition, however, efforts should be made to gather information on tests
that have been designedfor similar situation. If possible, sample of such test should be
obtained. There is nothing dishonourable in doing this ; it is what professional testing
bodies do when they are planning a test of a kind for which they do not already have
first-hand experience. Nor does it contradict the claim made earlier that each testing
situation is correct. It is not intended that other test should simply be copied; rather
that their development can serve to suggest possibilities and to help avoid the need to
‘reinvent the wheel’.
2. Writing specifications for the test

A set of specifications for the test must be written on the outset.

(i)Content

This refers not to the content of a single, particular version of a test, but to the
entire potential content of any number of versions. Sample of this content will appear in
individual versions of the test.

The fuller the information on content, the less arbitrary should be the
subsequent decisions as to what to include in the writing of any version of the test.
There is danger, however, that in the desire to be highly specific, we may go beyond
our current understanding of what the components of language ability are and what
their relationship is to each other.

The way in which content is described will vary with its nature. The content of a
grammar test, for example, may simply list all the relevant structures. The content of a
test of a language skill, on the other hand, may be specified along the numbers of
dimension. The following provides a possible framework for doing this. It is not meant
to be prescriptive; readers may wish to describe test content differently. The important
thing is that content should be as fully specified as possible.

Operations

• Scan text to locate specific information


• Guess meaning of unknown words from context

Types of text

• Letters
• Forms
• Academic essays up to three pages in length

Addressees of texts-this refers to the kinds of people that the candidate is


expected to be able to write or speak to or the people for whom reading and
listening materials are primarily intended.

• Native speakers of the same status and age


• Native speaker university students

Length of text(s)
• For a reading test, this would be the length of the passages on which items are
set.
• For a listening test, it could be the length of the spoken texts
• For a writing test, the length of the pieces to be written

Topics- it may be specified quite loosely and selected according to suitability for the
candidate and the type of test.

Readability- reading passages may be specified as being within a certain range of


readability.

Structural Range- it could be;

• A list of structures which may occur in texts


• A list of structures which should be excluded
• A general indication of range of structures

Vocabulary Range- This may be loosely or closely specified. An example of the latter
is to be found in the handbook of the Cambridge Young Learners tests, where words
are listed.

Dialect, accent, style- This may refer to the dialects and accents that test takers are
meant to understand or those in which they are expected to write or speak. Style may
be formal, informal, conversational etc.

Speed of processing-

• For reading, this may be expressed in the number of words to be read per
minute
• For speaking, it will be rate of speech, also expressed in words per minute.
• For listening, it will be the speed at which texts are spoken.

(ii) Structure, timing, medium/channel and techniques

The following should be specified:

• Test structure
What sections will the test have and what will be tested in each? (for
example: 3 sections- grammar, careful reading, expeditious reading)
• Number of Items
(in total and in the various sections)
• Number of Passages
(and number of items associated with each)
• Medium/channel
(paper and pencil, tape, computer, face-to-face, telephone, etc.)
• Timing
(for each section and for entire test)
• Techniques
What techniques will be used to measure what skills or subskills?

(iii) Criterial levels of performance

The required level(s) of performance for ( different levels of) success should be
specified. This may involve a simple statement to the effect that, to demonstrate
‘mastery’, 80 percent of the items must be responded to correctly.

For speaking or writing, however, one can expect a description of the criterial
level to be much more complex. For example, the handbook of the Cambridge
Certificates in Communicative Skills in English (CCSE) specifies the following degree of
skill for the award of the Certificate in Oral Interaction at level 2:

• Accuracy
Pronunciation must be clearly intelligible even if still obviously influenced by L1.
Grammatical/ lexical accuracy is generally high although some errors that do not
destroy communication are acceptable.
• Appropriacy
The use of language must be generally appropriate to function. The overall
intention of the speaker must be generally clear.
• Range
A fair range of language must be available to the candidate. Only in complex
utterances is there a need to search for words.

• Flexibility
There must be some evidence of the ability to initiate and concede a
conversation and to adapt to new topics or changes of direction.
• Size
Must be capable of responding with more than short-form answers where
appropriate. Should be able to expand simple utterances with occasional
prompting from the Interlocutor.
(iv) These are always important, but particularly so where scoring will be subjective.
The test developers should be clear as to how they will achieve high reliability and
validity in scoring. What rating scale will be used ? How many people will rate each
piece of work? What happens if two or more raters disagree about a piece of work?

3. Writing and Moderating Items

a. Sampling

➢ It is most unlikely that everything found under the heading of 'Content' in the
specifications can be covered by the items in any one version of the test.
➢ Choices have to be made. For content validity and for beneficial backwash, the
important thing is to choose widely from the whole area of content.

b. Writing Items

➢ Items should always be written with the specifications in mind. It is no use


writing 'good' items if they are not consistent with the specifications.
➢ As one writes an item, it is essential to try look as if through the eyes of test
takers and imagine how they might misinterpret the item.
➢ The writing of successful items is extremely difficult. No one can expect to be
able consistently to produce perfect items. Some items will have to be rejected,
others reworked.
➢ The best way to identity items that have to be improved or abandoned is
through the process of moderation.

c. Moderating Items

➢ Moderation is the scrutiny of proposed items by at least two colleagues, neither


of whom is the author of the items being examined.
➢ Their task is to try to find weaknesses in the items and, where possible, remedy
them.

4. Informal Trialling of Items on Native Speakers

➢ Items which have been through the process of moderation should be presented
in the form of a test to a number of native speakers-twenty or more, if possible.
➢ The native speakers should be similar to the people for whom is the test is being
developed, in terms of age, education, and general background.
➢ Items that proved difficult for the native speakers almost certainly need revision
or replacement.

5. Trialling of the Test on a group of non-native speakers similar to those for


whom the test is intended

➢ Those items that have survived moderation and informal trialling on native
speakers should be put together into a test, which is then administered under
test conditions to a group similar to that for which the test is intended.
➢ Problems in administration and scoring are noted.
➢ It has to be accepted that, for a number of reasons, trialling of this kind is not
feasible.
➢ It is often the case, therefore, that faults in a test are discovered only after it has
been administered to the target group.
➢ Unless it is intended that no part of the test should be used again, it is
worthwhile noting problems that become apparent during administration and
scoring, and afterwards carrying out statistical analysis of the kind referred to
below and treated more fully in Apendix 1.

6. Analysis of the result of the trial; making of any necessary changes

There are two kinds of analysis that should be carried out.

➢ This will reveal qualities (such as reliability) of the test as a whole and of
individual items (for example, how difficult they are, how well they discriminate
between stronger and weaker candidates.
➢ The second kind of analysis is qualitative. Responses should be examined in
order to discover misinterpretations, unanticipated but possibly correct
responses, and any other indicators of faulty items.
➢ Items that analysis shows to be faulty should be modified or dropped from the
test.
➢ Assuming that more items have been trialled than are needed for the final test, a
final selection can be made, basing decisions on the results of the analyses.

7. Calibration of Scales

➢ Where rating scales are going to be used for oral testing or the testing of writing,
these should be calibrated.
➢ Essentially this means collecting samples of performance (for example, pieces of
writing) which cover the full range of the scales. A team of "experts" then looks
at these samples and assigns each one of them to a point on the relevant scale.
➢ The assigned samples provide reference points for all future uses of the scale, as
well as being necessary training materials.

8. Evaluation

• The final version of the test to be validated.


• Regarded as essential for high stakes or published test.
• For low stakes, tests are to be used within an institution. This may not be
thought necessary, although where the test is likely to be used many
times over a period of time, informal, small-scale validation is desirable.
9. Writing handbooks for test takers, test users and staff

Handbook (each with rather different content, depending on audience) may be


expected of contain the following:

• the rationale for the test;


• an account of how the test was developed and validated;
• a description of the test( which may include a version of the
specifications);
• sample items( or a complete sample test);
• advice on preparing for taking the test;
• an explanation of how test scores are to be interpreted;
• training materials( for interviewers, raters, etc.);
• details of test administration.

10. Training Staff


• All staff involved in the test process should be trained. This may include
interviewers, raters, scorers, computer operators, invigilators (proctors).

8 COMMON TEST TECHNIQUES

• It is very difficult to write successful items

A further problem with multiple choice is that, even where items are possible,
good ones are extremely difficult to write. Professional test writers reckon to have to
write many more multiple choice items than they actually need for a test, and it is
only after trailing and statistical analysis of performance on the items that they can
recognize the ones that are usable.

Multiple choice tests that are produced for use within institutions are often shot
through with faults.

Common amongst these are:

• more than one correct answer


• no correct answer
• there are clues in the options as to which is correct
• ineffective distracters

Savings in the time for administration and scoring will be outweighed by the time
spent on successful test preparation.

It is true that item banks are worthwhile but great demands are still made on time
and expertise.

• Backwash may be harmful

It should hardly be necessary to point out that where a test that is important to
students is multiple choice in nature, there is a danger that practice for the test will
have a harmful effect on learning and teaching.

Practice at multiple choice items (especially when-as can happen-as much attention
is paid to improving one’s educated guessing as to the content of items) will not usually
be the best way for students to improve their command of language.

• Cheating may be facilitated

The fact that the responses on a multiple choice test (a,b,c,d) are so simple that
makes them easy to communicate to other candidates non-verbally.

Some defence against this is to have at least two versions of the test, the only
difference between them being the order in which the options are presented.

All in all, the multiple choice technique is best suited to relatively infrequent testing
of large numbers of candidates.

Yes/No and True/False items

• Items in which the test taker has merely to choose between Yes and NO or
between True or False.
• The obvious weakness of such items is that the test taker has a 50% chance of
choosing the correct response by chance alone.
• True/False items are sometimes modified by requiring test takers to give a reason
for their choice.

Short-answer items
• Items in which the test taker has to provide a short answer are common
particularly in listening and reading tests.
• Advantages over multiple choice:
1. guessing will contribute less to test scores;
2. the technique is not restricted by the need for distractors (though
there have to be potential alternative responses) ;
3. cheating is to be more difficult;
4. though great care must still be taken, items should be easier to
write.
• Disadvantages are:
1. responses may take longer and so reduce the possible number of items;
2. the test taker has produce language in order to respond;
3. scoring may be invalid or unreliable, if judgment is required;
4. scoring may take longer.

You might also like