Parshall Et Al (2002) Practical Considerations in Computer-Based Testing
Parshall Et Al (2002) Practical Considerations in Computer-Based Testing
Computer-Based Testing
Springer-Science+Business Media, LLC
Cynthia G. Parshall Judith A. Spray
John C. Kalohn Tim Davey
Practical Considerations in
Computer-Based Testing
With 17 Illustrations
, Springer
Cynthia G. Parshall Judith A. Spray
University of South Florida ACT,lnc.
4202 E. Fowler Avenue, EDU 162 2202 N. Dodge Street
Tampa, FL 33620-7750 Iowa City, IA 52243-0168
USA USA
[email protected] [email protected]
98765 432 1
Preface vii
3 Examinee Issues 34
Overall Reacti ons 34
Reactions to Adaptive Delivery Method s 35
Impa ct of Prior Computer Experience 36
Effects of the User Interface 37
Effects of Tas k Con strain ts 38
Effects of the Admi nistrative Process 40
Exa minees' Mental Mo dels 41
Summary 42
References 45
Additio nal Readings 46
x Contents
4 Software Issues 47
User Interfaces 48
Usability Studies 49
Software Evaluation 53
Design of the User Interface 56
Web-Based Tests 65
Quality Control 66
Summary 67
References 68
Additional Readings 69
Index 220
1
Considerations in
Computer-Based Testing
items can be made. The amount of time an examinee spends on an item, or the
response latency, also can be collected. If an item is designed appropriately,
process data as well as product data can be collected .
Computer administration of tests also has potential for some important
technological provisions for handicapped examinees. Examples include the
provision of large print or audio for vision-impaired examinees, alternate input
devices for examinees who are unable to write, and other accommodations.
These are only some of the potential benefits of computer-based testing.
However, a number of challenges for the development and administration of a
psychometrically sound computer-based test accompany them. It is important
for any agency considering a computer-based testing program to study the
critical issues or features of computer-based testing in order to design a program
that best suits its needs. These features or issues of computer-based testing are
presented briefly in this chapter and discussed more thoroughly in those that
follow.
Test Administration
held must be set under conditions that differ considerably from paper-and-pencil
test administrations. Finally, new procedures may also be needed in the staffing
of test program support personnel. Given more frequent exam administrations,
support personnel need to be available to respond to problems on a more
constant basis.
Test Development
Examinee Issues
The perspective of the examinees taking computer-based tests needs to be
considered as well. Issues related to examinees include their affective reactions
to computer-based test administration, the impact of prior level of computer
experience on their test performance, effects of the user interface, effects of task
constraints, effects of the test administration process , and examinees ' mental
models as they test. A few of these topics are introduced in the following
sections.
Examinee Reactions
Many early studies in the field of computerized testing found that examinees had
very positive reactions to the experience of taking an exam on computer.
However, the majority of these studies were conducted on tests under
unmotivated conditions, that is, the examinees had little reason or motivation to
perform well. As the number of operational computerized tests has increased, a
somewhat less optimistic picture has developed . Examinees tend to be more
cautious about offering positive reactions under any genuine test condition and
4 Examinee Issues
are sometimes very concerned or even negative about the experience under high-
stakes conditions (i.e., tests with criticaleducationalor career implications).
Examinees have expressed some concerns about the use of the computer
itself. Generally, these include a fear that their prior experience is insufficient
for the task or a sense that the task constraints will be more binding. Different,
less familiar test-taking strategies appear to be optimal for the process of taking
a computerized test . Scanning the test or quickly moving about the exam is
difficult or impossible when only a single item can be displayed on the screen at
a time . Note taking or marking an item for later review is also more difficult
when entering responses with a keyboard or mouse than when using a pencil.
In addition to the general effect of testing on computer, examinees are
concerned about certain specific aspects of adaptive testing. For example, a
typical computerized adaptive test (CAT) may result in an examinee getting
only about half of the items administered correct. This may provide a more
difficult testing experience than is the norm for many test-takers. Some
examinees are also frustrated by constraints on adaptive tests, such as the lack of
opportunity to review or change their responses to previous items. Also, given
that adaptive tests frequently are shorter than the traditional format , exam inees
can become much more anxious about their responses to each and every item.
Finally , examinees may be made more anxious, to the extent that they are aware
of their interaction with the computer. That is, because answering an item
correctly causes the computer to select and administer a more difficult item but
answering incorrectly causes the computer to select and administer an easier
item, examinees may become anxious about their perceived performance, based
on the "feedback" of the current item's difficulty level.
While test developers must consider every aspect of the computerized exam, the
only aspects of the computer-based test with which the examinee is directly
concerned are the item screens and the information screens. In both cases,
examinees need to know clearly to what part of the screen they must attend, how
to navigate, and how to indicate a response. The user interface comprises those
elements of a software program that a user sees and interacts with. The more
"intuitive" the computer test software is, the less the examinee needs to attend to
it, rather than to the test questions . A good interface is clear and consistent and
should be based on sound software design principles in order to support the
overall goals of the program.
Typically , it is more difficult to read from the screen than from paper. This
fact has particular implications for longer reading passages, multiscreen items,
and items that require scrolling. Scanning an entire test and skipping, answering,
or returning to items may also be easier in pencil-and-paper mode. On the other
hand, some advantages for presenting items on screen have been found . For
example, the presentation of one item per screen has been seen as helpful to
distractible examinees. In addition, several potential sources of measurement
error are removed . For example, an examinee is only able to select a single
response to a multiple-choice item; no double-gridding can occur. Further, an
examinee cannot mistakenly provide an answer to the wrong question; if item 14
is displayed on screen, there is no possibility of gridding item 15 instead.
Software Issues
The selection or development of software to administer computer-based tests is
an important step in the process of developing an online exam program.
Usability studies and software evaluations help test developers ensure that the
software program is wel1 designed and meets the needs of the exam. Critical
aspects of the software's interface include navigation, visual style and writing
for the screen. These topics are introduced in the fol1owing sections and
discussed in greater detail in Chapter 4.
Usability Studies
Usability studies are the means by which software developers evaluate and
refine their software programs. Potential users of the program are asked to
interact with and comment on early versions or prototypes of the software. In
this way, the software developers obtain information about the program's value,
utility, and appeal.
Usability studies help ensure good software design ; making decisions
without obtaining user input often leads to design flaws and operational
problems. Testing successive versions of the interface on actual users and
making changes based on the feedback provided have been shown to contribute
to a program that meets users' needs, and one that functions well in practice. For
computerized testing software, one important effect of good software design is
that examinees are able to spend less time and attention on how to take the test
and more on the actual test itself. This may remove a potential source of
measurement error and a limitation on test score validity. Testing programs that
develop their own software for computerized test delivery are well advised to
incorporate usability testing into the development process.
There are several different approaches to usability testing, ranging from very
expensive to very inexpensive. Usability studie s can be conducted using very
elaborate, high-fidelity prototypes of the actual software interface. However,
they can also be conducted quite effectively using inexpensive, simple mockups
of the interface. The studies themselves can be conducted in numerous ways,
including direct observations, interviews, and talk-aloud protocols or focus
groups .
Software Evaluation
programs ensures that a testing agency will select the program that best meets its
needs and the needs of the examinees. Those testing agencies that are producing
their own computerized testing software can benefit from software evaluation.
Checklists and usability heuristics can be used to aid in the evaluation process.
Navigation
Visual Style
The visual style of the software interface is reflected in all of the individual
visual elements on the screen, from the choice of background color to the shape
and label of navigation buttons to the type and size of text fonts. The visual style
of the program is displayed through the use of borders, spacing, and the physical
layout of screen elements. The style and detail of any graphics included, as well
as any animation or video, also contribute to the program's visual style. Finally,
the visual effect of all of these elements combined contributes to the overall
visual style of the interface.
In any software package, the visual style should be appropriate for the target
audience and for the subject matter of the program. It also should be designed to
match the style and tone of the written communication. For computerized
test ing, a simple uncluttered style is best. Clear, legible text with few screen
windows open at one time is the least confusing to novice users of the testing
software. Soft colors, with a good level of contrast between the font and the
background, are easiest on the eyes and thus help reduce potential examinee
fatigue during the exam.
8 Issues in Innovative ItemTypes
The topic of "writing for the screen" refers to important facets of written
communication with the software users . The term applies to the information on
software screens such as the introductory screen, instructional and tutorial
screens , any practice item screens, help screens, and closing screens . It does not
refer to the writing of the actual test items.
The (non item) textual information included in a program should be well
written, communicating information clearly and at an appropriate reading level
for the target users. Reading extensive information on screen is more taxing to
the reader than reading the same information from print, so brevity is a critical
concern in writing for the screen . Concise communication also can be aided by
the use of formatting techniques, including physical layout, spacing, and
bulleted text.
The font used for textual information should be selected carefully, to be
visually pleasing on screen , large enough to be seen from a reasonable distance,
and fully legible. Next, the display of text should be carefully designed. It
should be formatted with attention to aesthetics and attractive placement on the
screen. Legibility and readability are improved when text is left-justified rather
than centered or full-justified . The information is easier to read if the text is not
overly cluttered; instead, visual breaks should be used, and the text should be
surrounded by white space.
Items may be innovative (or depart from the traditional, discrete, text-based,
multiple-choice format) in many ways. Innovative items may include some
feature or function not easily provided in paper-and-pencil testing. This can
include the use of video, audio, or graphics in the item stems or response
options. Examinees might respond by clicking on or moving a screen graphic,
entering a free response, highlighting text, or another action . Other innovations
1. Considerations in Computer-Based Testing 9
relate to the computer's ability to interact with examinees and to score complex
responses by the examinees .
The overarching purpose of innovative item types is to improve
measurement, through either improving the quality of existing measures or
through expanding measurement into new areas. In other words , innovative item
types enable us to measure something better or to measure something more.
Innovative item types may offer improved measurement by reducing guessing or
allowing more direct measurement than traditional item types provide.
Innovative item types also may improve measurement by expanding an exam's
coverage of content areas and cognitive processes.
Dimensions of Innovation
Task Complexity
Innovative item types span a very wide range of task complexity. Task
complexity summarizes the number and type of examinee interactions within a
given task or item. The level of task complexity has implications for item
development as well as for implementation within a test delivery method.
Items with low task complexity are the most similar to traditional , text-based
multiple-choice items in terms of the amount of examinee response time
required and the amount of measurement information the item may be expected
to provide. The modest differences between these innovative item types and
more traditional items mean that items with low task complexity can be
incorporated into an existing test structure fairly easily. Innovative item types
with high task complexity, however, are another matter. Items with high task
complexity tend to require far more examinee response time, whether they
10 Test Delivery Methods
The simplest test delivery method is the computerized fixed test (CFT). This
nonadaptive computer test design has a format similar to the conventional paper-
and-pencil test on which it is modeled. The CFT is sometimes referred to as a
"linear" computerized test, or simply as a CBT (although that tenn has come to
mean any computerized test). The computerized fixed test has a fixed length and
consists of a fixed test form; the fixed set of items can be administered in either
fixed or random order. The computerized test forms are usually constructed
based on classical item statistics (p-values and point-biserial correlations), and
the tests typically are scored by number correct or scaled number correct. When
necessary, alternate forms can be equated by conventional means , or a passing
score can be determined for each test form if a classification decision needs to
be made.
1. Considerations in Computer-Based Testing 11
Under the automated test assembly (ATA) test delivery method, computerized
exams are constructed by selecting items from an item pool in accord with a test
plan, using both content and statistical constraints. Like the CFT, ATA tests
have a fixed length and are not adaptive. In fact, although many examinees
receive a different test, the forms may be constructed in advance of the test
session. Multiple tests for examinees make formal equating following
administration impossible. However, forms are usually constructed under
constraints that render them equivalent by a predetermined set of criteria based
on content and psychometric specifications.
Tests constructed with this method can be administered frequently on fixed
dates or during fixed windows. Item pools can be used for several
administrations before the item pool is set aside or restocked, and the exams
may be scored by classical test theory methods or estimates of each examinee's
ability as measured by the test. Either official or provisional test scores may be
provided at the conclusion of the test session, as discussed previously. The test
program is often self-sustaining through embedded pretest sections. More
information item development and pretesting is provided in Chapter 2.
In the computerized adaptive test (CAT) delivery method, items are individually
selected for each examinee based on the responses to previous items. The goal
of this type of test is to obtain a precise and accurate estimate of each examinee's
proficiency on some underlying scale. The number of items, the specific items,
and the order of item presentation are all likely to vary from one examinee to
another. Unique tests are constructed for each examinee, matching the difficulty
of each item to the examinee's estimated ability . Immediate scoring is provided.
Examinee scores are put on the same scale through reliance on item response
theory (IRT) latent ability estimates. (More information on IRT is provided in
the Basics ofIRT appendix and in Chapter 8.)
CAT test programs usually provide either continuous testing or frequent
testing windows . The item pools need to be refreshed regularly and the exams
are scored by IRT ability estimates . The official scores therefore can be reported
12 Organization of the Book
The computerized classification test (CCT) delivery method, like the CAT, is
adaptive. However, the goal in this delivery method is to classify examinees into
two or more broad categories, rather than to obtain an ability estimate for each
examinee. That is, examinees are placed into groups, such as pass or fail, but
may not be provided a score. This model is more efficient for making
classification decisions than a full CAT exam, although less information about
the examinee's ability is obtained. Both classical and IRT versions of the test are
possible.
Testing programs using the CCT delivery method often provide continuous
or year-round testing, and item pools may be changed or refreshed several times
annually. The official scores, or classification decisions, are often reported
immediately. The program is typically self-sustaining through embedded pretest
sections.
The processes of test administration and development are both critical elements
in any testing program . Chronologically, the development of any test occurs
before its administration, and thus the two are more commonly paired as "test
development and administration ." However, in discussing computerized testing
programs, it is often useful to address the administration issues first and then
turn to the development considerations. This is the approach followed in this
chapter.
Security
Examinee Time
Administration Frequency
Cost Concerns
The costing structure for a computerized exam is likely to be very different from
that of a paper-and-pencil exam . There may be cost savings due to reduced
printing and shipping costs. Test programs where paper-and-pencil exams have
been admini stered individually may save on salaries for exam proctors after
switching to computerized administration. In general, however, computer-based
testing is more expensive. This higher cost is the result of such elements as the
test administration software (whether purchased or developed internally), the
additional items that must be developed, the psychometric research to support
the program, and the computer test site fees. Part or all of the additional cost is
often passed along to the examinees. Given this higher cost to the examinee, it is
helpful to ensure that examinees see advantages to the CBT, particularly when
both paper-and-pencil and CBT versions of an exam are offered. Otherw ise,
very few examinees may choose the less familiar and more costly CBT version .
The most common advantages that examinees perceive are more frequent test
dates and immediate scoring. Additional discussion of CBT economics (and
18 TestDevelopment Issues
other logistical matters) is provided in Vale (1995) and Clauser and Schuwirth
(in press).
There are a number of psychometric issues related to planning the test, including
developing the test blueprint, defining the test characteristics, developing the
item pool, obtaining the item statistics, and conducting computerized
simulations. These stages in CBT development are introduced next.
Steps 1 through 5 are discussed in more detail next. The remaining steps
are discussed further in Chapter 10.
CBT require little or no revision. For new testing programs , test specifications
need to be developed.
The first steps in developing any CBT are identical to those undertaken when
conventional tests are developed. Determining what a test is to measure and
developing items to effectively do so are processes well described in other
references (e.g., Crocker & Algina, 1986) and will not be detailed in this book.
One of the differences for CBTs results from the expanded range of item types
made possible by computerized test administration. More information about
these possibilities can be found in Chapter 5.
The product of these steps is a blueprint that specifies the type, format, and
content characteristics for each item on the test. The options and flexibility are
almost endless. The test can comprise discrete items, stimulus-based units, or a
combination of the two. Even discrete items can be formed into bundles that are
always administered collectively (see Wainer, 1990, for a discussion of the
benefits of these bundles, called testlets). Items can allow for any means of
responding, provided scoring is objective and immediate. Although multiple
choice is currently the dominant item type, alternatives are available and are
becoming more prevalent. Lastly, the content characteristics of the items and the
proportion of the test devoted to each content domain are specified . This in turn
specifies the constraints under which items are selected during test
administration.
Nevertheless, if item analyses show that the items are performing adequately,
they subsequently can be used operationally.
An item pool that has been used for a paper-and-pencil assessment program
may need additional items to be developed before it can go online, particularly if
any kind of adaptive test is planned. Further, CBTs often have a continuing need
for many more new items to be written and pretested. In fact, for some testing
programs the need to satisfy the voracious requirement for items has become
one of the most challenging aspects of computerizing the exam. One cause of
the increased need for items in computer-based testing is the more frequent
administration of tests. The continuous test administrations cause items to be
exposed over time, potentially affecting test security. Furthermore, in the
adaptive test delivery methods, items are individually administered at differing
frequencies. The most desirable items (e.g., for many CATs, items of middle
difficulty and high discrimination) tend to be exposed most frequently and may
need to be retired from the item pool particularly quickly. These requirements
for additional items are especially relevant for high-stakes exams, less so for
low-stakes test programs. In low-stakes applications, an item pool only needs to
be large enough to provide good measurement, without the additional need to
address test security.
The need for large numbers of items in many computer-based testing
programs is often coupled with a need for examinee response data from a large
number of appropriate examinees . In addition, continuous testing results in a
slower, more gradual accumulation of sufficient numbers of examinees to
compute item statistics . The number of pretest examinees needed is related to
the measurement model in use; test programs using classical test theory are less
demanding than those using IRT. Item parameter estimates that are computed
using an insufficient number of examinees are likely to be poor, leading to
lessened test score precision (Hambleton & Jones, 1994).
For these reasons, computerized fixed test programs, or CFTs, that use
classical methods and are applied in low-stakes settings have fewer demands
than other delivery methods . The automated test assembly method, or ATA,
requires a greater item writing and pretesting effort, due to its use of multiple
forms . The adaptive methods, computerized classification testing and
computerized adaptive testing (CCT and CAT) also experience greater demands
for additional items. An adaptive exam program often requires a large initial set
of items and a greater need to replenish the item pool regularly. Furthermore,
pretesting large numbers of items can be a particular issue for short adaptive
tests, given the proportion of pretest-to-operational items. These demands
typically are greater for a CAT than a CCT. Further discussion of pretesting
accommodation is provided in each of the chapters covering test delivery
methods. More discussion of the issues in CBT item development and pretesting
is provided by Parshall (2000).
Prior to implementing computer-based testing, it is important to conduct an
evaluation of the existing item pool. This evaluation includes an assessment of
the quality of the current items and, if a measurement model is assumed, a test
2. Issues in Test Administration and Development 23
In summary, items within a CBT item pool are assumed to have statistical
characteristics associated with them that are valid for CBT situations . The best
way to ensure that this condition is met is to obtain the statistical item data under
CBT conditions. When this cannot be done, then some caution must be made
regarding the decisions obtained with CBTs assembled from item pools that
have been transferred directly from paper-and-pencil programs. This is also true
of item pools that are IRT-based. Item calibrations obtained from paper-and-
pencil forms will not necessarily reflect the performance of the items within a
CBT context.
Once an item has been calibrated online however (i.e., as part of a CBT
program), it is usually assumed that these online calibrations will represent the
true performance of that item, even though the item will most likely appear in a
different order and with different items for each examinee . It is assumed that
these effects will cancel out across examinees so that an item's characteristic s as
measured online will accurately reflect that item's performance overall (i.e.,
over all examinees and over all possible computer-based tests).
2. Issues in Test Administration and Development 25
Reliability
Validity
Data Analyses
For testing programs that administer exams periodically (e.g., the traditional
paper-and-pencil format), as opposed to continuously, data analyses are easily
scheduled and performed, After each periodic, large-group administration, a
number of psychometric procedures are routinely conducted using the full set of
examinee test data (Crocker & Algina, 1986). Item performance can be
examined through computation of the classical item statistics measuring
difficulty (p-values) and discrimination (point-biserial correlation coefficient),
or through IRT calibration. Distractor analyses can be perfonned to determine
the proportion of examinees selecting each response option . The possibility of
item bias is usually investigated through an analysis of differential item
functioning (DIF) . For fixed-form exams, the mean performance of examinees
and the reliability of each test form can be computed, and if necessary , each test
form can be equated to a base form, and standard setting can be conducted.
Conversely, for most computer-based testing programs, exams are
administered far more frequently and to far fewer examinees on any given test
date. Each of these CBT exams can be scored as soon as each individual
examinee has completed testing. However, the change to continuous test
administration necessitates a change in the procedures for the analysis of test
data. The testing program can establish regular intervals (e.g., quarterly) for item
and test data analysis. All test data accumulated during each of these periods can
be collapsed, and various group -level analyses can be conducted. Reports can
then be generated, summarizing the results of item, subtest, and test analyses for
the testing period (as opposed to the single test-administration date used in more
traditional test programs) . For some analyses, particularly analyses on examinee
subgroups, an insufficient number of examinees may test during a reporting
interval. In these instances, data should be collected over time until sufficient
samples are available, at which point the additional analyses can be conducted
(ATP, 2000).
This delayed approach to item and test analysis is not entirely satisfactory. In
less frequently administered, standardized, paper-and-pencil testing, these
analyses can be conducted prior to the release of test scores. If problems are
found (e.g., a negative discrimination or a high DIF value), then the problematic
items can be removed before the test is scored . However, when these analyses
are delayed, as is the case in most CBT applications, the possibility exists that
flawed items remain in the operational pool for a much longer period and will be
included in many examinees' final test scores.
2. Issues in Test Administration and Development 29
Comparability
There are a number of circumstances under which the comparability of test
scores needs to be established. An example in which comparability might be a
concern is the situation in which a computer-based test and a paper-and-pencil
version of the same test are offered for simultaneous administration. This
situation is referred to as maintaining an exam program across dual platforms.
There may also be versions of the CBT in multiple languages. In addition, for
security reasons a testing program may use more than one item pool. Finally, the
maintenance of a single item pool may produce substantive changes in the pool
over time, as the items in the pool are removed and new items used to replenish
the pool. In all of these cases, it may be important to investigate test
comparability (Wang & Kolen, 1997).
The initial concern for comparability was that of mode effect (APA, 1986).
Numerous studies compared examinee test scores obtained across computer and
paper-and-pencil test administration modes. Reviews of many of these early
comparability studies were conducted by Mazzeo and Harvey (1988) and by
Mead and Drasgow (1993). The overall results of these early comparability
studies suggested that test administration mode did not result in significant
differences for most power tests, although differences were found for speed
tests. However, some items and some item types did produce mode differences.
It is therefore recommended that any test program using scores as though they
were equivalent across dual platforms document the results of a comparability
study (APA, 1986; ATP, 2000). In fact, most of the early comparability studies
investigated tests comprised of discrete , single-screen item types. There is a
concern that multiscreen items, graphical items, and innovative item types may
be more subject to mode effect differences. This issue is related to the potential
effect that specific item presentation elements may have on item performance.
(Godwin, 1999; Pommerich & Burden, 2000). Pommerich & Burden (2000)
provide an example of research into the impact that subtle item formatting
differences may have across modes.
Another aspect of comparability may arise when two or more pools are used
for the construction of adaptive tests. If more than one pool is used to assemble
tests or if the items in a single pool change over time, then differences in item
content or statistical characteristics could create a lack of comparability (Wang
& Kolen, 1997) . Within adaptively administered exams, an additional
comparability issue is that of content balancing across individual examinees.
Content rules can be incorporated into the item selection algorithms to address
these issues, but comparability studies may be needed to document the level of
their effectiveness.
Summary
Table 2.1 provides a list of highlights from the topics related to test
administration and development introduced in this chapter. Several of these
topics are further addressed later in this book, particularly as they are related to
specific test delivery methods and other aspects of computer-based testing.
30 Summary
References
American Educational Research Association (AERA), American Psychological
Association (AP A), and the National Council on Measurement in Education
(NCME) . (1985) . Standards for educational and psychological testing . Washington,
DC: APA.
American Educational Research Association (A ERA) , American Psychological
Association (APA), and the National Council on Measurement in Education
(NCME) . (1999) . Standards for educational and psychological testing . Washington,
DC:AERA.
American Psychological Association Committee on Professional Standards and
Committee on Psychological Tests and Assessment. (APA) . (1986). Guidelines for
computer-based tests and interpretations. Washington, DC: Author
Association of Test Publishers (ATP). (2000). Computer-Based Testing Guidelines.
Clauser, B. E., & Schuwirth, L. W. T. (in press). The use of computers in assessment. In
G. Norman, C. van der Vleuten, & D. Newble (Eds.), The International Handbook
for Research in Medical Education. Boston: Kluwer Publishing .
Colton, G. D. (1997). High-tech approaches to breaching examination security. Paper
presented at the annual meeting ofNCME, Chicago.
Crocker, L. & Algina , J. (1986) . Introduct ion to Classical and Modern Test Theory. Ft.
Worth: Holt, Rinehart & Winston.
Godwin , J. (1999, April). Designing the ACT ESL Listening Test. Paper presented at the
annual meeting of the National Council on Measurement in Education, Montreal ,
Canada.
Green , B. F., Bock R. D., Humphreys, L. G., Linn, R. L., & Reckase, M. D. (1984).
Technical guidelines for assessing computerized adaptive tests. Journal of
Educational Measurement, 21, 347-360.
Hambleton, R. K., & Jones, R. W. (1994). Item parameter estimation errors and their
influence on test information functions . Applied Measurement in Education, 7,
171-186.
Mazzeo, J., & Harvey, A. L. (1988) . The equivalence of scores from automated and
conventional educational and psychological tests: A review of the literature (College
Board Rep. No . 88-8, ETS RR No . 88-21) . Princeton, NJ : Educational Test ing
Service.
Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil
cognitive ability tests: A meta-analysis. Psychological Bulletin, 9, 287-304.
NCME Software Committee. (2000). Report of NCME Ad Hoc Committee on Software
Issues in Educational Measurement. Available online:
http://www.b-a-h.com/ncmesoft/report.htrnl.
O'Neal, C. W. (1998) . Surreptitious audio surveillance: The unknown danger to law
enforcement. FBI Law Enforcement Bulletin, 67, 10-13.
ParshalI, C. G. (In press). Item development and pretesting. In C. Mills (Ed.) Computer-
Based Testing. Lawrence Erlbaum.
Pommerich, M., & Burden, T. (2000). From simulation to application: Examinees react to
computerized testing . Paper presented at the annual meeting of the National Council
on Measurement in Education, New Orleans.
32 Additional Readings
Rosen, G.A. (2000, April). Computer-based testing: Test site security. Paper presented at
the annual meeting of the National Council on Measurement in Education, New
Orleans.
Shermis, M., & Averitt, 1. (2000, April). Where did al1 the data go? Internet security for
Web-based assessments. Paper presented at the annual meeting of the National
Council on Measurement in Education, New Orleans.
Vale, C. D. (1995). Computerized testing in licensure. In 1. C. Impara (Ed.) Licensure
Testing : Purposes, Procedures, and Practices. Lincoln, NE: Buros Institute of
Mental Measurement.
Wainer, H. (Ed.) (1990) . Computerized Adaptive Testing : A Primer. Hillsdale, NJ:
Lawrence Erlbaum.
Wang, T., & Kolen, M. J. (1997, March). Evaluating comparability in computerized
adaptive testing: A theoretical framework . Paper presented at the annual meeting of
the American Educational Research Association, Chicago.
Way , W. D. (1998). Protecting the integrity of computerized test ing item pools.
Educational Measurement: Issues and Practice, 17, 17-27 .
Additional Readings
Bugbee, A. c., & Bernt, F. M. (1990). Testing by computer : Findings in six years of use.
Journal ofResearch on Computing in Education, 23, 87-100.
Buhr, D. C., & Legg, S. M. (1989) . Development of an Adaptive Test Version of the
College Level Academic Skills Test . (Institute for Student Assessment and
Evaluation, Contract No. 88012704). GainesviIIe, FL: University of Florida.
Bunderson, C. V., Inouye, D. K., & Olsen, J. B. (1989) . The four generations of
computerized educational measurement. In R. L. Linn (Ed .), Educational
Measurement (3rd ed., pp. 367-408). New York: MacmiIIan.
Eaves, R. C., & Smith, E. (1986) . The effect of media and amount of microcomputer
experience on examination scores. Journal ofExperimental Education, 55, 23-26.
Eignor, D. R. (1993, April) . Deriving Comparable Scores for Computer Adaptive and
Conventional Tests : An Example Using the SAT. Paper presented at the annual
meeting of the National Council on Measurement in Education, Atlanta.
Greaud, V. A., & Green, B. F. (1986). Equivalence of conventional and computer
presentation of speed tests. Applied Psychological Measurement, 10,23-34.
Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L., & Reckase, M. D. (1984).
Technical guidelines for assessing computerized adaptive tests . Journal of
Educational Measurement, 21, 347-360.
Haynie, K. A., & Way, W. D. (1995 , April) . An Invest igation of Item Calibration
Procedures for a Computerized Licensure Exam ination. Paper presented at
symposium entitled Computer Adaptive Testing, at the annual meeting of NCME ,
San Francisco.
Heppner, F. H., Anderson, 1. G. T., Farstrup, A. E., & Weiderman, N. H. (1985). Reading
performance on a standardized test is better from print than from computer display.
Journal ofReading, 28, 321-325.
Hoffman, K. I., & Lundberg, G. D. (1976) . A comparison of computer-mon itored group
tests with paper-and-pencil tests. Educational and Psychological Measurement, 36,
791-809.
2. Issues in Test Administration and Development 33
Overall Reactions
scoring typical of many CBT programs. They are also positive about the shorter
testing times possible under adaptive test-delivery methods. Many examinees
prefer responding to multiple-choice items on a CBT, where they can click or
select an answer, rather than having to "bubble in" a response on a scannable
paper form . Examinees may also indicate that certain innovative item types
seem to provide for more accurate assessment of their knowledge and skills in a
given area.
Examinees have also reported negative reactions to computerized testing.
Some examinees have a general anxiety about the computer itself, while others
are more concerned about whether their level of computer experience is
adequate for the task (Wise, 1997). Sometimes examinees object to computer-
based tests because of built-in task constraints . They find it more difficult or
awkward within a CBT environment to do such things as take notes or perform
mathematical computations. More specifically, examinees may have difficulty
with certain aspects of the user interface, or they may object to particular
elements of adaptive test delivery.
The primary examinee objections are discussed in somewhat more detail
later, along with suggestions for alleviating their potential negative impact.
The level of computer experience that examinees possess as they take an exam
can be of critical importance due to its potential impact on test-score validity.
Obviously, to the greatest extent possible, the test should be designed so that
examinees' level of computer experience does not affect their test scores.
When an exam consists solely of simple, discrete multiple-choice items ,
computer experience is likely to be only a minor issue for the majority of
examinees . In this type of application, there are no multiscreen items, no need
3. Examinee Issues 37
for scrolling, and no complex response actions required from the examinee.
Examinees with even minimal prior exposure to computers will usually be able
to respond to these item types without error, although not necessarily without
anxiety. Under these conditions, a computerized version of an exam is likely to
demonstrate similar performance to a paper-and-pencil version, and no mode
effect may be in evidence.
However, even a modest level of computer experience may be too much to
expect of some examinees. For example, a testing program with examinees from
countries with low levels of technology will need to be particularly concerned
about the potential detrimental effects of limited computer experience on
examinees' test scores.
A concern for test equity suggests that whenever different examinee sub-
groups have very different levels of opportunity for computer access and
experience, it is important to ensure that the CBT is designed to minimize these
differences (Gallagher, Bridgeman, & Cahalan, 1999; O'Neill & Powers, 1993;
Sutton, 1993). In other words, the skills and content knowledge of interest ought
to be measured in such a way that no more prior computer experience is needed
than the least experienced examinee subgroup may be expected to have.
Prior computer experience may also be a greater issue when more complex
item types are used. Some innovative item types require greater computer skills
than those needed to select a multiple-choice response option (Perlman, Berger,
& Tyler, 1993). Although it is often desirable to use these additional item types
in order to expand the test coverage, test developers need to be attentive to the
expected levels of computer experience in their test population as the item types
are developed and field tested (Bennett & Bejar, 1998).
In general, test developers can best address the issue of prior computer
experience by first obtaining a clear understanding of the computer experience
levels in their test population. That information should then be utilized as the
test is designed and developed. This will help ensure that the item types and test-
user interface are designed appropriately and that the resulting test scores will
not include measurement error arising from differing background computer
skills. The CBT typically should also include specific instructions and practice
items for each item type included on an exam. In some exam programs , more
extensive practice tests can be included with the advance test-preparation
materials to give examinees additional opportunity to become comfortable with
the CBT software. Taken together, these steps will help ensure that a modest
level of computer experience will be adequate to the task and that individual
examinees will not be unfairly disadvantaged.
The more "intuitive" the computer test software is, the less attention an
examinee needs to give to it-and the more attention he or she is free to give to
38 Effectsof Task Constraints
the test items instead. It is this point that makes user interface design in
computer-based tests so important.
The user interface primarily consists of the functions and navigation features
available to the user, along with the elements of screen layout and visual style.
The interface can be thought of as those components of a software program that
a user sees and interacts with. A good user interface should demonstrate
consistency and clarity and generalIy reflect good interface design principles.
While examinees may be unaware of the underlying psychometrics of an
exam, they will be aware of and immediately concerned with the test's interface.
In fact, an examinee is likely to perceive a CBT as consisting simply of the test
items and the software's user interface. If the user interface is confusing,
clumsy, or simply time-consuming, examinees will experience more frustration
and anxiety. Beyond these affective reactions, extensive research has shown that
a welI-designed user interface makes software easier to learn, easier to use, and
less prone to user error (Landauer, 1996; TulIis, 1997). When these facts are
applied to CBT, they suggest that a good user interface can reduce measurement
error (and conversely that a poor interface can increase it).
The goal in CBT interface design is to make the user interface as transparent
as possible, so that only the test items remain consequential (Harmes & ParshalI,
2000). Examinees primarily interact with a test's item screens and information
screens. As the examinees encounter these screens, they need to know clearly
what part of the screen to attend to and how to navigate from one screen to
another. For item screens, they also need to know how to respond. The interface
design can be used to guide the examinees to these elements in a clear and
simple manner. A user interface that successfulIy communicates this information
to the examinees will reduce the dependence on verbal instructions, practice
items , and help screens. It will also help reduce examinee frustration and
anxiety. (More information about user interface design, and the process of
usability testing to ensure a successful design, is provided in Chapter 4.)
In any assessment, there are elements that constrain the kinds of actions an
examinee may take and the kinds of responses that he or she is able to give .
Some task constraints are directly and intentionally built into the task, perhaps
for content or cognitively based reasons or perhaps for reasons of scoring
convenience. In other cases, a task constraint arises more as a by-product of
some other aspect of the test mode environment. It is therefore evident that task
constraints can have more or less construct relevance. IdealIy, test developers
want to design a test so that construct-relevant task constraints appropriately
guide the examinee's process of taking a test and responding to items, while the
effects of task constraints that are construct-irrelevant are reduced as much as
possible (MilIman & Greene, 1989).
3. Examinee Issues 39
paper at the test site to help address this issue, although this does not address the
inconvenience of switching between paper-and-pencil tools and computer input
devices.
Task constraints can also be seen on the test-taking experience as a whole .
Examinees have access to one set of test-taking strategies with paper-and-pencil
tests, where scanning the entire test, placing a pencil mark next to an item for
later review, and jumping directly to another item type section, are all actions
easily taken. None of these facilities were intentionally designed into the paper-
and-pencil mode, although test-preparation materials frequently suggest that
examinees make use of them. With a CBT, viewing the entire test is difficult,
placing a computerized mark is possible only indirectly, and jumping to a given
section may not even be feasible if sections do not exist. Some of these task
constraints on CBT test-taking strategies are the result of test design elements,
such as item selection rules and navigation elements of the user interface. Others
are simply the by-product of the limitations on screen resolution compared to
print (i.e., more items at a time can be clearly displayed in a print format).
In brief, task constraints shape the ways in which examinees can respond to
items and tasks. They can focus and limit examinee response actions in ways
that are cognitively appropriate and that make the scoring process easier and
more reliable . Task constraints can also limit the types of questions that can be
asked or confound a task and make it inappropriately more challenging. More
subtly, they may even affect the ways in which examinees are able to
conceptualize the tasks (Bennett & Bejar, 1998). In any of these cases, they
diminish the value of the assessment.
To address CBT task constraints, test developers should first ensure that the
items and the test as a whole include those task constraints that are endorsed for
construct-relevant reasons. These intentional task constraints might apply to the
ways in which examinees respond to items, other actions the examinee must
make, and the test-taking process as a whole. A more challenging second step is
to thoroughly examine all the tasks an examinee is asked to do in the CBT in
order to identify additional sources of task constraints . These unintentional task
constraints should be analyzed for their potential negative impact on examinees
and their test performance. Once such problem areas are identified, the test
developers need to either change the task to make it more appropriate or find
ways to prepare the examinees to deal appropriately with the construct-irrelevant
elements.
In addition to aspects of the test delivery method, the software interface, and the
task constraints, examinees are impacted by, and have reactions to, aspects of
the test-administration process. While examinees tend to be pleased with certain
3. Examinee Issues 41
for themselves, even when they only have very sketchy information. In the case
of computer-based tests, examinees hold mental models about the testing
processes that the computer follows. An examinee's reactions to computerized
testing can be conceptualized as partially resulting from the accuracy of his or
her mental model. Some mental models are inaccurate; there is a conflict or
discrepancy between what the examinee perceives as transpiring and the actual
process underway. Some negative reactions on the part of examinees are
actually due to these inaccurate mental models.
As mentioned previously, some examinees have expressed concern about
their own apparent poor performance when other examinees at the same test
administration site ended their exams and left the room early . An incorrect
mental model in this instance is that all examinees were being administered the
same test and the same number of items. Because of this internal
misrepresentation of the CBT administration process, some examinees felt
unnecessarily anxious about their "slower" performance-when in fact they
were likely to have been taking different tests of different lengths and
difficulties.
Another example of distress caused by an inaccurate mental model was the
anxiety felt by some examinees when their variable-length exam ended quickly.
Some of these examinees assumed that they failed the test and had not been
given an adequate opportunity to "show what they knew." In fact, testing may
well have ended quickly because their performance was very strong, and they
had already passed.
Another mental model that could cause examinees difficulty is related to
adaptive item selection. Examinees are sometimes made anxious by trying to use
the difficulty of the current item as a form of feedback about the correctness of
their previous response. That is, when the current item appears to be easier than
the last, they assume the previous answer was incorrect. This assumption may
well be wrong, however, because item selection is usually complicated by
content constraints, exposure control, and other factors unknown to the
examinee . In fact, examinees may not be particularly good at determining the
relative difficulty of one item from another. Given these additional elements of
item selection, an examinee 's guesses about his or her test performance based on
the perceived difficulty ordering of the items is likely to be in error, although it
still has the power to create anxiety and distress.
Often, the best way to address examinees' mental models is to ensure that
they have accurate ones. This will be the result of giving the examinees accurate
information, correcting misrepresentations, and directly addressing particular
areas of concern.
Summary
Table 3.1 provides a summary of some of the affective reactions examinees have
expressed toward computer-based tests as discussed in this chapter, along with
3. Examinee Issues 43
Table 3.1. Some Examinees' Concerns about CBTs and Possible Course s of Action
Concern Examinees' Reaction Course of Action
Overall Concern about testing on computer Provide instructional materials, sample items, and
practice tests
Adaptive tests Dislike lack of flexibility on item omits, reviews, Provide flexibility , either in total test or short sets
revisions, and previews Target the item selection to greater than .5
Concerns about average difficulty of test Develop progress indicators to provide information
Concerns about test progress under variable-length to examinee
exams
Prior computer experience Anxiety about the potential impact of insufficient Provide good informational materials, both in
prior experience advance and during testing
User interface Difficulty learning and using the CBT Thoroughly test the exam and item user interfaces
administration software on the target population
Task constraints Frustrations with constraints on the test-taking Carefully evaluate the test and item types for
process construct-irrelevant constraint s
Administrative process Anxiety when other examinees start or end during Inform examinees that others are taking different
their tests tests of different lengths
Mental models Concern about the correctness of an item response, Inform examinees about the numerous factors used
when the following item is perceived to be easier in item selection, beyond item difficulty en
§
3
~
3. Examinee Issues 45
References
Association of Test Publishers (ATP). (2000). Computer-Based Testing Guidelines.
Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring : It's not only the
scoring. Educational Measurement Issues and Practice, 17, 9-17 .
Bunderson, V. C., Inouye, D. I., & Olsen, J. B. (1989) . The four generations of
computerized educational measurement. In Linn, R. (Ed .) Educational
Measurementi Y" edition. New York : American Council on Education and
Macmillan Publishing Co.
Gallagher, A., Bridgeman, B., & Cahalan, C. (1999, April). The Effect ofCBT on Racial,
Gender, and Language Groups . Paper presented at the annual meeting of the
National Council on Measurement in Education, Montreal.
Harmes, J. C., & Parshall, C. G. (2000, November). An Iterative Process for
Computerized Test Development: Integrating Usability Methods. Paper presented at
the annual meeting of the Florida Educational Research Association, Tallahassee.
Kingsbury, G. G. (1996, April). Item Review and Adaptive Testing. Paper presented at
the annual meeting of the National Council on Measurement in Education , New
York.
Landauer, T. K. (1996). The Trouble with Computers: Usefulne ss, Usability, and
Productivity. Cambridge, Mass: MIT Press.
Lunz, M. E., & Bergstrom , B. A. (1994). An empirical study of computerized adaptive
test administration conditions. Journal ofEducational Measurement. 31,251-263.
Lunz, M. E., Bergstrom, B. A., & Wright, B. D. (1992). The effect of review on student
ability and test efficiency for computer adaptive testing . Applied Psychological
Measurement, 16, 33-40.
MilIman, J., & Greene, 1. (1989) . The specification and development of tests of
achievement and ability . In Linn, R. (ed.) Educational Measurement, 3rd edition.
New York: American Council on Education and Macmillan Publishing Co.
Norman, D. A. (1990). The Design ofEveryday Things. New York: Doubleday.
O'Neill, K., & Kubiak, A. (1992 , April) . Lessons Learned from Examinees about
Computer-Based Tests: Attitude Analyses. Paper presented at the annual meeting of
the National Council on Measurement in Education, San Francisco.
O'Neill, K., & Powers, D. E. (1993, April). The Performance of Examinee Subgroups on
a Computer-Administered Test of Basic Academic Skills. Paper presented at the
annual meeting of the National Council on Measurement in Education, Atlanta.
Perlman, M., Berger, K., & Tyler, L. (1993). An Application of Multimedia Software to
Standardized Test ing in Music. (Research Rep. No. 93-36) Princeton, NJ:
Educational Testing Service.
Pommerich, M., & Burden, T. (2000, April). From Simulation to Application: Examinees
React to Computerized Testing. Paper presented at the annual meeting of the
National Council on Measurement in Education, New Orleans.
Rosen, G. A. (2000, April). Computer-Based Testing: Test Site Security. Paper presented
at the annual meeting of the National Council on Measurement in Education, New
Orleans.
Stocking, M. L. (1997). Revising item responses in computerized adaptive tests : A
comparison of three models. Applied Psychological Measurement, 21, 129-142 .
46 Additional Readings
Sutton, R. E. (1993, April). Equity Issues in High Stakes Computerized Testing . Paper
presented at the annual meeting of the American Educational Research Association ,
Atlanta.
Tullis, T. (1997). Screen Design. In Helander, M., Landauer, T. K., & Prabhu, P. (eds).
Handbook of Human-Computer Interaction, 2nd completely revised edition,
(503-531). Amsterdam: Elsevier.
Vispoel, W. P., Hendrickson , A. B., & Bleiler, T. (2000) . Limiting answer review and
change on computerized adaptive vocabulary tests : Psychometric and attitud inal
results. Journal ofEducational Measurement, 37, 21-38.
Vispoel , W. P., Rocklin, T. R., Wang, T., & Bleiler, T. (1999). Can examinees use a
review option to obtain positively biased ability estimates on a computerized
adaptive test? Journal ofEducational Measurement, 36, 141-157.
Way, W. D. (1994). Psychometric Models for Computer-Based Licensure Testing. Paper
presented at the annual meeting of CLEAR, Boston.
Wise, S. (1996, April) . A Critical Analysis of the Arguments for and Against Item
Review in Computerized Adaptive Testing. Paper presented at the annual meeting of
the National Council on Measurement in Education, New York.
Wise, S. (1997, April). Examinee Issues in CAT. Paper presented at the annual meeting
of the National Council on Measurement in Education, Chicago.
Additional Readings
Becker, H. L, & Sterling, C. W. (1987). Equity in school computer use: National data and
neglected considerations. Journal ofEducational Computing Research, 3, 289-311 .
Burke, M. J., Normand, J., & Raju, N. S. (1987). Examinee attitudes toward computer-
administered ability testing. Computers in Human Behavior, 3, 95-107.
Koch, B. R., & Patience, W. M. (1978). Student attitudes toward tailored testing. In D. J.
Weiss (ed.), Proceedings of the 1977 Computerized Adaptive Testing Conference.
Minneapolis: University of Minnesota, Department of Psychology.
Llabre, M. M., & Froman, T. W. (1987) . Allocation of time to test items : A study of
ethnic differences. Journal ofExperimental Education, 55, 137-140.
Moe, K. c., & Johnson, M. F. (1988) . Participants' reactions to computerized testing.
Journal ofEducational Computing Research, 4, 79-86.
Ward, T. J. Jr., Hooper, S. R., & Hannafin, K. M. (1989). The effects of computerized
tests on the performance and attitudes of college students . Journal of Educational
Computing Research, 5, 327-333 .
Wise, S. L., Barnes, L. B., Harvey, A. L., & Plake, B. S. (1989) . Effects of computer
anxiety and computer experience on the computer-based achievement test
performance of college students. Applied Measurement in Education, 2, 235-241 .
4
Software Issues
User Interfaces
The user interface can be thought of as those components of a software program
that a user sees and interacts with . The software interface comprises the
functions available to the user, the forms of navigation, and the level and type of
interactivity, as well as the visual style, screen layout, and written
communication to the user. This chapter focuses on the user interfaces
encountered by the examinee, as this is the most critical measurement concern of
computer-based testing software . However, most CBT software programs also
include program modules for test development processes such as item entry and
test form assembly. These program elements also have user interfaces and good
design of these interfaces is of critical importance to the test developers who
must use them.
A program's interface can be either easy or difficult to learn. It can be either
easy or cumbersome to use and, likewise , it can be intuitive or confusing . A
well-designed user interface can result not only in software that is easier to learn
and use but also in software that is less prone to miskeys and other entry errors.
An easy, intuitive software interface is probably always desirable, but it is
particularly important for a critical application like computerized testing . If
examinees must focus too much attention or learning on using the software
program, the amount of their concentration available for the actual test may be
lessened. Examinees' reactions to taking a test on computer are very likely to be
affected by the quality of the user interface; their performance may also be
affected. This would clearly lessen the validity and usefulness of their test
scores.
Unlike more comprehensive software applications , the software needed for
computerized test administration often has a reduced number of features
available, resulting in a limited user interface. This is because, in general, there
4. Software Issues 49
is only a small set of actions that an examinee needs to make in order to take a
computerized test. For example, a simple CAT might allow an examinee to do
no more than click on a response item for each multiple-choice item and then
click to proceed to the next item. On the other hand, tests in which examinees
are free to review and revise their responses must provide additional
navigational functions, adding to the complexity of the user interface. And tests
that include innovative item types often require more complex response actions
on the part of the examinees, resulting in more elaborate user interfaces . Those
test programs that have examinee populations with little computer experience
need to exercise particular care to provide simple interfaces. Test programs that
include complex multi screen items or simulation tasks have particular
challenges.
Test developers who are producing their own computerized testing software
need to make decisions about the features and functions to include, as well as
about the appearance of the software screens. These professionals are advised to
conduct usability studies throughout the software development process in order
to ensure the effectiveness of the software interface and to avoid costly
mistakes . Testing professionals who are purchasing commercial software for
computerized testing also need to make decisions about the kinds of
computerized test features their exam programs should have. They can then
evaluate existing commercial software packages to determine the program that
best meets their needs.
Brief discussions of usability studies and software evaluation are provided
next, followed by more details on some of the user interface design issues that
are most pertinent to computerized testing software.
Usability Studies
Usability may be defined as the degree to which a computerized application is
easy to learn, contains the necessary functionality to allow the user to complete
the tasks for which it was designed, and is easy and pleasant to use (Gould &
Lewis, 1985). Usability studies are the means by which software developers
evaluate and refine their software design , particularly the user interfaces.
Landauer (1997) argues that usability studies help produce software that is both
more useful, in that it performs more helpful functions, and more usable, in that
it is easier and more pleasant to learn and operate.
A wide variety of methods exist for examining the usability of a software
program, ranging from informal reviews to full-scale, laboratory-based
experiments (e.g., Kirakowski & Corbett; 1990; Nielsen & Mack , 1994;
Schneiderman, 1998). In informal studies, potential users of the program are
asked to interact with an early version or prototype of the software and either
attempt to undertake some realistic tasks or simply note problems with the
software design and features . In this way, the software developers obtain
50 Usability Studies
information about the program's value, utility, and appeal. To be most effective,
usability studies should spring from a developmental focus on users, including a
thorough understanding of characteristics of the users, as well as of the nature of
the tasks to be performed (Gould, Boies, & Ukelson, 1997; Landauer, 1997).
Usability studies are important because they ensure good software designs;
on the other hand, making decisions without obtaining user input often leads to
design flaws and operational problems. Testing successive versions of the
interface on actual users and making changes based on the feedback provided
lead to a program that meets users' needs and functions well in practice. This
can also result in reduced user frustration, anxiety, and error rates, as well as an
increased likelihood that the program will be selected for use. In fact, there is an
extensive research base that documents the effectiveness of even very simple
and low-cost usability methods in providing improvements so that the software
is easier to learn, easier to use, and less subject to user entry errors (see, e.g.,
Landauer, 1995; Harrison, Henneman, & Blatt, 1994; Mayhew & Mantei, 1994;
and Tullis, 1997). Furtherrnore, the inclusion of usability testing often results in
cost savings to the software development process by prioritizing product
features, reducing programming costs, and decreasing maintenance and support
expenses (Bias & Mayhew, 1994; Karat, 1997; Ehrlich & Rohn, 1994).
For computerized test administration software, an important result of a well-
designed user interface is that examinees have to spend less time and attention
on how to take the test and more on the actual test itself. A peripheral benefit of
an interface that can be quickly learned by new users is that it reduces the
amount of online training that needs to be developed and administered. More
critically, a poor or confusing interface is a potential source of measurement
error (Bennett & Bejar, 1998; Booth, 1991; Bunderson, Inouye, & Olsen, 1989).
Usability studies may be conducted in numerous ways and a number of
articles and books have described or compared methods (Gould, Boies, &
Ukelson, 1997; Kirakowski & Corbett, 1990; Landauer, 1997; Nielsen, 1993;
Nielsen & Mack, 1994). Users may be asked to perform specific tasks within the
program and then either be directly observed or videotaped as they attempt the
tasks. Questionnaires , interviews, and talk-aloud protocols or focus groups may
be used to determine the interpretation users are making about the software and
how it functions . The computer may store information about the actions users
take, in addition to frequencies or times (through such methods as keystroke
logging, etc.). Computers can even be used to track users ' eye movements as
they view each screen. Usability studies can be conducted using very elaborate ,
high-fidelity prototypes of the actual software interface, but they can also be
conducted using very simple mock-ups of the interface. In fact, paper-and-pencil
prototypes can be used, with human interaction imitating any program
interactivity. (This approach has been referred to as the Wizard of Oz method of
user testing.) A simple but highly effective method is that of user testing. In
user tests, or user observation, a potential user of the software program is
observed as he or she attempts to use the software to carry out realistic tasks.
Nielsen (1994a, 1994b) has documented the effectiveness of low-cost, informal
4. Software Issues 51
usability methods, including what he has termed "discount user testing." He has
suggested that the greatest cost-benefit ratio actually comes from testing as few
as five users, with as many iterative stages as possible (Nielsen, 2000). One set
of basic steps for conducting a user test is provided in Table 4.1 (Apple, Inc.,
1995).
Table 4.1 Ten Steps for Conducting a User Observation (adapted from Apple,
Inc., 1995)
1. Introduce yourself and describe the purpose of the observation (in very
general terms). Most of the time, you shouldn't mention what you'll be
observing.
2. Tell the participant that it's okay to quit at any time.
3. Talk about the equipment in the room.
4. Explain how to think aloud.
5. Explain that you will not provide help.
6. Describe in general terms what the participant will be doing.
7. Ask ifthere are any questions before you start; then begin the
observation .
8. During the observation, remember several pointers: Stay alert; ask
questions or prompt the participant; be patient.
9. Conclude the observation. Explain what you were trying to find out,
answer any remaining questions, and ask for suggestions on how to
improve the software.
10. Use the results.
early user testing. An adaptation of their checklist for achieving early user
testing is provided in Table 4.2.
Table 4.2. Checklist for Achieving Early User Testing (adapted from Gould,
Boies, & Ukelson, 1997)
We did follow-up studies on people who are now using our system.
In the most general sense, usability studies help software developers ensure
that all the necessary and desirable functions are present in the software and that
users are easily able to tell what functions are present and how to use them. The
specific nature of the necessary software features and how they might be
accessed are tied to the purpose of the software program. For computerized
testing software, an important consideration to keep in mind is that a typical user
of the software only encounters the program once. Users of the program should
be able to learn to use the program quickly as, with the exception of those
examinees that can or must retest, they will not have the opportunity to learn and
use the software over repeated attempts.
4. Software Issues 53
Software Evaluation
Software evaluation is a process through which users can, somewhat formally,
consider a potential software program in terms of specific criteria (Alessi &
Trollip, 1991). Test developers who intend to purchase commercial software for
computerized test administration can use the software evaluation process to
analyze and compare existing commercial programs. Computerized test
administration software packages vary considerably in the features and functions
they provide, as well as in their cost and ease of use. For example, programs
vary in terms of their item banking capabilities, their ability to handle a variety
of graphics formats, and the number and kinds of item types they support. They
also differ in terms of test scoring and reporting functionality and in the quality
of their user manuals and technical support. A thorough evaluation of competing
software programs will help ensure that a testing agency selects the program that
best meets its needs and the needs of its examinees. Those testing agencies that
are producing their own computerized test software as part of the development
process can also use software evaluation.
Testing professionals can begin the software evaluation process by
developing a list of features that they would find desirable in a computerized
testing software program selected for their use. While any software program is
the result of compromises or tradeoffs between often-contradictory potential
features, the process of evaluating multiple CBT software packages will enable
the test developers to identify the software program where those tradeoffs come
closest to matching the test program's most critical needs and desirable features.
The list of features can be prioritized and then used to help structure and guide
the test developers in evaluating the available commercial software programs.
The features list can be compared to software program descriptions provided in
brochures, manuals, company Web sites, and other documentation as an initial
test of a given program's ability to satisfy the testing agency's needs. The
software programs that appear to be at least minimally appropriate can then be
evaluated further.
The full evaluation of a potential software program involves actual use of the
program to carry out a variety of typical and realistic tasks. The evaluator should
take a sample test, responding to on-screen instructions and prompts as carefully
and correctly as possible . The evaluator should also use the software package,
responding the way a careless or inattentive user might respond. This enables
the evaluator to determine how the program behaves when a user keys or clicks
on the wrong options, enters misspellings, or makes mistakes in navigation. The
evaluator should also have representatives from the examinee population use the
software to determine further whether the instructions and interface are clear to
the target user group. These users can be observed as they interact with the
software in a similar approach to the process of user testing discussed earlier. (A
full evaluation of the program should also include a thorough examination of
those software components and modules that test development staff would use.
54 Software Evaluation
A variety of realistic test developer tasks, such as item entry, test form assembly,
and item analyses or score reporting, should also be undertaken.)
The evaluator should also look for a program that follows good software
design principles and is free of errors or "bugs." The program should provide
"forgiveness," or the ability to change user actions that were not intended. This
does not refer to the test delivery method but rather to user actions in terms of
navigation and keystrokes . Ideally, the program should provide clear
instructions, intuitive navigation, feedback to user actions, and an appealing
screen design. Table 4.3 displays an example of a checklist that can be used to
evaluate computer-based testing software for elements such as visual clarity,
clear instructions and help screens, consistency, and facility with error
prevention or correction (Harmes & Parshall, 2000) .
Most of Someof
Consistency Always the time the time Never
Are icons, symbols, and graphicsused
4 3 2
consistently throughout the test?
Are differentcolors usedconsistently
throughout the test (e.g., questions are 4 3 2
always in the same color)?
Is the same type ofinfonnation (e.g., test
questions, navigation, instructions) in 4 3 2
the same location on each screen?
Isthewaythesoftware responds to a
4 3 2
particular user action consistent at alltimes?
Comments on Consistency:
Most of Someof
Error Prevention! Correction Always the time the time Never
Can the examineelook through other
items withina section(forward or 4 3 2
backward)?
Is there an easy way for the examinee to 4 3 2
correcta mistake?
Is the examineeable to check what they
have enteredbefore the answeris 4 3 2
evaluated?
Is there online help (in terms ofusing the
4 3 2
software) availablefor the examinee?
Comments on Error Prevention/Correction:
56 Designof the User Interface
A number of guidelines for software evaluation exist , and there are formal
user interface design principles (e.g ., Alessi & Trollip, 1991; Ravden &
Johnson, 1989). One set of design principles, termed usability heuristics, was
compiled by Nielsen and Molich (1990) and adapted by Landauer (1995) ; these
are listed in Table 4.4. This list of usability heuristics can be used by the
software evaluator as another checklist to determine whether or not the software
follows good design principles .
The evaluation process should also consider the quality of supporting
materials, such as examinee score reports generated by the program, and
technical and users ' manuals. Guidelines for evaluating software are useful for
testing professionals who are selecting software for purchase, as well as for
those who are developing their own test administration software. The process of
software evaluation can help ensure that a testing program 's most critical needs
are fully satisfied .
Table 4.4. Usability Heuristics (adapted from Nielsen & Molich, 1990, in
Landauer, 1995)
1. Use simple and natural dialogue. Tell only what is necessary, and tell it
in a natural and logical order. Ask only what users can answer.
2. Speak the users' language. Use words and concepts familiar to them in
their work, not jargon about the computer's innards.
3. Minimize the users' memory load by providing needed information
when it's needed.
4. Be consistent in terminology and required actions.
5. Keep the user informed about what the computer is doing.
6. Provide clearly marked exits so users can escape from unintended
situations.
7. Provide shortcuts for frequent actions and advanced users.
8. Give good, clear, specific, and constructive error messages in plain
language, not beeps or codes.
9. Wherever possible, prevent errors from occurring by keeping choices
and actions simple and easy.
10. Provide clear, concise, complete online help, instructions , and
documentation. Orient them to user tasks.
comprises all the components of the screen design, or "look" of the program.
Finally, there is the aspect of written communication. Tbis refers to all textual
titles, instructions, and help screens through which the user is informed about
the program.
These elements of the user interface are discussed in greater detail in the
following sections and are considered in terms of the design principles of
consistency, feedback , and forgiveness . Although there are many important
design principles that should be considered in the process of software
development, a few are most critical for computerized test administration
software. Consistency in the interface design makes learning to use the software
much easier and enables the user to concentrate on the content and purpose of
the program rather than on how to use it. Another important design principle is
that of feedback. When a software interface incorporates feedback, a
confirmation for every user action is provided, letting the user know that each
action was noted. Finally, the principle of forgiveness results in a software
interface that guides a user to take the correct action but also enables the user to
"back out" of an incorrect choice.
User interface elements are illustrated by figures provided in both this
chapter and Chapter 5.
Navigation
Navigation refers to the set of methods provided in the software interface for
users to move through the program. In instructional screens, for example, users
are often asked to click on a button or press the Enter key to advance to the next
screen. Sometimes, they are provided with a button or keystroke to move back
to the previous screen. A program with simple but limited navigation typically is
easy to learn; the tradeoff is often some reduction in user access and control.
A computerized test may have very simple navigation because very little
control over movement through the program is offered to the user. A short
adaptive test that consists entirely of single-screen items is likely to have very
minimal navigation. A test of this type provides a single button or keystroke for
an examinee to use to move forward in the test with no backward movement
allowed. Very simple navigation is illustrated later, in Figure 5.3. This item
screen is from an audio-based ESL Listening test. In this test, examinees are
restricted from returning to items after moving on; thus the only navigation on
this item screen is a right-arrow button, which moves the examinee forward to
the next item.
A nonadaptive test (e.g., a CFT) usually provides options for both forward
and backward movement through the test items. Often, this navigation is
provided through buttons labeled Next and Previous or displaying right and left
arrows. This type of navigation can be seen in Figure 5.1.
A test of moderate or greater length provides additional forms of navigation.
For example, examinees may be able to mark an item for later review; specific
58 Design of the User Interface
keystrokes or buttons may then enable the user to page or move only through the
subset of items that have been marked. Some longer computerized tests provide
an item menu option. In this form of navigation, examinees can access a screen
that contains a list of item numbers and perhaps some information as to whether
each item has already been seen by the examinee, been answered , and/or been
marked for review. The examinee is able to move directly to any given item
from this screen by selecting the item number.
Figure 5.4 displays a library skills item in which examinees are asked to rank
order three article titles in terms of their relevance to a search. An added
navigational feature of this item is that examinees are able to click on anyone of
the titles to see the article's abstract. To lessen potential navigational confusion ,
when an abstract is displayed it appears in a pop-up window that covers only
part of the item screen. In general, the most complex forms of navigation in
computerized tests are those provided in simulation exams. In these tests
numerous choices , or paths, may be available to the examinee for much of the
exam.
In navigation, the design principle of consistency is evidenced by software
that requires the same user action for the same software option throughout the
program. With inconsistent navigation , users may have to select the Enter key
at one point in the program, the right arrow key at another point, and a Continue
button at still another point, all to accomplish the same function of advancing to
the next screen. Another instance of software consistency is for buttons intended
for specific purposes to have the same look and be placed in the same location
on every relevant screen.
The principle of feedback is evidenced in navigation by such features as
buttons that make an audible click or that appear to depress when selected.
Actions that result in a noticeable delay may provide a display indicating that
the computer is processing the request (e.g., the cursor may turn into a clock
face or there may be a tone indicating the level of task) . Feedback is also
provided through such features as progress indicators. For example, a linear test
can include indications such as "item 6 of 20" on each item screen. Figure 5.1
displays this type of progress indicator.
Navigational forgiveness can be implemented in a number of ways. The
examinee should be able to undo actions, reverse direction in a series of
instructional screens, access help throughout the program, and change item
responses-unless there are good reasons to limit these actions. When reversing
an action is not possible, a program designed with the principle of forgiveness in
mind clearly informs the user of this fact in advance, along with any effects that
may result from a given choice. This type of informative forgiveness is
demonstrated in Figure 4.1. In this CBT, examinees are free to revisit items
within a test section, but they may not return to those items once they have left a
section. This screen clearly warns examinees before they take an action that they
will not be able to reverse.
4. Software Issues 59
Exit Section?
Return I GOOn]
Visual Style
The next aspect of user-interface design is the visual style of the interface. The
visual style is reflected in all the individual visual elements on the screen, from
the choice of background color to the shape and labels for navigation buttons to
the type and size of text fonts. One of the most basic aspects of visual style is
physical layout of screen elements, including the location and relative size of all
text, borders, and images. The style and detail of the screen graphics, as well as
any animation or video included, also contribute to the program's visual style.
Finally, the visual effect of all of these elements combined contributes to the
overall visual style of the interface.
A great deal of empirical research into the effect of various screen design
components has been conducted (see Tullis, 1997, for an excellent summary of
some of this work). Consistent effects on proportion of user errors and search
time (i.e., the amount of time it takes for a user to extract specific information
from a screen) have been found for screen design alternatives. Some of the most
critical visual design issues concern how much information a screen contains (or
information density), how the information is grouped, and the spatial
relationships among screen elements. For example, the evidence indicates that
search time is clearly linked to information density, although appropriate use of
grouping and layout can overcome this effect. In addition, good choice of
information grouping can improve the readability of the data as well as indicate
relationships between different groups of information.
In any software package, the visual style must be appropriate for the target
audience and the subject matter of the program. In some user interface
guidelines this is referred to as aesthetic integrity (e.g., Apple, Inc., 1995). It
should also be designed to match the style and tone of the written
communication . For computerized testing, a simple, uncluttered style is best so
that examinees' attention can be focused on the test items. Clear, legible text
with few screen windows open at one time is least confusing to users of the
testing software, particularly novices. Soft colors, with a good level of contrast
between the font and the background, are easiest on the eyes and thus help
reduce potential examinee fatigue during the exam. If specific information needs
to stand out, the use of an alternate color is often a better choice than use of
underlining, bold text, or reverse video. The technique of flashing (turning text
on and off) has been shown to be less effective than leaving the text un-
differentiated, due to the increased difficulty of reading flashing text (Tullis ,
1997). The item screen displayed in Figure 5.3 has a simple, clean layout. Black
text on a pale yellow background makes for very readable script. Although the
printed versions of the CBT screens provided in this text are all in black and
white , naturally the on-screen versions use a range of text and background
colors. A larger font size is used to distinguish the item instructions from the
item text. The "title" information is also clearly set off-in this instance through
use of underlining , although use of a different color or physical borders might
4. Software Issues 61
have been better choices. A problem with the item in Figure 5.1 is in the use of
the color red to outline an examinee's current selection. While the outline is very
visible on screen, it might lead to some confusion, given that red is often
associated with incorrect responses.
The layout, or physical placement of borders, instructional text, navigation
buttons, item stem and response options, and other screen elements should be
visually pleasing and, more critically, should lead the user to take the correct
action. Size can be used to indicate priority and can guide the eye to the most
important elements on a screen. For example, on an item screen the item itself
should probably take up the majority of the screen space. Item instructions and
navigation buttons should be available but unobtrusive. In Figure 5.2 the
graphical item dominates the screen space, making it easy to see each figural
component. The graphical elements that are movable are distinguished by their
placement and grouping within a box labeled "Tools."
Consistency in visual style may be implemented through the use of one
visual look for all item screens and a similar but distinct look for all
informational and help screens. This similar-but-distinct relationship is
illustrated in Figures 4.2 and 5.3, which are both from the same CBT and use the
same color scheme and fonts. Certain areas of the screen can also be consistently
devoted to specific purposes . For example, instructions can be displayed in a
text box, separated from the item stem and response options, permanent buttons
can be placed along one border of the screen, and temporary buttons can be
consistently located in another area of the screen. The interface's visual style
can provide feedback through such means as shadowing a button that has been
clicked or displaying a checkmark next to a selected item response option .
Navigation or other options that are temporarily unavailable can be "grayed out"
to inform the user visually that those options cannot currently be selected.
Forgiveness primarily may be reflected in a visual style that guides the user to
the important information or appropriate choice on each screen. For example, in
Figure 5.3 the button IabeIed Play is larger than average and is placed in line
with the item options, but placed above them. This use of layout is intended to
help guide the examinee to listen to the audio stem prior to attempting the item.
Figure 4.3 displays heavy use of visual elements including layout, color,
fonts, borders, and size to clarify the examinee's moderately complex task in
responding to the item. A weakness in this item, however, is that the use of
black text on a dark blue-green background makes the instructions less legible
on screen than they should be.
presentation of text are concerned, it also applies to the text of the actual item
stem and response options (the content of the item, of course, should be based
on measurement considerations).
Figure 4.2.
animation, and audio also can be used to help express information to the user
concisely. The information should be written in a style that is pleasing to the
eye, that has a good tone for the subject matter and the audience, and that flows
well from one screen to the next. Appropriate word choice is also critical. In the
progress indicator displayed in Figure 5.1, communication might be improved if
the word "Item" were used in place of "Screen." Naturally, the text must be
error-free, containing no typographical errors, grammatical mistakes, or
incorrect instructions.
Many aspects of the display of textual information on computer screens have
been empirically investigated. Tullis (1997) has summarized much of this
research, while additional information about on-screen written communication is
available in Nielsen (l997a, 1997b, 1999).
The font used for textual information should be selected carefully. It should
be a font that is visually pleasing on screen (as opposed to print), large enough
to be seen from a reasonable distance, and fully legible. The best font size for
most on-screen reading appears to be between 9 and 12 points; fonts smaller or
larger than this range tend to be less legible and to reduce reading speeds. There
is also evidence that dark characters on a light background are preferred over
light characters on a dark background.
The text should be carefully displayed. It should be formatted attractively
and well placed on the screen . For example, text is more readable when the
length of the lines is neither too short nor too long. The text should not be
broken up, but neither should it extend from one edge of the screen to the other.
64 Design of the User Interface
Web-Based Tests
There are some additional, specific user interface issues that apply when a Web-
based test is developed. The term Web-based test (WBT) is used here to refer to
those computer-based tests that are administered, not only via Internet, but
within Internet browser software (e.g., Netscape Navigator or Internet Explorer).
Other forms of CBT software also may be transmitted or delivered over the
Internet; however, their appearance and functionality will reflect the application
program, rather than elements of the browser interface.
Advantages of a WBT include use of a single development platform (e.g.,
HTML with JavaScript) to produce an exam that can be delivered across a wide
range of computer systems. However, the development and administration of a
quality WBT requires specific attention to interface design, as well as many
other issues.
A primary, critical aspect of interface design for WBTs is that a Web page
may be rendered differently by different browsers, different hardware platforms,
and different browsers on different platforms. In other words, a page may have a
different appearance when accessed under Netscape Navigator on an IBM PC
and on an Apple Macintosh computer or Internet Explorer on a PC.
Furthermore, a given font used on the page may not be installed on an
examinee's computer; or a given user may have customized his or her computer
so that certain page elements are colored or displayed in ways other than what
the page designer specified. This state of affairs is quite contrary to the level of
control over layout and design typically required in standardized test
applications. The test developer has a limited number of options available in a
WBT environment to provide a truly standardized test administration, given that
standardization is contrary to the underlying principles of the World Wide Web.
At a minimum, the test developer can follow HTML and other Web standards in
the development of the WBT and then evaluate the exam pages on multiple
browser and hardware platforms, to ensure that it looks and functions
appropriately . This will not provide identical pages, but it will help prevent the
most serious inconsistencies or problems. In applications where standardization
and consistency of design have greater importance, the test developer can
constrain the WBT so that it will only run on a specified browser and hardware
system. While this is not the typical approach for most Web pages, it simulates
the hardware and system requirements used for most CBT development and
administration.
Another critical difference between WBT interface design and other
computer-based tests is related to the use of what might be termed display space.
Some WBTs (using simple HTML coding) consist of a single, long, scrollable
document, with the items displayed one below another in a single, long column.
With this type of user interface, the design of the entire page must be
considered, rather than a single "item screen" at a time. A page may use
horizontal rulers to break up test sections or to separate instructional text from
66 QualityControl
actual items. The user should not have to scroll horizontally; rather, the entire
width of the WBT page should fit on the screen, even when the length does not.
The WBT page should also be designed so that each discrete item can be
displayed on a single screen in its entirety; an examinee should not have to scroll
back and forth to view an item's stem and response options. More complex
items, such as those that would typically function as multiscreen items in a
typical CBT, are always challenging to display clearly and require particular
care in a WBT format. The examinee should not have to struggle to access and
view each item element.
An option that test developers may elect to take is to design the WBT so that
it uses display space in a manner that is very similar to other computer-based
tests. In this instance, a browser window is opened and displayed according to
parameters that have been set by the developer (perhaps using JavaScript or
other programming code), such as 640 x 480 pixels. This constrains each page
of the WBT to specific elements that are somewhat comparable to the "item
screens" of other computer-based tests. The layout of these item pages can be
designed to include item-screen elements such as item instructions, a help
option, and navigation buttons for access to the previous and next items in the
exam.
This brief consideration of WBT interface design is just an introduction to
some of the primary issues and options. These possible programming solutions
are intended to be illustrative of the ways in which test developers can currently
address the specific interface and design issues in Web-based forms of
assessment. Other programming options are currently available, and no doubt
more will become available in the future. Furthermore, additional interface
issues will arise for any WBT that utilizes other Web elements . These may
include the use of multimedia, interactivity, and navigation to other pages or
resources. Given the rapid state of change on the Web, any final resolution to
these issues is not likely in the near future. Attention to basic principles of
interface design and measurement standards provides the best guidelines for the
concerned test developer.
Quality Control
There is an additional aspect of software evaluation and testing that has not been
addressed yet. That is the process of quality control. In many CBT applications,
the quality control, or QC, phase of software development is a critical evaluation
stage. In this aspect of evaluation, exam developers use and test an extensive set
of software components to ensure that they are functioning as planned . This
includes the correct display of items and item types on the screen. Item scoring
must also be tested. For example, innovative item types that have more than one
correct response may be tested to ensure that all possible correct responses are
actually scored as correct. The correct functioning of psychometric components
4. SoftwareIssues 67
Summary
There are many varied requirements for the development and administration of a
good computerized exam program. Knowledge of the principles and process of
software design is likely to be a weak point for many testing and measurement
professionals. This chapter is not a comprehensive coverage of software
development issues, but it has introduced many important concepts and
emphasized some critical needs. Highlights of these issues are displayed in
Table 4.5. Further resources on this topic are included in the References and
Additional Readings.
References
Alessi, S.M., & Trollip, S.R. (1991). Computer-Based Instruction : Methods and
Development. Englewood Cliffs, NJ: Prentice-Hall.
Apple, Inc. (1995). Human Interface Design and the DevelopmentProcess. In Macinto sh
Human Interface Guidelines . Reading, MA: Addison-Wesley.
Bennett, R. E., & Bejar, I. I. (1998). Validity and automated scoring: It's not only the
scoring, Educational Measurement : Issues and Practice, 17,9-17.
Bias, R. G., & Mayhew, E. 1. (1994). Cost-Justifying Usability. Boston: AcademicPress.
Booth, J. (1991). The key to valid computer-based testing: The user interface. Revue
Europeenne de Psychologie Appliquee, 41, 281-293.
Bunderson, V. c, Inouye, D. 1., & Olsen, 1. B. (1989). The four generations of computerized
educational measurement In Linn, R (ed.) Educational Measurement, 3rd edition. New
York: American Council on Education andMacmillan Publishing Co.
Ehrlich, K., & Rohn, 1. A. (1994). Cost justification of usability engineering: A vendor's
perspective. In Bias, R. G., & Mayhew, D. J. (eds.) Cost-Justifying Usability
(pp.73-110). Boston: Academic Press.
Gould, J. D., Boies, FJ., & Ukelson, 1. (1997). How to design usable systems. In
Helander, M., Landauer, T. K., & Prabhu, P. (eds.). Handbook ofHuman -Computer
Interaction, 2nd completelyrevised edition. (pp. 231-254). Amsterdam: Elsevier.
Gould, 1. D. & Lewis, C. (1985). Designing for usability: Key principles and what
designers think. In Baecker, R., & Buxton, (eds.), Readings in HC/. (pp. 528-539).
Association for Computing Machinery.
Harmes, J. C., & Parshall, C. G. (2000, November). An Iterative Process For
Computerized Test Development: Integrating Usability Methods. Paper presented at
the annual meeting of the Florida Educational Research Association, Tallahassee.
Harrison, M. C., Henneman, R. L., & Blatt, L. A. (1994). Design ofa human factors cost-
justification tool. In Bias, R. G., & Mayhew, D. J. (eds.) Cost-Justifying Usability
(pp. 203-241). Boston: Academic Press.
Helander, M., Landauer, T. K., & Prabhu, P. (eds.). (1997). Handbook Of Human-
Computer Interaction, 2nd completelyrevisededition. Amsterdam: Elsevier.
Karat, C. (1997). Cost-justifying usability engineering in the software life cycle. In
Helander, M., Landauer, T. K., & Prabhu, P. (eds.). Handbook Of Human-Computer
Interaction, 2nd completely revisededition. (pp. 767-778). Amsterdam: Elsevier.
Kirakowski, 1., & Corbett, M. (1990). Effective Methodology for the Study of HC/. New
York: North-Holland,
Landauer, T. K. (1995). The Trouble with Computers: Usefulness, Usability, and
Productivity . Cambridge, MA: MIT Press.
Landauer, T. K. (1997). Behavioral research methods in human-computer interaction. In
M. G. Helander, T. K. Landauer, P. Prabhu (eds.) Handbook of Human-Computer
Interaction, 2nd completelyrevisededition, (203-227). Amsterdam: Elsevier.
Mayhew, D. 1., & Mantei, M. (1994). A basic framework for cost-justifying usability
engineering. In Bias, R. G., & Mayhew, D. 1. (eds.) Cost-Justifying Usability (pp.
9-43). Boston: Academic Press.
Nielsen,J. (1993). Usability Engineering. Boston: Academic Press.
Nielsen, 1. (1994a). Guerrilla HCI: Using discount usability engineering to penetrate theintimidation
barrier. In Bias, R G. & Mayhew, D. 1. (eds.) Cost-justifYing usability. (pp. 245--272). Boston:
Academic Press. [also online at http://www.useitcom/papers!gumilla_hci.htmI)
4. Software Issues 69
Additional Readings
CTB/McGraw-HiII (1997). Technical bulletin I: TerraNova. Monterey, CA: Author.
CTB/McGraw-Hill. (1997). Usability : testing the test. In Inform: A Series of Special
Reportsfrom CTB/McGraw-Hill. (pp. 1-4). Monterey, CA: Author.
Gould,1. D. (1997). How to design usable systems. In Helander, M. G., Landauer , T. K.,
& Prabhu, P. (eds.) Handbook of Human-Computer Interaction, 2nd completely
revised edition. Amsterdam: Elsevier.
Harmes,1. C., & Kemker, K. (1999, October). Using JavaScript and Livestage to Create
Online Assessments. Paper presented at the annual meeting of the International
Conference on Technology and Education, Tampa, FL.
Karat, C. (1994). A business case approach to usability cost justification. In Bias, R. G.,
& Mayhew, D. J. (eds.) Cost-Justifying Usability (pp. 45-70). Boston: Academic
Press.
Nielsen, J. (1998). Severity Ratings for Usability Problems. [Online at:
http://www.useit.com/papers/heuristic/severityrating.html]
Nielsen, 1. (1995). Technology Transfer ofHeuristic Evaluation and Usability Inspection.
Keynote address at IFIP INTERACT '95, LiIlehammer, Norway . [Also online at:
http://www.useit.com/papers/heuristic/leamingjnspection.html]
Pressman, R. S. (1992). Software Engineering: A Practitioner's Approach. New York:
McGraw-HiII.
Rubin, 1. (1994). Handbook of Usability Testing: How to Plan, Design, and Conduct
Effective Tests. New York: Wiley & Sons.
5
Issues in Innovative Item Types
The development of innovative item types, defined as items that depart from the
traditional, discrete, text-based, multiple-choice format, is perhaps the most
promising area in the entire field of computer-based testing. The reason for this
is the great potential that item innovations have for substantively improving
measurement. Innovative item types are items that include some feature or
function made available due to their administration on computer. Items may be
innovative in many ways. This chapter will include a discussion of the purpose
or value of innovative item types, the five dimensions in which items may be
innovative, the impact of level of complexity on the development and
implementation of these item types, and a view toward the future of innovative
item types.
Purpose
The overarching purpose of innovative item types is to improve measurement,
through either improving the quality of existing measures or expanding
measurement into new areas. In other words, innovative item types enable us to
measure something better or to measure something more. In general, most of the
innovative item types developed to date provide measurement improvements in
one or more of the following ways.
First of all, innovative item types may improve measurement by reducing
guessing. While a typical four- or five-option multiple-choice item can be
responded to correctly by simple guessing as much as 20-25% of the time, this
guessing factor can be greatly reduced through innovative item types. One way
in which they reduce the potential effect of guessing is by appropriately
increasing the number of options in a selected response item type. For example,
in a test of reading comprehension, the examinee can be asked to select the topic
sentence in a reading passage. Thus, every sentence in the passage becomes an
option rather than four or five passage sentences listed as the only response
options. Examinees also can be asked to select one part of a complex graphic
image. Again, the number of available responses is likely to be far greater than
the usual four or five.
The potential for guessing an answer correctly can be reduced even further
through the use of constructed response items. In a test of math skills , for
example, examinees can be asked to type a numerical response, rather than
selecting an option from a list. Short responses to verbal items also can be
collected and scored in a similar manner. Acceptable misspellings, or alternative
mathematical formulations, may be included as keys within the list of acceptable
responses.
A second way in which innovative item types are designed to provide better
measurement is by measuring some knowledge or skill more directly than
traditional item types allow. These items can be designed to avoid some of the
artificial constraints of traditional, multiple-choice items. One innovative item
type provides for multiple-correct responses (e.g., "Select two of the following,"
"Select all that apply"). For example, a medical test might prompt examinees to
select all of the elements listed that are symptoms of a particular illness. Another
way in which innovative item types provide for a more direct measure of
proficiency is in sequence items. In a history test, for example, examinees may
be asked to indicate the order or sequence in which a series of events occurred.
A traditional item type might provide a numbered set of events and response
options that indicated various possible orderings for those numbered events
(e.g., a: 2, 1,4,3; b: 4, 2, 1,3). In a computerized innovative item, the examinee
instead could drag the events into the desired order. More direct measurement
can also be provided in graphical innovative items. Examinees may be asked to
click directly on a graphic or image. Or they may be given a set of icons and
tools, which they must use to construct a graphic as their response to the item.
Innovative item types that include nontext media can provide another approach
to more direct measurement. Rather than verbally describing some visual,
auditory, or dynamic element, these items can directly incorporate images,
sounds, animations, or videos into the item.
Innovative item types also improve measurement by expanding the content
coverage of a testing program. That is, use of computer technology may enable a
testing program to include areas of content that were logistically challenging , or
even impossible, to assess in traditional paper-and-pencil administration. The
inclusion of high-quality graphics can expand assessment in many areas. One
example is the use of medical slide images in an item stem; a very different
example of the same technology is the use of fine-arts graphic images.
Another use of the technology to expand content areas is the inclusion of
sound. The computer's audio functions provide a fairly easy way to add
assessment of listening to numerous content areas. These include content fields
such as foreign languages and music as well as many others . For example,
medical and scientific equipment often produce sounds that need to be
interpreted and understood ; tests in these areas can use innovative items that
incorporate appropriate sounds. In all these examples, the innovative item types
use the test administration technology to measure something more than had been
feasible previously.
72 Dimensions of Innovation
The final way in which innovative item types provide improved is to expand
the cognitive skills measured on a test. For example, innovative item types have
been designed that require examinees to construct or assemble on-screen figures .
While this type of task remains completely scorable by the computer, it also
provides for the measurement of productive skills not included in traditional
assessments. An example of this use of innovative item types to expand the
measurement of cognitive processes is seen in writing-skills assessments.
Examinees are presented with an error-filled passage and then asked to identify
the errors . The examinees may even be presented with the opportunity to retype
sections of text or correct the errors. Both error identification and actual
correction are cognitively different from traditional multiple-choice writing-
skills items. Numerous other applications of innovative item types to expand the
measurement of cognitive skills are possible.
Further examples of innovative item types that provide each of these
measurement improvements are offered in the next sections.
Dimensions of Innovation
The phrase innovative item types encompasses a large number of innovations.
They wil1 be discussed here in terms of a five-dimension classification system.
These five dimensions are not completely independent. However, in most cases,
items are innovative in only one or two of these dimensions at a time.
The five dimensions of item innovation are item format, response action,
media inclusion, level of interactivity, and scoring method. Item format defines
the sort of response collected from the examinee; major categories of item
format are selected response and constructed response. Response action refers to
the means by which examinees provide their responses, including key presses
and mouse clicks. Media inclusion covers the addition of nontext elements
within an item , including graphics, sound, animation, and video. Level of
interactivity describes the extent to which an item type reacts or responds to
examinee input. This can range from no interactivity through complex, multistep
items with branching. Finally, scoring method addresses how examinee
responses are converted into quantitative scores. This includes completely
automated dichotomous scoring along with scoring programs that need to be
"trained" to model human raters who assign polytomous scores.
Further discussion of each of these dimensions is provided next, along with
examples of innovative item types to illustrate the potential advantages inherent
in each area. (The topic of dimensions of item type innovation is addressed more
fully in Parshal1,Davey, & Pashley, 2000).
5. Issues in Innovative Item Types 73
Item Format
of numerical elements into size order, or even a list of alternatives into a degree-
of-correctness order.
Another selected response item type is described in Davey, Godwin, and
Mittelholtz (1997). Their test of writing skills is designed to simulate the editing
stage of the writing process. Examinees are confronted with a passage that
contains various grammatical and stylistic errors, but no indication is given as to
the location of these errors. Examinees read the passage and use a cursor to
point to sections that they think should be corrected or changed. They then are
presented with a list of alternative ways of rewriting the suspect section.
Examinees can select one of the alternatives or choose to leave the section as
written . If an alternative is chosen, the replacement text is copied into the
passage so that the changes can be reviewed in their proper context. The
rationale behind the use of this item type is that the error-identification portion
of the task adds to the cognitive skills assessed by the items, even though
conventional multiple-choice items are used.
There are other selected-response item formats that have been used in paper-
and-pencil administrations that could easily be adapted for administration in
computer-based tests. One example is an item to which an examinee is able to
respond multiple times, possibly with feedback regarding the correctness of each
response. The final item score for the examinee is then based on percent correct
out ofnumberattempted, or percent attempted until correct.
Response Action
While the dimension of item format discussed earlier defines what examinees
are asked, the dimension of response action defines how they are to respond.
Response action refers to the physical action that an examinee makes to respond
to an item . The most common response action required of examinees in
conventional paper-and-pencil assessments is to use a pencil to bubble in an oval
associated with an option.
76 Dimensions of Innovation
( Parallel Circuit )
TOOLS
Drag the pieces of the crcut n the boxes to
lMk e a ~r~ I~/ circut . You do !la necessariy
have to use ell the tools.
Media Inclusion
for resolving the conflict. (The interactive component of this exam will be
discussed in the next section.)
In addition to the single use of each of these forms of nontext media,
innovative items can appropriately include multiple forms of these media. An
item in a language listening skills test can display a still photo or video of two
people holding a dialogue while the sound file containing their conversation is
played . Bennett, Goodman, et al. (1997) developed a sample multimedia item
that included an image of a static electrocardiogram strip, an animation of a
heart monitor trace, and the sound of a heartbeat. Many more applications of
media-based innovative item types are likely to be developed in the near future.
I PlAY I
C a. econ omic
o b. cont rolle d
o c. sophisticated
C· d. em phasize
Level of Interactivity
once examinees have indicated their sequence or ordering for a set of elements,
they can see the elements rearranged in the new order. And when examinees
click on a histogram, scale, or dial they may see the bar or gauge move to reflect
their action . While the examinees are not being given feedback about the
correctness of their responses, this limited form of interactivity can be used to
help them decide whether a response should be changed. It allows them to
consider the item and the response within a particular context
In another level of interactivity, an innovative item provides a set of online
tools for the examinee's use within the response. This application may be
regarded as a "passive" form of interactivity; the computer interacts with, or
responds to, the examinee 's actions by displaying the results on screen in an
integrated way. An elaborate example of this type of item interactivity can be
seen in the computer-based credentialing test created by the National Council of
Architectural Registration Boards (NCARB; see Braun, 1994). This test offers a
computerized simulation of performance-based architectural problems and
assignments. The examinee must use computerized drawing tools to design a
solution to the problem "vignette" within specified criteria. The computer does
not directly react to or interact with the examinee, but it does provides a forum
for an extensive set of the examinee's actions to be made and displayed in an
integrated, contextual manner.
A number of certification exams in the information technology (IT) field display
this type of passive interactivity. These exams are often designed to simulate or
model the use of a given software program, and the examinees are provided with
software-use tasks to undertake. The tests may be designed so that examinees
are basically free to select any software command or option that would be
available in actual use of the program, or they may be constrained so that a more
limited subset of software features are "live." In either case, far more options are
likely to be available throughout the exam than would be typical of
noninnovative forms of testing. The examinees may be able to complete an
assigned task in a number of different ways, and scoring of the task may not
even distinguish between inefficient or optimal means of accomplishing the end
goal. An example of this type ofinteractivity is provided in Figure 5.4. A Web-
based library skills item provides the functionality of library search software and
displays additional information in a pop-up window upon examinee request
(Harmes & Parshall, 2000).
A further type or level of interactivity incorporates a two- or multistep
branching function within the item. In this type of innovative item, simple
branched structures are built into the item's response options, and the examinee
selection determines which branch is followed. An example of this type of
interactivity is provided in the interactive video assessment of conflict resolution
skills discussed earlier (Drasgow, Olson-Buchanan, & Moberg, 1999; Olson-
Buchanan, Drasgow, et al. 1998). In this application, an item begins with a
video-based scene of workplace conflict, followed by a multiple-choice question
relating to the best way to resolve the conflict. Once the examinee selects a
response, a second video scene is displayed, followed by another multiple-
5. Issues in Innovative Item Types 81
0
... . --
. Nets cape: Itent Prototype - ILsnt Order
E~," llI! 10110"",&ImI11:s tItllll)'O ur . tilU. PIlIIbm1 m Oldrl el mr....... IDtilt tcpk: ' lIlA!~ of
- ~ ~ t!l 8
I.
I Ant EducatlOMl Computlng Courus E!feellve? I
!Mathlnes that Teach !
2.
IAuthentic Taw for Authentic Learnlngl
3.
----
,......-I ~ ' F~ , ~ ~
"..
..:.......-.
......... 1'MI:~
st:.:s:..
............
Q 1- . . " , ... -
~_ :
IOU'«)
"""""
lNotr ~_.,...
...
~~IlI~.=.. ..::.:==.':==-..:: ~..,~
==--~==:,.,~-~:: :*~ ...C'==·."....
-.....
=-:.z:.. ==::=..~ --=:t.: ........,,--
.==.-~ ., ~~ ~. tn
~, ... . ,.. mo
==--
--- .........-
-~
~
.- ltbjl '''u"""RZJ''
""""""""
-:-..=z;::=:~~:;.:;..
D ....
c:..._ ~ ....... ::Dl oN).1" D . lD ' 9) ,HC'JD"I .,.
.
choice item. The particu lar second scene displayed is based on the examinee 's
selected action, and the action or conflict within the scenario moves forward
based on that choice. This assessment uses a two-stage branching level of
interactivity (although the sense of interactivity is probably enhanced further
through the use of the video prompts in which the characters interact with one
another).
In yet another level of interactivity, an online problem situation may be
accompanied by an extensive set of options or choices made available to the
examinee. In this case, the examinee may be able to select multiple options or
actions at any given time . The computer will then react to these choices,
updating and revising the on-screen problem. The number of steps to completion
of this type of innovative item may be variable, depending on the actions an
examinee takes and the computer reactions they engender. An example of this
type of interactivity can be seen in a test of physicians ' skills in patient
management (Clauser, Margolis, Clyman, and Ross, 1997). In this computerized
82 Dimensions of Innovation
Scoring Method
individual parts should be weighted and how they should be combined into a
polytomous item score.
Typically, scoring is more complex for constructed-response item types. For
example, simple open-ended items may be scored by comparing the examinee's
response to a key list, which includes alternate formations or acceptable
misspellings. For selected figural response items , scoring is based on the
position of the examinee's mouse click on the graphic image.
The more complex constructed-response items require even more complex
scoring algorithms. The scoring solutions for these item types are often rule-
based. The item types and their scoring algorithms are developed jointly for use
in computer-based tests. The item and component scores initially are developed
by human raters, even though the automated scoring may later be conducted by
computer.
A somewhat different approach to automated scoring can be taken with
essay items. Several distinct programs have been written to score essay
responses (e.g., PEG, E-rater, InteIIimetric Engineer, InteIIigent Essay Assessor,
InQuizit). The criteria for these programs vary greatly, from a consideration of
surface features, such as overall length, to the use of advanced computational
linguistics. Despite these differences, research on many of these programs has
shown that they are capable of producing a score that is as similar to one from a
human rater as a second human rater's score would be (Burstein et aI., 1998;
Landauer, Laham, Rehder, & Schreiner, 1997; Page & Petersen, 1995). All of
the various automated-scoring programs require some number of human scores
to "train" the computer program and to handle any difficult responses.
There are a number of issues related to automated essay-scoring that remain
unresolved at the present time. For example, it is unclear how robust these essay
scoring programs may be to "cheating" (i.e., examinee responses that are
tailored to the computer program 's criteria rather than to the prompt or question)
or how accepting the public may be of computerized scoring. For these reasons,
the most likely application of an automated essay-scoring program is as a
replacement for a second human rater in those testing programs where essay
responses are scored by two raters. Currently, operational use of these programs
is relatively limited, but that is likely to change dramatically in the next few
years.
Task Complexity
Innovative item types span a very wide range of task complexity, as can be seen
through the examples of innovative items presented in this chapter. This level of
task complexity has implications for item and item-type development as well as
for implementation with a test delivery method.
Items with low task complexity can be seen to have the greatest similarity to
traditional text-based multiple-choice items. They require a similar amount of
84 TaskComplexity
The majority of the research and development of innovative item types has
been conducted at these two extremes of task complexity. Relatively little work
has been accomplished in terms of developing item types of moderate task
complexity. Items or tasks at this level of complexity could be expected to take a
moderate amount of examinee response time, compared to the very low- and
very high-complexity item types. They would also utilize some form of partial-
credit or polytomous scoring , although the weighting of any single item with
moderate task complexity would probably be less than that of a single high-
complexity item. A relatively simple example of an item with moderate task
complexity is the circuit assembly item displayed in Figure 5.2. This item would
take more time and would require an examinee to provide more than the single
response of a multiple-choice item, but with the careful application of partial-
credit scoring it could also yield greater information about the examinee's
knowledge of circuitry.
Depending on where along the complexity continuum the items might fall
(and thus, the amount of examinee-response time required by these innovative
items), they might not be easily used in conjunction with traditional, multiple-
choice items. If they cannot, an exam could be developed to consist solely of
moderately complex innovative items.
Such an exam would include a moderate number of these tasks-fewer than
would be feasible for a traditional multiple-choice exam but more than might be
reasonable for performance-based simulations. Due to this factor, exams
composed of moderate task complexity could offer a compromise between the
two extremes. Ideally, these moderately complex tasks could provide
measurement improvements over more traditional multiple-choice items. At the
same time, an exam of this sort might be better able to provide adequate content
coverage than the more time-consuming high-task-complexity exams, thus
avoiding mitigating problems of task specificity and limited generalizability. A
test composed of moderately complex item types could be designed to use any
of the test-delivery methods presented in the chapters to follow.
Summary
This chapter has included a discussion of the benefits of innovative item types
along with a description of various dimensions of innovation. The foundational
purpose of item innovations was seen to be improved measurement, either by
measuring something better or by measuring something more. Innovative item
types that measure in better ways may reduce the effect of guessing, or they may
enable a more direct measure of the skill or attribute of interest. Innovative item
types that measure more may provide for the assessment of additional content
areas or for the assessment of additional cognitive processes. The dimensions in
which items may be innovative include item format, response action, media
inclusion , level of interactivity, and scoring method . The task complexity of
86 Summary
innovative item types was also discussed, while it was noted that most
developmental work has been conducted at the two extremes of low- and high-
task complexity. Some of the ways in which these elements of innovation can be
integrated are illustrated in Table 5.1. This table provides a few examples of
innovative item types taken from this chapter, categorized by the measurement
purpose of the innovation, one or more dimensions in which the item type is
innovative, and the level of task complexity that might be typical for that item
type.
The majority of the innovative item types presented in this chapter have been
used in either research or operational settings. They are all within the realm of
the possible for current development and implementation . Far more extensive
innovations are on the assessment horizon. One promising possibility concerns
assessment tasks that are embedded within instruction. Greater use of
interactivity can also be imagined. Fuller use of media, moving toward
immersed testing environments, can be envisioned. And, the development of
items or tasks that call for examinees to use online resources, perhaps including
the World Wide Web, may be a rich testing application.
An important point to recognize is that for many innovative item types,
special test administration software may be necessary. Existing CBT software
may be able to administer some of the low-task-complexity item types, but in
most other cases, a customized effort is needed. It is also important to
acknowledge that as great as the potential for these further innovative item types
may be, the need to do the foundational psychometric work is even greater. The
items and tasks should be developed in congruence with a testing area's
construct definition and test design. It is likely that a variety of innovative item
types need to be investigated to find those that are truly useful. Even then,
research will be needed to establish the validity of the item types and of their
associated scoring rubrics.
y.
00
00
c:
~
Table 5.1 Examples ofInnovative Item Types by Measurement Purpose, Innovative Dimension, and Task Complexity Er
§
Innovative Item Type Measurement Purpose Innovative Dimension o
Task Complexity
&
~.
Numerical constructed response To reduce guessing Item format Low
Click on an area of a graphical image To reduce guessing Item format Low to moderate ~
Response action ~
Media inclusion ~
Select all that apply To measure more directly Item format Low
Scoring method
Audio in stem To expand content coverage Media inclusion Low to moderate
Figural constructed response To expand cognitive Item format Low, moderate, or
skills measured Response action high
Interactivity
Two-stage branching To expand cognitive skills Interactivity Moderate to high
measured
Simulation of software use To measure more directly Interactivity Moderate to high
To expand content coverage Response action
Scoring method
00
-..l
88 References
References
ACT, Inc. (1998). Assessing listening comprehension: A Review of Recent Literature
Relevant to an LSAT Listening Component. Unpublished manuscript, LSAC,
Newton, PA.
ACT, Inc. (1999) . Technical Manual for the ESL Exam. Iowa City : Author .
Baker, E. L., & O'Neil, H. F. Jr . (1995). Computer technology futures for the
improvement of assessment. Journal of Science Education and Technology, 4,
37-45.
Balizet, S., Treder, D. W., & Parshall, C. G. (1999, April) . The development of an audio
computer-based classroom test of ESL listening skills. Paper presented at the annual
meeting of the American Educational Research Association, Montreal.
Bennett, R. E., & Bejar, I. I. (1998) . Validity and automated scoring: It's not only the
scoring . Educational Measurement: Issues & Practices, 17, 9-17.
Bennett, R. E., Goodman, M., Hessinger, 1., Ligget, J., Marshall, G., Kahn, H., & Zack, J.
(1997) . Using Multimedia in Large-Scale Computer-Based Testing Programs
(Research Rep. No. RR-97-3) . Princeton, NJ: Educational Testing Service.
Bennett, R. E., Morley, M., & Quardt, D. (1998 , April). Three Response Types for
Broadening the Conception of Mathematical Problem Solving in Computerized-
Adaptive Tests . Paper presented at the annual meeting of the National Council of
Measurement in Education , San Diego .
Braun, H. (1994) . Assessing technology in assessment. In Baker, E. A., & O'Neil, H. F.
(eds .), Technology Assessment in Education and Training (pp. 231-246). Hillsdale,
NJ: Lawrence Erlbaum Associates.
Breland, H. M. (1998, April) . Writing Assessment Through Automated Editing. Paper
presented at the annual meeting of the National Council on Measurement in
Education, San Diego.
Burstein, J., Kukich, K., Wolff, S., Lu, C., & Chodorow, M. (1998, April) . Computer
Analysis of Essays . Paper presented at the annual meeting of the National Council
on Measurement in Education, San Diego.
Clauser, B. E., Margolis, M. J., Clyman, S. G., & Ross, LP. (1997) . Development of
automated scoring algorithms for complex performance assessments: A comparison
of two approaches. Journal ofEducational Measurement, 34, 141-161 .
Davey, T., Godwin , J., & Mittelholtz, D. (1997) . Developing and scoring an innovative
computerized writing assessment. Journal ofEducational Measurement, 34, 21-41.
Drasgow, F., OIson-Buchanan, J. B., & Moberg, P. 1. (1999). Development of an
interactive video assessment: Trials and tribulations. In Drasgow, F., & Olson-
Buchanan,1. B, (eds .), Innovations in Computerized Assessment. (pp. 177-196).
Mahwah, NJ: Lawrence Erlbaum Associates.
Educational Testing Service (ETS). (1998). Computer-Based TOEFL Score User Guide.
Princeton , NJ: Author .
French, A., & Godwin, 1. (1996, April). Using Multimedia Technology to Create
Innovative Items. Paper presented at the annual meeting of the National Council on
Measurement in Education , New York.
Harmes, J. C., & Parshall, C. G. (2000, November). An Iterative Process for
Computerized Test Development: Integrating Usability Methods . Paper presented at
the annual meeting of the Florida Educational Research Association, Tallahassee.
5. Issues in Innovative Item Types 89
Landauer, T. K., Laham, D., Rehder, B., & Schreiner, M. E. (1997). How well can
passage meaning be derived without using word order? A comparison of latent
semantic analysis and humans. In Shafto, G., & Langley, P. (eds.), Proceedings of
the 19th Annual Meeting of the Cognitive Science Society (pp. 4 I2-417). Mahwah,
NJ: Erlbaum.
Olson-Buchanan, 1. B., Drasgow, F., Moberg, P. J., Mead, A. D., Keenan , P.A., &
Donovan, M.A. (1998) . Interactive video assessment of conflict resolution skills.
PersonnelPsychology, 51, 1-24.
Page, E. B. & Petersen N. S. (1995, March). The computer moves into essay grading :
Updating the ancient test. Phi Delta Kappen, 76,561-565.
Parshall, C. G. (1999, February). Audio CBTs : Measuring More through the use of
Speech and Nonspeech Sound. Paper presented at the annual meeting of the
National Council on Measurement in Education, Montreal.
Parshall, C. G., Davey, T., & Pashley, P. J. (2000). Innovative item types for
computerized testing. In Van der Linden, W. J., & Glas, C. A. W. (eds .),
Computerized Adaptive Testing: Theory and Practice. (pp. 129-148). Norwe11, MA:
Kluwer Academic Publisher.
Parshall, C. G., Stewart, R, & Ritter, 1. (1996, April). Innovations: Sound, Graphics, and
Alternative Response Modes. Paper presented at the annual meeting of the National
Council on Measurement in Education, New York.
Shavelson, R. J., Baxter, G. P., & Pine, 1. (1992). Performance assessments: Political
rhetoric and measurement reality. Educational Researcher, 21, 22-27 .
Vispoel, W. P., Wang, T., & Bleiler, T. (1997) . Computerized adaptive and fixed-item
testing of music listening skill : A comparison of efficiency, precision, and
concurrent validity. JournalofEducational Measurement, 34, 43-63 .
Additional Readings
ACT, Inc. (1995). Work Keys. Iowa City: Author.
Bejar, I. I. (1991). A methodology for scoring open-ended architectural design problems.
JournalofAppliedPsychology, 76, 522-532.
Bennett, R.E. (1998). Reinventing Assessment. Princeton, NJ: Educational Test ing
Service.
Bennett, R. E., & Sebrechts , M. M. (1997) . A computer-based task for measuring the
representational component of quantitative proficiency. Journal of Educational
Measurement, 34,64-77.
Bennett, R. E., Steffen, M., Singley, M. K., Morley, M., & Jacquemin, D. (1997) .
Evaluating an automatically scorable, open-ended response type for measuring
mathematical reasoning in computer-adaptive tests. Journal of Educational
Measurement, 34,162-176.
Booth, 1. (1991) . The key to valid computer-based testing : The user interface. Revue
Europeenne de Psychologie Appliquee, 41, 281-293 .
Bosman, F., Hoogenboom, J., & Walpot, G. (1994). An interactive video test for
pharmaceutical chemist's assistants. Computers in Human Behavior, 10,51-62.
Braun, H. I., Bennett, R. E., Frye, D., & Soloway , E. (1990) . Scoring constructed
responses using expert systems. JournalofEducational Measurement, 27, 93-108.
90 Additional Readings
Breland, H.M. (1998, April) . Writing Assessment through Automated Editing. Paper
presented at the annual meeting of the National Council on Measurement in
Education , San Diego.
Bugbee, A. C. Jr., & Bernt , F. M. (1990) . Testing by computer: Findings in six years of
use. Journal ofResearch on Computing in Education, 23, 87-100.
Buxton, W. (1987). There's more to interaction than meets the eye : Some issues in
manual input. In Baecker, R. M. & Buxton , W. A. S. (eds .) Readings in Human-
Computer Interaction: A Multidisciplinary Approach (pp. 366-375). San Mateo,
CA : Morgan Kaufinann.
Clauser, B. E., Ross, L. P., Clyman, S. G., Rose, K. M., Margolis, M. J., Nungester, R.
N., Piemme, T. E., Chang, L., EI-Bayoumi, G., Malakoff, G. L., & Pincetl, P. S.
(1997) . Development of a scoring algorithm to replace expert rating for scoring a
complex performance-based assessment. Applied Measurement in Education, 10,
345-358.
Dodd , B. G., & Fitzpatrick, SJ. (1998) . Alternatives for scoring computer-based tests .
Paper presented at the ETS Colloquium, Computer-Based Testing: Building The
Foundation For Future Assessments, Philadelphia.
Fitch , W. T., & Kramer, G. (1994). Sonifying the body electric: Superiority of an
auditory over a visual display in a complex, multivariate system. In Kramer, G.
(ed.), Auditory Display, (pp. 307-325). Reading, MA: Addison-Wesley.
Gaver, W. W. (1989). The SonicFinder: An interface that uses auditory icons . Human-
Computer Interaction, 4, 67-94.
Godwin, J. (1999 , April) . Designing the ACT ESL listen ing test. Paper presented at the
annual meeting of the National Council on Measurement in Education, Montreal.
Gruber, J. S. (1998 , October). [Interview with James Kramer, head of Virtual
Technologies, Inc.] Gropethink . Wired, pp. 168-169.
Koch, D. A. (1993). Testing goes graphical. Journal of Interactive Instruction
Development, 5, 14-21.
Luecht , R. M., & Clauser, B. E. (1998, September). Test methods for complex computer-
based testing. Paper presented at the ETS Colloquium, Computer-Based Testing:
Building The Foundation For Future Assessments, Philadelphia.
Martinez, M. E. (1991). A comparison of multiple-choice and constructed figural
response items. Journal ofEducational Measurement, 28, 13I- I45.
Martinez, M. E. (1993). Item formats and mental abilities in biology assessment. Journal
ofComputers in Mathematics and Science Teaching, 12, 289-301.
Martinez, M. E. & Bennett , R. E. (1992). A review of automatically scorable constructed-
response item types for large-scale assessment. Applied Measurement in Education,
5, 151-169.
Nissan, S. (1999, April). Incorporating Sound, Visuals, and Text for TOEFL on
Computer. Paper presented at the annual meeting of the National Council on
Measurement in Education, Montreal.
O'Neill, K., & Folk , V. (1996, April). Innovative CBT Item Formats in a Teacher
Licensing Program . Paper presented at the annual meeting of the National Council
on Measurement in Education, New York, NY.
Perlman, M., Berger, K., & Tyler , L. (1993) . An Application of Multimedia Software to
Standardized Testing in Music. (Research Rep . No. 93-36) Princeton, NJ :
Educational Testing Service.
5. Issues in Innovative Item Types 91
Shea, 1. A., Norcini, 1. 1., Baranowski, R. A., Langdon, L. 0 ., & Popp, R. L. (1992). A
comparison of video and print formats in the assessment of skill in interpreting
cardiovascular motion studies . Evaluation and the Health Professions, 15,325-340.
Stone, B. (1998, March). Focus on technology: Are you talking to me? Newsweek, pp.
85-86.
Taggart, W. R. (1995). Certifying pilots : Implications for medicine and for the future . In.
Mancall, E. L. & Bashook, P. G. (eds .), Assessing Clinical Reasoning: The Oral
Examination and Alternative Methods (pp. 175-182). Evanston, IL: American Board
of Medical Specialties.
Vicino , F. L., & Moreno, K. E. (1997) . Human factors in the CAT system : A pilot study .
In Sands, W. A., Waters, B. K, & McBride, 1. R. (eds .), Computerized Adaptive
Testing: From Inquiry To Operation (pp. 157-160). Washington, DC: APA.
Vispoel, W. P., & Coffman, D. (1992). Computerized adaptive testing of music-related
skills . Bulletin ofthe Council for Research in Music Education, 112,29-49.
Vispoel, W. P., Wang, T., & Bleiler, T. (1997). Computerized adaptive and fixed-item
testing of music listening skil\ : A comparison of efficiency, precision, and
concurrent validity . Journal ofEducational Measurement, 34, 43-63.
WiIliams, V. S. L., Sweeny, S. F., & Bethke, A. D. (1999). The Development and
Cognitive Laboratory Evaluation of an Audio-Assisted Computer-Adaptive Test for
Eighth-Grade Mathematics. Paper presented at the annual meeting of the National
Council on Measurement in Education , Montreal.
6
Computerized Fixed Tests
include short exams, where the efficiency advantages of adaptive tests are
relatively unimportant, and low-stakes exams, in which security is of minor
concern. CFTs are suitable for exam programs with small numbers of examinees
or with rapidly changing content. The small sample size makes the adaptive test
use of IRT more difficult and reduces some security concerns. Exam programs
that have content that quickly becomes dated (e.g., many technology-related
exams) may also find the CFT method to be suitable. Given the need to replace
items frequently, it may be desirable to avoid the psychometric efforts involved
in the extensive item pretesting and calibrating required in adaptive exams.
Finally, the CFT method may be appealing to exam programs with limited
funds, due to its lower development costs. Of course, that motivation should not
lead test developers or boards to choose the CFT if it results in poorer
measurement for a given exam program.
Typical applications of the CFT include low-stakes educational assessments
such as placement tests and assessments in distance education applications.
CFTs are used for a number of voluntary certification programs and for many
low-volume certification and licensure exams. The information technology (IT)
field is making use of CFTs for a large variety of certification applications (e.g.,
Adair & Berkowitz, 1999). These IT exams are characterized by extensive
innovations in terms of simulating software functionality , very simple scoring
algorithms, and rapidly changing content, making the CFT a suitable delivery
method.
Test Procedures
Each test-delivery method can be considered in terms of the test procedures
involved in implementing the method. These procedures include the processes
followed for test assembly and scoring and the requirements or needs for the
item pool characteristics. These elements and testing procedures are discussed
next.
Test Assembly
Test assembly for the CFT delivery method is most frequently conducted using
classical test theory methods. In classical test theory, exam forms are
constructed according to test specifications, or test blueprints. A test blueprint is
a set of rules that specifies attributes of a test form, such as the overall length of
the form, content representation, and statistical characteristics. For example, test
specifications may require that all test forms for a given exam program be 100
items in length, with 25, 35, and 40 items each coming from three content areas.
The test specifications could include further requirements regarding the
statistical characteristics of the test form. For example, all of the item p-values
94 TestProcedures
(i.e., difficulty indices) could be constrained to be greater than .3 and less than
.8, while no item biserial correlation could have a value less than .2. Further
discussion of test form assembly is provided later in this chapter.
Clearly, this approach to test assembly can be used whether the test forms
are to be administered in paper-and-pencil or computer mode. Many CFTs are
assembled using the same procedures that have long been applied in paper-and-
pencil testing. At other times, CFTs are developed simply by importing an
existing paper-and-pencil exam into a computerized test administration software
program. (The issue of comparability of test scores obtained across the two test
administration modes is addressed in Chapter 2.)
While it is a less common approach, the test forms for the CFT also can be
constructed and scored using IRT methods. One IRT-based approach to test
construction is to specify a target Test Information Function (TIF) along with
the exam's content requirements. The specified target TIF should be related to
the test purpose. For example, a classification exam might have a target TIF that
is highly peaked around the cutoff or passing score, while a proficiency exam
might have a relatively flat target TIF across the full score range. Items are then
selected from the pool based on their individual item information functions
(IIFs). Because information is additive, it is a relatively straightforward matter
to select items that satisfy the content requirements and produce an actual TIF
that is closely related to the target TIF. (In any test assembly method, successful
forms construction is related, of course, to the availability of a sufficient
quantity of good-quality items.)
Scoring
As indicated earlier, the test forms in a CFT usually are constructed using
classical item statistics. Given the classical test theory methods of test
construction, the exams are scored based on the number or proportion of items
answered correctly . The examinee score may be reported as a scaled number-
correct score. If there are multiple test forms, the scores can be equated using
such methods as linear or equipercentile equating . And if necessary, cutoff
scores for each test form can be established by standard-setting committees.
IRT methods also can be used, for both constructing the test form and
scoring the examinee responses. The IRT item parameter estimates are not used
in a CFT during the actual exam for either item selection or scoring purposes. At
the conclusion of the fixed exam, however, the test can be scored based on an
IRT estimate of examinee ability, using the IRT item parameters and the
examinee's responses to the fixed set of items. This scoring procedure results in
an estimated ability score, which usually is converted to some type of scaled
score for reporting purposes. If parallel (i.e., equivalent) tests are constructed in
this manner, no equating is necessary. Cutoff or passing scores can still be set by
standard-setting committees.
6. Computerized FixedTests 95
Whether they use classical or IRT methods, CFTs can be designed to provide
an immediate score . An official score may be offered at the conclusion of
testing, or a provisional score may be provided, with an official score to follow
by mail some weeks later. Provisional scores at the time of the exam are offered
when a testing agency or board wants the opportunity to confirm the examinee's
registration information, to conduct key validation, and to check for possible
problems such as test anomalies or test-center irregularities before providing an
official score . In these instances, the official score is released after the other
considerations have been addressed.
Measurement Characteristics
Each test-delivery method has certain measurement characteristics. These
measurement characteristics consider how each method addresses test length,
96 Measurement Characteristics
Test Length
When a testing program is developed, a decision must be made about the overall
length of the exams, considering the intended use of the test score and
reasonable limits on examinee time. The various test-delivery methods establish
test length and address the competing measurement concerns noted earlier in
different ways. For CFTs, as for paper-and-pencil exam administrations, the test
length is fixed, defined as the total number of items in the fixed form. Content
coverage, reliability, and security must be designed to be satisfied within that
test specification.
For other test-delivery methods, test length may be defined as a range; that
is, in a variable-length adaptive test, a given exam may be constrained to have at
least a required minimum number of items and no more than a specified
maximum number of items. However, for any adaptive delivery method, even
the maximum number of items is likely to be less than the number of items in a
CFT or other fixed-form method. CFTs do not have any of the test efficiency
advantage of adaptive exams.
Content Constraints
The test specifications for most exam programs typically include requirements
for the content levels to be included in a test form, as well as the number or
proportion of items from each content level. For a fixed-form test-delivery
method like the CFT, these content constraints are addressed and satisfied
during the test-form assembly process within the requirement for test length. If
multiple forms are developed, each form is constructed to satisfy the content
rules specified in the test blueprint (while also meeting test length and reliability
requirements).
Other test-delivery methods may address content constraints in other ways.
For example, a CAT or CCT program may be designed to administer a
minimum number or proportion of items from each of the test's content
categories (Kingsbury & Zara, 1991). Alternatively, an adaptive test may be
designed to satisfy target information functions for each content level (Davey &
Thomas, 1996).
6. Computerized Fixed Tests 97
Reliability
Item/Test Security
Test security in the CBT environment potentially includes two new concerns not
typically present in standardized paper-and-pencil test administrations . First, the
more frequent administrations available in any CBT (i.e., continuous testing)
result in a new type of security concern : that items will become known over
time. (Actually, a paper-and-pencil exam program that continued to use a given
test form over multiple administration dates rather than regularly replacing used
test forms with new ones, also could have this problem. But the problem may be
exacerbated by continuous testing environments typical of CBT programs,
where an examinee could pass on item information to another examinee who
would test the next day or next week, rather than three or four months later.)
For adaptive exams, an additional security issue is that individual items are
administered or exposed at different rates; some items are administered very
frequently and can become exposed quickly. This second factor is not an issue
for the fixed delivery methods. CFTs, like standardized paper-and-pencil tests,
administer items at a constant and predictable rate. Every item on a form is
exposed every time the form is administered. The overall exposure of an item is
easily predicted, based simply on the number of forms , the number of
administration dates, and the number of test-takers. If the item exposure is
regarded as too high, then additional test forms must be developed. In order to
keep exposure from becoming a security problem, multiple forms can be used,
and forms can be retired and replaced after a given number of uses. Item
exposure is far more complex for adaptive test-delivery methods. This problem
often is addressed statistically in adaptive delivery methods through the use of
item exposure control parameters. (The concept of an item exposure control
parameter will be detailed further in Chapter 8). Some operational exam
98 Practical Characteristics
programs have begun to rotate entire item pools or to construct pools from larger
"vats" in order to limit item exposure and to ensure adequate security (Way,
1998).
There is one additional type of test security that is available to CFTs and not
to adaptive exams. The correct answers can be omitted from the test
administration software. This approach is only possible for fixed exams like the
CFT, as adaptive item selection procedures are dependent on interactive
computations of item scores and examinee ability estimates. However, a
limitation of test programs that do not store the correct answers is that they will
not be able to provide immediate scoring of the exams. Because of this, few
testing programs elect to address test security in this manner. In fact, it may not
be necessary as long as other security measures are taken. Restricted access to
the test software or to the machines on which the software has been instalIed,
password protection, and data encryption methods have proven to be effective
test security measures (Rosen, 2000; Shermis & Averitt, 2000; Way, 1998).
Set-Based Items
The CFT delivery method is fully able to incorporate set-based items. These sets
of items may be associated with a reading passage, graph, or other stimulus, or
for some other reason may have been designed to be administered together.
Although adaptive exams must modify their item selection procedures when test
material is set-based (see Chapter 8), fixed exams using classical test assembly
methods have no such difficulty. An exam can include one or more sets of items,
which can be scored using classical test scoring methods.
Practical Characteristics
There are a number of important considerations beyond those discussed thus far
that must be addressed when any exam is delivered through the computer mode.
These practical characteristics include the examinee volume, the initial
development effort required by a test-delivery method, the ease with which an
exam can be maintained across dual platforms, pretest accommodations,
examinee reactions, and the probable cost of the CBT. Each of these factors is
discussed next for the CFT.
Examinee Volume
Initial Development
Dual Platform
Pretest Iterns
Each delivery method must accommodate the need for pretest items in some
fashion. Ideally, items for any computerized exam are pretested online . Over
time the items, test forms, and even the entire pool need to be retired and
additional items need to be developed. In order to support the pool
replenishment or replacement, continual item pretesting needs to be conducted.
In a CFT, the pretest items can be administered in several fashions. The non-
operational (and nonscored) items can be interspersed randomly within the
operational items, or they can be given as a set at either the beginning or the end
of the exam. After the items have been administered to a sufficient number of
examinees, item statistics are computed and the quality of the items is evaluated.
Those items that are found to be satisfactory can then be used to supplement the
pool and eventually replace used items. (Issues in item development and
pretesting are further discussed in Chapter 2.)
100 An Example of the eFT Method
Examinee Reactions
The cost of developing and administering any CBT is often higher than typical
costs for paper-and-pencil exams. Some portion of that cost is usually passed on
to examinees in the form of higher examination fees. The amount charged an
examinee for any test is related to the developmental effort and expenses
incurred by the testing company or certificationllicensure board. The amount
charged for a CBT is also related to the administrative expenses such as "seat
time." CFTs typically require a more modest developmental effort than other
test-delivery methods, but they do not have the efficiency advantage of adaptive
programs. The overall cost, and the cost borne by the examinee, will reflect
these and other specific expenses.
Summary
A summary of the test procedures, measurement characteristics, and practical
characteristics of the CFT delivery method is provided in Table 6.1. These
features highlight the method's strengths and weaknesses. Overall, a CFT
program is relatively easy to develop and maintain but cannot address security
well. In spite of this drawback, the CFT approach is the test-delivery method of
choice for many testing programs, particularly those that comprise low-volume,
low-stakes tests without any particular need for measurement efficiency . For
exam programs that have greater security concerns or a need for testing
efficiency, however, it is a relatively poor choice.
After the test blueprint for the exam program has been developed, items are
written to the specifications. Typically, it is a good idea to have many more
items written than are desired, as not all items produced will prove adequate .
Once a number of items have been generated, they can be reviewed by
additional subject matter experts, a fairness review panel, and test development
staff, to ensure that they appropriately address the subject matter, meet criteria
of cultural and ethnic group sensitivity, are clearly and correctly written, and are
consistent with the item format specifications for the exam. At this point, a test
form can be assembled and the items can be pretested or field-tested on
examinees.
Ideally, items are pretested on examinees similar in every characteristic to
the target test population, under motivated conditions, and in the mode in which
the items are to be delivered. For CBT programs, this means that it is best to
pretest items on computer, using the actual test-delivery software that will be
used during operational testing . Although this is the best-case scenario , many
CBT programs have had reasonable success using item statistics that were
obtained through paper-and-pencil administrations.
Once the items have been administered to a sufficient number of examinees,
item statistics can be computed. Table 6.3 displays the item statistics obtained
on a set of 15 pretest items for the sample application.
In addition to satisfying the content specifications detailed in the test
blueprint, the testing program may have statistical criteria. For example, no item
may be accepted for inclusion if the discrimination index is less than .20 (i.e.,
r(pbis) < .20) and each item difficulty measure must fall between .30 and .80 (i.e.,
.30 ~p-value ~ .80).
Given these requirements, several of the items in Table 6.3 have been
flagged with an asterisk (*). These items would be examined and, where
possible, revised to improve their performance . Typically , items that have been
revised must then be pretested again to verify that the revision was effective and
to obtain new item statistics . In the example, for illustrative purposes, three of
the flagged items (items 2, 8, and 9) were revised and new, satisfactory item
statistics were obtained. Upon examination, however, two of the items were
simply deleted from the pool (items 3 and 15). Table 6.4 displays the final item
pool (of 13 items) resulting from these changes.
In this test-development example, the final pool does not contain many more
items than are needed for a single (lO-item) exam form. However, the items
available do allow full satisfaction of the test specifications for a single form. As
a final step, items are selected for inclusion in the computerized exam. These are
items 1,2, and 4 for the three items required in the test blueprint (Table 6.2) for
content area A; items 6, 7, 9, and 10 for the four items needed from content area
B; and items 12, 13, and 14 as the three content area C items. These 10 items
will comprise the CFT. The highlighted rows in Table 6.4 indicate the final test
form.
6. Computerized Fixed Tests 103
1 A .50 .55
2 A 040 .15 *
3 A .90 .05 *
4 A .80 045
5 A .50 040
6 B .70 040
7 B 040 .35
8 B .20 .05 *
9 B .90 040 *
10 B .70 .50
11 B .60 .60
12 C .30 .30
13 C 040 .35
14 C .50 040
15 C .20 .15 *
1 A .50 .55
2 A .40 .25
4 A .80 .45
5 A .50 040
6 B .70 .40
7 B .40 .35
8 B .30 .30
9 B .80 .40
10 B .70 .50
11 B .60 .60
12 C .30 .30
13 C .40 .35
14 C .50 .40
If a second exam form was desired, only minor changes between the two
forms could be obtained using the available items. A single item could be
changed in content area A and two items in content area B, but no item changes
could be made for content area C, given that the number of items available in the
pool exactly equals the number of items required in the test specifications. Due
to the constraints on content and the small pool, the overlap between test forms
would be very high. Within content area C, the overlap rate across two exam
104 An Example of the eFT Method
forms would be 100%. Even though two exam forms were constructed, the
exposure rate for each item in content area C would also be 100%. It is
important to note that the item exposure rate is the rate at which an item is
administered ; for fixed exam forms, an item is exposed once each time a form
on which it is present is administered.
In order to provide better test security by lowering item exposure and test
overlap rates, it would be necessary to develop a larger item pool. If a greater
number of items were written, reviewed, and pretested, then multiple forms
could be assembled. An additional goal in assembling multiple forms is to
ensure that the forms are equivalent in some sense. Ideally, it is desirable for the
average test form difficulty and variance to be identical across multiple forms.
This requirement can necessitate the availability of many more items in order to
satisfy it exactly.
Given a large enough item pool and a set of test specifications, one or more
test forms can be assembled manually. However, if more than a very few forms
are needed, a much better option would be to use the automated methods
discussed in Chapter 7.
Summary of Example
This example illustrated the test-delivery process for the CFT delivery method.
It is a fairly straightforward approach, having a great deal in common with the
test-delivery process used in most standardized, paper-and-pencil exam
programs. Additional considerations include preparing for examinee reactions,
addressing software issues, and managing administrative elements such as the
effects of continuous testing on test security.
The CFT method provides all of the advantages of computer-based test
administration, including more frequent exam administrations and the potential
to improve measurement through the use of innovative items. However, the CFT
model provides no advantage of adaptive testing in that tests are not tailored to
the examinee and cannot be shortened . Furthermore, given that one or only a
few fixed forms are usually available and that they are offered on a more
frequent basis, no security advantage is typically present. Test-delivery methods
that provide greater security and/or efficiency advantages are covered in the next
chapters.
6. Computerized Fixed Tests 105
References
Adair, J. H., & Berkowitz, N. F. (1999, April). Live application testing: Performance
assessment with computer-based delivery. Paper presented at the annual meeting of
the American Educational Research Association, Montreal.
Crocker, L., & Algina, J. (1986). Introduction to Classical And Modern Test Theory. Fort
Worth: Holt, Rinehart & Winston.
Davey, T., & Thomas, L. (1996, April). Constructing adaptive tests to parallel
conventional program. Paper presented at the annual meeting of the American
Educational Research Association, New York.
Kingsbury, G. G., & Zara, A. R. (1991). A comparison of procedures for content-
sensitive item selection in computerized adaptive tests. Applied Measurement in
Education. 4,241-261 .
Rosen, G. A. (2000, April). Computer-based testing: Test site security. Paper presented at
the annual meeting of the National Council on Measurement in Education, New
Orleans.
Shermis, M., & Averitt, 1. (2000, April). Where did all the data go? Internet security for
Web-based assessments. Paper presented at the annual meeting of the National
Council on Measurementin Education, New Orleans.
Thissen, D. (1990). Reliability and measurement precision. In H. Wainer (ed.), Computer
Adaptive Testing: A Primer . (pp. 161-186). Hillsdale, NJ: Lawrence Erlbaum.
Way, W. D. (1998). Protectingthe integrity of computerized testing item pools.
Educational Measurement: Issues and Practice, 17,17-27.
7
Automated Test Assembly for
Online Administration
Test Procedures
Test Assembly
In general there are two types of ATA that can be used to build tests
automatically. The first is based on a mathematical concept often referred to as
Scoring
I The 1998 Van der Linden reference also contains an excellent reference list of
many of the most recent methods and procedures for ATA or optimal test
assembly.
108 MeasurementCharacteristics
the CAT or CCT, in which variability in test length, item selection, and item
difficulty make traditional scoring methods inappropriate in many situations.
The ATA number-correct score is easy to understand and requires no score
adjustment between forms. It also doesn't have to be converted into a scaled
score; the number correct metric is readily interpretable.
As we have mentioned, by ensuring that the tests are constructed using the
ATA approach described, there is no guarantee that the tests will be strictly
parallel for each examinee. Parallel tests are defined as having the same mean
difficulty and variability for each individual examinee within the population. For
the classical construction problem (given by example later in this chapter), the
constructed tests can only claim a general equivalence in terms of average test
difficulty and observed score variability.
In IRT construction, the use of a TTIF does not guarantee parallel tests in the
classical sense, but it does produce tests that are nominally or weakly parallel
(Samejima, 1977).
The only real requirement in terms of item pool size for ATA tests is that the
pool be large enough to support the construction of at least one test form.
Obviously, test overlap is a function of pool size. In general the larger the pool,
the smaller the test overlap rate. However, not every testing program has the
luxury of developing and maintaining a large item pool, so that many programs
must tolerate fairly small pools, even when there are many test content
specifications that must be met.
Similarly, pool quality isn't always a major consideration for some testing
programs. Many programs simply construct multiple forms that follow a
reference test of a previously administered test form, even though that form is
less than ideal. However, if the multiple test forms were constructed to meet a
specific psychometric goal, such as to provide maximum accuracy and precision
at the passing score or pass point, then the item pool should consist of items that
discriminate well at that score or point.
Measurement Characteristics
in the ATA process. The type of test reliability reported or used in the ATA
process depends on whether a classical or IRT approach is used. In the classical
construction of equivalent test forms, internal reliability, as measured by such
indices as KR-20 or coefficient alpha, is appropriate. In an IRT setting, the
target test information function or target TIF can be used to construct a test with
a certain precision at various points along the ability continuum.
Item/Test Security
In general, item or test security for ATA can be controlled by limiting the
number of times an item is included on multiple test forms, Ideally, each item
would be limited to a single form, thereby minimizing each item's exposure.
This would also ensure that the test overlap rate would be an optimal minimum,
in this case zero.
While an item exposure rate that results in a zero test-overlap rate is ideal, it
is rarely obtainable. Usually, item pool size is small relative to desired test
length, and some overlap between test forms is inevitable. In addition because of
content requirements, items may be shared across test forms simply because
there are so few items available for inclusion in a particular content category.
There are two approaches to controlling item inclusion on test forms . The
first is to simply limit or restrict item usage on test forms or to use item
inclusion (or exclusion) as a constraint in the ATA process itself. For example,
in addition to the usual content classification and psychometric properties, an
item can have associated with it an inclusion number designating the maximum
percentage of times that the item can appear across multiple test forms , The
second approach is to allow each item to have an item exposure parameter or
rate to be associated with it. This parameter is an estimate of a conditional
probability, the probability that the item will be included on a test form given
that it has been selected to be included on the form, or PUIs). This probability
can be estimated and stored with the item's other parameters (e.g., content
classification, psychometric properties, etc.) and used during the test assembly
process to limit the number of times anyone item can be included across forms .
More discussion on this topic follows the ATA example later in this chapter.
Set-Based Items
Items in an ATA pool can either be discrete items or occur as a member of a set,
where an item set is defined as a collection of items that usually refer to the
same stimulus. In many applications of test construction, general rules apply
whereby items enter the test individually or in some combination. For example,
a three-item set can be considered for test inclusion as (1) three individual items,
(2) one out of three, (3) two out of three, or (4) a complete set. Other constraints
110 PracticalCharacteristics
can be used to exclude certain items from appearing with other items in the pool.
These antagonistic items are sometimes referred to as item enemies .
Practical Characteristics
Examinee Volume
Multiple equivalent test forms allow for increased testing volumes and/or
increased testing frequency and therefore are ideal for large numbers of
examinees . However, once the test forms have been constructed, large numbers
of examinees are not necessary to maintain the program unless new items are to
be added to the ATA item pools. In this case new item statistics must be fairly
stable, which usually implies that they be based on large numbers of examinees
per item.
Initial Development
Because tests can be constructed using classical item statistics, the development
of multiple forms can be accomplished quickly. Once the test forms are
delivered online, item responses can be collected and the pool eventually can be
calibrated using IRT methods. An IRT calibrated pool may be more stable
across different examinee populations because the item statistics can be scaled
to the same metric. Once put on the same scale, IRT item parameter estimates
are considered invariant across different examinee populations (i.e., they are
considered to be the same for all examinees regardless oftheir ability levels).
Dual Platform
ATA methods offer the best option for constructing multiple test forms that can
be delivered online on a computer platform and amine by traditional paper-and-
pencil methods. It is relatively easy to construct several test forms and designate
some for online administration and others for a paper-and-pencil format.
Pretest Items
well, they can be added to the item pool for future use as scored or operational
items.
Examinee Reactions
Examinees are not aware that they are taking test forms that have been
constructed to be equivalent to other forms. Therefore, their test-taking behavior
or patterns should not change from those they experience in the traditional
paper-and-pencil format. Thus, the examinees are allowed to review items
previously answered and even to change their answers.
The examinees may experience a sense that each test form administered is
somewhat unique because there will be multiple forms and each form may be
administered in a scrambled format (i.e., one in which item order is determined
randomly by the computer). Following the testing sessions, examinees that tend
to discuss the test with others may sense that they received different forms.
Thus, the perception of unique test forms can improve overall test security.
Cost to Examinee
Because ATA tests are fixed in terms of their length, examinee costs for online
computerized administration are easy to estimate. However, because the forms
are constructed to parallel the fixed length of a reference form, ATA tests
normally do not offer any savings in testing time to examinees.
Summary
A summary of the test procedures, measurement characteristics, and practical
characteristics of the ATA method is provided in Table 7.1. These features
highlight the method's strengths and weaknesses. Overall, an ATA program is
relatively easy to develop and maintain, and it addresses security issues well by
constructing multiple test forms. It also handles both classical and IRT
construction methods.
2
The WDM heuristic is illustrated with a simple, classical ATA problem.
Table 7.2 provides a sample item pool consisting of 10 test items, or N(pool) = 10.
Each item has a classical difficulty index or p- val ue and a classical
discrimination index, the point-biserial correlation coefficient or rpbiS" We will
assume that these indices were calculated on large samples of examinees and
that the examinee populations did not change over time. In addition to the
statistical characteristics, each item has been classified into one of two content
categories, A or B.
2For an example of an ATA problem using an IRT approach, see the reference
by Stocking, Swanson, and Pearlman, 1993.
7. Automated Test Assembly 113
I A .50 .30
2 A .40 .15
3 A .90 .05
4 A .80 .45
5 A .70 .40
6 B .50 .40
7 B .40 .35
8 B .20 .05
9 B .90 .40
10 B .70 .40
The test construction task is defined in terms of a set of conditions that must be
met for each test form selected. Earlier, these conditions were referred to as test
constraints . There are really two different types of constraints: Cl) content
constraints and (2) statistical or psychometric constraints. A third condition, test
length or n, is sometimes referred to as a constraint, but in the current context, it
is really only a constant. It is assumed that the test length is fixed rather than
variable.
A simple content outline for this test might consist of the following:
This content outline dictates that .4n items should be from content area A and
the remaining .6n items should come from content area B. These would be the
content constraintsfor this test construction problem.
In terms of the statistical or psychometric properties of the test, suppose that
it is desirable to have the overall test difficulty be defined as the sum of thep-value
across the n items. For example, we might state that we want the average item
difficulty of the test to be between .60 and .70. Or, stated another way, we want
the expected observed test score or first moment of the observed score
distribution to be between .60n and .70n. Another psychometric constraint might
involve the variability of this distribution. We might stipulate that tests must be
constructed in such a way that the standard deviation of observed test scores or
Sx is greater than some value but less than another.
Usually these upper (U) and lower (L) bounds of the psychometric
constraints are taken from previously administered test forms. A form that is
114 An Example of the ATA Method
used as the basis for constructing other forms is sometimes referred to as the
target form , reference form, or domain-referenced item set. Again, referring to
our example from Table 7.2, suppose the following constraints have been
proposed:
Constraints
O. Test length, n = 5 items
1. 40% A items, or A = 2, or 2 S; A S; 2
2. 60% B items, or B = 3, or 3 S; B S; 3
3. 3.00 S;},; pS; 3.50 (test difficulty for a 5-item test)
4. .50 S; SxS; .60 (standard deviation of observed test scores)
first item guarantees that the same test form won't be constructed repeatedly.
This is why the ATA procedure is not accurately described by the phrase
domain sampling. Except for the first item, selection from the item pool is not
the result of any statistical sampling process. For our example, we will assume
that item 2 has been selected at random to begin the process.
If this were the only test that had to be drawn, we might question if this
particular test was the best one that could have been constructed. Swanson and
Stocking (1993) suggested that a replacement phase could be implemented
where we could consider adding one of the remaining five items, t = 1,4,5,6,
and 8, by first evaluating their sums of weighted deviations, as computed by qj
and SI' The only difference is that in the formula for qj, the future term is
ignored because all n items are already in the test. Therefore, qj = "f.ayXi + ay and
the summation is over the items in the pool as before. When SI is evaluated for
the five remaining items, item 8 yields the smallest weighted sum of positive
deviations when it is added to the five-item test, so it will be the item considered
for addition to the test next, provided that another one can be removed (see
Table 7.7).
When item 8 is entered into the test, the length of the test is now (tentatively)
n + 1. We next compute qj = "f.ayX i + ay and SI for the items provisionally in the
test, or for items 2, 3, 7, 8, 9, and 10. However, we will evaluate SI as each of
these items is removed from the test in order to find the item whose removal will
most reduce the weighted sum of positive deviations.
When each item is removed from the calculation of SI' t = 2, 3, 7, 8, 9, and
10, in Table 7.8, we find that item 8 shows the smallest weighted sum of
positive deviations. Thus, item 8 would be the most logical item to remove from
120 An Exampleof the ATA Method
the test, and the test would still consist of the original items selected, 2, 3, 7, 9,
and 10.
We might wonder if the initial selection of items could have been better, in
terms of meeting the constraints, if other weight values had been used. We
mentioned previously that one reason for making the deviation weights unequal
is to emphasize one or more constraints over the others. In addition, Van der
Linden (1998) pointed out that another reason to make the weights unequal is to
balance the effect of the weights if the constraint metrics are unequal. In the
previous example, we note that the first three constraints have lower and upper
bounds that are approximately the same magnitude. However, the fourth
constraint, that of the standard deviation of the observed test score, is about one-
fifth the magnitude of the others. If we let W/ = W2 = Wj = 1.0 and W 4 = 5.0, we
will ensure that all four constraint deviations will contribute to SI in
approximately the same proportion. If the weights in the previous example are
changed accordingly, the reader can verify that the test selected would change so
that the items selected would have been 2, 9, 10, 8, and 4. The new test would
have the following specifications:
Compared to the original item selection of items 2, 9, 10,7, and 3, this test is
slightly more difficult and has a larger standard deviation of observed test
scores. However, both tests satisfy the constraints. After implementing the
replacement phase, the test would remain as originally constructed.
As mentioned previously, by making a weight or weights considerably large
relative to the remaining weights, the test constructionist also can emphasize one
or more test characteristics over the others . A word of caution is required,
however. If the goal is to construct many equivalent tests with minimal test
overlap (i.e., percentage of shared items), the emphasis of one or more
constraints may yield tests with many repeated items across test forms.
In our example, we showed how the Swanson and Stocking WDM heuristic
could be used to construct a test using classical item characteristics. The WDM
approach also can be used to construct a test using IRT characteristics. Here, the
goal is to achieve weak parallelism by constructing tests whose test information
function approximates a target test information function or TTIF (Samejima,
1977) while also meeting all of the other constraints (e.g., content
specifications). The IRT procedure differs from the classical procedure only by
the way in which the psychometric constraints are defined.
A number of points are defined on the unidimensional ability scale, e. For
example , if 13 such points are defined, we might choose them to span [-3.0,
+3.0] in increments of .5, (i.e., e = -3 .0, -2.5, -2.0, -1.5, -1.0, -0.5,0.0,0.5,
1.0, 1.5, 2.0, 2.5, and 3.0.) Then there would be 13 psychometric constraints
and these would be lower and upper bounds of the TTlF at these values of e.
The magnitude of the vertical distance between boundaries of the TTIF need not
be equal. Table 7.9 contains values of a sample TTIF along with arbitrary lower
and upper bounds of the TTIF at 13 values of e.
Once these psychometric characteristics and bounds have been defined, the
test can be constructed using steps 1,2 and 3 as described previously.
We have already discussed the fact that for our sample item pool in Table 7.2,
it is possible to create or construct (5-choose-2) x(5-choose-3) tests that would
meet content specifications (i.e., that would contain two items from content
area A and three items from content area B) regardless of the psychometric
properties of the items. We might be interested to know what the test overlap
rate would be (1) between any pair of tests and (2) on average over the entire
set of 100 tests. The test overlap rate is defined as the percentage of
122 An Example of the ATA Method
shared items that occur between tests of a fixed length (i.e., in this case, n = 5).
In an ideal situation, we would like to minimize this rate between any two tests
that we construct. However, we must keep in mind that there exists this baseline
rate that we can estimate and use as a point of reference to achieving this goal as
we begin to apply the other constraints. Obviously, the overlap rate will increase
as we begin to consider other constraints because some items will be selected at
a higher rate. The rate at which an item is selected and then appears on an ATA
form is called the item's exposure rate; it is discussed later.
The expected value of the baseline overlap rate, or E [BOR], of a set of
fixed-length tests can be computed as follows:
Summary of Example
This example illustrated the major steps in the process of automated test
assembly or construction of a test form from a sample item pool. In practice
many such forms may be assembled so that they can be assumed to be
equivalent in some sense . Under certain conditions, the forms may even be
assumed to be parallel and have the same passing score or standard. The
124 Chapter7 Appendix
construction of multiple equivalent test forms makes the dual platform fonnat
(i.e. , simultaneous paper-and-pencil administration plus computer-based
administration) much easier to implement.
Chapter 7 Appendix
Each content area,j,j = 1,2, ... , J, is independent of the other content areas. We
will show the expected value of the baseline overlap rate or E [BOR] for one
content area first and then show that E [BOR] for the entire test of length n is
just the sum of these expected values. We define the random variable Y to be the
number of identical items between any two (paired) tests and Y + n is the
°
observed overlap rate for those tests. Possible values for the random variable Y
or y could be 0, 1, .. ., n, where y = would imply no shared items between any
two forms, while y = n would be complete overlap or identical test forms, We
desire the value ofE [y] and ultimately, I /n E [Y].
There are C, items in each content area, and we wish to draw mj items at
random from each content area. We abbreviate these values (for a fixed j) as
simply C and m and recognize that within each content area j there are X,
overlapping items (or simply X for abbreviation). Each X is distributed as a
hypergeometric random variable. For any content area, there are (~) possible
combinations of the m items selected from the C items in that content area.
Repeated draws will produce X items that are identical. Therefore,
Prob(X=x)
is the probability that X, the number of shared items in content area j, will be
equal to x = 0, 1, ... , m. The expected value of X (Ross, 1976) is
E(X)
where E(X) is actually equal to E(Xj) and E(Xj) = (m/ICj for the jth content area.
The expected value of Yover all J content areas is E(Y) = r. {E(~)} = r. {(m/ICj},
while the expected value of the baseline overlap rate, Yln, is E(Yln) = l/n or
E (Y) = r. {(m/ICj}+ n.
7. Automated Test Assembly 125
References
Chen, S., Ankenmann, R., & Spray, 1. (1999). Exploring the Relationship between Item
Exposure Rate and Test Overlap Rate in Computerized Adaptive Testing. (ACT
Research Report SeriesNo. 99-5). Iowa City: ACT, Inc.
Davey, T., & Parshall, C. G. (1995, April). New Algorithms for Item Selection and
Exposure Control with Computerized Adaptive Testing. Paper presented at the
annual meeting of the American Educational Research Association, San Francisco.
Gulliksen, H. (1950). Theory ofMental Tests. New York: WHey.
Ross,S. (1976).A First Course in Probability. New York:Macmillan Publishing Co., Inc.
Samejima, F. (1977). Weakly parallel tests in latent trait theory with some criticisms of
classicaltest theory. Psychometrika, 42, 193198.
Stocking, M. L., Swanson, L., & Pearlman, M. (1993). Application of an automated item
selection methodto real data. AppliedPsychological Measurement, 17, 167-176.
Swanson, L., & Stocking, M. L. (1993). A method and heuristic for solving very large
item selectionproblems. AppliedPsychological Measurement, 17, 151-166.
Van der Linden, W. (1998). Optimal assembly of psychological and educational tests.
AppliedPsychological Measurement, 22, 195-211 .
8
Computerized Adaptive Tests
Test Procedures
Test Assembly
maximum information criterion can be viewed as a special case under which all
weight is massed on the single column of the table that contains the provisional
ability estimate.) The WI is similar to the MPP in acknowledging that
provisional ability estimates are subject to error. However, while WI is not as
computationally simple as MI, it is much simpler than MPP. This is because
Owen's (1969, 1975) approximation to the posterior ability distribution can be
used efficiently to compute the weights.
Stopping Rules
Current CAT test administration methods fall into two basic categories. These
two types of CATs are defined by their stopping rules; they are fixed-length and
variable-length tests. A fixed-length CAT administers the same number of items
to each examinee. Different examinees therefore may be tested to different
levels of precision, just as they would be by a conventional nonadaptive test.
Examinees who are more easily "targeted" by their selected test, either because
they respond more predictably or because their ability falls where the CAT item
pool is strong, are measured more precisely than poorly targeted examinees. In
contrast, a variable-length CAT tests each examinee to a fixed level of precision
even if this requires administering different numbers of items to different
examinees. Well-targeted examinees generally receive shorter tests than poorly
targeted examinees.
Maximum Likelihood
The maximum likelihood estimate of ability is determined by finding the modal
or maximum value of the likelihood function. (Further details about likelihood
functions can be found in the IRT appendix.) The MLE is unstable for short tests
and even is often unbounded (e.g ., it may take on values of ± 00). It has a
relatively minor centripetal bias (i.e., the MLE tends to be slightly overestimated
for high abilities and slightly underestimated for low abilities), and multiple
modes are occasionally a problem. It also requires a lengthier computation than
Bayesian methods, although this is of minimal importance with fast computers.
3 "True" latent ability can be defined as the score an examinee would attain on a
test much longer than that actually being administered.
8. Computerized Adaptive Tests 131
Bayes Estimates
An arguably more proper approach to determining estimates of an examinee's
ability in the likelihood function is to make use of Bayes theorem. This states
that in general
Prob(A I B) Prob(B)
Prob(B I A) = Prob(A)
In the EAP, or Bayes mean approach, the mean of the posterior distribution
is computed as the point estimate of ability . In the MAP, or Bayes mode
approach, the mode or maximum value taken on by the posterior is used. The
EAP, unlike the MLE approach, is quite stable for short tests and is always
bounded but it does have some centrifugal bias. In other words, for high abilities
it produces underestimates, and for low abilit ies it produces overestimates.
While an unbiased method would be best, this bias is at least in the "right"
direction (i.e., it pulls the estimates back from being too extreme and more
toward the center of the distribution of S). However, computation for the EAP is
quite lengthy. The MAP, like the EAP, is relatively stable for short tests and is
always bounded. It also has a moderate centrifugal bias and multiple modes are
infrequent An advantage of the MAP over the EAP is that a quick computational
approximation, calledOwen's Bayes method,is available (Owen, 1969, 1975).
A CAT testing program is often very demanding in terms of the size of the item
pool required. This is due most directly to the uneven item exposure typical of
CAT item selection algorithms. It is also the result of the fact that CATs are
frequently used for high-stakes exam programs and are offered in a continuous
or on-demand test setting. The high-stakes characteristic of an exam program
means that item exposure is a serious test security concern, while continuous
testing results in the continuous exposure of items. Stocking (1994) recommends
that the item pool for an exam program contain 12 times the number of items in
an average CAT. For lieensure and certification testing, Way (1998) suggests,
for a number of reasons, that a pool size of six to eight times the average CAT
length might be adequate.
A large item writing and pretesting effort is needed to support the test
security needs of most CATs. Over time, exposed items need to be retired and
new items added to the pool. In some current exam programs, entire item pools
are rotated or retired. Elaborate methods are being developed for assembling
parallel item pools and for rotating, replenishing, and redistributing items across
these pools (Way, 1998).
8. Computerized Adaptive Tests 133
For a latent ability estimation exam, such as a CAT, it is necessary for the
item pool to contain not merely a large number of items, but items that span a
wide range of difficulty and provide adequate content coverage. The quality of
the items is always important. And in testing programs that use the 3-PL model,
the distribution of an item's a-parameter is also important.
Measurement Characteristics
In adaptive tests, the concepts of test length and the reliability or precision of the
estimated score are closely linked. Therefore, these two concepts are presented
together in this chapter. The relative advantages and disadvantages of fixed- and
variable-length adaptive tests have been debated elsewhere. Arguments favoring
fixed-length tests cite the method's simplicity and its avoidance of a particular
type of measurement bias (Stocking, 1987). Proponents of variable-length tests
contend that such tests are more efficient and allow test measurement properties
to be precisely specified (Davey & Thomas, 1996; Thompson, Davey & Nering,
1998). Both views are briefly summarized herein.
As its label suggests, a fixed-length adaptive test administers the same
number of items to each examinee. The number of items administered is
determined by weighing such factors as content coverage, measurement
precision, and the time available for testing. Measurement precision is usually
specified in the aggregate, or averaged across examinees at different latent
ability levels (Thissen, 1990). However, the measurement models that underlie
adaptive tests recognize that precision varies across examinees. Examinees
whose latent ability levels are identified quickly and accurately can be
repeatedly targeted with items of an appropriate difficulty and consequently
measured very efficiently and reliably. Examinees whose performance levels are
located in a range where an item pool is particularly strong are also likely to be
well measured. Conversely, examinees that are difficult to target or whose latent
ability levels fall where the item pool is weak are measured more poorly.
The function traced by measurement precision over latent ability level can be
manipulated in limited ways by test developers . Item pools can be bolstered
where they are weak and weakened where they are unnecessarily strong. Test
length can be shortened or lengthened. Item selection and exposure control
procedures can be finessed. However, the level of control is far short of
complete , leaving conditional measurement precision more a function of chance
than of design.
Variable -length tests allow measurement precision to be addressed directly
by using it as the criterion to determine when a test ends. Rather than
134 Measurement Characteristics
Item/Test Security
Balancing item content through a test blueprint serves two general purposes.
First, it helps assure that the test shows content or construct validity evidence .
For example, test developers may decide that a geometry test simply must
include an item that uses the Pythagorean theorem. The required content is such
an important component of the domain from which the test is drawn that
reasonable judgment dictates its inclusion . A content blueprint also serves to
ensure that alternate forms of the same test are as nearly parallel as possible.
Balancing item content across forms increases the likelihood that each form
measures the same composite of skills and knowledge.
Blueprints developed for conventional tests often are applied directly to the
CAT delivery method in hopes of deriving the same benefits. However, care
should be taken to see that the conventional blueprint is either necessary or
sufficient to control item content during a CAT. For example, if the CAT item
pools are truly unidimensional, content balance is more a matter of "face
validity" than anything else is. Absolute unidimensionality results from
responses being driven solely by the examinee's ability, whatever an item 's
content. Thus, distinctions drawn between items on content grounds are wholly
artificial, existing in theory but not in the response data. Balancing item content
would result only in lowered test efficiency.
Truly unidimensional item pools are rarely, if ever, encountered in practice.
Most pools cover a range of item content and, as a result, are no more
unidimensional than conventional tests. Balancing the content of the tests drawn
from typical pools, therefore, can provide the same advantages to a CAT as it
provides to conventional tests. For example, suppose an item pool included two
content domains, with the first consisting of primarily easy items and the second
more difficult items. If item content were not balanced, lower-ability examinees
would be tested almost exclusively with items from the easier domain . The
converse would be true for high-ability examinees. Scores for these two groups
of examinees subsequently would have different meanings and interpretations,
reflecting the differences in content between the two domains. Balancing item
content makes it more likely that the tests selected for different examinees are
parallel and will produce comparable scores . This is directly analogous to
constructing parallel alternate forms of a conventional test by following the
same content blueprint.
Consider, for example, the test specifications presented in Table 8.1. With a
conventional test scored by number correct, 20% of every examinee's score is
determined by the examinee's performance on trigonometry items. However, in
a CAT, because these items tend to be difficult, they may never be selected for
low-ability examinees, while able examinees would see little else. To ensure that
each examinee is measured on the same ability composite, we need to force the
selection of what are often inappropriately easy or difficult items.
136 Measurement Characteristics
There are several possible approaches to content balancing. In the split pool
approach, the item pool can be divided into more unidimensional subpools and
separate CATs administered from each . In the menu approach, separate
information tables can be maintained and alternated in their use according to
some plan. In the optimal selection method , a penalty function or integer-
programming algorithm can be used to select an optimal test that balances both
statistical and content concerns.
With the most common form of CAT content balancing, the numbers or
relative proportions of items of each content type are constrained to certain
ranges (Stocking & Swanson, 1993). However, balancing the proportions of
items administered during a CAT does not necessarily balance the influence that
each content domain has in determining the final ability estimate. Ability
estimates are not influenced so much by the number of items selected from a
content domain but rather by the amount of information that those items provide
toward estimation. While the test administered to a high-ability examinee might
have half of its items drawn from the easy content domain, those items may
make a negligible contribution to the total information accumulated at the final
ability estimate. Because noninformative items exert very little influence on an
ability estimate, the easy content domain would be effectively excluded from the
test score.
Because balancing numbers or proportions of items administered can be
ineffective, Segall and Davey (1995) developed and evaluated algorithms that
balance the amount of information provided by each content type . The
proportions of information provided by each content domain are not fixed but
rather are allowed to vary across ability levels in the same way they do on
conventional forms. As a result, low-ability examinees may take tests dominated
by one domain while tests administered to high-scoring examinees have a
different emphasis. While this means that interpretation of the obtained scores
changes across ability ranges, it does so in the same way with a CAT as it does
on a conventional test.
testing programs that are available on more than a few scheduled test dates
distributed throughout the year. The concern is that items administered
frequently become compromised quickly and no longer provide valid
measurement. Some general approaches for addressing this concern include the
following:
1.the use of enormous item pools containing more than 5000 items.
Such pools also could be organized into subpools used on a
revolving schedule to minimize the possibility of the same items
reappearing in the same time period or geographical area.
2. restriction of testing to certain time "windows ." This approach
stops short of full testing on demand but offers examinees greater
flexibility in scheduling test dates than could be provided with
most conventional tests.
3. direct control of item exposure rates through a statistical
algorithm incorporated in the item selection procedures.
The use of either the "big pool" or restricted testing windows approach is
largely dictated by practical and policy issues. However, in any case, some
means of directly controlling item exposure likely is necessary. Neither large
item pools nor restricted testing windows alone are sufficient to ensure integrity
of the item pool. Accordingly, a number of statistical procedures for controlling
item exposure rates have been devised.
One simple method recommended early in the history of computerized
testing is the so-called 4-3-2-1 procedure. This simple procedure requires that an
item-selection algorithm identify not only the best (e.g., most informative) item
for administration at a given point but also the second-, third-, and fourth-best
items. Item exposure then is limited by allowing the best item to actually be
administered only 40% of the time it is selected. The second-, third-, and fourth-
best items are presented 30%, 20%, and 10% of the time, respectively . This
method is reasonably easy to implement but provides limited protection against
overexposure of those items that are more "popular," or most likely to be
selected for administration.
Another approach, the Sympson-Hetter method (Sympson & Hetter, 1985),
was developed to provide more specific exposure control through the use of
exposure parameters. The exposure control parameter for each item is a
probability value between zero and one. Selected items actually are administered
only with these probabilities. Items that are selected but not administered are set
aside until the pool is empty. These exposure parameters are obtained through
simulations conducted in advance of operational testing.
Even under Sympson-Hetter, we find that sets or clusters of items appear
together with unwelcomed frequency and drive overlap rates upward. The
problem is that exposure probabilities are treated unconditionally; the
probability of an item being administered does not depend on which items have
appeared already. Conditional approaches factor in the other items that have
138 Measurement Characteristics
already appeared. Thus, they are useful because the real goal of exposure control
is not merely to limit the rates of item use. It is also to limit the extent of overlap
across tests administered . Recall that the item overlap rate is the percentage of
shared items that occurs between pairs of tests of a fixed length. More
specifically, there are two distinct test conditions in which overlap needs to be
considered. In test-retest overlap, the concern relates to examinees that retest
without intervening treatment to change their ability level (or examinees of
highly similar ability) . In peer-to-peer overlap, the concern is related to tests
given to examinees of randomly dissimilar ability.
Procedures have also been developed that build on the general Sympson-
Hetter framework but are conditional. One type of conditional exposure control
conditions on examinee ability (Stocking & Lewis, 1995; Thomasson, 1995). In
this conditional Sympson-Hetter approach, a matrix of item exposure parameters
is produced, with differing exposure parameters for each item, at each of a
number of discrete ability levels . The Davey-Parshall method (Davey &
Parshall, 1995; Parshall, Davey, & Nering, 1998) conditions on the items that
have already appeared during a given CAT on the grounds that item-pool
security may be protected best by directly limiting the extent that tests overlap
across examinees. The "hybrid," or Tri-Conditional method (Nering, Davey, &
Thompson, 1998; Parshall, Hogarty, & Kromrey, 1999), combines these
approaches and conditions on the individual item, examinee ability, and the
context of testing of those items that have already been administered.
Set-Based Items
The procedures outlined earlier extend readily to either set-based item units or
item bundles. The former are natural groups of items that each draw on the same
common stimulus. The most frequent example is a unit that asks examinees to
read a text passage and then answer a series of questions about that passage.
Item bundles are sets of discrete items that have no direct association with
one another but that are always presented collectively. Bundles may either be
formed as arbitrary collections of items or carefully defined in accord with
substantive and statistical considerations (see Wainer (1990) for a discussion of
the benefits of these bundles, also called testlets).
CAT administration procedures accommodate set-based units and item
bundles in a variety of ways. The simplest is to consider them as
indistinguishable from discrete items. A unit or bundle may provide more
information or have more complicated substantive properties , but it would be
selected and administered according to the same process . That is, the most
informative unit at the examinee's current ability estimate would be selected for
administration; provisional ability estimates would be computed after the
examinee answered all items in the unit, and units would be collectively
assigned a single parameter for protection against overexposure.
8. Computerized Adaptive Tests 139
More sophisticated uses of item units are also possible. For example, the
items attached to a unit may themselves be adaptively selected from among a
small pool available when the unit is chosen. The item selection process here
would mirror that of the units themselves in best suiting the current ability
estimate. The more difficult items attached to a unit would then be presented
whenever that unit was administered to able examinee. The same unit would
take on a different easier set of items when administered to a poorly performing
examinee.
Practical Characteristics
CAT may not be the best choice for every testing program or in every situation.
A number of practical issues must be considered to determine where and when
adaptive testing is best applied. These include the examinee volume, the ease
with which an exam can be maintained both on computer and conventionally,
initial development costs and effort, accommodation of pretest items, examinee
reactions, and the probable cost to the examinee of the computerized exam. Each
of these factors is discussed herein.
Examinee Volume
Initial Development
CAT is not a test-delivery method that can be reasonably "eased into" over time.
Calibrated item pools of substantial size must be available from the onset of a
testing program. As such, considerable work must be done to establish an item
pool before any tests are administered. A number of important test development
decisions are also required. These decisions determine the size of the item pool,
whether the test is of fixed or variable length, how long the test will be, how
items will be selected, including whether content balancing will be imposed and
how items are protected from overexposure, how ability will be estimated, and
whether and how reported scores will be determined from ability estimates. The
140 Practical Characteristics
Dual Platform
Pretest Items
Computerized tests generally, and CATs in particular, are well suited for
accommodating pretesting. Pretest items are new questions being evaluated to
determine whether they are suitable for future inclusion in a test form or item
pool. Because their properties are unknown, pretest items generally are not
allowed to contribute to examinee scoring. This is especially true of adaptive
tests, where newly administered pretest items would lack the IRT parameter
estimates needed to allow them to be included in an ability estimate. In fact, for
a CAT, the very reason items are pretested is to obtain the data needed for IRT
calibration.
A CAT allows for pretest items to be included either isolated from or
combined with the scored or operational items. An example of the former
approach would be to administer a distinct separately timed test section that
consists exclusively of pretest items. Alternatively, pretest items can be
appended to or embedded within a test section that also includes operational
items.
8. Computerized Adaptive Tests 141
Examinee Reactions
Examinees have sometimes been found to enjoy computerized testing and even
to prefer it to conventional paper-and-pencil-based administration. Whether this
is due to the novelty of CBT or something deeper is as yet unclear. There are
some specific advantages to adaptive test-delivery methods such as the CAT.
Clearly , the prospect of shortened tests is viewed favorably . That the test is
geared to an examinee's level of ability is also appreciated, particularly by low-
performing examinees whose CATs appear easier than the conventional tests
they are used to encountering.
However, there are features of CATs that have been identified as
objectionable. Foremost among these is the usual policy applied to CATs that
prohibit examinees from returning to previously answered questions to review or
revise their responses. The concern is that this may offer some examinees an
opportunity to "game" the system and unfairly boost their scores. For example,
an examinee may recognize a newly presented item as easier than one just
answered. The inference would then be that the answer just given was incorrect
and should be revised. Other strategies include intentionally responding
incorrectly to most or all of items as they are initially presented. This generally
results in successively easier items being administered. Once all items have been
administered, the examinee would return to revise the responses to each, ideally
answering most or all correctly. Depending on the sort of final ability estimate
employed, this can result in dramatically inflated scores. In any case, this
strategy would produce exceptionally unreliable or imprecise tests being
administered. Of course, the examinee would have to possess knowledge of the
adaptive item selection algorithm to use this strategy effectively.
Other negatives, related to any CBT, include difficulty in reading long text
passages from a computer monitor or graphics that are difficult to view due to
inadequate screen resolution . Some examinees also miss being able to make
notations or calculations in the test booklet. This is a particular problem when
figures need to be copied onto scratch paper or otherwise manipulated.
8. Computerized AdaptiveTests 143
Cost to Examinee
Summary
A summary of the test procedures, measurement characteri stics, and practical
characteristics of the CAT delivery method is shown in Table 8.2. This
highlights the method's strengths and weaknesses. Overall , CAT is a good
choice for testing programs that meet some or all of the following conditions:
1. Large numbers of examinees are tested each year.
2. Examinees naturally distribute themselves evenly across the year.
For example, a program certifying individuals who have completed
a home-study course at their own pace would thus enjoy a relatively
steady examinee volume and would be a good candidate for a CAT
144 Summary
Measurement Characteristics
Test length Fixed or variable, usually shorter than
conventional tests
Reliability Measured by the standard error of the
ability estimates or their transformed
values
Item/test security Goal is to minimize the extent to which
tests overlap across examinees
Set-based items Easily implemented, but use of set-
based items degrades test efficiency
Practical Characteristics
Examinee volume Large
Initial development Substantial effort
Dual platform Possible but difficult and expensive
Pretest items Easily handled
Examinee reactions Generally positive
Cost to examinee High relative to conventional tests
Item a b c
I .397 -2.237 .139
2 .537 -1.116 .139
3 1.261 -.469 .023
4 .857 -.103 .066
5 1.471 .067 .093
6 .920 .241 .061
7 1.382 .495 .190
8 .940 .801 .Ill
9 1.290 1.170 .147
10 1.440 1.496 .310
However, even for a fairly small pool, this table can become quite large. In
the 10-item example presented here, there would have to be 330 values of I(e)
computed and then listed in the table . Following each item response, an updated
estimate of the examinee's ability is determined (see the explanation that
follows) and the next item to be administered to this examinee is located from
the table. It is the item that has the most information at a value of e that is
closest to the newly updated estimate of'B for that examinee.
To simplify the example, the 10-by-33 table has been reduced to a 10-by-17
table with the values of e incremented by .5 from -4.0 to +4.0. In an actual
CAT, this table would be insufficient, in terms of the increments of e, to be
useful. However, it does simplify the example that follows. Table 8.4 gives the
information at these values of B for each of the ten items in the pool.
The steps taken for the administration of this sample CAT are as follows:
Step 3. The likelihood function (LF1) for the different values of a for
item 1 is given in Table 8.5. The value of a that gives the
maximum probability or likelihood of a single correct
response or p\(e), in terms of the a values given in the table, is
e =4.0.
?Cl
(')
~
'0
Table 8.4. Item Information Look-up Table for Determ ining Item Selection
~.
N
9 Cl>
p.
Item -4.0 -3.5 - 3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 >
1 .048 .062 .075 .084 .087 .085 .078 .068 .056 .045 .035 .027 .020 .0 15 .0 11 .008 .006 i
<
Cl>
-l
Cl>
2 .0 15 .029 .052 .083 .117 .146 .159 .154 .133 .106 .078 .055 .037 .024 .016 .0 10 .007 14
Cl>
3 .000 .000 .003 .020 .097 .332 .772 1.098 .876 .444 .177 .064 .022 .008 .003 .00 1 .000
4 .000 .00 1 .005 .0 18 .054 .135 .268 .410 .468 .401 .273 .159 .084 .043 .021 .0 10 .005
5 .000 .000 .000 .000 .002 .019 .147 .644 1.272 1.04 1 .453 .148 .044 .013 .004 .00 1 .000
6 .000 .000 .00 1 .006 .021 .069 .177 .351 .511 .531 .405 .245 .129 .063 .030 .0 14 .006
7 .000 .000 .000 .000 .000 .002 .017 .12 1 .505 .942 .758 .346 .122 .039 .012 .004 .00 1
8 .000 .000 .000 .00 1 .002 .0 10 .039 .119 .277 .456 .513 .408 .250 .131 .063 .030 .013
9 .000 .000 .000 .000 .000 .000 .002 .0 15 .093 .381 .819 .844 .481 .198 .071 .024 .008
10 .000 .000 .000 .000 .000 .000 .000 .00 1 .008 .067 .358 .792 .662 .292 .098 .030 .009
~
-
-...I
148 An Example of the CAT Method
Step 4. Based on the look-up table presented in Table 8.4, the next item
that should be administered to this examinee is item 8 because
this item has the largest amount of information for examinees
with 9 =4.0.
Step 5. Assume that this examinee misses item 8 so that the response
vector now contains the correct response to the initial item and
an incorrect response or 0 to item 8. The examinee's responses
(i.e., the response vector) are now (l 0).
Step 7. For an individual with e = - .5, the next item that should be
administered (i.e ., the item with the largest amount of
information at 9 = -.5) is item 3; see Table 8.4.
Step 9. The item with the most information for values of 9 = 0.0, is
item 5; see Table 8.4. Assume that the examinee correctly
answers item 5 so that the responses thus far are (1 0 1 1). The
likelihood function after four items (LF4) is P 1(9) Q2(9) P3(9)
e
P4(9). The maximumof this likelihood occurs for a value of = .5.
This is the next ability update for this examinee; see Table 8.5
Step 10. The item with the most information at 9 = .5 is also item 5.
Because this item has already been administered to this
examinee, we administer the item with the next largest
likelihood, or item 7; see Table 8.4.
Step 11. Assume that the examinee answers this item incorrectly so that
the responses are (1 0 1 1 0). The likelihood function after five
items (LF5) is P1(9) Q2(9) P3(9) P4(9) Qs(9). The maximum of
this likelihood occurs for a value of e= 0.0. The sequence of
ability estimates for this examinee has been (4.0, - .5, .0, .5, .0).
8. Computerized Adaptive Tests 149
The initial item (item # 1) was selected because it was the easiest
item and no ability estimate was required to select this item.
Step 12: For this small example, assume that the stopping rule of the test
is as follows: The test is stopped whenever the standard error of
the estimate of e falls at or below .70;4 see Table 8.6 for the
standard error of each updated ability estimate, e. This criterion
value falls below the .70 criterion or standard only after the fifth
(and final) item has been administered . Recall that the standard
e
error of is simply the square root of the reciprocal of the test
information function of the test up to and including the latest
updated value of e.
e
The final ability estimate for this examinee using this procedure is = 0.0, with
a standard error of .58. In an actual CAT administration, the final estimate of
ability most likely would be refined using a computational procedure such as the
Newton-Raphson algorithm (Lord, 1980, pp. 180--181).
Summary of Example
This example illustrated the steps in the CAT delivery method. This is a
complex approach, requiring a fairly extensive test development and
psychometric effort to develop and maintain. However, this complexity results
in very quick efficient testing , and will be worth the effort for some exam
programs . In addition to the developmental effort needed, the CAT method is
best suited to exam programs that have large item pools available, test large
numbers of examinees year-round, require only a moderate level of test security,
benefit from test efficiency, and need proficiency estimates (rather than
classification decisions) . For exam programs that do not fit this description,
other test-delivery methods may have advantages that outweigh those of the
CAT.
References
Brown.J. M., & Weiss, D. 1. (1977). An Adaptive Testing Strategy for Achievement Test
Batteries. (Research Report 77-6). Minneapolis: University of Minnesota,
Psychometric Methods Program.
Davey, T., & Parshall, C. G. (1995, April). New algorithms for item selection and
exposure control with computerized adaptive testing . Paper presented at the annual
meeting of the American Educational Research Association, San Francisco .
Davey, T., & Thomas, L. (1996, April). Constructing adaptive tests to parallel
conventional programs. Paper presented at the annual meeting of the American
Educational Research Association, New York.
Lord, F. M. (1980). Applications ofItem Response Theory to Testing Problems. Hillsdale,
NJ: Lawrence Erlbaum.
Owen, R. J. (1969). A Bayesian Approach to Tailored Testing. (Research Report 69-92) .
Princeton, NJ: Educational Testing Service.
Owen, R. J. (1975) . A Bayesian sequential procedure for quantal response in the context
of adaptive mental testing. Journal of the American Statistical Association, 70,
351-356.
Nering, M. L., Davey, T., & Thompson, T. (1998) . A hybrid method for controlling item
exposure in computerized adaptive testing. Paper presented at the annual meeting of
the Psychometric Society, Champaign-Urbana.
Parshall, C. G., Davey, T., & Nering, M. L. (1998, April) . Test development exposure
control for adaptive testing . Paper presented at the annual meeting of the National
Council on Measurement in Education, San Diego .
Parshall, C. G., Hogarty, K. Y., & Kromrey, J. D. (1999, June). Item exposure in adaptive
tests : An empirical investigation of control strategies. Paper presented at the annual
meeting of the Psychometric Society, Lawrence, KS.
Segall, D. 0 ., & Davey, T. C. (1995) . Some new methods for content balancing adaptive
tests . Presented at the annual meeting of the Psychometric Society, Minneapolis.
Stocking, M. L. (1987). Two simulated feasibility studies in computerized testing.
Applied Psychology: An International Review, 36(3), 263-277.
Stocking, M. L. (1994) . Three Practical Issues for Modern Adaptive Testing Item Pools.
(Report No. ETS-RR-94-5). Princeton , NJ: ETS.
Stocking, M. L., & Lewis, C. (1995) . Controlling Item Exposure Conditional on Ability
in Computerized Adaptive Testing . (Research Report 95-24). Princeton, NJ :
Educational Testing Service.
Stocking , M., & Swanson , L. (1993) . A method for severely constrained item selection in
adaptive testing . Applied Psychological Measurement, 17,277-292.
Sympson, J. B., & Hetter, R. D. (1985) . Controlling item-exposure rates in computerized
adaptive testing. Proceedings of the 27th annual meeting of the Military Testing
Association (pp. 973-977). San Diego: Navy Personnel Research and Development
Center.
Thissen, D. (1990). Ability and measurement precision. In Wainer, H. (ed.), Computer
Adaptive Testing: A Primer (chap . 7, pp. 161-186), Hillsdale, NJ : Lawrence
Erlbaum .
Thomasson, G. L. (1995). New item exposure control algorithms for computerized
adaptive testing. Paper presented at the annual meeting of the Psychometric Society,
Minneapolis.
152 References
Thompson, T., Davey, T.C., & Nering, M.L. (1998). Constructing adaptive tests to
parallel conventional programs . Presented at the Annual Meeting of the American
Educational Research Association, San Diego.
Wainer, H., (ed.) (1990) . Computerized Adaptive Testing: A Primer. Hillsdale, NJ:
Lawrence Erlbaum.
Way, W. D. (1998). Protecting the integrity of computerized testing item pools .
Educational Measurement: Issues and Practice, 17, 17-27 .
9
Computerized Classification Tests
Test Procedures
Test Assembly
selection algorithm contained within the test administration software. The test
assembly or item selection process for a CCT can be based on a variety of
methods, depending on the methodology used to implement it. If a CAT
approach is employed, the selection of items is guided by the current ability
estimate of the examinee (i.e., items are selected based on the amount of
information provided at the current ability estimate of the examinee).
Another method that attempts to measure an examinee's latent ability is a
procedure based on the estimation of latent class models. Latent class models
are probability statements about the likelihood of a vector of item responses
being a function of an examinee's membership based on a latent categorical
variable. They are similar in form to item response theory (IRT) models except
that the latent variable is discrete instead of continuous. Most of the research in
this area has been contributed by George B. Macready and C. Mitchell Dayton.
Those who are interested in CCT using latent class models should refer to these
references.
A second, general approach uses a statistical hypothesis framework and
includes computerized mastery testing (CMT) (Lewis & Sheehan, 1990;
Sheehan & Lewis, 1992) and the sequential probability ratio test (SPRT)
introduced by Abraham Wald (1947). The CMT approach (Lewis & Sheehan,
1990; Sheehan & Lewis, 1992) relies on item clusters or packets, called test/ets,
that have been preassembled or constructed to have the same number of items
that match content and psychometric specifications. The testlets are selected
randomly during the administration of the CCT. In an SPRT classification test,
items are selected based on the amount of item information at the passing score
or decision point of the test. This feature changes the demands placed on the
requirements for the item pool characteristics when compared with other test-
delivery methods.
Scoring
The scoring of the CCT is determined by the method used to conduct the test.
However, all methods rely on comparing examinee performance to a
predetermined criterion level of performance. Two primary approaches have
been employed in the administration of adaptive CCTs. A CCT may be scored
using ability estimation as done in a CAT or using the SPRT. If a CAT approach
is used, IRT ability estimates are obtained, along with the standard error of
ability estimate. These values are used to compare to the latent passing score to
determine a pass/fail decision. The stopping rule for a CAT is customarily
defined as the point in the test at which the classification decision point or
passing score falls outside of a given confidence level or credibility interval of
the examinee's estimated ability. Typically, examinees are not permitted to
change answers to previously answered items when items are selected at the
target of the examinee's current ability estimate.
9. Computerized Classification Tests 155
Adaptive tests, in general, tend to require large numbers of test items. However,
the requirements for CCT are not as extensive as CAT when the selection
algorithm chooses items based on information at the passing score. In this
situation, it is best if the item pool has a large number of informative items that
have maximum item information near the passing score. Therefore, CCT item
pools do not need to be as large as CAT pools but they do need to be large
enough to accommodate the psychometric and content specifications of the test.
For most high-stakes testing programs , the acceptable or targeted maximum
exposure rate (i.e., the percentage of time that an item should be administered) is
20%. This value can be used to estimate the number of items required to
administer a CCT. If a CCT has a minimum test length of 100 items and a
maximum exposure control target of 20%, the item pool needs to have at least
500 items to meet the requirements for the minimum test length. This minimum
may be higher if the content specifications are complex. The item pool needs
more items if the exposure control target is lower or the test length is longer.
Measurement Characteristics
Test Length
A CCT can make pass/fail decisions statistically with a very small number of
items (e.g., less than 10). However, most test developers agree that tests of this
length are not acceptable in most circumstances. Test length can be reduced to
half the length of the paper-and-pencil format or even less and still maintain the
accuracy of the classification decisions. Reductions in test length are determined
by the quality of the item pool, whether the CCT is variable or fixed in length,
and the complexity of the content specifications required for the test. A large
item pool containing high-quality items allows test lengths to be considerably
156 Measurement Characteristics
shorter than the traditional paper-and-pencil tests that they replace. Content
specifications that include a high level of detail require more items to be
presented in order to meet these specifications.
The determination of an appropriate test length is accomplished by
performing computerized simulations of the testing environment. The simulation
process permits the estimation of passing rates, decision error rates, and the
degree to which content specifications have been met. Examinees whose
abilities are further away from the passing score require fewer items on which to
base a classification decision. On the other hand, examinees that possess
abilities near the passing score require longer. The minimum and maximum test
lengths can be manipulated during the simulation process to determine values
that assist in minimizing classification errors, given other considerations such as
item pool size and quality, seat time, and item exposure.
Research has shown that if the classification decision is made on the basis of
an examinee's latent ability (i.e., a CAT), the test is more powerful if the
selection of items is determined at the passing score level rather than at the
examinee's estimated ability (Spray & Reckase, 1996). However, the SPRT
method administers tests that are more efficient than the CAT. Specifically, the
SPRT method provides test results that have comparable pass rates and
classification errors to the CAT but requires fewer items.
Reliability
The evaluation of reliab ility for a CCT is based on statistrcs that provide
information regarding the consistency of the decisions made in classifying
examinees into two or more categories. The estimation of decision consistency
and accuracy is facilitated during the computerized simulation of the CCT
environment. Decision consistency is a reliability measure for criterion-
referenced tests. It provides information concerning the stability or precision of
decisions. Most decision consistency measures indicate the percentage or
proportion of consistent decisions that would be made if the test could be
administered to the same examinees over two occasions. Decision accuracy , on
the other hand, is a measure of a test's ability to classify examinees correctly
into the appropriate categories . A test with a low classification error rate has a
high degree of decision accuracy.
Because an examinee 's true classification status is never known precisely, it
is impossible to determine a test's decision accuracy. However, based on
assumptions pertaining to the ability distribution of the population of examinees
taking the CCT, psychometricians can predict pass/fail rates, classification error
rates , and consistency of classifications. Measures such as the proportion of
consistent classifications , coefficient Kappa (Le., the proport ion of consistent
classifications corrected for chance) (Cohen, 1960), and the proportion of
correct classification decision made can be obtained through computer
simulations.
9. Computerized Classification Tests 157
Item/Test Security
A primary concern with any computerized test is the security of the items used
in the testing process. Any CCT must address how the items will be protected
from overuse. The determination of exposure control parameters can be
accomplished during the computerized simulation process. Lewis and Sheehan
(1990) and Sheehan and Lewis (1992) have employed the randomized selection
of parallel testlets to meet this goal. Another approach is a modification of the
Sympson-Hetter method, presented in detail in Chapter 10.
Set-Based Items
Practical Characteristics
There are a number of important considerations beyond those discussed so far
that must be addressed when any classification test is delivered via computer.
These include the examinee volume, the ease with which an exam can be
maintained across dual platforms, initial development effort, accommodation of
pretest items, examinee reactions, and the probable cost to the examinee of the
computerized exam. Each of these factors is discussed next for the CCT.
Examinee Volume
Testing programs that wish to use IRT methodology in a CCT require large
numbers of examinees to meet the requirements for item calibration. High
examinee volumes also have an impact on the number of times items are seen.
This, in turn, impacts designs regarding the CCT item pool to permit the rotation
of item pools so that the items are not overly exposed.
158 Practical Characteristics
Initial Development
Dual Platform
Pretest Items
Examinee Reactions
Examinees are permitted to review items and even change answers throughout
the testing process for those computerized classifications tests in which items
are selected based on the amount of information at the passing score. For
example, with CCT procedures such as the SPRT or CMT, examinee
performance does not impact the selection of items and item responses to
previous items can be changed at any time. Many examinees prefer to have this
flexibility, considering it a highly valued testing option. In addition , some
examinees prefer having shorter tests and the various CCT methods usually are
able to reduce test lengths, given a strong item pool.
Cost to Examinee
The cost of developing and administering CBTs is often higher than typical
costs for paper-and-pencil testing. Some portion of that cost is borne by higher
examinee fees. The amount charged an examinee for any test is related to the
developmental effort and expenses incurred by the testing company or
certificationllicensure board. Given the higher developmental effort needed for a
CCT, the cost to the examinee often is more than a paper-and-pencil or CFT
format.
Summary
A summary of the test procedures, measurement characteristics, and practical
characteristics of the CCT delivery method is provided in Table 9.1. These
features highlight the method's strengths and weaknesses. Overall, a CCT
program requires some initial effort; this may be very intensive, especially if the
item pool must be calibrated. Many of the critical decisions regarding the CCT
can be answered only after extensive computer simulations. In spite of this
drawback, the CCT approach is the test-delivery method of choice for many
testing programs, particularly those concerned with making classifications of
examinees to two or more categories. The most common use for a CCT is in the
area of licensure and certification testing.
160 An Example of the CCT Method
Practical Characteristics
Examinee volume Large (ifIRT is used)
Initial development Moderate effort
Dual platform Possible
Pretest items Easily handled
Examinee reactions Relatively positive
Cost to examinee May be high
The sequential probability ratio test, or SPRT, was first proposed by Abraham
Wald during the latter stages of World War II as a means of conserving Allied
ammunition during test firing of production lots of ammunition. Prior to
sequential testing, a fairly large quantity of live ammunition had to be fired in
order to determine a simple binomial probability of unacceptable lots. This
traditional fixed sample approach to testing required a predetermined number of
rounds, N, to be fired and the proportion of unacceptable rounds to be observed.
Using the sequential approach, rounds were fired only until a number of
unacceptable rounds had been observed that would indicate, within a given
statistical power and acceptable Type I error rate , what the unacceptable
proportion was. In most cases, this (variable) number of rounds usually was
smaller than the fixed sample required to reach the same conclusion with the
same or less power and at the same error rate.
The SPRT was first suggested for mental testing via computer by Ferguson
(1969a, 1969b) and later by Reckase (1983). Since this early research, the SPRT
has been applied to a variety of classification tests.
Either
examinee j has latent ability = eo (the null hypothesis)
or { examinee j has latent ability = e1 (the alternative hypothesis).
Arbitrarily, we let eo < e1 and refer to eo as the level of ability beyond which
we are (l-a)IOO% certain that an examinee should fail the examination. In other
a
words, the point o corresponds to the lowest possible passing level on or it isa,
the greatest lower bound of minimal competency that we are willing to tolerate .
Similarly, we refer to el as the level of ability beyond which we are (I-~) 100%
certain that the examinee should pass the examination. In other words, the point
al corresponds to the highest possible passing level on a, or it is the least upper
bound of minimal competency that we are willing to tolerate.
Classification error rates are specified by a and ~, where a is the rate of
false positive classification errors, or the frequency with which we classify an
exam inee as passing the examination given that the examinee should truly fail,
while ~ is the rate of false negative classification errors, or the frequency with
which we classify an examinee as failing the examination given that the
examinee should truly pass . We note that all classification examinations yield
false positive and false negative errors. However, the SPRT is unique in that it
can directly control these errors by specifying them in advance of testing.
probability that an examinee with Si = So has produced those responses thus far
during the test is
L(Xb X2'··.' Xk ISo) = 1t1(SO)1t2(SO)...1tk(SO), (9.1)
where
(9.2)
or equivalently,
(9.3)
where
- : 1, if the it~m is answered correctly,
{ Xj - 0, otherwise.
(9.5)
or equivalently,
(9.6)
(9.7)
The likelihood ratio, LR, is actually compared to two boundary values, A and
B. A is the upper boundary value and is approximately equal to (1-~)/a, or
equivalently,
10g(A) = log( 1 - ~) - log a. (9.8)
164 An Example of the CCT Method
The likelihood ratio, LR (or the log of LR), is updated following the response to
each item that an examinee is given, so that
(9.10)
The test terminates when either LR > A, LR < B {or equivalently when either
10g(LR) > 10g(A) or 10g(LR) < 10g(B)} or a time limit or test length maximum
has been exceeded. If the examinee has not been classified by the SPRT
procedure when the time limit has been exceeded or the maximum number of
items have been answered, some method must be used to determine how the
examinee is classified. Typically, the (unsigned) distance between the LR and
the boundaries is computed, and the shorter distance is used as the criterion on
which to base the final classification decision.
values of 10g(A) and 10g(B)}, the upper and lower boundaries of the test, can be
determined by applying Equations 9.8 and 9.9, respectively .
By applying equations 9.8 & 9.9:
and
log 1tj (e l ) = (xj)log p(e l ) + (l - Xj) * log Q(e.)
log 1tj (e 1) = (1) * log(.876) + (0) * log(l- .876)
log 1tj (e 1) = -.123.
Subsequently,
10g(LR) = log 1tj (e 1) - log 1tj (eo)
10g(LR) = -.123 - (-.719)
10g(LR) = .596.
Summary of Example
This example illustrated the steps in the CCT delivery method. In addition to the
developmental effort needed, the CCT method is best suited to exam programs
that have moderately sized item pools available, test moderate to large numbers
of examinees year-round, require only a moderate level of test security, benefit
from test efficiency, and need classification decisions. For exam programs that
9. Computerized Classification Tests 167
do not fit this description, other test-delivery methods may have advantages that
outweigh those of the CCT.
B, P 10g(B) Bp
- .050 .10 -2.197 -.40
Log of the
Item Parameters Probability Correct Likelihood Ratio
Response Examinee Correct Incorrect Total
Item a b c [(9p) P(9o) P(9\) with B,= 1.0 SR* Response Response Ratio
1.80 - .64 .12 1.70 .487 .876 .994 1 .59 -1.42 .59
2 2.00 - .16 .15 1.56 .249 .652 .984 I .96 - .77 1.55
3 1.80 .83 .23 1.14 .659 .935 .997 1 .35 -1.65 1.90
4 1.88 .16 .19 .48 .228 .461 .948 0 .70 -.36 1.54
5 1.71 -1.51 .20 .25 .920 .989 .999 1 .07 -1.94 1.61
6 .88 - 1.70 .31 .16 .866 .947 .988 .09 - .92 1.70
7 .63 - 1.22 .24 .16 .714 .830 .935 .15 -.52 1.85
8 .68 -1.66 .22 .15 .797 .894 .965 .12 - .65 1.97
9 .92 -2.09 .14 .13 .906 .966 .993 .06 -1.02 2.03
10 .59 -1.74 .18 .13 .779 .873 .951 .11 - .55 2.14
11 .47 -1.94 .15 .09 .763 .846 .926 .10 -.43 2.25
* SR - Scoredresponse
168 References
References
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and
Psychological Measurement, 20, 37-46.
Ferguson, R. L. (1969a). Computer-Assisted Criterion-Referenced Measurement (Working
Paper No. 41). Pittsburgh: University of Pittsburgh, Learning and Research Development
Center. (Eric Document Reproduction SeriesNo. ED 037 089).
Ferguson, R. L. (1969b). The development, implementation, and evaluation of a
computer-assisted branched test for a program of individually prescribed instruction.
Unpublished doctoral dissertation, Univers ity of Pittsburgh. (University Microfilms
No. 70-4530).
Glaser, R. (1963). Instructional technology and the measurement of learning outcomes.
American Psychologist, 18,519-521.
Glaser, R., & Nitko, A. 1. (1971) . Measurement in learning and instruction. In R. L.
Thorndike (ed.), Educational Measurement. Washington, DC: American Council on
Education .
Hambleton, R. K., Swaminathan, R , & Algina, J. (1976) . Some contributions to the
theory and practice of criterion-referenced testing. In DeGruijter, D. N. M., & van
der Kamp, L. 1. T. (eds.), Advances in Psychology and Educational Measurement
(pp. 51-62). New York: John Wiley & Sons.
Hambleton, R. K., Swaminathan, H., Algina, J., & Coulson, D. B. (1978) . Criterion-
referenced testing and measurement: A review of technical issues and developments.
Review ofEducational Research , 48, 1-47 .
Kalohn, J. C., & Huang , C. (June, 1997). The effect of changing item responses on the
accuracy of an SPRT CAT for pass/fail classification decisions. Paper presented at
the annual meeting of the Psychometric Society, Gatlinberg, TN.
Kalohn, J.C ., & Spray, J. A. (1999). The effect of model misspecification on
classifications made using a computerized classification test. Journal of Educational
Measurement, 36, 46-58.
Lewis, C., & Sheehan, K. (1990). Using Bayesian decision theory to design a
computerized mastery test. Applied Psychological Measurement , 14,367-386.
Popham , W. 1. (1974) Selecting objectives and generating test items for objectives-based
items. In Harris, C. W., & Popham, W. 1., (eds.), Problems in Criterion-Referenced
Measurement (pp. 13-25). Los Angeles: University of California, Center for the
Study of Evaluation .
Reckase , M. D. (1983) . A procedure for decision making using tailored testing . In Weiss,
D. J., (ed.), New Horizons in Testing: Latent Trait Test Theory and Computerized
Adaptive Testing (pp. 237-255). New York: Academic Press.
Safrit, M. J. (1977). Criterion-referenced measurement: Applications in physical
education . Motor Skills: Theory into Practice , 2, 21-35.
Sheehan, K., & Lewis, C. (1992) . Computerized mastery testing with nonequivalent
testlets. Applied Psychological Measurement, 16,65_76.
Spray, J. A. , & Reckase, M. D. (1996) . Comparison of SPRT and sequential Bayes
procedures for classifying examinees into two categories using a computerized test.
Journal ofEducational and Behavioral Statistics, 21, 405-414.
Wald, A. (1947). Sequential Analysis. New York: Wiley.
10
Item Pool Evaluation and Maintenance
Introduction
When testing programs indicate that they would like to move their current
multiple-choice paper-and-pencil examinations to a computer-based testing
program, they can consider several CBT options. These include CFT, ATA,
CAT, and CCT. Ultimately, the decision to move a testing program to a CBT
environment depends on many factors. These include the purpose of the test, the
type of test, the status of current test specifications and classification of items,
whether current exams are voluntary, current testing volumes, and the results of
an item pool evaluation. This chapter focuses on the item pool evaluation
process and item pool maintenance.
The evaluation of an item pool requires the involvement of two groups of
experts, content and psychometric. Content experts are responsible for the
evaluation of the item pool, paying particular attention to the quality of items
from a content perspective. Psychometric experts are responsible for the
evaluation of the quality of the statistical characteristics of the item pool in order
to determine if a CBT is feasible. The item pool evaluation process involves
three steps or processes. These are:
I. Item pool content review by content experts;
2. Assessment of item pool statistical characteristics; and
3. Simulation ofCBT environment.
It is expected that most testing programs have the resources to complete the
first step of the item pool evaluation process. However, many testing programs
may not be in a position to evaluate their own item pools thoroughly and
rigorously for possible implementation of CBT applications that are included in
the remaining steps of the evaluation process. In this case, psychometricians
who can perform this complex task should be consulted. The presentation of the
material for Steps 2 and 3 assumes that such expert advice and technical
assistance are available.
Once the CaT program is online and the program is operational, the item
pool must be maintained. Item pool maintenance consists of updating the item
calibrations from caT administrations, calibrating and evaluating pretest items,
monitoring item exposures, and possibly updating the standard reference set if or
when a new test blueprint or content outline is introduced. Ongoing pool
maintenance is a necessary process to ensure the best possible conditions for a
CaT program.
During an item pool review, content experts review all items in the pool to
determine what items are suitable for inclusion in a CBT from a content
perspective. For most testing programs considering CBT, this review process is
lengthy and requires planning and the establishment of guidelines to be efficient
and successful. One strategy is to assign small groups of content experts to
review a portion of the item pool and to come to a consensus about the status of
those items that they are assigned to review. Typically guidelines for reviewing
items consist of questions that help focus the efforts of the content experts .
Query-based guidelines are essential to assist the content experts in making
decisions efficiently and accurately. Some suggested guidelines for this portion
of the item pool evaluation process are presented in Table 10.1.
Each of the queries presented in Table 10.1 is a fairly common question that
should be evaluated for any type of testing program conducting an item pool
evaluation, regardless of presentation format (i.e ., paper or computer
presentation) . The last question relates to items that may clue answers to other
10.Item Pool Evaluationand Maintenance 171
items in an item pool. Many testing programs refer to these "cueing" items as
enemies. Items are enemies in the sense that they should not appear together on
the same test for a given examinee. This is a relatively simple task for a fixed-
form examination, given the limited number of items that are presented on an
exam of this type. However, for CBTs constructed in real time, the computer
program administering the test must have information that indicates item enemy
pairs. The content experts also must review the item pool with the item enemy
concept in mind throughout the entire evaluation process in order to identify
these items. It is easy to see how this step of the evaluation process becomes
very complicated with large item pools.
1. Does the item measure content that is germane to the current test
blueprint or test specifications?
2. Does the item contain information that is accurate and up to date (i.e.,
does it represent current content, laws, regulations, technology, etc.)?
3. Does the item possess correct content classification code(s) based on the
current test specifications?
4. Does the item possess a correct response (Le., is the key correct)?
5. Does the item cue the correct response for another item in the item pool?
If so, those items should be identified to prevent the administration of
these items on the same examination to an examinee.
In addition, the content experts must give special consideration to item sets.
Recall that item sets are characterized by items that are preceded by a common
stimulus or reading passage. The content experts have the discretion to
determine how an item set is to be presented in the CBT environment.
Specifically, the content expert can identify what group of items to present
together in a set or how the set is to be split in order to produce multiple item
sets that are attached to a common stimulus or reading passage . Based on this
configuration the psychometric staff can incorporate each item set into the item
selection and presentation process.
The content review process is a monumental task that requires a fair amount
of commitment by numerous content experts . For the most part, this step can
occur concurrently with Step 2, permitting some streamlining of the process .
This allows the two groups of experts to perform these initial steps
simultaneously. However, these first two steps must precede the process of
simulating the CBT environment. This is essential to prevent repetition of the
simulation process, because any changes in the composition of the item pool
require repetition of the simulation process from the beginning.
172 Item Pool Evaluation
The process of reviewing the statistical information about items in an item pool
depends on the type of CBT ·to be developed and the type of item statistics that
are used in assembling and scoring the CBT (i.e., classical versus IRT item
statistics). This process is different for each of the four methods that have been
presented in earlier chapters . The next sections present relevant information
related to this review statistical process for each CBT method with an emphasis
on CCT and CAT.
ComputerizedFixed Tests
The statistical review process for CFTs is driven by the types of statistics that
are maintained and used in the test assembly process. An important part of the
review process is to determine if the item pool consists of enough items to
assemble a form that is similar to the original or base form and meet the test
specifications in terms of content and statistical characteristics. This review
process requires a close look at each item's difficulty, discrimination, and
response analysis, comparing the item's performance to items from the base
form, Items that deviate significantly from the statistical item characteristics
from the base form most likely would not be used in the creation ofa new CFT.
mr Model Selection
In any item pool evaluation for CBT, it is necessary to review the status of the
items in the pool with respect to the quality of the IRT item calibrations.
Although a testing program does not have to be based on any particular IRT
model, there are advantages to having all of the items in the pool calibrated with
an IRT model and scaled to a common metric. The selection of an IRT model is
influenced by many factors, including psychometric theory, item format, testing
philosophy, organization politics, and testing organization preference. Prior to
beginning this step, it is beneficial for the psychometric consultants to meet with
representatives of the testing program to share information about possible CBT
options, given the factors that may impact the selection of the IRT model. This
prevents any misunderstanding about the intent of the testing program's
examination and assists in focusing the future work of the psychometric experts.
It is important to determine which model is the best given all of the factors that
may impact the final decision. Regardless of the final decision, it is crucial that
the implementation of the selected model adhere to the model's assumptions and
rules as closely as possible.
An important point to consider in the selection of an IRT model is that one
IRT model is not, by itself, better than another. The most important aspects of
selecting a model are (I) understanding what is required to implement the model
and (2) recognizing from the examination that the data and model are
compatible. Once a model is selected, the model has assumptions and rules that
must be followed . Violations of these assumptions or rules have an impact on
the accuracy of the examination process. The magnitude of the error depends on
the model's sensitivity to a particular violation. One commonly observed
violation is to permit items that don't fit a model to be used in a CAT or CCT.
Using a model inappropriately could result in poor measurement and inaccurate
decisions once the testing program is online. For example, Kalohn and Spray
(1999) found that misclassification errors increased when a one-parameter IRT
model was used to fit item responses from a three-parameter logistic model
when misfitting items were permitted to be used in a CCT. This is problematic
because the primary purpose of a CCT is to make decisions regarding the
classification status of examinees . Increases in classification errors should not
occur when errors can be prevented through correct IRT model specification or
the correct application of the IRT model that has been selected.
174 Item Pool Evaluation
If the item pool has been calibrated, it is always a good idea to determine if
the model used in the calibration process was appropriate. It may not be possible
to obtain item response data from previously calibrated items in order to check
the goodness of fit of the model to the data. In this case, the psychometric staff
may have to assume that the model is appropriate and simply proceed from that
point. However, if response data are available, or if the pool has not yet been
calibrated, a careful assessment of the quality of the calibrations is a necessary
initial step in the item pool evaluation process.
Evaluating the Convergence of the Calibration Process . An evaluation of the
overall quality of the calibration process is helpful in deciding whether to accept
the fitted estimates. BILOG and other item calibration programs provide output
pertaining to the estimation process itself. This output should be evaluated
before a decision to accept a model is reached. For example, it is helpful to
know whether the calibration process, which consists of numerous
computational iterations to reach a solution, converged or ended properly.
Evaluation of the convergence process, either by careful study of the
convergence indicators or by visual inspection of a plot such as the one in Figure
10.1, provides further evidence of good model fit. Important features to observe
in Figure 10.1 are that the log of the likelihood function continues to decrease
during the calibration process and that the largest change in the item parameter
estimates decreases with each iteration. Instances have been observed in which
this function has decreased during the early iterations of the calibration process
but then begins to increase. Usually, this is indicative of items that are either
poorly performing or miskeyed. The removal of these items or the correction of
the answer key usually corrects the problem, resulting in a likelihood function
that converges appropriately.
Assessing Goodness-of-Model-Fit. Several statistical indices can be used to help
determine whether an item fits a particular item response model. However, like
many statistical tests, these measures are sensitive to sample size. Large samples
almost always reject all models for lack of fit, while small samples rarely reject
any model. An additional approach to statistical tests of evaluating model fit
involves a graphical evaluation.
Plots of the fitted item characteristic curve (ICC) or P(8) versus the observed
response curve are very helpful in evaluating overall model fit quality. They are
also helpful in ascertaining whether the model fits in crucial areas along the
latent ability or 8 scale. For example, it may be important to determine if the
model fits around the area of the latent passing score, 8p, for CCT. Figure 10.2
shows a plot of an estimated ICC (i.e., the fitted model) versus the empirical
response curve for a given item. These curves were plotted from output
generated by the calibration software program, BILOG . This output includes the
expected frequencies of correct responses as predicted from the fitted model and
the proportions of correct responses actually observed. Although a test of model
fit is also part of that computer output, many researchers prefer to perform a
10. Item Pool Evaluationand Maintenance 175
T EM
Newton . 1 668400 ~
!
0.25 --=:.
c:.:..:.... •
0.20 T 668350 r-
o
U 0.15 ~ 668300 ';.
;.;:
~ O.IO -n ri- - - - - - - - +_ 668250 ~
... 0.05
0.00 -
-u . . • • • .
I
.=S j=-~..-,'-"
I 668200
668150
[
3 4 5 6 7 8 9 10 11 12 13 14 15
Iteration
simple visual inspection of the fit before accepting or rejecting the parameter
estimates.
A visual inspection of the individual calibrated items reveals how well a
particular model fits each item's responses. A plot of a poorly fitting ICC is
illustrated in Figure 10.2. It is apparent from the plot that this item did not
differentiate among candidates across the entire ability continuum and did not fit
the model. Additional information about the item, in terms of test form
information, sample size, classical item statistics indices, IRT item parameter
estimates, standard errors of estimate, and a goodness-of-fit index also are
included in the summary information. Further analysis of this item by content
experts indicated that this item was miskeyed. In contrast, Figure 10.3 shows an
item with good fit characteristics.
Figures 10.2 and 10.3 illustrate items where it is relatively easy to determine
the status of fit for each item. However, many items require a more detailed
review of the plot and associated data. Figure 1004. depicts such an item. The
observed curve is approximated fairly well by the estimated curve. The fit is not
as good as in Figure 10.3 but it is still acceptable . This conclusion is based not
only on a review of the plot but also of the item statistics presented above the
plot. In light of this information , it is apparent that this item should perform
well.
Approximating Item Calibrations. An item pool may consist of a mixed
collection of items in which some have actual item calibrations while others
only have classical item statistics, such as p-values and point-biserial correlation
coefficients. When this situation occurs, it is possible to use approximation
methods to link or scale the items without calibrations to those that have been
calibrated. The resulting calibrations are only approximate, but Huang, Kalohn ,
Lin, and Spray (2000) have shown that these approximations work reasonably
well in some testing situations.
176 Item Pool Evaluation
Point
N P E(P) Biserlal Biserial
2607 .607 .609 - .042 - .054
IRT Item Parameter Estimates
a (SEE) b (SEE) C (SEE) Fit Probability
.05 (.013) 1.01(1.68) .25(.053) 0.000
0.8
g 0.6
-_.--
7-
/'
.--_ .. ~ -- _. I---...
-_.. -- .. -.-- .. --_...... -. _..
~ t---- I'....
.g...
p...
0.4
0.2
7 <,
- -
0.0
-4 .00 -3 .20 -2 .40 -1.60 -0 .80 0.00 0.80 1.60 2.40 3.20 4.00
e
- - Observed Expected
Flags: PoorItemfit
Negativebiserial correlation coefficient
Point
N P E(P) Biserial Biserial
2607 .651 .646 .381 .490
IRT Item Parameter Estimates
a(SEE) b (SEE) C (SEE) Fit Probability
.72 (.056) - .38(.103) .14(.041) .937
~
»>
o
]
0.8
0.6
A'
-:
.g"" 4-
V
~ 0.4
...... ~
V·
0.2
~ ------
0.0
r~
- - --- -
-4.00 -3.20 -2.40 -1.60 -0 .80 0.00 0.80 1.60 2.40 3.20 4.00
- - Observed e Expected
Flags: None
Point
N P E(P) Biserlal Biserlal
2607 .673 .672 .264 .343
IRT Item Parameter Estimates
a(SEE) b (SEE) C (SEE) Fit Probability
.35 (.038) - .78 (.294) .170 (.064) .013
.. - ~
] 0.6
ell
.g .'
"
c:\: 0.4
0.2
~\. - .
_.....-
.:
--------
0.0 L=-
-4.00
-
-3 .20
- -
-2 .40 -1.60 -0 .80 0.00 0.80 1.60 2.40 3.20 4.00
- - Observed e Expected
Flags: None
I. Use the k calibrated items that comprise the fixed form of the exam
for administration.
2. Draw a random sample of N a-values that represent the examinees
from a specified distribution used to obtain the item parameter
estimates in the calibration computer program (usually denoted by
the phrase prior ability distribution).
3. Administer the items one by one to each examinee as described
previously by comparing the probability (P) of a correct response to
that item to a uniform-random deviate or U(O,I) drawn pseudorandomly
by the computer.
4. Score the test for each examinee based on total number of items
correct (or whatever scoring procedure is normally used for this
exam); if the test is a pass/fail test, compare the resulting scores, X,
for each examinee against the pass/fail score, P x (i.e., the
percentage-correct rate required to pass) and make the pass/fail
decision.
5 . Compare the expected score distribution with the simulated observed
score distribution.
6. For CCT, evaluate the errors in classification (i.e ., the number of
examinees with true passing abilities who were scored as failing the
test and the number of true failing examinees who were scored as
passing the test) . For CAT, determine errors in the estimation of
latent ability.
7. Compute test statistics such as average observed test score, standard
deviation of observed test score, reliability (KR-20), and
distribution of observed test scores; and compare these to the actual
results from the previous test form administrations. Additionally, for
CCT compute pass/fail rates and decision consistency estimates for
the proportion of agreement and Cohen's kappa.
If the results of the fixed form simulation appear to match those that were
actually observed in a previous operational administration of the form, then the
model can be assumed to be adequately describing the examinee-item
interaction. If the results do not match , then it will be important to review
assumptions that were included in this process. Differing results may be due to
an incorrect assumption related to the latent ability distribution or the selection
of an incorrect model for the calibration process. The cause of the differences
needs to be resolved before going forward with any further work. Once this has
been resolved, the results of any subsequent CBT simulations based on this
model can be accepted, and classification error rates or the accuracy of ability
estimates can be compared to those achieved in the paper-and-pencil fixed-form
format.
10.Item Pool Evaluation and Maintenance 179
representative of the SRS, provided that the revised pool has a sufficient number
of items. The plot in Figure 10.6 illustrates an item pool and SRS that are
acceptable in terms oftheir similarities .
e
2.0
• x 0
.9
"t;J 1.5
X
E 11
c2
.s 1.0 0
E
;:l
E
.~ 0.5
::;; X
0
0.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
8
c
2.0
o •
•
0
.~
E
c2
1.5
.s
E
;:l
1.0
.5 X
~
::;; 0.5
0.0
-3.0 -1.0 0.0 1.0 2.0 3.0
8
o Domain 1 • Domain 2 6Domain 3 xDomain 4
1.0
c 0.8
0
.~
e 0.6
<9
.-..,
.5
,,
0
;> 0.4
'"s I
\
~ I
0.2
~
,
--_ .... --.
I
\
' .'
I
....
-~- ..-.--- ... '"
0.0
-4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0
e
------ SRS - - Pool
°
for item i and X is the response to the item. Either X = x = 1 (for a
correct response) or (usually) if the item is answered incorrectly.
3. Draw a random number from the uniform distribution, U[O,I] . Call
this value u.
4. Compare u to Pi(XI Sj'ai, b.;e;).
Ifu ~ p;<xl Sj,aj,bj,e;),X=x = I .
If u > p;<xl Sj,aj, b.;Cj), X = x = 0.
5. Continue to "administer" n items to this examinee. When the test is
complete for this examinee, return to Step 1 and administer the test
to another examinee.
10. ItemPoolEvaluation and Maintenance 183
reasonable error tolerance (e.g., 05). The indifference region boundaries eo and
e1 are usually set to be equidistant from ep.
Determination of the Latent Passing Score (Op)
The next step involves establishing the passing score on the latent scale, ep, that
will be used in evaluation of the quality of the item pool. The latent passing
score usually is assumed to be equivalent to the passing percentage used in the
fixed-length simulation described earlier, or P x . There are two approaches that
may be used to determine ep, a graphical procedure and a computational
procedure.
Graphical Procedure. First, plot the function ~ [ l'P(e)] versus e, where the
sum is over the k items from the fixed-length test described earlier for the SRS.
The function ~ [ l'P(e)] is called the test characteristic function, or TCF. The
SRS serves as the reference form on which ep is established.
Next, locate the point Px on the vertical axis and the TCF (i.e., the percentage-
correct score required to pass the test; 67% in Figure 10.7), and draw a horizontal
line until it intersects the TCF. Then draw a vertical line perpendicular to the e-
axis. The point at which the vertical or perpendicular line intersects the e-axis is
ep, or approximately DAD, the latent passing score that corresponds to Px.
Computational Procedure. The second method requires calculation of the
probabilities of the TCF to a much finer degree in order to determine what value
of e corresponds to the passing standard, Px . Values of e are substituted into the
TCF until a match occurs to the Px value (see Table 10.2). This is comparable to
finding the e solution to the equation
(10.1)
In Table 10.2, two values of'B have been marked with asterisks to identify where
the probability is 67%. The comparable passing score on e, the latent ability
metric, is equivalent to -.397.
0.8
g 0.6
e~
p...
0.4
0.2
maximum item exposure rate. This reduces the number of iterations to find the
optimal exposure controls by about 50%. Software developed to determine the
exposure control parameters essentially takes the test administration software
and provides a computing environment that repeats the simulation process until
the item exposure parameters begin to stabilize, as evidenced by the minimal
differences in the parameters for two consecutive simulation runs.
Once the acceptable exposure controls have been established, the simulation
can be run to determine classification error rates, average test length,
consistency of classification, percentage of examinees force-classified, actual
item exposure rates, test overlap, and how well the test specifications were met.
At this point, the parameters for the SPRT CCT can be adjusted to improve the
decision accuracy. Then exposure control parameters must be determined again
before a simulation can be run to evaluate how the new parameters function. It is
important to note that any changes to any of the SPRT CCT parameters or to the
item pool itself necessitate the estimation of new item exposure controls. This is
critical because any change that impacts how items are selected and
administered has a direct impact on final total exposure rate of items.
but it is the amount of information at the current ability estimate of the examinee
that dictates item selection. During the simulation process, the exposure control
parameters must be evaluated at each e point to prevent examinees at the same
ability levels (Le., at a given value of e), the receipt of similar sets of items,
resulting in a high degree of overlap between tests. This is of particular concern
in areas where the density of the e distribution is low (e.g., near the tails of the e
distribution). This process is referred to as a conditional approach to item
exposure control.
Prior to conducting CAT simulations, not only must the acceptable level of
accuracy be determined for the ability estimates, but there are other test
parameters that must be established. These include (1) the minimum and
maximum test lengths, (2) maximum item exposure control target, (3) a method
for establishing the exposure control parameters, and (4) the maximum
acceptable test overlap rate.
The results of the item pool evaluation can be used to help testing programs
make important decisions regarding the type of CBT program that would be best
for their program. If the item pool evaluation suggests that an adaptive CBT
program cannot be supported, one possible alternative is to have multiple fixed-
length exams available for online administration. Recall that the process of
constructing multiple fixed-length test forms is called automated test assembly
190 Item Pool Maintenance
or ATA (see Chapter 7). An evaluation of an item pool for a possible ATA
application simply consists of running the ATA software program or programs
to determine if multiple forms can be constructed from the pool and, once
constructed, if these multiple forms (a) have adequate protection from item
overexposure and high test overlap rates , (b) have the appropriate content
distribution in terms of the test blueprint or content outline, and (c) can be
considered equivalent in terms of defined psychometric criteria.
Online Calibration
Concerns about obtaining item calibrations from online CBT programs depend
on the type of CBT being used . For CAT programs, items are selected for
administration based on current estimates of ability. Therefore, there is a
restriction of ability range for some of the CAT items, which may affect the item
calibrations (Stocking & Swanson, 1993) . For CCT programs, items are
administered at the passing score, and theoretically all examinees can receive all
items. Therefore, restricted range is not an issue with CCT.
A potentially more serious problem arises when the initial item pool has
been calibrated from paper-and-pencil administrations, and the items somehow
interact with the CBT environment significantly to change the examinee-item
interaction. For traditional multiple-choice text-based items, this may not be
much of a problem . However, items that depend on extensive graphics or those
that require responses other than the traditional multiple-choice alternatives may
operate differently online . If the item calibrations from online administrations
appear to differ significantly from those in the initial pool, the online
calibrations should replace the paper-and-pencil calibrations in the item pool. If
this happens frequently, the entire item pool may change, requiring a
reevaluation of the entire pool. It is suggested that online calibrations be
performed and evaluated, in terms of their discrepancies from the original item
10. ItemPool Evaluation and Maintenance 191
parameter estimates, only after enough data have been collected to ensure that
calibration differences are real and not simply due to examinee sample sizes or
estimation error. This may translate into an item calibration cycle that spans
several years.
Pretest or tryout items are usually introduced into online CBT administrations in
order to stock the item pool with new items that reflect changes in content or are
intended to replace or augment weak areas within the pool. The problem of
calibrating pretest items so that they can be used online differs from those
associated with the calibration of operational items. Usually, pretest items can
be administered to all examinees, regardless of their ability levels so that there is
no problem with range restriction. However, pretest items tend to be
administered to fewer examinees than operational items, and small sample sizes
can lead to large item parameter estimation errors.
The typical method of calibrating pretest items is to perform the calibration
along with the calibration of the operational items that also were administered.
This can lead to a sparse matrix situation where the responses from the N
examinees who have taken the operational items plus some number of pretest
items are calibrated together in the same computer run. The matrix of item
responses has to consist of N rows and 0 + p columns, where 0 equals the
number of operational items administered totally to the N examinees and p
equals the number of pretest items administered totally to the N examinees.
Because each examinee does not receive the same operational items nor
(usually) the same pretest items, the number of columns, 0 + p, exceeds the
number of actual item responses from each examinee. The remainder of the
columns must be identified as not presented items or items to be ignored in the
calibration process. This Nby (0 + p) calibration problem may become too large
for many calibration software programs to handle. In addition, there is always a
danger that the calibration of the p pretest items somehow may contaminate the
calibration of the 0 operational items. After all, pretest items are, by their very
nature, in the early stages of development and refinement. Calibrations on items
that do not function well or may be miskeyed can affect the entire process.
An alternative approach is to consider the OJ operational item parameter
estimates as fixed or known (by using the previously known estimates of the
parameters) and to calibrate the pretest items one at a time using the posterior
distribution of e, given the responses to the operational items to estimate the
ability of the examinee taking the pretest item. This amounts to solving the
pretest item calibration as a regression problem with the single value of e
replaced by an estimate of the posterior density of e, given the operational item
responses, Xi, or h (elXj), i = 1,2, .... OJ. Not only does this approach eliminate
the size of the sparse matrix problem; it also allows the calibration of the single
pretest item to be performed on relatively small sample sizes. This is due to the
192 Summary
fact that (1) only a single item actuaIly is being calibrated at anyone time and
(2) ability is assumed to be fixed or known (i.e., h (alX) is given). The approach
just described frequently produces estimates of item parameters that cannot be
satisfactorily obtained by other more traditional methods.
Once the pretest items have been calibrated, they can be put aside until it is
convenient to integrate them into the operational pool. This integration process
occurs infrequently because changes in the operational pool require new pool
simulations to update item exposure rates. It makes more sense to withhold all of
the pretest items and introduce them into the operational pool systematicaIly, for
example, on a yearly basis. Once they have been included in the operational
pool, the pretest items can be used in subsequent simulations to determine their
individual item exposure rates. The inclusion of these new items into the pool
also will affect the exposure rates of those items already in the pool (i.e., these
will have to be reestimated using the entire new pool).
If the test blueprint or content outline has been changed, or if a new passing
standard has been established, a new standard reference set must be used to
locate the new passing standard on the a-scale. If new content areas have been
added to the test blueprint, or even if current areas have been retained but only
the percentages of items to be used on the CBT have been changed , the changes
can alter the reference set and, hence, the passing standard. Obviously, once a
new standard reference set has been developed or a new passing standard has
been established, computer simulations will have to be run again to estimate
new item exposure rates as described earlier.
Summary
A complete and intensive evaluation of an item pool is required before a
decision to move an existing program to CBT is made. Essential steps in this
evaluation process have been considered in this chapter. Major steps include:
item pool content review by content experts, assessment of item pool statistical
characteristics, and simulation of the CBT environment. This process helps to
identify the strengths and weaknesses of an item pool and facilitates the
selection of an appropriate CBT method. Once a testing program is online, it is
imperative that planning take place to ensure that the item pool is maintained by
periodic evaluation and stocked with newly pretested items.
10. Item Pool Evaluation and Maintenance 193
References
Davey, T., & Parshall, C. G. (1995, April). New algorithms for item selection and
exposure control with computerized adaptive testing . Paper presented at the annual
meeting of the American Educational Research Association, San Francisco.
Huang, C., Kalohn, J. C., Lin, C., & Spray, J. A. (2000). Estimating Item Parameters
from Classical Indice s for Item Pool Development with a Computerized
Classification Test. (ACT Research Report) Iowa City: ACT, Inc.
Kalohn, J. C., & Spray, J. A. (1999). The effect of model misspecification on
classifications made using a computerized classification test. Journal ofEducational
Measurement, 36, 46-58.
Nering, M. L., Davey , T., & Thompson, T. (1998) . A hybrid method for controlling item
exposure in computerized adaptive testing . Paper presented at the annual meeting of
the Psychometric Society, Champaign-Urbana,
Parshall, C. G., Davey, T., & Nering, M. L. (1998, April). Test development exposure
control for adaptive testing. Paper presented at the annual meeting of the National
Council on Measurement in Education, San Diego.
Stocking, M. L., & Lewis, C. (1995). A New Method of Controlling Item Exposure in
Computeri zed Adaptive Testing. (Research Report 95-25). Princeton, NJ:
Educational Testing Service.
Stocking, M., & Swanson , L. (1993) . A method for severely constrained item selection in
adaptive testing. Applied PsychologicalMeasurement. 17,277-292
Sympson, J. B., & Hetter, R. D. (1985). Controlling item-exposure rates in computerized
adaptive testing. Proceedings of the 27th Annual Meeting of the Military Testing
Association (pp. 973-977). San Diego : Navy Personnel Research and Development
Center.
Thomasson, G. L. (1995). New item exposure control algorithms for computerized
adaptive testing . Paper presented at the annual meeting of the Psychometric Society,
Minneapolis.
11
Comparison of the
Test-Delivery Methods
Overview
The purpose of this chapter is to highlight some of the considerations in
choosing a test-delivery method. First, a brief summary of test-delivery methods
is provided, followed by a discussion of aspects of a testing program on which
the test-delivery models can be compared . The various elements of a testing
program that were discussed in each of the individual test-delivery method
chapters are used here as grounds for comparison across methods.
These considerations are test procedures, measurement characteristics, and
practical characteristics. The test procedures include test form assembly, scoring
method, and item pool characteristics. Measurement characteristics include
elements such as test length, reliability, test security, and the ability to
accommodate set-based items . Practical characteristics include examinee
volume, initial efforts required for development, ability to sustain dual
platforms, ability to accommodate pretest items, and cost.
In addition to these aspects of a testing program, other considerations are
used as grounds for comparison. These include relevant pool maintenance
issues, examinee concerns , software considerations, and the ease with which
innovative item types can be supported.
The automated test assembly process produces multiple test forms that are
equivalent in some sense. If the test assembly process has required the test forms
to have comparable difficulty and variability for each examinee, the tests, and
thus the results, would be interchangeable. This would eliminate the need for
separate passing scores for each form or post-administration equating. Such tests
are called preequated tests. ATAs have the same advantages over paper-and-
pencil testing that the computerized fixed tests offer. Furthermore , they offer
improved test security when testing volumes are high and item sharing among
examinees is a concern.
The CFT and ATA test-delivery methods are not equipped to meet the needs of
the individual test taker. However, the very nature of the computerized adaptive
test, or CAT, makes it ideal for constructing or tailoring the items to the
individual examinee. A CAT is particularly ideal for obtaining a norm-
referenced score, one that distinguishes between examinees along an interval
score scale. The CAT usually is variable in length, although there are exceptions
to this.
In terms of adaptive variable-length tests, the choice between a CAT and a
CCT often comes down to (1) the purpose of the test; (2) the strength, width,
and depth of the item pool; and (3) examinee preferences for item review. If the
primary purpose of the CBT is to classify examinees into mutually exclusive
categories, the CCT accomplishes this task more efficiently than a traditional
adaptive CAT (Spray & Reckase , 1996). On the other hand if the primary
purpose is to rank-order examinees over the entire score scale, then a CAT is
preferred.
196 TestProcedures
Test Procedures
The foundational test procedures of test form assembly, scoring method, and
requirements for item pools characteristics are discussed here. The differences
across test-delivery methods in terms of these test procedures are considered.
Test Assembly
Test assembly refers to the psychometric methods used to construct a test. For
example, CFT forms usually are constructed using classical item statistics, such
as item difficulty indices or p-values and discrimination or point-biserial
correlation coefficients . A CFT also may be constructed on the basis of the
content specifications of an existing test blueprint with little attention paid to
item statistics. Because the test form is fixed (i.e., because items are not selected
to be best in any sense), the items and therefore the test score may not be
optimal for all examinees, the "one test fits all" philosophy. Such a test form
could be constructed to measure best at, say a particular point on the score
metric, such as at the passing score. However, this is not typically done. Thus,
the CFT is not likely to be the most efficient delivery method to use, in terms of
measurement precision.
In the ATA method, multiple test forms constructed to be equivalent in some
sense are assembled offline for online delivery later. These test forms are
assembled according to content and statistical requirements. Like the CFT, tests
that have been constructed using ATA methods may not be optimal for
individual examinees but might be constructed to measure well at a particular
point on the ability or score scale (e.g., at the passing score). ATA methods
allow for both classical and IRT construction methods, while the adaptive tests,
CAT and CCT, usually require IRT calibrations of the items.
The ramifications of an IRT-based program are many. First, if they have not
been previously calibrated, items must be analyzed using an item calibration
program that provides estimates of the items' characteristics. This implies that
11.Comparisonof the Test-Delivery Methods 197
the item response data from previously administered test forms be accessible. In
addition calibration requires fairly large examinee samples and access to
calibration software. IRT-based programs typically also involve some input and
oversight from trained psychometric personnel. There are ways in which an item
pool can be calibrated even if only a portion of the items have response data
(Huang, Kalohn, Lin, & Spray, 2000). However, these calibrations must be
considered temporary until enough data have been collected from online testing
to update the item calibrations.
For adaptive delivery methods, test assembly may more directly be
conceptualized in terms of item selection. In CATs, items are selected to
produce a maximally efficient and informative test, resulting in examinee scores.
Conversely, in CCT item selection the goal is to classify examinees into one or
more mutually exclusive groups, and individual examinee scores are usually not
provided. In both cases, tests are assembled interactively, as the examinee
responds to individual items. Further, tests may be designed to be either
variable- or fixed-length.
Scoring
Many times the type of CBT delivery method to be used is dictated solely by
characteristics of the current item pool, primarily the quality and quantity of
items. CFTs have the simplest item pool requirements in that many CFTs are
constructed from a relatively small number of available items. Furthermore, if
there is no item pool per se but rather only enough items for one form exist, then
the CFT is the obvious choice. (Alternatively, the test program could be
maintained in paper-and-pencil mode as additional items are developed and
pretested.)
On the other hand, if a pool appears to be able to support multiple forms with
a tolerable amount of test overlap (or if a sufficient number of additional items
are written), then the ATA method makes sense. In the ATA method there is an
assumption concerning the item pool from which the tests are constructed . It is
assumed that the item statistics on which the ATA process depend, are
representative of the items as they will appear online. This assumption may be
untenable in situations in which items near the end of a test were not reached by
a majority of examinees. Such items would have higher difficulty or p-values
(i.e., would be easier) if these items appeared earlier on the computerized
version of the test. Thus, the item would no longer function as its difficulty
index would predict. It is usually assumed that the item statistics on ATA forms
are invariant to item position on the test and to test context effects. A context
effect could include the way in which a graphic image might appear on the
computer display or the interaction of other items with a particular item (e.g.,
one item providing a clue to the answer of another item).
Requirements are greater for computerized adaptive testing programs. The
CAT item pool must measure well across a wide range of abilities . In other
words it must include many items that measure high- and low-ability examinees
as well as those in the middle of the ability range. Depth (or a large number of
items in the pool that actually have a chance to appear on a CBT) is also critical
for test security reasons. CAT programs tend to be very demanding in terms of
the size of the item pool required, due to uneven item exposures, to the fact that
CATs are frequently used for high-stakes exam programs, and to their
availability in continuous or on-demand test setting.
In terms of the item pool, a CCT requires items to measure well (i.e., to
discriminate well between those who should pass and those who should fail) at
the passing score (or scores, if more than one decision point is used). Therefore ,
the items in a CCT pool should have the greatest amount of information at or
near the passing score (or scores). In other words the CCT pool should have
depth at or near the passing scores. Items that measure well at the extremes of
the ability scale are not important in the CCT decision. Pool depth also is critical
in ensuring that test overlap and individual item exposure are minimized.
For the CCT item pool, items that are optimal at distinguishing between the
two or more score classifications are preferred. Items do not have to measure
1I. Comparison of the Test-Delivery Methods 199
well at all potential ability levels; however, items that measure well at the
decision point or points are critical to the CCT process. The basic rule of thumb
is, "The more items that measure well at the passing score or decision point, the
better."
Measurement Characteristics
Test-delivery methods can be analyzed in terms of the ease with which they can
support testing programs with specific measurement characteristics. The
measurement characteristics considered here as grounds for comparison across
delivery method include aspects of test length, methods for estimating test
reliability, needs for test security, and support for set-based items.
Test Length
The CFT and ATA methods produce tests of a fixed length, while CAT and
CCT can produce either fixed- or variable-length tests. The concept of adaptivity
implies that items are selected to maximize the solution to the CBT
measurement problem for each examinee. This in turn leads to selection of a
minimum number of test items for each examinee, resulting in different test
lengths for different individuals. If a fixed test length requirement is imposed on
the CCT or CAT, it results in different levels of precision for different
individuals. This mayor may not be acceptable to a given testing program .
Given the interaction between test length and reliability, and the conflict
between measurement efficiency and content constraints, satisfactory test
lengths are often determined through simulations conducted during the test
development process.
Reliability
Item/Test Security
Item or test security is a function of anything that tends to result in the same
items being administered to many examinees. For example, deeper item pools
usually result in better security than more shallow pools, for obvious reasons.
Also, few content requirements (and in fact, fewer constraints in general) result
in more item choices for administration. By their very nature, the CFT will
reveal the same items to all examinees unless more than one fixed form is used.
Consequently, it is important to consider examinee volumes and security needs
when using a CFT. The ATA method can produce test forms that have a
controlled test overlap rate through multiple forms. Within both the CAT and
CCT methods, the item selection algorithms used include item exposure control
methods in order to control test security to some extent.
Set-Based Items
An item set consists of a group of items that usually refer to the same stimulus
material or stem. Items in a set can either be forced to appear in total or can be
selected to appear as a subset. All of the CBT methods allow set-based items to
appear on tests. They can easily be included in the fixed-form CFT and ATA
methods. For the adaptive CAT and CCT methods, forcing certain items to be
administered solely on the basis of requiring all or a subset of the items in a set
may result in longer tests or tests of less precision.
Practical Characteristics
Additional elements of a test program concern certain practical characteristics,
such as the examinee volume, the extent of the initial development likely to be
required, the ease with which the program can be maintained on dual platforms,
aspects of item pretesting, and the cost to the examinee. Test-delivery methods
vary in their needs and methods for addressing these practical concerns.
Examinee Volume
Examinee volume refers to the number of examinees who take a test. Typically,
volumes are reported by year, testing cycle, or administration. Small-volume
programs (e.g., fewer than 500) may be able to tolerate a single, fixed test form,
especially if there is little danger of item sharing between examinees or
11. Comparison of theTest-Delivery Methods 201
candidates. Further, even if a single fixed form is used, the items on that form
can be scrambled or presented in a random order to each examinee to increase
security . In order for a CFT test form to be secure, examinee volumes usually
have to be rather small to justify a single form. Security would be enhanced if
several forms of the test, such as those provided under the ATA method, could
be constructed and administered randomly to examinees. Thus, for programs
with small examinee volumes, either the CFT or the ATA method would be
suitable.
Programs with moderate volumes (e.g., around 1000 to 2000) or large
volumes (more than 2000) require multiple fixed forms or a larger pool of items
from which to assemble multiple forms online . Therefore, either the ATA
method for fixed forms or the adaptive methods, CAT or CCT, are appropriate
for these larger volumes.
Initial Development
Initial development activities refer to the amount of work, time, and expense
required to begin constructing or assembling computer-based tests. Often, items
already exist from previous administrations of paper-and-pencil tests. Although
there is no guarantee that the items will behave when administered online as
they did in the paper-and-pencil format, at least they provide some foundation
on which to build an item pool.
The number of items that a testing program requires is based on many
elements. Among the most critical are the stakes of the testing program and how
frequently the exam is offered (i.e., a few times a year, occasional testing
windows, or continuous testing) . The number of examinees needed to pretest
items is based primarily on the psychometric test model used. IRT methods
require more examinees for pretesting than do classical test methods. The three-
parameter IRT model requires more examinees than the one-parameter, or
Rasch , model. Because of these elements, the CFT method is often the least
demanding test-delivery method in terms of needing the fewest items and
examinees, while the adaptive, variable-length CAT and CCT are often the most
demanding . The ATA method offers a viable compromise because the method
can be used to construct tests based on classical item characteristics.
Initial development activities also include conducting any preparatory
computerized simulations. These simulations of test conditions are not necessary
for the CFT but are for the ATA, CAT, and CCT methods. Within these delivery
methods, the most extensive simulations are probably needed for CATs and the
least for ATAs.
202 Practical Characteristics
Dual-Platform Capabilities
Pretest Accommodations
Recall that a pretest item is one that is undergoing a tryout phase. Usually , the
tryout or pretest item is not integrated into the examinee's final test score. The
pretest item must be administered to a sufficient number of examinees before
stable estimates of the item's characteristics, such as its difficulty and
discrimination, can be calculated and the item becomes operational for future
administrations.
The key to the development of good items via pretesting is to ensure that the
examinee responds to the pretest item as though it were an operational or scored
item. That is, the examinee should not be able to detect the tryout nature of the
item from the item's position on the test or in the item 's stem or list of
alternatives. For many paper-and-pencil tests, pretest items are placed at the end
of the operational test, so that, if the test is timed or has time limits and the
examinee runs out of time, it will affect only the pretest items and not the
operational or scored items. This is an easy task for tests of fixed length, but it is
more difficult for variable-length tests such as CAT or CCT.
However, it still is possible to pretest a certain number of items on a
variable-length CAT or CCT. For example, if a program's CCT required a
minimum number of 50 and a maximum number of 70 items to be presented,
then those examinees who complete the test in 50 items would be required to
respond to 30 pretest items. Those who finished after 70 items would only have
to answer 10 pretest items. This would ensure that each examinee had to respond
to a total of 80 items. Other examinees that completed the test in fewer than 70
items but more than 50 would be given the appropriate number of pretest items
to equal 80 items total.
Interspersing or embedding pretest items throughout the variable-length
adaptive test is difficult because the number of items required of a particular
examinee is never known until test administration has been completed. Research
concerning attempts to predict CAT length online for the purpose of
11. Comparison of theTest-DeliveryMethods 203
administering embedded pretest items has met with equivocal results (Davey,
Pommerich, & Thompson, 1999).
Cost
The cost of a CBT (initially to the testing program, but often then indirectly to
the examinee) consists of many elements such as item development, software
programming, and administration fees. Some cost sources, including these three,
differ across test-delivery methods. To determine the cost of computerizing an
exam , either overall or per examinee, these and additional factors need to be
analyzed.
The number, and thus the cost, of items needed for a test program is related
to test-delivery method. In most cases, a CFT is the least demanding in terms of
the item development cost, followed by the ATA, the CCT, and finally the CAT.
The cost of software programming is also likely to vary across delivery
method. More complex software, such as might be required for test-delivery
methods using complex, adaptive item selection, typically results in greater cost.
In this case , the adaptive tests , (i.e., the CAT and CCT) will have higher
development costs.
Finally, the cost of administering a computerized test may also vary across
delivery method, as it is often related to seat time, the amount of time scheduled
for an examinee to use a computer and take a computerized exam. In fixed-
length exams such as those provided in the CFT and ATA methods, it is easier
to determine the cost per examinee. While variable test length sometimes makes
it more difficult to estimate computerized administration time and costs , the
more efficient delivery methods, such as the CAT and CCT, are likely to have
an advantage over nonadaptive methods.
Other Considerations
Finally, selection of a test-delivery method for a given testing program should
also take into account several additional considerations, including pool
maintenance issues, examinee issues , software issues, and the inclusion of
innovative item types.
The CFT method is typically the least demanding test-delivery method in terms
of pool maintenance activities , due to its small pool size and its frequent use in
low-stakes settings. ATA procedures require more extensive maintenance
204 Other Considerations
Examinee Issues
Software Issues
The relative ease with which innovative item types may be incorporated into a
CBT program differs depending on a number of elements, including the
dimension of the innovation and the facility with which that type of dimension
can be handled by the CBT administration software and hardware. The aspect of
innovative item types that is most relevant in comparing test-delivery methods,
however, may be task complexity. Innovative item types with low task
complexity can be used relatively freely across delivery methods. Conversely,
item innovations that result in high task complexity can be used most easily in a
nonadaptive exam. In fact, a highly complex computerized simulation of a performance-
based task can be conceived of as a novel test-delivery method standing on its own. In
any case, it is clearly a more challenging proposition to add highly complex, perhaps
interactive items, to an adaptive CBT than to a fixed-form program.
206 Summary
Summary
One of the basic premises of this book is that any test-delivery method has
particular strengths and weaknesses, and a testing program should carefully
select the delivery method that best satisfies its goals and needs. The four
methods of CFT, ATA, CAT, and CCT are only some of the test-delivery
methods that may be considered. Within this chapter these methods have been
compared on several important elements, including testing procedures,
measurement characteristics, practical characteristics , and other considerations .
Table 11.1 provides a brief summary of some of the highlights of this
discussion. This should make clear some of the advantages and disadvantages of
each delivery method, and help guide test developers in the selection of an
optimal method for a specific exam program.
-o
o
3
-g
::1 .
Table 11.1. Summary of Features of the Test-delivery Methods g
o
...,
Test Procedures CFT Features ATA Features CAT Features CCT Features Er
('I)
-l
Test assembly Classical or IRT Classical or IRT IRT methods; tests are IRT preferred, classical ~
methods methods assembled in real time possible o
2-
~.
Scoring Number correct or Number correct or IRT ability estimates or Classification decision
proportion correct proportion correct scaled scores alone, or scaled score -<
z
('I)
Item pool size Typically small Small or large Large Medium Er
8.
en
Measurement Characteristics
Test length Fixed Fixed Fixed or variable Usually variable
Reliability Usually internal Usually internal Standard error of ability Consistency of
consistency consistency estimates classification
Item/test security Minimal provisions Creates multiple forms to Minimize test overlap Item exposure control
minimize test overlap across examinees
Set-based items Easily addressed Easily addressed Easily implemented, but Easily implemented, but
degrades efficiency degrades efficiency
IV
o
-.I
N
o
00
CIl
§
3
~
11. Comparison of the Test-deliveryMethods 209
References
Huang, c., Kalohn, J. c., Lin, C., & Spray, J. A. (2000). Estimating Item Parameters from
Classical Indices for Item Pool Development with a Computerized Classification Test.
(ACTResearch Report) IowaCity: ACT,Inc.
Davey, T.C., Pommerich, M., & Thompson, T.D. (1999). Pretesting alongside an operational
adaptive test Paper presented at the annual meeting of the National Council for
Measurement in Education, Montreal.
Spray, J. A., & Reckase, M. D. (1996). Comparison of SPRT and sequential Bayes
procedures for classifying examinees into two categories using a computerized test.
Journal ofEducational and Behavioral Statistics, 21, 405-414.
Appendix
Basics of Item Response Theory
Introduction
Although mental testing has a long history, it acquired a rigorous statistical
foundation only during the first half of the last century. This introduced the
concepts of parallel test forms, true scores, and reliability. By quantifying
certain aspects of how tests perform, these developments, which have come to
be called classical test theory, allow us to determine whether tests are useful,
accurate , or better or worse than one another. Classical test theory is focused
primarily on test scores rather than the individual test questions or items that
comprise those scores. Furthermore, examinees are also dealt with in the
aggregate, as members of groups or populations, rather than as individuals.
Some simple examples may make these distinctions clearer.
Reliability is the classical test theory index of how precisely a test measures
examinee performance. Loosely defined, it is the expected correlation between
pairs of scores that would be obtained if each member of a group were tested
twice. A reliability coefficient is dependent on a particular examinee group; a
test can be more reliable with some groups than with others. However, reliability
cannot be attached to any particular member of a group. Surely some examinees
are measured more precisely than others are, but reliability does not recognize
these differences.
The same is true of the classical measure of item difficulty. The item
difficulty index, or p-value, is simply the proportion of examinees in a given
population who would be expected to answer an item correctly. An item with a
p-value of .60 is expected to be answered correctly by roughly 60% of the
examinees that attempt it. But this is not to say that 60% of any group of
examinees will answer correctly. Clearly , sixth-grade and first-grade students
will approach the same problem with differing degrees of success. So the p-
value also depends on a particular reference population. Neither can the p-value
be attached to any particular examinee. The difficulty index is averaged over all
examinees in a population . But brighter examinees obviously answer any item
correctly more often than less able examinees. Again, classical test theory has
few answers to offer.
Appendix: Basicsof Item Response Theory 211
1.0
0.8
g
:El 0.6
=
,Q
0
r. 0.4
=--
0.2
0.0
-3 -2 -1 0 2 3
e
Figure A.l. Item Response Function
Several features of this function are notable. The first is that it is continually
(monotonically) increasing, meaning that the probability of a correct response
increases uniformly with the latent ability level of the examinee. Although this
is not the case with every item, it's true often enough to make the assumption
tenable. Second, the function asymptotes at 1.0 on the upper end and .20 on the
lower. The upper asymptote implies that very capable examinees will always
answer correctly. Again, this does not hold uniformly, but it is generally
reasonable to assume. The nonzero lower asymptote indicates that even very
low-performing examinees have some probability of a correct response. This is
characteristic of multiple-choice items, which can be answered correctly by
chance guessing. Finally, the function is seen to rise at different rates across
different latent ability levels. An item is said to be discriminating over regions
where the response function ascends steeply. This means that small differences
in latent ability map to large differences in the probability of a correct response.
A discriminating item is able to distinguish between two examinees that differ
little in latent ability
Response functions, also known as item characteristic curves or ICCs, are
usually modeled as logistic ogives, which take the form
212 Appendix: Basics ofItem ResponseTheory
1- c,
Prob(u j = 110) = 1;(0) = c j + (A. I)
1+ expj-L'Za,(0 - b;)}
where the examinee parameter, e, indexes the latent ability of the examinee,
with higher values indicating higher performance . Each of the item parameters,
a.; b j, and C b control some aspect of the item response function 's shape. The
attributes of these item parameter are described next.
Item Parameters
Difficulty (b)
The b parameter is often considered the IRT analogue of the classical p-value as
an index of item difficulty. The b value sets the location of the ICC's inflection
point, or the point at which the curve rises most steeply. Easier items have lower
b values and response functions that are generally shifted left along the 0 scale.
Probabilities rise to nearly one even at lower latent ability levels. Conversely,
difficult items have larger b values and response functions that are shifted to the
right.
Discrimination (a)
The a parameter dictates how steeply the ICC rises at its point of maximum
discrimination, or at 0 = b. Larger a values produce the steeper response
functions associated with more discriminating items.
Likelihood Function
(A.2)
Jl=<I,I>
Jl=<O,O>
ll=< 1,0>
II = < 0,1 >
From the response functions for these two items, we know the probabilities
e
of a correct answer to each item, given that the latent ability level takes on
some value t. These are denoted as PI (r) and P2(t). The other important
probabilities are those of incorrect answers to these items . Because an answer
must be either right or wrong , the probability of an incorrect answer is the
complement of the probability of a correct answer. Notationally, Qi(t) = 1- Pj(t).
Then, under the circumstances outlined later, the joint probability of a series of
responses can be computed as the products of the probabilities of the individual
responses. In the simple example:
I e= t) = PI(t) Pit)
Prob(!l = < 1,
Prob(!l = < 0,
Prob(!l = < 1,
°°
1>
>
>
I e= t) = QI(t) Q 2(t)
I e= t) = P .(t) Q2(t)
Prob(U = < 0, 1> I e= t) = Q .(t) P 2(t)
A notational trick allows this to be extended to tests of any length n as:
The P(t) terms are included in the product when U = 1 while the Q(t) terms are
included when U = 0.
214 Appendix: Basics of Item Response Theory
Figure A.2 shows that likelihood functions provide a means for estimating
the latent ability of examinees based on their pattern of responses across a test.
The graph sketches the likelihood functions for the four possible response
patterns to the simple two-item test. Very proficient examinees are seen most
likely to produce the (1,1) pattern while low-latent-ability examinees will
probably get both items wrong. Examinees with middle-range true proficiencies
are about equally likely to answer in any of the four possible ways.
1.0
0.9
0.8
0.7 - - U ( I,I)
.. -. -".
-ge 0.6 -..- -e--U(1,O)
:5 0.5 ~U(O,1)
]
~ 0.4 -."" " "" U(O,O)
0.3
0.2b~~_~
0.1
0.0 +====:::;:===---.,----,r-----=:.::~~~~~
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
8
Figure A.2. Likelihood Functions for Four Response Patterns
1.2
~ 1.0
Q
;
=
e 0.8
l.
-...
oSc 0.6
-
e
~
0.4
0.2
0.0
-3 -2 -1 o 2 3
8
Figure A.3. Item Information Function
Model Assumptions
Like most models, item response theory makes a number of assumptions to
simplify the very complicated interactions between examinees and test items. It
cannot be reasonably argued that any of these assumptions are strictly true.
However, it's not necessary for a model to conform exactly with reality in order
to be useful. Indeed, most models are useful precisely because they offer
idealized and therefore clearer descriptions of what are, in fact, messy and
complex situations. It is only necessary that the predictions and inferences
drawn from a model prove accurate and valid enough to be valuable. The
importance of assumptions lies in the fact that predictions and inferences will
tend to be accurate to the extent that the model fits real data. Fortunately,
decades of experience in applying IRT in a variety of contexts and situations
have generally revealed that the assumptions made are true enough, often
enough.
Three major assumptions are generally identified. These are described and
their importance evaluated in turn.
Summary
The model assumptions, along with other concepts introduced here, such as item
response functions, likelihood functions, and information functions, are the basic
elements of IRT. They can be used to handle many complex measurement
problems, not the least of which is adaptive computer-based testing. Further
information on IRT can be found in Lord (1980) and in Hambleton and
Swaminathan (1985) . Discussion of how IRT is used within computer-based
testing is included throughout this text.
Appendix: Basics of Item Response Theory 219
References
Bimbaum , A. (1968). Some latent trait models and their use in inferring an examinee 's
ability. In Lord, F. M., & Novick, M. R. Statistical Theories of Mental Test Scores.
Reading, MA: Addison-Wesley.
Hambleton, R. K., and Swaminathan , H. (1995). Item Response Theory: Issues and
Applications. Boston: Kluwer Nijhoff.
Lord, F. M. (1980) . Applications of Item Response Theory to Testing Problems.
Hillsdale, NJ: Lawrence Erlbaum.
Index
Dual platform, 29-30, 98-99,101 , 112, IRT (item response theory) (cont.) 175-
124,144,157-158,160,194,200 177,182,196-197,201 ,204,211-
212,215-218
EAP (expected a posteriori), 129, 131- Item set, 109, 114, 171,200
132 Item statistics, 10, 18, 19,22-23 ,28,30,
Examinee volume, 98,101 ,106,112, 94,98-99, 102, 110, 172, 175, 196,
139,143-144,157,160,194,200 198,204
Exposure control, 16,42, 97, 133-134, Item/test security, 101, 112, 144, 160
137-138,155,157,166,183-184,
186-189,204 Latent ability, 11, 126, 129-131, 133-
Exposure rate, 30, 104, 122-123, 155, 134,154-156,161-162,174,178-179,
186-188, 192 182,184,211-217
Level ofinteractivity, 9, 72, 80-82, 85
Goodness-of-fit, 175 Likelihood function, 130-131, 146, 148,
163,174,212-214,217,218
Hardware, 2, 13-15,27,30,65, 143,205
High-stakes, 4, 16,22,95,99, 132, 134, MAP (maximum a posteriori), 129, 131-
153,155,198 132
Mastery testing, 153-154
Indifference region, 162, 164, 183 Maximum information, 127-129, 145,
Initial development, 98-99, 101, 112, 179
139,143-144,157,158,160,200, Maximum likelihood estimation, 129
201 Maximum posterior precision, 127-128
Innovative item types, 1,8-9,12,14,29, Media inclusion, 9, 72, 77, 85
35,37,47,49,66,70-72,77,79,82- Mental model, 3, 41-42
86,101 ,194,203,205 Menu approach, 136
Item enemy, 110 MLE, 129-130, 132
Item exposure control, 20, 97, 145, 158,
160-161, 166, 184, 186-188,200,204 Navigation, 6-7, 38, 40, 48, 53-61,66,
Item exposure rate, 104, 109, 123, 137, 67
186,192 Newton-Raphson algorithm, 149
Item format, 9-10, 29, 72-75, 85,101-
102,140,173 Optimal selection method, 136
Item information, 94,126,215
Item pool, 11-12, 15, 18-24,26,29-30, Passing score, 10,94,106-108,123,
93,95,98-99,101,102,104,107- 154-157,159,161 ,164,174,179,
114,121-123,126-129,132-133, 184,186,190,195,196,198-199
135-137,139-141,144,149,153-161 , Point-biserial correlation coefficient, 23,
166,169-175,177,179,182,184, 28,112,175,196
186, 188-192, 194-20 I, 204 Posterior ability distribution, 128
IRT (item response theory), 11-12,22, Pretest item calibration, 191
25-26,28,92,93-95,97,107-112, Pretest items, 21, 24, 99,101-102,110,
121,126,128-130,139,140-141, 112,139,140-141 ,144,157-158,
144-145,154,157,160-162,172-173, 160,170,190-192,194,201-202
222 Index