1.0 Brief Overview of Educational Assessment
1.0 Brief Overview of Educational Assessment
abstractions of mental processes that are related to behavior or experience (Murphy &
“Achievement, defined as the extent to which students can demonstrate mastery of a scholastic
curriculum, is the most frequently assessed construct in the classroom” (Chatterji, p. 27).
standardized conditions with established rules for scoring (Murphy & Davidshofer, p. 3,
Standards, p. 3). (Please note that the terms “test,” “exam,” and “assessment” will be used
interchangeably.) Tests vary in the precision and detail of their scoring, from the exact scoring
of multiple-choice tests to the more subjective judgment entailed by short answer or essay
tests. Tests may be used to assess maximal performance, such as aptitude or achievement
tests (examinees are asked to “do their best”), or may be used to assess typical performance,
such as an attitude survey or personality inventory (respondents are asked to report their
scores of a norm group. Norm groups vary, depending on the purpose of the assessment. For
example, the scores of a child on an intelligence test may be compared to a group of children of
the same age, which would indicate the child’s standing compared to their age group. Well-
known norm-referenced aptitude tests are the Scholastic Assessment Test and the Graduate
Record Examinations. Grading “on a curve” is also a norm-referenced procedure in which the
class itself serves as the norm. So, for example, the top 20% may receive an A, the second 20%
content mastered. “The focus is on what test takers can do and what they know, not on how
they compare to others” (Anastasi, p. 102). Many educational and licensing tests are criterion-
referenced tests which are used to establish knowledge or competency. For example, the
typical academic grade scale (90% to 100% = A, 80% to 89% = B, 70% to 79% = C, 60% to 69% =
decisions are used to shape the instructional design or delivery process, e.g., several of my
students need additional training on using the calculator. Summative decisions describe
decisions determine “who will and who will not gain access to employment, education, and
Ellingson, & Kabin, p. 302). Although many classroom assessments are used for low-stake
decision making, the stakes tend to be higher when making summative decisions, in which case,
2
Psychometrics is “the science of the assessment of individual differences,” often
referring to the quantitative aspects of psychological measurement (Whitney & Shultz, p. 425).
Two long-standing hallmarks of a test’s quality are its validity and reliability. “The validity of a
test concerns what the test measures and how well it does so. It tells us what can be inferred
from test scores” (Anastasi, p. 139). “Reliability refers to the consistency of scores obtained by
the same persons when reexamined . . . .” (Anastasi, p. 109). Because an unreliable test cannot
Tests also differ in the manner in which individuals respond to items. Probably the most
common form of testing is with items that call for a written response (even if only indicating an
may call for a structured-response in which there is only one correct answer (e.g., multiple
choice, true false) or open-response, in which the length and content of the response varies,
the achievement domain to be covered in a relatively short period of time (improving reliability
and validity), may be administered to large groups, can be quickly graded, and consist of
objectively correct answers (improving reliability). However, if structured response items are
1
Precise speakers and writers would note the following: Tests are not valid or invalid, the inferences that we
make from their use are valid or invalid. Tests are not reliable or unreliable. Reliability coefficients are specific to a
sample of the population, so the best we can say is that the use of the test is likely reliable, particularly across
similar populations. It should also be noted that reliability results for criterion-referenced tests are not typical.
This is because test scores on a criterion-referenced test result in less variability which will limit reliability
coefficients.
3
not well written, e.g., the distractors increase the likelihood of correct guessing, both the
Although some argue that structured response items can be written to assess higher-
order cognitive skills, they also admit that training and practice in creating such items is
required (Anastasi, p. 417). Most agree that open-response items can be created which will
require higher-level cognitive functioning, imparting access to some areas of the achievement
assessments consisting of open-response items allow less of the domain to be tested (reducing
Because Chatterji was a course book, there is no need to do anything further. What appears
below are non-course references.
References