Lecture Notes On Characteristics of Tests
Lecture Notes On Characteristics of Tests
Lecture Notes On Characteristics of Tests
IN BASIC
SCHOOLS
EBS 234
S
UNIT 3
Characteristics of Tests
Definition: Nitko (1996, p. 36) defined validity as the “soundness of the interpretations and use of
students’ assessment results”. Validity emphasizes the interpretations and use of the results and not
the test instrument. Evidence needs to be provided that the interpretations and use are appropriate.
Nature of validity:
In using the term validity in relation to testing and assessment, five points have to be borne in
mind.
1. Validity refers to the appropriateness of the interpretations of the results of an assessment
procedure for a group of individuals. It does not refer to the procedure of instrument itself.
2. Validity is a matter of degree. Results have different degrees of validity for different purposes and
for different situations. Assessment results may have high, moderate or low validity.
3. Validity is always specific to some particular use or interpretation. No assessment is valid for all
purposes.
4. Validity is a unitary concept that is based on various kinds of evidence.
5. Validity involves an overall evaluative judgment. Several types of validity evidence should be
studied and combined.
There are four principles that help a test user/giver to decide the degree to which his/her
assessment results are valid.
1. The interpretations (meanings) given to students’ assessment results are valid only to the degree
that evidence can be produced to support their appropriateness.
2. The uses made of assessment results are valid only to the degree that evidence can be produced to
support their appropriateness and correctness.
3. The interpretations and uses of assessment results are valid only when the educational and social
values implied by them are appropriate.
4. The interpretations and uses made of assessment results are valid only when the consequences of
these interpretations and uses are consistent with appropriate values.
1. Content-related evidence
This type of evidence refers to the content representativeness and relevance of tasks or items on an
instrument. Judgments of content representativeness focus on whether the assessment tasks are a
representative sample from a larger domain of performance. Judgments of content relevance focus
on whether the assessment tasks are included in the test user’s domain definition when
standardized tests are used.
Content-related evidence answers questions like:
i. How well do the assessment tasks represent the domain of important content?
ii. How well do the assessment tasks represent the curriculum as defined?
iii. How well do the assessment tasks reflect current thinking about what should be taught and
assessed?
iv. Are the assessment tasks worthy of being learned?
2
To obtain answers for the questions, a description of the curriculum and content to be learned (or
learned) is obtained. Each assessment task is checked to see if it matches important content and
learning outcomes. Each assessment task is rated for its relevance, importance, accuracy and
meaningfulness.
One way to ascertain content-related validity is to inspect the table of specification which is a two-way
chart showing the content coverage and the instructional objectives to be measured.
2. Criterion-related evidence
This type of evidence pertains to the empirical technique of studying the relationship between the test
scores or some other measures (predictors) and some independent external measures (criteria) such as
intelligence test scores and university grade point average. Criterion-related evidence answers the
question, How well the results of an assessment can be used to infer or predict an individual’s standing
on one or more outcomes other than the assessment procedure itself. The outcome is called the
criterion.
There are two types of criterion-related evidence. These are concurrent validity and predictive
validity.
Concurrent validity evidence refers to the extent to which individuals’ current status on a criterion can
be estimated from their current performance on an assessment instrument. For concurrent validity,
data are collected at approximately the same time and the purpose is to substitute the assessment result
for the score of a related variable. e.g. a test of swimming ability vrs swimming itself to be scored.
Predictive validity evidence refers to extent to which individuals’ future performance on a criterion can
be predicted from their prior performance on an assessment instrument. For predictive validity, data
are collected at different times. Scores on the predictor variable are collected prior to the scores on the
criterion variable. The purpose is to predict the future performance of a criterion variable. e.g. Using
WASSCE results to predict the first year GPA in the University of Cape Coast.
Criterion-related validation is done by computing the coefficient of correlation between the assessment
result and the criterion. The correlation coefficient is a statistical index that quantifies the degree of
relationship between the scores from one assessment and the scores from another. This coefficient is
often called the validity coefficient and takes values from
–1.0 to +1.0.
Expectancy tables can also be used for validation. An expectancy table is a two-way table that allows
one to say how likely it is for a person with a specific assessment result to attain each criterion score
level.
3. Construct-related evidence: This type of evidence refers to how well the assessment results can be
interpreted as reflecting an individuals’ status regarding an educational or psychological trait,
attribute or mental process. Examples of constructs are mathematical reasoning, reading
comprehension, creativity, honesty and sociability.
4
14. Fear of the assessment situation. Students can be frightened by the assessment situation and are
unable to perform normally. This reduces their actual level of performance and consequently,
lowers validity.
Test Reliability
Definition
Reliability is the degree of consistency of assessment results. It is the degree to which assessment
results are the same when (1) the same tasks are completed on two different occasions (2) different but
equivalent tasks are completed on the same or different occasions, and (3) two or more raters mark
performance on the same tasks.
Points to note when applying the concept of reliability to testing and assessment.
i. Reliability refers to the results obtained with an assessment instrument and not to the instrument
itself.
ii. An estimate of reliability refers to a particular type of consistency.
iii. Reliability is a necessary condition but not a sufficient condition for validity.
iv. Reliability is primarily statistical. It is determined by the reliability coefficient, which is defined
as a correlation coefficient that indicates the degree of relationship between two sets of scores
intended to be measures of the same characteristic. It ranges from 0.0 - 1.0
Definition of terms:
Obtained (Observed) score: Actual scores obtained in a test or assessment.
Error score: The amount of error in an obtained score.
True score: The difference between the obtained and the error scores. It is the portion of the observed
score that is not affected by random error. An estimate of the true score of a student is the mean score
obtained after repeated assessments under the same conditions.
X=T+E
Reliability can be defined theoretically as the ratio of the true score variance to the observed score
s2
variance. i.e. rxx = t2
sx
Standard error of measurement: It is a measure of the variation within individuals on a test. It is an
estimate of the standard deviation of the errors of measurement. It is obtained by the formula: Se =
Sx 1 − rxx or SEM = SDx√ 1 − reliabilitycoefficient , where Sx or SDx is the standard deviation of
the obtained scores. For example, given that, rxx = 0.8, Sx = 4.0,
SEM = 4 1 − 0.8 = 4 0.2 = 4 x 0.447 = 1.788
36
35 35 34 No overlap
5
Overlap
30 30 30
25
Reliability coefficient: A correlation coefficient that indicates the degree of relationship between two
sets of scores intended to be measures of the same characteristic (e.g. correlation between scores
assigned by two different raters or scores obtained from administration of two forms of a test)
6
Factors influencing reliability
1. Test length. Longer tests give more reliable scores. A test consisting of 40 items will give a more
reliable score than a test consisting of 25 items. Wherever practicable, give use more items.
2. Group variability. The more heterogeneous the group, the higher the reliability. The narrower the
range of a group’s ability, the lower the reliability. Differentiate among students. Use items that
differentiate the best students from the less able students.
3. Difficulty of items. Too difficult or too easy items produce little variation in the test scores. This
in turn lowers reliability. The difficulty of the assessment tasks should be matched to the ability
level of the students.
4. Scoring objectivity. Subjectively scored items result in lower variability. More objectively scored
assessment results are more reliable. For subjectively-scored items, multiple markers are
preferred.
5. Speed. Tests, where most students do not complete the items due to inadequate allocation of time,
result in lower reliability. Sufficient time should be provided to students to respond to the items.
6. Sole marking. Using multiple markers improves the reliability of the assessment results. A single
person grading may lead to low reliability especially of essay tests, term papers, and performances.
Averaging the results of several markers increases reliability.
7. Testing conditions. Where test administrators do not adhere strictly to uniform test regulations and
practices, students’ scores may not represent their actual level of performances and this tends to
reduce reliability. In cases of the test-retest method of estimating reliability, this issue is of a great
concern.
7
Sample Test Items
1. The process of assigning numbers to the attributes or traits possessed by persons according to specific
rules is
A. assessment.
B. evaluation.
C. measurement.
D. test.
5. One role of the Ghanaian teacher in the implementation of the continuous assessment programme is
that the teacher must
A. concentrate on the cognitive domain objectives.
B. constantly evaluate the assessment programme.
C. make an end-of-year time table for assessments.
D. provide information to place students in courses.
6. One purpose of the school-based assessment (SBA) in Ghana is to provide teachers with
A. a list of objectives in constructing assessment items.
B. guidance in dealing with social and psychological problems.
C. sources of motivation for directing students’ learning.
D. standards of achievement in each class of the school system.
8
9. Learning objectives are important for classroom assessment because they help teachers to
A. assess students’ performances through the knowledge of specific outcomes.
B. design appropriate assessment procedures based on unknown outcomes.
C. evaluate existing assessment instruments when specific outcomes are known.
D. obtain information for judging the reliability of assessment procedures.
10. In the cognitive domain classification of educational objectives, the ability to put parts together to form
a new whole is referred to as
A. analysis.
B. application
C. evaluation.
D. Synthesis.
12. One of the principles for deciding on the degree to which an assessment result is valid is that evidence
can be produced to support the
A. appropriateness of the test instrument.
B. appropriateness of educational values.
C. consequences of the uses of the test items.
D. content-relevance of the test items.
14. Which of the following factors does NOT decrease the degree to which a test is valid for a particular
purpose?
A. Ambiguous statements in tasks.
B. Difficult reading vocabulary.
C. Unclear directions to students.
D. Unidentifiable patterns of answers.
15. A small value of the standard error of measurement indicates that the
A. observed score is low.
B. error score is positive.
C. reliability is high.
D. reliability is low.
16. Amina obtained 97 in a Measurement & Evaluation quiz. It is known that the error score for the
quiz was − 3.0. An estimate of her true score is
A. 100
B. 97
C. 94
D. 3
9
1. Explain the following terms as used in educational assessment to a nonprofessional teacher
i. Educational goals
ii. Educational outcomes
iii. Learning outcomes
2. Assessment is done to help take decision on students, curricula, programmes and
educational policy. Discuss any four (4) decisions that can be taken on the variables
listed.
10