Lecture Notes On Characteristics of Tests

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

ASSESSMENT

IN BASIC
SCHOOLS

EBS 234
S
UNIT 3
Characteristics of Tests

TEST VALIDITY AND RELIABILITY


Test Validity

Definition: Nitko (1996, p. 36) defined validity as the “soundness of the interpretations and use of
students’ assessment results”. Validity emphasizes the interpretations and use of the results and not
the test instrument. Evidence needs to be provided that the interpretations and use are appropriate.

Nature of validity:
In using the term validity in relation to testing and assessment, five points have to be borne in
mind.
1. Validity refers to the appropriateness of the interpretations of the results of an assessment
procedure for a group of individuals. It does not refer to the procedure of instrument itself.
2. Validity is a matter of degree. Results have different degrees of validity for different purposes and
for different situations. Assessment results may have high, moderate or low validity.
3. Validity is always specific to some particular use or interpretation. No assessment is valid for all
purposes.
4. Validity is a unitary concept that is based on various kinds of evidence.
5. Validity involves an overall evaluative judgment. Several types of validity evidence should be
studied and combined.

Principles for validation

There are four principles that help a test user/giver to decide the degree to which his/her
assessment results are valid.

1. The interpretations (meanings) given to students’ assessment results are valid only to the degree
that evidence can be produced to support their appropriateness.
2. The uses made of assessment results are valid only to the degree that evidence can be produced to
support their appropriateness and correctness.
3. The interpretations and uses of assessment results are valid only when the educational and social
values implied by them are appropriate.
4. The interpretations and uses made of assessment results are valid only when the consequences of
these interpretations and uses are consistent with appropriate values.

Categories of validity evidence

There are 3 major categories of validity evidence.

1. Content-related evidence
This type of evidence refers to the content representativeness and relevance of tasks or items on an
instrument. Judgments of content representativeness focus on whether the assessment tasks are a
representative sample from a larger domain of performance. Judgments of content relevance focus
on whether the assessment tasks are included in the test user’s domain definition when
standardized tests are used.
Content-related evidence answers questions like:
i. How well do the assessment tasks represent the domain of important content?
ii. How well do the assessment tasks represent the curriculum as defined?
iii. How well do the assessment tasks reflect current thinking about what should be taught and
assessed?
iv. Are the assessment tasks worthy of being learned?
2
To obtain answers for the questions, a description of the curriculum and content to be learned (or
learned) is obtained. Each assessment task is checked to see if it matches important content and
learning outcomes. Each assessment task is rated for its relevance, importance, accuracy and
meaningfulness.

One way to ascertain content-related validity is to inspect the table of specification which is a two-way
chart showing the content coverage and the instructional objectives to be measured.

2. Criterion-related evidence
This type of evidence pertains to the empirical technique of studying the relationship between the test
scores or some other measures (predictors) and some independent external measures (criteria) such as
intelligence test scores and university grade point average. Criterion-related evidence answers the
question, How well the results of an assessment can be used to infer or predict an individual’s standing
on one or more outcomes other than the assessment procedure itself. The outcome is called the
criterion.

There are two types of criterion-related evidence. These are concurrent validity and predictive
validity.

Concurrent validity evidence refers to the extent to which individuals’ current status on a criterion can
be estimated from their current performance on an assessment instrument. For concurrent validity,
data are collected at approximately the same time and the purpose is to substitute the assessment result
for the score of a related variable. e.g. a test of swimming ability vrs swimming itself to be scored.

Predictive validity evidence refers to extent to which individuals’ future performance on a criterion can
be predicted from their prior performance on an assessment instrument. For predictive validity, data
are collected at different times. Scores on the predictor variable are collected prior to the scores on the
criterion variable. The purpose is to predict the future performance of a criterion variable. e.g. Using
WASSCE results to predict the first year GPA in the University of Cape Coast.

Criterion-related validation is done by computing the coefficient of correlation between the assessment
result and the criterion. The correlation coefficient is a statistical index that quantifies the degree of
relationship between the scores from one assessment and the scores from another. This coefficient is
often called the validity coefficient and takes values from
–1.0 to +1.0.

Expectancy tables can also be used for validation. An expectancy table is a two-way table that allows
one to say how likely it is for a person with a specific assessment result to attain each criterion score
level.

An example of an expectancy table

Predictor Percent of pupils receiving each grade


test score F D C B A Totals
90-99 20 60 20 100
80-89 8 33 42 17 100
70-79 20 33 40 7 100
60-69 22 44 28 6 100
50-59 6 28 44 22 100
40-49 7 40 33 20 100
30-39 17 42 33 8 100
20-29 25 50 25 100
10-19 100 100
3
Determine the degree of success by using a grade e.g. C or better. A question will be, What is the
probability that a person with a score of 65 will succeed in this course (i.e. obtaining grade C or
better)? The score of 65 lies in the 60-69 class and for this class, 78% (44+28 +6) are successful, so
the person has a 78% chance of success.

3. Construct-related evidence: This type of evidence refers to how well the assessment results can be
interpreted as reflecting an individuals’ status regarding an educational or psychological trait,
attribute or mental process. Examples of constructs are mathematical reasoning, reading
comprehension, creativity, honesty and sociability.

Methods of construct validation


i. Define the domain or tasks to be measured. Specifications must be very well defined so that the
meaning of the construct is clear. Expert judgment is then used to judge the extent to which the
assessment provides a relevant and representative measure of the task domain.
ii. Analyze mental process required by the assessment tasks. Examine the assessment tasks or
administer the tasks to individual students and have them “think aloud” as they perform the tasks.
iii. Compare the scores of known groups. Groups that are expected to differ on the construct, e.g.
by age, or training may both be given the same assessment tasks.
iv. Correlate the assessment scores with other measures. Similar constructs are expected to
produce high correlation. E.g. two assessments on creativity are expected to produce high
correlation.

Factors affecting validity


1. Unclear directions. Validity is reduced if students do not clearly understand how to respond to the
items and how to record the responses or the amount of time available.
2. Too difficult reading vocabulary and sentence structure tends to reduce validity. The assessment
may be measuring reading comprehension which is not to be measured.
3. Ambiguous statements in assessment tasks and items. This confuses students and makes way for
different interpretations thus reducing validity.
4. Inadequate time limits. This does not provide students with enough time to respond and thus may
perform below their level of achievement. This reduces validity.
5. Inappropriate level of difficulty of the test items. Items that are too easy or too difficult does not
provide high validity.
6. Poorly constructed test items. These items may provide unintentional clues which may cause
students to perform above their actual level of achievement. This lowers validity.
7. Test items being inappropriate for the outcomes being measured lowers validity.
8. Test being too short. If a test is too short, it does not provide a representative sample of the
performance being interested in and this lowers validity.
9. Improper arrangement of items. Placing difficult items in the beginning of the test may put some
students off and cause them to become unstable thereby performing below their level of
performance thus reducing validity.
10. Identifiable pattern of answers. Placing the answers to tests like multiple-choice and true/false
types enables students to guess the correct answers more easily and this lowers validity.
11. Cheating. When students cheat by copying answers or helping their friends with answers to test
items, validity is reduced.
12. Unreliable scoring. Scoring of test items especially essay tests may lower reliability if they are not
scored reliably.
13. Student emotional disturbances. These interfere with their performance thus reducing validity.

4
14. Fear of the assessment situation. Students can be frightened by the assessment situation and are
unable to perform normally. This reduces their actual level of performance and consequently,
lowers validity.

Test Reliability

Definition
Reliability is the degree of consistency of assessment results. It is the degree to which assessment
results are the same when (1) the same tasks are completed on two different occasions (2) different but
equivalent tasks are completed on the same or different occasions, and (3) two or more raters mark
performance on the same tasks.

Points to note when applying the concept of reliability to testing and assessment.
i. Reliability refers to the results obtained with an assessment instrument and not to the instrument
itself.
ii. An estimate of reliability refers to a particular type of consistency.
iii. Reliability is a necessary condition but not a sufficient condition for validity.
iv. Reliability is primarily statistical. It is determined by the reliability coefficient, which is defined
as a correlation coefficient that indicates the degree of relationship between two sets of scores
intended to be measures of the same characteristic. It ranges from 0.0 - 1.0

Definition of terms:
Obtained (Observed) score: Actual scores obtained in a test or assessment.
Error score: The amount of error in an obtained score.
True score: The difference between the obtained and the error scores. It is the portion of the observed
score that is not affected by random error. An estimate of the true score of a student is the mean score
obtained after repeated assessments under the same conditions.
X=T+E
Reliability can be defined theoretically as the ratio of the true score variance to the observed score
s2
variance. i.e. rxx = t2
sx
Standard error of measurement: It is a measure of the variation within individuals on a test. It is an
estimate of the standard deviation of the errors of measurement. It is obtained by the formula: Se =
Sx 1 − rxx or SEM = SDx√ 1 − reliabilitycoefficient , where Sx or SDx is the standard deviation of
the obtained scores. For example, given that, rxx = 0.8, Sx = 4.0,
SEM = 4 1 − 0.8 = 4 0.2 = 4 x 0.447 = 1.788

Interpreting standard errors of measurement


1. It estimates the amount that a student is likely to deviate from her/his true score. e.g. SEM=4
indicates that a student’s obtained scores lies 4 points above or below the true score. An obtained
score of 75 means the true score is either 71 or 79. The true score therefore lies between 71 and
79. 71-79 therefore provides a confidence band for interpreting an obtained score. A small
standard error of measurement indicates high reliability providing greater confidence that the
obtained score is near the true score.
2. In interpreting the scores of two students, if the ends of the bands do overlap as in Example 1, then
there is no real difference between the two scores. However, if the two bands do not overlap as in
Example 2, there is a real difference between the scores.
40 40

36
35 35 34 No overlap

5
Overlap

30 30 30

25

Example 1. SEM = 5 Example 2. SEM = 2


Suppose Grace had 34 and Suppose George had 38 and Aku 32
Fiifi 32 in a quiz. in a quiz.
There is overlapping. There is no overlapping.

Reliability coefficient: A correlation coefficient that indicates the degree of relationship between two
sets of scores intended to be measures of the same characteristic (e.g. correlation between scores
assigned by two different raters or scores obtained from administration of two forms of a test)

Methods of estimating reliability


1. Test-retest method. This is a measure of the stability of scores over a period of time. The same
test is given to a group of students twice within an interval ranging from several minutes to years.
The scores on the two administrations are correlated and the result is the estimate of the reliability
of the test. The interval should be reasonable, not be too short nor too long.
2. Equivalent forms method. Two test forms, which are alternate or parallel with the same content
and level of difficulty for each item, are administered to the same group of students. The forms
may be given on the same or nearly the same occasion or a time interval will elapse before the
second form is given. The scores on the two administrations are correlated and the result is the
estimate of the reliability of the test.
3. Split-half method. This is a measure of internal consistency. A single test is given to the students.
The test is then divided into two halves for scoring. The two scores for each student are correlated
to obtain the estimate of reliability. The test can be split into two halves in several ways. These
include using (i) odd-even numbered items, and (ii) first half-second half. The Spearman-Brown
prophecy formula is often used to obtain the reliability coefficient. This is given by:

ryy (Whole test reliability ) =


2 x correlation between half test scores
1 + correlastion between half test scores

Suppose correlation between half test scores was 0.75.


2 x 0.75 1.50
ryy = = = 0.86
1 + 0.75 1.75

4. Kuder-Richardson method. This is also a measure of internal consistency. A single administration


of the test is used. Kuder-Richardson Formulas 20 and 21 (KR20 & KR21) are used mostly for
dichotomously scored items (i.e. right or wrong). KR20 can be generalized to more than one-
correct response items (e.g. attitude scales ranging from 5 to 1 on a 5-point scale). Such estimates
are called Coefficient Alpha.
 
n  x n - x 
KR21 = 1−
n -1  ns 2x 
 
5. Inter-rater reliability. Two raters each score a students paper. The two scores for all the students
are correlated. This estimate of reliability is called scorer reliability or inter-rater reliability. It is
an index of the extent to which the raters were consistent in rating the same students.

6
Factors influencing reliability
1. Test length. Longer tests give more reliable scores. A test consisting of 40 items will give a more
reliable score than a test consisting of 25 items. Wherever practicable, give use more items.
2. Group variability. The more heterogeneous the group, the higher the reliability. The narrower the
range of a group’s ability, the lower the reliability. Differentiate among students. Use items that
differentiate the best students from the less able students.
3. Difficulty of items. Too difficult or too easy items produce little variation in the test scores. This
in turn lowers reliability. The difficulty of the assessment tasks should be matched to the ability
level of the students.
4. Scoring objectivity. Subjectively scored items result in lower variability. More objectively scored
assessment results are more reliable. For subjectively-scored items, multiple markers are
preferred.
5. Speed. Tests, where most students do not complete the items due to inadequate allocation of time,
result in lower reliability. Sufficient time should be provided to students to respond to the items.
6. Sole marking. Using multiple markers improves the reliability of the assessment results. A single
person grading may lead to low reliability especially of essay tests, term papers, and performances.
Averaging the results of several markers increases reliability.
7. Testing conditions. Where test administrators do not adhere strictly to uniform test regulations and
practices, students’ scores may not represent their actual level of performances and this tends to
reduce reliability. In cases of the test-retest method of estimating reliability, this issue is of a great
concern.

7
Sample Test Items

1. The process of assigning numbers to the attributes or traits possessed by persons according to specific
rules is
A. assessment.
B. evaluation.
C. measurement.
D. test.

2. Which of the following situations is a norm-referenced interpretation of a test score?


A. Comfort obtained Distinction in her teaching practice.
B. Jane obtained 8 As in the WASSC Examination.
C. Joseph scored 75% in his Statistics examination.
D. Stephen won the 1st prize in EPS 311 course.

3. The main purpose of formative evaluation is to


A. attain total growth and development of the student.
B. determine extent of achievement of objectives of education.
C. plan the types of traits and behaviours to be assessed.
D. provide feedback about the progress being made in school.

4. One of the characteristics of continuous assessment is that it is


A. apprehensive.
B. diagnostic.
C. selective.
D. systemic.

5. One role of the Ghanaian teacher in the implementation of the continuous assessment programme is
that the teacher must
A. concentrate on the cognitive domain objectives.
B. constantly evaluate the assessment programme.
C. make an end-of-year time table for assessments.
D. provide information to place students in courses.

6. One purpose of the school-based assessment (SBA) in Ghana is to provide teachers with
A. a list of objectives in constructing assessment items.
B. guidance in dealing with social and psychological problems.
C. sources of motivation for directing students’ learning.
D. standards of achievement in each class of the school system.

7. Which of the following constitutes a placement decision in teaching and learning?


A. Acquiring certificates for employment in the world of work.
B. Assigning grades to students as a record of progress and achievement.
C. Grouping individuals for instruction in view of individual differences.
D. Selecting students for award of prizes in the measurement class.

8. One of the general principles of assessment is that


A. users become aware of assessment techniques available.
B. assessment techniques require knowledge about student learning.
C. good assessment techniques must serve the needs of the teachers.
D. good assessments are provided by multiple indicators of performance.

8
9. Learning objectives are important for classroom assessment because they help teachers to
A. assess students’ performances through the knowledge of specific outcomes.
B. design appropriate assessment procedures based on unknown outcomes.
C. evaluate existing assessment instruments when specific outcomes are known.
D. obtain information for judging the reliability of assessment procedures.

10. In the cognitive domain classification of educational objectives, the ability to put parts together to form
a new whole is referred to as
A. analysis.
B. application
C. evaluation.
D. Synthesis.

11. Content-related evidence for the validity of an assessment result is obtained by


A. analyzing mental processes.
B. computing correlation coefficients.
C. examining course objectives.
D. inspecting specification tables.

12. One of the principles for deciding on the degree to which an assessment result is valid is that evidence
can be produced to support the
A. appropriateness of the test instrument.
B. appropriateness of educational values.
C. consequences of the uses of the test items.
D. content-relevance of the test items.

13. One method of obtaining construct-related evidence is to


A. analyze behavioral processes required by assessment tasks.
B. compare scores of groups that differ on the construct.
C. correlate assessment scores of two different constructs
D. use novice judgment on tasks that have been defined.

14. Which of the following factors does NOT decrease the degree to which a test is valid for a particular
purpose?
A. Ambiguous statements in tasks.
B. Difficult reading vocabulary.
C. Unclear directions to students.
D. Unidentifiable patterns of answers.

15. A small value of the standard error of measurement indicates that the
A. observed score is low.
B. error score is positive.
C. reliability is high.
D. reliability is low.

16. Amina obtained 97 in a Measurement & Evaluation quiz. It is known that the error score for the
quiz was − 3.0. An estimate of her true score is
A. 100
B. 97
C. 94
D. 3

9
1. Explain the following terms as used in educational assessment to a nonprofessional teacher

i. Educational goals
ii. Educational outcomes
iii. Learning outcomes
2. Assessment is done to help take decision on students, curricula, programmes and
educational policy. Discuss any four (4) decisions that can be taken on the variables
listed.

3. Differentiate between assessment of learning, assessment as learning and assessment


for learning giving examples to support your explanation.
4. Given the equation X = T + E, explain the variables involved by giving one example each.

10

You might also like