3.4. Validity, Reliability and Fairness
3.4. Validity, Reliability and Fairness
A test may have a single item or combination of items. Regardless of the numbers of item in a test,
every single item should possess certain characteristics. Therefore, in addition to good items, a test
should have certain characteristics. Following are some important characteristics of a test to be
good:
Validity
One important thing to consider when conducting an assessment is how much the results of the
assessment will serve the purpose for which it was intended. Finding an answer to this question is
the basis of validity in assessment. The primary function of any assessment is validity because a test
has no value if it is not valid hence it won't prove useful. The validity of an assessment involves what
it is intended to measure and how consistent it measures it. For instance, an educator might not
determine how conversant a student is in a particular knowledge area without conducting an
evaluation test. If the assessment was conducted and the results did not measure what it was
intended to measure, the educator might not accurately determine what the strengths of the
students are. Also, the educator might struggle to know whether the student is ready for a higher
level of instruction.
Validity is another prerequisite for an assessment to be good. An assessment is valid if it serves the
purpose for which it is designed. If a pen is good looking but does not write, then, it does not serve
the purpose for which it is meant. Similarly, if an intelligence test does not measure intelligence,
then, it does not serve the purpose for which it is meant. Therefore, any measuring instrument is
valid to the extent that it measures what it is supposed to measure, is called validity. In simply,
validity measures what it supposed to measure. Validity is a relative term. A test may be highly
valid for measuring one trait but completely invalid for measuring another.
There are different types of validity as there are purposes of evaluation. For instance, a teacher
may want to measure the academic achievement of his students after completion of the course
in a particular subject (Economics). In this direction, he will develop an assessment where the items
should reflect the knowledge and skill in that area of content, then, we can say that assessment
have content validity. Content validity is purely subjective in nature and estimation is done by the
judgment of subject experts and test specialists. Similarly, a researcher may desire to develop an
assessment to test creativity to suit the requirements of his specific research problem. In order to
ensure that assessment measure creativity, he may have to correlate the scores on his test with
the scores on an existing test of the same trait taken as concurrently available criterion. Thus
positive correlation between these two test lead to concurrent validity. Likewise, an engineering
college or any other institution of professional education gives an admission test to the candidates
seeking admission to the college. The purpose of such assessment is to select those candidates
who are most likely to succeed in the examination conducted by the college after the completion
of the course. In other words, the purpose of the test is to predict future success. Therefore, such
assessment should have what we call predictive validity. The predict validity is established by
correlating the scores on the admission test (predictor variable) and scores on the test
administered after completion of the course (criterion variable) for which admission test was given.
1
Still another purpose of an evaluation may be to assess some ‘psychological trait’ such as
reasoning, imagination or anxiety. The degree to which an assessment is a measure of the
underlying psychological trait is an indicator of its construct validity. A construct is defined as a
basic psychological trait, which is not objectively observable in practice.
Reliability
In an achievement test reliability refers to how consistently the assessment produces the same
results when it is measured or evaluated. For assessment to be reliable it means that the outcome
of the assessments are trustworthy. So for an achievement test to be considered accurate and
valid, it must be consistent. It must measure what is intended to measure in its true value. We can
say that the degree to which the test is free from error is one characteristic of an achievement
test. When a test is repeated, if the value is close to what was initially obtained, then it is said to
be reliable.
Reliability is one of the most important elements of a quality assessment. Reliability refers to the
assessment’s consistency with repeated trials. It shows the extent to which the results obtained are
consistent when the test is administered once or more than once on the same sample with a
reasonable time gap. Simply, if a test gives same result on different occasions, it is said to be
reliable. For example, if an administered assessment provides almost same scores to examinees
on two different occasions, then, it is highly reliable.
There are many methods which are used in determining the reliability of assessment. First method
is test-retest method (stability over time) where the same test administered in two different
occasions with short interval of time to the same group. Second is parallel-form method (stability
over its sample) where two forms of a test (both test must contain equal items) are used covering
the same content whose item difficulty levels are also same. Parallel-forms reliability is a measure
of reliability obtained by administering different versions of a test or tool (both versions must
contain items that probe the same construct, skill, knowledge base etc.) to the same group of
individuals. While the time gap between the two test administrations should be short, it does need
to be long enough so that examinees' scores are not affected by fatigue. Third one is split-half
method (Stability of items or homogeneity of items) where the reliability is determined from a single
administration of a test to a group of students. Here, the test is randomly divided into two
equivalent halves or between even and odd number items and correlate scores on one half of
the test with scores on the other half of the test. This is important because if the correlation is not
high, this means that the test is not consistent from beginning to end. Last one is inter-rater method
(stability over scorer) where the reliability is determined by the consensus among the raters. In this
direction, reliability can be determined by having two or more persons independently score the
same set of test papers. Inter-rater reliability provides a measure of the dependability or
consistency of scores that might be expected across raters.
Fairness
A fair assessment is one that provides all students an equal opportunity to demonstrate
achievement with transparency about learning expectations and criteria for judging student
performance and yields unbiased scores (Tierney, 2913). We want to allow students to show us
what they have learned from instruction. Fair assessments are unbiased and non-discriminatory,
uninfluenced by irrelevant or subjective factors. That is, neither the assessment task nor scoring is
2
differentially affected by race, gender, and ethnic background, handicapping conditions or
other factors unrelated to what is being assessed. There are some key components of fairness such
as:
- Transparency: student knowledge of learning targets and assessments
- Opportunity to learn
- Prerequisite knowledge and skills
- Avoiding student stereotyping
- Avoiding bias in assessment tasks and procedures
- Accommodating special needs