Lesson 6 Establishing Test Validity and Reliability: Learning Instructional Modules For CPE 105
Lesson 6 Establishing Test Validity and Reliability: Learning Instructional Modules For CPE 105
Learning Objectives
● use procedures and statistical analysis to establish test validity and reliability:
● decide whether a test is valid or reliable; and
● decide which test items are easy and difficult.
At the end of the lesson, you should be able to demonstrate your knowledge and skills
in determining whether the test and its items are valid and reliable. You are considered
successful in this culminating performance task if you have satisfied at least the following
indicators of success:
Use appropriate procedure in determining Provided the detailed steps, decision, and
test validity and reliability rationale in the use of appropriate validity
and reliability measures
Show the procedure on how to establish test Provided the detailed procedure from the
validity and reliability preparation of the instrument, procedure in
pretesting, and analysis in determining the
test’s validity and reliability
Provide accurate results in the analysis of Made the appropriate computation, use of
item difficulty and reliability software, reporting of results, and
interpretation of the results for the tests of
validity and reliability
To be able to successfully perform this culminating performance task, you should have
prepared a test following the proper procedure with clear learning targets (objectives), table of
Reliability is the consistency of the responses to measure under three conditions: (1)
when retested on the same person (2) when retested on the same measure; and (3) similarity
of responses across items that measure the same characteristic. In the first condition,
consistent response is expected when the test is given to the same participants. In the second
condition, reliability is attained if the responses to the same test is consistent with the same test
or its equivalent or another test that measures but measures the same characteristic when
administered at a different time. In the third condition, there is reliability when the person
responded in the same way or consistently across items that measure the same characteristic.
There are different factors that affect the reliability of a measure. The reliability of a
measure can be high or low, depending on the following factors:
1. The number of items in a test - The more items a test has, the likelihood of reliability is
high. The probability of obtaining consistent scores is high because of the large pool of
items.
2. Individual differences of participants - Every participant possesses characteristics that
affect their performance in a test, such as fatigue, concentration, innate ability,
perseverance, and motivation. These individual factors change over time and affect the
consistency of the answers in a test.
3. External environment - The external environment may include room temperature, noise
level, depth of instruction, exposure to materials and quality of instruction, which could
affect changes in the responses of examinees in a test.
There are different ways in determining the reliability of a test. The specific kind of
reliability will depend on the (1) variable you are measuring (2) type of test, and (3) number of
versions of the test.
Notice in the third column that statistical analysis is needed to determine the test reliability.
Method in Testing How is this reliability done? What statistics is used?
Test-Retest You have a test and you need to Correlate the test scores from
administer it at one time to a group of the first and next administration.
examinees. Administer it again at Significant and positive
another time to the “same group” of correlation indicates that the
examinees. There is a time interval of test has temporal stability over
not more than 6 months between the
Parallel Forms There are two versions of a test. The Correlate the test results for the
items need to exactly measure the same first form and the second form.
skill. Each test version is called a “form”. Significant and positive
Administer one form at one time and the correlation coefficient are
other form to another time to the expected. The significant and
“same” group of participants. The positive correlation indicates
responses on the two forms should be that the responses in the two
more or less the same. forms are the same or
consistent. Pearson r is usually
Parallel forms are applicable if there are
used for this analysis.
two versions of the test. This is usually
done when the test repeatedly used for
different groups, such as entrance and
licensure examinations. Different
versions of the test are given to
different group of examinees.
Inter-rater Reliability This procedure is used to determine the A statistical analysis called
consistency of multiple raters when Kendall’s tau coefficient of
using rating scales and rubrics to judge concordance is used to
performance. The reliability here refers determine if the ratings provided
to the similar or consistent ratings by multiple raters agree with
provided by more than one rater or each other. Significant Kendall’s
judge when they use an assessment tau value indicates that the
tool. raters concur or agree with each
other in their ratings.
Inter-rater is applicable when the
assessment requires the use of multiple
raters
You will notice in the table that statistical analysis is required to determine the reliability
of a measure. The very basis of statistical analysis to determine the reliability of linear
regression.
1. Linear regression
Linear regression is demonstrated when you have two variables that are measured, such
as two set of scores in a test taken at two different times by the same participants. When the
two scores are plotted in a graph (with X and Y-axis), they tend to form a straight line. The
straight line formed for the two sets of scores can produce a linear regression. When a straight
line is formed, we can say that there is a correlation between the two sets of scores. This can
be seen in the graph shown. This correlation is shown in the graph given. The graph is called a
scatterplot. Each point in the scatterplot is a respondent with two scores (one for each test).
The index of the linear regression is called a correlation coefficient. When the points in a
scatterplot tend to fall within the linear line, the correlation is said to be strong. When the
direction of the scatterplot is directly proportional, the correlation coefficient will have a positive
value. If the line is inverse, the correlation coefficient will have a negative value. The statistical
analysis used to determine the correlation coefficient is called the Pearson r. How the Pearson r
is obtained is illustrated below.
Suppose that a teacher gave the spelling of two-syllable words with 20 items for Monday
and Tuesday. The teacher wanted to determine the reliability of two set of scores by computing
for the Pearson r.
Formula:
N ¿¿
Monday Test Tuesday Test
X Y X2 Y2 XY
❑
∑ X - add all the X scores (Monday Scores) XY – multiply the X and Y scores
❑
❑ ❑
∑ Y - add all the Y scores (Tuesday Scores) ∑ X 2- add all the squared values of X
❑ ❑
❑
X 2 – square the value of the X scores ∑ Y 2-add all the squared values of Y
❑
❑
2
Y – square the value of the Y scores ∑ XY - add all the product of X and Y
❑
The value of a correlation coefficient does not exceed 1.00 or -1.00. A value of 1.00 and
-1.00 indicates perfect correlation. In test of reliability though, we aim for high positive correlation to
mean that there is consistency in the way the student answered the tests taken.
When the value of the correlation coefficient is positive, it means that the higher the scores in
X, the higher the scores in Y. This is called a positive correlation. In the case of the two spelling
scores, a positive correlation is obtained. When the value of the correlation coefficient is negative, it
means that the higher the scores in X, the lower the scores in Y, and vice versa. This is called a
negative correlation. When the same test is administered to the same group of participants, usually a
positive correlation indicates reliability or consistency of the scores.
Student Item Item Item Item Ite Total for Score- ( Score−Mean)2
1 3 4 m5 each case Mean
2
(X)
A 5 5 4 4 1 19 2.8 7.84
n
Cronbach’s α = ( n−1 ¿ ¿
5 5.7−5.2
= ( 5−1 ¿ ( 5.7
)
Cronbach’s α = 0.10
The scores given by the three raters are first computed by summing up the total ratings
for each demonstration. The mean is obtained for the sum of ratings ( X Ratings =8.4 ¿. The mean
is subtracted from each of the Sum of Ratings (D). Each difference is squared (D2), then the
sum of squares is computed ¿. The mean and summation of squared difference is substituted in
the Kendall’s ɯ formula. In the formula, m is the numbers of rates.
❑
2
12 ∑ D
W= ❑
❑
12(33.2)
=
3 ( 5 )(5 2−1)
2
W= 0.37
Content Validity When the items The items are compared with the objectives
represent the domain of the program. The items need to measure
being measured directly the objectives (for achievement) or
definition (for scales). A reviewer conducts
the checking.
Face Validity When the test is The test items and layout are reviewed and
presented well, free of tried out on a small group of respondents. A
errors, and administered manual for administration can be made as a
well guide for the test administrator.
Predictive Validity A measure should A correlation coefficient is obtained where
predict a future criterion. the X-variable is used as the predictor and
Example is an entrance the Y-variable as the criterion.
exam predicting the
grades of the students
after the first semester.
Construct Validity The components or The Pearson r can be used to correlate the
factors of the test should items for each factor. However, there is a
contain items that are technique called factor analysis to
strongly correlated. determine which items are highly correlated
to form a factor.
Concurrent Validity When two or more The scores on the measures should be
measure are present for correlated
each examinee that
measure the same
characteristic
Convergent When the components or Correlation is done for the factors of the
Validity factors of a test are test.
hypothesized to have a
positive correlation
Divergent Validity When the components or Correlation is done for factors of test.
factors of a test are
hypothesized to have a
There are cases for each type of validity provided that illustrate how it is conducted. After
reading the cases and references about the different kinds of validity, partner with a seatmate
and answer the following questions. Discuss your answers. You may use other references and
browse the internet.
1. Content Validity
A coordinator in science is checking the science test paper for grade 4. She asked the
grade 4 science teacher to submit the table of specifications containing the objectives of the
lesson and the corresponding items. The coordinator checked whether each item is aligned with
the objectives.
● How are the objectives used when creating test items?
● How is content validity determined when given the objectives and the items in a test?
● What should be present in a test table of specifications when determining content
validity?
● Who checks the content validity of items?
2. Face Validity
The assistant principal browsed the test paper made by the math teacher. She checked
if the contents of the items are about mathematics. She examined if instructions are clear. She
browsed through the items if the grammar is correct and if the vocabulary is within the students'
level of understanding.
● What can be done in order to ensure that the assessment appears to be effective?
● What practices are done in conducting face validity?
● Why is face validity the weakest form of validity?
3. Predictive Validity
The school admission's office developed an entrance examination. The officials wanted to
determine if the results of the entrance examination are accurate in
identifying good students. They took the grades of the students accepted for the first quarter.
They correlated the entrance exam results and the first quarter grades. They found significant
and positive correlations between the entrance examination scores and grades. The entrance
examination results predicted the grades of students after the first quarter. Thus, there was
predictive prediction validity.
● Why are two measures needed in predictive validity?
● What is the assumed connection between these two measures?
● How can we determine if a measure has predictive validity?
● What statistical analysis is done to determine predictive validity?
4. Concurrent Validity
A school guidance counselor administered a math achievement test to grade 6 students.
She also has a copy of the students' grades in math. She wanted to verify if the math grades of
the students are measuring the same competencies as the math achievement test. The school
counselor correlated the math achievement scores and math grades to determine if they are
measuring the same competencies.
● What needs to be available when conducting concurrent validity?
● At least how many tests are needed for conducting concurrent validity?
● What statistical analysis can be used to establish concurrent validity?
● How are the results of a correlation coefficient interpreted for concurrent validity?
5. Construct Validity
A science test was made by a grade 10 teacher composed of four domains matter, living things,
force and motion, and earth and space. There are 10 items under each domain. The teacher
wanted to determine if the 10 items made under each domain really belonged to that domain.
The teacher consulted an expert in test measurement. They conducted a procedure called
factor analysis. Factor analysis is a statistical procedure done to determine if the items written
will load under the domain they belong.
● What type of test requires construct validity?
● What should the test have in order to verify its constructs?
● What are constructs and factors in a test?
● How are these factors verified if they are appropriate for the test?
● What results come out in construct validity?
● How are the results in construct validity interpreted?
The construct validity of a measure is reported in journal articles. The following are guide
questions used when searching for the construct validity of a measure from reports:
● What was the purpose of construct validity?
● What type of test was used?
● What are the dimensions or factors that were studied using construct validity?
● What procedure was used to establish the construct validity?
● What statistics was used for the construct validity?
● What were the results of the test's construct validity?
6. Convergent Validity
A math teacher developed a test to be administered at the end of the school year, which
measures number sense, patterns and algebra, measurement, geometry, and statistics. It is
assumed by the math teacher that students' competencies in number sense improves their
capacity to learn patterns and algebra and other concepts. After administering the test, the
scores were separated for each area, and these five domains were intercorrelated using
Pearson r. The positive correlation between number sense and patterns and algebra indicates
that, when number sense scores increase, the patterns and algebra scores also increase. This
shows student learning of number sense scaffold patterns and algebra competencies.
7. Divergent Validity
An English teacher taught metacognitive awareness strategy to comprehend a paragraph for
grade 11 students. She wanted to determine if the performance of her students in reading
comprehension would reflect well in the reading comprehension test. She administered the
same reading comprehension test to another class which was not taught the metacognitive
awareness strategy. She compared the results using a t-test for independent samples and
found that the class that was taught metacognitive awareness strategy performed significantly
better than the other group. The test has divergent validity.
● What conditions are needed to conduct divergent validity?
● What assumption is being proved in divergent validity?
● What statistical analysis can be used to establish divergent validity?
● How are the results of divergent validity interpreted?
An item is difficult if majority of students are unable to provide the correct answer. The
item is easy if majority of the students are able to answer correctly. An Item can discriminate if
the examinees who score high in the test can answer more the items correctly than examines
who got low scores.
Below is a dataset of five items on the addition and subtraction of integers. Follow the
procedure to determine the difficulty and discrimination of each item.
1. Get the total score of each students and arrange scores from highest to lowest.
4. Obtain the proportion correct for each item. This is computed for the upper 27% group
and the lower 27% group. This is done by summating the correct answer per item and
dividing it by the total number of students.
Computation:
Item discrimination = pH – pL
Application:
1. ____________________
2. ____________________
3. ____________________
4. ____________________
5. ____________________