100% found this document useful (2 votes)
3K views

Lesson 6 Establishing Test Validity and Reliability: Learning Instructional Modules For CPE 105

The document discusses establishing test validity and reliability. It describes different methods for determining test reliability, including test-retest, parallel forms, split-half, and measuring internal consistency using Cronbach's alpha or Kuder-Richardson. Statistical analyses like correlation, Spearman-Brown coefficient, and Cronbach's alpha are used to establish reliability depending on the method. The goal is to choose a reliability measure that is appropriate for the test and use statistics to accurately determine if the test scores are consistent and reliable over time.

Uploaded by

Hanna Deatras
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
3K views

Lesson 6 Establishing Test Validity and Reliability: Learning Instructional Modules For CPE 105

The document discusses establishing test validity and reliability. It describes different methods for determining test reliability, including test-retest, parallel forms, split-half, and measuring internal consistency using Cronbach's alpha or Kuder-Richardson. Statistical analyses like correlation, Spearman-Brown coefficient, and Cronbach's alpha are used to establish reliability depending on the method. The goal is to choose a reliability measure that is appropriate for the test and use statistics to accurately determine if the test scores are consistent and reliable over time.

Uploaded by

Hanna Deatras
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Lesson 6

Establishing Test Validity and Reliability

Learning Objectives

In this lesson, you are expected to:

● use procedures and statistical analysis to establish test validity and reliability:
● decide whether a test is valid or reliable; and
● decide which test items are easy and difficult.

Significant Culminating Performance Task and Success Indicators

At the end of the lesson, you should be able to demonstrate your knowledge and skills
in determining whether the test and its items are valid and reliable. You are considered
successful in this culminating performance task if you have satisfied at least the following
indicators of success:

Specific Performance Tasks Success Indicators

Use appropriate procedure in determining Provided the detailed steps, decision, and
test validity and reliability rationale in the use of appropriate validity
and reliability measures

Show the procedure on how to establish test Provided the detailed procedure from the
validity and reliability preparation of the instrument, procedure in
pretesting, and analysis in determining the
test’s validity and reliability

Provide accurate results in the analysis of Made the appropriate computation, use of
item difficulty and reliability software, reporting of results, and
interpretation of the results for the tests of
validity and reliability

Prerequisite of this Lesson

To be able to successfully perform this culminating performance task, you should have
prepared a test following the proper procedure with clear learning targets (objectives), table of

Learning Instructional Modules for CPE 105


specifications, and pretest data per item. In the previous lesson, you were provided with
guidelines in constructing tests following different formats. You have also learned that
assessment becomes valid when the test items represent a good set of objectives and this
should be found in table of specifications. The learning targets will help you construct
appropriate test items.

What is test reliability?

Reliability is the consistency of the responses to measure under three conditions: (1)
when retested on the same person (2) when retested on the same measure; and (3) similarity
of responses across items that measure the same characteristic. In the first condition,
consistent response is expected when the test is given to the same participants. In the second
condition, reliability is attained if the responses to the same test is consistent with the same test
or its equivalent or another test that measures but measures the same characteristic when
administered at a different time. In the third condition, there is reliability when the person
responded in the same way or consistently across items that measure the same characteristic.

There are different factors that affect the reliability of a measure. The reliability of a
measure can be high or low, depending on the following factors:

1. The number of items in a test - The more items a test has, the likelihood of reliability is
high. The probability of obtaining consistent scores is high because of the large pool of
items.
2. Individual differences of participants - Every participant possesses characteristics that
affect their performance in a test, such as fatigue, concentration, innate ability,
perseverance, and motivation. These individual factors change over time and affect the
consistency of the answers in a test.
3. External environment - The external environment may include room temperature, noise
level, depth of instruction, exposure to materials and quality of instruction, which could
affect changes in the responses of examinees in a test.

What are the different ways to establish test reliability?

There are different ways in determining the reliability of a test. The specific kind of
reliability will depend on the (1) variable you are measuring (2) type of test, and (3) number of
versions of the test.

Notice in the third column that statistical analysis is needed to determine the test reliability.
Method in Testing How is this reliability done? What statistics is used?

Test-Retest You have a test and you need to Correlate the test scores from
administer it at one time to a group of the first and next administration.
examinees. Administer it again at Significant and positive
another time to the “same group” of correlation indicates that the
examinees. There is a time interval of test has temporal stability over
not more than 6 months between the

Learning Instructional Modules for CPE 105


first and second administration of tests time.
that measure stable characteristics, such
Correlation refers to a statistical
as standardized aptitude tests. The post-
procedure where linear
test can be given with a minimum time
relationship is expected for two
interval of 30 minutes. The responses in
variables. You may use Pearson
the test should more or less be the
r because test data are usually
same across the two points in time.
in an interval scale

Test-retest is applicable for test that


measure stable variables, such as
aptitude and psychomotor measures.

Parallel Forms There are two versions of a test. The Correlate the test results for the
items need to exactly measure the same first form and the second form.
skill. Each test version is called a “form”. Significant and positive
Administer one form at one time and the correlation coefficient are
other form to another time to the expected. The significant and
“same” group of participants. The positive correlation indicates
responses on the two forms should be that the responses in the two
more or less the same. forms are the same or
consistent. Pearson r is usually
Parallel forms are applicable if there are
used for this analysis.
two versions of the test. This is usually
done when the test repeatedly used for
different groups, such as entrance and
licensure examinations. Different
versions of the test are given to
different group of examinees.

Split-Half Administer a test to a group of Correlate the two sets of scores


examinees. The items need to be split using the Pearson r. After the
into halves, usually using the odd-even correlation, use another formula
technique. In this technique, get the called Spearman-Brown
sum of the points in the odd-numbered Coefficient. The correlation
items and correlate it with the sum of coefficient obtained using
the points of the even-numbered items. Pearson r and Spearman Brown
Each examinee will have two scores should be significant and
coming from the same test. The scores positive to mean that the test
on each test should be close or has internal consistency
consistent. reliability.

Split-half is applicable when the test has


a large number of items.

Test of Internal This procedure involves determining if A statistical analysis called


Consistency Using the scores for each item are consistently Cronbach’s alpha or the Kuder
Kuder-Richardson and answered by the examinees. After Richardson is used to determine
Cronbach’s Alpha administering the test to a group of the internal consistency of the
Method examinees, it is necessary to determine items. A Cronbach’s alpha value
and record the scores for each item. The of 0.60 and above indicates that

Learning Instructional Modules for CPE 105


idea here is to see if the responses per the test items have internal
item are consistent with each other. consistency.

This technique will work well when the


assessment tool has a large number of
items. It is also applicable for scales and
inventories like Likert scale.

Inter-rater Reliability This procedure is used to determine the A statistical analysis called
consistency of multiple raters when Kendall’s tau coefficient of
using rating scales and rubrics to judge concordance is used to
performance. The reliability here refers determine if the ratings provided
to the similar or consistent ratings by multiple raters agree with
provided by more than one rater or each other. Significant Kendall’s
judge when they use an assessment tau value indicates that the
tool. raters concur or agree with each
other in their ratings.
Inter-rater is applicable when the
assessment requires the use of multiple
raters

You will notice in the table that statistical analysis is required to determine the reliability
of a measure. The very basis of statistical analysis to determine the reliability of linear
regression.

1. Linear regression

Linear regression is demonstrated when you have two variables that are measured, such
as two set of scores in a test taken at two different times by the same participants. When the
two scores are plotted in a graph (with X and Y-axis), they tend to form a straight line. The
straight line formed for the two sets of scores can produce a linear regression. When a straight
line is formed, we can say that there is a correlation between the two sets of scores. This can
be seen in the graph shown. This correlation is shown in the graph given. The graph is called a
scatterplot. Each point in the scatterplot is a respondent with two scores (one for each test).

Learning Instructional Modules for CPE 105


2. Computation of Pearson r correlation

The index of the linear regression is called a correlation coefficient. When the points in a
scatterplot tend to fall within the linear line, the correlation is said to be strong. When the
direction of the scatterplot is directly proportional, the correlation coefficient will have a positive
value. If the line is inverse, the correlation coefficient will have a negative value. The statistical
analysis used to determine the correlation coefficient is called the Pearson r. How the Pearson r
is obtained is illustrated below.

Suppose that a teacher gave the spelling of two-syllable words with 20 items for Monday
and Tuesday. The teacher wanted to determine the reliability of two set of scores by computing
for the Pearson r.
Formula:

N ¿¿
Monday Test Tuesday Test
X Y X2 Y2 XY

10 20 100 400 200


9 15 81 225 135
6 12 36 144 72
10 18 100 324 180

Learning Instructional Modules for CPE 105


12 19 144 361 228
4 8 16 64 32
5 7 25 49 35
7 10 49 100 70
16 17 256 289 272
8 13 64 169 104
❑ ❑ ❑ ❑ ❑
∑ X=87

∑ Y =139

∑ X 2=871

∑ Y 2=2125



XY =1328


∑ X - add all the X scores (Monday Scores) XY – multiply the X and Y scores

❑ ❑
∑ Y - add all the Y scores (Tuesday Scores) ∑ X 2- add all the squared values of X
❑ ❑


X 2 – square the value of the X scores ∑ Y 2-add all the squared values of Y


2
Y – square the value of the Y scores ∑ XY - add all the product of X and Y

10 ( 1328 )−( 87)(139)


r=
√❑
r= 0.80

The value of a correlation coefficient does not exceed 1.00 or -1.00. A value of 1.00 and
-1.00 indicates perfect correlation. In test of reliability though, we aim for high positive correlation to
mean that there is consistency in the way the student answered the tests taken.

3. Difference between a positive and a negative correlation

When the value of the correlation coefficient is positive, it means that the higher the scores in
X, the higher the scores in Y. This is called a positive correlation. In the case of the two spelling
scores, a positive correlation is obtained. When the value of the correlation coefficient is negative, it
means that the higher the scores in X, the lower the scores in Y, and vice versa. This is called a
negative correlation. When the same test is administered to the same group of participants, usually a
positive correlation indicates reliability or consistency of the scores.

4. Determining the strength of a correlation

Learning Instructional Modules for CPE 105


The strength of the correlation also indicates the strength of the reliability of the test.
This is indicated by the value of the correlation coefficient. The closer the value to 1.00 or -1.00,
the stronger is the correlation. Below is the guide:

Numerical Value Interpretation


0.80-1.00 Very strong relationship

0.6-0.79 Strong relationship

0.40-0.59 Substantial/marked relationship

0.2-0.39 Weak relationship

0.00-0.19 Negligible relationship

5. Determining the significance of the correlation


The correlation obtained between two variables could be due to chance, In order to
determine if the correlation is free of certain errors, it is tested for significance. When a
correlation is significant, it means that the probability of the two variables being related is free of
certain errors.
In order to determine if a correlation coefficient value is significant, it is compared with an
expected probability of correlation coefficient values called a critical value. When the value
computed is greater than the critical value, it means that the information obtained has more than
95% chance of being correlated and is significant.
Another statistical analysis mentioned to determine the internal consistency of test is the
Cronbach's alpha. Follow the procedure to determine the internal consistency.
Suppose that five students answered a checklist about their hygiene with a scale of 1 to
5, where in the following are the corresponding scores:
5 - always, 4 - often, 3 - sometimes, 2 - rarely, 1 - never
The checklist has five items. The teacher wanted to determine if the items have internal
consistency.

Student Item Item Item Item Ite Total for Score- ( Score−Mean)2
1 3 4 m5 each case Mean
2
(X)

A 5 5 4 4 1 19 2.8 7.84

Learning Instructional Modules for CPE 105


B 3 4 3 3 2 15 -1.2 1.44
C 2 5 3 3 3 16 -0.2 0.04
D 1 4 2 3 3 13 -3.2 10.24
E 3 3 4 4 4 18 1.8 3.24
X case= 16.2 ❑
∑ (Score−Mean)2=22.8

Total for 14 21 16 17 13 X case= 16.2 ❑


2
each item ∑ ( Score−Mean)
( σ 2t = ❑
❑ ❑
∑ X ¿¿


48 91 54 59 39 22.8
∑ X2 σ 2t =

5−1

SD 2t 2.2 0.7 0.7 0.3 1.3
∑ SD 2t = 5.2
❑ σ 2t =5.7

n
Cronbach’s α = ( n−1 ¿ ¿
5 5.7−5.2
= ( 5−1 ¿ ( 5.7
)

Cronbach’s α = 0.10

The scores given by the three raters are first computed by summing up the total ratings
for each demonstration. The mean is obtained for the sum of ratings ( X Ratings =8.4 ¿. The mean
is subtracted from each of the Sum of Ratings (D). Each difference is squared (D2), then the
sum of squares is computed ¿. The mean and summation of squared difference is substituted in
the Kendall’s ɯ formula. In the formula, m is the numbers of rates.

2
12 ∑ D
W= ❑

12(33.2)
=
3 ( 5 )(5 2−1)
2

W= 0.37

Learning Instructional Modules for CPE 105


A Kendall's w coefficient value of 0.38 indicates the agreement of the three raters in the five
demonstrations. There is moderate concordance among the three raters because the value is far
from 1.00. What is test validity? A measure is valid when it measures what it is supposed to
measure. If a quarterly exam is valid, then the contents should directly measure the objectives of the
curriculum. If a scale that measures personality is composed of five factors, then the scores on the
five factors should have items that are highly correlated. If an entrance exam is valid, it should
predict students' grades after the first semester. What are the different ways to establish test
validity? There are different ways to establish test validity.

Types of Validity Definition Procedure

Content Validity When the items The items are compared with the objectives
represent the domain of the program. The items need to measure
being measured directly the objectives (for achievement) or
definition (for scales). A reviewer conducts
the checking.
Face Validity When the test is The test items and layout are reviewed and
presented well, free of tried out on a small group of respondents. A
errors, and administered manual for administration can be made as a
well guide for the test administrator.
Predictive Validity A measure should A correlation coefficient is obtained where
predict a future criterion. the X-variable is used as the predictor and
Example is an entrance the Y-variable as the criterion.
exam predicting the
grades of the students
after the first semester.
Construct Validity The components or The Pearson r can be used to correlate the
factors of the test should items for each factor. However, there is a
contain items that are technique called factor analysis to
strongly correlated. determine which items are highly correlated
to form a factor.
Concurrent Validity When two or more The scores on the measures should be
measure are present for correlated
each examinee that
measure the same
characteristic
Convergent When the components or Correlation is done for the factors of the
Validity factors of a test are test.
hypothesized to have a
positive correlation
Divergent Validity When the components or Correlation is done for factors of test.
factors of a test are
hypothesized to have a

Learning Instructional Modules for CPE 105


negative correlation. An
example to correlate are
the scores in a test on
intrinsic and extrinsic
motivation.

There are cases for each type of validity provided that illustrate how it is conducted. After
reading the cases and references about the different kinds of validity, partner with a seatmate
and answer the following questions. Discuss your answers. You may use other references and
browse the internet.

1. Content Validity
A coordinator in science is checking the science test paper for grade 4. She asked the
grade 4 science teacher to submit the table of specifications containing the objectives of the
lesson and the corresponding items. The coordinator checked whether each item is aligned with
the objectives.
● How are the objectives used when creating test items?
● How is content validity determined when given the objectives and the items in a test?
● What should be present in a test table of specifications when determining content
validity?
● Who checks the content validity of items?

2. Face Validity
The assistant principal browsed the test paper made by the math teacher. She checked
if the contents of the items are about mathematics. She examined if instructions are clear. She
browsed through the items if the grammar is correct and if the vocabulary is within the students'
level of understanding.
● What can be done in order to ensure that the assessment appears to be effective?
● What practices are done in conducting face validity?
● Why is face validity the weakest form of validity?

3. Predictive Validity
The school admission's office developed an entrance examination. The officials wanted to
determine if the results of the entrance examination are accurate in
identifying good students. They took the grades of the students accepted for the first quarter.
They correlated the entrance exam results and the first quarter grades. They found significant
and positive correlations between the entrance examination scores and grades. The entrance
examination results predicted the grades of students after the first quarter. Thus, there was
predictive prediction validity.
● Why are two measures needed in predictive validity?
● What is the assumed connection between these two measures?
● How can we determine if a measure has predictive validity?
● What statistical analysis is done to determine predictive validity?

Learning Instructional Modules for CPE 105


● How are the test results of predictive validity interpreted?

4. Concurrent Validity
A school guidance counselor administered a math achievement test to grade 6 students.
She also has a copy of the students' grades in math. She wanted to verify if the math grades of
the students are measuring the same competencies as the math achievement test. The school
counselor correlated the math achievement scores and math grades to determine if they are
measuring the same competencies.
● What needs to be available when conducting concurrent validity?
● At least how many tests are needed for conducting concurrent validity?
● What statistical analysis can be used to establish concurrent validity?
● How are the results of a correlation coefficient interpreted for concurrent validity?

5. Construct Validity
A science test was made by a grade 10 teacher composed of four domains matter, living things,
force and motion, and earth and space. There are 10 items under each domain. The teacher
wanted to determine if the 10 items made under each domain really belonged to that domain.
The teacher consulted an expert in test measurement. They conducted a procedure called
factor analysis. Factor analysis is a statistical procedure done to determine if the items written
will load under the domain they belong.
● What type of test requires construct validity?
● What should the test have in order to verify its constructs?
● What are constructs and factors in a test?
● How are these factors verified if they are appropriate for the test?
● What results come out in construct validity?
● How are the results in construct validity interpreted?
The construct validity of a measure is reported in journal articles. The following are guide
questions used when searching for the construct validity of a measure from reports:
● What was the purpose of construct validity?
● What type of test was used?
● What are the dimensions or factors that were studied using construct validity?
● What procedure was used to establish the construct validity?
● What statistics was used for the construct validity?
● What were the results of the test's construct validity?

6. Convergent Validity
A math teacher developed a test to be administered at the end of the school year, which
measures number sense, patterns and algebra, measurement, geometry, and statistics. It is
assumed by the math teacher that students' competencies in number sense improves their
capacity to learn patterns and algebra and other concepts. After administering the test, the
scores were separated for each area, and these five domains were intercorrelated using
Pearson r. The positive correlation between number sense and patterns and algebra indicates
that, when number sense scores increase, the patterns and algebra scores also increase. This
shows student learning of number sense scaffold patterns and algebra competencies.

Learning Instructional Modules for CPE 105


● What should a test have in order to conduct convergent validity?
● What are done with the domains in a test on convergent validity?
● What analysis is used to determine convergent validity?
● How are the results in convergent validity interpreted?

7. Divergent Validity
An English teacher taught metacognitive awareness strategy to comprehend a paragraph for
grade 11 students. She wanted to determine if the performance of her students in reading
comprehension would reflect well in the reading comprehension test. She administered the
same reading comprehension test to another class which was not taught the metacognitive
awareness strategy. She compared the results using a t-test for independent samples and
found that the class that was taught metacognitive awareness strategy performed significantly
better than the other group. The test has divergent validity.
● What conditions are needed to conduct divergent validity?
● What assumption is being proved in divergent validity?
● What statistical analysis can be used to establish divergent validity?
● How are the results of divergent validity interpreted?

How to determine if an item is easy or difficult

An item is difficult if majority of students are unable to provide the correct answer. The
item is easy if majority of the students are able to answer correctly. An Item can discriminate if
the examinees who score high in the test can answer more the items correctly than examines
who got low scores.

Below is a dataset of five items on the addition and subtraction of integers. Follow the
procedure to determine the difficulty and discrimination of each item.

1. Get the total score of each students and arrange scores from highest to lowest.

Item 1 Item 2 Item 3 Item 4 Item 5


Student 1 0 0 1 1 1
Student 2 1 1 1 0 1
Student 3 0 0 0 1 1
Student 4 0 0 0 0 1
Student 5 0 1 1 1 1
Student 6 1 0 1 1 0
Student 7 0 0 1 1 0
Student 8 0 1 1 0 0
Student 9 1 0 1 1 1
Student 10 1 0 1 1 0

Learning Instructional Modules for CPE 105


2. Obtain the upper and lower 27% of the group multiply by 0.27 by total number of students,
and you will get a value of 2.7. The rounded whole number value is 3.0. Get the top three
students and the bottom 3 students based on their total scores. The top three students and the
bottom 3 students based on their total scores. The top three students 2, 5, and 9. The bottom
three students are students 7, 9, and 4. The rest of the students are not included in the item
analysis.

Item 1 Item 2 Item 3 Item 4 Item 5 Total


Score
Student 2 1 1 1 0 1 4
Student 5 0 1 1 1 1 4
Student 9 1 0 1 1 1 4
Student 1 0 0 1 1 1 3
Student 6 1 0 1 1 0 3
Student 10 1 0 1 1 0 3
Student 3 0 0 0 1 1 2
Student 7 0 0 1 1 0 2
Student 8 0 1 1 0 0 2
Student 4 0 0 0 0 1 1

4. Obtain the proportion correct for each item. This is computed for the upper 27% group
and the lower 27% group. This is done by summating the correct answer per item and
dividing it by the total number of students.

Item 1 Item 2 Item 3 Item 4 Item 5 Total


Score
Student 2 1 1 1 0 1 4
Student 5 0 1 1 1 1 4
Student 9 1 0 1 1 1 4
Proportion 0.67 0.67 1.00 0.67 1.00
of the high
group (PH)

Learning Instructional Modules for CPE 105


Student 7 0 0 1 1 0 2
Student 8 0 1 1 0 0 2
Student 4 0 0 0 0 1 1
Proportion 0.00 0.33 0.67 0.33 0.33
of the high
group (PL)

5. The item difficulty is obtained using the following formula:


pH + pL
Item Difficulty =
2
The difficulty is interpreted using the table:

Difficulty Index Remark


0.76 or higher Easy item
0.25 to 0.75 Average item
0.24 or lower Difficult item

Computation:

Item 1 Item 2 Item 3 Item 4 Item 5


0.67+0 0.67+0.33 2.00+0.67 1.00+ 0.33 1.00+ 0.33
¿ ¿ ¿ ¿ ¿
2 2 2 2 2
Index of 0.33 0.50 0.83 0.50 0.67
difficulty
Item Difficult Average Easy Average Average
difficulty

5. The index of discrimination is obtained using the formula:

Item discrimination = pH – pL

The value is interpreted using the table:

Index discrimination Remark

Learning Instructional Modules for CPE 105


0.40 and above Very good item
0.30-0.39 Good item
0.20-0.29 Reasonably Good item
0.10-0.19 Marginal item
Below 0.10 Poor item

Item 1 Item 2 Item 3 Item 4 Item 5


=0.67-0 =0.67-0.33 =2.00-0.67 =1.00-0.33 =1.00-0.33
Discriminatio 0.67 0.33 0.33 0.33 0.67
n index
Discriminatio Very good Good item Good item Good item Very good
n item item

Application:

This shall be done individually and must be handwritten.

A. Indicate the type of reliability applicable for each case.


1. ____________________
2. ____________________
3. ____________________
4. ____________________
5. ____________________

Learning Instructional Modules for CPE 105


B. Indicate the type of validity applicable for each case.

1. ____________________
2. ____________________
3. ____________________
4. ____________________
5. ____________________

Learning Instructional Modules for CPE 105


C. Determine whether the spelling test is reliable and valid using the data to determine the
following: (1) split half, (2) Cronbach's alpha, (3) predictive validity with the English grade, (4)
convergent validity of between words with single and two stresses, and (5) difficulty index of
each item.

An English teacher administered a spelling test to 15 students. The spelling test is


composed of 10 items. Each item is encoded, wherein a correct answer is marked as “1”, and
the incorrect answer is marked as “0”. The grade in English also provided in the last column.
The first five are words with two stresses, and the next five are words with a single stress. The
recording is indicated in the table.

Learning Instructional Modules for CPE 105

You might also like