Lesson6 Establishing Test Validity and Reliability

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 42

Establishing Test Validity

and Reliability

Balagtas, M.U., David, A.P., Golla, E.F., Magno, C.P., & Valladolid, V.C. (2019). Assessment in
Learning 1: Outcomes-based Workstext. Quezon City: Rex Book Store, Inc.

This is an exclusive and copyrighted property of REX Book Store Inc. All rights reserved.
No part of this material shall be reproduced, distributed, or transmitted in any form or by
any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior written consent of REX Book Store Inc.


What is test reliability?
Reliability is the consistency of the responses on measures across three
(1) when retested on the same person- consistent response is expected when
the same test is provided to the same set of participants
(2) when retested on the same measure - responses on the same test is
expected to be consistent to the same test or another test but measures
the same characteristic when administered at a different time
(3) when similarity of responses across items that measure the same
characteristic is established; there is reliability when the person
responded in the same way or consistently across items that measure the
same characteristic.
Factors Affecting Reliability of A Measure
1. The number of items in a test – the more items a test has, the more likely
the reliability is high. The probability of obtaining consistent scores has high
probability because of the large pool of items.
2. Individual difference of participants – Every participant possesses
characteristics that affect their performance in the test such as fatigue,
concentration, innate ability, perseverance, and motivation. These individual
factors change overtime and affect the consistency of the answers in a test.
3. External environment – The external environment may include the room
temperature, noise level, depth of instruction, exposure to materials, and quality
of instruction that may affect changes in the responses of examinees on a test.

What are the different ways to establish test reliability?
There are different ways in determining the reliability of a test. The
specific kind of reliability will depend on the (1) variable you are
measuring, (2) type of test, and (3) number of versions the test have.
The different types of reliability are indicated and how they are done.
Notice in the third column that statistical analysis is needed to determine
the test reliability.

Different Ways to Establish Test Reliability
Type of Reliability How is this reliability done? What statistics is used?
1. Test-retest You have a test and you need to administer it at one time to a group of examinees. Correlate the test scores from the first and the
Administer it again in another time to the “same group” of examinees. There is a next administration. Significant and positive
time interval of not more than 6 months between the first and next test correlation indicates that the test has temporal
administration for tests that measure stable characteristics such as standardized stability overtime.
aptitude tests. The post test can be given with a minimum time interval of 30  
minutes. The responses on the test should more or less be the same across the two Correlation refers to a statistical procedure
points in time. where linear relationship is expected for two
  variables. You may use Pearson correlation
Test-retest is applicable for tests that measure stable variable such as aptitude and coefficient because test data are usually in an
psychomotor measures (ex. Typing test, tasks in physical education). interval scale

2. Parallel Forms There are two versions of a test. The items need to be exactly measuring the same Correlate the test results for the first form and
skill. Each test version is called a “form.” Administer one form at one time and the the second form. Significant and positive
other form at another time to the “same” group of participants. The responses on correlation coefficient is expected. The
the two forms should be more or less the same. significant and positive correlation indicates
  that the responses in the two forms are the
Parallel forms is applicable if there are two versions of the test. This is usually same or consistent.
done when the test is repeatedly used for different groups such as entrance
examinations and licensure examinations. Different versions of the test are given
to a different group of examinees.

Different Ways to Establish Test Reliability
3. Split-half Administer a test to a group of examinees. The items need to Correlate the scores on each set. After the
be split into halves. The odd-numbered items can be separated correlation, use another formula called
with the even-numbered items. Each examinee will have two Spearman-Brown Coefficient. The correlation
scores coming from the same test. The scores on each set coefficient for each set of scores should be
should be close or consistent. significant and positive. The significant and
  positive correlation indicates that the scores of
Split-half is applicable when the test has a large number of the examinee are consistent or the same across
items. halves of the test.
4. Internal Consistency This procedure involves determining if the scores on each item A statistical analysis called Cronbach’s alpha or
are consistently answered by the examinees. After the Kuder-Richardson is used to determine the
administering the test to a group of examinees, it is necessary internal consistency of the items. A Cronbach’s
to determine and record the scores on each item. The idea here alpha value of .60 and above indicates that the
is to see if the responses per item are consistent to each other. test items have internal consistency.
This technique will work well when the assessment tool has a
large number of items. It is also applicable for scales and
inventories (ex. Likert scale from “strongly agree” to “strongly

5. Inter-rater This procedure is used to determine the consistency A statistical analysis called
reliability of multiple raters when using rating scales and Kendall’s tau coefficient of
rubrics to judge a performance. The reliability here concordance is used to
refers to the similar or consistent ratings provided by determine if the ratings
more than one rater or judge when they use an provided by multiple raters
assessment tool. agree with each other.
  Significant Kendall’s tau
Inter-rater is applicable when the assessment requires value indicates that the
the use of multiple raters. raters concord or agree
with each other in their

Basis of Statistical Analysis to Determine Reliability
1. Linear regression
Linear regression is demonstrated when you have two variables that are
measured. Like two set of scores in a test taken at two different times by the same
participants. When the two scores are plotted in a graph (with X and Y axis), they
will tend to form a straight line. The straight line formed for the two set of scores
can produce a linear regression. When a straight line is formed, we can say that
there is a correlation between the two set of scores.


The graph is called a

scatterplot. Each point in
a scatterplot is a
respondent with two
scores (one for each

2. Computation of Pearson r correlation

The index of the linear regression is called a correlation coefficient. When

the points in a scatterplot tend to fall within the linear line, the stronger is
the correlation. When the direction of the scatterplot is directly proportional,
the correlation coefficient will have a positive value. If the line is inverse, the
correlation coefficient will have a negative value. The statistical analysis,
used to determine the correlation coefficient is called the Pearson r. Below
illustrates how the Pearson r is obtained.

Suppose a teacher gave a spelling of two syllable words with 20 items for
Monday and Tuesday. The teacher wanted to determine the reliability of the
two set of scores by computing for the Pearson r.

Monday Tuesday
Test Test      
Y 2
10 20 100 400 200
9 15 81 225 135
6 12 36 144 72
10 18 100 324 180
12 19 144 361 228
4 8 16 64 32
5 7 25 49 35
7 10 49 100 70
16 17 256 289 272
8 13 64 169 104
Y2=212 XY=132
X = 87 Y =139  X2=871 5 8

X – Add all the X scores (Monday scores)
Y – Add all the Y sores (Tuesday scores)
X2 – Square the value of the X scores (Monday scores)
Y2 – Square the value fo the Y scores (Tuesday scores)
XY – Multiply the X and Y scores
X2 – Add all the squared values of X
Y2 – Add all the squared values of Y
XY – Add all the product of X and Y
Substitute the values in the formula:

r = 0.80
The value of a correlation coefficient does not exceed 1.00 or -1.00. A value of 1.00 and -1.00
indicates perfect correlation.

3. Difference between a positive and negative correlation

When the value of the correlation coefficient is positive, it means that the higher
are the scores in X, the higher are the scores in Y. This is called a positive
correlation. In the case of the two spelling scores, a positive correlation is obtained.
When the value of the correlation coefficient is negative, it means that the higher are
the scores in X, the lower are the scores in Y or vice versa. This is called a negative
correlation. When the same test is administered to the same group of participants,
usually a positive correlation indicates reliability or consistency of the scores.

4. Determining the strength of a correlation
The strength of the correlation is determined by the value of the correlation
coefficient. The close is the value to 1.00 or -1.00, the stronger is the correlation.
Below is the guide:
0.80 – 1.00 Very high relationship
0.6 – 0.79 High relationship
0.40 – 0.59 Substantial/marked relationship
0.2 – 0.39 Low relationship
0.00 – 0.19 Negligible relationship

5. Determining the significance of the correlation
The obtained correlation of two variables may be due to chance. In order to determine if
the correlation is free of some error, it is tested for significance. When a correlation is
significant, it means that the probability of the two variables being related is free of some

In order to determine if a correlation coefficient value is significant, it is compared with

an expected probability of correlation coefficient values called a critical table. When the
value computed is greater than the critical value, it means that the information obtained
has beyond 95% chance of being correlated and it is significant.

Another statistical analysis mentioned to determine the internal consistency of test is the
Cronbach’s alpha. Follow the procedure to determine the internal consistency.
Suppose that five students answered a checklist with a scale of 1 to 5 about their hygiene
where the following are the corresponding scores:
5 - always, 4 – often, 3 – sometimes, 2 –rarely, 1 – never
The checklist has five items. The teacher wanted to determine if the items
have internal consistency.

item item item item item total for each case Score-
Student 1 2 3 4 5 (X) Mean (Score-Mean)2
A 5 5 4 4 1 19 2.8 7.84
B 3 4 3 3 2 15 -1.2 1.44
C 2 5 3 3 3 16 -0.2 0.04
D 1 4 2 3 3 13 -3.2 10.24
E 3 3 4 4 4 18 1.8 3.24
            case=16.2   Σ(Score-Mean)2=22.8
total for X
(ΣX) 14 21 16 17 13 item=16.2  
ΣX2 48 91 54 59 39      
SD t2
2.2 .7 .7 .3 1.3 ΣSD t =5.2
  = 5.7

Cronbach’s α = .10
The internal consistency of the responses in the attitude towards teaching is .10 indicating low internal consistency.
The consistency of the ratings can be obtained using a coefficient of
concordance. The Kendall’s ω coefficient of concordance is used to test the
agreement among raters.
If a performance task was demonstrated by five students and there are three
raters. The rubric used a scale of 1 to 4 where 4 is the highest and 1 is the lowest.
Sum of
Five demonstrations Rater 1 Rater 2 Rater 3 Ratings D D2
A 4 4 3 11 2.6 6.76
B 3 2 3 8 -0.4 0.16
C 3 4 4 11 2.6 6.76
D 3 3 2 8 -0.4 0.16
E 1 1 2 4 -4.4 19.36
        =8.4   ΣD2=33.2

The scores given by the three raters are first computed by summating
the total ratings for each demonstration. The mean is obtained for the
sum of ratings ( =8.4). The mean is subtracted to each of the Sum
of Ratings (D). Each difference is squared (D2), then the sum of squares
is computed (ΣD2=33.2). The mean and summation of squared
difference is substituted in the Kendall’s ω formula. In the formula, m is
the numbers of raters.
 A value of .38 Kendall’s ω coefficient estimates the
agreement of the three raters in the five demonstrations.
There is a moderate concordance among the three raters
because the value is far from 1.00.


What is test validity?
A measure is valid when it measures what it is supposed to measure.
If a quarterly exam is valid, the contents should directly measure the
objectives of the curriculum. If a scale that measures personality is
composed of five factors, the scores on the five factors should have
items that are highly correlated. If an entrance exam is valid, it
should predict students’ grades after the first semester.

Different Ways to Establish Test Validity

Type of Validity Definition
Content validity When the items represent the domain The items are compared with the objectives of
  being measured. the program. The items need to measure
  directly the objectives (for achievement) or
definition (for scales). A reviewer conducts the
Face validity When the test is presented well, free of The test items and its layout is reviewed and
  errors, and administered well. tried out to a small group of respondents. A
  manual for administration can be made as a
guide for the test administrator.
Predictive Validity A measure should predict a future A correlation coefficient is conducted where
  criterion. Example is an entrance exam the X variable is used as the predictor and the
  predicting the grades of the students Y variable as the criterion.
after the first semester.

Type of Validity Definition Procedure
Concurrent Validity When two or more measures describe The scores on the measures should be correlated.
  the present the same characteristic.
Construct Validity The components or factors of the test The Pearson r can be used to correlate the items
should contain items that are strongly for each factor. However, there is a technique
correlated. called factor analysis to determine which items
are highly correlated to form a factor.
Convergent Validity When the components or factors of a Correlation is done for the factors of the test.
  test are hypothesized to have positive
Divergent Validity When the components or factors of a Correlation is done for the factors of the test.
  test are hypothesized to have a
  negative correlation. Example is the
items on intrinsic and extrinsic

Cases to Illustrate the Types of Validity
1. Content Validity
A coordinator in science is checking the science test paper for grade 4. She asked the
grade 4 science teacher to submit the table of specifications containing the objectives
of the lesson and the corresponding items. The coordinator checked whether each
item is aligned with the objectives.
 How are the objectives used when creating test items?
 How is content validity determined when given the objectives and the items in a
 What should be present in a test table of specifications when determining content
 Who checks the content validity of items?

2. Face Validity
The assistant principal browsed the test paper made by the math
teacher. She checked if the contents of the items are about
mathematics. She examined if instructions are clear. She browsed
through the items if the grammar is correct and if the vocabulary is
within the students’ level of understanding.
 What can be done in order to ensure that the assessment appears
to be effective?
 What practices are done in conducting face validity?
 Why is face validity the weakest form of validity?

3. Concurrent Validity
A school guidance counselor administered a math achievement test among the grade
6 students. She also has a copy of the students’ grades in math. She wanted to verify
if the math grades of the students are measuring the same competencies in the math
achievement test. The school counselor correlated math achievement scores and the
math grades to determine if they are measuring the same competencies.
 What needs to be available when conducting concurrent validity?
 At least how many tests need to be present for conducting concurrent validity?
 What statistical analysis can be used to establish concurrent validity?
 How are the results of a correlation coefficient interpreted for concurrent

4. Predictive Validity
The school admission’s office developed an entrance examination. The officials
wanted to determine if the results of the entrance examination are accurate in
accepting the good students. They took the grades of the students accepted for the first
quarter. They correlated the entrance exam results and the first quarter grades. They
found that there was significant and positive correlations between the entrance
examination scores and grades. The entrance examination results predicted the grades
of students after the first quarter. There predictive validity.
 Why are two measures needed in predictive validity?
 What is the assumed connection between these two measures?
 How can we determine if a measure has predictive validity?
 What statistical analysis is done to determine predictive validity?
 How are the results of predictive validity interpreted?

5. Construct Validity
A science test was made by a grade 10 teacher composed of four domains: Matter, living
things, force and motion, and earth and space. There are 10 items under each domain. The
teacher wanted to determine if the 10 items made under each domain really belong to that
domain. The teacher consulted an expert in test measurement. They conducted a procedure
called factor analysis. Factor Analysis is a statistical procedure in determining if the items
written will load under the domain they belong.
 What type of tests can construct validity be used?
 What should the test have in order to verify its constructs?
 What are constructs and factors in a test?
 How are these factors verified if they are appropriate for the test?
 What results come out in construct validity?
 How are the results in construct validity interpreted?

The construct validity of a measure are reported in journal articles. The
following are guide questions used when searching for the construct validity
of a measure form reports:

 What was the purpose for doing construct validity

 What type of test was used?
 What are the dimensions or factors that were studied using construct
 What procedure was used to establish the construct validity?
 What statistics was used for the construct validity?
 What were the results of the test’s construct validity?

6. Convergent Validity
A math teacher developed a math test that will be administered at the end of the school
year that measures number sense, patterns and algebra, measurement, geometry, and
statistics. It is assumed by the math teacher that students’ competencies in number sense
help students learn better the patterns and algebra and the other areas. After
administering the test, the scores were separated for each area and these five domains
were inter-correlated using Pearson r. The positive correlation between number sense
and patterns and algebra indicates that when number sense scores increase, patterns and
algebra scores also increase. This shows that students learning in number sense scaffold
patterns and algebra competencies.
 What should a test have to conduct convergent validity?
 What are done with the domains in a test in convergent validity?
 What analysis is used to determine convergent validity?
 How are the results in convergent validity interpreted?
7. Divergent Validity
An English teacher taught metacognitive awareness strategy to comprehend a
paragraph for grade 11 students. She wanted to determine if the performance of
her students in reading comprehension would reflect well in the reading
comprehension test. She administered the same reading comprehension test to
another class which was not taught with the metacognitive awareness strategy.
She compared the results using a t-test for independent samples and found that
the class that was taught metacognitive awareness strategy significantly
performed higher than the other group. The test has divergent validity.
 What conditions need to be present to conduct divergent validity?
 What assumption is being proved in divergent validity?
 What statistical analysis can be used to establish divergent validity?
 How are the results of divergent validity interpreted?

How to determine if an item is easy or difficult?
An item is difficult if majority of students are not able to provide the
correct answer. The item is easy if majority of the students are able to
answer correctly.
An item can discriminate if the examinees who are high in the test can
answer more the items correctly than the examinees who got low scores.

Below is a data set of 5 items on addition and subtraction of integers. Follow the procedure to
determine the difficulty and discrimination of each item.

  Item 1 Item 2 Item 3 Item4 Item 5

Student 1 0 0 1 1 1
Student 2 1 1 1 0 1
Student 3 0 0 0 1 1
Student 4 0 0 0 0 1
Student 5 0 1 1 1 1
Student 6 1 0 1 1 0
Student 7 0 0 1 1 0
Student 8 0 1 1 0 0
Student 9 1 0 1 1 1
Student 10 1 0 1 1 0

1. Get the total score of each student and arrange from highest
to lowest.
  Item Item 2 Item 3 Item4 Item 5 total
1 score
Student 2 1 1 1 0 1 4
Student 5 0 1 1 1 1 4
Student 9 1 0 1 1 1 4
Student 1 0 0 1 1 1 3
Student 6 1 0 1 1 0 3
Student 10 1 0 1 1 0
Student 3 0 0 0 1 1 2
Student 7 0 0 1 1 0 2
Student 8 0 1 1 0 0 2
Student 4 0 0 0 0 1 1

2. Obtain the upper and lower 27% of the group. Multiply
0.27 to the total number of students and you will get a value of
2.7. The rounded whole number value is 3.0. Get the top 3
students and the bottom 3 students based on their total scores.
The top 3 students are students 2, 5, and 9. Thee bottom 3
students are students 7, 8, and 4. The rest of the students are
not included in the item analysis.

3. Obtain the proportion correct for each item. This is computed for the
upper 27% group and the lower 27% group. This is done by summating the
correct answer per item and divide it by the total number of students.
  Item 1 Item 2 Item 3 Item4 Item 5 total score

Student 2 1 1 1 0 1
Student 5 0 1 1 1 1
Student 9 1 0 1 1 1
Total 2 2 3 2 3  
of the high
0.67 0.67 1.00 0.67 1.00  

Student 7 0 0 1 1 0
Student 8 0 1 1 0 0
Student 4 0 0 0 0 1
Total 0 1 2 1 1  
n of the
(PL) 0.00 0.33 0.67 0.33 0.33  

4. The item difficulty is obtained using the following formula:
Item difficulty

The difficulty is interpreted using the table:

Difficulty Index Remark

.76 or higher Easy Item
.25 to .75 Average Item
.24 or lower Difficult Item


  Item 1 Item 2 Item 3 Item 4 Item 5

Index of
difficulty 0.33 0.50 0.83 0.50 0.67
Item Difficult Average Easy Average Average

5. The index of discrimination is obtained using the formula: Item
discrimination=pH – pL
The value is interpreted using the table:

Index discrimination Remark

.40 and above Very good item
.30 - .39 Good item
.20 - .29 Reasonably Good item
.10 - .19 Marginal item
Below .10 Poor item

  Item 1 Item 2 Item 3 Item 4 Item 5
  =0.67-0 =0.67-0.33 =2.00-0.67 =1.00-0.33 =1.00-0.33
Discrimination 0.67 0.33 0.33 0.33 0.67
Discrimination Very good Good item Good item Good item Very good
item item


You might also like