0% found this document useful (0 votes)
20 views22 pages

5 Item Analysis and Validation

Item analysis and validation are important steps to ensure a test is useful and functional. [1] Teachers first conduct a try-out of the draft test on a similar student group. [2] Each item is then analyzed based on difficulty level and ability to discriminate between students who know and don't know the material. [3] The final validated test can then be used as a standard assessment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views22 pages

5 Item Analysis and Validation

Item analysis and validation are important steps to ensure a test is useful and functional. [1] Teachers first conduct a try-out of the draft test on a similar student group. [2] Each item is then analyzed based on difficulty level and ability to discriminate between students who know and don't know the material. [3] The final validated test can then be used as a standard assessment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Item

Analysis and
Validation
LEARNING OUTCOMES

➢ Explain the meaning of item analysis,


item validity, reliability, item difficulty,
discrimination index.
➢ Determine the validity and reliability
of given test items
➢ Determine the quality of a test item
by its difficulty index, discrimination
index and plausibility of options (for a
selected-response test)
INTRODUCTION
The teacher normally prepares a draft of the test.
Such a draft is subjected to item analysis and validation in
order to ensure that the final version of the test would be
useful and functional. First, the teacher tries out the draft test
to a group of students of similar characteristics as the intended
test takers (try-out phase). From the try-out group, each item
will be analyzed in terms of its ability to discriminate between
those who know and those who do not know and also its level
of difficulty (item analysis phase). Then, finally, the final draft
of the test is subjected to validation if the intent is to make use
of the test as a standard test for the particular unit or grading
period. We shall be concerned with these concepts in this
lesson.
6.1 Item Analysis: Difficulty Index and Discrimination Index
There are two important characteristics of an item that will be of interest to the teacher.
These are: (a) item difficulty and (b) discrimination index. We shall learn how to measure these
characteristics and apply our knowledge in making a decision about the item in question.
The difficulty of an item or item difficulty is defined as the number of students who are able
to answer the item correctly divided by the total number of students. Thus:

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑤𝑖𝑡ℎ 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑛𝑠𝑤𝑒𝑟


Item difficulty= 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠

The item difficulty is usually expressed in percentage.

Example: What is the item difficulty index of an item if 25 students are unable to answer
it correctly while 75 answered it correctly?
Here, the total number of students is 100, hence the item difficulty is 75Τ100 or 75%.

Another example: 25 students answered the item correctly while 75 students did not. The
total number of students is 100 so the difficulty index is 25Τ100 or 25 which is 25%. It is a difficult item
than that one with a difficulty index of 75.
A high percentage indicates an easy item/question while a low percentage indicates a
difficult item.
One problem with this type of difficulty index is that it may not actually indicate that the
item is difficult (or easy). A student who does not know the subject matter will naturally be
unable to answer the item correctly even if the question is easy. How do we decide on the basis
of this index whether the item is too difficult or too easy?

The following arbitrary rule is often used in the literature:

Range of Difficulty Index Interpretation Action


0 – 0.25 Difficult Revise or discard
0.26 – 0.75 Right difficulty Retain
0.76 – above Easy Revise or discard

Difficult items tend to discriminate between those who know and those who do not
know the answer. Conversely, easy items cannot discriminate between these two groups of
students. We are therefore interested in deriving a measure that will tell us whether an item can
discriminate between these two groups of students. Such a measure is called an index of
discrimination.
6.1.1 Discrimination Index
The power of the item to discriminate the students between those who scored high and those who
scored low in the overall test. In other words, it is the power of the item to discriminate the students
who know the lesson and those who do not know the lesson. Discrimination index is the basis of
measuring the validity of an item. This index can be interpreted as an indication of the extent to which
overall knowledge of the content area or mastery of the skills is related to the response on an item.

Types of Discrimination Index


There are three kinds of discrimination index: positive discrimination, negative discrimination and zero
discrimination.
1. Positive discrimination happens when more students in the upper group got the item correctly than
those students in the lower group.
2. Negative discrimination occurs when more students in the lower group got the item correctly than
the students in the upper group.
3. Zero discrimination happens when a number of students in the upper group and lower group who
answer the test correctly are equal, hence, the test item cannot distinguish the students who
performed who performed in the overall test and the students whose performance are very poor.
Steps in Solving Difficulty Index and Discrimination Index
1. Arrange the scores from highest to lowest.
2. Separate scores into the upper group and lower group. There are different methods to do this: (a) if a
class consists of 30 students who takes an exam, arrange their scores from highest to lowest, then divide
them into two groups. The highest score belongs to the upper group. The lowest score belongs to the
lower group. (b) Other literatures suggested to use 27%, 30%, or 33% of the students for the upper group
and lower group. However, in the LET for Teachers the test developers always used 27% of the students
who participated in the examination for the upper and lower groups.
3. Count the number of those who chose the alternatives in the upper and lower group for each item and
record the information using the template:

Options A B C D E
Upper Group
Lower Group
*Note: Put asterisk for the correct answer
4. Compute the value of the difficulty index and the discrimination index and also the analysis of each
response in the distracters.
5. Make an analysis for each item.
An east way to derive such a measure is to measure how difficult an item is with respect
to those in the upper 25% of the class and how difficult it is with respect to those in the lower
25% of the class. If the upper 25% of the class found the item easy yet the lower 25% found it
difficult, then the item can discriminate properly between these two groups. Thus:
𝐼𝑛𝑑𝑒𝑥 𝑜𝑓 𝑑𝑖𝑠𝑐𝑟𝑖𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 = 𝐷𝑈 − 𝐷𝐿(𝑈 − 𝑈𝑝𝑝𝑒𝑟 𝑔𝑟𝑜𝑢𝑝; 𝐿 − 𝐿𝑜𝑤𝑒𝑟 𝑔𝑟𝑜𝑢𝑝)
Example: Obtain the index of discrimination of an item if the upper 25% of the class had
a difficulty index of 0.60 (i.e., 60% of the upper 25% got the correct answer) while the lower 25%
got the correct answer) while the lower 25% of the class had a difficulty index of 0.20.

Here, 𝐷𝑈 = 0.60 while 𝐷𝐿 = 0.20, thus 𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑑𝑖𝑠𝑐𝑟𝑖𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 = 0.60 − 0.20 = 0.40

Discrimination index is the difference between the proportion of the top scorers who
got an item correct and the proportion of the lowest scores who got the item right. The
discrimination index range is between -1 and +1. The closer the discrimination index is to +1, the
more effectively the item can discriminate or distinguish between the two groups of students. A
negative discrimination index means more from the lower group got the item correctly. The last
item is not good and so must be discarded.
Theoretically, the index of discrimination can range from -1.0 (when DU=0 and DL=1) to
1.0 (when DU=1 and DL=0). When the index of discrimination is equal to -1, then this means that
all of the lower 25% of the students got the wrong answer. In a sense, such an index discriminates
correctly between the two groups but the item itself is highly questionable. Why the bright ones
get the wrong answer, and the poor ones get the right answer? On the other hand, if the index of
discrimination is 1.0 then this means that all of the lower 25% failed to get the correct answer
while all the upper 25% got the correct answer. This is a perfectly discriminating item and is the
ideal item that should be included in the test. From these discussions, let us agree to discard or
revise all item that have negative discrimination index for although they discriminate correctly
between the upper and lower 25% of the class, the content of the item itself may be highly
dubious or doubtful. As in the case of the index of difficulty, we have the following rule of thumb:

Index Range Interpretation Action


-1.0 - -0.50 Can discriminate but item is questionable Discard
-0.55- 0.45 Non-discriminating Revise
0.46 – 1.0 Discriminating item Include
Example: Consider a multiple-choice type of test of which the following data were obtained:

Item Options
A B* C D
0 40 20 20 Total
1
0 15 5 0 Upper 25%
0 5 10 5 Lower 25%

The correct response is B. Let us compute the difficulty index and index of discrimination:

𝐷𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡𝑦 𝑖𝑛𝑑𝑒𝑥 = 𝑛𝑜.𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑔𝑒𝑡𝑡𝑖𝑛𝑔 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒ൗ𝑡𝑜𝑡𝑎𝑙


= 40Τ80
= 50%, within range of a “good item”
The discrimination index can similarly be computed:

𝐷𝑈 = 𝑛𝑜.𝑜𝑓𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑖𝑛 𝑢𝑝𝑝𝑒𝑟 25% 𝑤𝑖𝑡ℎ 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒ൗ𝑛𝑜.𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑢𝑝𝑝𝑒𝑟 25%
= 15Τ20
= 0.75 or 75%
Item Options
A B* C D
0 40 20 20 Total
1
0 15 5 0 Upper 25%
0 5 10 5 Lower 25%

𝐷𝐿 = 𝑛𝑜.𝑜𝑓𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑖𝑛 𝑙𝑜𝑤𝑒𝑟 25% 𝑤𝑖𝑡ℎ 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒ൗ𝑛𝑜.𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑙𝑜𝑤𝑒𝑟 25%
= 5Τ20
= 0.25 or 25%

𝑫𝒊𝒔𝒄𝒓𝒊𝒎𝒊𝒏𝒂𝒕𝒊𝒐𝒏 𝑰𝒏𝒅𝒆𝒙 = 𝐷𝑈 − 𝐷𝐿 = 0.75 − 0.25 = 0.50 𝑜𝑟 50%

Thus, the item also has a “good discriminating power.”

It is also instructive to note that the distracter A is not an effective distracter since
this was never selected by the students. It is an implausible distracter. Distracters C and D
appear to have good appeal as distracters. They are plausible distracters.
Index of Difficulty

𝑅𝑈 +𝑅𝐿
𝑃= × 100
𝑇
Where:
𝑅𝑈 −The number in the upper group who answered the item correctly
𝑅𝐿 − The number in the lower group who answered the item correctly.
𝑇 − The total number who tried the item.

Index of item Discriminating Power


𝑅 +𝑅
𝑃 = 𝑈1 𝐿
𝑇
2
Where:
𝑃 − 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑤ℎ𝑜 𝑎𝑛𝑠𝑤𝑒𝑟𝑒𝑑 𝑡ℎ𝑒 𝑖𝑡𝑒𝑚 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑖𝑛𝑑𝑒𝑥 𝑜𝑓 𝑑𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡𝑦
𝑅 − 𝑛𝑢𝑚𝑚𝑏𝑒𝑟 𝑤ℎ𝑜 𝑎𝑛𝑠𝑤𝑒𝑟𝑒𝑑 𝑡ℎ𝑒 𝑖𝑡𝑒𝑚 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦
𝑇 − 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑤ℎ𝑜 𝑡𝑟𝑖𝑒𝑑 𝑡ℎ𝑒 𝑖𝑡𝑒𝑚

8
𝑃= × 100 = 40%
20

The smaller the percentage figure the more difficult the item.
Estimate the item discriminating power using the formula below:
𝑅 −𝑅 6−2
D = 1𝑈Τ 𝑇 𝐿 × 100 = 10 = 0.40
2

The discriminating power of an item is reported as decimal fraction;


maximum discriminating power is indicated by an index of 1.00.
Maximum discrimination is usually found at the 50 percent level of
difficulty
0.00−0.20 = 𝑉𝑒𝑟𝑦 𝑑𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡
0.21−0.80 = Moderately difficult
0.81−1.00 = Very easy
For classroom achievement tests, most test constructors desire items with indices of
difficulty no lower than 20 nor higher than 80, with an average index of difficulty from 30 or 40 to a
maximum of 60.
The INDEX OF DISCRIMINATION is the difference between the proportion of the upper
group who got an item right and the proportion of the lower group who got the item right. This
index is dependent upon the difficulty of an item. It may reach a maximum value of 100 for an item
with an index of difficulty of 50, that is, when 100% of the upper group and none of the lower
group answer the item correctly. For items of less than or greater than 50 difficulty, the index of
discrimination has a maximum value of less than 100.
6.2 Validation and Validity
After performing the item analysis and revising the items which need revision, the next step is
to validate the instrument. The purpose of validation is to determine the characteristics of the whole
test itself, namely, the validity and reliability of the test. Validation is the process of collecting and
analyzing evidence to support the meaningfulness and usefulness of the test.
Validity. Validity is the extent to which a test measures what it purports to measure or as
referring to the appropriateness, correctness, meaningfulness and usefulness of the specific decisions
a teacher makes based on the test results. These two definitions of validity differ in the sense that the
first definition refers to the test itself while the second refers to the decisions made by the teacher
based on the test. A test is valid when it is aligned with the learning outcome.
A teacher who conducts test validation might want to gather different kinds of evidence.
There are essentially three main types of evidence that may be collected: content-related evidence of
validity, criterion-related evidence of validity and construct-related evidence of validity. Content-
related evidence of validity refers to the content and format of the instrument. How appropriate is the
content? How comprehensive? Does it logically get at the intended variable? How adequately does the
sample of items or questions represent the content to be assessed?
Criterion-related evidence of validity refers to the relationship between scores obtained
using the instrument and scores obtained using one or more other tests (often called criterion). How
strong is this relationship? How well to such scores estimate present or predict future performance of
a certain type?
Construct-related evidence of validity refers to the nature of the psychological construct or
characteristics being measured by the test. How well does a measure of the construct explain
differences in the behavior of the individuals or their performance of a certain task?
The usual procedure for determining content validity may be described as follows: The
teacher writes out the objectives of the test based on the Table of Specifications and then gives these
together with the least two (2) experts along with a description of the intended test takes. The experts
loot at the objectives, read over the items in the test and place a check mark in front of each questions
or item that they feel does not measure one or more objectives. They also place a check mark in front
of each objective now assessed by any item in the test. The teacher then rewrites any item checked
and resubmits to the experts and/or writes new items to cover those objectives not covered by the
existing test. This continues until the experts approve of all items and also until the experts agree that
all of the objectives are sufficiently covered by the test.
In order to obtain evidence of criterion-related validity. The teacher usually compares scores
on the test in question with the scores on some other independent criterion test which presumably
has already high validity. For example, if a test is designed to measure mathematics ability of students
and it correlates highly with a standardized mathematics achievement test (external criterion), then we
say we have high criterion-related evidence of validity. In particular, this type of criterion-related
validity is called its concurrent validity. Another type of criterion-related validity is called predictive
validity wherein the test scores in the instrument are correlated with scores on a later performance
(criterion measure) of the students. For example, the mathematics ability test constructed by the
teacher may be correlated with their later performance in a Division-wide mathematics achievement
test.
In summary, content validity refers to how will the test items reflect the knowledge actually
required for a given topic area (e.g., math). It requires the use of recognized subject matter experts to
evaluate whether test items assess defined outcomes. Does a pre-employment test measure
effectively and comprehensively the abilities required to perform the job? Does an English grammar
test measure effectively the ability to write good English?
Criterion-related validity is also known as concrete validity because criterion validity refers to
a test’s correlation with a concrete outcome.
In the care of pre-employment test, the two variables that are compared are test scores and
employee performance.
There are 2 main types of criterion validity-concurrent validity and predictive validity.
Concurrent validity refers to a comparison between the measure in question and an outcome assessed
at the same time.
An example of concurrent validity is a comparison of the scores with NAT Math exam with
course grades in Grade 12 Math. In predictive validity, we ask this question: Do the scores in NAT Math
exam predict the Math Grade 12?
6.3 Reliability
Reliability refers to the consistency of the scores obtained – how consistent they are for
each individual from one set of items to another. We already gave the formula for computing the
reliability of a test: for internal consistency; for instance, we could use the split-half method or the
Kuder-Richardson formulae (KR-20 or KR-21)
Reliability and validity are related concepts. If an instrument is unreliable, it cannot get valid
outcomes. As reliability improves, validity may improve (or it may not). However, if an instrument is
shown scientifically to be valid then it is almost certain that it is also reliable.
Predictive validity compares the questions with an outcome assessed at a later time. An
example of predictive validity is a comparison of scores in the National Achievement Test (NAT) with
first semester grade point average (GPA) in college. Do NAT scores predict college performance?
Construct validity refers to the ability of a test to measure what it is supposed to measure. As
researcher, you intend to measure depression, but you actually measure anxiety, so your research
gets compromised.
The following table is standard followed almost universally in educational test and measurement.

Reliability Interpretation
0.90 and above Excellent reliability; at the level of the best standardized tests
0.80 –0.90 Very good for a classroom test
0.70 – 0.80 Good for a classroom test; in the range of most. There are probably
a few items which could be improved.
0.60 – 0.70 Somewhat low. This test needs to be supplemented by other
measures (e.g., more tests) to determine grades. There are
probably some items which could be improved.
0.50 – 0.60 Suggests need for revision of test, unless it is quite short (ten or few
items). The test definitely needs to be supplemented by other
measures (e.g., more tests) for grading.
0.50 or below Questionable reliability. This test should not contribute heavily to
the course grade, and it needs revision.
Exercises 5
A. Write TRUE if the statement is correct and FALSE if it is wrong.

1. Difficulty index indicates the proportion of students who got the item right.
2. Difficulty index indicates the proportion of students who got the item wrong.
3. A high percentage indicates an easy item/question, and a low percentage indicates a difficult
item.
4. Authors agree, in general, that items should have values of difficulty no less than 20% correct
and no greater than 80%.
5. Very difficult or very easy items contribute greatly to the discriminating power of a test.
6. The discrimination index range is between -1 and +2.
7. The farther the index is to +1, the more effectively the item distinguishes between the two
groups of students.
8. When an item discriminates negatively, such item should be revised and eliminated from
scoring.
9. A positive discrimination index indicates that the lower performing students actually selected
the key or correct response more frequently than the top performers.
10.If no one selects distracter it is important to revise the option and attempt to make the
distracter a more plausible choice

You might also like