How To Read and Really Use An Item Analysis: Nurse Educator
A frequent challenge for nursing faculty is to write a test that effectively evaluates learning and prepares students to be
successful on the NCLEX-RN examination. Use of item analysis is an approach often used to provide an objective evaluation
of examinations. Interpreting these analyses, however, can be frustrating. The authors provide an explanation of the various
components of an item analysis, how to make an analysis useful for faculty, and how to use the components of an item analysis
in revising tests.
riting the perfect examination is an idealized and What Is an Item Analysis?
unrealistic goal for faculty, but one for which Slight variations exist in the statistics used in an item
to continually strive. Faculty often spend hours analysis, depending on the software used, but the general
writing and rewriting test questions each semester, only to elements of the analysis are the same. Item analysis is a
find some new problem or issue with each version of the process of statistically examining both the test questions
test. Test banks are frequently used in generating ques- and the students’ answers to assess the quality of the
tions, but items are sometimes found to be poorly writ- questions and the test as a whole.2 The analysis assists in
ten,1 so the frustration continues. Faculty often think they determining the extent to which individual test items
have written an excellent test, but without an objective contribute to the overall reliability, or internal consistency,
item analysis it is really impossible to have any assurance of the test. The basic elements of an item analysis include
of its evaluative competence. Using a computer-generated measures of central tendency (mean, median, mode, stan-
item analysis can be extremely useful to faculty in their dard deviation), correct group responses, response fre-
constant endeavor to write the perfect test. quencies, nondistracters, point biserial, and a reliability
A well-written test serves to confirm that students are coefficient. Looking at the item analysis in this order will
appropriately challenged, have a good grasp of the mate- provide a clear process to follow and enable faculty to
rial that was taught, and are prepared to progress. Al- systematically examine tests.
though many alternative forms of teaching and evaluation
are available and frequently used in nursing curricula, the Measures of Central Tendency
use of multiple-choice tests is the most common mode
Measures of central tendency are among the most basic of
of objectively evaluating course work. This format is also
all the statistical results specified in an item analysis. The
used for initial licensure examinations, so creating good
mean is simply the average of all individual student scores
tests helps to prepare students to successfully take the
for a particular test or examination. The median is the
NCLEX-RN licensing examination. Because licensure is the
number at which 50% of all scores on that test fall below.
ultimate goal for a graduating nursing student, faculty
The range is the difference between the highest and
have an obligation to prepare them well for this.
lowest scores. Standard deviation is the measurement of
It has become common practice for nursing faculty to
variability. In other words, it is the measure of dispersion
have an item analysis performed after administering an ex-
of student scores or how much on average scores vary
amination. A standard item analysis report yields a wealth
around the mean. Although these statistics are easily un-
of information, but to many faculty, the data are not mean-
derstood by most, their interpretation may be somewhat
ingful, and therefore, the report is not used. When used
skewed in a population of nursing students. For instance,
appropriately, an item analysis can guide the faculty in
in upper-level nursing courses, it is not expected that a
revising and improving tests. There is, however, a dearth
percentage of students will fail the test or the course. In
in the nursing literature on this subject.
fact, the further along in the nursing curriculum the stu-
dents are, the greater the expectation that students will
investigation is required. It could be that there are 1 or 2 very 4 potential answers in a multiple choice question, and 2 of
low grades that are outliers and have skewed the mean. those are nondistracters, the students were in essence an-
swering a question with only 2 potential answers. Non-
Item Difficulty/Item Discrimination distracters are often too easy. All alternative answers should
In addition to measures of central tendency, other rather be plausible. It is important to remember that when writing
simple descriptive elements are in an item analysis. These questions for nursing examinations, they should mimic
include the correct group responses, response frequen- NCLEX-RN questions as closely as possible, so silly or ob-
cies, and nondistracters (Table 1). viously incorrect potential responses should not be used.
The correct group responses category is divided into
3 columns. One column gives the total percentage of stu- Point Biserial
dents who answer an item correctly. This is a basic indicator The point biserial is the second and more complete
of item difficulty. The greater the percentage of students calculation of test item discrimination and is used to judge
answering a question correctly, generally, the easier that item quality. It tells how much predictive power a test
question is. By contrast, if a question has a zero percent- item has and whether the students who would be ex-
age of students answering it correctly, that item does not pected to answer a question correctly are actually doing
contribute to distinguishing between individual differences so.2 The point biserial is a correlational calculation deter-
among the students and is a question that needs to be mined by the dichotomous variable of student responses
revised. If a question has more than 50% incorrect re- to a particular test question (1 = right or 0 = wrong) and
sponses, the faculty needs to examine the item and deter- the continuous variable of their total score on the overall
mine whether it needs to be revised or deleted or if there test. In other words, it is a correlation between item score
was a coding error. and total score. This coefficient is an interaction between
The next column indicates the percentage of the item discrimination and item difficulty. It is the measure-
upper third of the students answering an item correctly. ment that illustrates how well an item separates, or dif-
The last of these 3 columns shows the percentage of the ferentiates, between those students who answer an item
lower third of the students answering an item correctly. correctly or incorrectly, and have a high or low test score,
These last 2 columns of the correct group responses can respectively.3 This number can range from j1 to +1. Very
be very useful. They tell you how the students making the easy or very difficult test items will have little discrim-
highest grades and the lowest grades on this examination ination. Items of moderate difficulty (60%-80% answering
did on a particular question. This analysis is the first step correctly) generally are more discriminating.
in what is often called item discrimination. It would be The point biserial is designed to reflect the degree
expected that the upper third would do the best on all test to which an item and the examination as a whole are
items. However, if the reverse is true and students in the measuring a single attribute topic and will be lower for
lower third do best on the item, it means that the question examinations that measure a wider range of content. The
was not a good one or was worded poorly, thus mis- higher the point biserial, the better that examination item
leading or discriminating against the students in the upper is at discriminating among students on the basis of how
group. Any question that favors the lower third of the well they really know the material.4 A positive biserial
students and not the upper third needs to be revised. In indicates that those scoring higher on the test were more
addition, if an analysis shows low percentages in the likely to answer that question correctly. If the students in
upper third of the test takers selecting the correct answer, the lower third answer an item correctly more frequently
it warrants looking at those questions and determining if than the upper third of the students, the point biserial will
they need to be revised. A professor may have taught the have a negative value. This usually indicates that that test
content and the students just missed it, or it could be that item is flawed and should be revised. A low value usually
the content explanation was not clear. In any event, if the means that the question was too easy. There are no uni-
upper third missed a test item, the professor needs to go versal guidelines as to what point biserial value is most
back and reteach the material. The best test questions desirable on a nursing examination, but there are common
discriminate between those students who do well on the ranges considered to be acceptable. As a general rule, any-
examination and those who do not. thing below 0.20 is considered a poor question and in
The response frequencies are merely a tally of how need of revision; items with a value between 0.20 and
many students responded to each of the possible answers to 0.30 are considered fair and could be improved upon, and
a particular question. It also usually indicates with an aster- items between 0.40 and 0.70 are considered good.2,4,5
isk the correct response to that question. This is particularly However, each question should always be evaluated in
useful when trying to understand why the majority of a terms of the purpose of the test and the purpose of the
class misses a question. For example, if no students an- individual question. For example, there may be a question
swered an item correctly, it may be that the answer sheet that is so critical to the knowledge base of the students
was keyed incorrectly. Teacher error in keying is always a that the professor desires and expects 100% of the stu-
possibility. The response frequencies also help to illustrate dents to answer it correctly. In that case, a point biserial of
which distracters are most challenging. 0 may be the goal.
The nondistracters are the potential answers that none
of the students chose as correct. When there are non- Reliability Coefficient
distracters, it means that the number of plausible answers The overall reliability of an examination is analyzed using
is more limited than intended. In other words, if there are a reliability coefficient. It may be reported as a Cronbach !
Table 1. ParSCORE Analyses for two 50-Item Examinations
or a Kuder-Richardson Formula 20 (KR-20) coefficient. value. Because the entire class answered item 9 correctly,
This is a measure of the stability or consistency among the the biserial is 0. The second example in Table 1 has a lower
test scores or the internal consistency of a test.4 The higher KR-20, but a higher mean. There is also more variability in
the reliability coefficient, the more likely a test will pro- the correct group responses.
duce consistent scores when administered to similar groups. Table 2 is an example of an item analysis of a 10-item
It is designed to measure how well a test measures a sin- quiz. As would be expected, most of the upper third of
gle cognitive factor. A low reliability coefficient may be the students did very well on this quiz. Eight of the point
reflected when a test covers multiple topics. The KR-20 is biserials are above 0.20, indicating good discrimination
used for tests that have right and wrong answers. Cronbach between the upper and lower thirds of the class.
! can be used for instruments that have right, wrong, and
no right-wrong answer, such as in an attitude survey. For Now That I Can Read and Interpret
this reason, the KR-20 is most often used in education. an Item Analysis, So What?
The KR-20 index ranges from 0 to 1, and reflects The general interpretation of an item analysis can be more
4 different things: (1) the total number of test questions, difficult when used with nursing students in the upper-
(2) the proportion of the responses to an item that are level courses of the curriculum. The typical normal distri-
correct, (3) the proportion of responses to an item that are bution of grades in a bell curve that might be expected in
incorrect, and (4) the variance for that set of scores.6 Low a freshman-level course should not be seen in this pop-
reliability may mean that the test questions are unrelated ulation of students. Nursing students in a baccalaureate
to each other in terms of who answered them correctly nursing program should show a positive skewed distribu-
and that the test scores reflect peculiarities of the test tion because they have already completed approximately
items more than the students’ knowledge of the subject. 3 or 4 semesters of general education and science courses
The most common cause of a low reliability score is that before they are accepted into a nursing program. The nurs-
the questions are too easy. Other reasons for a low reli- ing major is a rigorous one, and the criteria for acceptance
ability include an excessive number of very difficult items, into nursing programs are more stringent than those of
unclear or poorly written items that do not discriminate, many other majors. There is nothing ‘‘average’’ about
or test items that do not test a unified body of content.7 A nursing students. This makes interpreting and using an
high reliability coefficient indicates that the individual item analysis even more complex because the basic rules
questions on a test tended to pull together to measure 1 of interpretation may not be valid for this population.
topic, and the students who did well overall were likely to In nursing schools, everything has to be examined in
answer each question correctly. context. A faculty member may administer an examination
Practically speaking, a reliability coefficient of greater and have an item analysis that indicates that the test is an
than 0.50 can be considered a good coefficient for a nurs- excellent one. It may have a reliability coefficient of 0.85,
ing examination because most nursing examinations cover but a mean score of 76. If the cutoff for passing at a par-
multiple concepts and topics. Even if a test does cover ticular school is 80 and the mean score on an examination
multiple areas of content, such as cardiovascular and re- is 76, it means that the average score is not a passing one.
spiratory systems, the central construct for most examina- This may be an indicator that the material for this ex-
tions is still nursing knowledge, so the KR-20 is useful in amination was not grasped well by the students as a
analyzing the reliability of the test items relative to this whole and needs to be reinforced or taught in a different
construct. There are several ways to improve test reliability: way. It may also mean that the students simply did not
administer longer tests, have a more heterogeneous group prepare adequately. A good way to distinguish between
of students, or attempt to change the questions and thus these 2 causes is to administer the same or very similar test
the item difficulty to where 70% to 80% of the students each semester and compare the analyses over time. Item
answer test items correctly. However, none of these may analysis interpretations are usually normed with typical
be a viable option in nursing courses. grading scales, so each nursing school, each course, and
each examination must be considered when deciding
Examples of Test Analyses what is acceptable on the item analysis and what indicates
Table 1 is an example of 2 different item analysis reports a need for a change in a particular examination. Caution
for 50-item examinations. Only the first 10 items are used should always be used when interpreting statistics based
for illustration. The first example has a high KR-20 of 0.81, on a small sample size because the results may simply be
indicating strong internal consistency. The mean of this random chance.
examination is 79.84. The passing grade for this particular Evaluating examinations is difficult. Faculty members
course is 80. There was 1 outlier of a grade of 52, which tend to like questions they write and want to think it is
brought the mean down. Point biserials indicate that items strictly the fault of the students if they miss questions. An
1 to 4 are good questions that discriminate well between item analysis is an objective measure to scrutinize exami-
the upper third of those scoring on this examination and nations more critically and determine which questions
the lower third. Item 5 is an example of the lower third really do need to be revised, what material needs to be
scoring better than the upper, thus rendering a negative revisited, and which questions need to be kept. If an item
biserial. When this item was examined, it was found to be analysis indicates a poor question, it does not mean that the
a knowledge-level question, so it is possible that the question must be discarded and everyone given credit for
upper third of the class overanalyzed the question and the the item on that examination. It is most useful to look at
lower third did better because they simply took it at face the item analysis along with the test blueprint to see the
Table 2. ParSCORE Analysis for a 10-Item Quiz
whole picture. Looking at the whole picture is helpful to examinations to increase the sample size. This is helpful
see where to focus to develop new questions. It is also only if examination questions have not been changed. If
helpful to have colleagues look at tests and give feedback. the questions have been changed, it is helpful to compare
An objective outsider will be able to discern awkward the analyses to determine if the changes made the ex-
wording or poorly written questions. amination better.
Statistics from an item analysis are useful in under- Once a faculty member has learned how to read and
standing student performance on that examination, but the interpret an item analysis, the analysis can be very helpful
purpose of the examination should always remain para- both in refining tests and also in indicating what may need
mount. There are both theoretical/conceptual and practical strengthening or deleting in the teaching of the course.
reasons for writing examination items and designing the The analysis helps to find flaws or errors in a test so it can
test as a whole.8 Results of an item analysis should always be adjusted before grades are posted. For instance, it may
be carefully examined and not used as the sole determinant indicate that there are 2 right answers and both should
for rewriting or revising a test. Data from the analyses are be accepted or that a question was keyed incorrectly and
always going to be influenced by the number of students the tests must be rescored. The analysis is also useful in
taking the examination, the type of students, the variances determining which questions are too difficult or too easy
in teaching, and the inevitable errors attributable to chance. so that those questions can be reworded or revised to be
Table 3 gives a brief summary of suggestions of when to more appropriately challenging. When questions are
revise and when not to revise questions. found to have high levels of difficulty, it may be necessary
When using a software package to do item analysis, it for the material to be stressed more carefully in class or more
is possible to add results of a current examination to prior fully explained. Identifying the common misconceptions
of the students by looking at the frequencies of the dis- test banks accompanying text books used in nursing educa-
tracters also helps to determine material that needs to be tion. J Nurs Educ. 2001;40(1):25-32.
further clarified in class. 2. Matlock-Hetzel S. Basic concepts in item and test analysis.
Paper presented at Annual Meeting of the Southwest Educa-
tional Research Association; January 23-25, 1997; Austin, Texas.
Summary 3. McGill exam results. Available at
Writing the perfect examination is a lifelong challenge, products/exams/. Accessed January 14, 2009.
but the goal should be continual improvement. Interpret- 4. Oermann MH, Gaberson KB. Evaluation and Testing in Nursing
Education. 2nd ed. New York, NY: Springer Publishing Co;
ing examination results by use of an item analysis yields a
wealth of information that is useful in both improving test 5. Introduction to Item Analysis. Scoring Office: Academic Tech-
items and in improving teaching. Because one of the goals nology Services. Available at
of every nursing school is to have its students pass the soweb/itanhand.html. Accessed January 14, 2009.
NCLEX-RN examination on the first try, the time and effort 6. Bodner GM. Statistical analysis of multiple-choice exams.
given to interpreting examinations and using this infor- J Chem Educ. 1980;57(3):188-190.
mation are invaluable. 7. Kehoe J. Basic item analysis for multiple-choice tests. Practical
assessment, research & evaluation. 1995;4(10). Available at http:// Accessed January 14, 2009.
References 8. Brown JD. Questions and answers about language testing
1. Masters JC, Hulsmeyer BS, Pike MaryE, Leichty K, Tiller MT, statistics. Shiken: JALT Testing & Evaluation SIG Newsletter.
Verst AL. Assessment of multiple-choice questions in selected 2001;5(3):12-15.
