0% found this document useful (0 votes)
337 views110 pages

Interrater Reliability - Combined PDF

Uploaded by

allyson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
337 views110 pages

Interrater Reliability - Combined PDF

Uploaded by

allyson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 110

Best practices in

quantitative methods
3 Best Practices in Interrater
Reliability Three Common Approaches

Contributors: Steven E. Stemler & Jessica Tsai


Editors: Jason Osborne
Book Title: Best practices in quantitative methods
Chapter Title: "3 Best Practices in Interrater Reliability Three Common Approaches"
Pub. Date: 2008
Access Date: May 10, 2014
Publishing Company: SAGE Publications, Inc.
City: Thousand Oaks
Print ISBN: 9781412940658
Online ISBN: 9781412995627
DOI: http://dx.doi.org/10.4135/9781412995627.d5
Print pages: 29-50
2008 SAGE Publications, Inc. All Rights Reserved.
This PDF has been generated from SAGE Research Methods. Please note that the
pagination of the online version will vary from the pagination of the print book.
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

http://dx.doi.org/10.4135/9781412995627.d5
[p. 29 ]

3 Best Practices in Interrater Reliability


Three Common Approaches
Steven E.Stemler

JessicaTsai

1
The concept of interrater reliability permeates many facets of modern society.

For example, court cases based on a trial by jury require unanimous agreement from
jurors regarding the verdict, lifethreatening medical diagnoses often require a second
or third opinion from health care professionals, student essays written in the context
of highstakes standardized testing receive points based on the judgment of multiple
readers, and Olympic competitions, such as figure skating, award medals to participants
based on quantitative ratings of performance provided by an international panel of
judges.

Any time multiple judges are used to determine important outcomes, certain technical
and procedural questions emerge. Some of the more common questions are as follows:
How many raters do we need to be confident in our results? What is the minimum level
of agreement that my raters should achieve? And is it necessary for raters to agree
exactly, or is it acceptable for them to differ from each other so long as their difference
is systematic and can therefore be corrected?

Page 3 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Key Questions to Ask Before Conducting


an Interrater Reliability Study
If you are at the point in your research where you are considering conducting an
interrater reliability study, then there are three important questions worth considering:

What is the purpose of conducting your interrater reliability study?


What is the nature of your data?
What resources do you have at your disposal (e.g., technical expertise, time,
money)?

The answers to these questions will help determine the best statistical approach to use
for your study.

[p. 30 ]

What Is the Purpose of Conducting Your


Interrater Reliability Study?
There are three main reasons why people may wish to conduct an interrater reliability
study. Perhaps the most popular reason is that the researcher is interested in getting
a single final score on a variable (such as an essay grade) for use in subsequent data
analysis and statistical modeling but first must prove that the scoring is not subjective
or biased. For example, this is often the goal in the context of educational testing
where largescale state testing programs might use multiple raters to grade student
essays for the ultimate purpose of providing an overall appraisal of each student's
current level of academic achievement. In such cases, the documentation of interrater
reliability is usually just a means to an endthe end of creating a single summary
score for use in subsequent data analysesand the researcher may have little inherent
interest in the details of the interrater reliability analysis per se. This is a perfectly
acceptable reason for wanting to conduct an interrater reliability study; however,

Page 4 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

researchers must be particularly cautious about the assumptions they are making when
summarizing the data from multiple raters to generate a single summary score for each
student. For example, simply taking the mean of the ratings of two independent raters
may, in some circumstances, actually lead to biased estimates of student ability, even
when the scoring by independent raters is highly correlated (we return to this point later
in the chapter).

A second common reason for conducting an interrater reliability study is to evaluate


a newly developed scoring rubric to see if it is working or if it needs to be modified.
For example, one may wish to evaluate the accuracy of multiple ratings in the absence
of a gold standard. Consider a situation in which independent judges must rate
the creativity of a piece of artwork. Because there is no objective rule to indicate the
true creativity of a piece of art, a minimum first step in establishing that there is such
a thing as creativity is to demonstrate that independent raters can at least reliably
classify objects according to how well they meet the assumptions of the construct.
Thus, independent observers must subjectively interpret the work of art and rate the
degree to which an underlying construct (e.g., creativity) is present. In situations such
as these, the establishment of interrater reliability becomes a goal in and of itself. If a
researcher is able to demonstrate that independent parties can reliably rate objects
along the continuum of the construct, this provides some good objective evidence for
the existence of the construct. A natural subsequent step is to analyze individual scores
according to the criteria.

Finally, a third reason for conducting an interrater reliability study is to validate how
well ratings reflect a known true state of affairs (e.g., a validation study). For example,
suppose that a researcher believes that he or she has developed a new colon cancer
screening technique that should be highly predictive. The first thing the researcher
might do is train another provider to use the technique and compare the extent to
which the independent rater agrees with him or her on the classification of people
who have cancer and those who do not. Next, the researcher might attempt to predict
the prevalence of cancer using a formal diagnosis via more traditional methods (e.g.,
biopsy) to compare the extent to which the new technique is accurately predicting the
diagnosis generated by the known technique. In other words, the reason for conducting
an interrater reliability study in this circumstance is because it is not enough that
independent raters have high levels of interrater reliability; what really matters is the

Page 5 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

level of reliability in predicting the actual occurrence of cancer as compared with a gold
standardin this case, the rate of classification based on an established technique.

Once you have determined the primary purpose for conducting an interrater reliability
study, the next step is to consider the nature of the data that you have or will collect.

What Is the Nature of Your Data?


There are four important points to consider with regard to the nature of your data. First,
it is important to know whether your data are considered nominal, ordinal, interval, or
ratio (Stevens, 1946). Certain statistical techniques are better suited to certain types
of data. For example, if the data you are evaluating are nominal (i.e., the differences
between the categories you are rating are qualitative), then there are relatively few
statistical methods for you to choose from (e.g., percent agreement, Cohen's kappa).
If, on the other hand, the data are measured at [p. 31 ] the ratio level, then the data
meet the criteria for use by most of the techniques discussed in this chapter.

Once you have determined the type of data used for the rating scale, you should then
examine the distribution of your data using a histogram or bar chart. Are the ratings
of each rater normally distributed, uniformly distributed, or skewed? If the rating data
exhibit restricted variability, this can severely affect consistency estimates as well
as consensusbased estimates, threatening the validity of the interpretations made
from the interrater reliability estimates. Thus, it is important to have some idea of the
distribution of ratings in order to select the best statistical technique for analyzing the
data.

The third important thing to investigate is whether the judges who rated the data agreed
on the underlying trait definition. For example, if two raters are judging the creativity
of a piece of artwork, one rater may believe that creativity is 50% novelty and 50%
task appropriateness. By contrast, another rater may judge creativity to consist of
50% novelty, 35% task appropriateness, and 15% elaboration. These differences in
perception will introduce extraneous error into the ratings. The extent to which your
raters are defining the construct in a similar way can be empirically evaluated using

Page 6 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

measurement approaches to interrater reliability (e.g., factor analysis, a procedure that


is further described later in this chapter).

Finally, even if the raters agree as to the structure, do they assign people into the same
category along the continuum, or does one judge assign a person poor in mathematics
while another judge classifies that same person as %good? In other words, are they
using the rating categories the same way? This can be evaluated using consensus
estimates (e.g., via tests of marginal homogeneity).

After specifying the purpose of the study and thinking about the nature of the data that
will be used in the analysis, the final question to ask is the pragmatic question of what
resources you have at your disposal.

What Resources Do You Have at Your


Disposal?
As most people know from their life experience, best does not always mean most
expensive or most resource intensive. Similarly, within the context of interrater reliability,
it is not always necessary to choose a technique that yields the maximum amount of
information or that requires sophisticated statistical analyses in order to gain useful
information. There are times when a crude estimate may yield sufficient information
for example, within the context of a lowstakes, exploratory research study. There are
other times when the estimates must be as precise as possiblefor example, within the
context of situations that have direct, important stakes for the participants in the study.

The question of resources often has an influence on the way that interrater reliability
studies are conducted. For example, if you are a newcomer who is running a pilot
study to determine whether to continue on a particular line of research, and time and
money are limited, then a simpler technique such as the percent agreement, kappa, or
even correlational estimates may be the best match. On the other hand, if you are in a
situation where you have a highstakes test that needs to be graded relatively quickly,
and money is not a major issue, then a more advanced measurement approach (e.g.,
the manyfacets Rasch model) is most likely the best selection.

Page 7 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

As an additional example, if the goal of your study is to understand the underlying


nature of a construct that to date has no objective, agreedon definition (e.g., wisdom),
then achieving consensus among raters in applying a scoring criterion will be of
paramount importance. By contrast, if the goal of the study is to generate summary
scores for individuals that will be used in later analyses, and it is not critical that
raters come to exact agreement on how to use a rating scale, then consistency or
measurement estimates of interrater reliability will be sufficient.

Summary
Once you have answered the three main questions discussed in this section, you will
be in a much better position to choose a suitable technique for your project. In the next
section of this chapter, we will discuss (a) the most popular statistics used to compute
interrater reliability,

(b) the computation and interpretation of the results of statistics using worked examples,

(c) the implications for summarizing data that follow from each technique, and (d) the
advantages and disadvantages of each technique.

[p. 32 ]

Choosing the Best Approach for the Job


Many textbooks in the field of educational and psychological measurement and
statistics (e.g., Anastasi & Urbina, 1997; Cohen, Cohen, West, & Aiken, 2003; Crocker
& Algina, 1986; Hopkins, 1998; von Eye & Mun, 2004) describe interrater reliability
as if it were a unitary concept lending itself to a single, best approach across all
situations. Yet, the methodological literature related to interrater reliability constitutes
a hodgepodge of statistical techniques, each of which provides a particular kind of
solution to the problem of establishing interrater reliability.

Page 8 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Building on the work of Uebersax (2002) and J. R. Hayes and Hatch (1999), Stemler
(2004) has argued that the wide variety of statistical techniques used for computing
interrater reliability coefficients may be theoretically classified into one of three broad
categories: (a) consensus estimates, (b) consistency estimates, and (c) measurement
estimates. Statistics associated with these three categories differ in their assumptions
about the purpose of the interrater reliability study, the nature of the data, and the
implications for summarizing scores from various raters.

Consensus Estimates of Interrater


Reliability
Consensus estimates are often used when one is attempting to demonstrate that
a construct that traditionally has been considered highly subjective (e.g., creativity,
wisdom, hate) can be reliably captured by independent raters. The assumption is that
if independent raters are able to come to exact agreement about how to apply the
various levels of a scoring rubric (which operationally defines behaviors associated
with the construct), then this provides some defensible evidence for the existence
of the construct. Furthermore, if two independent judges demonstrate high levels of
agreement in their application of a scoring rubric to rate behaviors, then the two judges
may be said to share a common interpretation of the construct.

Consensus estimates tend to be the most useful when data are nominal in nature and
different levels of the rating scale represent qualitatively different ideas. Consensus
estimates also can be useful when different levels of the rating scale are assumed to
represent a linear continuum of the construct but are ordinal in nature (e.g., a Likert
type scale). In such cases, the judges must come to exact agreement about each of the
quantitative levels of the construct under investigation.

The three most popular types of consensus estimates of interrater reliability found in
the literature include (a) percent agreement and its variants, (b) Cohen's kappa and its
variants (Agresti, 1996; Cohen, 1960, 1968; Krippendorff, 2004), and (c) odds ratios.
Other less frequently used statistics that fall under this category include Jaccard's J and
the GIndex (see Barrett, 2001).

Page 9 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Percent Agreement. Perhaps the most popular method for computing a consensus
estimate of interrater reliability is through the use of the simple percent agreement
statistic. For example, in a study examining creativity, Sternberg and Lubart (1995)
asked sets of judges to rate the level of creativity associated with each of a number of
products generated by study participants (e.g., draw a picture illustrating Earth from
an insect's point of view, write an essay based on the title 2983). The goal of their
study was to demonstrate that creativity could be detected and objectively scored with
high levels of agreement across independent judges. The authors reported percent
agreement levels across raters of .92 (Sternberg & Lubart, 1995, p. 31).

The percent agreement statistic has several advantages. For example, it has a strong
intuitive appeal, it is easy to calculate, and it is easy to explain. The statistic also has
some distinct disadvantages, however. If the behavior of interest has a low or high
incidence of occurrence in the population, then it is possible to get artificially inflated
percent agreement figures simply because most of the values fall under one category of
the rating scale (J. R. Hayes & Hatch, 1999). Another disadvantage to using the simple
percent agreement figure is that it is often timeconsuming and laborintensive to train
judges to the point of exact agreement.

One popular modification of the percent agreement figure found in the testing literature
involves broadening the definition of agreement by including the adjacent scoring
categories on the rating scale. For example, some testing programs include writing
sections that are scored by judges using a rating scale with levels ranging from 1 (low)
to 6 (high) (College Board, 2006). If a percent adjacent agreement approach were used
to score this section of the exam, this would [p. 33 ] mean that the judges would not
need to come to exact agreement about the ratings they assign to each participant;
rather, so long as the ratings did not differ by more than one point above or below the
other judge, then the two judges would be said to have reached consensus. Thus, if
Rater A assigns an essay a score of 3 and Rater B assigns the same essay a score of
4, the two raters are close enough together to say that they agree, even though their
agreement is not exact.

The rationale for the adjacent percent agreement approach is often a pragmatic one. It
is extremely difficult to train independent raters to come to exact agreement, no matter
how good one's scoring rubric. Yet, raters often give scores that are pretty close to

Page 10 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

the same, and we do not want to discard this information. Thus, the thinking is that if
we have a situation in which two raters never differ by more than one score point in
assigning their ratings, then we have a justification for taking the average score across
all ratings. This logic holds under two conditions. First, the difference between raters
must be randomly distributed across items. In other words, Rater A should not give
systematically lower scores than Rater B. Second, the scores assigned by raters must
be evenly distributed across all possible score categories. In other words, both raters
should give equal numbers of s, 2s, 3s, 4s, 5s, and 6s across the population of essays
that they have read. If both of these assumptions are met, then the adjacent percent
agreement approach is defensible. If, however, either of these assumptions is violated,
this could lead to a situation in which the validity of the resultant summary scores is
dubious (see the box below).

Consider a situation in which Rater A systematically assigns scores that are one power
lower than Rater B. Assume that they have each rated a common set of 100 essays. If
we average the scores of the two raters across all essays to arrive at individual student
scores, this seems, on the surface, to be defensible because it really does not matter
whether Rater A or Rater B is assigning the high or low score because even if Rater A
and Rater B had no systematic difference in severity of ratings, the average score would
be the same. However, suppose that dozens of raters are used to score the essays.
Imagine that Rater C is also called in to rate the same essay for a different sample of
students. Rater C is paired up with Rater B within the context of an overlapping design
to maximize rater efficiency (e.g., McArdle, 1994). Suppose that we find a situation in
which Rater B is systematically lower than Rater C in assigning grades. In other words,
Rater A is systematically one point lower than Rater B, and Rater B is systematically
one point lower than Rater C.

On the surface, again, it seems logical to average the scores assigned by Rater B and
Rater C. Yet, we now find ourselves in a situation in which the students rated by the
Rater B/C pair score systematically one point higher than the students rated by the
Rater A/B pair, even though neither combination of raters differed by more than one
score point in their ratings, thereby demonstrating interrater reliability. Which student
would you rather be? The one who was lucky enough to draw the B/C rater combination
or the one who unfortunately was scored by the A/B combination?

Page 11 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Thus, in order to make a validity argument for summarizing the results of multiple raters,
it is not enough to demonstrate adjacent percent agreement between rater pairs; it must
also be demonstrated that there is no systematic difference in rater severity between
the rater set pairs.

This can be demonstrated (and corrected for in the final score) through the use of the
manyfacet Rasch model.

Now let us examine what happens if the second assumption of the adjacent percent
agreement approach is violated. If you are a rater for a large testing company, and
you are told that you will be retained only if you are able to demonstrate interrater
reliability with everyone else, you would naturally look for your best strategy to maximize
interrater reliability. If you are then told that your scores can differ by no more than
one point from the other raters, you would quickly discover that your best bet then is to
avoid giving any ratings at the extreme ends of the scale (i.e., a rating of 1 or a rating
of 6). Why? Because a rating at the extreme end of the scale (e.g., 6) has two potential
scores with which it can overlap (i.e., 5 or 6), whereas a rating of 5 would allow you to
potentially agree with three scores

[p. 34 ] (i.e., 4, 5, or 6), thereby maximizing your chances of agreeing with the second
rater. Thus, it is entirely likely that the scale will go from being a 6point scale to a
4point scale, reducing the overall variability in scores given across the spectrum of
participants. If only four categories are used, then the percent agreement statistics will
be artificially inflated due to chance factors. For example, when a scale is 1 to 6, two
participants are expected to agree on ratings by chance alone only 17% of the time.
When the scale is reduced to 1 to 4, the percent agreement expected by chance jumps
to 25%. If three categories, a 33% chance agreement is expected; if two categories, a
50% chance agreement is expected. In other words, a 6point scale that uses adjacent
percent agreement scoring is most likely functionally equivalent to a 4point scale that
uses exact agreement scoring.

This approach is advantageous in that it relaxes the strict criterion that the judges agree
exactly. On the other hand, percent agreement using adjacent categories can lead to
inflated estimates of interrater reliability if there are only a limited number of categories
to choose from (e.g., a 14 scale). If the rating scale has a limited number of points,

Page 12 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

then nearly all points will be adjacent, and it would be surprising to find agreement lower
than 90%.

Cohen's Kappa. Another popular consensus estimate of interrater reliability is Cohen's


kappa statistic (Cohen, 1960,1968). Cohen's kappa was designed to estimate
the degree of consensus between two judges and determine whether the level of
agreement is greater than would be expected to be observed by chance alone (see
Stemler, 2001, for a practical example with calculation). The interpretation of the kappa
statistic is slightly different from the interpretation of the percent agreement figure
(Agresti, 1996). A value of zero on kappa does not indicate that the two judges did not
agree at all; rather, it indicates that the two judges did not agree with each other any
more than would be predicted by chance alone. Consequently, it is possible to have
negative values of kappa if judges agree less often than chance would predict. Kappa is
a highly useful statistic when one is concerned that the percent agreement statistic may
be artificially inflated due to the fact that most observations fall into a single category.

Kappa is often useful within the context of exploratory research. For example, Stemler
and Bebell (1999) conducted a study aimed at detecting the various purposes of
schooling articulated in school mission statements. Judges were given a scoring rubric
that listed 10 possible thematic categories under which the main idea of each mission
statement could be classified (e.g., social development, cognitive development, civic
development). Judges then read a series of mission statements and attempted to
classify each sampling unit according to the major purpose of schooling articulated.
If both judges consistently rated the dominant theme of the mission statement as
representing elements of citizenship, then they were said to have communicated with
each other in a meaningful way because they had both classified the statement in the
same way. If one judge classified the major theme as social development, and the
other judge classified the major theme as citizenship, then a breakdown in shared
understanding occurred. In that case, the judges were not coming to a consensus
on how to apply the levels of the scoring rubric. The authors chose to use the kappa
statistic to evaluate the degree of consensus because they did not expect the frequency
of the major themes of the mission statements to be evenly distributed across the 10
categories of their scoring rubric.

Page 13 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Although some authors (Landis & Koch, 1977) have offered guidelines for interpreting
kappa values, other authors (Krippendorff, 2004; Uebersax, 2002) have argued that
the kappa values for different items or from different studies cannot be meaningfully
compared unless the base rates are identical. Consequently, these authors suggest
that although the statistic gives some indication as to whether the agreement is better
than that predicted by chance alone, it is difficult to apply rules of thumb for interpreting
kappa across different circumstances. Instead, Uebersax (2002) suggests that
researchers using the kappa coefficient look at it [p. 35 ] for up or down evaluation
of whether ratings are different from chance, but they should not get too invested in its
interpretation.

Krippendorff (2004) has introduced a new coefficient alpha into the literature that claims
to be superior to kappa because alpha is capable of incorporating the information from
multiple raters, dealing with missing data, and yielding a chancecorrected estimate
of interrater reliability. The major disadvantage of Krippendorff's alpha is that it is
computationally complex; however, statistical macros that compute Krippendorff's alpha
have been created and are freely available (K. Hayes, 2006). In addition, however,
some research suggests that in practice, alpha values tend to be nearly identical to
kappa values (Dooley, 2006).

Odds Ratios. A third consensus estimate of interrater reliability is the odds ratio. The
odds ratio is most often used in circumstances where raters are making dichotomous
ratings (e.g., presence/absence of a phenomenon), although it can be extended to
ordered category ratings. In a 2 2 contingency table, the odds ratio indicates how
much the odds of one rater making a given rating (e.g., positive/negative) increase for
cases when the other rater has made the same rating. For example, suppose that in
a music competition with 100 contestants, Rater 1 gives 90 of them a positive score
for vocal ability, while in the same sample of 100 contestants, Rater 2 only gives 20 of
them a positive score for vocal ability. The odds of Rater 1 giving a positive vocal ability
score are 90 to 10, or 9:1, while the odds of Rater 2 giving a positive vocal ability score
are only 20 to 80, or 1:4 = 0.25:1. Now, 9/0.25 = 36, so the odds ratio is 36. Within the
context of interrater reliability, the important idea captured by the odds ratio is whether it
deviates substantially from 1.0. From the perspective of interrater reliability, it would be
most desirable to have an odds ratio that is close to 1.0, which would indicate that Rater
1 and Rater 2 rated the same proportion of contestants as having high vocal ability. The

Page 14 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

larger the odds ratio value, the larger the discrepancy there is between raters in terms
of their level of consensus.

The odds ratio has the advantage of being easy to compute and is familiar from
other statistical applications (e.g., logistic regression). The disadvantage to the odds
ratio is that it is most intuitive within the context of a 2 2 contingency table with
dichotomous rating categories. Although the technique can be generalized to ordered
category ratings, it involves extra computational complexity that undermines its intuitive
advantage. Furthermore, as Osborne (2006) has pointed out, although the odds ratio
is straightforward to compute, the interpretation of the statistic is not always easy to
convey, particularly to a lay audience.

Computing Common Consensus Estimates


of Interrater Reliability
Let us now turn to a practical example of how to calculate each of these coefficients. As
an example data set, we will draw from Stemler, Grigorenko, Jarvin, and Sternberg's
(2006) study in which they developed augmented versions of the Advanced Placement
Psychology Examination. Participants were required to complete a number of essay
items that were subsequently scored by different sets of raters. Essay Question 1, Part
d was a question that asked participants to give advice to a friend who is having trouble
sleeping, based on what they know about various theories of sleep. The item was
scored using a 5point scoring rubric. For this particular item, 75 participants received
scores from two independent raters.

Percent Agreement. Percent agreement is calculated by adding up the number of


cases that received the same rating by both judges and dividing that number by the
total number of cases rated by the two judges. Using SPSS, one can run the crosstabs
procedure and generate a table to facilitate the calculation (see Table 3.1). The percent
agreement on this item is 42%; however, the percent adjacent agreement is 87%.

Cohen's Kappa. The formula for computing Cohen's kappa is listed in Formula 1.

Page 15 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

where P
A

= proportion of units on which the raters agree, and P


C

= the proportion of units for which agreement is expected by chance.

It is possible to compute Cohen's kappa in SPSS by simply specifying in the crosstabs


procedure the desire to produce Cohen's kappa (see Table 3.1). For this data set, the
kappa value is [p. 36 ] .23, which indicates that the two raters agreed on the scoring
only slightly more often than we would predict based on chance alone.

Table 3.1 SPSS Code and Output for Percent Agreement and Percent Adjacent
Agreement and Cohen's Kappa

Page 16 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Odds Ratios. The formula for computing an odds ratio is shown in Formula 2.

Page 17 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

The SPSS code for computing the odds ratio is shown in Table 3.2. In order to compute
the odds ratio using the crosstabs procedure in SPSS, it was necessary to recode
the data so that the ratings were dichotomous. Consequently, ratings of 0, 1, and 2
were assigned a value of 0 (failing) while ratings of 3 and 4 were assigned a value of
1 (passing). The odds ratio for the current data set is 30, indicating that there was a
substantial difference between the raters in terms of the proportion of students classified
as passing versus failing.

Implications for Summarizing Scores From


Various Raters
If raters can be trained to the point where they agree on how to assign scores from a
rubric, then scores given by the two raters may be treated as equivalent. This fact has
practical implications for determining the number of raters needed to complete a study.
Thus, the remaining work of rating subsequent items can be split between the raters
without both raters having to score all items. Furthermore, the summary scores may
be calculated by simply taking the score from one of the judges or by averaging the
scores given by all of the judges, since high interrater reliability indicates that the judges
agree about how to apply the rating scale. A typical guideline found in the literature
for evaluating the quality of interrater reliability based on consensus estimates is that
they should be 70% or greater. If raters are shown to reach high levels of consensus,
then adding more raters adds little extra information from a statistical perspective and is
probably not justified from the perspective of resources.

Page 18 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

[p. 37 ]

Table 3.2 SPSS Code and Output for Odds Ratios

Advantages of Consensus Estimates


One particular advantage of the consensus approach to estimating interrater reliability is
that the calculations are easily done by hand.

Page 19 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

A second advantage is that the techniques falling within this general category are well
suited to dealing with nominal variables whose levels on the rating scale represent
qualitatively different categories. A third advantage is that consensus estimates can
be useful in diagnosing problems with judges interpretations of how to apply the rating
scale. For example, inspection of the information from a crosstab table may allow the
researcher to realize that the judges may be unclear about the rules for when they are
supposed to score an item as zero as opposed to when they are supposed to score the
item as missing. A visual analysis of the output allows the researcher to go back to the
data and clarify the discrepancy or retrain the judges.

When judges exhibit a high level of consensus, it implies that both judges are
essentially providing the same information. One implication of a high [p. 38 ]
consensus estimate of interrater reliability is that both judges need not score all
remaining items. For example, if there were 100 tests to be scored after the interrater
reliability study was finished, it would be most efficient to ask Judge A to rate exams 1
to 50 and Judge B to rate exams 51 to 100 because the two judges have empirically
demonstrated that they share a similar meaning for the scoring rubric. In practice,
however, it is usually a good idea to build in a 30% overlap between judges even after
they have been trained, in order to provide evidence that the judges are not drifting from
their consensus as they read more items.

Disadvantages of Consensus Estimates


One disadvantage of consensus estimates is that interrater reliability statistics must
be computed separately for each item and for each pair of judges. Consequently,
when reporting consensusbased interrater reliability estimates, one should report the
minimum, maximum, and median estimates for all items and for all pairs of judges.

A second disadvantage is that the amount of time and energy it takes to train judges to
come to exact agreement is often substantial, particularly in applications where exact
agreement is unnecessary (e.g., if the exact application of the levels of the scoring
rubric is not important, but rather a means to the end of getting a summary score for
each respondent).

Page 20 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Third, as Linacre (2002) has noted, training judges to a point of forced consensus may
actually reduce the statistical independence of the ratings and threaten the validity of
the resulting scores.

Finally, consensus estimates can be overly conservative if two judges exhibit systematic
differences in the way that they use the scoring rubric but simply cannot be trained to
come to a consensus. As we will see in the next section, it is possible to have a low
consensus estimate of interrater reliability while having a high consistency estimate and
vice versa. Consequently, sole reliance on consensus estimates of interrater reliability
might lead researchers to conclude that interrater reliability is low when it may be more
precisely stated that the consensus estimate of interrater reliability is low.

Consistency Estimates of Interrater


Reliability
Consistency estimates of interrater reliability are based on the assumption that it is not
really necessary for raters to share a common interpretation of the rating scale, so long
as each judge is consistent in classifying the phenomenon according to his or her own
definition of the scale. For example, if Rater A assigns a score of 3 to a certain group of
essays, and Rater B assigns a score of 1 to that same group of essays, the two raters
have not come to a consensus about how to apply the rating scale categories, but the
difference in how they apply the rating scale categories is predictable.

Consistency approaches to estimating interrater reliability are most useful when the
data are continuous in nature, although the technique can be applied to categorical
data if the rating scale categories are thought to represent an underlying continuum
along a unidimensional construct. Values greater than .70 are typically acceptable for
consistency estimates of interrater reliability (Barrett, 2001).

The three most popular types of consistency estimates are (a) correlation coefficients
(e.g., Pearson, Spearman), (b) Cronbach's alpha (Cronbach, 1951), and (c) intraclass
correlation. For information regarding additional consistency estimates of interrater

Page 21 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

reliability, see Bock, Brennan, and Muraki (2002); Burke and Dunlap (2002); LeBreton,
Burgess, Kaiser, Atchley, and James (2003); and Uebersax (2002).

Correlation Coefficients. Perhaps the most popular statistic for calculating the degree
of consistency between raters is the Pearson correlation coefficient. Correlation
coefficients measure the association between independent raters. Values approaching
+1 or 1 indicate that the two raters are following a systematic pattern in their ratings,
while values approaching zero indicate that it is nearly impossible to predict the score
one rater would give by knowing the score the other rater gave. It is important to note
that even though the correlation between scores assigned by two judges may be nearly
perfect, there may be substantial mean differences between the raters. In other words,
two raters may differ in the absolute values they assign to each rating by two points;
however, so long as there is a 2point difference for each rating they assign, the raters
will have achieved high consistency estimates of interrater reliability. Thus, a large
value for a measure of association does not imply that the raters are agreeing on the
actual application of the rating scale, only that they are consistent in applying the ratings
according to their own unique understanding of the scoring rubric.

[p. 39 ] The Pearson correlation coefficient can be computed by hand (Glass &
Hopkins, 1996) or can easily be computed using most statistical packages. One
beneficial feature of the Pearson correlation coefficient is that the scores on the rating
scale can be continuous in nature (e.g., they can take on partial values such as 1.5).
Like the percent agreement statistic, the Pearson correlation coefficients can be
calculated only for one pair of judges at a time and for one item at a time.

A potential limitation of the Pearson correlation coefficient is that it assumes that the
data underlying the rating scale are normally distributed. Consequently, if the data from
the rating scale tend to be skewed toward one end of the distribution, this will attenuate
the upper limit of the correlation coefficient that can be observed. The Spearman rank
coefficient provides an approximation of the Pearson correlation coefficient but may be
used in circumstances where the data under investigation are not normally distributed.
For example, rather than using a continuous rating scale, each judge may rank order
the essays that he or she has scored from best to worst. In this case, then, since both
ratings being correlated are in the form of rankings, a correlation coefficient can be
computed that is governed by the number of pairs of ratings (Glass & Hopkins, 1996).

Page 22 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

The major disadvantage to Spearman's rank coefficient is that it requires both judges to
rate all cases.

Cronbach's Alpha. In situations where more than two raters are used, another approach
to computing a consistency estimate of interrater reliability would be to compute
Cronbach's alpha coefficient (Crocker & Algina, 1986). Cronbach's alpha coefficient
is a measure of internal consistency reliability and is useful for understanding the
extent to which the ratings from a group of judges hold together to measure a common
dimension. If the Cronbach's alpha estimate among the judges is low, then this implies
that the majority of the variance in the total composite score is really due to error
variance and not true score variance (Crocker & Algina, 1986).

The major advantage of using Cronbach's alpha comes from its capacity to yield a
single consistency estimate of interrater reliability across multiple judges. The major
disadvantage of the method is that each judge must give a rating on every case, or
else the alpha will only be computed on a subset of the data. In other words, if just one
rater fails to score a particular individual, that individual will be left out of the analysis. In
addition, as Barrett (2001) has noted, because of this averaging of ratings, we reduce
the variability of the judges ratings such that when we average all judges ratings, we
effectively remove all the error variance for judges (p. 7).

Intraclass Correlation. A third popular approach to estimating interrater reliability is


through the use of the intraclass correlation coefficient. An interesting feature of the
intraclass correlation coefficient is that it confounds two ways in which raters differ: (a)
consensus (or biasi.e., mean differences) and (b) consistency (or association). As a
result, the value of the intraclass correlation coefficient will be decreased in situations
where there is a low correlation between raters and in situations where there are large
mean differences between raters. For this reason, the intraclass correlation may be
considered a conservative estimate of interrater reliability. If the intraclass correlation
coefficient is close to 1, then chances are good that this implies that excellent interrater
reliability has been achieved.

The major advantage of the intraclass correlation is its capacity to incorporate


information from different types of rater reliability data. On the other hand, as Uebersax
(2002) has noted, If the goal is to give feedback to raters to improve future ratings,

Page 23 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

one should distinguish between these two sources of disagreement (p. 5). In addition,
because the intraclass correlation represents the ratio of withinsubject variance to
betweensubject variance on a rating scale, the results may not look the same if raters
are rating a homogeneous subpopulation as opposed to the general population. Simply
by restricting the betweensubject variance, the intraclass correlation will be lowered.
Therefore, it is important to pay special attention to the population being assessed and
to understand that this can influence the value of the intraclass correlation coefficient
(ICC). For this reason, ICCs are not directly comparable across populations. Finally, it is
important to note that, like the Pearson correlation coefficient, the intraclass correlation
coefficient will be attenuated if assumptions of normality in rating data are violated.

Computing Common Consistency


Estimates of Interrater Reliability
Let us now turn to a practical example of how to calculate each of these coefficients.
We will use the same data set and compute each estimate on the data.

[p. 40 ] Correlation Coefficients. The formula for computing the Pearson correlation
coefficient is listed in Formula 3.

Using SPSS, one can run the correlate procedure and generate a table similar to
Table 3.3. One may request both Pearson and Spearman correlation coefficients.
The Pearson correlation coefficient on this data set is .76; the Spearman correlation
coefficient is .74.

Page 24 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Cronbach's Alpha. The Cronbach's alpha value is calculated using Formula 4,

where

N is the number of components (raters),

is the variance of the observed total scores, and

2
y
i

is the variance of component i.

In order to compute Cronbach's alpha using SPSS, one may simply specify in the
crosstabs procedure the desire to produce Cronbach's alpha (see Table 3.4). For this
example, the alpha value is .86.

Table 3.3 SPSS Code and Output for Pearson and Spearman Correlations

Page 25 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

[p. 41 ]

Table 3.4 SPSS Code and Output for Cronbach's Alpha

Page 26 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Intraclass Correlation. Formula 5 presents the equation used to compute the intraclass
correlation value.

where

2
(b) is the variance of the ratings between
judges, and

2
(w) is the pooled variance within raters.
In order to compute intraclass correlation, one may specify the procedure in SPSS
using the code listed in Table 3.5. The intraclass correlation coefficient for this data set
is .75.

Page 27 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Implications for Summarizing Scores From


Various Raters
It is important to recognize that although consistency estimates may be high, the
means and medians of the different judges may be very different. Thus, if one judge
consistently gives scores that are 2 points lower on the rating scale than does a second
judge, the scores will ultimately need to be corrected for this difference in judge severity
if the final scores are to be summarized or subjected to further analyses.

Table 3.5 SPSS Code and Output for Intraclass Correlation

[p. 42 ]

Page 28 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Advantages of Consistency Estimates


There are three major advantages to using consistency estimates of interrater reliability.
First, the approach places less stringent demands on the judges in that they need
not be trained to come to exact agreement with one another so long as each judge is
consistent within his or her own definition of the rating scale (i.e., exhibits high intrarater
reliability). It is sometimes the case that the exact application of the levels of the scoring
rubric is not important in itself. Instead, the scoring rubric is a means to the end of
creating scores for each participant that can be summarized in a meaningful way. If
summarization is the goal, then what is most important is that each judge apply the
rating scale consistently within his or her own definition of the rating scale, regardless
of whether the two judges exhibit exact agreement. Consistency estimates allow for
the detection of systematic differences between judges, which may then be adjusted
statistically. For example, if Judge A consistently gives scores that are 2 points lower
than Judge B does, then adding 2 extra points to the exams of all students who were
scored by Judge A would provide an equitable adjustment to the raw scores.

A second advantage of consistency estimates is that certain methods within this


category (e.g., Cronbach's alpha) allow for an overall estimate of consistency among
multiple judges. The third advantage is that consistency estimates readily handle
continuous data.

Disadvantage of Consistency Estimates


One disadvantage of consistency estimates is that if the construct under investigation
has some objective meaning, then it may not be desirable for the two judges to agree
to disagree. Instead, it may be important for the judges to come to an exact agreement
on the scores that they are generating.

A second disadvantage of consistency estimates is that judges may differ not only
systematically in the raw scores they apply but also in the number of rating scale
categories they use. In that case, a mean adjustment for a severe judge may provide a

Page 29 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

partial solution, but the two judges may also differ on the variability in scores they give.
Thus, a mean adjustment alone will not effectively correct for this difference.

A third disadvantage of consistency estimates is that they are highly sensitive to the
distribution of the observed data. In other words, if most of the ratings fall into one or
two categories, the correlation coefficient will necessarily be deflated due to restricted
variability. Consequently, a reliance on the consistency estimate alone may lead the
researcher to falsely conclude that interrater reliability was poor without specifying more
precisely that the consistency estimate of interrater reliability was poor and providing an
appropriate rationale.

Measurement Estimates of Interrater


Reliability
Measurement estimates are based on the assumption that one should use all of the
information available from all judges (including discrepant ratings) when attempting to
create a summary score for each respondent. In other words, each judge is seen as
providing some unique information that is useful in generating a summary score for
a person. As Linacre (2002) has noted, It is the accumulation of information, not the
ratings themselves, that is decisive (p. 858). Consequently, under the measurement
approach, it is not necessary for two judges to come to a consensus on how to apply a
scoring rubric because differences in judge severity can be estimated and accounted for
in the creation of each participant's final score.

Measurement estimates are also useful in circumstances where multiple judges are
providing ratings, and it is impossible for all judges to rate all items. They are best used
when different levels of the rating scale are intended to represent different levels of an
underlying unidimensional construct (e.g., mathematical competence).

The two most popular types of measurement estimates are (a) factor analysis and (b)
the manyfacets Rasch model (Linacre, 1994; Linacre, Englehard, Tatem, & Myford,
1994; Myford & Cline, 2002) or loglinear models (von Eye & Mun, 2004).

Page 30 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Factor Analysis. One popular measurement estimate of interrater reliability is computed


using factor analysis (Harman, 1967). Using this method, multiple judges may rate a set
of participants. The judges scores are then subjected to a common factor analysis in
order to determine the amount of shared variance in the ratings [p. 43 ] that could be
accounted for by a single factor. The percentage of variance that is explainable by the
first factor gives some indication of the extent to which the multiple judges are reaching
agreement. If the shared variance is high (e.g., greater than 60%), then this gives some
indication that the judges are rating a common construct. The technique can also be
used to check the extent to which judges agree on the number of underlying dimensions
in the data set.

Once interrater reliability has been established in this way, each participant may then
receive a single summary score corresponding to his or her loading on the first principal
component underlying the set of ratings. This score can be computed automatically by
most statistical packages.

The advantage of this approach is that it assigns a summary score for each participant
that is based only on the relevance of the strongest dimension underlying the data. The
disadvantage to the approach is that it assumes that ratings are assigned without error
by the judges.

ManyFacets Rasch Measurement and LogLinear Models. A second measurement


approach to estimating interrater reliability is through the use of the manyfacets Rasch
2
model (Linacre, 1994). Recent advances in the field of measurement have led to an
extension of the standard Rasch measurement model (Rasch, 1960/1980; Wright &
Stone, 1979). This new, extended model, known as the manyfacets Rasch model,
allows judge severity to be derived using the same scale (i.e., the logit scale) as person
ability and item difficulty. In other words, rather than simply assuming that a score of 3
from Judge A is equally difficult for a participant to achieve as a score of 3 from Judge
B, the equivalence of the ratings between judges can be empirically determined. Thus,
it could be the case that a score of 3 from Judge A is really closer to a score of 5 from
Judge B (i.e., Judge A is a more severe rater). Using a manyfacets analysis, each
essay item or behavior that was rated can be directly compared.

Page 31 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

In addition, the difficulty of each item, as well as the severity of all judges who rated the
items, can also be directly compared. For example, if a history exam included five essay
questions and each of the essay questions was rated by 3 judges (2 unique judges per
item and 1 judge who scored all items), the facets approach would allow the researcher
to directly compare the severity of a judge who rated only Item 1 with the severity of
a judge who rated only Item 4. Each of the 11 judges (2 unique judges per item + 1
judge who rated all items = 52 + 1 = 11) could be directly compared. The mathematical
representation of the manyfacets Rasch model is fully described in Linacre (1994).

Finally, in addition to providing information that allows for the evaluation of the severity
of each judge in relation to all other judges, the facets approach also allows one to
evaluate the extent to which each of the individual judges is using the scoring rubric in
a manner that is internally consistent (i.e., an estimate of intrarater reliability). In other
words, even if judges differ in their interpretation of the rating scale, the fit statistics will
indicate the extent to which a given judge is faithful to his or her own definition of the
scale categories across items and people.

The manyfacets Rasch approach has several advantages. First, the technique puts
rater severity on the same scale as item difficulty and person ability (i.e., the logit scale).
Consequently, this feature allows for the computation of a single final summary score
that is already corrected for rater severity. As Linacre (1994) has noted, this provides a
distinct advantage over generalizability studies since the goal of a generalizability study
is to determine

the error variance associated with each judge's ratings, so that


correction can be made to ratings awarded by a judge when he is
the only one to rate an examinee. For this to be useful, examinees
must be regarded as randomly sampled from some population of
examinees which means that there is no way to correct an individual
examinee's score for judge behavior, in a way which would be helpful
to an examining board. This approach, however, was developed for
use in contexts in which only estimates of population parameters are of
interest to researchers. (p. 29)

Page 32 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Second, the item fit statistics provide some estimate of the degree to which each
individual rater was applying the scoring rubric in an internally consistent manner. In
other words, highfit statistic values are an indication of rater drift over time.

Third, the technique works with multiple raters and does not require all raters to
evaluate all objects. In other words, the technique is well suited to overlapping research
designs, which [p. 44 ] allows the researcher to use resources more efficiently. So
long as there is sufficient connectedness in the data set (Engelhard, 1997), the severity
of all raters can be evaluated relative to each other.

The major disadvantage to the manyfacets Rasch approach is that it is computationally


intensive and therefore is best implemented using specialized statistical software
(Linacre, 1988). In addition, this technique is best suited to data that are ordinal in
nature.

Computing Common Measurement


Estimates of Interrater Reliability
Measurement estimates of interrater reliability tend to be much more computationally
complex than consensus or consistency estimates. Consequently, rather than present
the detailed formulas for each technique in this section, we instead refer to some
excellent sources that are devoted to fully expounding the detailed computations
involved. This will allow us to focus on the interpretation of the results of each of these
techniques.

Factor Analysis. The mathematical formulas for computing factoranalytic solutions are
expounded in several excellent texts (e.g., Harman, 1967; Kline, 1998). When using
factor analysis to estimate interrater reliability, the data set should be structured in
such a way that each column in the data set corresponds to the score given by Rater
on Item Y to each object in the data set (objects each receive their own row). Thus,
if five raters were to score three essays from 100 students, the data set should contain
15 columns (e.g., Rater1_Item1, Rater2_Item1, Rater1_Item2) and 100 rows. In this
example, we would run a separate factor analysis for each essay item (e.g., a 5 100

Page 33 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

data matrix). Table 3.6 shows the SPSS code and output for running the factor analysis
procedure.

There are two important pieces of information generated by the factor analysis. The first
important piece of information is the value of the explained variance in the first factor.
In the example output, the shared variance of the first factor is 76%, indicating that
independent raters agree on the underlying nature of the construct being rated, which
is also evidence of interrater reliability. In some cases, it may turn out that the variance
in ratings is distributed over more than one factor. If that is the case, then this provides
some evidence to suggest that the raters are not interpreting the underlying construct
in the same manner (e.g., recall the example about creativity mentioned earlier in this
chapter).

The second important piece of information comes from the factor loadings. Each object
that has been rated will have a loading on each underlying factor. Assuming that the
first factor explains most of the variance, the score to be used in subsequent analyses
should be the loading on the primary factor.

ManyFacets Rasch Measurement. The mathematical formulas for computing results


using the manyfacets Rasch model may be found in Linacre (1994). In practice, the
manyfacets Rasch model is best implemented through the use of specialized software
(Linacre, 1988). An example output of a manyfacets Rasch analysis is listed in Table
3.7. The example output presented here is derived from the larger Stemler et al. (2006)
data set.

The key values to interpret within the context of the manyfacets Rasch approach
are rater severity measures and fit statistics. Rater severity indices are useful for
estimating the extent to which systematic differences exist between raters with regard
to their level of severity. For example, rater CL was the most severe rater, with an
estimated severity measure of +0.89 logits. Consequently, students whose test items
were scored by CL would be more likely to receive lower raw scores than students who
had the same test item scored by any of the other raters used in this project. At the
other extreme, rater AP was the most lenient rater, with a rater severity measure of
0.91 logits. Consequently, simply using raw scores would lead to biased estimates of
student proficiency since student estimates would depend, to an important degree, on

Page 34 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

which rater scored their essay. The facets program corrects for these differences and
incorporates them into student ability estimates. If these differences were not taken into
account when calculating student ability, students who had their exams scored by AP
would be more likely to receive substantially higher raw scores than if the same item
were rated by any of the other raters.

[p. 45 ]

Table 3.6 SPSS Code and Output for Factor Analysis

Page 35 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

The results presented in Table 3.7 show that there is about a 1.5logit spread in
systematic differences in rater severity (from 0.91 to +0.89). Consequently, assuming
that all raters are defining the rating scales they are using in the same way is not a
tenable assumption, and differences in rater severity must be taken into account in
order to come up with precise estimates of student ability.

Page 36 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

In addition to providing information that allows us to evaluate the severity of each rater
in relation to all other raters, the facets approach also allows us to evaluate the extent
to which each of the individual raters is using the scoring rubric in a manner that is
internally consistent (i.e., intrarater reliability). In other words, even if raters differ in their
own definition of how they use the scale, the fit statistics will indicate the extent to which
a given rater is faithful to his or her own definition of the scale categories across items
and people. Rater fit statistics are presented in columns 5 and 6 of Table 3.7.

Table 3.7 Output for a ManyFacets Rasch Analysis

Fit statistics provide an empirical estimate of the extent to which the expected response
patterns for each individual match the observed response patterns. These fit statistics
are interpreted much the same way as item or person infit statistics are interpreted
(Bond & Fox, 2001; Wright & Stone, 1979). An infit value greater than 1.4 indicates that
there is 40% more variation in the data than predicted by the Rasch model. Conversely,
an infit value of 0.5 indicates that there is 50% less [p. 46 ] variation in the data than
predicted by the Rasch model. Infit mean squares that are greater than 1.3 indicate
that there is more unpredictable variation in the raters responses than we would expect

Page 37 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

based on the model. Infit mean square values that are less than 0.7 indicate that there
is less variation in the raters responses than we would predict based on the model.
Myford and Cline (2002) note that high infit values may suggest that ratings are noisy
as a result of the raters overuse of the extreme scale categories (i.e., the lowest and
highest values on the rating scale), while low infit mean square indices may be a
consequence of overuse of the middle scale categories (e.g., moderate response bias).

The infit and outfit meansquare indices are unstandardized,


informationweighted indices; by constrast the infit and outfit
standardized indices are unweighted indices that are standardized
toward a unitnormal distribution. These standardized indices are
sensitive to sample size and, consequently, the accuracy of the
standardization is data dependent. The expectation for the mean square
index is 1.0; the range is 0 to infinity (Myford & Cline, 2002, p. 14).

The results in Table 3.7 reveal that 6 of the 12 raters had infit meansquare indices
that exceeded 1.3. Raters CL (infit of 3.4), JW (infit of 2.4), and AM (infit of 2.2) appear
particularly problematic. Their high infit values suggest that these raters are not using
the scoring rubrics in a consistent way. The table of misfitting ratings provided by the
facets computer program output allowed for an investigation of the exact nature of the
highly unexpected response patterns associated with each of these raters. The table of
misfitting ratings provides information on discrepant ratings based on two criteria: (a)
how the other raters scored the item and (b) the particular raters typical level of severity
in scoring items of similar difficulty.

Implications for Summarizing Scores From


Various Raters
Measurement estimates allow for the creation of a summary score for each participant
that represents that participant's score on the underlying factor of interest, taking into
account the extent to which each judge influences the score.

Page 38 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Advantages of Measurement Estimates


There are several advantages to estimating interrater reliability using the measurement
approach. First, measurement estimates can take into account errors at the level of
each judge or for groups of judges. Consequently, the summary scores generated
from measurement [p. 47 ] estimates of interrater reliability tend to more accurately
represent the underlying construct of interest than do the simple raw score ratings from
the judges.

Second, measurement estimates effectively handle ratings from multiple judges by


simultaneously computing estimates across all of the items that were rated, as opposed
to calculating estimates separately for each item and each pair of judges.

Third, measurement estimates have the distinct advantage of not requiring all judges
to rate all items in order to arrive at an estimate of interrater reliability. Rather, judges
may rate a particular subset of items, and as long as there is sufficient connectedness
(Linacre, 1994; Linacre et al., 1994) across the judges and ratings, it will be possible to
directly compare judges.

Disadvantages of Measurement Estimates


The major disadvantage of measurement estimates is that they are unwieldy to
compute by hand. Unlike the percent agreement figure or correlation coefficient,
measurement approaches typically require the use of specialized software to compute.

A second disadvantage is that certain methods for computing measurement estimates


(e.g., facets) can handle only ordinallevel data. Furthermore, the file structure required
to use facets is somewhat counterintuitive.

Page 39 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Summary and Conclusion


In this chapter, we have attempted to outline a framework for thinking about interrater
reliability as a multifaceted concept. Consequently, we believe that there is no silver
bullet best approach for its computation. There are multiple techniques for computing
interrater reliability, each with its own assumptions and implications. As Snow, Cook,
Lin, Morgan, and Magaziner (2005) have noted, Percent/proportion agreement is
affected by chance; kappa and weighted kappa are affected by low prevalence of
condition of interest; and correlations are affected by low variability, distribution shape,
and mean shifts (p. 1682). Yet each technique (and class of techniques) has its own
strengths and weaknesses.

Consensus estimates of interrater reliability (e.g., percent agreement, Cohen's kappa,


odds ratios) are generally easy to compute and useful for diagnosing rater disparities;
however, training raters to exact consensus requires substantial time and energy and
may not be entirely necessary, depending on the goals of the study.

Consistency estimates of interrater reliability (e.g., Pearson and Spearman correlations,


Cronbach's alpha, and intraclass correlations) are familiar and fairly easy to compute.
They have the additional advantage of not requiring raters to perfectly agree with each
other but only require consistent application of a scoring rubric within raterssystematic
variance between raters is easily tolerated. The disadvantage to consistency estimates,
however, is that they are sensitive to the distribution of the data (the more it departs
from normality, the more attenuated the results). Furthermore, even if one achieves
high consistency estimates, further adjustment to an individual's raw scores may be
required in order to arrive at an unbiased final score that may be used in subsequent
data analyses.

Measurement estimates of interrater reliability (e.g., factor analysis, manyfacets


Rasch measurement) can deal effectively with multiple raters, easily derive adjusted
summary scores that are corrected for rater severity, and allow for highly efficient
designs (e.g., not all raters need to rate all objects); however, this comes at the expense
of added computational complexity and increased demands on resources (e.g., time
and expertise).

Page 40 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

In the end, the best technique will always depend on (a) the goals of the analysis (e.g.,
the stakes associated with the study outcomes), (b) the nature of the data, and (c)
the desired level of information based on the resources available. The answers to
these three questions will help to determine how many raters one needs, whether the
raters need to be in perfect agreement with each other, and how to approach creating
summary scores across raters.

We conclude this chapter with a brief table that is intended to provide rough interpretive
guidance with regard to acceptable interrater reliability values (see Table 3.8). These
values simply represent conventions the authors have encountered in the literature
and via discussions with colleagues and reviewers; however, keep in mind that these
guidelines are just rough estimates and will vary depending on the purpose of the study
and the stakes associated with the [p. 48 ] outcomes. The conventions articulated
here assume that the interrater reliability study is part of a lowstakes, exploratory
research study.

Table 3.8 General Guidelines for Interpreting Various Interrater Reliability Coefficients

Notes

Page 41 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

1. Also known as interobserver or interjudge reliability or agreement.

2. Readers interested in this model can refer to Chapters 4 and 5 on Rasch


measurement for more information.

References
Agresti, A. (1996). An introduction to categorical data analysis (2nd ed.). New York:
John Wiley.

Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River,
NJ: Prentice Hall.

Barrett, P. (2001, March). Assessing the reliability of rating data . Retrieved June 16,
2003, from http://www.liv.ac.uk/~pbarrett/rater.pdf

Bock, R., Brennan, R. L., and Muraki, E. The information in multiple ratings Applied
Psychological Measurement, vol. 26(4)364-375(2002).

Bond, T., & Fox, C. (2001). Applying the Rasch model . Mahwah, NJ: Lawrence
Erlbaum.

Burke, M. J. and Dunlap, W. P. Estimating interrater agreement with the


average deviation index: A user's guide Organizational Research Methods, vol.
5(2)159-172(2002).

Cohen, J. A coefficient for agreement for nominal scales Educational and Psychological
Measurement, vol. 20,37-46(1960).

Cohen, J. Weighted kappa: Nominal scale agreement with provision for scale
disagreement or partial credit Psychological Bulletin, vol. 70,213-220(1968).

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/
correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence
Erlbaum.

Page 42 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

College Board . (2006). How the essay is scored . Retrieved November 4, 2006, from
http://www.coUegeboard.com/student/testing/sat/about/sat/essay_scoring.html

Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory .
Orlando, FL: Harcourt Brace Jovanovich.

Cronbach, L. J. Coefficient alpha and the internal structure of tests Psychometrika, vol.
16,297-334(1951).

Dooley, K. (2006). Questionnaire Programming LanguageInterrater reliability report .


Retrieved November 4, 2006, from http://qpl.gao.gov/ca050404.htm

Engelhard, G. Constructing rater and task banks for performance assessment Journal
of Outcome Measurement, vol. 1(1)19-33(1997).

Glass, G. v., & Hopkins, K. H. (1996). Statistical methods in education and psychology .
Boston: Allyn & Bacon.

Harman, H. H. (1967). Modern factor analysis . Chicago: University of Chicago Press.

Hayes, J. R. and Hatch, J. A. Issues in measuring reliability: Correlation versus


percentage of agreement Written Communication, vol. 16(3)354-367(1999).

Hayes, K. (2006). SPSS Macro for computing Krippendorff's alpha . Retrieved from
http://www.comm.ohio-state.edu/ahayes/SPSS%20programs/kalpha.htm

Hopkins, K. H. (1998). Educational and psychological measurement and evaluation (8th


ed.). Boston: Allyn & Bacon.

Kline, R. (1998). Principles and practice of structural equation modeling . New York:
Guilford.

Krippendorff, K. Reliability in content analysis: Some common misconceptions and


recommendations Human Communication Research, vol. 30(3)411-433(2004).

Landis, J. R. and Koch, G. G. The measurement of observer agreement for categorical


data Biometrics, vol. 33,159-174(1977).

Page 43 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

LeBreton, J. M., Burgess, J. R., Kaiser, R. B., Atchley, E., and James, L. R. The
restriction of variance hypothesis and interrater reliability and agreement: Are ratings
from multiple sources really dissimilar? Organizational Research Methods, vol.
6(1)80-128(2003).

Linacre, J. M. (1988). FACETS: A computer program for many-facet Rasch


measurement (Version 3.3.0) . Chicago: MESA Press.

Linacre, J. M. (1994). Many-facet Rasch measurement . Chicago: MESA Press.

Linacre, J. M. Judge ratings with forced agreement Rasch Measurement Transactions,


vol. 16(1)857-858(2002).

Linacre, J. M., Englehard, G., Tatem, D. S., and Myford, C. M. Measurement with
judges: Many-faceted conjoint measurement International Journal of Educational
Research, vol. 21(4)569-577(1994).

McArdle, J. J. Structural factor analysis experiments with incomplete data Multivariate


Behavioral Research, vol. 29(4)409-454(1994).

Myford, C. M., & Cline, F. (2002, April 1-5). Looking for patterns in disagreements: A
facets analysis of human raters and e-raters scores on essays written for the Graduate
Management Admission Test (GMAT) . Paper presented at the annual meeting of the
American Educational Research Association, New Orleans, LA.

Osborne, J. W. Bringing balance and technical accuracy to reporting odds ratios and the
results of logistic regression analyses Practical Assessment, Research & Evaluation,
vol. 11(7).Retrieved from http://pareonline.net/getvn.asp?v=11&n=17(2006).

Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests
(Expanded ed.). Chicago: University of Chicago Press. (Original work published 1960)

Snow, A. L., Cook, K. F., Lin, P.-S., Morgan, R. O., and Magaziner, J. Proxies and
other external raters: Methodological considerations Health Services Research, vol.
40(5)1676-1693(2005).

Page 44 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
University of Arizona
2008 SAGE Publications, Inc. All Rights Reserved. SAGE Research Methods

Stemler, S. E. An overview of content analysis Practical Assessment, Research and


Evaluation, vol. 7(17).Retrieved from http://ericae.net/pare/getvn.asp?v=7&n=17(2001).

Stemler, S. E. A comparison of consensus, consistency, and measurement approaches


to estimating interrater reliability Practical Assessment, Research & Evaluation, vol.
9(4).Retrieved from http://pareonline.net/getvn.asp?v=9&n=4(2004).

Stemler, S. E., & Bebell, D. (1999, April). An empirical approach to understanding and
analyzing the mission statements of selected educational institutions . Paper presented
at the New England Educational Research Organization (NEERO), Portsmouth, NH.

Stemler, S. E., Grigorenko, E. L., Jarvin, L., and Sternberg, R. J. Using the theory of
successful intelligence as a basis for augmenting AP exams in psychology and statistics
Contemporary Educational Psychology, vol. 31(2)75-108(2006).

Sternberg, R. J., & Lubart, T. I. (1995). Defying the crowd: Cultivating creativity in a
culture of conformity . New York: Free Press.

Stevens, S. S. On the theory of scales of measurement Science, vol.


103,677-680(1946).

Uebersax, J. (2002). Statistical methods for rater agreement . Retrieved August 9, 2002,
from http://ourworld.compuserve.com/homepages/jsuebersax/agree.htm

von Eye, A., & Mun, E. Y. (2004). Analyzing rater agreement: Manifest variable
methods . Mahwah, NJ: Lawrence Erlbaum.

Wright, B. D., & Stone, M. H. (1979). Best test design . Chicago: MESA Press.

http://dx.doi.org/10.4135/9781412995627.d5

Page 45 of 45 Best practices in quantitative methods: 3 Best


Practices in Interrater Reliability Three Common
Approaches
http://www.jmde.com/ Articles

Quantitative Methods for Estimating the


Reliability of Qualitative Data

Jason W. Davey
Fenwal, Inc.

P. Cristian Gugiu
Western Michigan University

Chris L. S. Coryn
Western Michigan University

Background: Measurement is an indispensable Research Design: Case study.


aspect of conducting both quantitative and
qualitative research and evaluation. With respect Data Collection and Analysis: Narrative data
to qualitative research, measurement typically were collected from a random sample of 528
occurs during the coding process. undergraduate students and 28 professors.

Purpose: This paper presents quantitative Findings: The calculation of the kappa statistic,
methods for determining the reliability of weighted kappa statistic, ANOVA Binary Intraclass
conclusions from qualitative data sources. Correlation, and Kuder-Richardson 20 is
Although some qualitative researchers disagree illustrated through a fictitious example. Formulae
with such applications, a link between the are presented so that the researcher can calculate
qualitative and quantitative fields is successfully these estimators without the use of sophisticated
established through data collection and coding statistical software.
procedures.
K e y w o r d s : qualitative coding; qualitative
Setting: Not applicable. methodology; reliability coefficients
__________________________________
Intervention: Not applicable.

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 140


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

T he rejection of using quantitative


methods for assessing the reliability of
qualitative findings by some qualitative
to consider whether differences may exist
between groups. Yet, most statisticians
consider all these factors before
researchers is both frustrating and formulating their conclusions. While
perplexing from the vantage point of statisticians may be faulted for paying too
quantitative and mixed method much attention to measures of central
researchers. While the need to distinguish tendency (e.g., mean, median) at the
one methodological approach from expense of interesting outliers, this is not
another is understandable, the reasoning the same as believing in one all-inclusive
sometimes used to justify the wholesale truth.
rejection of all concepts associated with The distinction between objective
quantitative analysis are replete with research and subjective research also
mischaracterizations, overreaching appears to emerge from this paradigm
arguments, and inadequate substitutions. debate. Statisticians are portrayed as
One of the first lines of attack against detached and neutral investigators while
the use of quantitative analysis centers on qualitative researchers are portrayed as
philosophical arguments. Healy and Perry embracing personal viewpoints and even
(2000), for example, characterize biases to describe and interpret the
qualitative methods as flexible, inductive, subjective experience of the phenomena
and multifaceted, whereas quantitative they study (Miller, 2008). While parts of
methods are often characterized as these characterizations do, indeed,
inflexible and fixed. Moreover, most differentiate between the two groups of
qualitative researchers view quantitative researchers, they fail to explain why a
methods as characteristic of a positivist majority of qualitative researchers dismiss
paradigm (e.g., Stenbacka, 2001; Davis, the use of statistical methods. After all,
2008; Paley, 2008)a term that has come the formulas used to conduct such
to take on a derogatory connotation. Paley analyses do not know or care whether the
(2008) states that doing quantitative data were gathered using an objective
research entails commitment to a rather than a subjective method.
particular ontology and, specifically, to a Moreover, certain statistical methods lend
belief in a single, objective reality that can themselves to, and were even specifically
be described by universal laws (p. 649). developed for, the analysis of qualitative
However, quantitative analysis should data (e.g., reliability analysis). Other
not be synonymous with the positivist qualitative researchers have come to
paradigm because statistical inference is equate positivism, and by extension
concerned with probabilistic, as opposed quantitative analysis, with causal
to deterministic, conclusions. Nor do explanations (Healy & Perry, 2000). To
statisticians believe in a universal law date, the gold standard for substantiating
measured free of error. Rather, causal claims is through the use of a well-
statisticians believe that multiple truths conducted experimental design. However,
may exist and that despite the best efforts the implementation of an experimental
of the researcher these truths are design does not necessitate the use of
measured with some degree of error. If quantitative analysis. Furthermore,
that were not the case, statisticians would quantitative analysis may be conducted
ignore interaction effects, assume that for any type of research design, including
measurement errors do not exist, and fail

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 141


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

qualitative research, as is the central then further differentiated or integrated so


premise of this paper. that they may be reworked into a smaller
number of categories, relationships, and
For some qualitative researchers (e.g., patterns so as to tell a story or
Miller, 2008; Stenbacka, 2001), the communicate conclusions drawn from the
wholesale rejection of all concepts data. (p. 85)
perceived to be quantitative has extended
to general research concepts like Clearly, in absence of utilizing a coding
reliability and validity. According to process, researchers would be forced to
Stenbacka (2001), reliability has no provide readers with all of the data,
relevance in qualitative research, where it which, in turn, would place the burden of
is impossible to differentiate between interpretation on the reader. However,
researcher and method (p. 552). From while the importance of coding to
the perspective of quantitative research, qualitative research is self-evident to all
this statement is inaccurate because those who have conducted such research,
several quantitative methods have been the role of measurement may not be as
developed for differentiating between the obvious. In part, this may be attributed to
researcher, data collection method, and a misunderstanding on the part of many
informant (e.g., generalizability theory), researchers as to what is measurement.
provided, of course, data are available for Measurement is the process of
two or more researchers and/or methods. assigning numbers, symbols, or codes to
Stenbacka (2001) also objected to phenomena (e.g., events, features,
traditional forms of validity because the phrases, behaviors) based on a set of
purpose in qualitative research never is to prescribed rules (i.e., a coding rubric).
measure anything. A qualitative method There is nothing inherently quantitative
seeks for a certain quality that is typical about this process or, at least, there does
for a phenomenon or that makes the not need to be. Moreover, it does not limit
phenomenon different from others (p. qualitative research in any way. In fact,
551). It would seem to the present many times, measurement may only be
authors, however, that this notion is performed in a qualitative context.
inconsistent with traditional qualitative For example, suppose that a
research. Measurement is a indispensable researcher conducts an interview with an
aspect of conducting research, regardless informant who states that the bathrooms
if it is quantitative or qualitative. in the school are very dirty. Now further
With respect to qualitative research, suppose that the researcher developed a
measurement occurs during the coding coding rubric, which, for the sake of
process. Illustrating the integral nature of simplicity, only contained two levels:
coding in qualitative research, Benaquisto cleanliness and academic performance.
(2008) noted: Clearly, the informants statement
addressed the first level (cleanliness) and
The coding process refers to the steps the not the second. Whether the researcher
researcher takes to identify, arrange, and chooses to assign this statement a
systematize the ideas, concepts, and
categories uncovered in the data. Coding checkmark for the cleanliness category or
consists of identifying potentially a 1, and an X or 0 (zero) for the academic
interesting events, features, phrases, performance category, does not make a
behaviors, or stages of a process and difference. The researcher clearly used his
distinguishing them with labels. These are or her judgment to transform the raw

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 142


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

statement made by the informant into a reliability indicator through the use of
code. However, when the researcher multiple coders, transparency, audit trails,
decided that the statement best and member checks. Credibility, on the
represented cleanliness and not academic other hand, is concerned with the
performance, he or she also performed a research methodology and data sources
measurement process. Therefore, if one used to establish a high degree of
accepts this line of reasoning, qualitative harmony between the raw data and the
research depends upon measurement to researchers interpretations and
render judgments. Furthermore, three conclusions. Various means can be used
questions may be asked. First, does to enhance credibility, including
statement X fit the definition of code Y? accurately and richly describing data,
Second, how many of the statements citing negative cases, using multiple
collected fit the definition of code Y? And researchers to review and critique the
third, how reliable is the definition of code analysis and findings, and conducting
Y for differentiating between statements member checks (Given & Saumure, 2008;
within and across researchers (i.e., Jensen, 2008; Saumure & Given, 2008).
intrarater and interrater reliability, Dependability recognizes that the most
respectively)? appropriate research design cannot be
Fortunately, not every qualitative completely predicted a priori.
researcher has accepted Stenbackas Consequently, researchers may need to
notion, in part, because qualitative alter their research design to meet the
researchers, like quantitative researchers, realities of the research context in which
compete for funding and therefore, must they conduct the study, as compared to
persuade funders of the accuracy of their the context they predicted to exist a priori
methods and results (Cheek, 2008). (Jensen, 2008). Dependability can be
Consequently, the concepts of reliability addressed by providing a rich description
and validity permeate qualitative of the research procedures and
research. However, owing to the desire to instruments used so that other
differentiate itself from quantitative researchers may be able to collect data in
research, qualitative researchers have similar ways. The idea being that if a
espoused the use of interpretivist different set of researchers use similar
alternatives terms (Seale, 1999). Some of methods then they should reach similar
the most popular terms substituted for conclusions (Given & Saumure, 2008).
reliability include confirmability, Finally, replicability is concerned with
credibility, dependability, and replicability repeating a study on participants from a
(Coryn, 2007; Golafshani, 2003; Healy & similar background as the original study.
Perry, 2000; Morse, Barrett, Mayan, Researchers may address this reliability
Olson, & Spiers, 2002; Miller, 2008; indicator by conducting the new study on
Lincoln & Guba, 1985). participants with similar demographic
In the qualitative tradition, variables, asking similar questions, and
confirmability is concerned with coding data in a similar fashion to the
confirming the researchers original study (Firmin, 2008).
interpretations and conclusions are Like qualitative researchers,
grounded in actual data that can be quantitative researchers have developed
verified (Jensen, 2008; Given & Saumure, numerous definitions of reliability,
2008). Researchers may address this including interrater and intrarater

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 143


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

reliability, test-retest reliability, internal must work independently of the other to


consistency, and interclass correlations to reduce bias in the first phase of analysis.
name a few (Crocker & Algina, 1986; Often, this task is greatly facilitated by use
Hopkins, 1998). A review of the of a database system that, for example, (1)
qualitative alternative terms revealed displays the smallest codable unit of a
them to be indirectly associated with transcript (e.g., a single sentence), (2)
quantitative notions of reliability. presents the available coding options, and
However, although replicability is (3) records the raters code before
conceptually equivalent to test-retest displaying the next codable unit.
reliability, the other three terms appear to While it is likely that qualitative
describe research processes tangentially researchers who prescribe to a
related to reliability. Moreover, they have constructionist paradigm may object to
two major liabilities. First, they place the the constraint of forcing qualitative
burden of assessing reliability squarely on researchers to use the same coding rubric
the reader. For example, if a reader for a study, rather than developing their
wanted to determine the confirmability of own, this is an indispensable process for
a finding they would need to review the attaining a reasonable level of interrater
audit trail and make an independent reliability. An example of the perils of not
assessment. Similar reviews of the data attending to this issue may be found in an
would be necessary, if a reviewer wanted empirical study conducted by Armstrong,
to assess the credibility of a finding or Gosling, Weinman, and Marteau (1997).
dependability of a study design. Armstrong and his colleagues invited six
Second, they fail to consider interrater experienced qualitative researchers from
reliability, which, in our experience, Britain and the United States to analyse a
accounts for a considerable amount, if not transcript (approximately 13,500 words
a majority, of the variability in findings in long) from a focus group comprised of
qualitative studies. Interrater reliability is adults living with cystic fibrosis that was
concerned with the degree to which convened to discuss the topic of genetic
different raters or coders appraise the screening. In return for a fee, each
same information (e.g., events, features, researcher was asked to prepare an
phrases, behaviors) in the same way (van independent report in which they
den Hoonaard, 2008). In other words, do identified and described the main themes
different raters interpret qualitative data that emerged from the focus group
in similar ways? The process of discussion, up to a maximum of five.
conducting an interrater reliability Beyond these instructions, each
analysis, which is detailed in the next researcher was permitted to use any
section, is relatively straightforward. method for extracting the main themes
Essentially, the only additional step they felt was appropriate. Once the
beyond development and finalization of a reports were submitted, they were
coding rubric is that, at least two or more thematically analyzed by one of the
raters must independently rate all of the authors, who deliberately abstained from
qualitative data using the coding rubric. reading the original transcript to reduce
Although collaboration, in the form of external bias.
consensus agreement, may be used to The results uncovered by Armstrong
finalize ratings after each rater has had an and his colleagues paint a troubling
opportunity to rate all data, each rater picture. On the surface, it was clear that a

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 144


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

reasonable level of consensus in the by knowledgeable researchers. Clearly, by


identification of themes was achieved. this statement, we are assuming that
Five of the six researchers identified five reliability of findings across different
themes, while one identified four themes. researchers is a desirable quality. There
Consequently, only four themes are certainly may be instances in which
discussed in the article: visibility; reliability is not important because one is
ignorance; health service provision; and only interested in the findings of a specific
genetic screening. With respect to the researcher, and the perspectives of others
presence of each theme, there was are not desired. That being the case, one
unanimous agreement for the visibility may consider examining intrarater
and genetic screening themes, while the reliability. In all other instances, however,
agreement rates were slightly lower for it is reasonable to assume that it is
the ignorance and health service provision desirable to differentiate between the
themes (83% and 67%, respectively). perspectives of the informants and those
Overall, these are good rates of of the researcher. In other words, are the
agreement. However, a deeper researchers findings truly grounded in
examination of the findings revealed two the data or do they reflect his or her
troubling issues. First, a significant personal ideological perspectives. For a
amount of disagreement existed with politician, for example, knowing the
respect to how the themes were organized. answer to this question may mean the
Some researchers classified a theme as a difference between passing and rejecting a
basic structure whereas others organized policy that allows parents to genetically
it under a larger basic structure (i.e., gave test embryos.
it less importance than the overarching Although qualitative researchers can
theme they assigned it to). Second, a address interrater reliability by following
significant amount of disagreement the method used by Armstrong and his
existed with respect to the manner in colleagues, the likelihood of achieving a
which themes were interpreted. For reasonable level of reliability will be low
example, some of the researchers felt that simply due to researcher differences (e.g.,
the ignorance theme suggested a need for the labels used to describe themes,
further education, other researchers structural organization of themes,
raised concern about the eugenic threat, importance accorded to themes,
and the remainder thought it provided interpretation of data). In general, given
parents with choice. Similar the importance of reducing the variability
inconsistencies with regard to in research findings attributed solely to
interpretability occurred for the genetic researcher variability, it would greatly
screening theme where three researchers benefit qualitative researchers to utilize a
indicated that genetic screening provided common coding rubric. Furthermore, use
parents with choice while one linked it of a common coding rubric does not
with the eugenic threat. greatly interfere with normal qualitative
These results serve as an example of procedures, particularly if consensus is
how reality is relative to the researcher reached beforehand by all the researchers
doing the interpretation. However, they on the rubric that will be used to code all
also demonstrate how the quality of a the data. Of equal importance, this
research finding requires knowledge of procedure permits the researcher to
the degree to which consensus is reached

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 145


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

remain to be the instrument by which estimates, whereas data whose


data are interpreted (Brodsky, 2008). interpretations are consistent will likely
Reporting the results of, to this point, produce high reliability estimates. Finally,
this qualitative process should calculating interrater reliability in
considerably improve the credibility of addition to reporting a narrative of the
research findings. However, three issues discrepancies and consistencies between
still remain. First, reporting the findings researchers can be thought of as a form of
of multiple researchers places the burden methodological triangulation.
of synthesis on the reader. Therefore,
researchers should implement a method Method
to synthesize all the findings through a
consensus-building procedure or
Data Collection Process
averaging results, where appropriate and
possible. Second, judging the reliability of
Narrative data were collected from 528
a study requires that deidentified data are
undergraduate students and 28 professors
made available to anyone who requests it.
randomly selected from a university
While no one, to the best of our
population. Data were collected with the
knowledge, has studied the degree to
help of an open-ended survey that asked
which this is practiced, our experience
respondents to identify the primary
suggests it is not prevalent in the research
challenges facing the university that
community. Third, reporting the findings
should be immediately addressed by the
of multiple researchers will only permit
universitys administration. Data were
readers to get an approximate sense of the
transcribed from the surveys to an
level of interrater reliability or whether it
electronic database (Microsoft Access)
meets an acceptable standard. Moreover,
programmed to resemble the original
comparisons between the reliability of the
questionnaire. Validation checks were
study to another qualitative study are
performed by upper-level graduate
impractical for complex studies.
students to assess the quality of the data
Fortunately, simple quantitative
entry process. Corrections to the data
solutions exist that enable researchers to
entered into the database were made by
report the reliability of their conclusions
the graduate students in the few instances
rather than shift the burden to the reader.
in which discrepancies were found
The present paper will expound upon four
between the responses noted on the
quantitative methods for calculating
survey and those entered in the database.
interrater reliability that can be
Due to the design of the original
specifically applied to qualitative data and
questionnaire, which encouraged
thus, should not be regarded as products
respondents to bullet their responses,
of a positivist position. In fact, reliability
little additional work was necessary to
estimates, which can roughly be
further break responses into the smallest
conceptualized as the degree to which
codable units (typically 1-3 sentences).
variability of research findings are or are
That said, it was possible for the smallest
not due to differences in researchers,
codable units to contain multiple themes
illustrate the degree to which reality is
although the average number of themes
socially constructed or not. Data that are
was less than two per unit of analysis.
subject to a wide range of interpretations
will likely produce low reliability

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 146


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

Coding Procedures such themes and patterns, the reliability


of coders efforts can be determined,
Coding qualitative data is an arduous task typically by coefficients of agreement.
that requires iterative passes through the This type of estimate can be used as a
raw data in order to generate a reliable measure that objectively permits a
and comprehensive coding rubric. This researcher to substantiate that his or her
task was conducted by two experienced coding scheme is replicable.
qualitative researchers who independently Most estimators for gauging the
read the original narratives and identified reliability of continuous agreement data
primary and secondary themes, predominately evolved from psychometric
categorized these themes based on their theory (Cohen, 1968; Lord & Novick,
perception of the internal structure 1968; Gulliksen, 1950; Rozeboom, 1966).
(selective coding; Benaquisto, 2008), and Similar methods for binomial agreement
produced labels for each category and data shortly followed (Cohen, 1960; Lord
subcategory based on the underlying data & Novick, 1968). Newer forms of these
(open coding; Benaquisto, 2008). estimators, called binomial intraclass
Following this initial step, the two correlation coefficients (ICC), were later
researchers further differentiated or developed to handle more explicit
integrated their individual coding rubric patterns in agreement data (Fleiss &
(axial coding; Benaquisto, 2008) into a Cuzick, 1979; Kleinman, 1973; Lipsitz,
unified coding rubric. Using the unified Laird, & Brennan, 1994; Mak, 1988;
coding rubric, the two researchers Nelder & Pregibon, 1987; Smith, 1983;
attempted an initial coding of the raw data Tamura & Young, 1987; Yamamoto &
to determine (1) the ease with which the Yanagimoto, 1992).
coding rubric could be applied, (2) In this paper four methods that can be
problem areas that needed further utilized to assess the reliability of
clarification, (3) the trivial categories that binomial coded agreement data are
could be eliminated or integrated with presented. These estimators are the kappa
other categories, (4) the extensive statistic (), the weighted kappa statistic
categories that could be further refined to (W), the ANOVA binary ICC, and the
make important distinctions, and (5) the Kuder-Richardson 20 (KR-20). The kappa
overall coverage of the coding rubric. Not statistic was one of the first statistics
surprisingly, several iterations were developed for assessing the reliability of
necessary before the coding rubric was binomial data between two or more
finalized. In the following section, for ease coders (Cohen, 1960; Fleiss, 1971). A
of illustration, reliability estimates are modified version of this statistic
presented only for a single category. introduced the use of numerical weights.
This statistic allows the user to apply
Statistical Procedures different probability weights to cells in a
contingency table (Fleiss, Cohen, &
Very often, coding schemes follow a Everitt, 1969) in order to apply different
binomial distribution. That is, coders levels of importance to various coding
indicate whether a particular theme either frequencies. The ANOVA binary ICC is
is or is not present in the data. When two based on the mean squares from an
or more individuals code data to identify analysis of variance (ANOVA) model
modified for binomial data (Elston, Hill, &

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 147


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

Smith, 1977). The last estimator was Table 1


developed by Kuder and Richardon General Layout of Binomial Coder
(1937), and is commonly known as KR-20 Agreement Patterns for Qualitative Data
or KR (20), because it was the 20th
numbered formula in their seminal Coder 1
article. This estimator is based on the
Theme Not
ratios of agreement to the total discrete Theme
Present
variance. Present (j1)
(j2)
These four reliability statistics are
Theme
functions of i x j contingency tables, also Present (i1)
Cell11 Cell21
known as cross-tabulation tables. The Coder 2
current paper will illustrate the use of Theme Not
Present Cell12 Cell22
these estimators for a study dataset that (i2)
comprises the binomial coding patterns of
two investigators. Because these coding
patterns are from two coders and the Participants
coded responses are binomial (i.e., theme
either is or is not present in a given Interview data were collected for and
interview response, the contingency table transcribed from 28 professors and 528
has two rows (i = 2) and two columns (j = undergraduate students randomly
2). selected from a university population. The
The layout of this table is provided in binomial coding agreement patterns for
Table 1. The first cell, denoted (i1 = these two groups of interview participants
Present, j1 = Present), of this table consists are provided in Table 2 and Table 3.
of the total frequency of cases where For the group of professor and student
Coder 1 and Coder 2 both agree that a interview participants, the coders agreed
theme is present in the participant that one professor and 500 students
interview responses. The second cell, provided a response that pertains to
denoted (i1 = Present, j2 = Not Present), of overall satisfaction of university facilities.
this table consists of the total frequency of Coder 1 felt that an additional seven
cases where Coder 1 feels that a theme is professors and two students made a
present in the interview responses, and response pertinent to overall satisfaction,
the second coder does not agree with this whereas Coder 2 did not feel that response
assessment. The third cell, denoted (i2 = from these two professors pertained to the
Not Present, j1 = Present), of this table interview response of interest. Coder 2 felt
consists of the total frequency of cases that one professor and one student made
where Coder 2 feels that a theme is a response pertinent to overall
present, and the first coder does not agree satisfaction, whereas Coder 1 did not feel
with this assessment. The fourth cell, that response from this professor
denoted (i2 = Not Present, j1 = Not pertained to the interview response of
Present), of this table consists of the total interest. Coder 1 and Coder 2 agreed that
frequency of cases where both Coder 1 and responses from the final 19 professors and
Coder 2 agree that a theme is not present 25 students did not pertain to the topic of
in the interview responses (Soeken & interest.
Prescott, 1986).

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 148


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

Table 2 Four Estimators for Calculating


Binomial Coder Agreement Patterns for
Professor Interview Participants the Reliability of Qualitative
Data
Coder 1
Theme
Kappa
Theme Not
Present
Present (j2)
(j1) According to Brennan and Hays (1992),
Theme the statistic determines the extent of
1 1 agreement between two or more judges
Present (i1)
Coder 2 exceeding that which would be expected
Theme Not
Present (i2)
7 19 purely by chance (p. xx). This statistic is
based on the observed and expected level
of agreement between two or more raters
Table 3 with two or more levels. The observed
Binomial Coder Agreement Patterns for level of agreement (po) equals the
Student Interview Participants frequency of records where both coders
agree that a theme is present plus the
Coder 1 frequency of records where both coders
Theme agree that a theme is not present divided
Theme Not
Present
Present (j2)
by the total number of ratings. The
(j1) expected level of agreement (pe) equals
Theme the summation of the cross product of the
500 1
Present (i1) marginal probabilities. In other words,
Coder 2 this is the expected rate of agreement by
Theme Not
2 25 random chance alone. The kappa statistic
Present (i2)
() then equals (po-pe)/(1-pe). The
traditional formulae for po and pe are
c c c c
po = pij p e = p i. p. j
i =1 j =1 i =1 j =1
and ,
where c denotes the total number of cells,
i denotes the ith row, and j denotes the jth
column (Fleiss, 1971; Soeken & Prescott,
1986). These formulae are illustrated in
Table 4.

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 149


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

Table 4
2 x 2 Contingency Table for the Kappa Statistic

Marginal Row
Coder 1
Probabilities
Theme not
Theme present pi.
present
Theme present c11 c21 p1. = (c11 + c21) / N
Coder 2 Theme not
c12 c22 p2. = (c12 + c22) / N
present
Marginal
p.1 = (c11 + c12) / p.2 = (c21 + c22) / N = (c11 + c21 + c12
Column p.j
N N + c22)
Probabilities

c c
c11 + c22 observed level of agreement for professors
po = pij = is (1+19)/556 = 0.0360. The expected
i =1 j =1 N and level of agreement for professors is
c c
pe = pi. p. j = p1. p.1 + p2. p.2 0.0036(0.0144) + 0.0468(0.0360) =
i =1 j =1 0.0017.
Estimates from professor interview
participants for calculating the kappa
statistic are provided in Table 5. The

Table 5
Estimates from Professor Interview Participants for Calculating the Kappa Statistic

Marginal Row
Coder 1
Probabilities
Theme not
Theme present pi.
present
p1. = 2/556 =
Theme present 1 1
0.0036
Coder 2
Theme not p2. = 26/556 =
7 19
present 0.0468
Marginal
p.1 = 8/556 = p.2 = 20/556 = N = 28 + 528 =
Column p.j
0.0144 0.0360 556
Probabilities

Estimates from student interview level of agreement for students is


participants for calculating the kappa 0.9011(0.9029) + 0.0486(0.0486) =
statistic are provided in Table 6. The 0.8160.
observed level of agreement for students
is (500+25)/556 = 0.9442. The expected

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 150


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

Table 6
Estimates from Student Interview Participants for Calculating the Kappa Statistic

Marginal Row
Coder 1
Probabilities
Theme not
Theme present pi.
present
p1. = 501/556 =
Theme present 500 1
0.9011
Coder 2
Theme not p2. = 27/556 =
2 25
present 0.0486
Marginal
p.1 = 502/556 = p.2 = 26/556 =
Column p.j N = 556
0.9029 0.0468
Probabilities

The total observed level of agreement agree that a theme is not present. The
for the professor and student interview same logic can be applied where the
groups is po = 0.0360 + 0.9442 = 0.9802. coders disagree on the presence of a
The total expected level of agreement for theme in participant responses.
the professor and student interview The weighted observed level of
groups is pe = 0.0017 + 0.8160 = 0.8177. agreement (pow) equals the frequency of
For the professor and student and records where both coders agree that a
professor groups, the kappa statistic theme is present times a weight plus the
equals = (0.9802 0.8177)/(1 0.8177) frequency of records where both coders
= 0.891. The level of agreement between agree that a theme is not present times
the two coders is 0.891 beyond that which another weight divided by the total
is expected purely by chance. number of ratings. The weighted expected
level of agreement (pew) equals the
Weighted Kappa summation of the cross product of the
marginal probabilities, where each cell in
The reliability coefficient, W, has the the contingency table has its own weight.
same interpretation as the kappa statistic, The weighted kappa statistic W then
, but the researcher can differentially equals (pow-pew)/(1-pew). The traditional
weight each cell to reflect varying levels of formulae for pow and pew are
c c c c
importance. According to Cohen (1968),
pow = wij pij and pew = wij pi. p. j ,
W is the proportion of weighted i =1 j =1 i =1 j =1
agreement corrected for chance, to be where c denotes the total number of cells,
used when different kinds of i denoted the ith row, j denotes the jth
disagreement are to be differentially column, and wij denotes the i, jth cell
weighted in the agreement index (p. xx). weight (Fleiss, Cohen, & Everitt, 1969;
As an example, the frequencies of coding Everitt, 1968). These formulae are
patterns where both raters agree that a illustrated in Table 7.
theme is present can be given a larger
weight than patterns where both raters

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 151


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

Table 7
2 x 2 Contingency Table for the Weighted Kappa Statistic

Marginal Row
Coder 1
Probabilities
Theme not
Theme present pi.
present
p1. = (w11c11 +
Theme present w11c11 w21c21
w21c21) / N
Coder 2
Theme not p2. = (w12c12 +
w12c12 w22c22
present w22c22) / N
Marginal
p.1 = (w11c11 + p.2 = (w21c21 + N = (c11 + c21 + c12
Column p.j
w12c12) / N w22c22) / N + c22)
Probabilities

c c
w11c11 + w22 c22 probability weights used are provided in
po = pij = Table 8. In the first row and first column,
i =1 j =1 N and the probability weight is 0.80. This weight
c c
pe = wij pi. p. j was chosen arbitrarily to reflect the
i =1 j =1 overall level of importance in the
.
agreement of a theme being present as
Karlin, Cameron, and Williams (1981)
identified by both coders. In the second
provided three methods for weighting
row and first column, the probability
probabilities as applied to the calculation
weight is 0.10. In the first row and second
of a kappa statistic. The first method
column, the probability weight is 0.09.
equally weights each pair of observations.
These two weights were used to reduce the
n
This weight is calculated as w i = i , where impact of differing levels of experience in
N qualitative research between the two
ni is the sample size of each cell and N is raters. In the second row and second
the sum of the sample sizes from all of column, the probability weight is 0.01.
cells of the contingency table. The second This weight was employed to reduce the
method equally weights each group (e.g., effect of the lack of existence of a theme
undergraduate students and professors) from the interview data.
irrespective of its size. These weights can
1
be calculated as w i = , where k is
kn i (n i 1)
the number of groups (e.g., k = 2). The
last method weights each cell according to
the sample size in each cell. The formula
1
for this weighting option is w i = .
N (n i 1)
There is no single standard for
applying probability weights to each cell
in a contingency table. For this study, the

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 152


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

Table 8
Probability Weights on Binomial Coder Estimates from professor interview
Agreement Patterns for Professor and participants for calculating the weighted
Student Interview Participants kappa statistic are provided in Table 9.
The observed level of agreement for
Coder 1 professors is [0.8(1)+0.01(19)]/556 =
0.0018. The expected level of agreement
Theme Not
Theme
Present for professors is 0.0016(0.0027) +
Present (j1) 0.0016(0.0005) = 0.00001.
(j2)
Theme
0.80 0.09
Present (i1)
Coder 2 Theme Not
Present 0.10 0.01
(i2)

Table 9
Estimates from Professor Interview Participants for Calculating the Weighted Kappa
Statistic

Marginal Row
Coder 1
Probabilities
Theme not
Theme present pi.
present
p1. = 0.89/556 =
Theme present 0.8(1) = 0.8 0.09(1) = 0.09
0.0016
Coder 2
Theme not p2. = 0.89/556 =
0.1(7) = 0.7 0.01(19) =0.19
present 0.0016
Marginal
p.1 = 1.5/556 = p.2 = 0.28/556 = N = 28 + 528 =
Column p.j
0.0027 0.0005 556
Probabilities

Estimates from professor interview 0.7199. The expected level of agreement


participants for calculating the weighted for professors is 0.7196(0.7198) +
kappa statistic are provided in Table 10. 0.0008(0.0006) = 0.5180.
The observed level of agreement for
professors is [0.8(500)+0.01(25)]/556 =

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 153


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

Table 10
Estimates from Student Interview Participants for Calculating the Weighted Kappa
Statistic

Marginal Row
Coder 1
Probabilities
Theme not
Theme present pi.
present
p1. = 400.09/556 =
Theme present 0.8(500) = 400 0.09(1) = 0.09
0.7196
Coder 2
Theme not p2. = 0.45/556 =
0.1(2) = 0.2 0.01(25) =0.25
present 0.0008
Marginal
p.1 = 400.2/556 p.2 = 0.34/556 = N = 28 + 528 =
Column p.j
= 0.7198 0.0006 556
Probabilities

The total observed level of agreement for two or more coding groups/categories
for the professor and student interview from an analysis of variance model
groups is pow = 0.0018 + 0.7199 = 0.7217. modified for binary response variables by
The total expected level of agreement for Elston (1977). This reliability statistic
the professor and student interview measures the consistency of the two
groups is pew = 0.00001 + 0.5180 = ratings (Shrout and Fleiss, 1979), and is
0.5181. For the professor and student and appropriate when two or more raters rate
professor groups, the weighted kappa the same interview participants for some
statistic equals W = (0.7217 0.5181)/(1 item of interest. ICC(3,1) assumes that the
0.5181) = 0.423. The level of agreement raters are fixed; that is, the same raters
between the two coders is 0.423 beyond are utilized to code multiple sets of data.
that which is expected purely by chance The statistic ICC(2,1) that assumes the
after applying importance weights to each coders are randomly selected from a
cell. This reliability statistic is notably larger population of raters (Shrout and
smaller than the unadjusted kappa Fleiss, 1979) is recommended for use but
statistic because of the number of down- not currently available for binomial
weighted cases where both coders agreed response data.
that the theme is not present in the The traditional formulae for these
interview responses. mean squares within and between along
with an adjusted sample size estimate are
ANOVA Binary ICC provided in Table 11. In these formulae, k
denotes the total number of groups or
From the writings of Shrout and Fleiss categories. Yi denotes the frequency of
(1979), the currently available ANOVA agreements (both coders indicate a theme
Binary ICC that is appropriate for the is present, or both coders indicate a theme
current data set is based on what they is not present) between coders for the ith
refer to as ICC(3,1). More specifically, this group or category, ni is the total sample
version of the ICC is based on within size for the ith group or category, and N is
mean squares and between mean squares the total sample size across all groups or

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 154


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

categories. Using these estimates, the 0.0157 and 2.0854, respectively. Using
reliability estimate equals these estimates, the ANOVA binary ICC
MS B MSW equals
AOV = (Elston, Hill, &
MS B + ( n0 1) MSW MS B MSW 2.0854 0.0157
=
Smith, 1977; Ridout, Demtrio, & Firth, MS B + (n0 1) MSW 2.0854 + (54.5827 1)0.0157
1999). = 0.714, which denotes the consistency of
Estimates from professor and student coding between the two coders on the
interview participants for calculating the professor and student interview
ANOVA Binary ICC are provided in Table responses.
11. Given that k = 2 and N = 556, the
adjusted sample size equals 54.2857. The
within and between mean squares equal

Table 11
Formulae and Estimates from Professor and Student Interview Participants for
Calculating the ANOVA Binary ICC

Description of
Statistic Formula
Statistic

1 k k
Yi 2
Mean Squares
MSW i =
Y
1
[545 536.303] = 0.0157
Within N k i =1 i =1 ni 556 2

1 k Yi 2 1 k 1 5452
2

i = 2.0854
Mean Squares
MSB Y = 536.303
Between k 1 i =1 ni N i =1 2 1 556

1 1 k 2 1

Adjusted Sample 1
n0 N ni = 556 - (528 2 + 282 ) = 54.5827
Size k 1 N i =1 2 - 1 556

Note: Yi denotes the total number of cases where both coders indicate that a theme either is or is not
present in a given response.
requirements of the data in relation to the
Kuder-Richardson 20 calculation of the correlation rii , possibly
due to its time of development in relation
In their landmark article, Kuder and to the infancy of mathematical statistics.
Richardson (1937) presented the This vagueness has lead to some incorrect
derivation of the KR-20 statistic, a calculations of the KR-20. Crocker and
coefficient that they used to determine the Algina (1986) present examples on the
reliability of test items. This estimator is a calculation of the KR-20 in Table 7.2
function of the sample size, summation of based on data from Table 7.1 (pp. 136-
item variances, and total variance. Two 140). In Table 7.1, the correlation on the
observations in these formulae require two split-halves is presented as AB = 0.34 .
further inquiry. These authors do not It is not indicated that this statistic is the
appear to discuss the distributional Pearson correlation. This is problematic

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 155


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

because this statistic assumes that the two KR-20 will be computed using the
random variables are continuous, when in N 1 k Yi Yi
actuality they are discrete. An appropriate formula 1 2 1 , where
N 1 T i =1 ni ni
statistic is Kendall-c and this correlation
equals 0.35. As can be seen, the k denotes the total number of groups or
correlation may be notably categories, Yi denotes the number of
underestimated as well as the KR-20 if the agreements between coders for the ith
incorrect distribution is assumed. For the group or category, ni is the total sample
remainder of this paper, the Pearson size for the ith group or category, and N is
correlation will be substituted with the the total sample size across all groups or
Kendall-c correlation. categories (Lord & Novick, 1968). The
Second, Kuder and Richardson (1937) total variance ( T2 ) for coder agreement
present formulae for the calculation of t2 patterns equals the summation of
and rii that are not mutually exclusive. elements in a variance-covariance matrix
for binomial data (i.e.,
This lack of exclusiveness has caused
some confusion in appropriate 1 + 2 + 2COV ( X 1 , X 2 )
2 2
=
calculations of the total variance t2 . Lord 1 + 2 + 2 12 1 2 ) (Stapleton, 1995). The
2 2

and Novick (1968) indicated that this variance-covariance matrix takes the
statistic is equal to coefficient general form
(continuous) under certain circumstances, 12 L
and Crocker and Algina (1986) elaborated 12 1 2 ij i j
on this statement by indicating This 2
2
M L
= 21 2 1 (Kim
formula is identical to coefficient alpha L L O M
with the substitution of piqi for i2 (p. L L 2
n
ij i j
139). This is unfortunately incomplete.
& Timm, 2007), and reduces to
Not only must this substitution be
made for the numerators variances, the 12 121 2
= for a coding
denominator variances must also be 21 21 22
adjusted in the same manner. That is, if scheme comprised of two raters. In this
the underlying distribution of the data is
binomial, all estimators should be based matrix, the variances 2 2
(
1 , 2 of )
on the level of measurement appropriate agreement for the i group or category
th

for the distribution. Otherwise, KR-20 should be based on discrete expectations


formula will be based on a ratio of a (Hogg, McKean, & Craig, 2004). The form
discrete variance to a continuous variance. of this variance equals the second moment
The resulting total variance will be minus the square of the first moment; that
notably to substantially inflated. For the is, E(X2) [E(X)]2 (Ross, 1997). For
continuous data, E (X 2 ) = x 2 f ( x )x and
+
current paper, the KR-20 will be a

function of a total variance based on the
+
discrete level of measurement. This E(X) = x f ( x )x where f(x) denotes the

variance will equal the summation of the
probability density function (pdf) for
main and off diagonals of a variance-
continuous data. For the normal pdf, for
covariance matrix. These calculations are
further detailed in the next section. ( )
example, E(X) = and E X 2 E ( X ) = 2 .

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 156


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

For discrete data, type of data. In this paper, the correlation


E(X ) = xi Pr ob( X = xi )
2 2
and 12 for agreement patterns between the
i coders will be Kendall-c (Bonett and
E(X) = xi Pr ob( X = xi ) , where Prob(X = Wright, 2000). This correlation can be
i readily estimated using the PROC CORR
xi) denotes the pdf for discrete data (Hogg procedure in the statistical software
& Craig, 1995). For the binomial pdf, package SAS.
E(X) = n p and E (X 2 ) E ( X ) = n p (1 p ) Estimates for calculating the KR-20
(Efron & Tibshirani, 1993). If a discrete based on coder agreement patterns for the
distribution cannot be assumed or is professor and student interview groups
unknown, it is most appropriate to use the are provided in Table 12. Letting x2 = 2 for
distribution-free expectation non-agreed responses, the variance is
(Hettmansperger & McKean, 1998). Basic 0.816 and 0.023, respectively, for the
algebra is only needed to solve for E (X 2 ) professor and undergraduate student
groups. The Kendall-c correlation equals
and E ( X ) . For this last scenario it is 0.881. Using these estimates, the
important to also note that if the covariance between the groups equals
underlying distribution is discrete, 0.121. The total variance then equals
methods assuming continuity for 10.081. The final component of the KR-20
calculating E (X ) and E ( X ) should not be
2
formula is the proportion of agreement
utilized because the standard error can be times one minus this proportion (i.e., pi(1-
substantially inflated, and reducing the pi)) for each of the groups. This estimate
accuracy of statistical inference for the professor and undergraduate
(Bartoszynski & Niewiadomska-Bugaj, student interview groups equals 0.204
1996). and 0.006. The sum of these values is
As with the calculation of E (X 2 ) and 0.210. The KR-20 reliability estimate thus
E ( X ) the distribution of data must also 556 1
equals 1- 0.210 = 0.807,
be considered in the calculation of 556 - 1 1.081
correlations. Otherwise, standard errors which equals the reliability between
will be inflated. For data that take the professor and student interview responses
form as either the presence or absence of on the theme of interest.
a theme, which clearly have a discrete
distribution, the correlation should be
based on distributions suitable for this

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 157


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

Table 12
Estimates from Professor and Student Interview Participants for Calculating the KR 20

Estimate Professor Group Student Group


Individual Variances 0.816 0.023
Kendall-c Correlation 0.881
Covariance (0.881)(0.816)1/2(0.023)1/2 = 0.121
Total Variance 0.816 + 0.023 + 2*0.121 = 1.081
pi(1-pi) 0.714(1-0.714) = 0.204 0.994(1-0.994) = 0.006
pi(1-pi) 0.210

Discussion Excellent, of 0.8-0.9 are Good, of 0.7-


0.8] are Acceptable, of 0.6-0.7 are
This paper presented four quantitative Questionable, of 0.5-0.6] are Poor,
methods for gauging interrater reliability and less than 0.5 are Unacceptable,
of qualitative findings following a where coefficients of at least 0.8 should be
binomial distribution (theme is present, a researchers target.
theme is absent). The statistic is a According this tool, the obtained of
measure of observed agreement beyond 0.891 demonstrates Almost Perfect to
the expected agreement between two or Good agreement between the coders. The
more coders. The W statistic has the same W statistic of 0.423 demonstrates Fair to
interpretation as the kappa statistic, but Unacceptable agreement between the
permits the differential weights of cell coders. The obtained ANOVA ICC of 0.714
frequencies reflecting patterns of coder demonstrates Substantial to Acceptable
agreement. The ANOVA (binary) ICC agreement between the coders. Last, the
measures the degree to which two or more obtained KR-20 of 0.807 demonstrates
ratings are consistent. The KR-20 statistic Substantial to Good agreement between
is a reliability estimator based on the ratio the coders.
of variances. That being said, it is The resulting question from these
important to note that the reliability of findings is Are these reliability estimates
binomial coding patterns is invalid if sufficient? The answer is dependent
based on continuous agreement statistics upon on the focus of the study, the
(Maclure & Willett, 1987). complexity of the theme(s) under
Some researchers have developed tools investigation, and the comfort level of the
for interpreting reliability coefficients, but researcher. The more complicated the
do not provide guidelines for determining topic being investigated, the lower the
the sufficiency of such statistics. proportion of observed agreement
According to Landis and Koch (1977), between the coders may be. According to
coefficients of 0.41-0.60, 0.61-0.80, and Nunnally (1978), Cascio (1991), and
0.81-1.00 have Moderate, Substantial, Schmitt (1996), reliabilities of at least
and Almost Perfect agreement, in that 0.70 are typically sufficient for use. The
order. George and Mallery (2003) indicate statistic, ANOVA ICC, and KR-20 meet
that reliability coefficients of 0.9-1.0 are this cutoff, demonstrating acceptable
reliability coefficients.

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 158


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

What happens if the researcher has an coefficient on their coded qualitative


acceptable level of reliability in mind, but interview data with a certain likelihood
does not meet the requirement? What prior to the initiation of data collection.
methods should be employed in this The current study simulated coder
situation? If a desired reliability agreement data that follow a binomial
coefficient is not achieved, it is probability density function. Further
recommended that the coders revisit their investigation should be conducted to
coding decisions on patterns of determine if there are more appropriate
disagreement on the presence of themes discrete distributions to model agreement
in the binomial data (e.g., interview data. Possible densities may include the
responses). After the coders revisit their geometric, negative binomial, beta-
coding decisions, the reliability coefficient binomial, and Poisson, for example. This
would be re-estimated. This process development could lead to better
would be recursive until a desired estimators of reliability coefficients (e.g.,
reliability coefficient is achieved. for the investigation of rare events).
Although this process may seem tedious,
the confidence in which the coders References
identified themes increases and thus
improves the interpretability of the data. Armstrong, D., Gosling, A., Weinman, J.,
& Marteau, T. (1997). The place of
Future Research inter-rater reliability in qualitative
research: An empirical study.
Three areas of research are recommended Sociology, 31(3), 597-606.
for furthering the use of reliability Bartoszynski, R., & Niewiadomska-Bugaj,
estimators for discrete coding patterns of M. (1996). Probability and statistical
binomial responses (e.g., qualitative inference. New York, NY: John Wiley.
interview data). In the current paper Benaquisto, L. (2008). Axial coding. In L.
estimators that can be used to gauge M. Given (Ed.), The Sage encyclopedia
agreement pattern reliability within a of qualitative research methods (Vol.
theme were presented. The development 1, pp. 51-52). Thousand Oaks, CA:
of quality reliability estimators applicable SAGE.
across themes should be further Benaquisto, L. (2008). Coding frame. In
developed and investigated. This would L. M. Given (Ed.), The Sage
allow researchers to determine the encyclopedia of qualitative research
reliability of ones grounded theory, for methods (pp. 88-89). Thousand Oaks,
example, as opposed to a component of CA: Sage.
the theory. Benaquisto, L. (2008). Open coding. In L.
Sample size estimation methods also M. Given (Ed.), The Sage encyclopedia
should be further developed for reliability of qualitative research methods (Vol.
estimators, but are presently limited to 2, pp. 581-582). Thousand Oaks, CA:
the statistic (Bonett, 2002; Feldt & Sage.
Ankenmann, 1998). Sample size Benaquisto, L. (2008). Selective coding.
estimation would inform the researcher, In L. M. Given (Ed.), The Sage
in the example of the current paper, as to encyclopedia of qualitative research
how many interviews should be conducted methods. Thousand Oaks, CA: Sage.
in order to achieve a desired reliability

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 159


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

Bonett, D. G. (2002). Sample size among five approaches (2nd ed.).


requirements for testing and Thousand Oaks, CA: Sage.
estimating coefficient alpha. Journal Crocker, L., & Algina, J. (1986).
of Educational and Behavioral Introduction to classical & modern
Statistics, 27, 335-340. test theory. Fort Worth, TX: Holt,
Bonett, D. G. & Wright, T. A. (2000). Rinehart, & Winston.
Sample size requirements for Davis, C. S. (2008). Hypothesis. In L. M.
estimating Pearson, Kendall, and Given (Ed.), The Sage encyclopedia of
Spearman correlations. qualitative research methods (Vol. 1,
Psychometrika, 65, 23-28. pp. 408-409). Thousand Oaks, CA:
Brodsky, A. E. (2008). Researcher as SageAGE.
instrument. In L. M. Given (Ed.), The Dillon, W. R., & Mulani, N. (1984). A
Sage encyclopedia of qualitative probabilistic latent class model for
research methods (Vol. 2, p. 766). assessing inter-judge reliability.
Thousand Oaks, CA: Sage. Multivariate Behavioral Research, 19,
Burla, L., Knierim, B., Barth, J., Liewald, 438-458.
K., Duetz, M., & Abel, T. (2008). From Efron, B. & Tibshirani, R. J. (1993). An
text to codings: Intercoder reliability introduction to the bootstrap. New
assessment in qualitative content York, NY: Chapman & Hall/CRC.
analysis. Nursing Research, 57, 113- Elston, R. C., Hill, W. G., & Smith, C.
117. (1977). Query: Estimating
Cascio, W. F. (1991). Applied psychology Heritability of a dichotomous trait.
in personnel management (4th ed.). Biometrics, 33, 231-236.
Englewood Cliffs, NJ: Prentice-Hall Everitt, B. S. (1968). Moments of the
International. statistics kappa and weighted kappa.
Cheek, J. (2008). Funding. In L. M. Given The British Journal of Mathematical
(Ed.), The Sage encyclopedia of and Statistical Psychology, 21, 97-103.
qualitative research methods (Vol. 1, Feldt, L. S. & Ankenmann, R. D. (1998).
pp. 360-363). Thousand Oaks, CA: Appropriate sample size for
Sage. comparison alpha reliabilities. Applied
Cohen, J. (1960). A coefficient of Psychological Measurement, 22, 170-
agreement from nominal scales. 178.
Educational and Psychological Firmin, M. W. (2008). Replication. In L.
Measurement, 20, 37-46. M. Given (Ed.), The Sage encyclopedia
Cohen, J. (1968). Weighted kappa: of qualitative research methods (Vol.
Nominal scale agreement with 2, pp. 754-755). Thousand Oaks, CA:
provision for scaled disagreement or Sage.
partial credit. Psychological Bulletin, Fleiss, J. L. (1971). Measuring nominal
70, 213-220. scale agreement among many raters.
Coryn, C. L. S. (2007). The holy trinity of Psychological Bulletin, 76, 378-382.
methodological rigor: A skeptical view. Fleiss, J. L., Cohen, J., & Everitt, B. S.
Journal of MultiDisciplinary (1969). Large sample standard errors
Evaluation, 4(7), 26-31. of kappa and weighted kappa.
Creswell, J. W. (2007). Qualitative Psychological Bulletin, 72, 323-327.
inquiry & research design: Choosing Fleiss, J. L., & Cuzick, J. (1979). The
reliability of dichotomous judgments:

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 160


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

Unequal numbers of judges per Jensen, D. (2008). Dependability. In L.


subject. Applied Psychological M. Given (Ed.), The Sage encyclopedia
Measurement, 3, 537-542. of qualitative research methods (Vol.
George, D. & Mallery, P. (2003). SPSS for 1, pp. 208-209). Thousand Oaks, CA:
Windows step by step: A simple guide Sage.
and reference. 11.0 update (4th ed.). Karlin, S., Cameron, P. E., & Williams, P.
Boston, MA: Allyn & Bacon. (1981). Sibling and parent-offspring
Given, L. M., & Saumure, K. (2008). correlation with variable family age.
Trustworthiness. In L. M. Given (Ed.), Proceedings of the National Academy
The Sage encyclopedia of qualitative of Science, U.S.A. 78, 2664-2668.
research methods (Vol. 2, pp. 895- Kim, K. & Timm, N. (2007). Univariate
896). Thousand Oaks, CA: Sage. and multivariate general linear
Golafshani, N. (2003). Understanding models: Theory and applications with
reliability and validity in qualitative SAS (2nd ed.). New York, NY:
research. The Qualitative Report, 8(4), Chapman & Hall/CRC.
597-607. Kleinman, J. C. (1973). Proportions with
Greene, J. C. (2007). Mixed methods in extraneous variance: Single and
social inquiry. Thousand Oaks, CA: independent samples. Journal of the
Sage. American Statistical Association, 68,
Gulliksen, H. (1950). Theory of mental 46-54.
tests. New York: Wiley. Krippendorf, K. (2004). Content analysis:
Hettmansperger, T. P. & McKean, J. An introduction to its methodology
(1998). Kendalls library of statistics 5, (2nd ed.). Thousand Oaks, CA: Sage.
robust nonparametric statistical Kuder, G. F., & Richardson, M. W. (1937).
models. London: Arnold. The theory of estimation of test
Hogg, R. V. & Craig, A. T. (1995). reliability. Psychometrika, 2, 151-160.
Introduction to mathematical Landis, J. R., & Koch, G. C. (1977). The
statistics (5th ed.). Upper Saddle measurement of observer agreement
River, NJ: Prentice Hall. for categorical data. Biometrics, 33,
Hogg, R. V., McKean, J. W., & Craig, A. T. 159-174.
(2004). Introduction to mathematical Lincoln, Y. S., & Guba, E. G. (1985).
statistics (6th ed.). Upper Saddle Naturalistic inquiry. Newbury Park,
Rover, NJ: Prentice Hall. CA: Sage.
Hopkins, K. D. (1998). Educational and Lipsitz, S. R., Laird, N. M., & Brennan, T.
psychological measurement and A. (1994). Simple moment estimates of
evaluation (8th ed.). Boston, MA: the -coefficient and its variance.
Allyn and Bacon. Applied Statistics, 43, 309-323.
Jensen, D. (2008). Confirmability. In L. Lord, F. M. & Novick, M. R. (1968).
M. Given (Ed.), The Sage encyclopedia Statistical theories of mental test
of qualitative research methods (Vol. scores. Reading, MA: Addison-Wesley.
1, p. 112). Thousand Oaks, CA: Sage. Maclure, M. & Willett, W. C. (1987).
Jensen, D. (2008). Credibility. In L. M. Misinterpretation and misuse of the
Given (Ed.), The Sage encyclopedia of kappa statistic. Journal of
qualitative research methods (Vol. 1, Epidemiology, 126, 161-169.
pp. 138-139). Thousand Oaks, CA: Magee, B. (1985). Popper. London:
Sage. Routledge Falmer.

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 161


ISSN 1556-8180
February 2010
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn

Mak, T. K. (1988). Analyzing intraclass Ross, S. (1997). A first course in


correlation for dichotomous variables. probability (5th ed.). Upper Saddle
Applied Statistics, 37, 344-252. River, NJ: Prentice Hall.
Marshall, C., & Rossman, G. B. (2006). Rozzeboom, W. W. (1966). Foundations
Designing qualitative research (4th of the theory of prediction.
ed.). Thousand Oaks, CA: Sage. Homewood, IL: Dorsey.
Maxwell, A. E. (1977). Coefficients of Saumure, K., & Given, L. M. (2008). Rigor
agreement between observers and in qualitative research. In L. M. Given
their interpretation. British Journal of (Ed.), The Sage encyclopedia of
Psychiatry, 130, 79-83. qualitative research methods (Vol. 2,
Miles, M. B., & Huberman, A. M. (1994). pp. 795-796). Thousand Oaks, CA:
Qualitative data analysis: An Sage.
expanded sourcebook (2nd ed.). Schmitt, N. (1996). Uses and abuses of
Thousand Oaks, CA: Sage. coefficient alpha. Psychological
Miller, P. (2008). Reliability. In L. M. Assessment, 8, 81-84.
Given (Ed.), The Sage encyclopedia of Seale, C. (1999). Quality in qualitative
qualitative research methods (Vol. 2, research. Qualitative Inquiry, 5(4),
pp. 753-754). Thousand Oaks, CA: 465-478.
Sage. Smith, D. M. (1983). Algorithm AS189:
Mitchell, S. K. (1979). Interobserver Maximum likelihood estimation of the
agreement, reliability, and parameters of the beta binomial
generalizability of data collected in distribution. Applied Statistics, 32,
observational studies. Psychological 196-204.
Bulletin, 86, 376-390. Soeken, K. L., & Prescott, P. A. (1986).
Morse, J. M., Barrett, M., Mayan, M., Issues in the use of kappa to estimate
Olson, K., & Spiers, J. (2002). reliability. Medical Care, 24, 733-741.
Verification strategies for establishing Stapleton, J. H. (1995). Linear statistical
reliability and validity in qualitative models. New York, NY: John Wiley &
research. International Journal of Sons, Inc.
Qualitative Methods, 1(2), 13-22. Stenbacka, C. (2001). Qualitative research
Nelder, J. A., & Pregibon, D. (1987). An requires quality concepts of its own.
extended quasi-likelihood function. Management Decision, 39(7), 551-555.
Biometrika, 74, 221-232. Tamura, R. N., & Young, S. S. (1987). A
Nunnally, J. C. (1978). Psychometric stabilized moment estimator for the
theory (2nd ed.). New York: McGraw- beta-binomial distribution.
Hill. Biometrics, 43, 813-824.
Paley, J. (2008). Positivism. In L. M. van den Hoonaard, W. C. (2008). Inter-
Given (Ed.), The Sage encyclopedia of and intracoder reliability. In L. M.
qualitative research methods (Vol. 2, Given (Ed.), The Sage encyclopedia of
pp. 646-650). Thousand Oaks, CA: qualitative research methods (Vol. 1,
Sage. pp. 445-446). Thousand Oaks, CA:
Ridout, M. S., Demtrio, C. G. B., & Firth, Sage.
D. (1999). Estimating intraclass Yamamoto, E., & Yanagimoto, T. (1992).
correlations for binary data. Moment estimators for the binomial
Biometrics, 55, 137-148. distribution. Journal of Applied
Statistics, 19, 273-283.

Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 162


ISSN 1556-8180
February 2010
JOURNAL
Iacobucci,OF
Dawn (ed.) (2001),
CONSUMER Journal
PSYCHOLOGY, of Consumer
10(1&2), 71-73 Psychology's Special Issue on Methodological and Statistical
? 2001,Lawrence
Copyright Erlbaum Inc. Researcher, 10 (1&2), Mahwah, NJ: Lawrence Erlbaum Associates, 71-73.
Associates,
Concerns of the Experimental Behavioral

InterraterReliability

IV.INTERRATERRELIABILITY Alternatively,HughesandGarrett(1990) outlineda num-


ASSESSMENT IN CONTENT ANALYSIS ber of different options (including Krippendorff's, 1980)
and then offered their own solution based on a
Whatis thebest way to assess reliabilityin contentanalysis- generalizability theory approach. Rust and Cooil (1994)
Is percentageagreementbetweenjudges best (NO!)? took a "proportionalreductionin loss" approachand pro-
Or, statedin a slightly differentmannerfrom anotherre- vided a generalframeworkfor reliabilityindexes for quanti-
searcher:There are several tests that give indexes of rater tative and qualitativedata.
agreementfor nominal data and some other tests or coeffi-
cientsthatgive indexesofinterraterreliabilityformetricscale
data.For my databased on metric scales, I have established Professor Roland Rust
raterreliability using the intraclasscorrelationcoefficient, VanderbiltUniversity
butI also wantto look at interrateragreement(for two raters).
Whatappropriatetest is therefor this? I have huntedaround Recent work in psychometrics(Cooil & Rust, 1994, 1995;
butcannotfindanything.I havethoughtthata simplepercent- Rust& Cooil, 1994)haveset forththeconceptofproportional
age of agreement(i.e., 1 point difference using a 10-point reductionof loss (PRL)as a generalcriterionof reliabilitythat
scale is 10%disagreement)adjustedfor the amountof vari- subsumesboth the categoricalcase andthe metriccase. This
ance for each questionmay be suitable. criterionconsiders the expected loss to a researcherfrom
wrong decisions, and it turs out to include some popularly
used methods (e.g., Cronbach's alpha: Cronbach, 1951;
generalizabilitytheory:Cronbach,Gleser, Nanda, & Raja-
Professor Kent Grayson ratnam,1972;Perreaultand Leigh's, 1989, measure)as spe-
London Business School cial cases.
Simplyusing percentageof agreementbetweenjudges is
Kolbe and Burnett(1991) offered a nice-and prettydamn- not so good becausesome agreementis sureto occur,if only
ing-critique of the qualityof contentanalysis in consumer by chance,andthe fewerthe numberof categories,the more
research.Theyhighlighta numberof criticisms,one of which randomagreementis likely to occur,thusmakingthe reliabil-
is this concern about percentageagreementas a basis for ity appearbetterthanit reallyis. This randomagreementwas
judging the qualityof contentanalysis. theconceptualbasisofCohen's kappa(Cohen,1960),apopu-
The basic concernis thatpercentagesdo not take into ac- larmeasurethatis not a PRLmeasureandhas otherbadprop-
count the likelihood of chance agreementbetween raters. erties (e.g., overconservatismand, under some conditions,
Chanceis likely to inflateagreementpercentagesin all cases, inabilityto reachone, even if thereis perfectagreement).
but especially with two coders, and low degrees of freedom
on each coding choice (i.e., few codingcategories).Thatis, if
CoderA andCoderB haveto decideyes-no whethera coding Editor: For the discussion that follows, imagine a concrete
unithas propertyX, then mere chancewill have themagree- example. Perhaps two experts have been asked to code
ing at least 50% of the time (i.e., in a 2 x 2 table with codes whether 60 print advertisementsare "emotionalimagery,"
randomlydistributed,25% in each cell, therewould be 50% "rationalinformative,"or "mixedambiguous"in appeal.The
of the scores already randomly along the diagonal, which ratingstablethatfollows depictsthe coding scheme with the
would representspuriousapparentagreement). threecategories-let us call the numberof coding categories
Severalscholarshave offered statisticsthattry to correct "c."The row and columnrepresentthe codes assignedto the
for chanceagreement.The one thatI havebeen using lately is advertisementsby the two independentjudges. Forexample,
"Krippendorffsalpha"(Krippendorff,1980), which he de- n21representsone kind of disagreement-the numberof ads
scribedin his chapteron reliability.I use it becausethe math thatRaterI judgedas rationalthatRater2 thoughtemotional.
seems intuitive-it seems to be roughly based on the ob- (Thenotation,nl +,representsthe sum forthe firstrow,having
servedand expected logic thatunderlieschi-square. aggregatedoverthecolumns.)If theratersagreedcompletely,
72 ASSESSMENT
RELIABILITY

all ads would fall into the nl, 1n22,or n33cells, with zeros off thanproportionsandanequationforanapproximatestandard
the main diagonal. errorfor the index.
Researchershavecriticizedthekappaindexforsome of its
Rater2- row properties and proposed extensions (e.g., Brennan &
Prediger, 1981; Fleiss, 1971; Hubert, 1977; Kaye, 1980;
Rater1: emotional rational mixed sums Kraemer,1980; Tanner& Young, 1985). To be fair, Cohen
emotional n,, n,2 ni3 nj+ (1960, p. 42) anticipatedsome of these qualities(e.g., thatthe
upperboundfor kappacan be less than 1.0, dependingon the
rational n21 n22 n23
marginaldistributions),andso he providedan equationto de-
n2+

mixed n3 n32 n33 n3+ terminethe maximumkappaone might achieve.


column n n If Cohen's(1960) kappahas some problems,what might
n+2 n+3
serve as a superiorindex? Perreaultand Leigh (1989) rea-
sums e.g.,60 soned throughexpectedlevels of chanceagreementin a way
thatdidnotdependon themarginalfrequencies.Theydefined
Cohen (1960) proposed K, kappa, the "coefficient of an "indexof reliability,"Iras follows (p. 141):
agreement," drawing the analogy (pp. 37-38) between
interrateragreementand item reliabilityas pursuitsof evalu-
Pa\Pa
ating the quality of data. His index was intendedas an im- \ ( C) c-\ )
provementon the simple (he says "primitive")computation
of percentageagreement.Percentageagreementis computed
as the sum of the diagonal(agreedon) ratingsdividedby the whenpa > (l/c). Ifpa < (/lc), Iris set to zero. (Recallthatc is
numberof units being coded: (nll + n22+ n33)/n++.Cohen the numberof codingcategoriesas definedpreviously.)They
stated,"Ittakesrelativelylittle in the way of sophisticationto also providean estimatedstandarderror(p. 143):
appreciatethe inadequacyof this solution"(p. 38). The prob-
lem to which he speaks is thatthis index does not correctfor -
|IIr(l-Ir) ,
O
SI =
the fact that there will be some agreementsimply due to n++
chance. Hughes and Garrett(1990) and Kolbe and Burett
(1991), respectively,reported65% and 32% of the articles
whichis important,becausewhentheconditionholdsthatIrx
they reviewedas relyingon percentageagreementas the pri-
n++> 5, these two statisticsmay be used in conjunctionto
mary index of interraterreliability.Thus, althoughCohen's
criticismwas clear40 yearsago, thesereviews,only 10 years form a confidence interval,in essence a test of the signifi-
cance of the reliabilityindex:
ago, suggestthatthe issues andsolutionsstill havenotperme-
ated the social sciences.
Cohen (1960, pp. 38-39) also criticized the use of the Ir ? (1.96)si.
chi-squaretest of associationin this application,becausethe
Rust and Cooil (1994) is anotherlevel of achievement,
requirementof agreementis more stringent.That is, agree-
mentrequiresall non-zerofrequenciesto be along the diago- extending Perreaultand Leigh's (1989) index to the situa-
tion of three or more ratersand creatinga frameworkthat
nal; association could have all frequenciesconcentratedin
subsumesquantitativeand qualitativeindexes of reliability
off-diagonalcells.
Hence, Cohen (1960) proposedkappa: (e.g., coefficient alphafor ratingscales andinterrateragree-
ment for categorical coding). Hughes and Garrett(1990)
used generalizabilitytheory,which is basedon a random-ef-
K=_(Pa -Pc) fects analysis of a variance-like modeling approach to
(1-pc) aportionvariance due to rater,stimuli, coding conditions,
and so on. (Hughes & Garrettalso criticized the use of
wherepais the proportionof agreedonjudgments(in ourex- intraclass correlationcoefficients as insensitive to differ-
ample, Pa = (nll + n22 + n33)/n++). The termpc is the propor- ences between coders due to mean or variance; p. 187.)
tion of agreementsone would expect by chance;pc = (el + Ubersax (1988) attemptedto simultaneouslyestimate reli-
e22 + e33)/n++),where eii = (ni+/n++)(n+i/n++)(n++);for ex- ability and validity from coding judgments using a latent
ample, the numberof agreements in the (2, (rationalcode)
2) class approach,which is prevalentin marketing.
cell would be e22 = (n2+/n++)(n+2/n++)(n++)(just as you In conclusion,perhapswe can at leastagreeto finallyban-
would compute expected frequenciesin a chi-squaretest of ish the simplepercentageagreementas anacceptableindexof
independence in a two-way table-it is just that here, we interraterreliability.In terms of an index suited for general
care only about the diagonal entries in the matrix). Cohen endorsement,Perreaultand Leigh's (1989) index (discussed
also provided his equation in terms of frequencies rather earlier)would seem to fit manyresearchcircumstances(e.g.,
RELIABILITYASSESSMENT 73

two raters).Furthermore,
it appearssufficientlystraightfor- Hubert,Lawrence.(1977). Kapparevisited.PsychologicalBulletin,84, 289-
wardthatonecouldcomputetheindexwithouta mathemati- 297.
Hughes,MarieAdele, & Garrett,Dennis E. (1990). Intercoderreliabilityes-
cally-inducedcoronary. timationapproachesin marketing:A generalizabilitytheoryframework
for quantitativedata.Journalof MarketingResearch,27, 185-195.
Kaye, Kenneth.(1980). Estimatingfalse alarms and missed events from
interobserveragreement:A rationale.PsychologicalBulletin,88, 458-
REFERENCES 468.
Kolbe,RichardH., & Burnett,Melissa S. (1991). Content-analysisresearch:
Brennan,Robert L., & Prediger,Dale J. (1981). Coefficient kappa:Some An examinationof applicationswith directivesfor improvingresearch
uses, misuses, and alternatives.Educationaland Psychological Mea- reliabilityandobjectivity.Journalof ConsumerResearch,18,243-250.
surement,41, 687-699. Kraemer,Helena Chmura. (1980). Extension of the kappa coefficient.
Cohen,Jacob.(1960). A coefficientof agreementfornominalscales. Educa- Biometrics,36, 207-216.
tional and Psychological Measurement,20, 37-46. Krippendorff,Klaus.(1980). Contentanalysis:An introductionto its meth-
Cooil, Bruce,& Rust,RolandT. (1994). Reliabilityandexpectedloss: A uni- odology.NewburyPark,CA: Sage.
fying principle.Psychometrika,59, 203-216. Perreault,WilliamD., Jr.,& Leigh, LaurenceE. (1989). Reliabilityof nomi-
Cooil, Bruce,& Rust,RolandT. (1995). Generalestimatorsforthe reliability nal data based on qualitativejudgments. Journal of MarketingRe-
of qualitativedata.Psychometrika,60, 199-220. search, 26, 135-148.
Cronbach,Lee J. (1951). Coefficientalphaandthe internalstructureof tests. Rust,RolandT., & Cooil, Bruce.(1994). Reliabilitymeasuresfor qualitative
Psychometrika,16, 297-334. data:Theoryandimplications.JournalofMarketingResearch,31, 1-14.
Cronbach,Lee J., Gleser, Goldine C., Nanda, Harinder,& Rajaratnam, Tanner,MartinA., & Young,MichaelA. (I 985). Modelingagreementamong
Nageswari. (1972). The dependabilityof behavioral measurements: raters.Journal of the AmericanStatisticalAssociation, 80(389), 175-
Theoryof generalizabilityforscores andprofiles. New York:Wiley. 180.
Fleiss, JosephL. (1971). Measuringnominalscale agreementamongmany Ubersax,JohnS. (1988). Validityinferencesfrominter-observeragreement.
raters.PsychologicalBulletin, 76, 378-382. PsychologicalBulletin, 104, 405-416.
Literature Review of Inter-rater Reliability

Inter-rater reliability, simply defined, is the extent to which the way information

being collected is being collected in a consistent manner (Keyton, et al, 2004). That is, is

the information collecting mechanism and the procedures being used to collect the

information solid enough that the same results can repeatedly be obtained? This should

not be left to chance, either. Having a good measure of inter-rater reliability rate

(combined with solid survey/interview construction procedures) allows project manners

to state with confidence they can be confident in the information they have collected.

Statistical measures are used to measure inter-rater reliability in order to provide a

logistical proof that the similar answers collected are more than simple chance

(Krippendorf, 2004a).

Inter-rater reliability also alerts project managers to problems that may occur in

the research process (Capwell, 1997; Keyton, et al, 2004; Krippendorf, 2004a,b;

Neuendorf, 2002). These problems include poorly executed coding procedures in

qualitative surveys/interviews (such as a poor coding scheme, inadequate coder training,

coder fatigue, or the presence of a rogue coder all examined in a later section of this

literature review) as well as problems regarding poor survey/interview administration

(facilitators rushing the process, mistakes on part of those recording answers, the

presence of a rogue administrator) or design (see Survey Methods, Interview/Re-Interview

Methods, or Interview/Re-Interview Design literature reviews). From all of the potential

problems listed here alone, it is evident measuring inter-rater reliability is important in

the interview and re-interview process.


Preparing qualitative/open-ended data for inter-rater reliability checks

If closed data was not collected for the survey/interview, then the data will have

to be coded before it is analyzed for inter-rater reliability. Even if closed data was

collected, then coding may be important because in many cases closed-ended data has a

large amount of possibilities. A common consideration, the YES/NO priority (Green,

2004), requires answers to be placed into yes or no paradigms as a simple data coding

mechanism for determining inter-rater reliability. For instance, it would be difficult to

determine inter-rater reliability for information such as birthdates. Instead of recording

birthdates, then, it can be determined whether the two data collections netted the same

result. If so, then YES can be recorded for each respective survey. If not, then YES

should be recorded for one survey and NO for the other (do not enter NO for both, as that

would indicate agreement). While placing qualitative data into a YES/NO priority could

be a working method for the information collected in the ConQIR Consortium given the

high likelihood that interview data will match, the forced categorical separation is not

considered to be the best available practice and could prove faulty in accepting or

rejecting hypotheses (or for applying analyzed data toward other functions). It should,

however, be sufficient in evaluating whether reliable survey data is being obtained for

agency use. For best results, the survey design should be created with reliability checks in

mind, employing either a YES/NO choice option (this is different than what is reviewed

above a YES/NO option would include questions like, Were you born before July 13,

1979? where the participant would have to answer yes or no) or a likert-scale type

mechanism. See the Interview/Re-Interview Design literature review for more details.
How to compute inter-rater reliability

Fortunately, computing inter-rater reliability is a relatively easy process involving

a simple mathematical formula based on a complicated statistical proof (Keyton, et al,

2004). In the case of qualitative studies, where survey or interview questions are open-

ended, some sort of coding scheme will need to be put into place before using this

formula (Friedman, et al, 2003; Ketyon, et al, 2004). For closed-ended surveys or

interviews where participants are forced to choose one choice, then the collected data is

immediately ready for inter-rater checks (although quantitative checks often produce

lower reliability scores, especially when the likert scale is used) (Friedman, et al, 2003).

To compute inter-rater reliability in quantitative studies (where closed-answer

question data is collected using a likert scale, a series of options, or yes/no answers),

follow these steps to determine Cohens kappa (1960), a statistical measure determining

inter-rater reliability:

1. Arrange the responses from the two different surveys/interviews into a

contingency table. This means you will create a table that demonstrates,

essentially, how many of the answers agreed and how many answers

disagreed (and how much they disagreed, even). For example, if two different

survey/interview administrators asked ten yes or no questions, their answers

would first be laid out and observed:

Question Number 1 2 3 4 5 6 7 8 9 10

Interviewer #1 Y N Y N Y Y Y Y Y Y

Interviewer #2 Y N Y N Y Y Y N Y N
From this data, a contingency table would be created:

RATER #1 (Going across)

RATER #2 (Going down) YES NO

YES 6 0

NO 2 2

Notice that the number six (6) is entered in the first column because when looking

at the answers there were six times when both interviewers found a YES answer to

the same question. Accordingly, they are placed where the two YES answers

overlap in the table (with the YES going across the top of the table representing

Rater/Interviewer #1 and the YES going down the left side of the table

representing Rater/Interviewer #2). A zero (0) is entered in the second column in

the first row because for that particular intersection in the table there were no

occurrences (that is, Interviewer/Rater #1 never found a NO answer when

Interviewer/Rater #2 found a YES). The number two (2) is entered in the first

column of the second row since Interviewer/Rater #1 found a YES answer two

times when Interviewer/Rater #2 found a NO; and a two (2) is entered in the

second column of the second row because both Interviewer/Rater #1 and

Interviewer/Rater #2 found NO answers to the same question two different times.

NOTE: It is important to consider that the above table is for a YES/NO type survey. If a

different number of answers are available for the questions in a survey, then the number

of answers should be taken into consideration in creating the table. For instance, if a five
question likert-scale were used in a survey/interview, then the table would have five rows

and five columns (and all answers would be placed into the table accordingly).

2. Sum the row and column totals for the items. To find the sum for the first

row in the previous example, the number six would be added to the number

zero for a first row total of six. The number two would be added to the

number two for a second row total of four. Then the columns would be added.

The first column would find six being added to two for a total of eight; and the

second column would find zero being added to two for a total of two.

3. Add the respective sums from step two together. For the running example,

six (first row total) would be added to four (second row total) for a row total

of ten (10). Eight (first column total) would be added to two (second column

total) for a column total of ten (10). At this point, it can be determined

whether the data has been entered and computed correctly by whether or not

the row total matches the column total. In the case of this example, it can be

seen that the data seems to be in order since both the row and column total

equal ten.

4. Add all of the agreement cells from the contingency table together. In the

running example, this would lead to six being added to two for a total of eight

because there were six times where the YES answers matched from both

interviewers/raters (as designated by the first column in the first row) and two

times where the NO answers matched from both interviewers/raters (as

designated by the second column in the second row). The sum of agreement

then, and the answer to this step, would be eight (8). The agreement cells will
always appear in a diagonal pattern across the chart so, for instance, if

participants had five possibilities for answers then there should be five cells

going across and down the chart in a diagonal pattern that will be added.

NOTE: At this point simple agreement can be computed by dividing the answer in step

four by the answer in step five. In the case of this example, that would lead to eight being

divided by ten for a result of 0.8. This number would be rejected by many researchers,

however, since it does not take into account the probability that some of these agreements

in answers could have been by chance. That is why the rest of the steps must be followed

to determine a more accurate assessment of inter-rater reliability.

5. Compute the expected frequency for each of the agreement cells

appearing in the diagonal pattern going across the chart. To do this, find

the row total for the first agreement cell (row one column one) and multiply

that by the column total for the same cell. Divide this by the total number

possible for all answers (this is the row/column total from step three). So, for

this example, first the cell containing the number six would be located (since

it is the first agreement cell located in row one column one) and the column

and row totals would be multiplied by each other (these were found in step

two) and then divided by the total: 6 x 8 =48 48/10=4.8. The next diagonal

cell (one over to the right and one down) is the next row to be computed:

2 x 4=8 8/10=0.8. Since this is the final cell in the diagonal, this is the final

computation that needs to be made in this step for the sample problem;

however, if more answers were possible, then the step would be repeated as

many times as there are answers. For a five answer likert scale, for instance,
the process would be repeated for five agreement cells going across the chart

diagonally in order to consider how those answers matched up and provide a

full account of inter-rater reliability.

6. Add all of the expected frequencies found in step five together. This

represents the expected frequencies of agreement by chance. For the example

used in this literature review, that would be 4.8 + 0.8 for a sum of 5.6. For a

five answer likert scale, all five of the totals found in step five would be added

together.

7. Compute kappa. To do this, take the answer from step four and subtract the

answer from step six. Place the result of that computation aside. Then take the

total number of items from the survey/interview and subtract the answer from

step six. After this has been completed, take the first computation from this

step (the one that was set aside) and divide it by the second computation from

this step. The resulting computation represents kappa. For the running

example that has been provided in this literature review, it would look like

this: 8 - 5.6 = 2.4; 10 5.6 = 4.4 2.4/4.4 = 0.545.

8. Determine whether the reliability rate is satisfactory. If kappa is at 0.7 or

higher, then the inter-rater reliability rate is generally considered satisfactory

(CITE). If not, then it is often rejected.

What to do if inter-rater reliability is not at an appropriate level

Unfortunately, if inter-rater reliability is not at the appropriate level (generally

0.7) then it is often recommended that the data be thrown out (Krippendorf, 2004a). In

cases such as these, it is often wise to administer an additional data collection so a third
set of information can compared to the other collected data (and calculated against both

in order to determine if an acceptable inter-rater reliability level has been achieved with

either of the previous data collecting attempts). If many cases of inter-rater issues are

occurring, then the data from these cases can often be observed in order to determine

what the problem may be (Keyton, et al, 2004). If data has been prepared for inter-rater

checks from qualitative collection measures, for instance, the coding scheme used to

prepare the data may be examined.

It may also be helpful to check with the person who coded the data to make sure

they understood the coding procedure (Ketyon, et al, 2004). This inquiry can also include

questions about whether they became fatigued during the coding process (often those

coding large sets of information tend to make more mistakes) and whether or not they

agree with the process selected for coding (Keyton, et al, 2004). In some cases a rogue

coder may be the culprit for failure to achieve inter-rater reliability (Neuendorf, 2002).

Rogue coders are coders who disapprove of the methods used for analyzing the data and

who assert their own coding paradigms. Facilitators of projects may also be to blame for

the low inter-rater reliability, especially if they have rushed the process (causing rushed

and hasty coding), required one individual to code a large amount of information (leading

to fatigue), or if the administrator has tampered with the data (Keyton, et al, 2004).
References

Capwell, A. (1997). Chick flicks: An analysis of self-disclosure in friendships. Cleveland:

Cleveland State.

Cohen, J. (1960). Kappa test: A coefficient of agreement for nominal scales. Education

Psychology Measures, 20, 37-46.

Friedman, P. G., Chidester, P. J., Kidd, M. A., Lewis, J. L., Manning, J. M., Morris, T.

M., Pilgram, M. D., Richards, K., Menzie, K., & Bell, J. (2003). Analysis of

ethnographic interview research procedures in communication studies: Prevailing

norms and exciting innovations. National Communication Association, Miami,

FL.

Green, B. (2004). Personal construct psychology and content analysis. Personal

Construct Theory and Practice, 1, 82-91.

Keyton, J., King, T., Mabachi, N. M., Manning, J., Leonard, L. L., & Schill, D. (2004).

Content analysis procedure book. Lawrence, KS: University of Kansas.

Krippendorf, K. (2004a). Content analysis: An introduction to its methodology. Thousand

Oaks, CA: Sage.

Krippendorf, K. (2004b). Reliability in content analysis: Some common misconceptions

and recommendations. Human Communication Research, 30, 411-433.

Neuendorf, K. A. (2002). The content analysis guidebook. Thousand Oaks, CA: Sage.
The Qualitative Report Volume 10 Number 3 September 2005 439-462
http://www.nova.edu/ssss/QR/QR10-3/marques.pdf

The Application of Interrater Reliability as a Solidification


Instrument in a Phenomenological Study

Joan F. Marques
Woodbury University, Burbank, California

Chester McCall
Pepperdine University, Malibu, California

Interrater reliability has thus far not been a common application in


phenomenological studies. However, once the suggestion was brought up
by a team of supervising professors during the preliminary orals of a
phenomenological study, the utilization of this verification tool turned out
to be vital to the credibility level of this type of inquiry, where the
researcher is perceived as the main instrument and where bias may,
hence, be difficult to eliminate. With creativeness and the appropriate
calculation approach the researcher of the here reviewed qualitative study
managed to apply this verification tool and found that the establishment of
interrater reliability served as a great solidification to the research
findings. Key Words: Phenomenology, Interrater Reliability, Applicability,
Bias Reduction, Qualitative Study, Research Findings, and Study
Solidification

Introduction

This paper intends to serve as support for the assertion that interrater reliability
should not merely be limited to being a verification tool for quantitative research, but that
it should be applied as a solidification strategy in qualitative analysis as well. This should
be applied particularly in a phenomenological study, where the researcher is considered
the main instrument and where, for that reason, the elimination of bias may be more
difficult than in other study types.
A verification tool, as interrater reliability is often referred to in quantitative
studies, is generally perceived as a means of verifying coherence in the understanding of
a certain topic, while the term solidification strategy, as referred to in this case of a
qualitative study, reaches even further: Not just as a means of verifying coherence in
understanding, but at the same time a method of strengthening the findings of the entire
qualitative study. The following provides clarification of the distinction in using interrater
reliability as a verification tool in quantitative studies versus using this test as a
solidification tool in qualitative studies. Quantitative studies, which are traditionally
regarded as more scientifically based than qualitative studies, mainly apply interrater
reliability as a percentage-based agreement in findings that are usually fairly
straightforward in their interpretability. The interraters in a quantitative study are not
necessarily required to engage deeply into the material in order to obtain an
Joan F. Marques and Chester McCall 440

understanding of the studys findings for rating purposes. The findings are usually
obvious and require a brief review from the interraters in order to state their
interpretations. The entire process can be a very concise and insignificant one, easily
understandable among the interraters, due to the predominantly numerical-based nature
of the quantitative findings.
However, in a qualitative study the findings are usually not represented in plain
numbers. This type of study is regarded as less scientific and its findings are perceived in
a more imponderable light. Applying interrater reliability in such a study requires the
interraters to engage in attentive reading of the material, which then needs to be
interpreted, while at the same time the interraters are expected to display a similar or
basic understanding of the topic. The use of interrater reliability in these studies as more
than just a verification tool because qualitative studies are thus far not unanimously
considered scientifically sophisticated. It is seen more as a solidification tool that can
contribute to the quality of these types of studies and the level of seriousness with which
they will be considered in the future. As explained earlier, the researcher is usually
considered the instrument in a qualitative study. By using interrater reliability as a
solidification tool, the interraters could become true validators of the findings of the
qualitative study, thereby elevating the level of believability and generalizability of the
outcomes of this type of study. As a clarification to the above, as the instrument in the
study the researcher can easily fall into the trap of having his or her bias influence the
studys findings. This may happen even though the study guidelines assume that he or
she will dispose of all preconceived opinions before immersing himself or herself into the
research. Hence, the act of involving independent interraters, who have no prior
connection with the study, in the analysis of the obtained data will provide substantiation
of the instrument and significantly reduce the chance of bias influencing the outcome.
Regarding the generalizability enhancement Myers (2000) asserts

Despite the many positive aspects of qualitative research, [these] studies


continue to be criticized for their lack of objectivity and generalizability.
The word 'generalizability' is defined as the degree to which the findings
can be generalized from the study sample to the entire population. ( 9)

Myers continues that

The goal of a study may be to focus on a selected contemporary


phenomenon [] where in-depth descriptions would be an essential
component of the process. ( 9)

This author subsequently suggests that, in such situations, small


qualitative studies can gain a more personal understanding of the phenomenon
and the results can potentially contribute valuable knowledge to the community
( 9).
It is exactly for this purpose, the potential contribution of valuable knowledge to
the community, that the researcher mentioned the elevation of generalizability in
qualitative studies, through the application of interrater reliability as a solidification and
thus bias-reducing tool.
441 The Qualitative Report September 2005

Before immersing into specifics it might be appropriate to explain that there are
two main prerequisites considered when applying interrater reliability to qualitative
research: (1) The data to be reviewed by the interraters should only be a segment of the
total amount, since data in qualitative studies are usually rather substantial and interraters
usually only have limited time and (2) It needs to be understood that there may be
different configurations in the packaging of the themes, as listed by the various
interraters, so that the researcher will need to review the context in which these themes
are listed in order to determine their correspondence (Armstrong, Gosling, Weinman, &
Marteau, 1997). It may also be important to emphasize here that most definitions and
explanations about the use of interrater reliability to date are mainly applicable to the
quantitative field, which suggests that the application of this solidification strategy in the
qualitative area still needs significant review and subsequent formulation regarding its
possible applicability.
This paper will first explain the two main terms to be used, namely interrater
reliability and phenomenology, after which the application of interrater reliability in a
phenomenological study will be discussed. The phenomenological study that will be used
for analysis in this paper is one that was conducted to establish a broadly acceptable
definition of spirituality in the workplace. In this study the researcher interviewed six
selected participants in order to obtain a listing of the vital themes of spirituality in the
workplace. This process was executed as follows: First, the researcher formulated the
criteria, which each participant should meet. Subsequently, she identified the participants.
The six participants were selected through a snowball sampling process: Two participants
referred two other participants who each referred to yet another eligible person. The
researcher interviewed each participant in a similar way, using an interview protocol that
was validated on its content by two recognized authors on the research topic, Drs. Ian
Mitroff and Judi Neal.

Ian Mitroff is distinguished professor of business policy and founder of the USC
Center for Crisis Management at the Marshall School of Business, University of Southern
California, Los Angeles. (Ian I. Mitroff, 2005, 1). He has published over two hundred
and fifty articles and twenty-one books of which his most recent are Smart Thinking for
Difficult Times: The Art of Making Wise Decisions, A Spiritual Audit of Corporate
America, and Managing Crises Before They Happen (Ian I. Mitroff, 4).

Judi Neal is the founder of the Association for Spirit at Work and the author of
several books and numerous academic journal articles on spirituality in the workplace
(Association for Spirit at Work, 2005, 10-11). She has also established her authority in
the field of spirituality in the workplace in her position of executive director of The
Center for Spirit at Work at the University of New Haven, [] a membership
organization and clearinghouse that supports personal and organizational transformation
through coaching, education, research, speaking, and publications (School of Business at
the University of New Haven, 2005, 2).

After transcribing the six interviews the researcher developed a horizonalization


table; all six answers to each question were listed horizontally. She subsequently
eliminated redundancies in the answers and clustered the themes that emerged from this
Joan F. Marques and Chester McCall 442

process, which in phenomenological terms is referred to as phenomenological


reduction. This process was fairly easy, as the majority of questions in the interview
protocol were worded in such a way that they solicited enumerations of topical
phenomena from the participants. To clarify this with an example one of the questions
was What are some words that you consider to be crucial to a spiritual workplace? This
question solicited a listing of words that the participants considered identifiable with a
spiritual workplace. From six listings of words, received from six participants, it was
relatively uncomplicated to distinguish overlapping words and eliminate them. Hence,
phenomenological reduction is much easier to execute these types of answers when
compared to answers provided in essay-form. This, then, is how the themes emerged.
To provide the reader with even more clarification regarding the question formulations,
the interview protocol that was used in this study is included as an appendix (see
Appendix A).

Interrater Reliability

Interrater reliability is the extent to which two or more individuals (coders or


raters) agree. Although widely used in quantitative analyses, this verification strategy has
been practically barred from qualitative studies since the 1980s because a number of
leading qualitative researchers argued that reliability and validity were terms pertaining
to the quantitative paradigm and were not pertinent to qualitative inquiry (Morse,
Barrett, Mayan, Olson, & Spiers, 2002, p. 1). Interrater reliability addresses the
consistency of the implementation of a rating system (Colorado State University, 1997,
1). The CSU on-line site further clarifies interrater reliability as follows:

A test of interrater reliability would be the following scenario: Two or


more researchers are observing a high school classroom. The class is
discussing a movie that they have just viewed as a group. The researchers
have a sliding rating scale (1 being most positive, 5 being most negative)
with which they are rating the student's oral responses. Interrater reliability
assesses the consistency of how the rating system is implemented. For
example, if one researcher gives a "1" to a student response, while another
researcher gives a "5," obviously the interrater reliability would be
inconsistent. Interrater reliability is dependent upon the ability of two or
more individuals to be consistent. Training, education and monitoring
skills can enhance interrater reliability. ( 2)

Tashakkori and Teddlie (1998) refer to this type of reliability as interjudge or


interobserver, describing it as the degree to which ratings of two or more raters or
observations of two or more observers are consistent with each other. According to these
authors, interrater reliability can be determined by calculating the correlation between a
set of ratings done by two raters ranking an attribute in a group of individuals. Tashakkori
and Teddlie continue for qualitative observations, interrater reliability is determined by
evaluating the degree of agreement of two observers observing the same phenomena in
the same setting (p. 85).
443 The Qualitative Report September 2005

In the past several years interrater reliability has rarely been used as a verification
tool in qualitative studies. A variety of new criteria were introduced for the assurance of
credibility in these research types instead. According to Morse et al. (2002), this was
particularly the case in the United States. The main argument against using verification
tools with the stringency of interrater reliability in qualitative research has, so far, been
that expecting another researcher to have the same insights from a limited data base is
unrealistic (Armstrong et al., 1997, p. 598). Many of the researchers that oppose the use
of interrater reliability in qualitative analysis argue that it is practically impossible to
obtain consistency in qualitative data analysis because a qualitative account cannot be
held to represent the social world, rather it evokes it, which means, presumably, that
different researchers would offer different evocations (Armstrong et al., p. 598).
On the other hand, there are qualitative researchers who maintain that
responsibility for reliability and validity should be reclaimed in qualitative studies,
through the implementation of verification strategies that are integral and self-correcting
during the conduct of inquiry itself (Morse et al., 2002). These researchers claim that the
currently used verification tools for qualitative research are more of an evaluative (post
hoc) than of a constructive (during the process) nature (Morse et al.), which leaves room
for assumptions that qualitative research must therefore be unreliable and invalid,
lacking in rigor, and unscientific (Morse et al., p. 4). These investigators further explain
that post-hoc evaluation does little to identify the quality of [research] decisions, the
rationale behind those decisions, or the responsiveness and sensitivity of the investigator
to data (Morse et al., p. 7) and can therefore not be considered a verification strategy.
The above-mentioned researchers emphasize that the currently used post-hoc procedures
may very well evaluate rigor but do not ensure it (Morse et al.).
The concerns addressed by Morse et al. (2002) above about verification tools in
qualitative research being more of an evaluative nature (post hoc) than of a constructive
(during the process) nature can be omitted by utilizing interrater reliability as it was
applied to this study, which is, right after the initial attainment of themes by the
researcher yet before formulating conclusions based on the themes registered. This
method of verifying the studys findings represents a constructive way (during the
process) of measuring the consistency in the interpretation of the findings rather than an
evaluative (post hoc) way. It therefore avoids the problem of concluding insufficient
consistency in the interpretations after the study has been completed and it leaves room
for the researcher to further substantiate the study before it is too late. The substantiation
can happen in various ways. For instance, this might be done by seeking additional study
participants, adding their answers to the material to be reviewed, performing a new cycle
of phenomenological reduction, or resubmitting the package of text to the interraters for
another round of theme listing.
As suggested on the Colorado State University (CSU) website (1997) interrater
reliability should preferably be established outside of the context of the measurement in
your study. This source claims that interrater reliability should preferably be executed as
a side study or pilot study. The suggestion of executing interrater reliability as a side
study corresponds with the above-presented perspective from Morse et al. (2002) that
verification tools should not be executed post-hoc, but constructively during the
execution of the study. As stated before, the results from establishing interrater reliability
as a side study at a critical point during the execution of the main study (see
Joan F. Marques and Chester McCall 444

explanation above) will enable the researcher, in case of insufficient consistency between
the interraters, to perform some additional research in order to obtain greater consensus.
In the opinion of the researcher of this study, the second option suggested by CSU, using
interrater reliability as a pilot study, would mainly establish consistency in the
understandability of the instrument. In this case such would be the interview protocol to
be used in the research, since there would not be any findings to be evaluated at that time.
However, the researcher perceives no difference between this interpretation of interrater
reliability and the content validation here applied to the interview protocol by Mitroff and
Neal. The researcher further questions the value of such a measurement without the
additional review of study findings, or a part thereof. For this reason, the researcher
decided that interrater reliability in this qualitative study would deliver optimal value if
performed on critical parts of the study findings. This, then, is what was implemented in
the here reviewed case.

Phenomenology

A phenomenological study entails the research of a phenomenon by obtaining


authorities verbal descriptions based on their perceptions of this phenomenon: aiming to
find common themes or elements that comprise the phenomenon. The study is intended to
discover and describe the elements (texture) and the underlying factors (structure) that
comprise the experience of the researched phenomenon.
Phenomenology is regarded as one of the frequently used traditions in qualitative
studies. According to Creswell (1998) a phenomenological study describes the meaning
of the lived experiences for several individuals about a concept or the phenomenon.
Blodgett-McDeavitt (1997) presents the following definition,

Phenomenology is a research design used to study deep human


experience. Not used to create new judgments or find new theories,
phenomenology reduces rich descriptions of human experience to
underlying, common themes, resulting in a short description in which
every word accurately depicts the phenomenon as experienced by co-
researchers. ( 10)

Creswell suggests for a phenomenological study the process of collecting


information should involve primarily in-depth interviews with as many as 10 individuals.
According to Creswell, Dukes recommends studying 3 to 10 subjects, and the Riemen
study included 10. The important point is to describe the meaning of a small number of
individuals who have experienced the phenomenon (p. 122).
Given these recommendations, the researcher of the phenomenological study
described here chose to interview a number of participants between 3 and 10 and ended
up with the voluntary choice of 6.
Creswell (1998) describes the procedure that is followed in a phenomenological
approach to be undertaken:

In a natural setting where the researcher is an instrument of data collection


who gathers words or pictures, analyzes them inductively, focuses on the
445 The Qualitative Report September 2005

meaning of participants, and describes a process that is expressive and


persuasive in language. (p. 14)

Like all qualitative studies, the researcher who engages in the phenomenological
approach should realize that phenomenology is an influential and complex philosophic
tradition (Van Manen, 2002a, 1) as well as a human science method (Van Manen,
2002a, 2), which draws on many types and sources of meaning (Van Manen, 2002b,
1).
Creswell (1998) presents the procedure in a phenomenological study as follows:

1. The researcher begins [the study] with a full description of his or her own experience
of the phenomenon (p. 147).
2. The researcher then finds statements (in the interviews) about how individuals are
experiencing the topic, lists out these significant statements (horizonalization of the
data) and treats each statement as having equal worth, and works to develop a list of
nonrepetitive, nonoverlapping statements (p. 147).
3. These statements are then grouped into meaning units: the researcher lists these
units, and he or she writes a description of the textures (textural description) of the
experience - what happened - including verbatim examples (p. 150).
4. The researcher next reflects on his or her own description and uses imaginative
variation or structural description, seeking all possible meanings and divergent
perspectives, varying the frames of reference about the phenomenon, and constructing
a description of how the phenomenon was experienced (p. 150).
5. The researcher then constructs an overall description of the meaning and the essence
of the experience (p. 150).
6. This process is followed first for the researchers account of the experience and then
for that of each participant. After this, a composite description is written (p. 150).

Based on the above-presented explanations and their subsequent incorporation in


a study on workplace spirituality, the researcher developed the following model (Figure
1), which may serve as an example of a possible phenomenological process with
incorporation of interrater reliability as a constructive solidification tool.
Joan F. Marques and Chester McCall 446

Figure 1. Research process in picture.

Interviewee A Interviewee B Interviewee C Interviewee D Interviewee E Interviewee F

Horizonalization Table

Phenomenological
Reduction

Meaning Clusters

Interrater 1 Emergent Themes Interrater 2

Internal aspects Integrated External aspects


external/internal aspects

Leadership Employee
imposed imposed
aspects aspects

Meaning of this Phenomenon


Textural and Structural
Description
Possible Structural Meanings
of the Experience
Definition Spirituality in the Workplace
Underlying Themes and Contexts

Precipitating Factors
Implications of Findings
Invariant Themes

Recommendations for
Individuals and Organizations

In the here-discussed phenomenological study, which aimed to establish a broadly


acceptable definition of spirituality in the workplace and therefore sought to obtain vital
themes that would be applicable in such a work environment, the researcher considered
the application of interrater reliability most appropriate at the time when the
phenomenological reduction was completed. The meaning clusters also had been formed.
Since the most important research findings would be derived from the emergent themes,
this seemed to be the most crucial as well as the most applicable part for soliciting
interrater reliability. However, the researcher did not submit any pre-classified
information to the interraters, but instead provided them the entirety of raw transcribed
data with highlights of 3 topical questions from which common themes needed to be
derived. In other words, the researcher first performed phenomenological reduction,
concluded which questions provided the largest numbers of theme listings, and then
submitted the raw version of the answers to these questions to the interraters to find out
whether they would come up with a decent amount of similar theme findings. This
process will be explained in more detail later in the paper.
Blodgett-McDeavitt (1997) cites one of the prominent researchers in
phenomenology, Moustakas, in a presentation of the four main steps of
phenomenological processes: epoche, reduction, imaginative variation, and synthesis of
composite textural and composite structural descriptions. The way Moustakas steps can
447 The Qualitative Report September 2005

be considered to correspond with the earlier presented procedure, as formulated by


Creswell, is that epoche (which is the process of bracketing previous knowledge of the
researcher on the topic) happens when the researcher describes his or her own
experiences of the phenomenon and thereby symbolically empties his or her mind (see
Creswell step 1); reduction occurs when the researcher finds nonrepetitive,
nonoverlapping statements, and groups them into meaning units (Creswell step 2 and 3);
imaginative variation takes place when the researcher engages in reflection (Creswell
step 4); and synthesis is applied when the researcher constructs an overall description
and formulates his or her own accounts as well as those of the participants (Creswell
steps 5 and 6).
Elaborating on the interpretation of epoche, Blodgett-McDeavitt (1997) explains,

Epoche clears the way for a researcher to comprehend new insights into
human experience. A researcher experienced in phenomenological
processes becomes able to see data from new, naive perspective from
which fuller, richer, more authentic descriptions may be rendered.
Bracketing biases is stressed in qualitative research as a whole, but the
study of and mastery of epoche informs how the phenomenological
researcher engages in life itself. (p. 3)

Although epoche may be considered an effective way for the experienced


phenomenologist to empty him or herself and subsequently see the obtained data from a
nave perspective, the chance is that bias is still very present for the less experienced
investigator. The inclusion of interrater reliability as a bias reduction tool could therefore
lead to significant quality enhancement of the studys findings (as will be discussed
below).

Using Interrater Reliability in a Phenomenological Study

Interrater reliability has thus far not been a common application in


phenomenological studies. However, once the suggestion was brought up by a team of
supervising professors about vital themes in a spiritual workplace, the utilization of this
constructive verification tool emerged into an interesting challenge and, at the same time,
required a high level of creativeness from the researcher in charge. Because of the
uncommonness of using this verification strategy in a qualitative study, especially a
phenomenology where the researcher is highly involved in the formulation of the
research findings, it was fairly difficult to determine the applicability and positioning of
this tool in the study. It was even more complicated to formulate the appropriate
approach in calculating this rate, since there were various ways possible for computing it.
The first step for the researcher in this study was to find a workable definition for
this verification tool. It was rather obvious that the application of this solidification
strategy toward the typical massive amount of descriptive data of a phenomenology
would have to differ significantly from the way this tool was generally used in
quantitative analysis where kappa coefficients are the common way to go. After in-depth
source reviews, the researcher concluded that there was no established consistency to
date in defining interrater reliability, since the appropriateness of its outcome depends on
Joan F. Marques and Chester McCall 448

the purpose it is used for. Isaac and Michael (1997) illuminate this by stating that there
are various ways of calculating interrater reliability, and that different levels of
determining the reliability coefficient take account of different sources of error (p. 134).
McMillan and Schumacher (2001) elaborate on the inconsistency issue by explaining that
researchers often ask how high a correlation should be for it to indicate satisfactory
reliability. McMillan and Schumacher conclude that this question is not answered easily.
According to them, it depends on the type of instrument (personality questionnaires
generally have lower reliability than achievement tests), the purpose of the study
(whether it is exploratory research or research that leads to important decisions), and
whether groups or individuals are affected by the results (since action affecting
individuals requires a higher correlation than action affecting groups).
Aside from the above presented statements about the divergence in opinions with
regards to the appropriate correlation coefficient to be used, as well as the proper
methods of applying interrater reliability, it is also a fact that most or all of these
discussions pertain to the quantitative field. This suggests that there is still intense review
and formulation needed in order to determine the applicability of interrater reliability in
qualitative analyses, and that every researcher that takes on the challenge of applying this
solidification strategy in his or her qualitative study will therefore be a pioneer.
The first step for the researcher of this phenomenological study was attempting to
find the appropriate degree of coherence that should exist in the establishment of
interrater reliability. It was the intention of the researcher to use a generally agreed upon
percentage, if existing, as a guideline in her study. However, after assessing multiple
electronic (online) and written sources regarding the application of interrater reliability in
various research disciplines, the researcher did not succeed in finding a consistent
percentage for use of this solidification strategy. Source included Isaac and Michaels
(1997) Handbook in Research and Evaluation, Tashakkori and Teddlies (1998) Mixed
Methodology, and McMillan and Schumachers (2001) Research in Education;
Proquests extensive article and paper database as well as its digital dissertations file; and
other common search engines such as Google.. Consequently, this researcher presented
the following illustrations for the observed basic inconsistency, in applying interrater
reliability, as she perceived them throughout a variety of studies, which were not
necessarily qualitative in nature.

1. Mott, Etsler, and Drumgold (2003) presented the following reasoning for his
interrater reliability findings in their study, Applying an Analytic Writing Rubric to
Children's Hypermedia Narratives.

A comparative approach to the examination of the technical qualities of a


pen and paper writing assessment for elementary students hypermedia-
created productsPearson correlations averaged across 10 pairs of raters
found acceptable interrater reliability for four of the five subscales. For the
four subscales, theme, character, setting, plot and communication, the r
values were .59, .55, .49, .50 and .50, respectively (Mott, Etsler, &
Drumgold, 2003, 1).
449 The Qualitative Report September 2005

2. Butler and Strayer (1998) assert the following in their online-presented research
document, administered by Stanford University and titled The Many Faces of
Empathy.

Acceptable interrater reliability was established across both dialogues and


monologues for all of the verbal behaviors coded. The Pearson
correlations for each variable, as rated by two independent raters, are as
follows: Average intimacy of disclosure, r =.94, t (8) = 7.79 p < .05;
Focused empathy, r =.78, t (14) = 4.66 p < .05; and Shared Affect, r =.85,
t (27) = 8.38, p < .05 (1).

3. Srebnik, Uehara, Smukler, Russo, Comtois, and Snowden (2002) approach interrater
reliability in their study on Psychometric Properties and Utility of the Problem
Severity Summary for Adults with Serious Mental Illness as follows: Interrater
reliability: A priori, we interpreted the intraclass correlations in the following manner:
.60 or greater, strong; .40 to .59, moderate; and less than .40, weak (15).
Through multiple reviews of accepted reliability rates in various studies, this
researcher finally concluded that the acceptance rate for interrater reliability varies
between 50% and 90%, depending on the considerations mentioned above in the citation
of McMillan and Schumacher (1997). The researcher did not succeed in finding a fixed
percentage for interrater reliability in general and definitely not for phenomenological
research. She contacted the guiding committee of this study to agree upon a usable rate.
The researcher found that in the phenomenological studies she reviewed through the
Proquest digital dissertation database, interrater reliability had not been applied, although
she did find a masters thesis from the Trinity Western University that briefly mentioned
the issue of using reliability in a phenomenological study by stating

Phenomenological research must concern itself with reliability for its


results to have applied meaning. Specifically, reliability is concerned with
the ability of objective, outside persons to classify meaning units with the
appropriate primary themes. A high degree of agreement between two
independent judges will indicate a high level of reliability in classifying
the categories. Generally, a level of 80 percent agreement indicates an
acceptable level of reliability. (Graham, 2001, p. 66)

Graham (2001) then states the percent agreement between researcher and the
student [the external judge] was 78 percent (p. 67). However, in the explanation
afterwards it becomes apparent that this percentage was not obtained by comparing the
findings from two independent judges aside from the researcher, but by comparing the
findings from the researcher to one external rater. Considering the fact that the researcher
in a phenomenological study always ends up with an abundance of themes on his or her
list (since he or she manages the entirety of the data, while the external rater only reviews
a limited part of the data) calculating a score as high as 78% should not be difficult to
obtain depending on the calculation method (as will be demonstrated later in this paper).
The citation Graham used as a guideline in his thesis referred to the agreement between
Joan F. Marques and Chester McCall 450

two independent judges and not to the agreement between one independent judge and the
researcher.
The researcher of the here-discussed phenomenological study on spirituality in the
workplace also learned that the application of this solidification tool in qualitative studies
has been a subject of ongoing discussion (without resolution) in recent years, which may
explain the lack of information and consistent guidelines currently available.
The guiding committee for this particular research agreed upon an acceptable
interrater reliability of two thirds, or 66.7% at the time of the suggestion for applying this
solidification tool. The choice for 66.7% was based on the fact that, in this team, there
were quantitative as well as qualitative oriented authorities, who after thorough
discussion came to the conclusion that there were variable acceptable rates for interrater
reliability in use. The team also considered the nature of the study and the multi-
interpretability of the themes to be listed and subsequently decided the following: Given
the study type and the fact that the interraters would only review part of the data, it
should be understood that a correspondence percentage higher than 66.7% between two
external raters might be hard to attain. This correspondence percentage becomes even
harder to achieve if one considers that there might also be such a high number of themes
to be listed, even in the limited data provided, that one rater could list entirely different
themes than the other, without necessarily having a different understanding of the text;
The researcher subsequently performed the following measuring procedure:

1. The data gained for the purpose of this study were first transcribed and saved. This
was done by obtaining a listing of the vital themes applicable to a spiritual workplace
and consisted of interviews taken with a pre-validated interview protocol from 6
participants.
2. Since one of the essential procedures in phenomenology is to find common themes in
participants statements, the transcribed raw data were presented to two pre-identified
interraters. The interraters were both university professors and administrators, each
with an interest in spirituality in the workplace and, expectedly, with a fairly
compatible level of comprehensive ability. These individuals were approached by the
researcher and, after their approval for participation, separately visited for an
instructional session. During this session, the researcher handed each interrater a form
she had developed, in which the interrater could list the themes he found when
reviewing the 6 answers to each of the three selected questions. Each interrater was
thoroughly instructed with regards to the philosophy behind being an interrater, as
well as with regards to the vitality of detecting themes that were common (either
through direct wording or interpretative formulation by the 6 participants). The
interraters, although acquainted with each other, were not aware of each others
assignment as an interrater. The researcher chose this option to guarantee maximal
individual interpretation and eliminate mutual influence. The interraters were thus
presented with the request to list all the common themes they could detect from the
answers to three particular interview questions. For this procedure, the researcher
made sure to select those questions that solicited a listing of words and phrases from
the participants. The reason for selecting these questions and their answers was to
provide the interraters with a fairly clear and obvious overview of possible themes to
choose from.
451 The Qualitative Report September 2005

3. The interraters were asked to list the common themes per highlighted question on a
form that the researcher developed for this purpose and enclosed in the data package.
Each interrater thus had to produce three lists of common themes: one for each
highlighted topical question.
The highlighted questions in each of the six interviews were: (1) What are some
words that you consider to be crucial to a spiritual workplace? (2) If a worker was
operating at his or her highest level of spiritual awareness, what would he or she actually
do? and (3) If an organization is consciously attempting to nurture spirituality in the
workplace, what will be present? One reason for selecting these particular responses was
that the questions that preceded these answers asked for a listing of words from the
interviewees, which could easily be translated into themes. Another important reason was
that these were also the questions from which the researcher derived most of the themes
she listed. However, the researcher did not share any of the classifications she had
developed with the interraters, but had them list their themes individually instead in order
to be able to compare their findings with hers.
4. The purpose of having the interraters list these common themes was to distinguish the
level of coordinating interpretations between the findings of both interraters, as well
as the level of coordinating interpretations between the interraters findings and those
of the researcher. The computation methods that the researcher applied in this study
will be explained further in this paper.
5. After the forms were filled out and received from the interraters, the researcher
compared their findings to each other and subsequently to her own. Interrater
reliability would be established, as recommended by the dissertation committee for
this particular study, if at least 66.7% (2/3) agreement was found between interraters
and between interraters and researchers findings. Since the researcher serves as the
main instrument in a phenomenological study, and even more because this researcher
first extracted themes from the entire interviews, her list was much more extensive
than those of the interraters who only reviewed answers to a selected number of
questions. It may therefore not be very surprising that there was 100% agreement
between the limited numbers of themes submitted by the interraters and the
abundance of themes found by the researcher. In other words, all themes of interrater
1 and all themes of interrater 2 were included in the theme-list of the researcher. It is
for this reason that the agreement between the researchers findings and the
interraters findings was not used as a measuring scale in the determination of the
interrater reliability percentage.
A complication occurred when the researcher found that the interraters did not
return an equal amount of common themes per question. This could happen because the
researcher omitted setting a mandatory amount of themes to be submitted. In other words,
the researcher did not set a fixed number of themes for the interraters to come up with,
but rather left it up to them to find as many themes they considered vital in the text
provided. The reason for refraining from limiting the interraters to a predetermined
number of themes was because the researcher feared that a restriction could prompt
random choices by each interrater among a possible abundance of available themes,
ultimately leading to entirely divergent lists and an unrealistic conclusion of low or no
interrater reliability.
Joan F. Marques and Chester McCall 452

To clarify the researchers considerations a simple example would be if there was


a total of 100 obvious themes to choose from and the researcher required the submission
of only 15 themes per interrater, there would be no guarantee which part of the 100
available themes each interrater would choose. It could very well be that interrater 1
would select the first 15 themes encountered, while interrater 2 would choose the last 15.
If this were the case there would be zero percent interrater reliability, even though the
interraters may have actually had a perfect common understanding of the topic.
Therefore, the researcher decided to just ask each interrater to list as many common
themes he could find among the highlighted statements from the 6 participants. It may
also be appropriate to stress here that the researcher explained well in advance to the
raters what the purpose of the study was, so there would be no confusion with regards to
the understanding of what exactly were considered to be themes.
Dealing with the problem of establishing interrater reliability with an unequal
amount of submissions from the interraters was thus another interesting challenge. Before
illustrating how the researcher calculated interrater reliability for this particular case, note
the following information:

Interrater 1 (I1) submitted a total of 13 detected themes for the selected questions.
Interrater 2 (I2) submitted a total of 17 detected themes for the selected questions.
The researcher listed a total of 27 detected themes for the selected questions.

Between both interraters there were 10 common themes found. The agreement
was determined on two counts: (1) On the basis of exact listing, which was the case with
7 of these 10 themes and (2) on the basis of similar interpretability, such as giving to
others and contributing; encouraging and motivating; aesthetically pleasing
workplace; and beauty of which the latter was mentioned in the context of a nice
environment. The researcher color-coded the themes that corresponded with the two
interraters (yellow) and subsequently color-coded the additional themes that she shared
with either interrater (green for additional corresponding themes between the researcher
and interrater 1 and blue for additional corresponding themes between the researcher and
interrater 2). All of the corresponding themes between both interraters (the yellow
category) were also on the list of the researcher and therefore also colored yellow on her
list.
Before discussing the calculation methods reviewed by this researcher about
spirituality in the workplace, it may be useful to clarify that phenomenology is a very
divergent and complicated study type, entailing various sub-disciplines and oftentimes
described as the study of essences, including the essence of perception and of
consciousness (Scott, 2002, 1). In his presentation of Merleau-Pontys Phenomenology
of Perception Scott explains, phenomenology is a method of describing the nature of our
perceptual contact with the world. Phenomenology is concerned with providing a direct
description of human experience (1). This may clarify to the reader that the
phenomenological researcher is aware that reality is a subjective phenomenon,
interpretable in many different ways. Based on this conviction, this researcher did not
make any pre-judgments on the quality of the various calculation methods presented
below, but merely utilized them on the basis of their perceived applicability to this study
type.
453 The Qualitative Report September 2005

The researcher came across various possible methods for calculating interrater
reliability described.

Calculation Method 1

Various electronic sources, among which a website from Richmond University


(n.d.), mentions the percent agreement between two or more raters as the easy way to
calculate interrater reliability. In this case, reliability would be calculated as: (Total #
agreements) / (Total # observations) x 100. In the case of this study, the outcome would
be: 20/30 x 100 = 66.7%, whereby 20 equals the added number of agreements from both
interraters (10 + 10) and 30 equals the added number of observations from both
interraters (13 + 17). The recommendation from Posner, Sampson, Ward, and Cheney
(1990) is that interrater reliability, R = number of agreements / number of agreements +
number of disagreements, also leads to the same outcome. This calculation would be
executed as follows: 20 / (20+10) = 2/3 = 66.7%.
Various authors recommend the confusion matrix, which is a standard
classification matrix, as a valid option for calculating interrater reliability. A confusion
matrix, according to Hamilton, Gurak, Findlater, and Olive (2003), contains information
about actual and predicted classifications done by a classification system. Performance of
such systems is commonly evaluated using the data in the matrix. (1) According to
these authors, the meaning of the entries in the confusion matrix should be specified as
they pertain to the context of the study. In this study the following meanings will be
ascribed to the various entries, a is the number of agreeing themes that Interrater 1 listed
in comparison with Interrater 2; b is the number of disagreeing themes that Interrater 1
listed in comparison with Interrater 2; c is the number of disagreeing themes that
Interrater 2 listed in comparison with Interrater 1; and d is the total number of disagreeing
themes that both interraters listed.
The confusion matrix that Hamilton et al. (2003) present is similar to the one
displayed in Table 1. However, this researcher has specified the entries as recommended
by these authors for the purpose of this study.

Table 1

Confusion Matrix 1
Interrater 1
Agree Disagree
Agree a B
Interrater 2
Disagree c D

Hamilton et al. (2003) subsequently present a number of equations relevant to their


specific study. The researcher of this study substituted the actual values pertaining to this
particular study in the authors equations and came to some interesting findings:

1. The rate that these authors label as the accuracy rate (AC), named this way because
it measures the proportion of the total number of findings from Interrater 1 -- the one
with the lowest number of themes submitted -- that are accurate. In this case
Joan F. Marques and Chester McCall 454

accurate means in agreement with the submissions of Interrater 2 (adopted from


Hamilton et al., 2003, 5, and modified toward the values used in this particular
study), is calculated as seen below.

AC = (a + d) / (a + b + c + d)
= (10 + 10) / (10 + 3 + 7 + 10)
= 20/30 = 66.7%

2. The rate these authors label as the true agreement rate: The title of this rate has
been modified by substituting the names of values applicable in this particular study.
The true agreement rate was named this way because it measures the proportion of
agreed upon themes (10) perceived from the entire number of submitted themes from
Interrater 1, the one with the lowest number of submissions (adopted from Hamilton
et al., 2003, 8, and modified toward the values used in this particular study), is
calculated as seen below.

TA = a / (a + b)
= 10 / (10 + 3)
= 10/13 = 76.9%

Dr. Brian Dyre (2003), associate professor of experimental psychology at the


University of Idaho also uses the Confusion Matrix for determining interrater reliability.
Dyre recommends the following computation under the heading: Establishing Reliable
Measures for Non-Experimental Research. As mentioned above, this researcher inserted
the values that were derived from the interrater reliability test for this particular study
about spirituality in the workplace in the recommended columns and rows, presented
below as Table 2. The interraters are referred to as R1 and R2.

Table 2

Confusion Matrix 2 with Substitution of Actual Values


R1
Agree Disagree
Agree 10 3 13
R2
Disagree 7 10 (=3+7) 17

17 13 30

According to Dyre (2003), interrater reliability = (Number of agreeing themes) +


(Number of disagreeing themes) / (Total number of observed themes) = (10 + 10) / 30 =
2/3 = 66.7%, which is similar to the earlier discussed accuracy rate (AC) from Hamilton
et al. (2003).
455 The Qualitative Report September 2005

Calculation Method 2

Since the interraters did not submit an equal number of observations, as is general
practice in interrater reliability measures, the above-calculated rate of 66.7% can be
disputed. Although the researcher did not manage to find any written source to base the
following computation on, she considered it logical that in case of unequal submissions,
the lowest submitted number of findings from similar data by any of two or more
interraters used in a study should be used as the denominator in measuring the level of
agreement. Based on this observation, interrater reliability would be: (Number of
common themes) / (Lowest Number of submission) x 100 = 10/13 x 100% = 76.9%.
Rationale for this calculation: if the numbers of submissions by both interraters
had varied even more, say 13 for interrater 1 versus 30 for interrater 2, interrater
reliability would be impossible to be established even if all the 13 themes submitted by
interrater 1 were also on the list of interrater 2. With the calculations as presented under
calculation method 1, the outcome would then be: (13 +13) / (30 + 13) = 26/43 = 60.5%,
whereby 13 would be the number of agreements and 43 the total number of observations.
This does not correspond at all with the logical conclusion that a total level of agreement
from one interraters list onto the other should equal 100%.
If, therefore, the rational justification of calculation method 2 is accepted, then
interrater reliability is 76.9%, which exceeds the minimally consented rate of 66.7%.
Expanding on this reasoning, further comparison leads to the following findings: All 13
listed themes from interrater 1 (13/13 x 100% = 100%) were on the researchers list and
16 of the 17 themes on interrater 2s list (16/17 X 100% = 94.1%) were also on the list of
the researcher. These calculations are based on calculation method 2.
The researcher thought it to be interesting that the percentage of 76.9 between
both interraters was also reached in the true agreement rate (TA) as presented earlier by
Hamilton et al. (2003).

Calculation Method 3

Elaborating on Hamilton et al.s (2003) true agreement rate (TA), which is the
proportion of corresponding themes identified between both interraters, it is calculated
using the equation: TA = a / (a+b), whereby a equals the amount of corresponding
themes between both interraters and b equals the amount of non-corresponding themes
as submitted by the interrater with the lowest number of themes. The researcher thought
it to be interesting to examine the calculated outcomes in the case that the names of the
two interraters would have been placed differently in the confusion matrix. When
exchanging the interraters places in the matrix the outcome of this rate turned out to be
different, since the value substituted for b now became that of the number of non-
corresponding themes, as submitted by the interrater with the highest number of themes.
In fact, the new computation led to an unfavorable, but also unrealistic interrater
reliability rate of 58.8%. The unrealistic reference lies in the fact that it becomes
apparent that the interrater reliability rate, in the case of the above-mentioned
substitution, starts turning out extremely low as the submission numbers of the two
interraters start differing to an increasing degree. In such a case, it does not even matter
anymore whether the two interraters have full correspondence as far as the submissions
Joan F. Marques and Chester McCall 456

of the lowest submitter goes: The percentage of the interrater reliability, which is
supposed to reflect the common understanding of both interraters, will decrease to almost
zero.
To illustrate this assertion, the confusion matrix is presented in Table 3 with the
names of the interraters switched.

Table 3

Confusion Matrix with Names of Interraters Switched


Interrater 2
Agree Disagree
Agree a b
Interrater 1
Disagree c d

With this exchange, the outcome for TA changes significantly:

1. The rate that these authors label as the accuracy rate (AC), remains the same:
AC = (a + d) / (a + b + c + d)
= (10 + 10) / (10 + 3 + 7 + 10)
= 20/30 = 66.7%

2. The true agreement rate (title name substituted with names of values applicable in
this study), is calculated as follows.
TC = a / (a + b)
= 10 / (10 + 7)
= 10/17 = 58.8%

In this study, TA rationally presented a rate of 76.9%, which was higher than the
minimum requirement of 66.7% in both, calculation methods 1 and 2. On the other hand
it is demonstrated in the new true agreement rate here that the less logical process of
exchanging the interraters positions to where the highest number of submissions would
be used as the common denominator instead of the lowest (see first part of calculation
method 3), delivered a percentage below the minimum requirement. As a reminder to the
reader the irrationality of using the highest number of submissions as the denominator
may serve the example given under the rationale section for calculation method 2, in
which numbers of submissions would diverge significantly (30 vs. 13). It is the
researchers opinion that this new suggested computation of moderation would lead to
the following outcome for the true agreement reliability rate (TAR):

TAR = ((TA-1) + (TA-2)) / 2


= (76.9% + 58.8%) / 2
= 135.7% / 2 = 67.9%

It was the researchers conclusion that whether the reader considers calculation
method 1, calculation method 2, or calculation method 3 as the most appropriate one for
this particular study, all three methods demonstrated that there was sufficient common
457 The Qualitative Report September 2005

understanding and interpretation of the essence of the interviewees declarations, as they


all resulted in outcomes equal to, or greater than, 66.7%. Hence, for this study interrater
reliability could be considered established.

Recommendations

1. The researcher of this study has found that although interraters in a phenomenological
study, and presumably generally in qualitative studies, can very well select themes
with a similar understanding of essentials in the data she also found that there are
three major attention points to address in order to enhance the success rate and
swiftness of the process:1. The data to be reviewed by the interraters should be only a
segment of the total amount, since data in qualitative studies are usually rather
substantial and interraters usually only have limited time.2. The researcher will need
to understand that there are different configurations possible in the packaging of the
themes as listed by the various interraters, so that he or she will need to review the
context in which these themes are listed in order to determine their correspondence
(Armstrong et al., 1997). In this paper the researcher gave examples of themes that
could be considered similar, although they were packaged different by the
interraters, such as giving to others and contributing; encouraging and
motivating; aesthetically pleasing workplace; and beauty, of which the latter
was mentioned in the context of a nice environment.
2. In order to obtain results with similar depth from all raters, the researcher should set
standards in the number of observations to be listed by the interraters as well as the
time allotted to them. The fact that these confines were not specified to the interraters
resulted in a diverged level of input: One interrater spent only two days in listing the
words and came up with a total of 13 themes and the other interrater spent
approximately one week in preparing his list and consequently came up with a more
detailed list of 17 themes. Although there was a majority of congruent themes
between the two interraters (there were 10 common themes between both lists), the
calculation of interrater reliability was complicated by the unequal numbers of
submissions. All interrater reliability calculation methods assume equal numbers of
submissions by the interraters. The officially recognized reliability rate of 66.7% for
this study is therefore lower than it would have been when both interraters had been
limited to a pre-specified number of themes to be listed. If, for example, both
interraters had been required to select 15 themes within an equal time span of, say,
one week, the puzzle regarding the use of either the lowest or highest common
denominator would be resolved because there would be only one denominator, as
well as an equal level of input from both interraters. If, in this case, the interraters
came up with 12 common themes out of 15, the interrater reliability rate could be
easily calculated as 12/15 = .8 = 80%. Even in the case of only 10 common themes on
a total required submission of 15, the rate would still meet the minimum
requirements: 10/15 = .67 = 66.7%. This may be valuable advice for future
applications of this valuable tool to qualitative studies.4. The solicited number of
submissions from the interraters should be set as high as possible, especially if there
is a multiplicity of themes to choose from. If the solicited number is kept too low it
may be that two raters have perfectly similar understanding of the text yet submit
Joan F. Marques and Chester McCall 458

different themes, which may erroneously elicit the idea that there was not enough
coherence in the raters perceptions and, thus, no sufficient interrater reliability.
3. The interraters should have at least a reasonable degree of similarity in intelligence,
background, and interest level in the topic in order to ensure a decent degree of
interpretative coherence. It would further be advisable to attune the educational and
interest level of the interraters to the target group of the study, so that the reader could
encounter a greater level of recognition with the study topic as well as the findings.

Conclusion

As mentioned previously, interrater reliability is not a commonly used tool in


phenomenological studies. Of the eight phenomenology dissertations that this researcher
reviewed, prior to embarking on her own experiential journey, none applied this
instrument of control and solidification. This was possibly attributable to the fact that
various qualitative oriented scholars have asserted in the past years that it is difficult to
obtain consistency in qualitative data analysis and interpretation (Armstrong et al., 1997).
These scholars instead, introduced a variety of new criteria for determining reliability
and validity, and hence ensuring rigor, in qualitative inquiry (Morse et al., 2002, p. 2).
Unfortunately, the majority of these criteria are either of a post hoc (evaluative) nature,
which entails that they are applied after the study had been executed and correction is not
possible anymore; or of a non-rigorous nature, such as member checks, which are merely
used as a confirmation tool for the study participants regarding the authenticity of the
provided raw data, but have nothing to do with the data analysis procedures (Morse et
al.). However, having been confronted by the guiding committee in a phenomenological
study on spirituality in the workplace, with the application of this tool as an enhancement
of the reliability of the findings as well as a bias reduction mechanism, the researcher
found that the establishment of interrater reliability or interrater agreement was a major
solidification of the themes that were ultimately listed as the most significant ones in this
study.
It is the researchers opinion that the process of interrater reliability should be
applied more often to phenomenological studies, in order to provide them with a more
scientifically recognizable basis. Up to now, it is still a general perception that qualitative
study, a category to which phenomenology belongs, is less scientifically grounded than
quantitative study. This perception is supported by the arguments from various scholars
that different reviewers cannot coherently analyze a single package of qualitative data.
However, the researcher of this particular study has found that the interraters, given the
prerequisite of a certain minimal similarity in educational and cultural background as
well as interest, could very well select themes with a similar understanding of essentials
in the data. This conclusion is shared with Armstrong et al. (1997), who came to similar
findings in an empirical study in which they attempted to detect the level to which
various external raters could detect themes from similar data and demonstrate similar
interpretations. The two main prerequisites presented by Armstrong et al., entailing data
limitation and contextual interpretability, were similar to those from the researcher in this
phenomenological study. These prerequisites were presented in this paper in the
recommendations section.
459 The Qualitative Report September 2005

An interesting lesson from this experience for the researcher was that the number
of observations to be listed by the interraters, as well as the time allotted to the interraters,
should preferably be kept synchronous. Yet, one might attempt to set as high a number of
submissions as possible, due to the risk of too widely varied choices to be selected by
interraters, if there are many themes available. This may happen in spite of perfect
common understanding between interraters and may, henceforth, wrongfully educe the
idea that there is not enough consistency in comprehension between the raters and, thus,
no interrater reliability. The justifications for this argument are also presented in the
recommendations section of this paper.

References

Armstrong, D., Gosling, A., Weinman, J., & Martaeu, T. (1997). The place of inter-rater
reliability in qualitative research: An empirical study. Sociology, 31(3), 597-606.
Association for Spirit at Work (2005). The professional association for people involved
with spirituality in the workplace. Retrieved February 20, 2005, from
http://www.spiritatwork.com/aboutSAW/profile_JudiNeal.htm
Blodgett-McDeavitt, C. (1997, October). Meaning of participating in technology
training: A phenomenology. Paper presented at the meeting of the Midwest
Research-to-Practice Conference in Adult, Continuing and Community
Education, Michigan State University, East Lansing, MI. Retrieved January 25,
2003, from http://www.iupui.edu/~adulted/mwr2p/prior/blodgett.htm
Butler, E. A., & Strayer, J. (1998). The many faces of empathy. Poster presented at the
annual meeting of the Canadian Psychological Association, Edmonton, Alberta,
Canada.
Colorado State University. (1997). Interrater reliability. Retrieved April 8, 2003, from
http://writing.colostate.edu/guides/research/relval/com2a5.cfm
Creswell, J. (1998). Qualitative inquiry and research design: Choosing among five
traditions. Thousand Oaks, CA: Sage.
Dyre, B. (2003, May 6). Dr. Brian Dyre's pages. Retrieved November 12, 2003, from
http://129.101.156.107/brian/218%20Lecture%20Slides/L10%20research%20desi
gns.pdf
A phenomenological study of quest-oriented religion. Retrieved September 5, 2004, from
http://www.twu.ca/cpsy/Documents/Theses/Matt%20Thesis.pdf
Hamilton, H., Gurak, E., Findlater, L., & Olive, W. (2003, February 7). The confusion
matrix. Retrieved November 16, 2003, from
http://www2.cs.uregina.ca/~hamilton/courses/831/notes/confusion_matrix/confusi
on_matrix.html
Isaac, S., & Michael, W. (1997). Handbook in research and evaluation (Vol. 3). San
Diego, CA: Edits.
McMillan, J., & Schumacher, S. (2001). Research in education (5th ed.). New York:
Longman.
Ian I. Mitroff. (2005). Retrieved February 20, 2005, from the University of Southern
California Marshall School of Business web site:
http://www.marshall.usc.edu/web/MOR.cfm?doc_id=3055
Joan F. Marques and Chester McCall 460

Morse, J. M., Barrett, M., Mayan, M., Olson, K., & Spiers, J. (2002). Verification
strategies for establishing reliability and validity in qualitative research.
International Journal of Qualitative Methods, 1(2), 1-19.
Mott, M. S., Etsler, C., & Drumgold, D. (2003). Applying an analytic writing rubric to
children's hypermedia "narratives". Early Childhood Research & Practice,5(1)
Retrieved September 25, 2003, from http://ecrp.uiuc.edu/v5n1/mott.html
Myers, M. (2000, March). Qualitative research and the generalizability question:
Standing firm with proteus. The Qualitative Report, 4(3/4), Retrieved March 10,
2005, from http://www.nova.edu/ssss/QR/QR4-3/myers.html
Posner, K. L., Sampson, P. D., Ward, R. J., & Cheney, F. W. (1990, September).
Measuring interrater reliability among multiple raters: An example of methods
for nominal data. Retrieved November 13, 2003, from
http://schatz.sju.edu/multivar/reliab/interrater.html
Richmond University. (n.d.). Interrater reliability. Retrieved November 13, 2003, from
http://www.richmond.edu/~pli/psy200_old/measure/interrater.html
School of Business at the University of New Haven. (2005). Judi Neal Associate
Professor. Retrieved February 20, 2005, from
http://www.newhaven.edu/faculty/neal/
Scott, A. (2002). Merleau-Pontys phenomenology of perception. Retrieved September 5,
2004, from http://www.angelfire.com/md2/timewarp/merleauponty.html
Srebnik, D. S., Uehara, E., Smukler, M., Russo, J. E., Comtois, K. A., & Snowden, M.
(2002, August). Psychometric properties and utility of the problem severity
summary for adults with serious mental illness. Psychiatric Services 53, 1010-
1017. Retrieved March 4, 2005, from
http://ps.psychiatryonline.org/cgi/content/full/53/8/1010
Tashakkori, A., & Teddlie, C. (1998). Mixed methodology (Vol. 46). Thousand Oaks,
CA: Sage.
Van Manen, M. (2002a). Phenomenological inquiry. Retrieved September 4, 2004, from
http://www.phenomenologyonline.com/inquiry/1.html
Van Manen, M. (2002b). Sources of meaning. Retrieved September 4, 2004, from
http://www.phenomenologyonline.com/inquiry/49.html

Appendix A

Interview Protocol

Project: Spirituality in the Workplace: Establishing a Broadly Acceptable


Definition of this Phenomenon

Time of interview:
Date:
Place:
Interviewer:
Interviewee:
Position of interviewee:
461 The Qualitative Report September 2005

To the interviewee:
Thank you for participating in this study and for committing your time and effort.
I value the unique perspective and contribution that you will make to this study.

My study aims to establish a broadly acceptable definition of spirituality in the


workplace by exploring the experiences and perceptions of a small group of recognized
interviewees, who have had significant exposure to the phenomenon, either through
practical or theoretical experience. You are one of these icons identified. You will be
asked for your personal definitions and perceived essentials (meanings, thoughts, and
backgrounds) regarding spirituality in the workplace. I am looking for accurate and
comprehensive portrayals of what these essentials are like for you: your thoughts,
feelings, insights, and recollections that might illustrate your statements. Your
participation will hopefully help me understand the essential elements of spirituality in
the workplace.

Questions

1. Definition of Spirituality in the Workplace


1.1 How would you describe spirituality in the workplace?
1.2 What are some words that you consider to be crucial to a spiritual workplace?
1.3 Do you consider these words applicable to all work environments that meet your
personal standards of a spiritual workplace?
1.4 What is essential for the experience of a spiritual workplace?

2. Possible structural meanings of experiencing spirituality in the workplace?


2.1 If a worker was operating at his or her highest level of spiritual awareness, what
would he or she actually do?
2.2 If a worker was operating at his or her highest level of spiritual awareness, what
would he or she not do?
2.3 What is easy about living in alignment with spiritual values in the workplace?
2.4 What is difficult about living in alignment with spiritual values in the workplace?

3. Underlying themes and contexts for the experience of a spiritual workplace


3.1 If an organization is consciously attempting to nurture spirituality in the workplace,
what will be present?
3.2 If an organization is consciously attempting to nurture spirituality in the workplace,
what will be absent?

4. General structures that precipitate feelings and thoughts about the experience of
spirituality in the workplace.
4.1 What are some of the organizational reasons that could influence the transformation
from a workplace that does not consciously attempt to nurture spirituality and the human
spirit to one that does?
4.2 From the employees perspective, what are some of the reasons to transform from a
worker who does not attempt to live and work with spiritual values and practices to one
that does?
Joan F. Marques and Chester McCall 462

5. Conclusion
Would you like to add, modify or delete anything significant from the interview that
would give a better or fuller understanding toward the establishment of a broadly
acceptable definition of spirituality in the workplace

Thank you very much for your participation.

Author Note

Joan Marques was born in Suriname, South America, where she made a career in
advertising, public relations, and program hosting. She founded and managed an
advertising and P.R. company as well as a foundation for womens awareness issues. In
1998 she immigrated to California and embarked upon a journey of continuing education
and inspiration. She holds a Bachelors degree in Business Economics from M.O.C. in
Suriname, a Masters degree in Business Administration from Woodbury University, and
a Doctorate in Organizational Leadership from Pepperdine University. Her recently
completed dissertation was centered on the topic of spirituality in the workplace. Dr.
Marques is currently affiliated to Woodbury University as an instructor of Business &
Management. She has authored a wide variety of articles pertaining to workplace
contentment for audiences in different continents of the globe. Joan Marques, 712 Elliot
Drive # B, Burbank, CA 91504; E-mail: [email protected]; Telephone: (818)
845 3063
Chester H. McCall, Jr., Ph.D. entered Pepperdine University after 20 years of
consulting experience in such fields as education, health care, and urban transportation.
He has served as a consultant to the Research Division of the National Education
Association, several school districts, and several emergency health care programs,
providing survey research, systems evaluation, and analysis expertise. He is the author of
two introductory texts in statistics, more than 25 articles, and has served on the faculty of
The George Washington University. At Pepperdine, he teaches courses in data analysis,
research methods, and a comprehensive exam seminar, and also serves as chair for
numerous dissertations. Email: [email protected]

Copyright 2005: Joan F. Marques, Chester McCall, and Nova Southeastern


University

Article Citation

Marques, J. F. (2005). The application of interrater reliability as a solidification


instrument in a phenomenological study. The Qualitative Report 10(3), 439-462.
Retrieved [Insert date], from http://www.nova.edu/ssss/QR/QR10-4/marques.pdf
5/10/2014 The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu

Search People, Research Interests and Universities Home Log In

The place of inter-rater reliability in qualitative research: an empirical study more


by David Armstrong

Download (.pdf)
Sociology August 1997 v31 n3 p597(10) Pa ge 1
1997_Sociology_Inte
The place of inter-ra ter relia bility in qua lita tive resea rch: a n empirica l 25.6 KB

study.
by David Armstrong, Ann Gosling, Josh Weinman and Theresa Martaeu

Assessing inter-rater reliability, whereby data are independently coded and the codings compared
for agreement, is a recognised process in quantitative research. However, its applicability to
qualitative research is less clear: should researchers be expected to identify the same codes or
themes in a transcript or should they be expected to produce different accounts? Some
qualitative researchers argue that assessing inter-rater reliability is an important method for
ensuring rigour, others that it is unimportant; and yet it has never been formally examined in an
empirical qualitative study. Accordingly, to explore the degree of inter-rater reliability that might
be expected, six researchers were asked to identify themes in the same focus group transcript.
The results showed close agreement on the basic themes but each analyst packaged the themes
differently.

Key words: inter-rater reliability, qualitative research, research methods.

CO PYRIGHT 1997 British Soc iologica l Assoc iation consistency of findings from an analysis conducted by two
Public a tion Ltd. (BSA) or more researchers. However, the concept emerges
implicitly in descriptions of procedures for carrying out the
Reliability and validity are fundamental concerns of the analysis of qualitative data. The frequent stress on an
quantitative researcher but seem to have an uncertain analysis being better conducted as a group activity
place in the repertoire of the qualitative methodologist. suggests that results will be improved if one view is
Indeed, for some researchers the problem has apparently tempered by another. Waitzkin described meeting with two
disappeared: as Denzin and Lincoln have observed, research assistants to discuss and negotiate agreements

Terms such as credibility,


and confirmability transferability,
replace the dependability
usual positivist criteria of and disagreements
described as hashingabout
outcoding in a process
(1991:69). Another he
example is
internal and external validity, reliability and objectivity afforded by Olesen and her colleagues (1994) who
(1994:14). Nevertheless, the ghost of reliability and validity described how they (together with their graduate students -
continues to haunt qualitative methodology and different a standard resource in these reports - debriefed and
researchers in the field have approached the problem in a brainstormed to pull our first-order statements from
number of different ways. respondents accounts and agree them. Indeed, in
commenting on Olesen and her colleagues work, Bryman
One strategy for addressing these concepts is that of and Burgess (1994) wondered whether members of teams
triangulation. This device, it is claimed, follows from should produce separate analyses and then resolve any
navigation science and the techniques deployed by discrepancies, or whether joint meetings should generate
surveyors to establish the accuracy of a particular point a single, definitive coded set of materials.
(though it bears remarkable similarities to the
psychometric concepts of convergent and construct Qualitative methodologists are keen on stressing the
validity). In this way, it is argued, diverse confirmatory transparency of their technique, for example, in carefully
instances in qualitative research lend weight to findings. documenting all steps, presumably so that they can be
Denzin (1978) suggested that triangulation can involve a checked by another researcher: by keeping all collected
variety of data sources; multiple theoretical perspectives to data in well-organized, retrievable form, researchers can
interpret a single set of data; multiple methodologies to make them available easily if the findings are challenged
study a single problem; and several different researchers or if another researcher wants to reanalyze the data
or evaluators. This latter form of triangulation implies that (Marshall and Rossman 1989:146). Although there is no
the difference between researchers can be used as a formal description of how any reanalysis of data might be
method for promoting better understanding. But what role used, there is clearly an assumption that comparison with
is there for the more traditional concept of reliability? the original findings can be used to reject, or sustain, any
Should the consistency of researchers interpretations, challenge to the original interpretations. In other words,
rather than their differences, be used as a support for the there is an implicit notion of reliability within the call for
status of any findings? transparency of technique.

In general, qualitative methodologies do not make explicit Unusually for a literature that is so opaque about the
use of the concept of inter-rarer reliability to establish the importance of independent analyses of a single dataset,

- Reprinted with permission. Add itional c opying is prohibited. - G AL E G RO UP

Information Integrity

Sociology August 1997 v31 n3 p597(10) Pa ge 2

The place of inter-ra ter relia bility in qua lita tive resea rch: a n empirica l
study.
https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak 1/6
5/10/2014 study. The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu

Mays and Pope explicitly use the term reliability and, and, more commonly, those who reject the term but allow
moreover, claim that it is a significant criterion for the concept to creep into their work. On the other hand are
assessing the value of a piece of qualitative research: the those who adopt such a relativist position that issues of
analysis of qualitative data can be enhanced by organising consistency are meaningless as all accounts have some
an independent assessment of transcripts by additional validity whatever their claims. A theoretical resolution of
skilled qualitative researchers and comparing agreement these divergent positions is impossible as their core
between the raters (1995:110). This approach, they claim, ontological assumptions are so different. Yet this still
was used by Daly et al. (1992) in a study of clinical leaves a simple empirical question: do qualitative
encounters between cardiologists and their patients when researchers actually show consistency in their accounts?
the transcripts were analysed by the principal researcher The answer to this question may not resolve the

and an independent
assessed. panel, and
However, ironically, the
the level of agreement
procedure described by methodological confusion
the debate. If accounts do but it may
diverge clarify
then the modernists
for the nature of
Daly et al. was actually one of ascribing quantitative there is a methodological problem and for the
weights to pregiven variables which were then subjected postmodernists a confirmation of diversity; if accounts are
to statistical analysis (1992:204). similar, the modernists search for measures of
consistency is reinforced and the postmodernists need to
A contrary position is taken by Morse who argues that the recognise that accounts do not necessarily recognise the
use of external raters is more suited to quantitative multiple character of reality.
research; expecting another researcher to have the same
insights from a limited data base is unrealistic: No-one The purpose of the study was to see the extent to which
takes a second reader to the library to check that indeed researchers show consistency in their accounts and
he or she is interpreting the original sources correctly, so involved asking a number of qualitative researchers to
why does anyone need a reliability checker for his or her identify themes in the same data set. These accounts were
data? (Morse 1994:231). This latter position is taken then themselves subjected to analysis to identify the
further by those so-called post-modernist qualitative degree of concordance between them.
researchers (Vidich and Lyman 1994) who would
challenge the whole notion of consistency in analysing Method
data. The researchers analysis bears no direct
correspondence with any underlying reality and different As part of a wider study of the relationship between
researchers would be expected to offer different accounts perceptions of disability and genetic screening, a number
as reality itself (if indeed it can be accessed) is of focus groups were organised. One of these focus
characterised by multiplicity. For example, Tyler (1986) groups consisted of adults with cystic fibrosis (CF), a
claims that a qualitative account cannot be held to genetic disorder affecting the secretory tissues of the
represent the social world, rather it evokes it- which body, particularly the lung. Not only might these adults with
means, presumably, that different researchers would offer cystic fibrosis have particular views of disability but theirs
different evocations. Hammersely (1991) by contrast was a condition for which widespread genetic screening
argues that this position risks privileging the rhetorical over was being advocated. The aim of such a screening
the scientific and argues that quality of argument and use programme was to identify carriers of the gene so that
of evidence should remain the arbiters of qualitative their reproductive decisions might be influenced to prevent
accounts; in other words, a place remains for some sort of the birth of children with the disorder.
correspondence between the description and reality that
would allow a role for consistency. Presumably this latter The focus group was invited to discuss the topic of genetic
position would be supported by most qualitative screening. The session was introduced with a brief
researchers, particularly those drawing inspiration from summary of what screening techniques were currently
Glaser and Strausss seminal text which claimed that the available and then discussion from the group on views of
virtue of inductive processes was that they ensured that genetic screening was invited and facilitated. The ensuing
theory was closely related to the daily realities (what is discussion was tape recorded and transcribed. Six
actually going on) of substantive areas (1967:239). experienced qualitative investigators in Britain and the
United States who had some interest in this area of work
In summary, the debates within qualitative methodology on were approached and asked if they would analyse the
the place of the traditional concept of reliability (and transcript and prepare an independent report on it,
validity) remain confused. On the one hand are those identifying, and where possible rank ordering, the main
researchers such as Mays and Pope who believe reliability themes emerging from the discussion (with a maximum of
should be a benchmark for judging qualitative research; five themes). The analysts were offered a fee for this work.
- Reprinted with permission. Add itional c opying is prohibited. - G AL E G RO UP

Information Integrity

Sociology August 1997 v31 n3 p597(10) Pa ge 3

The place of inter-ra ter relia bility in qua lita tive resea rch: a n empirica l
study.
The choice of method for examining the six reports was context that gave it coherence. At its simplest this can be
made on pragmatic grounds. One method, consistent with illustrated by the way that the theme of the relative
the general approach, would have been to ask a further six invisibility of genetic disorders as forms of disability was
researchers to write reports on the degree of consistency handled. All six analysts agreed that it was an important
that they perceived in the initial accounts. But then, these theme and in those instances when the analysts attempted
accounts themselves would have needed yet further a ranking, most placed it first. For example, according to
researchers to be recruited for another assessment, and the third rarer:
so on. At some point a final judgement of consistency
needed to be made and it was thought that this could just The visibility of the disability is the single most important
as easily be made on the first set of reports. Accordingly, element in its representation. [R3]
one of the authors (DA) scrutinised all six reports and
deliberately did not read the original focus group transcript. But while all analysts identified an invisibility theme, all
The approach involved listing the themes that were also expressed it as a comparative phenomenon:
identified by the six researchers and making judgements traditional disability is visible while CF is invisible.
from the background justification whether or not there were
similarities and differences between them. The stereotypes of the disabled person in the wheelchair;
https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak 2/6
5/10/2014 The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu
similarities and differences between them. The stereotypes of the disabled person in the wheelchair;
the contrast between visible, e.g. gross physical, and
Results invisible, e.g. specific genetic, disabilities; and the special
problems posed by the general invisibility of so many
The focus group interview with the adults with cystic genetic disabilities. [R2]
fibrosis was transcribed into a document 13,500 words
long and sent to the six designated researchers. All six In short, the theme was contextualised to make it
researchers returned reports. Five of the reports, as coherent, and give it meaning. Perhaps because the
requested, described themes: four analysts identified five invisibility theme came with an implicit package of a
each, the other four. The sixth analysts returned a lengthy contrast with traditional images of deviance, there was
and discursive report that commented extensively on the general agreement on the theme and its meaning across
dynamics of the focus group, but then discussed a number all the analysts. Even so, the theme of invisibility was also
of more thematic issues. Although not explicitly described, used by some analysts as a vehicle for other issues that
five themes could be abstracted from this text. they thought were related: a link with stigma was
mentioned by two analysts; another pointed out the
In broad outline, the six analysts did identify similar themes difficulty of managing invisibility by CF sufferers.
but there were significant differences in the way they were
packaged. These differences can be illustrated by Ignorance. Whereas the theme of invisibility had a clear
examining four different themes that the researchers referent of visibility against which there could be general
identified in the transcript, namely, visibility, ignorance, consensus, other themes offered fewer such natural
health service provision and genetic screening. backdrops. Thus, the theme of peoples ignorance about
genetic matters was picked up by five of the six analysts,
Visibility. All six analysts identified a similar constellation of but presented in different ways. Only one analyst
themes around such issues as the relative invisibility of expressed it as a basic theme while others chose to link
genetic disorders, peoples ignorance, the eugenic debate ignorance with other issues to make a broader theme. One
and health care choices. However, analysts frequently linked it explicitly with the need for education.
differed in the actual label they applied to the theme. For
example, while misperceptions of the disabled, relative The main attitudes expressed were of great concern at the
deprivation in relation to visibly disabled, and images of low levels of public awareness and understanding of
disability were worded differently, it was clear from the disability, and of great concern that more educational effort
accompanying description that they all related to the same should be put into putting this right. [R2]
phenomenon, namely the fact that the general public were
prepared to identify - and give consideration to - disability Three other analysts tied the populations ignorance to the
that was overt, whereas genetic disorders such as CF eugenic threat. For example:
were more hidden and less likely to elicit a sympathetic
response. Ignorance and fear about genetic disorders and screening,
and the future outcomes for society. The group saw the
Further, although each theme was given a label it was public as associating genetic technologies with Hitler,
more than a simple descriptor; the theme was placed in a eugenics, and sex selection, and confusing minor gene
- Reprinted with permission. Add itional c opying is prohibited. - G AL E G RO UP

Information Integrity

https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak 3/6
5/10/2014 The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu

https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak 4/6
5/10/2014 The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu

https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak 5/6
5/10/2014 The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu

Job Board About Mission Press Blog Stories We're hiring engineers! FAQ Terms Privacy Copyright Send us Feedback

Academia 2014

https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak 6/6

You might also like