Interrater Reliability - Combined PDF
Interrater Reliability - Combined PDF
quantitative methods
3 Best Practices in Interrater
Reliability Three Common Approaches
http://dx.doi.org/10.4135/9781412995627.d5
[p. 29 ]
JessicaTsai
1
The concept of interrater reliability permeates many facets of modern society.
For example, court cases based on a trial by jury require unanimous agreement from
jurors regarding the verdict, lifethreatening medical diagnoses often require a second
or third opinion from health care professionals, student essays written in the context
of highstakes standardized testing receive points based on the judgment of multiple
readers, and Olympic competitions, such as figure skating, award medals to participants
based on quantitative ratings of performance provided by an international panel of
judges.
Any time multiple judges are used to determine important outcomes, certain technical
and procedural questions emerge. Some of the more common questions are as follows:
How many raters do we need to be confident in our results? What is the minimum level
of agreement that my raters should achieve? And is it necessary for raters to agree
exactly, or is it acceptable for them to differ from each other so long as their difference
is systematic and can therefore be corrected?
The answers to these questions will help determine the best statistical approach to use
for your study.
[p. 30 ]
researchers must be particularly cautious about the assumptions they are making when
summarizing the data from multiple raters to generate a single summary score for each
student. For example, simply taking the mean of the ratings of two independent raters
may, in some circumstances, actually lead to biased estimates of student ability, even
when the scoring by independent raters is highly correlated (we return to this point later
in the chapter).
Finally, a third reason for conducting an interrater reliability study is to validate how
well ratings reflect a known true state of affairs (e.g., a validation study). For example,
suppose that a researcher believes that he or she has developed a new colon cancer
screening technique that should be highly predictive. The first thing the researcher
might do is train another provider to use the technique and compare the extent to
which the independent rater agrees with him or her on the classification of people
who have cancer and those who do not. Next, the researcher might attempt to predict
the prevalence of cancer using a formal diagnosis via more traditional methods (e.g.,
biopsy) to compare the extent to which the new technique is accurately predicting the
diagnosis generated by the known technique. In other words, the reason for conducting
an interrater reliability study in this circumstance is because it is not enough that
independent raters have high levels of interrater reliability; what really matters is the
level of reliability in predicting the actual occurrence of cancer as compared with a gold
standardin this case, the rate of classification based on an established technique.
Once you have determined the primary purpose for conducting an interrater reliability
study, the next step is to consider the nature of the data that you have or will collect.
Once you have determined the type of data used for the rating scale, you should then
examine the distribution of your data using a histogram or bar chart. Are the ratings
of each rater normally distributed, uniformly distributed, or skewed? If the rating data
exhibit restricted variability, this can severely affect consistency estimates as well
as consensusbased estimates, threatening the validity of the interpretations made
from the interrater reliability estimates. Thus, it is important to have some idea of the
distribution of ratings in order to select the best statistical technique for analyzing the
data.
The third important thing to investigate is whether the judges who rated the data agreed
on the underlying trait definition. For example, if two raters are judging the creativity
of a piece of artwork, one rater may believe that creativity is 50% novelty and 50%
task appropriateness. By contrast, another rater may judge creativity to consist of
50% novelty, 35% task appropriateness, and 15% elaboration. These differences in
perception will introduce extraneous error into the ratings. The extent to which your
raters are defining the construct in a similar way can be empirically evaluated using
Finally, even if the raters agree as to the structure, do they assign people into the same
category along the continuum, or does one judge assign a person poor in mathematics
while another judge classifies that same person as %good? In other words, are they
using the rating categories the same way? This can be evaluated using consensus
estimates (e.g., via tests of marginal homogeneity).
After specifying the purpose of the study and thinking about the nature of the data that
will be used in the analysis, the final question to ask is the pragmatic question of what
resources you have at your disposal.
The question of resources often has an influence on the way that interrater reliability
studies are conducted. For example, if you are a newcomer who is running a pilot
study to determine whether to continue on a particular line of research, and time and
money are limited, then a simpler technique such as the percent agreement, kappa, or
even correlational estimates may be the best match. On the other hand, if you are in a
situation where you have a highstakes test that needs to be graded relatively quickly,
and money is not a major issue, then a more advanced measurement approach (e.g.,
the manyfacets Rasch model) is most likely the best selection.
Summary
Once you have answered the three main questions discussed in this section, you will
be in a much better position to choose a suitable technique for your project. In the next
section of this chapter, we will discuss (a) the most popular statistics used to compute
interrater reliability,
(b) the computation and interpretation of the results of statistics using worked examples,
(c) the implications for summarizing data that follow from each technique, and (d) the
advantages and disadvantages of each technique.
[p. 32 ]
Building on the work of Uebersax (2002) and J. R. Hayes and Hatch (1999), Stemler
(2004) has argued that the wide variety of statistical techniques used for computing
interrater reliability coefficients may be theoretically classified into one of three broad
categories: (a) consensus estimates, (b) consistency estimates, and (c) measurement
estimates. Statistics associated with these three categories differ in their assumptions
about the purpose of the interrater reliability study, the nature of the data, and the
implications for summarizing scores from various raters.
Consensus estimates tend to be the most useful when data are nominal in nature and
different levels of the rating scale represent qualitatively different ideas. Consensus
estimates also can be useful when different levels of the rating scale are assumed to
represent a linear continuum of the construct but are ordinal in nature (e.g., a Likert
type scale). In such cases, the judges must come to exact agreement about each of the
quantitative levels of the construct under investigation.
The three most popular types of consensus estimates of interrater reliability found in
the literature include (a) percent agreement and its variants, (b) Cohen's kappa and its
variants (Agresti, 1996; Cohen, 1960, 1968; Krippendorff, 2004), and (c) odds ratios.
Other less frequently used statistics that fall under this category include Jaccard's J and
the GIndex (see Barrett, 2001).
Percent Agreement. Perhaps the most popular method for computing a consensus
estimate of interrater reliability is through the use of the simple percent agreement
statistic. For example, in a study examining creativity, Sternberg and Lubart (1995)
asked sets of judges to rate the level of creativity associated with each of a number of
products generated by study participants (e.g., draw a picture illustrating Earth from
an insect's point of view, write an essay based on the title 2983). The goal of their
study was to demonstrate that creativity could be detected and objectively scored with
high levels of agreement across independent judges. The authors reported percent
agreement levels across raters of .92 (Sternberg & Lubart, 1995, p. 31).
The percent agreement statistic has several advantages. For example, it has a strong
intuitive appeal, it is easy to calculate, and it is easy to explain. The statistic also has
some distinct disadvantages, however. If the behavior of interest has a low or high
incidence of occurrence in the population, then it is possible to get artificially inflated
percent agreement figures simply because most of the values fall under one category of
the rating scale (J. R. Hayes & Hatch, 1999). Another disadvantage to using the simple
percent agreement figure is that it is often timeconsuming and laborintensive to train
judges to the point of exact agreement.
One popular modification of the percent agreement figure found in the testing literature
involves broadening the definition of agreement by including the adjacent scoring
categories on the rating scale. For example, some testing programs include writing
sections that are scored by judges using a rating scale with levels ranging from 1 (low)
to 6 (high) (College Board, 2006). If a percent adjacent agreement approach were used
to score this section of the exam, this would [p. 33 ] mean that the judges would not
need to come to exact agreement about the ratings they assign to each participant;
rather, so long as the ratings did not differ by more than one point above or below the
other judge, then the two judges would be said to have reached consensus. Thus, if
Rater A assigns an essay a score of 3 and Rater B assigns the same essay a score of
4, the two raters are close enough together to say that they agree, even though their
agreement is not exact.
The rationale for the adjacent percent agreement approach is often a pragmatic one. It
is extremely difficult to train independent raters to come to exact agreement, no matter
how good one's scoring rubric. Yet, raters often give scores that are pretty close to
the same, and we do not want to discard this information. Thus, the thinking is that if
we have a situation in which two raters never differ by more than one score point in
assigning their ratings, then we have a justification for taking the average score across
all ratings. This logic holds under two conditions. First, the difference between raters
must be randomly distributed across items. In other words, Rater A should not give
systematically lower scores than Rater B. Second, the scores assigned by raters must
be evenly distributed across all possible score categories. In other words, both raters
should give equal numbers of s, 2s, 3s, 4s, 5s, and 6s across the population of essays
that they have read. If both of these assumptions are met, then the adjacent percent
agreement approach is defensible. If, however, either of these assumptions is violated,
this could lead to a situation in which the validity of the resultant summary scores is
dubious (see the box below).
Consider a situation in which Rater A systematically assigns scores that are one power
lower than Rater B. Assume that they have each rated a common set of 100 essays. If
we average the scores of the two raters across all essays to arrive at individual student
scores, this seems, on the surface, to be defensible because it really does not matter
whether Rater A or Rater B is assigning the high or low score because even if Rater A
and Rater B had no systematic difference in severity of ratings, the average score would
be the same. However, suppose that dozens of raters are used to score the essays.
Imagine that Rater C is also called in to rate the same essay for a different sample of
students. Rater C is paired up with Rater B within the context of an overlapping design
to maximize rater efficiency (e.g., McArdle, 1994). Suppose that we find a situation in
which Rater B is systematically lower than Rater C in assigning grades. In other words,
Rater A is systematically one point lower than Rater B, and Rater B is systematically
one point lower than Rater C.
On the surface, again, it seems logical to average the scores assigned by Rater B and
Rater C. Yet, we now find ourselves in a situation in which the students rated by the
Rater B/C pair score systematically one point higher than the students rated by the
Rater A/B pair, even though neither combination of raters differed by more than one
score point in their ratings, thereby demonstrating interrater reliability. Which student
would you rather be? The one who was lucky enough to draw the B/C rater combination
or the one who unfortunately was scored by the A/B combination?
Thus, in order to make a validity argument for summarizing the results of multiple raters,
it is not enough to demonstrate adjacent percent agreement between rater pairs; it must
also be demonstrated that there is no systematic difference in rater severity between
the rater set pairs.
This can be demonstrated (and corrected for in the final score) through the use of the
manyfacet Rasch model.
Now let us examine what happens if the second assumption of the adjacent percent
agreement approach is violated. If you are a rater for a large testing company, and
you are told that you will be retained only if you are able to demonstrate interrater
reliability with everyone else, you would naturally look for your best strategy to maximize
interrater reliability. If you are then told that your scores can differ by no more than
one point from the other raters, you would quickly discover that your best bet then is to
avoid giving any ratings at the extreme ends of the scale (i.e., a rating of 1 or a rating
of 6). Why? Because a rating at the extreme end of the scale (e.g., 6) has two potential
scores with which it can overlap (i.e., 5 or 6), whereas a rating of 5 would allow you to
potentially agree with three scores
[p. 34 ] (i.e., 4, 5, or 6), thereby maximizing your chances of agreeing with the second
rater. Thus, it is entirely likely that the scale will go from being a 6point scale to a
4point scale, reducing the overall variability in scores given across the spectrum of
participants. If only four categories are used, then the percent agreement statistics will
be artificially inflated due to chance factors. For example, when a scale is 1 to 6, two
participants are expected to agree on ratings by chance alone only 17% of the time.
When the scale is reduced to 1 to 4, the percent agreement expected by chance jumps
to 25%. If three categories, a 33% chance agreement is expected; if two categories, a
50% chance agreement is expected. In other words, a 6point scale that uses adjacent
percent agreement scoring is most likely functionally equivalent to a 4point scale that
uses exact agreement scoring.
This approach is advantageous in that it relaxes the strict criterion that the judges agree
exactly. On the other hand, percent agreement using adjacent categories can lead to
inflated estimates of interrater reliability if there are only a limited number of categories
to choose from (e.g., a 14 scale). If the rating scale has a limited number of points,
then nearly all points will be adjacent, and it would be surprising to find agreement lower
than 90%.
Kappa is often useful within the context of exploratory research. For example, Stemler
and Bebell (1999) conducted a study aimed at detecting the various purposes of
schooling articulated in school mission statements. Judges were given a scoring rubric
that listed 10 possible thematic categories under which the main idea of each mission
statement could be classified (e.g., social development, cognitive development, civic
development). Judges then read a series of mission statements and attempted to
classify each sampling unit according to the major purpose of schooling articulated.
If both judges consistently rated the dominant theme of the mission statement as
representing elements of citizenship, then they were said to have communicated with
each other in a meaningful way because they had both classified the statement in the
same way. If one judge classified the major theme as social development, and the
other judge classified the major theme as citizenship, then a breakdown in shared
understanding occurred. In that case, the judges were not coming to a consensus
on how to apply the levels of the scoring rubric. The authors chose to use the kappa
statistic to evaluate the degree of consensus because they did not expect the frequency
of the major themes of the mission statements to be evenly distributed across the 10
categories of their scoring rubric.
Although some authors (Landis & Koch, 1977) have offered guidelines for interpreting
kappa values, other authors (Krippendorff, 2004; Uebersax, 2002) have argued that
the kappa values for different items or from different studies cannot be meaningfully
compared unless the base rates are identical. Consequently, these authors suggest
that although the statistic gives some indication as to whether the agreement is better
than that predicted by chance alone, it is difficult to apply rules of thumb for interpreting
kappa across different circumstances. Instead, Uebersax (2002) suggests that
researchers using the kappa coefficient look at it [p. 35 ] for up or down evaluation
of whether ratings are different from chance, but they should not get too invested in its
interpretation.
Krippendorff (2004) has introduced a new coefficient alpha into the literature that claims
to be superior to kappa because alpha is capable of incorporating the information from
multiple raters, dealing with missing data, and yielding a chancecorrected estimate
of interrater reliability. The major disadvantage of Krippendorff's alpha is that it is
computationally complex; however, statistical macros that compute Krippendorff's alpha
have been created and are freely available (K. Hayes, 2006). In addition, however,
some research suggests that in practice, alpha values tend to be nearly identical to
kappa values (Dooley, 2006).
Odds Ratios. A third consensus estimate of interrater reliability is the odds ratio. The
odds ratio is most often used in circumstances where raters are making dichotomous
ratings (e.g., presence/absence of a phenomenon), although it can be extended to
ordered category ratings. In a 2 2 contingency table, the odds ratio indicates how
much the odds of one rater making a given rating (e.g., positive/negative) increase for
cases when the other rater has made the same rating. For example, suppose that in
a music competition with 100 contestants, Rater 1 gives 90 of them a positive score
for vocal ability, while in the same sample of 100 contestants, Rater 2 only gives 20 of
them a positive score for vocal ability. The odds of Rater 1 giving a positive vocal ability
score are 90 to 10, or 9:1, while the odds of Rater 2 giving a positive vocal ability score
are only 20 to 80, or 1:4 = 0.25:1. Now, 9/0.25 = 36, so the odds ratio is 36. Within the
context of interrater reliability, the important idea captured by the odds ratio is whether it
deviates substantially from 1.0. From the perspective of interrater reliability, it would be
most desirable to have an odds ratio that is close to 1.0, which would indicate that Rater
1 and Rater 2 rated the same proportion of contestants as having high vocal ability. The
larger the odds ratio value, the larger the discrepancy there is between raters in terms
of their level of consensus.
The odds ratio has the advantage of being easy to compute and is familiar from
other statistical applications (e.g., logistic regression). The disadvantage to the odds
ratio is that it is most intuitive within the context of a 2 2 contingency table with
dichotomous rating categories. Although the technique can be generalized to ordered
category ratings, it involves extra computational complexity that undermines its intuitive
advantage. Furthermore, as Osborne (2006) has pointed out, although the odds ratio
is straightforward to compute, the interpretation of the statistic is not always easy to
convey, particularly to a lay audience.
Cohen's Kappa. The formula for computing Cohen's kappa is listed in Formula 1.
where P
A
Table 3.1 SPSS Code and Output for Percent Agreement and Percent Adjacent
Agreement and Cohen's Kappa
Odds Ratios. The formula for computing an odds ratio is shown in Formula 2.
The SPSS code for computing the odds ratio is shown in Table 3.2. In order to compute
the odds ratio using the crosstabs procedure in SPSS, it was necessary to recode
the data so that the ratings were dichotomous. Consequently, ratings of 0, 1, and 2
were assigned a value of 0 (failing) while ratings of 3 and 4 were assigned a value of
1 (passing). The odds ratio for the current data set is 30, indicating that there was a
substantial difference between the raters in terms of the proportion of students classified
as passing versus failing.
[p. 37 ]
A second advantage is that the techniques falling within this general category are well
suited to dealing with nominal variables whose levels on the rating scale represent
qualitatively different categories. A third advantage is that consensus estimates can
be useful in diagnosing problems with judges interpretations of how to apply the rating
scale. For example, inspection of the information from a crosstab table may allow the
researcher to realize that the judges may be unclear about the rules for when they are
supposed to score an item as zero as opposed to when they are supposed to score the
item as missing. A visual analysis of the output allows the researcher to go back to the
data and clarify the discrepancy or retrain the judges.
When judges exhibit a high level of consensus, it implies that both judges are
essentially providing the same information. One implication of a high [p. 38 ]
consensus estimate of interrater reliability is that both judges need not score all
remaining items. For example, if there were 100 tests to be scored after the interrater
reliability study was finished, it would be most efficient to ask Judge A to rate exams 1
to 50 and Judge B to rate exams 51 to 100 because the two judges have empirically
demonstrated that they share a similar meaning for the scoring rubric. In practice,
however, it is usually a good idea to build in a 30% overlap between judges even after
they have been trained, in order to provide evidence that the judges are not drifting from
their consensus as they read more items.
A second disadvantage is that the amount of time and energy it takes to train judges to
come to exact agreement is often substantial, particularly in applications where exact
agreement is unnecessary (e.g., if the exact application of the levels of the scoring
rubric is not important, but rather a means to the end of getting a summary score for
each respondent).
Third, as Linacre (2002) has noted, training judges to a point of forced consensus may
actually reduce the statistical independence of the ratings and threaten the validity of
the resulting scores.
Finally, consensus estimates can be overly conservative if two judges exhibit systematic
differences in the way that they use the scoring rubric but simply cannot be trained to
come to a consensus. As we will see in the next section, it is possible to have a low
consensus estimate of interrater reliability while having a high consistency estimate and
vice versa. Consequently, sole reliance on consensus estimates of interrater reliability
might lead researchers to conclude that interrater reliability is low when it may be more
precisely stated that the consensus estimate of interrater reliability is low.
Consistency approaches to estimating interrater reliability are most useful when the
data are continuous in nature, although the technique can be applied to categorical
data if the rating scale categories are thought to represent an underlying continuum
along a unidimensional construct. Values greater than .70 are typically acceptable for
consistency estimates of interrater reliability (Barrett, 2001).
The three most popular types of consistency estimates are (a) correlation coefficients
(e.g., Pearson, Spearman), (b) Cronbach's alpha (Cronbach, 1951), and (c) intraclass
correlation. For information regarding additional consistency estimates of interrater
reliability, see Bock, Brennan, and Muraki (2002); Burke and Dunlap (2002); LeBreton,
Burgess, Kaiser, Atchley, and James (2003); and Uebersax (2002).
Correlation Coefficients. Perhaps the most popular statistic for calculating the degree
of consistency between raters is the Pearson correlation coefficient. Correlation
coefficients measure the association between independent raters. Values approaching
+1 or 1 indicate that the two raters are following a systematic pattern in their ratings,
while values approaching zero indicate that it is nearly impossible to predict the score
one rater would give by knowing the score the other rater gave. It is important to note
that even though the correlation between scores assigned by two judges may be nearly
perfect, there may be substantial mean differences between the raters. In other words,
two raters may differ in the absolute values they assign to each rating by two points;
however, so long as there is a 2point difference for each rating they assign, the raters
will have achieved high consistency estimates of interrater reliability. Thus, a large
value for a measure of association does not imply that the raters are agreeing on the
actual application of the rating scale, only that they are consistent in applying the ratings
according to their own unique understanding of the scoring rubric.
[p. 39 ] The Pearson correlation coefficient can be computed by hand (Glass &
Hopkins, 1996) or can easily be computed using most statistical packages. One
beneficial feature of the Pearson correlation coefficient is that the scores on the rating
scale can be continuous in nature (e.g., they can take on partial values such as 1.5).
Like the percent agreement statistic, the Pearson correlation coefficients can be
calculated only for one pair of judges at a time and for one item at a time.
A potential limitation of the Pearson correlation coefficient is that it assumes that the
data underlying the rating scale are normally distributed. Consequently, if the data from
the rating scale tend to be skewed toward one end of the distribution, this will attenuate
the upper limit of the correlation coefficient that can be observed. The Spearman rank
coefficient provides an approximation of the Pearson correlation coefficient but may be
used in circumstances where the data under investigation are not normally distributed.
For example, rather than using a continuous rating scale, each judge may rank order
the essays that he or she has scored from best to worst. In this case, then, since both
ratings being correlated are in the form of rankings, a correlation coefficient can be
computed that is governed by the number of pairs of ratings (Glass & Hopkins, 1996).
The major disadvantage to Spearman's rank coefficient is that it requires both judges to
rate all cases.
Cronbach's Alpha. In situations where more than two raters are used, another approach
to computing a consistency estimate of interrater reliability would be to compute
Cronbach's alpha coefficient (Crocker & Algina, 1986). Cronbach's alpha coefficient
is a measure of internal consistency reliability and is useful for understanding the
extent to which the ratings from a group of judges hold together to measure a common
dimension. If the Cronbach's alpha estimate among the judges is low, then this implies
that the majority of the variance in the total composite score is really due to error
variance and not true score variance (Crocker & Algina, 1986).
The major advantage of using Cronbach's alpha comes from its capacity to yield a
single consistency estimate of interrater reliability across multiple judges. The major
disadvantage of the method is that each judge must give a rating on every case, or
else the alpha will only be computed on a subset of the data. In other words, if just one
rater fails to score a particular individual, that individual will be left out of the analysis. In
addition, as Barrett (2001) has noted, because of this averaging of ratings, we reduce
the variability of the judges ratings such that when we average all judges ratings, we
effectively remove all the error variance for judges (p. 7).
one should distinguish between these two sources of disagreement (p. 5). In addition,
because the intraclass correlation represents the ratio of withinsubject variance to
betweensubject variance on a rating scale, the results may not look the same if raters
are rating a homogeneous subpopulation as opposed to the general population. Simply
by restricting the betweensubject variance, the intraclass correlation will be lowered.
Therefore, it is important to pay special attention to the population being assessed and
to understand that this can influence the value of the intraclass correlation coefficient
(ICC). For this reason, ICCs are not directly comparable across populations. Finally, it is
important to note that, like the Pearson correlation coefficient, the intraclass correlation
coefficient will be attenuated if assumptions of normality in rating data are violated.
[p. 40 ] Correlation Coefficients. The formula for computing the Pearson correlation
coefficient is listed in Formula 3.
Using SPSS, one can run the correlate procedure and generate a table similar to
Table 3.3. One may request both Pearson and Spearman correlation coefficients.
The Pearson correlation coefficient on this data set is .76; the Spearman correlation
coefficient is .74.
where
2
y
i
In order to compute Cronbach's alpha using SPSS, one may simply specify in the
crosstabs procedure the desire to produce Cronbach's alpha (see Table 3.4). For this
example, the alpha value is .86.
Table 3.3 SPSS Code and Output for Pearson and Spearman Correlations
[p. 41 ]
Intraclass Correlation. Formula 5 presents the equation used to compute the intraclass
correlation value.
where
2
(b) is the variance of the ratings between
judges, and
2
(w) is the pooled variance within raters.
In order to compute intraclass correlation, one may specify the procedure in SPSS
using the code listed in Table 3.5. The intraclass correlation coefficient for this data set
is .75.
[p. 42 ]
A second disadvantage of consistency estimates is that judges may differ not only
systematically in the raw scores they apply but also in the number of rating scale
categories they use. In that case, a mean adjustment for a severe judge may provide a
partial solution, but the two judges may also differ on the variability in scores they give.
Thus, a mean adjustment alone will not effectively correct for this difference.
A third disadvantage of consistency estimates is that they are highly sensitive to the
distribution of the observed data. In other words, if most of the ratings fall into one or
two categories, the correlation coefficient will necessarily be deflated due to restricted
variability. Consequently, a reliance on the consistency estimate alone may lead the
researcher to falsely conclude that interrater reliability was poor without specifying more
precisely that the consistency estimate of interrater reliability was poor and providing an
appropriate rationale.
Measurement estimates are also useful in circumstances where multiple judges are
providing ratings, and it is impossible for all judges to rate all items. They are best used
when different levels of the rating scale are intended to represent different levels of an
underlying unidimensional construct (e.g., mathematical competence).
The two most popular types of measurement estimates are (a) factor analysis and (b)
the manyfacets Rasch model (Linacre, 1994; Linacre, Englehard, Tatem, & Myford,
1994; Myford & Cline, 2002) or loglinear models (von Eye & Mun, 2004).
Once interrater reliability has been established in this way, each participant may then
receive a single summary score corresponding to his or her loading on the first principal
component underlying the set of ratings. This score can be computed automatically by
most statistical packages.
The advantage of this approach is that it assigns a summary score for each participant
that is based only on the relevance of the strongest dimension underlying the data. The
disadvantage to the approach is that it assumes that ratings are assigned without error
by the judges.
In addition, the difficulty of each item, as well as the severity of all judges who rated the
items, can also be directly compared. For example, if a history exam included five essay
questions and each of the essay questions was rated by 3 judges (2 unique judges per
item and 1 judge who scored all items), the facets approach would allow the researcher
to directly compare the severity of a judge who rated only Item 1 with the severity of
a judge who rated only Item 4. Each of the 11 judges (2 unique judges per item + 1
judge who rated all items = 52 + 1 = 11) could be directly compared. The mathematical
representation of the manyfacets Rasch model is fully described in Linacre (1994).
Finally, in addition to providing information that allows for the evaluation of the severity
of each judge in relation to all other judges, the facets approach also allows one to
evaluate the extent to which each of the individual judges is using the scoring rubric in
a manner that is internally consistent (i.e., an estimate of intrarater reliability). In other
words, even if judges differ in their interpretation of the rating scale, the fit statistics will
indicate the extent to which a given judge is faithful to his or her own definition of the
scale categories across items and people.
The manyfacets Rasch approach has several advantages. First, the technique puts
rater severity on the same scale as item difficulty and person ability (i.e., the logit scale).
Consequently, this feature allows for the computation of a single final summary score
that is already corrected for rater severity. As Linacre (1994) has noted, this provides a
distinct advantage over generalizability studies since the goal of a generalizability study
is to determine
Second, the item fit statistics provide some estimate of the degree to which each
individual rater was applying the scoring rubric in an internally consistent manner. In
other words, highfit statistic values are an indication of rater drift over time.
Third, the technique works with multiple raters and does not require all raters to
evaluate all objects. In other words, the technique is well suited to overlapping research
designs, which [p. 44 ] allows the researcher to use resources more efficiently. So
long as there is sufficient connectedness in the data set (Engelhard, 1997), the severity
of all raters can be evaluated relative to each other.
Factor Analysis. The mathematical formulas for computing factoranalytic solutions are
expounded in several excellent texts (e.g., Harman, 1967; Kline, 1998). When using
factor analysis to estimate interrater reliability, the data set should be structured in
such a way that each column in the data set corresponds to the score given by Rater
on Item Y to each object in the data set (objects each receive their own row). Thus,
if five raters were to score three essays from 100 students, the data set should contain
15 columns (e.g., Rater1_Item1, Rater2_Item1, Rater1_Item2) and 100 rows. In this
example, we would run a separate factor analysis for each essay item (e.g., a 5 100
data matrix). Table 3.6 shows the SPSS code and output for running the factor analysis
procedure.
There are two important pieces of information generated by the factor analysis. The first
important piece of information is the value of the explained variance in the first factor.
In the example output, the shared variance of the first factor is 76%, indicating that
independent raters agree on the underlying nature of the construct being rated, which
is also evidence of interrater reliability. In some cases, it may turn out that the variance
in ratings is distributed over more than one factor. If that is the case, then this provides
some evidence to suggest that the raters are not interpreting the underlying construct
in the same manner (e.g., recall the example about creativity mentioned earlier in this
chapter).
The second important piece of information comes from the factor loadings. Each object
that has been rated will have a loading on each underlying factor. Assuming that the
first factor explains most of the variance, the score to be used in subsequent analyses
should be the loading on the primary factor.
The key values to interpret within the context of the manyfacets Rasch approach
are rater severity measures and fit statistics. Rater severity indices are useful for
estimating the extent to which systematic differences exist between raters with regard
to their level of severity. For example, rater CL was the most severe rater, with an
estimated severity measure of +0.89 logits. Consequently, students whose test items
were scored by CL would be more likely to receive lower raw scores than students who
had the same test item scored by any of the other raters used in this project. At the
other extreme, rater AP was the most lenient rater, with a rater severity measure of
0.91 logits. Consequently, simply using raw scores would lead to biased estimates of
student proficiency since student estimates would depend, to an important degree, on
which rater scored their essay. The facets program corrects for these differences and
incorporates them into student ability estimates. If these differences were not taken into
account when calculating student ability, students who had their exams scored by AP
would be more likely to receive substantially higher raw scores than if the same item
were rated by any of the other raters.
[p. 45 ]
The results presented in Table 3.7 show that there is about a 1.5logit spread in
systematic differences in rater severity (from 0.91 to +0.89). Consequently, assuming
that all raters are defining the rating scales they are using in the same way is not a
tenable assumption, and differences in rater severity must be taken into account in
order to come up with precise estimates of student ability.
In addition to providing information that allows us to evaluate the severity of each rater
in relation to all other raters, the facets approach also allows us to evaluate the extent
to which each of the individual raters is using the scoring rubric in a manner that is
internally consistent (i.e., intrarater reliability). In other words, even if raters differ in their
own definition of how they use the scale, the fit statistics will indicate the extent to which
a given rater is faithful to his or her own definition of the scale categories across items
and people. Rater fit statistics are presented in columns 5 and 6 of Table 3.7.
Fit statistics provide an empirical estimate of the extent to which the expected response
patterns for each individual match the observed response patterns. These fit statistics
are interpreted much the same way as item or person infit statistics are interpreted
(Bond & Fox, 2001; Wright & Stone, 1979). An infit value greater than 1.4 indicates that
there is 40% more variation in the data than predicted by the Rasch model. Conversely,
an infit value of 0.5 indicates that there is 50% less [p. 46 ] variation in the data than
predicted by the Rasch model. Infit mean squares that are greater than 1.3 indicate
that there is more unpredictable variation in the raters responses than we would expect
based on the model. Infit mean square values that are less than 0.7 indicate that there
is less variation in the raters responses than we would predict based on the model.
Myford and Cline (2002) note that high infit values may suggest that ratings are noisy
as a result of the raters overuse of the extreme scale categories (i.e., the lowest and
highest values on the rating scale), while low infit mean square indices may be a
consequence of overuse of the middle scale categories (e.g., moderate response bias).
The results in Table 3.7 reveal that 6 of the 12 raters had infit meansquare indices
that exceeded 1.3. Raters CL (infit of 3.4), JW (infit of 2.4), and AM (infit of 2.2) appear
particularly problematic. Their high infit values suggest that these raters are not using
the scoring rubrics in a consistent way. The table of misfitting ratings provided by the
facets computer program output allowed for an investigation of the exact nature of the
highly unexpected response patterns associated with each of these raters. The table of
misfitting ratings provides information on discrepant ratings based on two criteria: (a)
how the other raters scored the item and (b) the particular raters typical level of severity
in scoring items of similar difficulty.
Third, measurement estimates have the distinct advantage of not requiring all judges
to rate all items in order to arrive at an estimate of interrater reliability. Rather, judges
may rate a particular subset of items, and as long as there is sufficient connectedness
(Linacre, 1994; Linacre et al., 1994) across the judges and ratings, it will be possible to
directly compare judges.
In the end, the best technique will always depend on (a) the goals of the analysis (e.g.,
the stakes associated with the study outcomes), (b) the nature of the data, and (c)
the desired level of information based on the resources available. The answers to
these three questions will help to determine how many raters one needs, whether the
raters need to be in perfect agreement with each other, and how to approach creating
summary scores across raters.
We conclude this chapter with a brief table that is intended to provide rough interpretive
guidance with regard to acceptable interrater reliability values (see Table 3.8). These
values simply represent conventions the authors have encountered in the literature
and via discussions with colleagues and reviewers; however, keep in mind that these
guidelines are just rough estimates and will vary depending on the purpose of the study
and the stakes associated with the [p. 48 ] outcomes. The conventions articulated
here assume that the interrater reliability study is part of a lowstakes, exploratory
research study.
Table 3.8 General Guidelines for Interpreting Various Interrater Reliability Coefficients
Notes
References
Agresti, A. (1996). An introduction to categorical data analysis (2nd ed.). New York:
John Wiley.
Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River,
NJ: Prentice Hall.
Barrett, P. (2001, March). Assessing the reliability of rating data . Retrieved June 16,
2003, from http://www.liv.ac.uk/~pbarrett/rater.pdf
Bock, R., Brennan, R. L., and Muraki, E. The information in multiple ratings Applied
Psychological Measurement, vol. 26(4)364-375(2002).
Bond, T., & Fox, C. (2001). Applying the Rasch model . Mahwah, NJ: Lawrence
Erlbaum.
Cohen, J. A coefficient for agreement for nominal scales Educational and Psychological
Measurement, vol. 20,37-46(1960).
Cohen, J. Weighted kappa: Nominal scale agreement with provision for scale
disagreement or partial credit Psychological Bulletin, vol. 70,213-220(1968).
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/
correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence
Erlbaum.
College Board . (2006). How the essay is scored . Retrieved November 4, 2006, from
http://www.coUegeboard.com/student/testing/sat/about/sat/essay_scoring.html
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory .
Orlando, FL: Harcourt Brace Jovanovich.
Cronbach, L. J. Coefficient alpha and the internal structure of tests Psychometrika, vol.
16,297-334(1951).
Engelhard, G. Constructing rater and task banks for performance assessment Journal
of Outcome Measurement, vol. 1(1)19-33(1997).
Glass, G. v., & Hopkins, K. H. (1996). Statistical methods in education and psychology .
Boston: Allyn & Bacon.
Hayes, K. (2006). SPSS Macro for computing Krippendorff's alpha . Retrieved from
http://www.comm.ohio-state.edu/ahayes/SPSS%20programs/kalpha.htm
Kline, R. (1998). Principles and practice of structural equation modeling . New York:
Guilford.
LeBreton, J. M., Burgess, J. R., Kaiser, R. B., Atchley, E., and James, L. R. The
restriction of variance hypothesis and interrater reliability and agreement: Are ratings
from multiple sources really dissimilar? Organizational Research Methods, vol.
6(1)80-128(2003).
Linacre, J. M., Englehard, G., Tatem, D. S., and Myford, C. M. Measurement with
judges: Many-faceted conjoint measurement International Journal of Educational
Research, vol. 21(4)569-577(1994).
Myford, C. M., & Cline, F. (2002, April 1-5). Looking for patterns in disagreements: A
facets analysis of human raters and e-raters scores on essays written for the Graduate
Management Admission Test (GMAT) . Paper presented at the annual meeting of the
American Educational Research Association, New Orleans, LA.
Osborne, J. W. Bringing balance and technical accuracy to reporting odds ratios and the
results of logistic regression analyses Practical Assessment, Research & Evaluation,
vol. 11(7).Retrieved from http://pareonline.net/getvn.asp?v=11&n=17(2006).
Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests
(Expanded ed.). Chicago: University of Chicago Press. (Original work published 1960)
Snow, A. L., Cook, K. F., Lin, P.-S., Morgan, R. O., and Magaziner, J. Proxies and
other external raters: Methodological considerations Health Services Research, vol.
40(5)1676-1693(2005).
Stemler, S. E., & Bebell, D. (1999, April). An empirical approach to understanding and
analyzing the mission statements of selected educational institutions . Paper presented
at the New England Educational Research Organization (NEERO), Portsmouth, NH.
Stemler, S. E., Grigorenko, E. L., Jarvin, L., and Sternberg, R. J. Using the theory of
successful intelligence as a basis for augmenting AP exams in psychology and statistics
Contemporary Educational Psychology, vol. 31(2)75-108(2006).
Sternberg, R. J., & Lubart, T. I. (1995). Defying the crowd: Cultivating creativity in a
culture of conformity . New York: Free Press.
Uebersax, J. (2002). Statistical methods for rater agreement . Retrieved August 9, 2002,
from http://ourworld.compuserve.com/homepages/jsuebersax/agree.htm
von Eye, A., & Mun, E. Y. (2004). Analyzing rater agreement: Manifest variable
methods . Mahwah, NJ: Lawrence Erlbaum.
Wright, B. D., & Stone, M. H. (1979). Best test design . Chicago: MESA Press.
http://dx.doi.org/10.4135/9781412995627.d5
Jason W. Davey
Fenwal, Inc.
P. Cristian Gugiu
Western Michigan University
Chris L. S. Coryn
Western Michigan University
Purpose: This paper presents quantitative Findings: The calculation of the kappa statistic,
methods for determining the reliability of weighted kappa statistic, ANOVA Binary Intraclass
conclusions from qualitative data sources. Correlation, and Kuder-Richardson 20 is
Although some qualitative researchers disagree illustrated through a fictitious example. Formulae
with such applications, a link between the are presented so that the researcher can calculate
qualitative and quantitative fields is successfully these estimators without the use of sophisticated
established through data collection and coding statistical software.
procedures.
K e y w o r d s : qualitative coding; qualitative
Setting: Not applicable. methodology; reliability coefficients
__________________________________
Intervention: Not applicable.
statement made by the informant into a reliability indicator through the use of
code. However, when the researcher multiple coders, transparency, audit trails,
decided that the statement best and member checks. Credibility, on the
represented cleanliness and not academic other hand, is concerned with the
performance, he or she also performed a research methodology and data sources
measurement process. Therefore, if one used to establish a high degree of
accepts this line of reasoning, qualitative harmony between the raw data and the
research depends upon measurement to researchers interpretations and
render judgments. Furthermore, three conclusions. Various means can be used
questions may be asked. First, does to enhance credibility, including
statement X fit the definition of code Y? accurately and richly describing data,
Second, how many of the statements citing negative cases, using multiple
collected fit the definition of code Y? And researchers to review and critique the
third, how reliable is the definition of code analysis and findings, and conducting
Y for differentiating between statements member checks (Given & Saumure, 2008;
within and across researchers (i.e., Jensen, 2008; Saumure & Given, 2008).
intrarater and interrater reliability, Dependability recognizes that the most
respectively)? appropriate research design cannot be
Fortunately, not every qualitative completely predicted a priori.
researcher has accepted Stenbackas Consequently, researchers may need to
notion, in part, because qualitative alter their research design to meet the
researchers, like quantitative researchers, realities of the research context in which
compete for funding and therefore, must they conduct the study, as compared to
persuade funders of the accuracy of their the context they predicted to exist a priori
methods and results (Cheek, 2008). (Jensen, 2008). Dependability can be
Consequently, the concepts of reliability addressed by providing a rich description
and validity permeate qualitative of the research procedures and
research. However, owing to the desire to instruments used so that other
differentiate itself from quantitative researchers may be able to collect data in
research, qualitative researchers have similar ways. The idea being that if a
espoused the use of interpretivist different set of researchers use similar
alternatives terms (Seale, 1999). Some of methods then they should reach similar
the most popular terms substituted for conclusions (Given & Saumure, 2008).
reliability include confirmability, Finally, replicability is concerned with
credibility, dependability, and replicability repeating a study on participants from a
(Coryn, 2007; Golafshani, 2003; Healy & similar background as the original study.
Perry, 2000; Morse, Barrett, Mayan, Researchers may address this reliability
Olson, & Spiers, 2002; Miller, 2008; indicator by conducting the new study on
Lincoln & Guba, 1985). participants with similar demographic
In the qualitative tradition, variables, asking similar questions, and
confirmability is concerned with coding data in a similar fashion to the
confirming the researchers original study (Firmin, 2008).
interpretations and conclusions are Like qualitative researchers,
grounded in actual data that can be quantitative researchers have developed
verified (Jensen, 2008; Given & Saumure, numerous definitions of reliability,
2008). Researchers may address this including interrater and intrarater
Table 4
2 x 2 Contingency Table for the Kappa Statistic
Marginal Row
Coder 1
Probabilities
Theme not
Theme present pi.
present
Theme present c11 c21 p1. = (c11 + c21) / N
Coder 2 Theme not
c12 c22 p2. = (c12 + c22) / N
present
Marginal
p.1 = (c11 + c12) / p.2 = (c21 + c22) / N = (c11 + c21 + c12
Column p.j
N N + c22)
Probabilities
c c
c11 + c22 observed level of agreement for professors
po = pij = is (1+19)/556 = 0.0360. The expected
i =1 j =1 N and level of agreement for professors is
c c
pe = pi. p. j = p1. p.1 + p2. p.2 0.0036(0.0144) + 0.0468(0.0360) =
i =1 j =1 0.0017.
Estimates from professor interview
participants for calculating the kappa
statistic are provided in Table 5. The
Table 5
Estimates from Professor Interview Participants for Calculating the Kappa Statistic
Marginal Row
Coder 1
Probabilities
Theme not
Theme present pi.
present
p1. = 2/556 =
Theme present 1 1
0.0036
Coder 2
Theme not p2. = 26/556 =
7 19
present 0.0468
Marginal
p.1 = 8/556 = p.2 = 20/556 = N = 28 + 528 =
Column p.j
0.0144 0.0360 556
Probabilities
Table 6
Estimates from Student Interview Participants for Calculating the Kappa Statistic
Marginal Row
Coder 1
Probabilities
Theme not
Theme present pi.
present
p1. = 501/556 =
Theme present 500 1
0.9011
Coder 2
Theme not p2. = 27/556 =
2 25
present 0.0486
Marginal
p.1 = 502/556 = p.2 = 26/556 =
Column p.j N = 556
0.9029 0.0468
Probabilities
The total observed level of agreement agree that a theme is not present. The
for the professor and student interview same logic can be applied where the
groups is po = 0.0360 + 0.9442 = 0.9802. coders disagree on the presence of a
The total expected level of agreement for theme in participant responses.
the professor and student interview The weighted observed level of
groups is pe = 0.0017 + 0.8160 = 0.8177. agreement (pow) equals the frequency of
For the professor and student and records where both coders agree that a
professor groups, the kappa statistic theme is present times a weight plus the
equals = (0.9802 0.8177)/(1 0.8177) frequency of records where both coders
= 0.891. The level of agreement between agree that a theme is not present times
the two coders is 0.891 beyond that which another weight divided by the total
is expected purely by chance. number of ratings. The weighted expected
level of agreement (pew) equals the
Weighted Kappa summation of the cross product of the
marginal probabilities, where each cell in
The reliability coefficient, W, has the the contingency table has its own weight.
same interpretation as the kappa statistic, The weighted kappa statistic W then
, but the researcher can differentially equals (pow-pew)/(1-pew). The traditional
weight each cell to reflect varying levels of formulae for pow and pew are
c c c c
importance. According to Cohen (1968),
pow = wij pij and pew = wij pi. p. j ,
W is the proportion of weighted i =1 j =1 i =1 j =1
agreement corrected for chance, to be where c denotes the total number of cells,
used when different kinds of i denoted the ith row, j denotes the jth
disagreement are to be differentially column, and wij denotes the i, jth cell
weighted in the agreement index (p. xx). weight (Fleiss, Cohen, & Everitt, 1969;
As an example, the frequencies of coding Everitt, 1968). These formulae are
patterns where both raters agree that a illustrated in Table 7.
theme is present can be given a larger
weight than patterns where both raters
Table 7
2 x 2 Contingency Table for the Weighted Kappa Statistic
Marginal Row
Coder 1
Probabilities
Theme not
Theme present pi.
present
p1. = (w11c11 +
Theme present w11c11 w21c21
w21c21) / N
Coder 2
Theme not p2. = (w12c12 +
w12c12 w22c22
present w22c22) / N
Marginal
p.1 = (w11c11 + p.2 = (w21c21 + N = (c11 + c21 + c12
Column p.j
w12c12) / N w22c22) / N + c22)
Probabilities
c c
w11c11 + w22 c22 probability weights used are provided in
po = pij = Table 8. In the first row and first column,
i =1 j =1 N and the probability weight is 0.80. This weight
c c
pe = wij pi. p. j was chosen arbitrarily to reflect the
i =1 j =1 overall level of importance in the
.
agreement of a theme being present as
Karlin, Cameron, and Williams (1981)
identified by both coders. In the second
provided three methods for weighting
row and first column, the probability
probabilities as applied to the calculation
weight is 0.10. In the first row and second
of a kappa statistic. The first method
column, the probability weight is 0.09.
equally weights each pair of observations.
These two weights were used to reduce the
n
This weight is calculated as w i = i , where impact of differing levels of experience in
N qualitative research between the two
ni is the sample size of each cell and N is raters. In the second row and second
the sum of the sample sizes from all of column, the probability weight is 0.01.
cells of the contingency table. The second This weight was employed to reduce the
method equally weights each group (e.g., effect of the lack of existence of a theme
undergraduate students and professors) from the interview data.
irrespective of its size. These weights can
1
be calculated as w i = , where k is
kn i (n i 1)
the number of groups (e.g., k = 2). The
last method weights each cell according to
the sample size in each cell. The formula
1
for this weighting option is w i = .
N (n i 1)
There is no single standard for
applying probability weights to each cell
in a contingency table. For this study, the
Table 8
Probability Weights on Binomial Coder Estimates from professor interview
Agreement Patterns for Professor and participants for calculating the weighted
Student Interview Participants kappa statistic are provided in Table 9.
The observed level of agreement for
Coder 1 professors is [0.8(1)+0.01(19)]/556 =
0.0018. The expected level of agreement
Theme Not
Theme
Present for professors is 0.0016(0.0027) +
Present (j1) 0.0016(0.0005) = 0.00001.
(j2)
Theme
0.80 0.09
Present (i1)
Coder 2 Theme Not
Present 0.10 0.01
(i2)
Table 9
Estimates from Professor Interview Participants for Calculating the Weighted Kappa
Statistic
Marginal Row
Coder 1
Probabilities
Theme not
Theme present pi.
present
p1. = 0.89/556 =
Theme present 0.8(1) = 0.8 0.09(1) = 0.09
0.0016
Coder 2
Theme not p2. = 0.89/556 =
0.1(7) = 0.7 0.01(19) =0.19
present 0.0016
Marginal
p.1 = 1.5/556 = p.2 = 0.28/556 = N = 28 + 528 =
Column p.j
0.0027 0.0005 556
Probabilities
Table 10
Estimates from Student Interview Participants for Calculating the Weighted Kappa
Statistic
Marginal Row
Coder 1
Probabilities
Theme not
Theme present pi.
present
p1. = 400.09/556 =
Theme present 0.8(500) = 400 0.09(1) = 0.09
0.7196
Coder 2
Theme not p2. = 0.45/556 =
0.1(2) = 0.2 0.01(25) =0.25
present 0.0008
Marginal
p.1 = 400.2/556 p.2 = 0.34/556 = N = 28 + 528 =
Column p.j
= 0.7198 0.0006 556
Probabilities
The total observed level of agreement for two or more coding groups/categories
for the professor and student interview from an analysis of variance model
groups is pow = 0.0018 + 0.7199 = 0.7217. modified for binary response variables by
The total expected level of agreement for Elston (1977). This reliability statistic
the professor and student interview measures the consistency of the two
groups is pew = 0.00001 + 0.5180 = ratings (Shrout and Fleiss, 1979), and is
0.5181. For the professor and student and appropriate when two or more raters rate
professor groups, the weighted kappa the same interview participants for some
statistic equals W = (0.7217 0.5181)/(1 item of interest. ICC(3,1) assumes that the
0.5181) = 0.423. The level of agreement raters are fixed; that is, the same raters
between the two coders is 0.423 beyond are utilized to code multiple sets of data.
that which is expected purely by chance The statistic ICC(2,1) that assumes the
after applying importance weights to each coders are randomly selected from a
cell. This reliability statistic is notably larger population of raters (Shrout and
smaller than the unadjusted kappa Fleiss, 1979) is recommended for use but
statistic because of the number of down- not currently available for binomial
weighted cases where both coders agreed response data.
that the theme is not present in the The traditional formulae for these
interview responses. mean squares within and between along
with an adjusted sample size estimate are
ANOVA Binary ICC provided in Table 11. In these formulae, k
denotes the total number of groups or
From the writings of Shrout and Fleiss categories. Yi denotes the frequency of
(1979), the currently available ANOVA agreements (both coders indicate a theme
Binary ICC that is appropriate for the is present, or both coders indicate a theme
current data set is based on what they is not present) between coders for the ith
refer to as ICC(3,1). More specifically, this group or category, ni is the total sample
version of the ICC is based on within size for the ith group or category, and N is
mean squares and between mean squares the total sample size across all groups or
categories. Using these estimates, the 0.0157 and 2.0854, respectively. Using
reliability estimate equals these estimates, the ANOVA binary ICC
MS B MSW equals
AOV = (Elston, Hill, &
MS B + ( n0 1) MSW MS B MSW 2.0854 0.0157
=
Smith, 1977; Ridout, Demtrio, & Firth, MS B + (n0 1) MSW 2.0854 + (54.5827 1)0.0157
1999). = 0.714, which denotes the consistency of
Estimates from professor and student coding between the two coders on the
interview participants for calculating the professor and student interview
ANOVA Binary ICC are provided in Table responses.
11. Given that k = 2 and N = 556, the
adjusted sample size equals 54.2857. The
within and between mean squares equal
Table 11
Formulae and Estimates from Professor and Student Interview Participants for
Calculating the ANOVA Binary ICC
Description of
Statistic Formula
Statistic
1 k k
Yi 2
Mean Squares
MSW i =
Y
1
[545 536.303] = 0.0157
Within N k i =1 i =1 ni 556 2
1 k Yi 2 1 k 1 5452
2
i = 2.0854
Mean Squares
MSB Y = 536.303
Between k 1 i =1 ni N i =1 2 1 556
1 1 k 2 1
Adjusted Sample 1
n0 N ni = 556 - (528 2 + 282 ) = 54.5827
Size k 1 N i =1 2 - 1 556
Note: Yi denotes the total number of cases where both coders indicate that a theme either is or is not
present in a given response.
requirements of the data in relation to the
Kuder-Richardson 20 calculation of the correlation rii , possibly
due to its time of development in relation
In their landmark article, Kuder and to the infancy of mathematical statistics.
Richardson (1937) presented the This vagueness has lead to some incorrect
derivation of the KR-20 statistic, a calculations of the KR-20. Crocker and
coefficient that they used to determine the Algina (1986) present examples on the
reliability of test items. This estimator is a calculation of the KR-20 in Table 7.2
function of the sample size, summation of based on data from Table 7.1 (pp. 136-
item variances, and total variance. Two 140). In Table 7.1, the correlation on the
observations in these formulae require two split-halves is presented as AB = 0.34 .
further inquiry. These authors do not It is not indicated that this statistic is the
appear to discuss the distributional Pearson correlation. This is problematic
because this statistic assumes that the two KR-20 will be computed using the
random variables are continuous, when in N 1 k Yi Yi
actuality they are discrete. An appropriate formula 1 2 1 , where
N 1 T i =1 ni ni
statistic is Kendall-c and this correlation
equals 0.35. As can be seen, the k denotes the total number of groups or
correlation may be notably categories, Yi denotes the number of
underestimated as well as the KR-20 if the agreements between coders for the ith
incorrect distribution is assumed. For the group or category, ni is the total sample
remainder of this paper, the Pearson size for the ith group or category, and N is
correlation will be substituted with the the total sample size across all groups or
Kendall-c correlation. categories (Lord & Novick, 1968). The
Second, Kuder and Richardson (1937) total variance ( T2 ) for coder agreement
present formulae for the calculation of t2 patterns equals the summation of
and rii that are not mutually exclusive. elements in a variance-covariance matrix
for binomial data (i.e.,
This lack of exclusiveness has caused
some confusion in appropriate 1 + 2 + 2COV ( X 1 , X 2 )
2 2
=
calculations of the total variance t2 . Lord 1 + 2 + 2 12 1 2 ) (Stapleton, 1995). The
2 2
and Novick (1968) indicated that this variance-covariance matrix takes the
statistic is equal to coefficient general form
(continuous) under certain circumstances, 12 L
and Crocker and Algina (1986) elaborated 12 1 2 ij i j
on this statement by indicating This 2
2
M L
= 21 2 1 (Kim
formula is identical to coefficient alpha L L O M
with the substitution of piqi for i2 (p. L L 2
n
ij i j
139). This is unfortunately incomplete.
& Timm, 2007), and reduces to
Not only must this substitution be
made for the numerators variances, the 12 121 2
= for a coding
denominator variances must also be 21 21 22
adjusted in the same manner. That is, if scheme comprised of two raters. In this
the underlying distribution of the data is
binomial, all estimators should be based matrix, the variances 2 2
(
1 , 2 of )
on the level of measurement appropriate agreement for the i group or category
th
Table 12
Estimates from Professor and Student Interview Participants for Calculating the KR 20
InterraterReliability
all ads would fall into the nl, 1n22,or n33cells, with zeros off thanproportionsandanequationforanapproximatestandard
the main diagonal. errorfor the index.
Researchershavecriticizedthekappaindexforsome of its
Rater2- row properties and proposed extensions (e.g., Brennan &
Prediger, 1981; Fleiss, 1971; Hubert, 1977; Kaye, 1980;
Rater1: emotional rational mixed sums Kraemer,1980; Tanner& Young, 1985). To be fair, Cohen
emotional n,, n,2 ni3 nj+ (1960, p. 42) anticipatedsome of these qualities(e.g., thatthe
upperboundfor kappacan be less than 1.0, dependingon the
rational n21 n22 n23
marginaldistributions),andso he providedan equationto de-
n2+
two raters).Furthermore,
it appearssufficientlystraightfor- Hubert,Lawrence.(1977). Kapparevisited.PsychologicalBulletin,84, 289-
wardthatonecouldcomputetheindexwithouta mathemati- 297.
Hughes,MarieAdele, & Garrett,Dennis E. (1990). Intercoderreliabilityes-
cally-inducedcoronary. timationapproachesin marketing:A generalizabilitytheoryframework
for quantitativedata.Journalof MarketingResearch,27, 185-195.
Kaye, Kenneth.(1980). Estimatingfalse alarms and missed events from
interobserveragreement:A rationale.PsychologicalBulletin,88, 458-
REFERENCES 468.
Kolbe,RichardH., & Burnett,Melissa S. (1991). Content-analysisresearch:
Brennan,Robert L., & Prediger,Dale J. (1981). Coefficient kappa:Some An examinationof applicationswith directivesfor improvingresearch
uses, misuses, and alternatives.Educationaland Psychological Mea- reliabilityandobjectivity.Journalof ConsumerResearch,18,243-250.
surement,41, 687-699. Kraemer,Helena Chmura. (1980). Extension of the kappa coefficient.
Cohen,Jacob.(1960). A coefficientof agreementfornominalscales. Educa- Biometrics,36, 207-216.
tional and Psychological Measurement,20, 37-46. Krippendorff,Klaus.(1980). Contentanalysis:An introductionto its meth-
Cooil, Bruce,& Rust,RolandT. (1994). Reliabilityandexpectedloss: A uni- odology.NewburyPark,CA: Sage.
fying principle.Psychometrika,59, 203-216. Perreault,WilliamD., Jr.,& Leigh, LaurenceE. (1989). Reliabilityof nomi-
Cooil, Bruce,& Rust,RolandT. (1995). Generalestimatorsforthe reliability nal data based on qualitativejudgments. Journal of MarketingRe-
of qualitativedata.Psychometrika,60, 199-220. search, 26, 135-148.
Cronbach,Lee J. (1951). Coefficientalphaandthe internalstructureof tests. Rust,RolandT., & Cooil, Bruce.(1994). Reliabilitymeasuresfor qualitative
Psychometrika,16, 297-334. data:Theoryandimplications.JournalofMarketingResearch,31, 1-14.
Cronbach,Lee J., Gleser, Goldine C., Nanda, Harinder,& Rajaratnam, Tanner,MartinA., & Young,MichaelA. (I 985). Modelingagreementamong
Nageswari. (1972). The dependabilityof behavioral measurements: raters.Journal of the AmericanStatisticalAssociation, 80(389), 175-
Theoryof generalizabilityforscores andprofiles. New York:Wiley. 180.
Fleiss, JosephL. (1971). Measuringnominalscale agreementamongmany Ubersax,JohnS. (1988). Validityinferencesfrominter-observeragreement.
raters.PsychologicalBulletin, 76, 378-382. PsychologicalBulletin, 104, 405-416.
Literature Review of Inter-rater Reliability
Inter-rater reliability, simply defined, is the extent to which the way information
being collected is being collected in a consistent manner (Keyton, et al, 2004). That is, is
the information collecting mechanism and the procedures being used to collect the
information solid enough that the same results can repeatedly be obtained? This should
not be left to chance, either. Having a good measure of inter-rater reliability rate
to state with confidence they can be confident in the information they have collected.
logistical proof that the similar answers collected are more than simple chance
(Krippendorf, 2004a).
Inter-rater reliability also alerts project managers to problems that may occur in
the research process (Capwell, 1997; Keyton, et al, 2004; Krippendorf, 2004a,b;
coder fatigue, or the presence of a rogue coder all examined in a later section of this
(facilitators rushing the process, mistakes on part of those recording answers, the
If closed data was not collected for the survey/interview, then the data will have
to be coded before it is analyzed for inter-rater reliability. Even if closed data was
collected, then coding may be important because in many cases closed-ended data has a
2004), requires answers to be placed into yes or no paradigms as a simple data coding
birthdates, then, it can be determined whether the two data collections netted the same
result. If so, then YES can be recorded for each respective survey. If not, then YES
should be recorded for one survey and NO for the other (do not enter NO for both, as that
would indicate agreement). While placing qualitative data into a YES/NO priority could
be a working method for the information collected in the ConQIR Consortium given the
high likelihood that interview data will match, the forced categorical separation is not
considered to be the best available practice and could prove faulty in accepting or
rejecting hypotheses (or for applying analyzed data toward other functions). It should,
however, be sufficient in evaluating whether reliable survey data is being obtained for
agency use. For best results, the survey design should be created with reliability checks in
mind, employing either a YES/NO choice option (this is different than what is reviewed
above a YES/NO option would include questions like, Were you born before July 13,
1979? where the participant would have to answer yes or no) or a likert-scale type
mechanism. See the Interview/Re-Interview Design literature review for more details.
How to compute inter-rater reliability
2004). In the case of qualitative studies, where survey or interview questions are open-
ended, some sort of coding scheme will need to be put into place before using this
formula (Friedman, et al, 2003; Ketyon, et al, 2004). For closed-ended surveys or
interviews where participants are forced to choose one choice, then the collected data is
immediately ready for inter-rater checks (although quantitative checks often produce
lower reliability scores, especially when the likert scale is used) (Friedman, et al, 2003).
question data is collected using a likert scale, a series of options, or yes/no answers),
follow these steps to determine Cohens kappa (1960), a statistical measure determining
inter-rater reliability:
contingency table. This means you will create a table that demonstrates,
essentially, how many of the answers agreed and how many answers
disagreed (and how much they disagreed, even). For example, if two different
Question Number 1 2 3 4 5 6 7 8 9 10
Interviewer #1 Y N Y N Y Y Y Y Y Y
Interviewer #2 Y N Y N Y Y Y N Y N
From this data, a contingency table would be created:
YES 6 0
NO 2 2
Notice that the number six (6) is entered in the first column because when looking
at the answers there were six times when both interviewers found a YES answer to
the same question. Accordingly, they are placed where the two YES answers
overlap in the table (with the YES going across the top of the table representing
Rater/Interviewer #1 and the YES going down the left side of the table
the first row because for that particular intersection in the table there were no
Interviewer/Rater #2 found a YES). The number two (2) is entered in the first
column of the second row since Interviewer/Rater #1 found a YES answer two
times when Interviewer/Rater #2 found a NO; and a two (2) is entered in the
NOTE: It is important to consider that the above table is for a YES/NO type survey. If a
different number of answers are available for the questions in a survey, then the number
of answers should be taken into consideration in creating the table. For instance, if a five
question likert-scale were used in a survey/interview, then the table would have five rows
and five columns (and all answers would be placed into the table accordingly).
2. Sum the row and column totals for the items. To find the sum for the first
row in the previous example, the number six would be added to the number
zero for a first row total of six. The number two would be added to the
number two for a second row total of four. Then the columns would be added.
The first column would find six being added to two for a total of eight; and the
second column would find zero being added to two for a total of two.
3. Add the respective sums from step two together. For the running example,
six (first row total) would be added to four (second row total) for a row total
of ten (10). Eight (first column total) would be added to two (second column
total) for a column total of ten (10). At this point, it can be determined
whether the data has been entered and computed correctly by whether or not
the row total matches the column total. In the case of this example, it can be
seen that the data seems to be in order since both the row and column total
equal ten.
4. Add all of the agreement cells from the contingency table together. In the
running example, this would lead to six being added to two for a total of eight
because there were six times where the YES answers matched from both
interviewers/raters (as designated by the first column in the first row) and two
designated by the second column in the second row). The sum of agreement
then, and the answer to this step, would be eight (8). The agreement cells will
always appear in a diagonal pattern across the chart so, for instance, if
participants had five possibilities for answers then there should be five cells
going across and down the chart in a diagonal pattern that will be added.
NOTE: At this point simple agreement can be computed by dividing the answer in step
four by the answer in step five. In the case of this example, that would lead to eight being
divided by ten for a result of 0.8. This number would be rejected by many researchers,
however, since it does not take into account the probability that some of these agreements
in answers could have been by chance. That is why the rest of the steps must be followed
appearing in the diagonal pattern going across the chart. To do this, find
the row total for the first agreement cell (row one column one) and multiply
that by the column total for the same cell. Divide this by the total number
possible for all answers (this is the row/column total from step three). So, for
this example, first the cell containing the number six would be located (since
it is the first agreement cell located in row one column one) and the column
and row totals would be multiplied by each other (these were found in step
two) and then divided by the total: 6 x 8 =48 48/10=4.8. The next diagonal
cell (one over to the right and one down) is the next row to be computed:
2 x 4=8 8/10=0.8. Since this is the final cell in the diagonal, this is the final
computation that needs to be made in this step for the sample problem;
however, if more answers were possible, then the step would be repeated as
many times as there are answers. For a five answer likert scale, for instance,
the process would be repeated for five agreement cells going across the chart
6. Add all of the expected frequencies found in step five together. This
used in this literature review, that would be 4.8 + 0.8 for a sum of 5.6. For a
five answer likert scale, all five of the totals found in step five would be added
together.
7. Compute kappa. To do this, take the answer from step four and subtract the
answer from step six. Place the result of that computation aside. Then take the
total number of items from the survey/interview and subtract the answer from
step six. After this has been completed, take the first computation from this
step (the one that was set aside) and divide it by the second computation from
this step. The resulting computation represents kappa. For the running
example that has been provided in this literature review, it would look like
0.7) then it is often recommended that the data be thrown out (Krippendorf, 2004a). In
cases such as these, it is often wise to administer an additional data collection so a third
set of information can compared to the other collected data (and calculated against both
in order to determine if an acceptable inter-rater reliability level has been achieved with
either of the previous data collecting attempts). If many cases of inter-rater issues are
occurring, then the data from these cases can often be observed in order to determine
what the problem may be (Keyton, et al, 2004). If data has been prepared for inter-rater
checks from qualitative collection measures, for instance, the coding scheme used to
It may also be helpful to check with the person who coded the data to make sure
they understood the coding procedure (Ketyon, et al, 2004). This inquiry can also include
questions about whether they became fatigued during the coding process (often those
coding large sets of information tend to make more mistakes) and whether or not they
agree with the process selected for coding (Keyton, et al, 2004). In some cases a rogue
coder may be the culprit for failure to achieve inter-rater reliability (Neuendorf, 2002).
Rogue coders are coders who disapprove of the methods used for analyzing the data and
who assert their own coding paradigms. Facilitators of projects may also be to blame for
the low inter-rater reliability, especially if they have rushed the process (causing rushed
and hasty coding), required one individual to code a large amount of information (leading
to fatigue), or if the administrator has tampered with the data (Keyton, et al, 2004).
References
Cleveland State.
Cohen, J. (1960). Kappa test: A coefficient of agreement for nominal scales. Education
Friedman, P. G., Chidester, P. J., Kidd, M. A., Lewis, J. L., Manning, J. M., Morris, T.
M., Pilgram, M. D., Richards, K., Menzie, K., & Bell, J. (2003). Analysis of
FL.
Keyton, J., King, T., Mabachi, N. M., Manning, J., Leonard, L. L., & Schill, D. (2004).
Neuendorf, K. A. (2002). The content analysis guidebook. Thousand Oaks, CA: Sage.
The Qualitative Report Volume 10 Number 3 September 2005 439-462
http://www.nova.edu/ssss/QR/QR10-3/marques.pdf
Joan F. Marques
Woodbury University, Burbank, California
Chester McCall
Pepperdine University, Malibu, California
Introduction
This paper intends to serve as support for the assertion that interrater reliability
should not merely be limited to being a verification tool for quantitative research, but that
it should be applied as a solidification strategy in qualitative analysis as well. This should
be applied particularly in a phenomenological study, where the researcher is considered
the main instrument and where, for that reason, the elimination of bias may be more
difficult than in other study types.
A verification tool, as interrater reliability is often referred to in quantitative
studies, is generally perceived as a means of verifying coherence in the understanding of
a certain topic, while the term solidification strategy, as referred to in this case of a
qualitative study, reaches even further: Not just as a means of verifying coherence in
understanding, but at the same time a method of strengthening the findings of the entire
qualitative study. The following provides clarification of the distinction in using interrater
reliability as a verification tool in quantitative studies versus using this test as a
solidification tool in qualitative studies. Quantitative studies, which are traditionally
regarded as more scientifically based than qualitative studies, mainly apply interrater
reliability as a percentage-based agreement in findings that are usually fairly
straightforward in their interpretability. The interraters in a quantitative study are not
necessarily required to engage deeply into the material in order to obtain an
Joan F. Marques and Chester McCall 440
understanding of the studys findings for rating purposes. The findings are usually
obvious and require a brief review from the interraters in order to state their
interpretations. The entire process can be a very concise and insignificant one, easily
understandable among the interraters, due to the predominantly numerical-based nature
of the quantitative findings.
However, in a qualitative study the findings are usually not represented in plain
numbers. This type of study is regarded as less scientific and its findings are perceived in
a more imponderable light. Applying interrater reliability in such a study requires the
interraters to engage in attentive reading of the material, which then needs to be
interpreted, while at the same time the interraters are expected to display a similar or
basic understanding of the topic. The use of interrater reliability in these studies as more
than just a verification tool because qualitative studies are thus far not unanimously
considered scientifically sophisticated. It is seen more as a solidification tool that can
contribute to the quality of these types of studies and the level of seriousness with which
they will be considered in the future. As explained earlier, the researcher is usually
considered the instrument in a qualitative study. By using interrater reliability as a
solidification tool, the interraters could become true validators of the findings of the
qualitative study, thereby elevating the level of believability and generalizability of the
outcomes of this type of study. As a clarification to the above, as the instrument in the
study the researcher can easily fall into the trap of having his or her bias influence the
studys findings. This may happen even though the study guidelines assume that he or
she will dispose of all preconceived opinions before immersing himself or herself into the
research. Hence, the act of involving independent interraters, who have no prior
connection with the study, in the analysis of the obtained data will provide substantiation
of the instrument and significantly reduce the chance of bias influencing the outcome.
Regarding the generalizability enhancement Myers (2000) asserts
Before immersing into specifics it might be appropriate to explain that there are
two main prerequisites considered when applying interrater reliability to qualitative
research: (1) The data to be reviewed by the interraters should only be a segment of the
total amount, since data in qualitative studies are usually rather substantial and interraters
usually only have limited time and (2) It needs to be understood that there may be
different configurations in the packaging of the themes, as listed by the various
interraters, so that the researcher will need to review the context in which these themes
are listed in order to determine their correspondence (Armstrong, Gosling, Weinman, &
Marteau, 1997). It may also be important to emphasize here that most definitions and
explanations about the use of interrater reliability to date are mainly applicable to the
quantitative field, which suggests that the application of this solidification strategy in the
qualitative area still needs significant review and subsequent formulation regarding its
possible applicability.
This paper will first explain the two main terms to be used, namely interrater
reliability and phenomenology, after which the application of interrater reliability in a
phenomenological study will be discussed. The phenomenological study that will be used
for analysis in this paper is one that was conducted to establish a broadly acceptable
definition of spirituality in the workplace. In this study the researcher interviewed six
selected participants in order to obtain a listing of the vital themes of spirituality in the
workplace. This process was executed as follows: First, the researcher formulated the
criteria, which each participant should meet. Subsequently, she identified the participants.
The six participants were selected through a snowball sampling process: Two participants
referred two other participants who each referred to yet another eligible person. The
researcher interviewed each participant in a similar way, using an interview protocol that
was validated on its content by two recognized authors on the research topic, Drs. Ian
Mitroff and Judi Neal.
Ian Mitroff is distinguished professor of business policy and founder of the USC
Center for Crisis Management at the Marshall School of Business, University of Southern
California, Los Angeles. (Ian I. Mitroff, 2005, 1). He has published over two hundred
and fifty articles and twenty-one books of which his most recent are Smart Thinking for
Difficult Times: The Art of Making Wise Decisions, A Spiritual Audit of Corporate
America, and Managing Crises Before They Happen (Ian I. Mitroff, 4).
Judi Neal is the founder of the Association for Spirit at Work and the author of
several books and numerous academic journal articles on spirituality in the workplace
(Association for Spirit at Work, 2005, 10-11). She has also established her authority in
the field of spirituality in the workplace in her position of executive director of The
Center for Spirit at Work at the University of New Haven, [] a membership
organization and clearinghouse that supports personal and organizational transformation
through coaching, education, research, speaking, and publications (School of Business at
the University of New Haven, 2005, 2).
Interrater Reliability
In the past several years interrater reliability has rarely been used as a verification
tool in qualitative studies. A variety of new criteria were introduced for the assurance of
credibility in these research types instead. According to Morse et al. (2002), this was
particularly the case in the United States. The main argument against using verification
tools with the stringency of interrater reliability in qualitative research has, so far, been
that expecting another researcher to have the same insights from a limited data base is
unrealistic (Armstrong et al., 1997, p. 598). Many of the researchers that oppose the use
of interrater reliability in qualitative analysis argue that it is practically impossible to
obtain consistency in qualitative data analysis because a qualitative account cannot be
held to represent the social world, rather it evokes it, which means, presumably, that
different researchers would offer different evocations (Armstrong et al., p. 598).
On the other hand, there are qualitative researchers who maintain that
responsibility for reliability and validity should be reclaimed in qualitative studies,
through the implementation of verification strategies that are integral and self-correcting
during the conduct of inquiry itself (Morse et al., 2002). These researchers claim that the
currently used verification tools for qualitative research are more of an evaluative (post
hoc) than of a constructive (during the process) nature (Morse et al.), which leaves room
for assumptions that qualitative research must therefore be unreliable and invalid,
lacking in rigor, and unscientific (Morse et al., p. 4). These investigators further explain
that post-hoc evaluation does little to identify the quality of [research] decisions, the
rationale behind those decisions, or the responsiveness and sensitivity of the investigator
to data (Morse et al., p. 7) and can therefore not be considered a verification strategy.
The above-mentioned researchers emphasize that the currently used post-hoc procedures
may very well evaluate rigor but do not ensure it (Morse et al.).
The concerns addressed by Morse et al. (2002) above about verification tools in
qualitative research being more of an evaluative nature (post hoc) than of a constructive
(during the process) nature can be omitted by utilizing interrater reliability as it was
applied to this study, which is, right after the initial attainment of themes by the
researcher yet before formulating conclusions based on the themes registered. This
method of verifying the studys findings represents a constructive way (during the
process) of measuring the consistency in the interpretation of the findings rather than an
evaluative (post hoc) way. It therefore avoids the problem of concluding insufficient
consistency in the interpretations after the study has been completed and it leaves room
for the researcher to further substantiate the study before it is too late. The substantiation
can happen in various ways. For instance, this might be done by seeking additional study
participants, adding their answers to the material to be reviewed, performing a new cycle
of phenomenological reduction, or resubmitting the package of text to the interraters for
another round of theme listing.
As suggested on the Colorado State University (CSU) website (1997) interrater
reliability should preferably be established outside of the context of the measurement in
your study. This source claims that interrater reliability should preferably be executed as
a side study or pilot study. The suggestion of executing interrater reliability as a side
study corresponds with the above-presented perspective from Morse et al. (2002) that
verification tools should not be executed post-hoc, but constructively during the
execution of the study. As stated before, the results from establishing interrater reliability
as a side study at a critical point during the execution of the main study (see
Joan F. Marques and Chester McCall 444
explanation above) will enable the researcher, in case of insufficient consistency between
the interraters, to perform some additional research in order to obtain greater consensus.
In the opinion of the researcher of this study, the second option suggested by CSU, using
interrater reliability as a pilot study, would mainly establish consistency in the
understandability of the instrument. In this case such would be the interview protocol to
be used in the research, since there would not be any findings to be evaluated at that time.
However, the researcher perceives no difference between this interpretation of interrater
reliability and the content validation here applied to the interview protocol by Mitroff and
Neal. The researcher further questions the value of such a measurement without the
additional review of study findings, or a part thereof. For this reason, the researcher
decided that interrater reliability in this qualitative study would deliver optimal value if
performed on critical parts of the study findings. This, then, is what was implemented in
the here reviewed case.
Phenomenology
Like all qualitative studies, the researcher who engages in the phenomenological
approach should realize that phenomenology is an influential and complex philosophic
tradition (Van Manen, 2002a, 1) as well as a human science method (Van Manen,
2002a, 2), which draws on many types and sources of meaning (Van Manen, 2002b,
1).
Creswell (1998) presents the procedure in a phenomenological study as follows:
1. The researcher begins [the study] with a full description of his or her own experience
of the phenomenon (p. 147).
2. The researcher then finds statements (in the interviews) about how individuals are
experiencing the topic, lists out these significant statements (horizonalization of the
data) and treats each statement as having equal worth, and works to develop a list of
nonrepetitive, nonoverlapping statements (p. 147).
3. These statements are then grouped into meaning units: the researcher lists these
units, and he or she writes a description of the textures (textural description) of the
experience - what happened - including verbatim examples (p. 150).
4. The researcher next reflects on his or her own description and uses imaginative
variation or structural description, seeking all possible meanings and divergent
perspectives, varying the frames of reference about the phenomenon, and constructing
a description of how the phenomenon was experienced (p. 150).
5. The researcher then constructs an overall description of the meaning and the essence
of the experience (p. 150).
6. This process is followed first for the researchers account of the experience and then
for that of each participant. After this, a composite description is written (p. 150).
Horizonalization Table
Phenomenological
Reduction
Meaning Clusters
Leadership Employee
imposed imposed
aspects aspects
Precipitating Factors
Implications of Findings
Invariant Themes
Recommendations for
Individuals and Organizations
Epoche clears the way for a researcher to comprehend new insights into
human experience. A researcher experienced in phenomenological
processes becomes able to see data from new, naive perspective from
which fuller, richer, more authentic descriptions may be rendered.
Bracketing biases is stressed in qualitative research as a whole, but the
study of and mastery of epoche informs how the phenomenological
researcher engages in life itself. (p. 3)
the purpose it is used for. Isaac and Michael (1997) illuminate this by stating that there
are various ways of calculating interrater reliability, and that different levels of
determining the reliability coefficient take account of different sources of error (p. 134).
McMillan and Schumacher (2001) elaborate on the inconsistency issue by explaining that
researchers often ask how high a correlation should be for it to indicate satisfactory
reliability. McMillan and Schumacher conclude that this question is not answered easily.
According to them, it depends on the type of instrument (personality questionnaires
generally have lower reliability than achievement tests), the purpose of the study
(whether it is exploratory research or research that leads to important decisions), and
whether groups or individuals are affected by the results (since action affecting
individuals requires a higher correlation than action affecting groups).
Aside from the above presented statements about the divergence in opinions with
regards to the appropriate correlation coefficient to be used, as well as the proper
methods of applying interrater reliability, it is also a fact that most or all of these
discussions pertain to the quantitative field. This suggests that there is still intense review
and formulation needed in order to determine the applicability of interrater reliability in
qualitative analyses, and that every researcher that takes on the challenge of applying this
solidification strategy in his or her qualitative study will therefore be a pioneer.
The first step for the researcher of this phenomenological study was attempting to
find the appropriate degree of coherence that should exist in the establishment of
interrater reliability. It was the intention of the researcher to use a generally agreed upon
percentage, if existing, as a guideline in her study. However, after assessing multiple
electronic (online) and written sources regarding the application of interrater reliability in
various research disciplines, the researcher did not succeed in finding a consistent
percentage for use of this solidification strategy. Source included Isaac and Michaels
(1997) Handbook in Research and Evaluation, Tashakkori and Teddlies (1998) Mixed
Methodology, and McMillan and Schumachers (2001) Research in Education;
Proquests extensive article and paper database as well as its digital dissertations file; and
other common search engines such as Google.. Consequently, this researcher presented
the following illustrations for the observed basic inconsistency, in applying interrater
reliability, as she perceived them throughout a variety of studies, which were not
necessarily qualitative in nature.
1. Mott, Etsler, and Drumgold (2003) presented the following reasoning for his
interrater reliability findings in their study, Applying an Analytic Writing Rubric to
Children's Hypermedia Narratives.
2. Butler and Strayer (1998) assert the following in their online-presented research
document, administered by Stanford University and titled The Many Faces of
Empathy.
3. Srebnik, Uehara, Smukler, Russo, Comtois, and Snowden (2002) approach interrater
reliability in their study on Psychometric Properties and Utility of the Problem
Severity Summary for Adults with Serious Mental Illness as follows: Interrater
reliability: A priori, we interpreted the intraclass correlations in the following manner:
.60 or greater, strong; .40 to .59, moderate; and less than .40, weak (15).
Through multiple reviews of accepted reliability rates in various studies, this
researcher finally concluded that the acceptance rate for interrater reliability varies
between 50% and 90%, depending on the considerations mentioned above in the citation
of McMillan and Schumacher (1997). The researcher did not succeed in finding a fixed
percentage for interrater reliability in general and definitely not for phenomenological
research. She contacted the guiding committee of this study to agree upon a usable rate.
The researcher found that in the phenomenological studies she reviewed through the
Proquest digital dissertation database, interrater reliability had not been applied, although
she did find a masters thesis from the Trinity Western University that briefly mentioned
the issue of using reliability in a phenomenological study by stating
Graham (2001) then states the percent agreement between researcher and the
student [the external judge] was 78 percent (p. 67). However, in the explanation
afterwards it becomes apparent that this percentage was not obtained by comparing the
findings from two independent judges aside from the researcher, but by comparing the
findings from the researcher to one external rater. Considering the fact that the researcher
in a phenomenological study always ends up with an abundance of themes on his or her
list (since he or she manages the entirety of the data, while the external rater only reviews
a limited part of the data) calculating a score as high as 78% should not be difficult to
obtain depending on the calculation method (as will be demonstrated later in this paper).
The citation Graham used as a guideline in his thesis referred to the agreement between
Joan F. Marques and Chester McCall 450
two independent judges and not to the agreement between one independent judge and the
researcher.
The researcher of the here-discussed phenomenological study on spirituality in the
workplace also learned that the application of this solidification tool in qualitative studies
has been a subject of ongoing discussion (without resolution) in recent years, which may
explain the lack of information and consistent guidelines currently available.
The guiding committee for this particular research agreed upon an acceptable
interrater reliability of two thirds, or 66.7% at the time of the suggestion for applying this
solidification tool. The choice for 66.7% was based on the fact that, in this team, there
were quantitative as well as qualitative oriented authorities, who after thorough
discussion came to the conclusion that there were variable acceptable rates for interrater
reliability in use. The team also considered the nature of the study and the multi-
interpretability of the themes to be listed and subsequently decided the following: Given
the study type and the fact that the interraters would only review part of the data, it
should be understood that a correspondence percentage higher than 66.7% between two
external raters might be hard to attain. This correspondence percentage becomes even
harder to achieve if one considers that there might also be such a high number of themes
to be listed, even in the limited data provided, that one rater could list entirely different
themes than the other, without necessarily having a different understanding of the text;
The researcher subsequently performed the following measuring procedure:
1. The data gained for the purpose of this study were first transcribed and saved. This
was done by obtaining a listing of the vital themes applicable to a spiritual workplace
and consisted of interviews taken with a pre-validated interview protocol from 6
participants.
2. Since one of the essential procedures in phenomenology is to find common themes in
participants statements, the transcribed raw data were presented to two pre-identified
interraters. The interraters were both university professors and administrators, each
with an interest in spirituality in the workplace and, expectedly, with a fairly
compatible level of comprehensive ability. These individuals were approached by the
researcher and, after their approval for participation, separately visited for an
instructional session. During this session, the researcher handed each interrater a form
she had developed, in which the interrater could list the themes he found when
reviewing the 6 answers to each of the three selected questions. Each interrater was
thoroughly instructed with regards to the philosophy behind being an interrater, as
well as with regards to the vitality of detecting themes that were common (either
through direct wording or interpretative formulation by the 6 participants). The
interraters, although acquainted with each other, were not aware of each others
assignment as an interrater. The researcher chose this option to guarantee maximal
individual interpretation and eliminate mutual influence. The interraters were thus
presented with the request to list all the common themes they could detect from the
answers to three particular interview questions. For this procedure, the researcher
made sure to select those questions that solicited a listing of words and phrases from
the participants. The reason for selecting these questions and their answers was to
provide the interraters with a fairly clear and obvious overview of possible themes to
choose from.
451 The Qualitative Report September 2005
3. The interraters were asked to list the common themes per highlighted question on a
form that the researcher developed for this purpose and enclosed in the data package.
Each interrater thus had to produce three lists of common themes: one for each
highlighted topical question.
The highlighted questions in each of the six interviews were: (1) What are some
words that you consider to be crucial to a spiritual workplace? (2) If a worker was
operating at his or her highest level of spiritual awareness, what would he or she actually
do? and (3) If an organization is consciously attempting to nurture spirituality in the
workplace, what will be present? One reason for selecting these particular responses was
that the questions that preceded these answers asked for a listing of words from the
interviewees, which could easily be translated into themes. Another important reason was
that these were also the questions from which the researcher derived most of the themes
she listed. However, the researcher did not share any of the classifications she had
developed with the interraters, but had them list their themes individually instead in order
to be able to compare their findings with hers.
4. The purpose of having the interraters list these common themes was to distinguish the
level of coordinating interpretations between the findings of both interraters, as well
as the level of coordinating interpretations between the interraters findings and those
of the researcher. The computation methods that the researcher applied in this study
will be explained further in this paper.
5. After the forms were filled out and received from the interraters, the researcher
compared their findings to each other and subsequently to her own. Interrater
reliability would be established, as recommended by the dissertation committee for
this particular study, if at least 66.7% (2/3) agreement was found between interraters
and between interraters and researchers findings. Since the researcher serves as the
main instrument in a phenomenological study, and even more because this researcher
first extracted themes from the entire interviews, her list was much more extensive
than those of the interraters who only reviewed answers to a selected number of
questions. It may therefore not be very surprising that there was 100% agreement
between the limited numbers of themes submitted by the interraters and the
abundance of themes found by the researcher. In other words, all themes of interrater
1 and all themes of interrater 2 were included in the theme-list of the researcher. It is
for this reason that the agreement between the researchers findings and the
interraters findings was not used as a measuring scale in the determination of the
interrater reliability percentage.
A complication occurred when the researcher found that the interraters did not
return an equal amount of common themes per question. This could happen because the
researcher omitted setting a mandatory amount of themes to be submitted. In other words,
the researcher did not set a fixed number of themes for the interraters to come up with,
but rather left it up to them to find as many themes they considered vital in the text
provided. The reason for refraining from limiting the interraters to a predetermined
number of themes was because the researcher feared that a restriction could prompt
random choices by each interrater among a possible abundance of available themes,
ultimately leading to entirely divergent lists and an unrealistic conclusion of low or no
interrater reliability.
Joan F. Marques and Chester McCall 452
Interrater 1 (I1) submitted a total of 13 detected themes for the selected questions.
Interrater 2 (I2) submitted a total of 17 detected themes for the selected questions.
The researcher listed a total of 27 detected themes for the selected questions.
Between both interraters there were 10 common themes found. The agreement
was determined on two counts: (1) On the basis of exact listing, which was the case with
7 of these 10 themes and (2) on the basis of similar interpretability, such as giving to
others and contributing; encouraging and motivating; aesthetically pleasing
workplace; and beauty of which the latter was mentioned in the context of a nice
environment. The researcher color-coded the themes that corresponded with the two
interraters (yellow) and subsequently color-coded the additional themes that she shared
with either interrater (green for additional corresponding themes between the researcher
and interrater 1 and blue for additional corresponding themes between the researcher and
interrater 2). All of the corresponding themes between both interraters (the yellow
category) were also on the list of the researcher and therefore also colored yellow on her
list.
Before discussing the calculation methods reviewed by this researcher about
spirituality in the workplace, it may be useful to clarify that phenomenology is a very
divergent and complicated study type, entailing various sub-disciplines and oftentimes
described as the study of essences, including the essence of perception and of
consciousness (Scott, 2002, 1). In his presentation of Merleau-Pontys Phenomenology
of Perception Scott explains, phenomenology is a method of describing the nature of our
perceptual contact with the world. Phenomenology is concerned with providing a direct
description of human experience (1). This may clarify to the reader that the
phenomenological researcher is aware that reality is a subjective phenomenon,
interpretable in many different ways. Based on this conviction, this researcher did not
make any pre-judgments on the quality of the various calculation methods presented
below, but merely utilized them on the basis of their perceived applicability to this study
type.
453 The Qualitative Report September 2005
The researcher came across various possible methods for calculating interrater
reliability described.
Calculation Method 1
Table 1
Confusion Matrix 1
Interrater 1
Agree Disagree
Agree a B
Interrater 2
Disagree c D
1. The rate that these authors label as the accuracy rate (AC), named this way because
it measures the proportion of the total number of findings from Interrater 1 -- the one
with the lowest number of themes submitted -- that are accurate. In this case
Joan F. Marques and Chester McCall 454
AC = (a + d) / (a + b + c + d)
= (10 + 10) / (10 + 3 + 7 + 10)
= 20/30 = 66.7%
2. The rate these authors label as the true agreement rate: The title of this rate has
been modified by substituting the names of values applicable in this particular study.
The true agreement rate was named this way because it measures the proportion of
agreed upon themes (10) perceived from the entire number of submitted themes from
Interrater 1, the one with the lowest number of submissions (adopted from Hamilton
et al., 2003, 8, and modified toward the values used in this particular study), is
calculated as seen below.
TA = a / (a + b)
= 10 / (10 + 3)
= 10/13 = 76.9%
Table 2
17 13 30
Calculation Method 2
Since the interraters did not submit an equal number of observations, as is general
practice in interrater reliability measures, the above-calculated rate of 66.7% can be
disputed. Although the researcher did not manage to find any written source to base the
following computation on, she considered it logical that in case of unequal submissions,
the lowest submitted number of findings from similar data by any of two or more
interraters used in a study should be used as the denominator in measuring the level of
agreement. Based on this observation, interrater reliability would be: (Number of
common themes) / (Lowest Number of submission) x 100 = 10/13 x 100% = 76.9%.
Rationale for this calculation: if the numbers of submissions by both interraters
had varied even more, say 13 for interrater 1 versus 30 for interrater 2, interrater
reliability would be impossible to be established even if all the 13 themes submitted by
interrater 1 were also on the list of interrater 2. With the calculations as presented under
calculation method 1, the outcome would then be: (13 +13) / (30 + 13) = 26/43 = 60.5%,
whereby 13 would be the number of agreements and 43 the total number of observations.
This does not correspond at all with the logical conclusion that a total level of agreement
from one interraters list onto the other should equal 100%.
If, therefore, the rational justification of calculation method 2 is accepted, then
interrater reliability is 76.9%, which exceeds the minimally consented rate of 66.7%.
Expanding on this reasoning, further comparison leads to the following findings: All 13
listed themes from interrater 1 (13/13 x 100% = 100%) were on the researchers list and
16 of the 17 themes on interrater 2s list (16/17 X 100% = 94.1%) were also on the list of
the researcher. These calculations are based on calculation method 2.
The researcher thought it to be interesting that the percentage of 76.9 between
both interraters was also reached in the true agreement rate (TA) as presented earlier by
Hamilton et al. (2003).
Calculation Method 3
Elaborating on Hamilton et al.s (2003) true agreement rate (TA), which is the
proportion of corresponding themes identified between both interraters, it is calculated
using the equation: TA = a / (a+b), whereby a equals the amount of corresponding
themes between both interraters and b equals the amount of non-corresponding themes
as submitted by the interrater with the lowest number of themes. The researcher thought
it to be interesting to examine the calculated outcomes in the case that the names of the
two interraters would have been placed differently in the confusion matrix. When
exchanging the interraters places in the matrix the outcome of this rate turned out to be
different, since the value substituted for b now became that of the number of non-
corresponding themes, as submitted by the interrater with the highest number of themes.
In fact, the new computation led to an unfavorable, but also unrealistic interrater
reliability rate of 58.8%. The unrealistic reference lies in the fact that it becomes
apparent that the interrater reliability rate, in the case of the above-mentioned
substitution, starts turning out extremely low as the submission numbers of the two
interraters start differing to an increasing degree. In such a case, it does not even matter
anymore whether the two interraters have full correspondence as far as the submissions
Joan F. Marques and Chester McCall 456
of the lowest submitter goes: The percentage of the interrater reliability, which is
supposed to reflect the common understanding of both interraters, will decrease to almost
zero.
To illustrate this assertion, the confusion matrix is presented in Table 3 with the
names of the interraters switched.
Table 3
1. The rate that these authors label as the accuracy rate (AC), remains the same:
AC = (a + d) / (a + b + c + d)
= (10 + 10) / (10 + 3 + 7 + 10)
= 20/30 = 66.7%
2. The true agreement rate (title name substituted with names of values applicable in
this study), is calculated as follows.
TC = a / (a + b)
= 10 / (10 + 7)
= 10/17 = 58.8%
In this study, TA rationally presented a rate of 76.9%, which was higher than the
minimum requirement of 66.7% in both, calculation methods 1 and 2. On the other hand
it is demonstrated in the new true agreement rate here that the less logical process of
exchanging the interraters positions to where the highest number of submissions would
be used as the common denominator instead of the lowest (see first part of calculation
method 3), delivered a percentage below the minimum requirement. As a reminder to the
reader the irrationality of using the highest number of submissions as the denominator
may serve the example given under the rationale section for calculation method 2, in
which numbers of submissions would diverge significantly (30 vs. 13). It is the
researchers opinion that this new suggested computation of moderation would lead to
the following outcome for the true agreement reliability rate (TAR):
It was the researchers conclusion that whether the reader considers calculation
method 1, calculation method 2, or calculation method 3 as the most appropriate one for
this particular study, all three methods demonstrated that there was sufficient common
457 The Qualitative Report September 2005
Recommendations
1. The researcher of this study has found that although interraters in a phenomenological
study, and presumably generally in qualitative studies, can very well select themes
with a similar understanding of essentials in the data she also found that there are
three major attention points to address in order to enhance the success rate and
swiftness of the process:1. The data to be reviewed by the interraters should be only a
segment of the total amount, since data in qualitative studies are usually rather
substantial and interraters usually only have limited time.2. The researcher will need
to understand that there are different configurations possible in the packaging of the
themes as listed by the various interraters, so that he or she will need to review the
context in which these themes are listed in order to determine their correspondence
(Armstrong et al., 1997). In this paper the researcher gave examples of themes that
could be considered similar, although they were packaged different by the
interraters, such as giving to others and contributing; encouraging and
motivating; aesthetically pleasing workplace; and beauty, of which the latter
was mentioned in the context of a nice environment.
2. In order to obtain results with similar depth from all raters, the researcher should set
standards in the number of observations to be listed by the interraters as well as the
time allotted to them. The fact that these confines were not specified to the interraters
resulted in a diverged level of input: One interrater spent only two days in listing the
words and came up with a total of 13 themes and the other interrater spent
approximately one week in preparing his list and consequently came up with a more
detailed list of 17 themes. Although there was a majority of congruent themes
between the two interraters (there were 10 common themes between both lists), the
calculation of interrater reliability was complicated by the unequal numbers of
submissions. All interrater reliability calculation methods assume equal numbers of
submissions by the interraters. The officially recognized reliability rate of 66.7% for
this study is therefore lower than it would have been when both interraters had been
limited to a pre-specified number of themes to be listed. If, for example, both
interraters had been required to select 15 themes within an equal time span of, say,
one week, the puzzle regarding the use of either the lowest or highest common
denominator would be resolved because there would be only one denominator, as
well as an equal level of input from both interraters. If, in this case, the interraters
came up with 12 common themes out of 15, the interrater reliability rate could be
easily calculated as 12/15 = .8 = 80%. Even in the case of only 10 common themes on
a total required submission of 15, the rate would still meet the minimum
requirements: 10/15 = .67 = 66.7%. This may be valuable advice for future
applications of this valuable tool to qualitative studies.4. The solicited number of
submissions from the interraters should be set as high as possible, especially if there
is a multiplicity of themes to choose from. If the solicited number is kept too low it
may be that two raters have perfectly similar understanding of the text yet submit
Joan F. Marques and Chester McCall 458
different themes, which may erroneously elicit the idea that there was not enough
coherence in the raters perceptions and, thus, no sufficient interrater reliability.
3. The interraters should have at least a reasonable degree of similarity in intelligence,
background, and interest level in the topic in order to ensure a decent degree of
interpretative coherence. It would further be advisable to attune the educational and
interest level of the interraters to the target group of the study, so that the reader could
encounter a greater level of recognition with the study topic as well as the findings.
Conclusion
An interesting lesson from this experience for the researcher was that the number
of observations to be listed by the interraters, as well as the time allotted to the interraters,
should preferably be kept synchronous. Yet, one might attempt to set as high a number of
submissions as possible, due to the risk of too widely varied choices to be selected by
interraters, if there are many themes available. This may happen in spite of perfect
common understanding between interraters and may, henceforth, wrongfully educe the
idea that there is not enough consistency in comprehension between the raters and, thus,
no interrater reliability. The justifications for this argument are also presented in the
recommendations section of this paper.
References
Armstrong, D., Gosling, A., Weinman, J., & Martaeu, T. (1997). The place of inter-rater
reliability in qualitative research: An empirical study. Sociology, 31(3), 597-606.
Association for Spirit at Work (2005). The professional association for people involved
with spirituality in the workplace. Retrieved February 20, 2005, from
http://www.spiritatwork.com/aboutSAW/profile_JudiNeal.htm
Blodgett-McDeavitt, C. (1997, October). Meaning of participating in technology
training: A phenomenology. Paper presented at the meeting of the Midwest
Research-to-Practice Conference in Adult, Continuing and Community
Education, Michigan State University, East Lansing, MI. Retrieved January 25,
2003, from http://www.iupui.edu/~adulted/mwr2p/prior/blodgett.htm
Butler, E. A., & Strayer, J. (1998). The many faces of empathy. Poster presented at the
annual meeting of the Canadian Psychological Association, Edmonton, Alberta,
Canada.
Colorado State University. (1997). Interrater reliability. Retrieved April 8, 2003, from
http://writing.colostate.edu/guides/research/relval/com2a5.cfm
Creswell, J. (1998). Qualitative inquiry and research design: Choosing among five
traditions. Thousand Oaks, CA: Sage.
Dyre, B. (2003, May 6). Dr. Brian Dyre's pages. Retrieved November 12, 2003, from
http://129.101.156.107/brian/218%20Lecture%20Slides/L10%20research%20desi
gns.pdf
A phenomenological study of quest-oriented religion. Retrieved September 5, 2004, from
http://www.twu.ca/cpsy/Documents/Theses/Matt%20Thesis.pdf
Hamilton, H., Gurak, E., Findlater, L., & Olive, W. (2003, February 7). The confusion
matrix. Retrieved November 16, 2003, from
http://www2.cs.uregina.ca/~hamilton/courses/831/notes/confusion_matrix/confusi
on_matrix.html
Isaac, S., & Michael, W. (1997). Handbook in research and evaluation (Vol. 3). San
Diego, CA: Edits.
McMillan, J., & Schumacher, S. (2001). Research in education (5th ed.). New York:
Longman.
Ian I. Mitroff. (2005). Retrieved February 20, 2005, from the University of Southern
California Marshall School of Business web site:
http://www.marshall.usc.edu/web/MOR.cfm?doc_id=3055
Joan F. Marques and Chester McCall 460
Morse, J. M., Barrett, M., Mayan, M., Olson, K., & Spiers, J. (2002). Verification
strategies for establishing reliability and validity in qualitative research.
International Journal of Qualitative Methods, 1(2), 1-19.
Mott, M. S., Etsler, C., & Drumgold, D. (2003). Applying an analytic writing rubric to
children's hypermedia "narratives". Early Childhood Research & Practice,5(1)
Retrieved September 25, 2003, from http://ecrp.uiuc.edu/v5n1/mott.html
Myers, M. (2000, March). Qualitative research and the generalizability question:
Standing firm with proteus. The Qualitative Report, 4(3/4), Retrieved March 10,
2005, from http://www.nova.edu/ssss/QR/QR4-3/myers.html
Posner, K. L., Sampson, P. D., Ward, R. J., & Cheney, F. W. (1990, September).
Measuring interrater reliability among multiple raters: An example of methods
for nominal data. Retrieved November 13, 2003, from
http://schatz.sju.edu/multivar/reliab/interrater.html
Richmond University. (n.d.). Interrater reliability. Retrieved November 13, 2003, from
http://www.richmond.edu/~pli/psy200_old/measure/interrater.html
School of Business at the University of New Haven. (2005). Judi Neal Associate
Professor. Retrieved February 20, 2005, from
http://www.newhaven.edu/faculty/neal/
Scott, A. (2002). Merleau-Pontys phenomenology of perception. Retrieved September 5,
2004, from http://www.angelfire.com/md2/timewarp/merleauponty.html
Srebnik, D. S., Uehara, E., Smukler, M., Russo, J. E., Comtois, K. A., & Snowden, M.
(2002, August). Psychometric properties and utility of the problem severity
summary for adults with serious mental illness. Psychiatric Services 53, 1010-
1017. Retrieved March 4, 2005, from
http://ps.psychiatryonline.org/cgi/content/full/53/8/1010
Tashakkori, A., & Teddlie, C. (1998). Mixed methodology (Vol. 46). Thousand Oaks,
CA: Sage.
Van Manen, M. (2002a). Phenomenological inquiry. Retrieved September 4, 2004, from
http://www.phenomenologyonline.com/inquiry/1.html
Van Manen, M. (2002b). Sources of meaning. Retrieved September 4, 2004, from
http://www.phenomenologyonline.com/inquiry/49.html
Appendix A
Interview Protocol
Time of interview:
Date:
Place:
Interviewer:
Interviewee:
Position of interviewee:
461 The Qualitative Report September 2005
To the interviewee:
Thank you for participating in this study and for committing your time and effort.
I value the unique perspective and contribution that you will make to this study.
Questions
4. General structures that precipitate feelings and thoughts about the experience of
spirituality in the workplace.
4.1 What are some of the organizational reasons that could influence the transformation
from a workplace that does not consciously attempt to nurture spirituality and the human
spirit to one that does?
4.2 From the employees perspective, what are some of the reasons to transform from a
worker who does not attempt to live and work with spiritual values and practices to one
that does?
Joan F. Marques and Chester McCall 462
5. Conclusion
Would you like to add, modify or delete anything significant from the interview that
would give a better or fuller understanding toward the establishment of a broadly
acceptable definition of spirituality in the workplace
Author Note
Joan Marques was born in Suriname, South America, where she made a career in
advertising, public relations, and program hosting. She founded and managed an
advertising and P.R. company as well as a foundation for womens awareness issues. In
1998 she immigrated to California and embarked upon a journey of continuing education
and inspiration. She holds a Bachelors degree in Business Economics from M.O.C. in
Suriname, a Masters degree in Business Administration from Woodbury University, and
a Doctorate in Organizational Leadership from Pepperdine University. Her recently
completed dissertation was centered on the topic of spirituality in the workplace. Dr.
Marques is currently affiliated to Woodbury University as an instructor of Business &
Management. She has authored a wide variety of articles pertaining to workplace
contentment for audiences in different continents of the globe. Joan Marques, 712 Elliot
Drive # B, Burbank, CA 91504; E-mail: [email protected]; Telephone: (818)
845 3063
Chester H. McCall, Jr., Ph.D. entered Pepperdine University after 20 years of
consulting experience in such fields as education, health care, and urban transportation.
He has served as a consultant to the Research Division of the National Education
Association, several school districts, and several emergency health care programs,
providing survey research, systems evaluation, and analysis expertise. He is the author of
two introductory texts in statistics, more than 25 articles, and has served on the faculty of
The George Washington University. At Pepperdine, he teaches courses in data analysis,
research methods, and a comprehensive exam seminar, and also serves as chair for
numerous dissertations. Email: [email protected]
Article Citation
Download (.pdf)
Sociology August 1997 v31 n3 p597(10) Pa ge 1
1997_Sociology_Inte
The place of inter-ra ter relia bility in qua lita tive resea rch: a n empirica l 25.6 KB
study.
by David Armstrong, Ann Gosling, Josh Weinman and Theresa Martaeu
Assessing inter-rater reliability, whereby data are independently coded and the codings compared
for agreement, is a recognised process in quantitative research. However, its applicability to
qualitative research is less clear: should researchers be expected to identify the same codes or
themes in a transcript or should they be expected to produce different accounts? Some
qualitative researchers argue that assessing inter-rater reliability is an important method for
ensuring rigour, others that it is unimportant; and yet it has never been formally examined in an
empirical qualitative study. Accordingly, to explore the degree of inter-rater reliability that might
be expected, six researchers were asked to identify themes in the same focus group transcript.
The results showed close agreement on the basic themes but each analyst packaged the themes
differently.
CO PYRIGHT 1997 British Soc iologica l Assoc iation consistency of findings from an analysis conducted by two
Public a tion Ltd. (BSA) or more researchers. However, the concept emerges
implicitly in descriptions of procedures for carrying out the
Reliability and validity are fundamental concerns of the analysis of qualitative data. The frequent stress on an
quantitative researcher but seem to have an uncertain analysis being better conducted as a group activity
place in the repertoire of the qualitative methodologist. suggests that results will be improved if one view is
Indeed, for some researchers the problem has apparently tempered by another. Waitzkin described meeting with two
disappeared: as Denzin and Lincoln have observed, research assistants to discuss and negotiate agreements
In general, qualitative methodologies do not make explicit Unusually for a literature that is so opaque about the
use of the concept of inter-rarer reliability to establish the importance of independent analyses of a single dataset,
Information Integrity
The place of inter-ra ter relia bility in qua lita tive resea rch: a n empirica l
study.
https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak 1/6
5/10/2014 study. The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu
Mays and Pope explicitly use the term reliability and, and, more commonly, those who reject the term but allow
moreover, claim that it is a significant criterion for the concept to creep into their work. On the other hand are
assessing the value of a piece of qualitative research: the those who adopt such a relativist position that issues of
analysis of qualitative data can be enhanced by organising consistency are meaningless as all accounts have some
an independent assessment of transcripts by additional validity whatever their claims. A theoretical resolution of
skilled qualitative researchers and comparing agreement these divergent positions is impossible as their core
between the raters (1995:110). This approach, they claim, ontological assumptions are so different. Yet this still
was used by Daly et al. (1992) in a study of clinical leaves a simple empirical question: do qualitative
encounters between cardiologists and their patients when researchers actually show consistency in their accounts?
the transcripts were analysed by the principal researcher The answer to this question may not resolve the
and an independent
assessed. panel, and
However, ironically, the
the level of agreement
procedure described by methodological confusion
the debate. If accounts do but it may
diverge clarify
then the modernists
for the nature of
Daly et al. was actually one of ascribing quantitative there is a methodological problem and for the
weights to pregiven variables which were then subjected postmodernists a confirmation of diversity; if accounts are
to statistical analysis (1992:204). similar, the modernists search for measures of
consistency is reinforced and the postmodernists need to
A contrary position is taken by Morse who argues that the recognise that accounts do not necessarily recognise the
use of external raters is more suited to quantitative multiple character of reality.
research; expecting another researcher to have the same
insights from a limited data base is unrealistic: No-one The purpose of the study was to see the extent to which
takes a second reader to the library to check that indeed researchers show consistency in their accounts and
he or she is interpreting the original sources correctly, so involved asking a number of qualitative researchers to
why does anyone need a reliability checker for his or her identify themes in the same data set. These accounts were
data? (Morse 1994:231). This latter position is taken then themselves subjected to analysis to identify the
further by those so-called post-modernist qualitative degree of concordance between them.
researchers (Vidich and Lyman 1994) who would
challenge the whole notion of consistency in analysing Method
data. The researchers analysis bears no direct
correspondence with any underlying reality and different As part of a wider study of the relationship between
researchers would be expected to offer different accounts perceptions of disability and genetic screening, a number
as reality itself (if indeed it can be accessed) is of focus groups were organised. One of these focus
characterised by multiplicity. For example, Tyler (1986) groups consisted of adults with cystic fibrosis (CF), a
claims that a qualitative account cannot be held to genetic disorder affecting the secretory tissues of the
represent the social world, rather it evokes it- which body, particularly the lung. Not only might these adults with
means, presumably, that different researchers would offer cystic fibrosis have particular views of disability but theirs
different evocations. Hammersely (1991) by contrast was a condition for which widespread genetic screening
argues that this position risks privileging the rhetorical over was being advocated. The aim of such a screening
the scientific and argues that quality of argument and use programme was to identify carriers of the gene so that
of evidence should remain the arbiters of qualitative their reproductive decisions might be influenced to prevent
accounts; in other words, a place remains for some sort of the birth of children with the disorder.
correspondence between the description and reality that
would allow a role for consistency. Presumably this latter The focus group was invited to discuss the topic of genetic
position would be supported by most qualitative screening. The session was introduced with a brief
researchers, particularly those drawing inspiration from summary of what screening techniques were currently
Glaser and Strausss seminal text which claimed that the available and then discussion from the group on views of
virtue of inductive processes was that they ensured that genetic screening was invited and facilitated. The ensuing
theory was closely related to the daily realities (what is discussion was tape recorded and transcribed. Six
actually going on) of substantive areas (1967:239). experienced qualitative investigators in Britain and the
United States who had some interest in this area of work
In summary, the debates within qualitative methodology on were approached and asked if they would analyse the
the place of the traditional concept of reliability (and transcript and prepare an independent report on it,
validity) remain confused. On the one hand are those identifying, and where possible rank ordering, the main
researchers such as Mays and Pope who believe reliability themes emerging from the discussion (with a maximum of
should be a benchmark for judging qualitative research; five themes). The analysts were offered a fee for this work.
- Reprinted with permission. Add itional c opying is prohibited. - G AL E G RO UP
Information Integrity
The place of inter-ra ter relia bility in qua lita tive resea rch: a n empirica l
study.
The choice of method for examining the six reports was context that gave it coherence. At its simplest this can be
made on pragmatic grounds. One method, consistent with illustrated by the way that the theme of the relative
the general approach, would have been to ask a further six invisibility of genetic disorders as forms of disability was
researchers to write reports on the degree of consistency handled. All six analysts agreed that it was an important
that they perceived in the initial accounts. But then, these theme and in those instances when the analysts attempted
accounts themselves would have needed yet further a ranking, most placed it first. For example, according to
researchers to be recruited for another assessment, and the third rarer:
so on. At some point a final judgement of consistency
needed to be made and it was thought that this could just The visibility of the disability is the single most important
as easily be made on the first set of reports. Accordingly, element in its representation. [R3]
one of the authors (DA) scrutinised all six reports and
deliberately did not read the original focus group transcript. But while all analysts identified an invisibility theme, all
The approach involved listing the themes that were also expressed it as a comparative phenomenon:
identified by the six researchers and making judgements traditional disability is visible while CF is invisible.
from the background justification whether or not there were
similarities and differences between them. The stereotypes of the disabled person in the wheelchair;
https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak 2/6
5/10/2014 The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu
similarities and differences between them. The stereotypes of the disabled person in the wheelchair;
the contrast between visible, e.g. gross physical, and
Results invisible, e.g. specific genetic, disabilities; and the special
problems posed by the general invisibility of so many
The focus group interview with the adults with cystic genetic disabilities. [R2]
fibrosis was transcribed into a document 13,500 words
long and sent to the six designated researchers. All six In short, the theme was contextualised to make it
researchers returned reports. Five of the reports, as coherent, and give it meaning. Perhaps because the
requested, described themes: four analysts identified five invisibility theme came with an implicit package of a
each, the other four. The sixth analysts returned a lengthy contrast with traditional images of deviance, there was
and discursive report that commented extensively on the general agreement on the theme and its meaning across
dynamics of the focus group, but then discussed a number all the analysts. Even so, the theme of invisibility was also
of more thematic issues. Although not explicitly described, used by some analysts as a vehicle for other issues that
five themes could be abstracted from this text. they thought were related: a link with stigma was
mentioned by two analysts; another pointed out the
In broad outline, the six analysts did identify similar themes difficulty of managing invisibility by CF sufferers.
but there were significant differences in the way they were
packaged. These differences can be illustrated by Ignorance. Whereas the theme of invisibility had a clear
examining four different themes that the researchers referent of visibility against which there could be general
identified in the transcript, namely, visibility, ignorance, consensus, other themes offered fewer such natural
health service provision and genetic screening. backdrops. Thus, the theme of peoples ignorance about
genetic matters was picked up by five of the six analysts,
Visibility. All six analysts identified a similar constellation of but presented in different ways. Only one analyst
themes around such issues as the relative invisibility of expressed it as a basic theme while others chose to link
genetic disorders, peoples ignorance, the eugenic debate ignorance with other issues to make a broader theme. One
and health care choices. However, analysts frequently linked it explicitly with the need for education.
differed in the actual label they applied to the theme. For
example, while misperceptions of the disabled, relative The main attitudes expressed were of great concern at the
deprivation in relation to visibly disabled, and images of low levels of public awareness and understanding of
disability were worded differently, it was clear from the disability, and of great concern that more educational effort
accompanying description that they all related to the same should be put into putting this right. [R2]
phenomenon, namely the fact that the general public were
prepared to identify - and give consideration to - disability Three other analysts tied the populations ignorance to the
that was overt, whereas genetic disorders such as CF eugenic threat. For example:
were more hidden and less likely to elicit a sympathetic
response. Ignorance and fear about genetic disorders and screening,
and the future outcomes for society. The group saw the
Further, although each theme was given a label it was public as associating genetic technologies with Hitler,
more than a simple descriptor; the theme was placed in a eugenics, and sex selection, and confusing minor gene
- Reprinted with permission. Add itional c opying is prohibited. - G AL E G RO UP
Information Integrity
https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak 3/6
5/10/2014 The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu
https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak 4/6
5/10/2014 The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu
https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak 5/6
5/10/2014 The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu
Job Board About Mission Press Blog Stories We're hiring engineers! FAQ Terms Privacy Copyright Send us Feedback
Academia 2014
https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak 6/6