Selecting and Interpreting Measures of Thematic Classification Accuracy

ELSEVIER
Selecting and Interpreting Measures of

Thematic Classification Accuracy
Stephen V. Stehman
A n error matrix is frequently employed to organize and INTRODUCTION

display information used to assess the thematic accuracy
Accuracy assessment of land-cover maps constructed
of a land-cocer map, und numerous accuracy measures
from remotely sensed data contributes important data
huve been proposed for .summav-izing the information
quality information to map users. Accuracy assessment is
contained in this error matrix. No one measure is univer-
usually conducted by selecting a sample of reference loca-
.sally best .for all accuracy assessment objectives, and dif-
tions, and comparing the classifications at these refer-
ferent accuracy measures may lead to conflicting conclu-
ence locations to the classifications provided by the land-
sions because the measures do not represent accuracy in
cover map. The reference sample should be selected in-
the same zcay. Choosing appropriate accuracy measures
dependently of data used for training and/or developing
that address objectives of the mapping project is critical.
the classification procedure. The reference sample data
Characteristux of some commonly used accuracy mea-
are then summarized in an error matrix (Congalton et
suvxs are de,scribed. and relationships anwng these mea-
al., 1983; Story and Congalton, 1986), and various accu-
.surtls are prooided to aid the user in choosing an appro-
racv statistics are computed from this error matrix.
priate measure. Accuracy measures that are directly in-
terpretable as probabilities of encountering certain types A variety of measures have been suggested for de-
of vvliscla.~sz~catiovl errors or correct classijications should scribing the accuracy of land-cover classifications. The
he selected in preference to measures not interpretable overall proportion of area, pixels or polygons classified
a~ such. User’,s and producer’s accuracy and the overall correctly for the entire map, various forms of kappa (K)
proportion of area correctly classi;fied are examples of ac- coefficients of agreement, the r coefficient, user’s and
curacy measures pos.sessing the desired probabilistic in- producer’s accuracy, and conditional K are commonly
terpretation. The kappa coefhcient of agreement does not used accuracy measures. No consensus has been reached
possess .such u probabilistic ivlterpv-etation because of the on which measures are appropriate for a given objective
adjustment for hypothetical chance agreement incorpo- of accuracy assessment, although the kappa statistic
rated into this measure, and the strong dependence of seems to be generally favored. Rosenfield and Fitzpa-
kappa on the murginal proportions (If the error matrix trick-Lins (1986, p. 226) recommended “that coefficients
makes the utility of kuppa for comparisons suspect. Nor- of Kappa and conditional Kappa be adopted by the re-
malizing an error mutrix re,sults in estimates that are not mote sensing community as a measure of accuracy for
consi.stevn jar accuracy parameters of the map being as- thematic classification as a whole, and for the individual
.se.ssed, .so that thi.s procedure is generally not vuarranted categories.” Fitzgerald and Lees (1994, p. 368) proposed
$vr most opplication~. OElsrmer Science Inc., 1997 “that the Kappa test statistic be used in preference to
the overall accuracy as a means of testing classification
accuracy based on error matrices.” Fung and LeDrew
(1988, p. 1453) concluded that “accuracv indices based
on the producer’s accuracy and ovrerall accuracy may
tend to be biased towards the category with a large num-
SUNY (Zollrp~ of Ernironmrntal Science and Forestry, S,yrac~lse.
ber of samples,” and recommended that x be used be-
New York
cause “all cells of the error matrix are considered.” But
Address correspondence to S. \‘. Stehman, SUNY Cdl. of Em-i-
ronmental Science & Forestry, 320 Bray Hall, Syrxusr, NY 13210. Foody (1992) suggested that the usual A- overestimated
Kmhwl 1 S L>~ccwrhrr 1996: rvcised 30 April 199;. chance agreement and recommended a modified form of
REMOTE SENS. ENVIHON. 62:77--89 (1997)
OElsevier Science Inc., 1997 0034-42<57/9i/$17.00
655 Avemlc~ ol’th~~ Americas, New York. NY 10010 1’11 SO(IX1-129~l97)00083-i
ti. Xla and Kedmontl (1995, p. -135) sl~o\vc~l that tliis
Inotlificd h’ coultl lx, \irwrtl xi alrotlrc~r t\pL 01’
+qcYVllc~l~tllIeaSI1re calltYl z, alld then stated tlrat r .‘is ;,
Iletter ineasure of vlassificatioii accurac!; Cm-wise vvitll I‘(‘-
Irlote sensing data than either Kappa or percentage
agrrenieut.” Lark (1LKdS)mentions the use of K in XCII-
racy assessiiient, but does not describe an application in
wliich K is the parameter of’ clioiw ainong the several
examples presented.
Each accuracy measure provides a different sum-
man_ of the information contained in an error matrix,
and each may be applicable for a particular user in a
givren project. Lark ( 1WS) describes specific circum-
band combinations. The first class of applications rc-
stances in which one accuracy measure may- bc more rel-
quires selection of the most appropriate descriptor of
evant than others for a particular objective. Congalton
map accuracy, whereas the map comparison application
(1991) suggested calculating various accuracy- measures,
requires an appropriate descriptor, an ability to rank the
and if the interpretations differed, evaluating the naturc maps, and a measure of the magnitude of the difference
of the differences. Given that different accuracv’ mea- in accuracv.
sures are appropriate for different objectives. it is impor- Sometimes comparisons are needed for maps of dif-
tant to understand the characteristics of each me;rsure. ferent regions and/or different land-cover classification
The objective of this article is to describe properties of schemes. Such applications present perhaps the most dif-
and relationships among these accuracy measures and to ficult task because of the confounding of accuracy with
illustrate why they differ in certain circumstances. Sup- regional and land-cover For exam-
scheme differences.
plied with this information, a user can better decide ple, even if the same region is represented by the two
which accuracy measure is appropriate for a given appli- maps but the land-cover classification schemes differ, the
cation and objective user likely has different objectives motivating the two
schemes, and a direct comparison based only on a map
Uses of Map Accuracy Measures accuracy measure may not capture this difference in ob-
Because an error matrix is such an effective descriptive jectives.
tool for organizing and presenting accuracy assessment
information, the error matrix should be reported when- Notation and Description of Accuracy Parameters
ever feasible. In addition to the valuable descriptive role The accuracy assessment measures can be defined as pa-
of the full error matrix, various map accuracv measures rameters of a population error matrix. That is, given
will be of interest to summarize the error matrix infor- complete reference data for the study region, both the
mation Uses of these summary measures may be reference and image classifications for all areas on the
grouped into two broad application classes, reporting the map are available. A population error matrix (Table 1)
accuracy of a final map product, and comparing maps. could be constructed from this census, where /I,, is the*
In the first class of applications, a user has available probability that a randomly selected arta is classified as
a final product, land-cover map and the accuracy assess- category i 1~ the image and as categov, j bv, the refer-
ment objective is to provide a description of classification ence data. Summary measures computed from this error
error (e.g., Congalton et al., 1993; Dicks and Lo, 1990; matrix are then population parameters. Recognizing that
Fiorella and Ripple, 1993; Lauver and Whistler, 199:3; these parameters are characteristics of a real population
Lawrence et al., 1996; Vujakovic, 1987). In the second is a useful device when interpreting the varioirs accuracv
class of applications, the map producers may still be in measures. In practice, a sampling design is implemented,
the process of creating a map for the region of interest. and the sample data obtained are used to estimate the
In this application, the identity and number of land- population error matrix and associated parameters. How-
cover classes is often the same, and the objective is to ever, to cxaminr the various accuracy measures. it is sim-
determine which map constructed from the candidate pler to operate at the population level and limqy~ tlic
imagery dates, classification algorithms, or other available specifics of how to estimate these parameters for a par-
options results in the highest accuracy. For example, ticular sampling design.
Gong and Howard1 (1990) evaluated several factors in- The accuracy parameters that will be discussed are
fluencing classification accuracy, Treitz et al. (1992) com- listed below. In the Table 1 notation, lower case /I will
pared the accuracies of maps derived from different be used to denote characteristics of the population error
methods for classifying training data, and Stenback and matrix such as the individual cell probabilities (I,~,) and
Congalton (1990) compared accuracies of different TM row and cohmm marginal proportions (I’,+ and I’+,. rt’-
Selecting Measuws of Thematic Classification .+mracy 79
spectively). Upper case P will be used to denote sum- These parameters use different information con-
mary parameters such as the overall proportion of area tained in the error matrix (Congalton, 1991) and summa-
correctly classified (PC), and user’s (PJ and producer’s rize the error matrix at various levels. For example, P,.,
(P4) accuracy. K, K,., and T provide a single summary measure for the
entire error matrix, whereas P,4, and K, provide a sum-
1. Overall proportion of area correctly classified,
mary by columns of the error matrix, and PcrLand lc, pro-
vide a summary by the rows of the error matrix. Each of
P,‘&kl, (I)
k=l these summary measures obscures potentially important
detail contained in the error matrix, so that the full error
where q=number of land-cover categories.
matrix should be reported whenever possible.
2. Kappa,
.=..a
PC- ipi;+p+r. SINGLE SUMMARY MEASURES OF THE
ERROR MATRIX
” (2)
l- &k+p+k Of the single number summary measures, the most basic
k=l
is P,. Other accuracy measures are derived from PC as the
where pk+=5p+ and p+k=$p&. starting point. The motivation for K and 5 is to account
j=l i=l
for agreement between the map and reference classifi-
3. Kappa with random chance agreement as defined cations that could be attributable to random chance.
by Foody (1992), These parameters start with an observed measure of
agreement, P,., and then adjust P, by “a hypothetical ex-
PC- l/9
Ic,=p--, (3) pected probability of agreement under an appropriate set
l-l/9
of baseline constraints” (Landis and Koch, 1977, p. 163).
the subscript e used to indicate that each land- “Adjusting” for chance agreement is better terminology
cover class is equally probable under this defini- than “correcting” for chance agreement because the lat-
tion of hypothetical chance agreement. ter has the connotation that P, is somehow an incorrect
4. Tau (Ma and Redmond, 1995), representation of accuracy. PC represents a legitimate
probability describing one aspect of map data quality.
p,.- ibkpi-k Users may choose to represent accuracy by another pa-
k=l
7- (4) rameter, but it is not relevant to claim that P,. is incorrect
1- bkp+k’ or that it provides a biased measure of accuracy.
k=l
K and T measure the extent to which the observed
where pk is the user specified a priori probability probability of agreement exceeds the probability of agree-
of membership in map class k. ment expected under the specified hypothetical baseline
5. User’s accuracy for cover type i, the conditional constraints. As shown by Ma and Redmond (1995), K
probability that an area classified as category i by uses marginal proportions of the ohsertietl map, and t
the map is classified as category i by the refer- uses marginal proportions specified prior to constructing
ence data, the map. The hypothetical nature of the adjustment for
“chance agreement” needs to be emphasized. In reality,
pI'z=pu/pi+. (5)
areas classified correctly “by random chance” are indis-
6. Producer’s accuracy for cover type j, the condi- tinguishable from areas classified correctly because of
tional probability that an area classified as cate- some more favorable aspect of the classification proce-
gory j by the reference data is classified as cate- dure. That is, it is impossible to go to the map and iden-
gory j by the map, tify those areas (e.g., pixels or polygons) that have been
correctly classified by random chance. If the objective is
P4,=pJp+,. (6)
to describe the accuracy of a final map product, the user
7. Conditional kappa for the map classifications in of this particular map is probably not concerned with the
category (row) i, hypothetical proportion of area classified correctly by
random chance. Such areas, even if they could be identi-
K,= Pll-Pl+P+i
_pUt-P+i
(7) fied, arra still classified correctly, and attributing a hypo-
’ pi+-p,+pi-t l-p+i .
thetical reason for the classification being correct is irrel-
8. Conditional kappa for the reference classifications evant to applications requiring this map. If the overall
in category (column) j, map accuracy is 80% (P,=OBO), the user holds a map for
p .-- which a randomly selected area has a11 80% chance of
&__L &+
(8) being correctly classified. Thus estimating K or 7 for a
’ l-pi+ final map product is not an informative accuracy mea-
srire, and P,., P+ and PI,, are more relevant accuracy pa- ginal proportions. The random chance adjustment takes
rameters because of their direct interpretation as proba- into account the imposed constraint.
bilities characterizing data quality of this particular snap, Foody (1992) argued that the measure of chance
Reporting K or 7 to summarize a final map product agreement incorporated into K is not the proper represen-
provides a misleading representation of the probability tation for most accuracy assessment problems because the
that an area on the map is correctly classified because map margins are not fixed. hut free to vay. That is, the
both K and 7 are always smaller than P,. This is seen by classification is not constrained to match specified row
noting that both K and 7 may be written as [PC.-u]l( 1 -a), marginal proportions. In the absence of any information
where a is the hypothetical adjustment for chance agree- about the land-cover class of a given area, that area would
ment. Then it follows that [P,.-a]l(l-a)-P,.=[P,.-a-P, be classified into one of the y classes with equal probabil-
(1-a)]l(l-a)=@-l)l(l-a)<0 because (P,-1)G and ity, so FJ~+= I/C/. Then assuming independence of the map
(l-a)aO. Therefore, both K and 7 underrepresent the and reference classifications, chance agreement is still
true probability of a correct classification. A similar argu-
ip+kpn+, but substituting pi+ = l/c/ into the equation
ment applies to the relationship between conditional L=I
kappa and user’s and producer’s accuracy; that is, K,sP, I, leads to (l/~)~~+r=l/~. This result is the same regardless
and Icj6P,I. k=l
When adjusting P, for chance agreement, K, K,~,and of how the reference (column) marginal proportions are
7 incorporate different adjustments. Which adjustment is viewed, either fixed or free, and this chance agreement
best is difficult to discern and will probably remain a adjustment results in K,,. Chance agreement defined for K,
matter of debate. In the detailed discussion of Brennan is smaller than that defined for K, and this is the basis of
and Prediger (1981, p. 690) of agreement measures, a Foody’s (1992) statement that chance agreement is overes-
key distinction they proposed is whether the marginal timated by K.
proportions are considered fixed or free to vary: “A mar- Ma and Redmond (1995) show that K,.is a special
gin is ‘fixed’ w h enever the marginal proportions are case of 7, and claim that K(, is an appropriate measure if
known to the assigner before classifying the objects into unsupervised classification is employed, or if a supervised
categories,” and “a margin is ‘free’ whenever the mar- classification is employed with no n priori specification
ginal proportions are not known to the assigner before- of class membership. In both cases, the row marginal
hand.” The identification of the marginal proportions of proportions of the error matrix used in the chance
an error matrix as fixed or free determines the measure agreement adjustment are l/(r; that is, in the absence of
of chance agreement used to adjust P,. any information about the true class of a particular area
The adjustment for chance agreement used in K is of the map, the area will be classified into one of the y
land-cover categories with equal probability. If a super-
i+ )+k. Agresti (1996, p. 246) cites the dependence of
& r vised classification is employed and the N 1)tior-i class
the K definition of chance agreement on the marginal membership probabilities specified are not equal, then
proportions as a primary source of controversy on the
the measure of chance agreement used by 7 is %&+i,
utility of this measure. The K adjustment is tantamount h-l
to assuming fixed map marginal proportions. But this where /s, is the n ~lrioti probability of classifying an area
assumption imposes a circularity in reasoning because into category X-. In this case, the measure of chance
the map (row) marginal proportions are the wsult of the agreement used in 7 is based on the premise that some
classification, not a fixed set of marginal proportions the information about the true class of an area csists. That
map construction process was required to match. The information is contained in the /$ values specified. So
adjustment incorporated into K would be appropriate if YIOW random agreement does not imply the complete ab-
the n priori row marginal proportions were specified, and sence of prior information about the true class of an
the classification were forced to result in a map with area, which is how chance agreement is defined for /i,,.
those exact marginal proportions. An example from an- Although a priori probabilities are specified in a su
other application illustrates the point. Suppose a physi- pervised classification, the classification procedure is not
cian will evaluate 100 patients, and each patient must be constrained to match these specified probabilities, so that
classified into one of five disease categories. The true dis- the measure of chance agreement used in 7 is indepen-
ease category is known (although not by the physician) dent of the row marginal proportions of the population
for each patient, and the physician is provided with the error matrix obtained. Ma and Redmond (1995) claim
proportion of patients in each category and told to match this independence is a desirable feature of 7. But this
those same marginal proportions in his or her evaluation. independence leads to the conceptually disconcerting
In this scenario, the K measure of chance agreement is consequence that two maps having the exact same popu-
justified because the marginal proportions provided by lation error matrices may result in different 7 coefficients
the classification are fixed, and the classification (the phy- simply because the a priori probabilities are different for
sician’s evaluation) is constrained to match those mar- the two maps. Further, if the map marginal proportions
Selecting Measurers ofThematic Clmsificcltion Accwucy 81
are not forced to match /?k,it is unclear how to interpret Table 2. Example Population Error Matrices and Associated
Accuracy Parameters
z in the Brennan and Prediger (1981) framework. The
map (row) marginal proportions are still free to vary, so Class A B C PI+ P”,
that perhaps the interpretation is that $ the map margin- Error Matrix I jP<=0.636, K=0.450)
als (pk+) had been forced to match the a priori proba-
A 0.2727 0.0000 0.09OY 0.3636 0.750
bilities (/?k), then random agreement would be as mea-
B 0.1818 0.1818 0.0000 0.3636 0.500
sured by z. c o.OOOO 0.0909 0.1818 0.2727 0.667
Because chance agreement is a hypothetical con-
pi, 0.4545 0.2727 0.2727
struct, the various definitions invoked lead to the differ- p,, 0.600 0.667 0.667
ent accuracy parameters IC, K?, and z. Each parameter
Error Mutrix 2 (P, =0.636, ~=0.45Oj
assumes different n priori information about the true
land-cover class of an area, so that it is difficult to claim
A 0.3636 0.0000 o.OOoo 0.36:36 1,000
B 0.0909 0.1364 0.1364 0.3637 0.375
that one measure is better than another. These measures (: 0.0000 0.1364 0.1364 0.2728 0.500
are simply different. Choosing among these accuracy pa-
IJ +i 0.4545 0.2728 0.2728
rameters raises the question of how to represent accu- O.HOO 0.5OO 0.500
P\,
racy, whether to adjust for hypothetical chance agree-
Error Matrix 3 (P,=O.660, K=O.370)
ment at all, and if an adjustment is incorporated, which
A 0.4500 0.1 100 0.0400 0.60 0.75
measure to use. In some cases, all parameters lead to
B 0.1500 0.1500 0.0000 0.30 0.50
similar conclusions, but in other applications, the conclu-
(: o.oOOo 0.0400 0.0600 0.10 0.60
sions will differ (see first subsection or section on Testing
?)+I 0.60 0.30 0.10
Overall Map Accuracy). Numerous other accuracy mea-
PI, 0.75 0.50 0.60
sures could be defined, and some are reviewed by Kalk-
Error Mutt-ix 4 jP,=0.660, K=0.469)
han et al. (1995; 1996). Bishop et al. (1975) and Agresti
(1990) distinguish between measures of associatior~ and A 0.3600 O.lOOO 0.1400 0.60 0.60
B 0.0400 0.2000 0.0600 0.30 0.67
measures of agreement, and state that strong association
(: 0.0000 o.oOOo 0.1000 0.10 l.OOO
in a contingency table (error matrix) does not imply high
!‘+I 0.40 0.30 0.30
agreement. Therefore, measures of association should
PC 0.90 0.67 0.33
not be applied to accuracy assessment problems.
Error Matrix ,i jPc=0.660, h-=0.490)
A 0.2644 0.0600 0.0089 0.3333 0.793

FURTHER DISCUSSION OF K I3 0.0422 0.1811 0.1100 0.3333 0.543
(: 0.0267 0.0922 0.2144 0.33:3:3 0.643
Because IC has generally been accepted and frequently
used to summarize the results of an accuracy assessment, P +I 0.:333:3 0.3333 0.3333
PI, 0.793 0.543 0.643
it is worth further exploring the properties of this mea-
sure. Dicks and Lo (1990), Fung and LeDrew (1988),
Janssen and van der We1 (1994), and Rosenfield and Fitz-
patrick-Lins (1986) all state that K uses all cells of the classification depends on the relative importance of each
error matrix, not just the diagonal entries used by PC. land-cover class, and the relative importance of user’s
Others apparently disagree. Zhuang et al. (1995, p. 427) and producer’s accuracy to the objectives of the particu-
stated that K does “not directly include the effects of off- lar mapping project. Both P, and ic obscure class-level
diagonal entries on the accuracies of individual classifi- differences, and this example illustrates an inherent
cation categories and overall classification.” Interpreting problem with surnmarizing the error matrix by a single
such conflicting views is difficult, and some of the diffi- number.
culty is attributable to authors defining terms differently. Another feature of lc is that when the corresponding
The observed marginal proportions of the error ma- row and column marginal proportions of the population
trix are obviously incorporated into IC, so that IC does use error matrix are closer to each other, more observed
some of the off-diagonal information in the error matrix. agreement (higher P,) is needed to attain the same value
But different internal configurations of the error matrix of K (Lee and Tu, 1994). That is, if pk+ and p+r are simi-
~1~~‘s
can result in the same row and column marginal pro- lar for each class k, P, must be higher to achieve the
portions. so that, in that sense, K does not use all cells same K as a map with a greater disparity between pk+
of the error matrix. Consider the first two error matrices and ])+A. A highly d esirable feature of a land-cover map
shown in Table 2. Both have P,.=O.636 and the same is for the proportion of area in each land-cover class
marginal proportions. The two error matrices are obvi- identified by the map to match the proportion for that
ously different internally, but both have ~=0.450. User’s class that exists on the ground (p~+=p+k). Yet K penalizes
and producer’s accuracy differ for the two matrices, and a map achieving this desirable feature. This is demon-
the decision of which error matrix represents a better strated numerically with the third and fourth error matri-
cos in Table 2. Error matrices 3 and 3 both have will be applied to the conditional probabilities d&nc~l by
P, =0.66. Error matrix 3 has row proportions (~q+) 0.6. Eqs. (5) and (6). liscr’s accuracy is related to corrrmissioll
0.3, and 0.1 and matching cohmin proportions (F)+~) 0.6, error, and producer’s accuracy is related to omissiorl w-
0.3, and 0.1, yielding the highly favorable result that area ror (Janssen and van der Wel, 1994).
rstimates for each land-cover class from the map match Rosenfield and Fitzpatrick-Lins (1986) suggested us-
the actual areas as given by the reference data. Error ing conditional k for the same reason motivating K, which
matrix 4 has row proportions of 0.6, 0.3, and 0.1 and col- is to incorporate an adjustment for hypothetical chancc~
umn proportions of 0.4, 0.3, and 0.3. The classification agreement. The relationship between conditional kappa
resulting in error matrix 4 has poorer agreement he- (K,) and user’s accllracy cm be illustrated bv writing K,
tween the reference and map marginal proportions, yet as
error matrix 4 has a higher K (~=0.47) than error matrix
Kl,fi-pi+p+, _ PC,-))+I
3 (h-=0.37), even though both maps have P,.=O.66. Error
pi+-pi+))+t l-p+,
matrix 3 is penalized with higher chance agreement de-
spite possessing a desirable match between row and col- Thus conditional K defines random agreement as P+~, and
iimi marginal proportions. therefore adjusts user’s accuracy by this column propor-
Consider still another classification represented by tion for reference class i. Recall that pit is the true (ref-
error matrix 5 which also has P,.=O.66, and similar to er- erence) proportion of area in land-cover class i, not the
ror matrix 3, has matching row and column proportions. mapped area proportion. The result of this adjustment is
In error matrix 5, all map classes are equally represented that those land-cover classes that are common in the
(lb), and user’s and producer’s accuracy are slightly mapped region must have higher user’s accuracy to
higher than those shown for error matrix 3. With P,. the achieve the same conditional K as a less common cover
same as error matrix 3, and user’s and producer’s ~CCII- class. The justification for this penalty invoked by K, to
racy only slightly higher than in error matrix 3, error ma- common cover classes seems tenuous. Conditional K is
trix 5 has a much higher K (~=0.49 versus ~=0.37) bca- apparently based on a premise that the map somehow
cause it has smaller chance agreement (0.333) than that “knows” the proportion of pixels in category i, and therr-
of error matrix 3 (0.46). Given that error matrices 3 and fore the map should assign pixels to that category ac-
5 both possess the desirable feature that the row propor- cording to those known proportions. That is, K~ assumes
tions match the column proportions, it is not clear why the margin JJ+! is fixed in the Brennan and Prediger
chance agreement should differ between the two just be- (1981) framework. If the map indeed has such “knowl-
cause the configuration of marginal proportions is differ- edge” of I)+), that is a favorable feature of the classifica-
ent in the two error matrices. tion process, and the accuracy of such a map should not
K, resolves some of the confusion in interpreting K be penalized by higher chance agreement. Instead of us-
because it is based only on the number of land-cover cat- ing p+, to represent random agreement, it seems pref&ra-
egories, y, not on the marginal proportions. For K‘,, as q ble to use l/q resulting in a conditional K parameter
increases, chance agreement (l/y) decreases. This is intll- analogous to K,,, (P( I- l/y)/( 1 -l/q). That is, if’ the map is
itively appealing because if more land-cover categories truly classifying area completely at random, areas should
are added, we would expect fewer correct guesses if the be assigned to the land-cover classes with equal probahil-
map were classifying area completely at random. When ity, l/q. Similar arguments apply to the relationship be-
several maps, all with the same number of categories, are tween producer’s accuracy and K,.
compared, K, orders these maps from best to worst in
exactly the same way as PC because K, is a linear rescahng
AN ILLUSTRATIVE EXAMPLE
of P,. That is, ~,~=a+hP,., where n=ll(l-q) and b=ql
(c/-l), so the difference between Kc values for the two A detailed analysis of a published example highlights
error matrices is b times the difference between the PC. some of the important issues in defining and interpreting
values. Thus, in this setting, K,,rescales the magnitude of different accuracy measures. This example also demon-
the difference in accuracy, but does not alter the order- strates the confusion that can arise when interpreting dif-
ing or ranking. ferent measures. The error matrices presented in Fitz-
gerald and Lees (1994, their Tables 4 and 5) provide the
source material. The two error matrices are based on dif-
CLASS-LEVEL ACCURACY MEASURES
ferent classifiers, a neural network (NN) classifier, and a
When interest focuses on the accuracy for particular decision tree (DT) classifier. In addition to assessing the
land-cover classes, attention shifts to row and column ac- overall accuracy of the two classifiers, there is also inter-
curacy measures such as user’s and producer’s accuracy, est in evaluating and comparing accuracy for the individ-
and conditional kappa. Although the labels user’s and ual land-cover classes.
producer’s accuracy are not universally accepted (cf. Based on their analyses, Fitzgerald and I,ers (1994,
Janssen and van der Wel, 1994; Lark, 1995), those terms p. 362) expressed a strong preference for K, and pur-
Selecting Measww (4 Thematic Classijkx~tion Acturacy 83
Table 3. Accuracy Statistics Computed from Fitzgerald and Lees (1994, Table 4) Including the Sea Class
FL E.stimutes
Userk Conditional Producer’s5 Conditional Overall

ClasY Accuracy Kappa (row) Accuracy Kappa (~01) Agreement R
1. Dry sclerophyll 0.349 0.346 0.678 0.675 0.992 0.457

2. E. hotroyoides 0.300 0.299 0.529 0.529 0.998 0.382
3. Lower wet slope 0.026 0.025 0.077 0.07.5 0.997 0.037
4. Wet E maculatn 0.477 0.475 0.372 0.370 0.996 0.416
5. Dry E. maculatu 0.451 0.449 0.68 I 0.680 0.997 0.541
6. RF Ecotone 0.593 0.592 0.154 0.153 0.998 0.244
7. Rainforest 0.337 0.336 0.348 0.347 0.998 0.342
8. Paddocks 0.446 0.446 0.962 0.961 0.999 0.609
9. Sea 1.000 0.999 0.992 0.695 0.992 0.820
ported to have demonstrated “that the accepted method For these same data, user’s and producer’s accuracy
of assessing classification accuracy, the overall accuracy and conditional kappa are computed (first four columns
percentage [I’,.], is misleading especially so when applied of Table 3). These measures evaluate class-level accuracy
at the class comparison level.” They further stated that K within the context of the full nine-category classification
is a more “sophisticated measure of interclassifier agree- scheme. In this representation, a much different conclu-
ment than the overall accuracy and gives better interclass sion from that presented by FL is obtained. The propor-
discrimination than the overall accuracy.” Fitzgerald and tion correct, as measured by user’s and producer’s accu-
Lees’ (hereafter FL) preference for K was based on their racy, differs little from the corresponding conditional R,
class-level comparisons. For each land-cover category, and the kappa statistic does not provide a better or even
they collapsed the full error matrix into a 2x2 table. For different discrimination based on agreement among the
example, to estimate their overall agreement proportion classes. The relative rankings of the different classes are
for the class dry sclerophyll, they collapsed the entire er- exactly the same whether the proportion correctly classi-
ror matrix into two classes, “dry sclerophyll” and “not dry fied or the kappa statistic is used, and “the disparity in
sclerophyll.” The diagonal entries of this collapsed error the relative rankings of the overall accuracy values and
matrix are then summed and divided by the total sample the Kappa values” noted by FL (p. 366) is an artifact of
size to get the estimated overall proportion correct, PC. the collapsed 2x2 tables formed in their definition of
Their R statistic is also computed from the collapsed 2x2 class-level accuracy. In general, the FL error matrices do
table. The FL results are shown in the last two columns not demonstrate that kappa is “a more rigorous and dis-
of Table 3, and the discrepancy between R and P,. is ap-
cerning statistical tool for measuring the classification
parent.
accuracy of different classifiers” except when class-level
Estimating P, from this collapsed error matrix cre-
accuracy is defined according to their collapsed-class rep-
ates one representation of class-level accuracy and is the
resentation. Their conclusions do not generalize to other
approach taken by Fleiss (1981, Chap. 13). For the col-
common representations of class-level accuracy.
lapsed 2x2 table, P,. represents the accuracy for a di-
Analysis of the error matrices with the dominating
chotomous classification, for example, accuracy of a “dry
sea class excluded (Table 4) provides additional interest-
sclerophyll” and “not dry sclerophyll” classification. The
ing insights into uses of the information in an error ma-
dichotomous classification results in a significant loss of
trix. For this eight-category classification (land classes
information, and the objective motivating this perspec-
only), the estimated values for PC are O.ijll for the NN
tive of class-level accuracy is substantially different from
the objective of evaluating the accuracy of the dry sclero- classifier and 0.508 for the DT classifier, and the esti-
phyll class within th e context of the nine-category classi- mated K is 0.395 for NN and 0.389 for DT. Although
fication represented by the full error matrix. If the objec- the P, and Iz values differ, both measures suggest little
tives call for a two-category classification system, then difference between the two classifiers for the overall er-
the FL assessment is appropriate. The class-level fi, val- ror matrix. The class-level accuracies are represented by
ues computed by FL are high because the sea class dom- user’s accuracy and conditional K (for rows). The class-
inates the sample size, and this class has extremely high level accuracies achieved by the NN and DT classifiers
accuracy. This makes any dichotomous classification very are generally similar, but the accuracy for classes 3 and
accurate. The FL class-level R statistics differ greatly possibly 8 are sufficiently higher for the NN classifier
from the P, values also because of the dominance of the that this might convey an important practical advantage
sea class in the sample. Chance agreement defined by K relative to the DT classifier. Such accuracy differences
will be very high in these collapsed 2x2 tables, so that may be important, hut they are not evident from the
the R values are much smaller than the PC values. comparison of the single number summarv measures, P,
p,, K, K,, r, or other accuracy parameters. SOJTW g?nrral
issues pertaining to hypothesis testing in accuracv assess-
ment are reviewed hchrt’ with thr specific focus being
tests based on K.
Testing the null hypothesis H,,: K=(), fvaluattas it
overall accuracy exceeds that of chance agreenlent. Fleiss
I 0.575 0.613 0.406 0.459 (1981), Agresti (1990), and Janssen and WJ~ der We1
2 0.396 0.383 0.356 0.:344 (1994) suggest that because map accuracy is anticipated
3 0.571 0.4oi 0.550 0.378 to exceed agreement expected by random chatlcr, testing
4 0.475 0.433 0.324 0.265
the hypothesis that K=O is often not relevant. A niore
5 0.45 1 0.420 0.344 0.:30:
informative test is to determine if K exceeds some hy-
6 0.593 0.554 0.55 1 0.509
7 0.337 0.355 0.279 0.300 pothesized value, say K,,. The values of x,, may be various
8 0.943 0.879 0.941 0.873 cutoffs such as those suggested by Landis and Koch
(1977) for moderate (0.41-0.6), substantial (0.61~0.8).
and almost perfect (0.81-1.0) agreement. For example, a
test of the hypothesis that accuracy beyond that expected
and K. This illustrates why class-level evaluations are of-
by random chance may be considered “substantial” trans-
ten important.
lates into testing H,,: i ~0.61, versus H,,: ~X.61. Fleiss
If the NN classifier is compared to the DT classifier
(1981, p. 221) p resents the formulas for carrying out
category by category, the ordering of the classifiers is the
such a test.
same whether accuracy is measured by user’s accuracy
As with any hypothesis test, the power of the test
or conditional IC. For a particular classifier, the ordering
and practical importance of statistically significant differ-
of the land-cover classes obtained from user’s accuracy
ences found must be considered (Aronoff’, 1982). If the
and conditional K is not the same. Not surprisingly, dif-
sample size is large, the null hypothesis may be rejected
ferences in the order occur for those classes that have
even when the observed kappa statistic (12) does not ex-
relatively close values of P, or &.
ceed the hypothesized value (K,) by a large amount. For
These examples illustrate that project objectives may example, Fitzgerald and Lees (1994) reported numerous
dictate that user’s accuracy or producer’s accuracy for tests of the null hypothesis H,,: K=O, and rejected this
one or more classes is a high priority, and certain mis- hypothesis for an observed R as small as 0.077. ri-rO.077
classifications may be extremely critical while other er- is clearly not indicative of practically better agreement
rors are less important. A more detailed analysis of the beyond that expected by random chance, yet this small
error matrix focusing on selected user’s and producer’s K is evidence to claim K is statistically separable from 0.
accuracies or particular cell probabilities (pJ can be Because the sample size in the Fitzgerald and Lees
done to customize the assessment more closely-to project (1994) example is large (n=62.727), this particular test is
objectives. A weighted kappa statistic (Fleiss, 1981; extremely powerful so that even practically unimportant
Naesset, 1996) has been proposed as an overall measure differences from K=O will result in rejection of H,,. It
of agreement in which the importance of different mis- should be routine data analysis practice to consider both
classifications to the user’s objectives can be incorporated the statistical significance and practical importance of any
into the accuracy measure. This weighting feature is ex- hypothesis test result.
actly the type of approach that should be employed to Confidence intervals provide important descriptive
link accuracy measures more closely to mapping objec- information and can be used to conduct hypothesis tests.
tives. Unfortunately, embedding the weighting within the Basic description for accuracy assessment should include
context of a kappa framework results in a measure that estimates of parameters of interest (e.g., P,., K, &,, PI,,,
suffers from all of the same definition and interpretation or K,) accompanied by standard errors. An approximate
problems inherent in K (see the third section on Further confidence interval is constructed via the formula
Discussion of K). Consequently, weighted kappa cannot J?tz*SE(i), where j? is an estimate of the parameter of
be recommended as a useful accuracy measure. interest, z is a percentile from the standard normal distri-
bution corresponding to the specified confidence level,
and SE(@ is the standard error of i?. Both i? and SE(g)
TESTING OVERALL h&U’ GCCUaACY
depend OII the sampling design used to collect the refer-
Hypothesis testing may sometimes be required to ad- ence data. These confidence intervals assume that the
dress accuracy assessment objectives. For example, if a reference sample size is large enough to justify use of a
map must have an overall accuracy of ~,=O.70 to meet normal approximation for the sampling distribution of B.
contractual specifications, a test of the null hypothesis For estimates based on the entire error matrix such as
H,,: ~,.~0.70, can be made against the alternative hypoth- PC and R, sample sizes are likely to be adequately large
esis H,,: P,.>O.70. Hypothesis tests can be constructed for to satisfy this assumption. For those csstimatc,s based OIJ
Selecting Measures of Thematic ClassijkicationAccuracy 85
rows or columns of the error matrix such as Izi, i)ui, and Table 5. Accuracy Statistics and Pairwise Comparison Tests
for Four Classification Algorithms Reported in
ik,, sample sizes may be small for rare classes and the
Congalton et al. (1983)
normal approximation will not be justified.
Algorithm n PC R
Comparison of Error Matrices: Tests Based on a 4 modified clustering 632 0.859 0.718
Single Summary Measure 2 nonsupenised 20 clusters 659 0.785 0.586
1 nonsupervised 10 clusters 659 0.766 0.605
When comparing the accuracy of two or more maps, the
3 modified supervised 646 0.714 0.476
primary focus is to rank or order the maps, and to pro-
vide some measure of the magnitude of differences in z StatisticsforPair-wiseComparisons ofAccuracy of the
Four Algorithms
accuracy among the maps. Such comparisons could be
based on I’,., r, K, or K,. Once again, differing opinions K PC.
have been proffered on which parameter to use. For ex- 1 vs. 2 0.48 0.79
ample, Janssen and van der We1 (1994, p. 424) state that 1 \‘S.3 3.01 2.17
“PCC values [P,] cannot be compared in a straightfor- I \s 4 -2.94 -4.32
21s. 3 2.43 2.96
ward way” and suggest normalizing the error matrix or
2 vs. 4 -3.28 -3.53
using K to make such comparisons. Although the mean- 3 vs. 4 -5.62 -6.46
ing of “straightforward” is open to interpretation, differ-
ent error matrices can in fact be compared using I’,, and
normalizing an error matrix is shown in the next section
to be a questionable analysis strategy. Assuming that the category land-cover scheme (Table 6). The ordering of
reference samples for the error matrices are independent the six methods is slightly different depending on
simple random samples of size nl and nz, so that the esti- whether P, or K is used, but the discrepancies again oc-
mates p,_, and i),.;, are independent, a test of H,,, P,, =Pc2 cur when accuracy of two methods is nearly similar to
is obtained from begin with. Rosenfield and Fitzpatrick-Lins (1986, p.

224) reported conditional kappa and user’s accuracy for
a five-category classification and obtained the same or-
dering in terms of class-level accuracy from the two mea-
sures. The tables presented by Fung and LeDrew (1988)
where z is distributed as a standard normal random vari- and Dikshit and Roy (1996) provide numerous additional
able. This is a standard test for comparing two population examples for comparing the ordering of maps on the ba-
proportions (cf. Snedecor and Cochran, 1980, Sec. 7.10). sis of PC and 1~.
The comparison could also be based on K (Con- When differences in the ordering obtained by PC and
galton et al., 1983), and such comparisons have been K among different maps are extreme, this implies poten-
used in analyses (e.g., Fung and LeDrew, 1988; Gong tially important structural differences in the land-cover
and Howarth, 1990; Marsh et al., 1994). This test incor- of the maps being compared. Brennan and Prediger
porates the adjustment for chance agreement provided (1981, p. 696) suggest that if the marginals vary from
by K. In practice, the same conclusions will often be map to map (“assigner to assigner” in their terminology),
reached whether P, or K is used for the comparison. it is difficult to compare the values of K that result be-
Congalton et al. (1983) p resented estimates for I’, and K
cause accuracy is confounded with chance agreement.
for error matrices derived from four different classifica- That is, comparisons based on K will yield the same con-
tion algorithms, and then conducted tests to determine clusion as comparisons based on PC unless the marginal
if the K values for the different population pairs differed proportions (pi+ or p+k) of the two error matrices are
(Table S). The ordering of the four algorithms is differ- very different. The question arises, does it make sense
ent depending on whether PC or K is used, but the dis- to compare numerically maps with such fundamental dif-
crepancy is minor because algorithms 1 and 2 are nearly
similar in accuracy. Both accuracy parameters (K and PC)
reflect this similarity, even though the ordering differs. Table 6. Accuracy Measures for Six Classification
The n-statistics for the pairwise comparisons of accuracy Approaches (Ordered by K) Reported in Jakuhanskas et
based on K and P, lead to the same general conclusions al. (I 992)
concerning the statistical significance of differences in Data/Techniqur h- PC
accuracy of the different algorithms. Only the compari-
TM/SPOT supervised 0.618 0.802
son between algorithms 1 and 3 might be affected by TM/SPOT unsupervised 0.606 0.801
the choice of parameter depending on the Type I error SPOT supenked 0.603 0.802
level chosen. TM uns~~pervisetl 0.535 0.785
TM sqxtised 0.492 0.768
Jakubauskas et al. (1992) reported PC and K values
SPOT unsupervised 0.440 0.742
for six different classification methods, all using a four- -__ .___ ..~__
krcmces ill land covcsr? For example, if’ one region has
fivr land-cover classes, all approximately equally distrib-
lltecl, and another region has 10 classes with one class
representing 91% of the area, what objective motivates a
comparison of accuracy fbr these very different regions?
Would there be reason to expect accuracy of the two re-
gions to be similar? To compare maps with such funda-
mental structural differences, perhaps some measure of
map value defined relative to the user’s objectives is a
more appropriate basis for the comparison than a mea-
sure of map accuracy.
Campbell (1987, p. 351) discusses map accuracy
comparisons for the objective of determining which clas-
sifications using different images from different dates, which is exactly the test statistic based on P, [Eq. (Y)].
classification algorithms, or individuals are best for a If two maps have different numbers of classes, the
given region. In this case, the objective motivating com- comparison based on K, is confounded with a basic struc-
parison of the maps is clear, but differences in K are still tural change in the classification scheme. The user must
difficult to interpret. If the same region is being classi- factor in how these differences in the classification
fied, then presumably the same land cover categories are schemes impact the objectives of the project. The deci-
being used, and the reference proportions (P+~) must be sion of which map is better depends on the value of each
the same for each map. But K also uses the row propor- map to the user, and a map’s value will depend both on
tions (pk+) in the adjustment for chance agreement, so the map’s accuracy, and the relevance of the land-cover
that maps constructed from different classification algo- classification scheme to the mapping objectives. How P,
rithms, interpreters, and dates will have different chance and K, incorporate this latter component of map value is
agreement, even though they all classi@ the same region. not clear.
The user must decide if it is reasonable to assign these Comparisons based on T suffer from the same con-
maps different chance agreement, even though they are founding as comparisons based on K. When using 5, hp-
classifjing the exact same region. What is it about the pathetical chance agreement will differ if the (1 priori
different classification procedures that justifies regarding probabilities used in a supervised classification differ. 111
them as having a different probability of correctly classi- such cases, the comparison is of two fundamentally dif-
fying areas by “random chance”? Similar issues were dis- ferent classification procedures possessing different 0
cussed in the second section relative to the definition of priori information. Recall that the measure of chance
chance agreement employed in K. agreement employed by z uses hypothetical marginal
xP circumvents the confounding problem present in proportions (8) that are not reflected in the actual map
K attributable to differing marginal proportions. Because proportions. That is, 5 adjusts for chance agreement as f
chance agreement does not depend on the realized map the map had been forced to match the specified u prior-i
proportions (pk+), maps classifjring the same region using proportions of the supervised classification. The UXI
the same land-cover classes, as in Campbell’s (1987) ap- must decide if basing the comparison on a measure ad-
plication, will have the same chance agreement. I!’ the justing for this n p~%)ri information is relevant to the ob-
number of categories is the same for the two error matri- jectives.
ces, a test based on K,, turns out to be equivalent to a Finally, an additional complication with co~npariso~~s
test based on PC. To see this, we begin with the variance of two maps based on the same reference data should
for an estimated z coefficient (the generalization of K,.) be noted. Even though the two maps may be constructed
and the z-statistic specified by Ma and Kedmond (199.5), independently, the reference sample used for the con-
parison is often the same for both maps, so that the test
statistic employed for the hypothesis test should take this
lack of independence into account. That is, the test sta-
where rZ,I, i& are the estimated K,.for the two error ma- tistics usually provided for comparing K, 5. or P, assume
trices and 6: and 3; are the estimated variances for the independent reference samples, not independent maps,
two sample estimates of K,. each based on a sample of so the independence assumption underlying the statisti-
size n, cal comparison is routinely violated. Some type of paired
comparison is appropriate if only a single reference sam-
^J P,.(l-PJ
(J”Z ple is obtained, and a test statistic based on a11 assump-
n(l-l/q)“’
tion of two independent samples represents at best an
Then substituting the estimates for K,. and G-’ into 2, approximation to the correct statistical test.
Table 7. Effect of Standardizing an Error Matrix on Various strate that estimates obtained after standardization are
Accuracy Parameters
not consistent for the parameters of the actual popula-
a) Original Population Error Matrix tion error matrix. This raises the question of what the
Rejkence parameters estimated following standardization actually
represent, and whether these parameters are meaningful
A B C PC+ PI;,
to accuracy assessment objectives.
Map A 0.271 0.053 0.267 0.591 0.459
Zhuang et al. (1995) claimed that because user’s and
B 0.076 0.027 0.004 0.107 0.252
c: 0.173 0.098 0.031 0.302 0.103 producer’s accuracies differ, neither is the appropriate
estimator of class-level accuracy. If the usual calculations
P+J 0.520 0.178 0.302 AN=225
pv 0.521 0.152 0.103 for user’s and producer’s accuracies are applied to a stan-
dardized error matrix, the two accuracies are equal and
h) Stanokrdized Population Error Matrix
[Values in Bishop et al. (1975) Divided by lOO] represented by the diagonal element for that category.
This is the effect of standardizing to homogeneous mar-
Reference
gins. But there is no reason why user’s and producer’s
A B c Total Pu,
accuracies should be the same for a particular land-cover
Map A 0.201 0.102 0.697 1.000 0.201 class. Story and Congalton (1986) argue that both user’s
R 0.474 0.428 0.098 1.000 0.428
and producer’s accuracies may be needed to address
(: 0.325 0.470 0.205 1.ooo 0.205
project objectives. Both measures represent well-defined
Total I .ooo 1.001) 1.000 N=22Fj
conditional probabilities, and this is a compelling reason
p, 0.201 0.428 0.2M
for retaining them as appropriate accuracy measures.
These conditional probabilities are not constrained to be
equal, so evaluating both row and column conditional
Standardized Error Matrices probabilities is part of a thorough analysis of the error
Standardizing or normalizing an error matrix (Congalton, matrix. The diagonal probabilities of a standardized ma-
1991) has been proposed as a way to compare individual trix must in some sense combine user’s and producer’s
cell probabilities of an error matrix because it eliminates accuracy. But based on the consistency argument, the di-
the influence of different marginal proportions. Another agonal cell probability from a standardized error matrix
advantage of standardization cited is that it uses all the is not a consistent estimator of the parameter pkk of the
information in the error matrix to estimate the cell prob- actual population represented by the map.
abilities. Zhuang et al. (1995) advocate routine use of Standardization has been employed in contingency
standardized error matrices. table analyses to enhance the interpretability of interac-
An example illustrating the result of standardizing an tion patterns (see Bishop et al., 1975, examples 3.6-2 and
error matrix is presented in Table 7 [data from Bishop 3.6-n), but the value of standardization to enhance the
et al. (I975), p. 991. Suppose the original error matrix in
interpretability or comparability of error matrices in an
Table 7 represents a census of reference data for all
accuracy assessment setting is questionable. The lack of
N=225 pixels in a small region. Summary measures
consistency of estimates from a standardized error matrix
calculated from this error matrix are parameters of the
is a critical problem. Further, Bishop et al. (1975, p. 97)
population represented by the map. For example,
state that standardization scales the contingency table to
F’c=ipii=0.329 is the overall proportion of pixels cor- fit hyp~~~hetical margins, which for an error matrix means
k=l
scaling to hypothetical homogeneous margins. Scaling
rectly classified in the population. When the population
the error matrix to homogeneous margins is a valid statis-
error matrix is standardized, the accuracy parameters dif-
fer markedly from those computed from the original ma- tical procedure, but the real populations that are the sub-
trix. For example, P, of the standardized error matrix is ject of accuracy assessment projects do not have these
0.278 (sum of the diagonal elements divided by 3). hypothetical equal margins. Consequently, standardizing
In reality, standardization is applied to a sample er- leads to estimates of parameters for a hypothetical popu-
ror matrix, but evaluating the result of standardizing a lation that has little relevance to the reality of the accu-
population error matrix is relevant for the following rea- racy assessment. In their discussions of measures of
son. An important statistical property of a sample-based agreement, neither Agresti (1990), Bishop et al. (1975),
estimator of a population parameter is consistency. Coch- nor Fleiss (1981) suggest standardizing a contingency ta-
ran (1977, p. 21) defines a method of estimation as con- bl e prior to computing the agreement measures. Unless
sistent “if the estimate becomes exactly equal to the pop- the parameters estimated from a standardized error ma-
ulation value when n=N, that is, when the sample trix can be identified and shown to be relevant to the
consists of the whole population.” The example calcula- objectives of accuracy assessment, this procedlue should
tions for the Table i standardized error matrix demon- not be used.
SUMMARY comparison. The fourth situation provides a comparison
based on chance agreement as defined by a feature of
The variety of accuracy parameters available creates a
the map construction process. Does the ustar want to
seemingly bewildering array of options from which to
base the comparison on a feature of the classification
choose. Different accuracy measures use different infor-
process itself’, or on the outcome of the classification pro-
mation contained in the error matrix. Selecting an appro-
cess? These are the tvpes of questions that should arise
priate accuracy measure depends on the objectives of the
when choosing a map accuracy measure f& comparing
assessment, which are in turn determined by the objec-
error matrices.
tives of the mapping project. If’ the objective is to de-
Using a single accuracy parameter to summarize an
scribe the accuracy of a final map product, the overall
error matrix may not satisfy the objectives of an accuracy
proportion correct (PC), user’s accuracy (PJ, and produc-
assessment. Aronoff (1982) stated that a single measure
er’s accuracy (P,lJ) have a direct probabilistic interpreta-
of map quality does not provide the information needed
tion in terms of the actual population represented by that
to understand the relative advantages of two land-cover
map. The appeal of these measures is that they corre-
maps, and that the error matrix is a valuable tool for such
spond to probabilities of the map user “drawing a correct
comparisons. Story and Congalton (1986) similarly ar-
conclusion from the map (or making a particular type
gued that using only a single value can be extremely mis-
of error) when using it to make a particular predic-
leading, and recommended reporting both user’s and
tion” (Lark, 1995, p. 1465). Adjustments for hypothetical
producer’s accuracies as well as the error matrix. Be-
chance agreement are unnecessary when the objective is
cause it is difficult to anticipate objectives and accuracy
to report the accuracy of a single, final map product, and
needs of all eventual users of a map product. the best
standardizing an error matrix does not lead to interpret-
course of action is to report the full error matrix along
able parameters for the actual population represented by
with the sampling design used to collect the reference
this map.
data. This generally provides sufficient information for
When the assessment objectives require comparing
each user to estimate and compare accuracy parameters
error matrices, then the choice of an appropriate param-
of interest to satisfy the assessment objectives of that
eter becomes less clear. If a single summary measure of
project.
the error matrix is employed, any of the parameters, P,.,
K, K,,, or z is potentially applicable, but none of these
parameters directly takes into account specific objectives This research hay been supported by cooperative agreement
CR821 782 between the Environlrwntal Protection Agency and
of a mapping project. Any implication that one accuracy
SUNY-ESF. This manuscript hm not been mbjected to EPA’,y
measure is best for all applications is misleading. PC is peer and policy review, und does not necessarily reflect the
the simplest measure to interpret, but if a user wishes to views of the Agency. David Verbyla and two reviewers pro-
make the comparison adjusting for hypothetical chance vided several helpful suggestions fc>r improving the manuscript.
agreement, then the question of how to define chance The Department of Statistics at Oregon State University is (I(‘-
knowledged for supporting this work.
agreement arises. K, K,,,and 7 each measure chance agree-
ment in different ways, so that a comparison of two maps
is obviously dependent on how chance agreement is de- REFERENCES
fined. K, results in exactly the same test statistic as PC if
the two maps being compared have the same number of Agresti, A. (lQQO), Categorical Data Analysis, Wiley, New York.
categories. Conclusions from tests based on PC, K, K,,,and Agresti, A. (1996), An Introduction to Categoricul Data Anuly-
7 may differ if one or more of the following occur: 1) sis, Wiley, New York.
The number of land-cover categories in the two maps Aronoff, S. (1982), Classification accuracy: a user approach.
differs; 2) the land-cover categories themselves differ in Photogramm. Eng. Remote Sens. 48:129%1307.
the tie maps; 3) the marginal proportions (pk+ or p+k) Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. (1975),
differ in the two error matrices (in the case of K); and Discrete M&variate Analysis Theory and Practice, MIT
4) n ptioti probabilities (,!3J for each category differ in Press, Cambridge, MA.
the two maps in a supervised classification (in the case Brennan, R. L., and Prediger, D. J. (1981), Coefficient kappa:
some uses, misuses, and alternatives. Ed. P.sychoZ. Meu-
of 7). In the first three situations, the comparison is of
sure. 41:687-699.
two fundamentally different classification schemes, and
Campbell, J. B. (1987), Introduction to Remote Sensing, Guil-
the user must ask the question whether it makes sense
ford, New York.
to employ a statistical comparison to evaluate what is al-
Cochran, W. G. (1977), Sampling Techniques, 3rd ed., Wiley,
ready clearly a different classification scenario. Is a nu- New York.
merical comparison necessary when the two map prod- Con&on, R. G. (lQQl), A review of assessing the accuracy of
ucts being compared represent regions with funda- classifications of remotely sensed data. Remote Sens. Envi-
mentally different land cover? This question is relevant ron. 37:35-46.
regardless of the accuracy measure chosen to make the Congalton, R. G., Oderwald, R. G., and Mead, R. A. (1983),
Selecting Measures ofThematic Cla,ssi$xtion r\cncrucy 89
Assessing Landsat classification accuracy using discrete mul- era1 Technical Report RM-GTR-277, USDA Forest Service,
tivariate analysis statistical techniques. Photogrumm. Eng. Fort Collins, CO, pp. 467476.
Remote Sens. 49:1671-1678. Landis, J. R., and Koch, G. G. (1977), The measurement of ob-
Congalton, R. G., Green, K., and Teply, J. (1993) Mapping old server agreement for categorical data. &&tics 33:159174.
growth forests on national forest and park lands in the Pa- Lark, R. M. (1995), Components of accuracy of maps with spe-
cific Northwest from remotely sensed data. Photogrumm. cial reference to discriminant analysis on remote sensor
Eng. Remote Sens. 59:529-535. data. Int. /. Renwte Sens. 16:1461-1480.
Dicks, S. E., and Lo, T. H. C. (1990), Evaluation of thematic Lauver, C. L., and Whistler, J. L. (1993), A hierarchical classi-
map accuracy in a land-use and land-cover mapping pro- fication of Landsat TM imagery to identify natural grassland
gram. Photogrumm. Eng. Rerrwte Sens. 56:1247-1252. areas and rare species habitat. Photogramm. Eng Renwte
Dikshit, O., and Roy, D. P. (1996), An empirical investigation of Sens. 59:627-634.
image resampling effects upon the spectral and textural su- Lawrence, R. L., Means, J. E., and Ripple, W’. J. (1996), An
pervised classification of a high spatial resolution multispectral automated method for digitizing color thematic maps. Pho-
image. Photogramm. Eng. Renwte Sens. 62:1061092. togramm. Eng. Remote Sens. 62:1245-1248.
Fiorella, M., and Ripple, W. J. (1993), Determining succes- Lee. J. J., and Tu, Z. N. (1994), A better conficlence interval
sional stage of temperate coniferous forests with Landsat for kappa (K) on measuring agreement between two raters
satellite data. Photogramm. Eng. Remote Sens. 59:239-246. with binary outcomes. J. Comput. Graph. Stat. 3::301-321.
Fitzgerald, R. W., and Lees, B. W. (1994), Assessing the classi- Ma, Z., and Redmond, R. L. (1995), Tau coefficients for accu-
fication accuracy of multiresource remote sensing data. Re- racy assessment of classification of remote sensing data
mute Sens. Environ. 47:362368. Photogramm. Eng. Remote Sens. 61:435-439.
Fleiss, J. L. (1981), Statistical Methods for Rates and Propor- Marsh, S. E., Walsh, J. L., and Sobrevila, C. (1994), Evaluation
tions, 2nd ed., Wiley, New York. of airborne video data for land-cover classification accuracy
Foody, G. M. (1992), On the compensation for chance agree- assessment in an isolated Brazilian forest. Remote Sens. En-
ment in image classification accuracy assessment. Pho- viron. 48:61-69.
togrumm. Eng. Remote Sens. 58:1459-1460. Naesset, E. (1996), Use of the weighted Kappa coefficient in
Fung, T., and LeDrew, E. (1988), The determination of optimal classification error assessment of thematic maps. Znt.J. Geogr.
threshold levels for change detection using various accuracy Inf Syst. 10:591604.
indices. Photogramm. Eng. Remote Sens. 54:144%1454. Rosenfield, G. H., and Fitzpatrick-Lins, K. (1986), A coefficient
Gong, P., and Howarth, P. J. (1990), An assessment of some of agreement as a measure of thematic classification accu-
factors influencing multispectral land-cover classification. racy. Photogramm. Eng. Remote Sens. 52~223-227.
Photogramm. Eng. Remote Sens 56:597-603. Snedecor, G. W., and Cochran, W. G. (1980), Statistical Meth-
Jakubauskas, M. E., Whistler, J. L., Dillworth, M. E., and Mar- ods, 7th ed., Iowa State University Press, Ames, IA.
tinko, E. A. (1992), Classifying remotely sensed data for use Stenback, J. M., and Congalton, R. G. (1990). Using thematic
in an agricultural nonpoint-source pollution model. J. Soil mapper imagery to examine forest understory. Photogrumm.
Water Conservation 47:179-183. Eng. Remote Sens. 56:1285-1290.
Janssen, L. L. F., and van der Wel, F. J. M. (1994), Accuracy Story, M., and Congahon, R. G. (1986), Accuracy assessment:
assessment of satellite derived land-cover data: a review. a user’s perspective. Photogramm. Eng. Remote Sens. 52:
Photogramm. Eng. Remote Suns. 60:419426. 397-,399.
Kalkhan, M. A., Reich, R. M., and Czaplewski, R. L. (1995), Treitz, P. M., Howarth, P. J., Suffhng, R. C., and Smith, P.
Statistical properties of five accuracy indices in assessing the (1992), Application of detailed ground information to vege-
accuracy of remotely sensed data using simple random sam- tation mapping with high spatial resolution digital imagery.
pling. In Proceedings c$ the 1995 ACSMIASPRS Annual Remote Sens. Environ. 42~65-82.
Convention, ASPRS Technical Papers, Vol. 1, pp. 246-257. Vujakovic, P. (1987), Monitoring extensive ‘buffer zones’ in Af-
Kalkhan, M. A., Reich, R. M., and Czaplewski, R. L. (1996), rica: an application of satellite imagery. Biol. Conservation
Statistical properties of measures of association and the 39: 195-208.
kappa statistic for assessing the accuracy of remotely sensed Zhuang, X., Engel, B. A., Xiong, X., and Johannsen, C. J. (1995),
data using double sampling. In Spatial Accuracy Assessment Analysis of classification results of remotely sensed data and
in Natural Resources and Environmental Sciences (H. T. evaluation of classification algorithms. Photogramm. Eng. Re-
Mowrer, R. L. Czaplewski, and R. H. Hamre, Eds.), Gen- mote Sens. 61:427433.

Selecting and Interpreting Measures of Thematic Classification Accuracy

Uploaded by

Copyright:

Available Formats

Selecting and Interpreting Measures of Thematic Classification Accuracy

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Selecting and Interpreting Measures of Thematic Classification Accuracy

Uploaded by

Copyright:

Available Formats

ELSEVIER

Selecting and Interpreting Measures of

A n error matrix is frequently employed to organize and INTRODUCTION

A 0.2644 0.0600 0.0089 0.3333 0.793

Userk Conditional Producer’s5 Conditional Overall

1. Dry sclerophyll 0.349 0.346 0.678 0.675 0.992 0.457

is obtained from begin with. Rosenfield and Fitzpatrick-Lins (1986, p.

You might also like