Selecting and Interpreting Measures of Thematic Classification Accuracy
Selecting and Interpreting Measures of Thematic Classification Accuracy
Selecting and Interpreting Measures of Thematic Classification Accuracy
Stephen V. Stehman
spectively). Upper case P will be used to denote sum- These parameters use different information con-
mary parameters such as the overall proportion of area tained in the error matrix (Congalton, 1991) and summa-
correctly classified (PC), and user’s (PJ and producer’s rize the error matrix at various levels. For example, P,.,
(P4) accuracy. K, K,., and T provide a single summary measure for the
entire error matrix, whereas P,4, and K, provide a sum-
1. Overall proportion of area correctly classified,
mary by columns of the error matrix, and PcrLand lc, pro-
vide a summary by the rows of the error matrix. Each of
P,‘&kl, (I)
k=l these summary measures obscures potentially important
detail contained in the error matrix, so that the full error
where q=number of land-cover categories.
matrix should be reported whenever possible.
2. Kappa,
.=..a
PC- ipi;+p+r. SINGLE SUMMARY MEASURES OF THE
ERROR MATRIX
” (2)
l- &k+p+k Of the single number summary measures, the most basic
k=l
is P,. Other accuracy measures are derived from PC as the
where pk+=5p+ and p+k=$p&. starting point. The motivation for K and 5 is to account
j=l i=l
for agreement between the map and reference classifi-
3. Kappa with random chance agreement as defined cations that could be attributable to random chance.
by Foody (1992), These parameters start with an observed measure of
agreement, P,., and then adjust P, by “a hypothetical ex-
PC- l/9
Ic,=p--, (3) pected probability of agreement under an appropriate set
l-l/9
of baseline constraints” (Landis and Koch, 1977, p. 163).
the subscript e used to indicate that each land- “Adjusting” for chance agreement is better terminology
cover class is equally probable under this defini- than “correcting” for chance agreement because the lat-
tion of hypothetical chance agreement. ter has the connotation that P, is somehow an incorrect
4. Tau (Ma and Redmond, 1995), representation of accuracy. PC represents a legitimate
probability describing one aspect of map data quality.
p,.- ibkpi-k Users may choose to represent accuracy by another pa-
k=l
7- (4) rameter, but it is not relevant to claim that P,. is incorrect
1- bkp+k’ or that it provides a biased measure of accuracy.
k=l
K and T measure the extent to which the observed
where pk is the user specified a priori probability probability of agreement exceeds the probability of agree-
of membership in map class k. ment expected under the specified hypothetical baseline
5. User’s accuracy for cover type i, the conditional constraints. As shown by Ma and Redmond (1995), K
probability that an area classified as category i by uses marginal proportions of the ohsertietl map, and t
the map is classified as category i by the refer- uses marginal proportions specified prior to constructing
ence data, the map. The hypothetical nature of the adjustment for
“chance agreement” needs to be emphasized. In reality,
pI'z=pu/pi+. (5)
areas classified correctly “by random chance” are indis-
6. Producer’s accuracy for cover type j, the condi- tinguishable from areas classified correctly because of
tional probability that an area classified as cate- some more favorable aspect of the classification proce-
gory j by the reference data is classified as cate- dure. That is, it is impossible to go to the map and iden-
gory j by the map, tify those areas (e.g., pixels or polygons) that have been
correctly classified by random chance. If the objective is
P4,=pJp+,. (6)
to describe the accuracy of a final map product, the user
7. Conditional kappa for the map classifications in of this particular map is probably not concerned with the
category (row) i, hypothetical proportion of area classified correctly by
random chance. Such areas, even if they could be identi-
K,= Pll-Pl+P+i
_pUt-P+i
(7) fied, arra still classified correctly, and attributing a hypo-
’ pi+-p,+pi-t l-p+i .
thetical reason for the classification being correct is irrel-
8. Conditional kappa for the reference classifications evant to applications requiring this map. If the overall
in category (column) j, map accuracy is 80% (P,=OBO), the user holds a map for
p .-- which a randomly selected area has a11 80% chance of
&__L &+
(8) being correctly classified. Thus estimating K or 7 for a
’ l-pi+ final map product is not an informative accuracy mea-
srire, and P,., P+ and PI,, are more relevant accuracy pa- ginal proportions. The random chance adjustment takes
rameters because of their direct interpretation as proba- into account the imposed constraint.
bilities characterizing data quality of this particular snap, Foody (1992) argued that the measure of chance
Reporting K or 7 to summarize a final map product agreement incorporated into K is not the proper represen-
provides a misleading representation of the probability tation for most accuracy assessment problems because the
that an area on the map is correctly classified because map margins are not fixed. hut free to vay. That is, the
both K and 7 are always smaller than P,. This is seen by classification is not constrained to match specified row
noting that both K and 7 may be written as [PC.-u]l( 1 -a), marginal proportions. In the absence of any information
where a is the hypothetical adjustment for chance agree- about the land-cover class of a given area, that area would
ment. Then it follows that [P,.-a]l(l-a)-P,.=[P,.-a-P, be classified into one of the y classes with equal probabil-
(1-a)]l(l-a)=@-l)l(l-a)<0 because (P,-1)G and ity, so FJ~+= I/C/. Then assuming independence of the map
(l-a)aO. Therefore, both K and 7 underrepresent the and reference classifications, chance agreement is still
true probability of a correct classification. A similar argu-
ip+kpn+, but substituting pi+ = l/c/ into the equation
ment applies to the relationship between conditional L=I
kappa and user’s and producer’s accuracy; that is, K,sP, I, leads to (l/~)~~+r=l/~. This result is the same regardless
and Icj6P,I. k=l
When adjusting P, for chance agreement, K, K,~,and of how the reference (column) marginal proportions are
7 incorporate different adjustments. Which adjustment is viewed, either fixed or free, and this chance agreement
best is difficult to discern and will probably remain a adjustment results in K,,. Chance agreement defined for K,
matter of debate. In the detailed discussion of Brennan is smaller than that defined for K, and this is the basis of
and Prediger (1981, p. 690) of agreement measures, a Foody’s (1992) statement that chance agreement is overes-
key distinction they proposed is whether the marginal timated by K.
proportions are considered fixed or free to vary: “A mar- Ma and Redmond (1995) show that K,.is a special
gin is ‘fixed’ w h enever the marginal proportions are case of 7, and claim that K(, is an appropriate measure if
known to the assigner before classifying the objects into unsupervised classification is employed, or if a supervised
categories,” and “a margin is ‘free’ whenever the mar- classification is employed with no n priori specification
ginal proportions are not known to the assigner before- of class membership. In both cases, the row marginal
hand.” The identification of the marginal proportions of proportions of the error matrix used in the chance
an error matrix as fixed or free determines the measure agreement adjustment are l/(r; that is, in the absence of
of chance agreement used to adjust P,. any information about the true class of a particular area
The adjustment for chance agreement used in K is of the map, the area will be classified into one of the y
land-cover categories with equal probability. If a super-
i+ )+k. Agresti (1996, p. 246) cites the dependence of
& r vised classification is employed and the N 1)tior-i class
the K definition of chance agreement on the marginal membership probabilities specified are not equal, then
proportions as a primary source of controversy on the
the measure of chance agreement used by 7 is %&+i,
utility of this measure. The K adjustment is tantamount h-l
to assuming fixed map marginal proportions. But this where /s, is the n ~lrioti probability of classifying an area
assumption imposes a circularity in reasoning because into category X-. In this case, the measure of chance
the map (row) marginal proportions are the wsult of the agreement used in 7 is based on the premise that some
classification, not a fixed set of marginal proportions the information about the true class of an area csists. That
map construction process was required to match. The information is contained in the /$ values specified. So
adjustment incorporated into K would be appropriate if YIOW random agreement does not imply the complete ab-
the n priori row marginal proportions were specified, and sence of prior information about the true class of an
the classification were forced to result in a map with area, which is how chance agreement is defined for /i,,.
those exact marginal proportions. An example from an- Although a priori probabilities are specified in a su
other application illustrates the point. Suppose a physi- pervised classification, the classification procedure is not
cian will evaluate 100 patients, and each patient must be constrained to match these specified probabilities, so that
classified into one of five disease categories. The true dis- the measure of chance agreement used in 7 is indepen-
ease category is known (although not by the physician) dent of the row marginal proportions of the population
for each patient, and the physician is provided with the error matrix obtained. Ma and Redmond (1995) claim
proportion of patients in each category and told to match this independence is a desirable feature of 7. But this
those same marginal proportions in his or her evaluation. independence leads to the conceptually disconcerting
In this scenario, the K measure of chance agreement is consequence that two maps having the exact same popu-
justified because the marginal proportions provided by lation error matrices may result in different 7 coefficients
the classification are fixed, and the classification (the phy- simply because the a priori probabilities are different for
sician’s evaluation) is constrained to match those mar- the two maps. Further, if the map marginal proportions
Selecting Measurers ofThematic Clmsificcltion Accwucy 81
are not forced to match /?k,it is unclear how to interpret Table 2. Example Population Error Matrices and Associated
Accuracy Parameters
z in the Brennan and Prediger (1981) framework. The
map (row) marginal proportions are still free to vary, so Class A B C PI+ P”,
that perhaps the interpretation is that $ the map margin- Error Matrix I jP<=0.636, K=0.450)
als (pk+) had been forced to match the a priori proba-
A 0.2727 0.0000 0.09OY 0.3636 0.750
bilities (/?k), then random agreement would be as mea-
B 0.1818 0.1818 0.0000 0.3636 0.500
sured by z. c o.OOOO 0.0909 0.1818 0.2727 0.667
Because chance agreement is a hypothetical con-
pi, 0.4545 0.2727 0.2727
struct, the various definitions invoked lead to the differ- p,, 0.600 0.667 0.667
ent accuracy parameters IC, K?, and z. Each parameter
Error Mutrix 2 (P, =0.636, ~=0.45Oj
assumes different n priori information about the true
land-cover class of an area, so that it is difficult to claim
A 0.3636 0.0000 o.OOoo 0.36:36 1,000
B 0.0909 0.1364 0.1364 0.3637 0.375
that one measure is better than another. These measures (: 0.0000 0.1364 0.1364 0.2728 0.500
are simply different. Choosing among these accuracy pa-
IJ +i 0.4545 0.2728 0.2728
rameters raises the question of how to represent accu- O.HOO 0.5OO 0.500
P\,
racy, whether to adjust for hypothetical chance agree-
Error Matrix 3 (P,=O.660, K=O.370)
ment at all, and if an adjustment is incorporated, which
A 0.4500 0.1 100 0.0400 0.60 0.75
measure to use. In some cases, all parameters lead to
B 0.1500 0.1500 0.0000 0.30 0.50
similar conclusions, but in other applications, the conclu-
(: o.oOOo 0.0400 0.0600 0.10 0.60
sions will differ (see first subsection or section on Testing
?)+I 0.60 0.30 0.10
Overall Map Accuracy). Numerous other accuracy mea-
PI, 0.75 0.50 0.60
sures could be defined, and some are reviewed by Kalk-
Error Mutt-ix 4 jP,=0.660, K=0.469)
han et al. (1995; 1996). Bishop et al. (1975) and Agresti
(1990) distinguish between measures of associatior~ and A 0.3600 O.lOOO 0.1400 0.60 0.60
B 0.0400 0.2000 0.0600 0.30 0.67
measures of agreement, and state that strong association
(: 0.0000 o.oOOo 0.1000 0.10 l.OOO
in a contingency table (error matrix) does not imply high
!‘+I 0.40 0.30 0.30
agreement. Therefore, measures of association should
PC 0.90 0.67 0.33
not be applied to accuracy assessment problems.
Error Matrix ,i jPc=0.660, h-=0.490)
Table 3. Accuracy Statistics Computed from Fitzgerald and Lees (1994, Table 4) Including the Sea Class
FL E.stimutes
ported to have demonstrated “that the accepted method For these same data, user’s and producer’s accuracy
of assessing classification accuracy, the overall accuracy and conditional kappa are computed (first four columns
percentage [I’,.], is misleading especially so when applied of Table 3). These measures evaluate class-level accuracy
at the class comparison level.” They further stated that K within the context of the full nine-category classification
is a more “sophisticated measure of interclassifier agree- scheme. In this representation, a much different conclu-
ment than the overall accuracy and gives better interclass sion from that presented by FL is obtained. The propor-
discrimination than the overall accuracy.” Fitzgerald and tion correct, as measured by user’s and producer’s accu-
Lees’ (hereafter FL) preference for K was based on their racy, differs little from the corresponding conditional R,
class-level comparisons. For each land-cover category, and the kappa statistic does not provide a better or even
they collapsed the full error matrix into a 2x2 table. For different discrimination based on agreement among the
example, to estimate their overall agreement proportion classes. The relative rankings of the different classes are
for the class dry sclerophyll, they collapsed the entire er- exactly the same whether the proportion correctly classi-
ror matrix into two classes, “dry sclerophyll” and “not dry fied or the kappa statistic is used, and “the disparity in
sclerophyll.” The diagonal entries of this collapsed error the relative rankings of the overall accuracy values and
matrix are then summed and divided by the total sample the Kappa values” noted by FL (p. 366) is an artifact of
size to get the estimated overall proportion correct, PC. the collapsed 2x2 tables formed in their definition of
Their R statistic is also computed from the collapsed 2x2 class-level accuracy. In general, the FL error matrices do
table. The FL results are shown in the last two columns not demonstrate that kappa is “a more rigorous and dis-
of Table 3, and the discrepancy between R and P,. is ap-
cerning statistical tool for measuring the classification
parent.
accuracy of different classifiers” except when class-level
Estimating P, from this collapsed error matrix cre-
accuracy is defined according to their collapsed-class rep-
ates one representation of class-level accuracy and is the
resentation. Their conclusions do not generalize to other
approach taken by Fleiss (1981, Chap. 13). For the col-
common representations of class-level accuracy.
lapsed 2x2 table, P,. represents the accuracy for a di-
Analysis of the error matrices with the dominating
chotomous classification, for example, accuracy of a “dry
sea class excluded (Table 4) provides additional interest-
sclerophyll” and “not dry sclerophyll” classification. The
ing insights into uses of the information in an error ma-
dichotomous classification results in a significant loss of
trix. For this eight-category classification (land classes
information, and the objective motivating this perspec-
only), the estimated values for PC are O.ijll for the NN
tive of class-level accuracy is substantially different from
the objective of evaluating the accuracy of the dry sclero- classifier and 0.508 for the DT classifier, and the esti-
phyll class within th e context of the nine-category classi- mated K is 0.395 for NN and 0.389 for DT. Although
fication represented by the full error matrix. If the objec- the P, and Iz values differ, both measures suggest little
tives call for a two-category classification system, then difference between the two classifiers for the overall er-
the FL assessment is appropriate. The class-level fi, val- ror matrix. The class-level accuracies are represented by
ues computed by FL are high because the sea class dom- user’s accuracy and conditional K (for rows). The class-
inates the sample size, and this class has extremely high level accuracies achieved by the NN and DT classifiers
accuracy. This makes any dichotomous classification very are generally similar, but the accuracy for classes 3 and
accurate. The FL class-level R statistics differ greatly possibly 8 are sufficiently higher for the NN classifier
from the P, values also because of the dominance of the that this might convey an important practical advantage
sea class in the sample. Chance agreement defined by K relative to the DT classifier. Such accuracy differences
will be very high in these collapsed 2x2 tables, so that may be important, hut they are not evident from the
the R values are much smaller than the PC values. comparison of the single number summarv measures, P,
p,, K, K,, r, or other accuracy parameters. SOJTW g?nrral
issues pertaining to hypothesis testing in accuracv assess-
ment are reviewed hchrt’ with thr specific focus being
tests based on K.
Testing the null hypothesis H,,: K=(), fvaluattas it
overall accuracy exceeds that of chance agreenlent. Fleiss
I 0.575 0.613 0.406 0.459 (1981), Agresti (1990), and Janssen and WJ~ der We1
2 0.396 0.383 0.356 0.:344 (1994) suggest that because map accuracy is anticipated
3 0.571 0.4oi 0.550 0.378 to exceed agreement expected by random chatlcr, testing
4 0.475 0.433 0.324 0.265
the hypothesis that K=O is often not relevant. A niore
5 0.45 1 0.420 0.344 0.:30:
informative test is to determine if K exceeds some hy-
6 0.593 0.554 0.55 1 0.509
7 0.337 0.355 0.279 0.300 pothesized value, say K,,. The values of x,, may be various
8 0.943 0.879 0.941 0.873 cutoffs such as those suggested by Landis and Koch
(1977) for moderate (0.41-0.6), substantial (0.61~0.8).
and almost perfect (0.81-1.0) agreement. For example, a
test of the hypothesis that accuracy beyond that expected
and K. This illustrates why class-level evaluations are of-
by random chance may be considered “substantial” trans-
ten important.
lates into testing H,,: i ~0.61, versus H,,: ~X.61. Fleiss
If the NN classifier is compared to the DT classifier
(1981, p. 221) p resents the formulas for carrying out
category by category, the ordering of the classifiers is the
such a test.
same whether accuracy is measured by user’s accuracy
As with any hypothesis test, the power of the test
or conditional IC. For a particular classifier, the ordering
and practical importance of statistically significant differ-
of the land-cover classes obtained from user’s accuracy
ences found must be considered (Aronoff’, 1982). If the
and conditional K is not the same. Not surprisingly, dif-
sample size is large, the null hypothesis may be rejected
ferences in the order occur for those classes that have
even when the observed kappa statistic (12) does not ex-
relatively close values of P, or &.
ceed the hypothesized value (K,) by a large amount. For
These examples illustrate that project objectives may example, Fitzgerald and Lees (1994) reported numerous
dictate that user’s accuracy or producer’s accuracy for tests of the null hypothesis H,,: K=O, and rejected this
one or more classes is a high priority, and certain mis- hypothesis for an observed R as small as 0.077. ri-rO.077
classifications may be extremely critical while other er- is clearly not indicative of practically better agreement
rors are less important. A more detailed analysis of the beyond that expected by random chance, yet this small
error matrix focusing on selected user’s and producer’s K is evidence to claim K is statistically separable from 0.
accuracies or particular cell probabilities (pJ can be Because the sample size in the Fitzgerald and Lees
done to customize the assessment more closely-to project (1994) example is large (n=62.727), this particular test is
objectives. A weighted kappa statistic (Fleiss, 1981; extremely powerful so that even practically unimportant
Naesset, 1996) has been proposed as an overall measure differences from K=O will result in rejection of H,,. It
of agreement in which the importance of different mis- should be routine data analysis practice to consider both
classifications to the user’s objectives can be incorporated the statistical significance and practical importance of any
into the accuracy measure. This weighting feature is ex- hypothesis test result.
actly the type of approach that should be employed to Confidence intervals provide important descriptive
link accuracy measures more closely to mapping objec- information and can be used to conduct hypothesis tests.
tives. Unfortunately, embedding the weighting within the Basic description for accuracy assessment should include
context of a kappa framework results in a measure that estimates of parameters of interest (e.g., P,., K, &,, PI,,,
suffers from all of the same definition and interpretation or K,) accompanied by standard errors. An approximate
problems inherent in K (see the third section on Further confidence interval is constructed via the formula
Discussion of K). Consequently, weighted kappa cannot J?tz*SE(i), where j? is an estimate of the parameter of
be recommended as a useful accuracy measure. interest, z is a percentile from the standard normal distri-
bution corresponding to the specified confidence level,
and SE(@ is the standard error of i?. Both i? and SE(g)
TESTING OVERALL h&U’ GCCUaACY
depend OII the sampling design used to collect the refer-
Hypothesis testing may sometimes be required to ad- ence data. These confidence intervals assume that the
dress accuracy assessment objectives. For example, if a reference sample size is large enough to justify use of a
map must have an overall accuracy of ~,=O.70 to meet normal approximation for the sampling distribution of B.
contractual specifications, a test of the null hypothesis For estimates based on the entire error matrix such as
H,,: ~,.~0.70, can be made against the alternative hypoth- PC and R, sample sizes are likely to be adequately large
esis H,,: P,.>O.70. Hypothesis tests can be constructed for to satisfy this assumption. For those csstimatc,s based OIJ
Selecting Measures of Thematic ClassijkicationAccuracy 85
rows or columns of the error matrix such as Izi, i)ui, and Table 5. Accuracy Statistics and Pairwise Comparison Tests
for Four Classification Algorithms Reported in
ik,, sample sizes may be small for rare classes and the
Congalton et al. (1983)
normal approximation will not be justified.
Algorithm n PC R
Comparison of Error Matrices: Tests Based on a 4 modified clustering 632 0.859 0.718
Single Summary Measure 2 nonsupenised 20 clusters 659 0.785 0.586
1 nonsupervised 10 clusters 659 0.766 0.605
When comparing the accuracy of two or more maps, the
3 modified supervised 646 0.714 0.476
primary focus is to rank or order the maps, and to pro-
vide some measure of the magnitude of differences in z StatisticsforPair-wiseComparisons ofAccuracy of the
Four Algorithms
accuracy among the maps. Such comparisons could be
based on I’,., r, K, or K,. Once again, differing opinions K PC.
have been proffered on which parameter to use. For ex- 1 vs. 2 0.48 0.79
ample, Janssen and van der We1 (1994, p. 424) state that 1 \‘S.3 3.01 2.17
“PCC values [P,] cannot be compared in a straightfor- I \s 4 -2.94 -4.32
21s. 3 2.43 2.96
ward way” and suggest normalizing the error matrix or
2 vs. 4 -3.28 -3.53
using K to make such comparisons. Although the mean- 3 vs. 4 -5.62 -6.46
ing of “straightforward” is open to interpretation, differ-
ent error matrices can in fact be compared using I’,, and
normalizing an error matrix is shown in the next section
to be a questionable analysis strategy. Assuming that the category land-cover scheme (Table 6). The ordering of
reference samples for the error matrices are independent the six methods is slightly different depending on
simple random samples of size nl and nz, so that the esti- whether P, or K is used, but the discrepancies again oc-
mates p,_, and i),.;, are independent, a test of H,,, P,, =Pc2 cur when accuracy of two methods is nearly similar to
Assessing Landsat classification accuracy using discrete mul- era1 Technical Report RM-GTR-277, USDA Forest Service,
tivariate analysis statistical techniques. Photogrumm. Eng. Fort Collins, CO, pp. 467476.
Remote Sens. 49:1671-1678. Landis, J. R., and Koch, G. G. (1977), The measurement of ob-
Congalton, R. G., Green, K., and Teply, J. (1993) Mapping old server agreement for categorical data. &&tics 33:159174.
growth forests on national forest and park lands in the Pa- Lark, R. M. (1995), Components of accuracy of maps with spe-
cific Northwest from remotely sensed data. Photogrumm. cial reference to discriminant analysis on remote sensor
Eng. Remote Sens. 59:529-535. data. Int. /. Renwte Sens. 16:1461-1480.
Dicks, S. E., and Lo, T. H. C. (1990), Evaluation of thematic Lauver, C. L., and Whistler, J. L. (1993), A hierarchical classi-
map accuracy in a land-use and land-cover mapping pro- fication of Landsat TM imagery to identify natural grassland
gram. Photogrumm. Eng. Rerrwte Sens. 56:1247-1252. areas and rare species habitat. Photogramm. Eng Renwte
Dikshit, O., and Roy, D. P. (1996), An empirical investigation of Sens. 59:627-634.
image resampling effects upon the spectral and textural su- Lawrence, R. L., Means, J. E., and Ripple, W’. J. (1996), An
pervised classification of a high spatial resolution multispectral automated method for digitizing color thematic maps. Pho-
image. Photogramm. Eng. Renwte Sens. 62:1061092. togramm. Eng. Remote Sens. 62:1245-1248.
Fiorella, M., and Ripple, W. J. (1993), Determining succes- Lee. J. J., and Tu, Z. N. (1994), A better conficlence interval
sional stage of temperate coniferous forests with Landsat for kappa (K) on measuring agreement between two raters
satellite data. Photogramm. Eng. Remote Sens. 59:239-246. with binary outcomes. J. Comput. Graph. Stat. 3::301-321.
Fitzgerald, R. W., and Lees, B. W. (1994), Assessing the classi- Ma, Z., and Redmond, R. L. (1995), Tau coefficients for accu-
fication accuracy of multiresource remote sensing data. Re- racy assessment of classification of remote sensing data
mute Sens. Environ. 47:362368. Photogramm. Eng. Remote Sens. 61:435-439.
Fleiss, J. L. (1981), Statistical Methods for Rates and Propor- Marsh, S. E., Walsh, J. L., and Sobrevila, C. (1994), Evaluation
tions, 2nd ed., Wiley, New York. of airborne video data for land-cover classification accuracy
Foody, G. M. (1992), On the compensation for chance agree- assessment in an isolated Brazilian forest. Remote Sens. En-
ment in image classification accuracy assessment. Pho- viron. 48:61-69.
togrumm. Eng. Remote Sens. 58:1459-1460. Naesset, E. (1996), Use of the weighted Kappa coefficient in
Fung, T., and LeDrew, E. (1988), The determination of optimal classification error assessment of thematic maps. Znt.J. Geogr.
threshold levels for change detection using various accuracy Inf Syst. 10:591604.
indices. Photogramm. Eng. Remote Sens. 54:144%1454. Rosenfield, G. H., and Fitzpatrick-Lins, K. (1986), A coefficient
Gong, P., and Howarth, P. J. (1990), An assessment of some of agreement as a measure of thematic classification accu-
factors influencing multispectral land-cover classification. racy. Photogramm. Eng. Remote Sens. 52~223-227.
Photogramm. Eng. Remote Sens 56:597-603. Snedecor, G. W., and Cochran, W. G. (1980), Statistical Meth-
Jakubauskas, M. E., Whistler, J. L., Dillworth, M. E., and Mar- ods, 7th ed., Iowa State University Press, Ames, IA.
tinko, E. A. (1992), Classifying remotely sensed data for use Stenback, J. M., and Congalton, R. G. (1990). Using thematic
in an agricultural nonpoint-source pollution model. J. Soil mapper imagery to examine forest understory. Photogrumm.
Water Conservation 47:179-183. Eng. Remote Sens. 56:1285-1290.
Janssen, L. L. F., and van der Wel, F. J. M. (1994), Accuracy Story, M., and Congahon, R. G. (1986), Accuracy assessment:
assessment of satellite derived land-cover data: a review. a user’s perspective. Photogramm. Eng. Remote Sens. 52:
Photogramm. Eng. Remote Suns. 60:419426. 397-,399.
Kalkhan, M. A., Reich, R. M., and Czaplewski, R. L. (1995), Treitz, P. M., Howarth, P. J., Suffhng, R. C., and Smith, P.
Statistical properties of five accuracy indices in assessing the (1992), Application of detailed ground information to vege-
accuracy of remotely sensed data using simple random sam- tation mapping with high spatial resolution digital imagery.
pling. In Proceedings c$ the 1995 ACSMIASPRS Annual Remote Sens. Environ. 42~65-82.
Convention, ASPRS Technical Papers, Vol. 1, pp. 246-257. Vujakovic, P. (1987), Monitoring extensive ‘buffer zones’ in Af-
Kalkhan, M. A., Reich, R. M., and Czaplewski, R. L. (1996), rica: an application of satellite imagery. Biol. Conservation
Statistical properties of measures of association and the 39: 195-208.
kappa statistic for assessing the accuracy of remotely sensed Zhuang, X., Engel, B. A., Xiong, X., and Johannsen, C. J. (1995),
data using double sampling. In Spatial Accuracy Assessment Analysis of classification results of remotely sensed data and
in Natural Resources and Environmental Sciences (H. T. evaluation of classification algorithms. Photogramm. Eng. Re-
Mowrer, R. L. Czaplewski, and R. H. Hamre, Eds.), Gen- mote Sens. 61:427433.