0% found this document useful (0 votes)
43 views16 pages

Remote Sensing of Environment

This document provides recommendations for good practices in estimating area and assessing accuracy of land change from remote sensing data. It recommends: (1) implementing a probability sampling design to achieve objectives of accuracy and area estimation while satisfying practical constraints; (2) using reference data sources with sufficient spatial and temporal representation to accurately label each sample unit; and (3) analyzing results consistently with the sampling and response design, reporting estimated error matrices, accuracy measures, area estimates of classes, and uncertainty. An example application illustrates the process.

Uploaded by

Shubham Soni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views16 pages

Remote Sensing of Environment

This document provides recommendations for good practices in estimating area and assessing accuracy of land change from remote sensing data. It recommends: (1) implementing a probability sampling design to achieve objectives of accuracy and area estimation while satisfying practical constraints; (2) using reference data sources with sufficient spatial and temporal representation to accurately label each sample unit; and (3) analyzing results consistently with the sampling and response design, reporting estimated error matrices, accuracy measures, area estimates of classes, and uncertainty. An example application illustrates the process.

Uploaded by

Shubham Soni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Remote Sensing of Environment 148 (2014) 42–57

Contents lists available at ScienceDirect

Remote Sensing of Environment


journal homepage: www.elsevier.com/locate/rse

Review

Good practices for estimating area and assessing accuracy of land change
Pontus Olofsson a,⁎, Giles M. Foody b, Martin Herold c, Stephen V. Stehman d,
Curtis E. Woodcock a, Michael A. Wulder e
a
Department of Earth and Environment, Boston University, 685 Commonwealth Avenue, Boston, MA 02215, USA
b
School of Geography, University of Nottingham, University Park, Nottingham NG7 2RD, UK
c
Laboratory of Geo-Information Science and Remote Sensing, Wageningen University, Droevendaalsesteeg 3, 6708 Wageningen, The Netherlands
d
Department of Forest and Natural Resources Management, State University of New York, 1 Forestry Drive, Syracuse, NY 13210, USA
e
Canadian Forest Service (Pacific Forestry Centre), Natural Resources Canada, Victoria, BC 12 V8Z 1M5, Canada

a r t i c l e i n f o a b s t r a c t

Article history: The remote sensing science and application communities have developed increasingly reliable, consistent, and
Received 30 May 2013 robust approaches for capturing land dynamics to meet a range of information needs. Statistically robust and
Received in revised form 15 January 2014 transparent approaches for assessing accuracy and estimating area of change are critical to ensure the integrity
Accepted 22 February 2014
of land change information. We provide practitioners with a set of “good practice” recommendations for design-
Available online xxxx
ing and implementing an accuracy assessment of a change map and estimating area based on the reference
Keywords:
sample data. The good practice recommendations address the three major components: sampling design,
Accuracy assessment response design and analysis. The primary good practice recommendations for assessing accuracy and estimating
Sampling design area are: (i) implement a probability sampling design that is chosen to achieve the priority objectives of accuracy
Response design and area estimation while also satisfying practical constraints such as cost and available sources of reference data;
Area estimation (ii) implement a response design protocol that is based on reference data sources that provide sufficient spatial
Land change and temporal representation to accurately label each unit in the sample (i.e., the “reference classification” will be
Remote sensing considerably more accurate than the map classification being evaluated); (iii) implement an analysis that is
consistent with the sampling design and response design protocols; (iv) summarize the accuracy assessment
by reporting the estimated error matrix in terms of proportion of area and estimates of overall accuracy,
user's accuracy (or commission error), and producer's accuracy (or omission error); (v) estimate area of classes
(e.g., types of change such as wetland loss or types of persistence such as stable forest) based on the reference
classification of the sample units; (vi) quantify uncertainty by reporting confidence intervals for accuracy and
area parameters; (vii) evaluate variability and potential error in the reference classification; and (viii) document
deviations from good practice that may substantially affect the results. An example application is provided to
illustrate the recommended process.
© 2014 Elsevier Inc. All rights reserved.

Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.1. Good practice recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.2. Context of good practice recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2. Sampling design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.1. Choosing the sampling design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.1.1. Strata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.1.2. Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.3. Systematic vs. random selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2. A recommended good practice sampling design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3. Response design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1. Spatial assessment unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2. Sources of reference data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

⁎ Corresponding author. Tel.: +1 617 353 9734; fax: +1 617 353 8399.
E-mail address: [email protected] (P. Olofsson).
URL: http://people.bu.edu/olofsson (P. Olofsson).

http://dx.doi.org/10.1016/j.rse.2014.02.015
0034-4257/© 2014 Elsevier Inc. All rights reserved.
P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57 43

3.3. Reference labeling protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


3.4. Defining agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5. Reference classification uncertainty: geolocation and interpreter variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.1. The error matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2. General principles of estimation for good practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3. Estimating accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4. Estimating area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5. Example of good practices: estimating area and assessing accuracy of forest change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1. Sampling design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1.1. Determining the sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1.2. Determine sample allocation to strata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2. Estimating accuracy, area and confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.1. Estimating accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.2. Estimating area and uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.1. General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.2. Sampling design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3. Response design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.4. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

1. Introduction also potentially subject to bias. An accuracy assessment identifies the


errors of the classification, and the sample data can be used for estimat-
Land change maps quantify a wide range of processes including ing both accuracy and area along with the uncertainty of these esti-
wildfire (Schroeder, Wulder, Healey, & Moisen, 2011), forest harvest mates. While the notion of accuracy assessment is well-established
(Olofsson et al., 2011), forest disturbance (Huang et al., 2010), land use within the remote sensing community (Foody, 2002; Strahler et al.,
pressure (Drummond & Loveland, 2010) and urban expansion (Jeon, 2006), studies of land change routinely fail to assess the accuracy of the
Olofsson, & Woodcock, 2013). Map users and producers are acutely final change maps and few published studies of land change make full
interested in communicating and understanding the quality of these use of the information obtained from accuracy assessments (Olofsson,
maps. Accordingly, guidance on how to assess accuracy of these maps Foody, Stehman, & Woodcock, 2013).
in a consistent and transparent manner is a necessity. The use of remote
sensing products depicting change for scientific, management, or policy 1.1. Good practice recommendations
support activities all require quantitative accuracy statements to buttress
the confidence in the information generated and in any subsequent In this article, we synthesize the current status of key steps and
reporting or inferences made. Area estimation, whether of change in methods that are needed to complete an accuracy assessment of a land
land cover/use or of status of land cover/use at a single date, is a natural change map and to estimate area of land change. This article addresses
value-added use of land change maps in many local, national and global the fundamental protocols required to produce scientifically rigorous
land accounting applications. For example, the amount of land area and transparent estimates of accuracy and area. The set of good practice
allocated for a specific use is a key country reporting requirement to recommendations provides guidelines to assist both scientists and prac-
the United Nations (UN) Food and Agriculture Organization (FAO) statis- titioners in the design and implementation of accuracy assessment and
tics and the global forest resources' assessment (FAO, 2010) as well as for area estimation methods applied to land change assessments using
countries reporting under the Kyoto protocol and the evolving activities remote sensing. The accuracy and area estimation objectives are linked
for the UN Collaborative Programme on Reducing Emissions from Defor- via a map of change. A change map provides a spatially explicit depiction
estation and Forest Degradation — UN-REDD (Grassi, Monni, Federici, of change and this spatial information can be readily aggregated to calcu-
Achard, & Mollicone, 2008; UN-REDD, 2008). Estimates of forest extent late the total mapped area or the proportion of mapped area of change
or deforestation are often derived via remote sensing (cf. Achard et al., for the region of interest (ROI). Accuracy assessment addresses questions
2002; DeFries et al., 2002; Hansen, Stehman, & Potapov, 2010), and area related to how well locations of mapped change correspond to actual
estimation also plays a prominent role in ongoing efforts to establish sci- areas of change. A fundamental premise of the recommended good
entifically valid protocols for forest change monitoring in the context of practices methodology is that the change map will be subject to an accu-
specific accounting applications to policy approaches for reducing green- racy assessment based on a sample of higher quality change information
house gas emissions from forests (DeFries et al., 2007; GOFC-GOLD, (i.e., the reference classification). The higher quality reference classifica-
2011). tion is compared to the map classification on a location-specific basis to
A key strength of remote sensing is that it enables spatially exhaus- quantify accuracy of the change map and to estimate area. Although it
tive, wall-to-wall coverage of the area of interest. However, as might be is possible to estimate area of change without producing a change map
expected with any mapping process, the results are rarely perfect. Plac- (Achard et al., 2002; FAO, 2010; Hansen et al., 2010), we will assume
ing spatially and categorically continuous conditions into discrete that a map of change exists (although there will not necessarily be a
classes may result in confusion at the categorical transitions. Error can map for each date). The focus for this document is change between
also result from the change mapping process, the data used, and analyst two dates.
biases (Foody, 2010). Change detection and mapping approaches using Before any detailed planning of the response and sampling designs is
remotely sensed data are increasingly robust, with improvements undertaken, a basic visual assessment should be conducted to identify
aimed at the mitigation of these sources of error. However, any map obvious errors and concerns in the remotely sensed product. This
made from remotely sensed data can be assumed to contain some assessment provides an evaluation of the map's suitability for the
error, with the areas calculated from the map (e.g., pixel counting) intended application and should detect if a map is so unsuitable for
44 P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57

use that there is no value in proceeding to a more detailed assessment. Estimating area and accuracy of change maps introduces additional
The visual assessment should also highlight errors that are easy to methodological challenges that were not within the scope addressed
remove enabling the map to be refined prior to initiating a detailed by Strahler et al. (2006). In particular, the area estimation objective
assessment or confirm that no obvious concerns exist and the map is was not addressed at all by Strahler et al. (2006). Accuracy assessment
ready for further rigorous evaluation. of change highlights many unique challenges, including the dynamic
We separate the accuracy assessment methodology into three major nature of the reference data, and aspects of the change features includ-
components, the response design, sampling design, and analysis ing type, severity, persistence, and area. Another challenge is that
(Stehman & Czaplewski, 1998). The response design encompasses all change is usually a rare feature over a given landscape. The accuracy
aspects of the protocol that lead to determining whether the map and of a map and the area estimates derived with its aid are a function of
reference classifications are in agreement. Because it is often impractical the land-cover mosaic under study, the underlying imagery and the
to apply the response design to the entire ROI, a subset of the area is methods applied. Accuracy and area estimates for the same region
sampled. The sampling design is the protocol for selecting that subset will, for example, vary if using a per-pixel or object-based classification
of the ROI. The analysis includes protocols for defining how to quantify or if the spatial resolution of the imagery is altered (cf. Baker, Warner,
accuracy along with the formulas and inference framework for estimat- Conley, & McNeil, 2013; Duro, Franklin, & Duba, 2012; Johnson, 2013).
ing accuracy and area and quantifying uncertainty of these estimates. A Our recommendations also focus on methods for providing rigorous
separate section of this guidance document is devoted to each of these estimates of land (area) change and its uncertainties. A primary use of
three major components of accuracy assessment methodology. These such estimates is in analysis and accounting frameworks such as nation-
sections are followed by an example of the recommended workflow. al inventories. In evolving frameworks compensating for successful
climate change mitigation actions in the forest sector (such as REDD+,
DeFries et al., 2007), the consideration of uncertainties are likely linked
1.2. Context of good practice recommendations with financial incentives and are subject to critical international political
negotiations on reporting and verification (Sanz-Sanchez, Herold, &
The good practice recommendations are intended to represent a Penman, 2013). Understanding and management of uncertainties in
synthesis of the current science of accuracy assessment and area estima- area change is essential, particularly because data and capacity gaps in
tion. We fully anticipate that improved methods will be developed over forest monitoring are large in many developing countries (Romijn,
time. As the designation of “best practice” implies a singular approach, Herold, Kooistra, Murdiyarso, & Verchot, 2012). Accuracy assessments
we prefer the use of “good practice” to indicate that “best” is relative should also focus on identifying and addressing error sources, and prior-
and will vary, with one hard-coded approach not always appropriate. itize on capacity development needs to provide continuous improve-
In communicating good practices, desirable features and selection ments and reduce uncertainties in the estimates over time. This also
criteria can be followed to ensure that the protocol applied satisfies – includes assessing the value of data streams from evolving monitoring
as thoroughly as possible – the accuracy and area estimation recommen- technologies (de Sy et al., 2012; Pratihast, Herold, de Sy, Murdiyarso, &
dations. The good practice recommendations do not preclude the exis- Skutsch, 2013) where the ultimate impact on lower uncertainties need
tence of other acceptable practices, but instead represent protocols to be proven in operational contexts. Thus, the methods of good practice
that, if implemented correctly, would ensure scientific credibility of presented here are generic for providing rigorous estimates, and having
the results. Furthermore, the recommendations presented herein allow agreed-upon tools to do so will provide the saliency and legitimacy for
flexibility to choose specific details of the different components of the using them in quantifying improvements in monitoring systems, and for
methodology. For example, while the general recommendation for the dealing with uncertainties in financial compensation schemes (e.g., for
sampling design is to implement a probability sampling protocol, there climate change mitigation actions).
are numerous sampling designs that meet this criterion (Stehman, This article synthesizes key steps and methods needed to complete
2009). Similarly, the response design protocol allows flexibility to use an accuracy assessment of a change map and to estimate area and accu-
a variety of different sources for determining the reference classification racy of the map classes. It addresses the protocols required to produce
and multiple options exist for defining agreement between the map and scientifically rigorous and transparent estimates of accuracy and area.
reference classifications. The good practices recommendations repre-
sent an ideal to strive for, but it is likely that most projects will not satisfy 2. Sampling design
every recommendation. Documenting and justifying deviations from
good practices are expected features of many accuracy assessment and The sampling design is the protocol for selecting the subset of spatial
area estimation studies. For the most part, the good practice recommen- units (e.g., pixels or polygons) that will form the basis of the accuracy
dations consist of methods for which there is considerable experience of assessment. Choosing a sampling design requires a consideration of
practical use in the remote sensing community. the specific objectives of the accuracy assessment and a prioritized list
These good practice recommendations for area estimation and accu- of desirable design criteria. The most critical recommendation is that
racy assessment of land change build on earlier guidelines for single- the sampling design should be a probability sampling design. An essen-
date land-cover maps described by Strahler et al. (2006). Strahler et al. tial element of probability sampling is that randomization is incorporated
(2006) presented general guiding principles of good practices with in the sample selection protocol. Probability sampling is defined in
less emphasis on details of methodology. In the intervening years terms of inclusion probabilities, where an inclusion probability relates
since Strahler et al. (2006), additional theory and practical application the likelihood of a given unit being included in the sample (Stehman,
related to accuracy assessment and area estimation have been accumu- 2000). The two conditions defining a probability sample are that the in-
lated, and this current document avails upon these developments to clusion probability must be known for each unit selected in the sample
delve more deeply into methodological details. We do not attempt to and the inclusion probability must be greater than zero for all units in
provide an exhaustive description of methods given the range of issues the ROI (Stehman, 2001).
and the highly application-specific nature of the topic. Instead, our pur- A variety of probability sampling designs are applicable to accuracy
pose is to focus upon the main issues needed to establish a common assessment and area estimation, with the most commonly used designs
basis of good practice methodology that will be generally applicable being simple random, stratified random, and systematic (Stehman,
and result in transparent methods and rigorous estimates of accuracy 2009). Non-probability sampling protocols include purposely selecting
and area. A list of recommendations for all components of the process sample units (e.g., choosing units that are convenient to access),
(sampling design, response design, and analysis) is presented in the restricting the sample to homogeneous areas, and implementing a com-
Summary section (Section 6). plex or ad hoc selection protocol for which it is not possible to derive the
P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57 45

inclusion probabilities. The condition that the inclusion probabilities geographically by continents). Stratification is a partitioning of the ROI
must be known for the units selected in the sample must be adhered in which each assessment unit is assigned to a single stratum. The two
to. These inclusion probabilities are the basis of the estimates of accura- most common attributes used to construct strata are the classes deter-
cy and area, so if they are not known, the probabilistic basis for design- mined from the map and geographic subregions within the ROI. Strati-
based inference (see Section 4.2) is forfeited. It is difficult to envision a fication is implemented for two primary purposes. The first purpose is
circumstance in which a deviation from this condition of probability when the strata are of interest for reporting results (e.g., accuracy and
sampling (i.e., known inclusion probabilities) would be acceptable for area are reported by land-cover class or by geographic subregion). The
a scientifically rigorous assessment of accuracy. second use of stratification is to improve the precision of the accuracy
In practice, it is not always possible to adhere perfectly to a probabil- and area estimates. For example, when strata are created for the objec-
ity sampling protocol (Stehman, 2001). For example, if the response tive of reporting accuracy by strata, the stratified design allows specify-
design specifies field visits to sample locations, it may be too dangerous ing a sample size for each stratum to ensure that a precise estimate is
or too expensive to access some of the sample units. Conversely, persis- obtained for each stratum. Land change often occupies a small propor-
tent cloud coverage or lack of useable imagery for portions of the ROI tion of the landscape, so a change stratum can be identified and the
may prevent obtaining the reference classification for some sample sample size allocated to this stratum can be large enough to produce a
units. The reference data are often derived from another set of imagery small standard error for the change user's accuracy estimate.
and the spatial and temporal coverage of reference data might be differ- The practical reality is that limited resources will likely be available
ent from the coverage of the imagery used to create the map. If the for the reference sample and this constraint will strongly impact sample
reference classification for a sample unit cannot be obtained, the inclu- allocation decisions because different allocations favor different estima-
sion probability is zero for that unit. All deviations from the probability tion objectives. For example, allocating equal sample sizes to all strata
sampling protocol should be documented and quantified to the greatest favors estimation of user's accuracy over estimation of overall and
extent possible. For example, the proportion of the selected sample units producer's accuracies (Stehman, 2012). Conversely, the standard errors
for which cloud cover prevented assessment of the unit should be for estimating producer's and overall accuracies are typically smaller for
reported, or the proportion of area of the ROI for which the reference proportional allocation (i.e., the sample size allocated to each stratum is
imagery is not available should be documented. Whereas probability proportional to the area of the stratum) relative to equal allocation. As a
sampling ensures representation of the population via the rigorous prob- compromise between favoring user's versus producer's and overall ac-
abilistic basis of inference established, when a large proportion of the ROI curacies, the allocation recommended is to shift the allocation slightly
is not available to be sampled, the question of how well the sample away from proportional allocation by increasing the sample size in the
represents the population must be addressed by subjective judgment. rarer classes, but the sample size for the rare classes should not be
increased to the point where the final allocation is equal to allocation
2.1. Choosing the sampling design (see Section 5 for an example). The sample size allocation decision
can be informed by calculating the anticipated standard errors (see
The major decisions in choosing a sampling design relate to trade- Sections 4.3 and 4.4) for different sample sizes and different allocations.
offs among different designs in terms of advantages to meet specified An ineffective allocation of sample size to strata will not result in biased
accuracy objectives and priority desirable design criteria. The objectives estimators of accuracy or area, but it may result in larger standard errors
commonly specified are to estimate overall accuracy, user's accuracy (or (see Section 5 for an example).
commission error), producer's accuracy (or omission error), and area of When stratified sampling is applied to a single date land-cover map, it
each class (e.g., area of each type of land change). Estimates for subregions is usually feasible to define a stratum for each land-cover class (Wulder,
of the ROI are also often of interest (cf. Scepan, 1999). Desirable sampling White, Magnussen, & McDonald, 2007). Identifying an effective stratifica-
design criteria include: probability sampling design, ease and practicality tion for change can be more challenging. A common approach is to use a
of implementation, cost effectiveness, representative spatial distribution map of change to identify the strata, and such strata are effective for
across the ROI, small standard errors in the accuracy and area estimates, estimating user's accuracy of change precisely. However, the number
ease of accommodating a change in any step in the implementation of of different types of change may be so large that defining every change
the design, and availability of an approximately unbiased estimator of type as a stratum is not advisable. For example, in a post-classification
variance. Determining whether any or all of these desirable design criteria comparison of two land-cover maps that each include 8 land-cover clas-
have been satisfied by the chosen sampling design may be subjective. For ses, there are 56 possible types of change in the final change map. If each
example, determining what constitutes a small standard error will stratum must receive a relatively large sample to achieve a precise
depend on the application and may vary for different estimates within user's accuracy estimate, the overall sample size may be unaffordable.
the same project. There are also precedents for defining an accuracy target The trade-offs between precision of user's accuracy, producer's accu-
and desired error bounds as a means for determination of sample size racy, and area estimates from different sample size allocations become
using standard statistical theory (Wulder, Franklin, White, Linke, & exacerbated as the number of strata increases. Some types of change
Magnussen, 2006) (see also Section 5.1.1). may be very unlikely to occur and consequently could be eliminated
Stehman and Foody (2009) provide an overview and comparison of as strata. To further reduce the number of strata, strata could be defined
the basic sampling designs typically applied to accuracy assessment. on the basis of generalized change categories (Wickham et al., 2013).
Stehman (2009) provides a more expansive review of sampling design For example, a stratum could be change from any class to urban
options and discusses how these designs fulfill different objectives and (i.e., urban gain), and another stratum could be change to any class
desirable design criteria. A variety of sampling designs will satisfy from forest (i.e., forest loss). These generalized or aggregated change stra-
good practice guidelines so the key is to choose a design well suited ta are obviously less focused on all possible individual change types. For
for a given application. Three key decisions that strongly influence the example, the forest loss stratum could include forest to developed, forest
choice of sampling design are whether to use strata, whether to use to water, or forest to cropland. These generalized change strata would
clusters, and whether to implement a systematic or simple random allow for specifying the sample size allocated to different general change
selection protocol (Stehman, 2009). Each of these decisions will be dis- types, but within one of the generalized strata, the sample size allocated
cussed in the following subsections. to the individual change types would be proportional to the area of that
change type. For example, if the most common type of forest loss is to
2.1.1. Strata cropland and the least common change is forest loss to water, many
There is often a desire to partition the ROI into discrete, mutually more of the sample units within the forest loss stratum will be forest-
exclusive subsets or strata (e.g., a global map could be stratified to-cropland-conversion. Strahler et al. (2006, Fig. 5.2, p. 32) provides
46 P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57

additional examples of aggregated change classes that could be used as units within a cluster is still interpreted as a separate unit even though it
strata. is selected into the sample as part of a cluster. For example, a 3 × 3 pixel
The desire to limit the number of strata motivates discussion of sub- cluster would require obtaining the reference classification for individ-
population estimation as it relates to sampling design. A subpopulation ual pixels within the cluster.
is any subset of the ROI, for example a particular type of change or a The primary motivation for cluster sampling is to reduce the cost of
particular subregion. Subpopulations can be defined as strata, but it is data collection. For example, if field visits are required to obtain the
not necessary for a subpopulation to be defined as a stratum to produce reference classification, transit time and costs may be reduced if the sam-
an estimate for that subpopulation. For example, when aggregating ple units are grouped spatially into clusters. Zimmerman et al. (2013)
multiple types of change into a generalized change stratum, it would used cluster sampling to reduce the number of raster images (i.e., clus-
still be possible to estimate accuracy of each of the subpopulations ters) required because the primary cost of the sampling protocol was
representing the individual types of change making up the aggregated associated with processing the very high resolution images used for refer-
change stratum. However, if these subpopulations are not defined as ence data. As another example, Stehman and Selkowitz (2010) used a
strata, the sample size representing the subpopulation may not be 27 km × 27 km cluster sampling unit to constrain sample locations to a
large enough to obtain a precise estimate. Resources available for accu- single day of flight time per cluster when the reference data were collect-
racy assessment may require limiting the number of strata used in the ed by aircraft. Cluster sampling may also be motivated by the objectives of
design, so prioritizing subpopulations may be necessary to establish an accuracy assessment. For example, a cluster sampling unit becomes
which subpopulations are defined as strata. necessary to assess accuracy at multiple spatial supports (e.g., single
It is sometimes the case that several maps will be assessed based pixel, 1 ha unit, and 1 km2 unit).
on a common accuracy assessment sample. This forces a decision on The cost savings gained by cluster sampling should be substantial
whether the strata should be based on a single map (and if so, which before choosing this design because the correlation among units within
map) or if the strata should be defined by a combination of the multiple a cluster (i.e., intracluster correlation) often reduces precision relative to
maps. Once strata are defined and the sample is selected using these a simple random sample of equal size. Focusing on the specific example
strata, the strata become a fixed feature of the design because the anal- of estimating land-cover area in Europe, Gallego (2012) showed that a
ysis is dependent on the estimation weight associated with each sample 10 km × 10 km sampling unit produced equivalent information to
unit and this weight is determined by the sampling design. Fortunately, that of a simple random sample of only 25 points or fewer. The low
whatever the decision is to define strata when multiple maps are to be yield of information per cluster diminishes the cost advantage of cluster
assessed, the sample reference data are still valid to assess any of the sampling if the intracluster correlation is high. Another potential disad-
maps, even if the strata are defined on the basis of a single map. The vantage of cluster sampling is that it complicates stratification when the
principles of estimation outlined in the Analysis section (Section 4) strata are the map classes and the assessment unit is a pixel. In the sim-
must be adhered to, and this simply requires using the estimation plest setting, each cluster would be assigned to a stratum, but rules have
weights for the sample units determined by the original stratified selec- to be established for assigning a cluster to a stratum when the cluster
tion protocol. The impact of the choice of strata will be reflected in the includes area of several different classes. Cluster sampling can be
standard errors of the estimates. Olofsson et al. (2012) and Stehman, combined with stratification of pixels by the map class of each pixel
Olofsson, Woodcock, Herold, and Friedl (2012) discuss sampling design in a two-stage stratified cluster sampling approach (Stehman, Sohl &
issues associated with constructing a reference validation database that Loveland, 2003; Stehman, Wickham, Wade & Smith, 2008), but such
would allow assessment of multiple maps. designs require more complex analysis and implementation protocols
To summarize the recommendations related to the important ques- than what are required of a stratified design without clusters. Because
tion of whether to incorporate stratification in the sampling design, of the added complexity cluster sampling introduces for sampling
stratifying by mapped change and by subregions is justified to achieve design (e.g., accommodating stratification within a cluster sampling
the objective of precise class-specific accuracy and to report accuracy design) and estimation (e.g., estimating standard errors), we recom-
by subregion. If the overall sample size is not adequate to support mend this design only in cases for which the objectives require a cluster
both class-specific and subregion accuracy estimates, the subregional sampling unit or in which the cost savings or practical advantages of
stratification may be omitted and accuracy by subregion relegated to cluster sampling are substantial.
the status of subpopulation estimation. The recommended allocation
of sample size to the strata defined by the map classes is to increase 2.1.3. Systematic vs. random selection
the sample size for the rarer classes making the sample size per stratum The two most common selection protocols implemented in accuracy
more equitable than what would result from proportional allocation, assessment are simple random and systematic sampling (we define
but not pushing to the point of equal allocation. The rationale for this “systematic” as selecting a starting point at random with equal probabil-
recommendation is that user's accuracy is often a priority objective ity and then sampling with a fixed distance between sample locations).
and we can control the precision of the user's accuracy estimates by Both protocols can be implemented to select units from within strata or
the choice of sample allocation. However, the trade-off is that a design to select clusters, and both can be applied to a ROI that is not partitioned
allocation chosen solely for the objective of user's accuracy precision into strata or clusters. Unbiased estimators of the various accuracy
(i.e., equal allocation) may be detrimental to precision of estimates of parameters are available from either systematic or simple random
overall accuracy, producer's accuracy, and area, so a compromise alloca- selection, so the bias criterion is not a basis for choosing between
tion is in order. Lastly, defining aggregations of change types as strata these options. Instead, the choice of simple random versus systematic
may be necessary if the number of strata needs to be limited, and accu- depends on how each selection protocol satisfies the priority desirable
racy and area estimates for the individual change types would be design criteria (Stehman, 2009). For example, systematic sampling is
obtained as subpopulation estimates. often simpler to implement when the response design is based on
field visits, but the greater convenience of systematic versus simple ran-
2.1.2. Cluster sampling dom is diminished when working with imagery or aerial photographs
A cluster is a sampling unit that consists of one or more of the basic as a source of the reference data. Typically, systematic selection will
assessment units specified by the response design. For example, a clus- yield more precise estimates than simple random selection, but system-
ter could be a 3 × 3 block of 9 pixels or a 1 km × 1 km cluster containing atic sampling requires use of a variance approximation so if unbiased
100 1 ha assessment units. In cluster sampling, a sample of clusters is variance estimation is a priority criterion, simple random is preferred.
selected and the spatial units within each cluster are therefore selected Simple random selection also is advantageous if it is likely that the sam-
as a group rather than selected as individual entities. Each of the spatial ple size will need to be modified during the course of the accuracy
P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57 47

assessment (Stehman et al., 2012). A scenario in which systematic selec- the area within the polygon has the same map classification (e.g., the
tion opportunistically arises is when accuracy assessment reference data entire polygon is stable forest or the entire polygon represents an area
can be simultaneously obtained in conjunction with another field sam- of change from forest to urban). Polygons defined on the basis of a
pling activity. For example, many national forest inventories employ a map will be called “map polygons.” Alternatively, a polygon could be
systematic sample of field plots (Tomppo, Gschwantner, Lawrence, & delineated on the basis of the reference classification as an area within
McRoberts, 2010) and these field plot data may be an inexpensive, which the reference class is the same. A polygon delineated on the
high quality source of reference data. In general, the simple random basis of the reference classification will be called a “reference polygon”.
selection protocol will better satisfy the desirable design criteria and is A “block” spatial assessment unit is defined as a rectangular array of
the recommended option. However, systematic selection is also nearly pixels (e.g., a 3 × 3 block of pixels). Irrespective of the spatial unit select-
always acceptable. ed, it is important to note that some spatial units may be impure, i.e.,
they represent an area of more than one class. Mixed pixels are common,
2.2. A recommended good practice sampling design especially in coarse spatial resolution data. Similarly, it is possible that a
map polygon is not internally homogeneous in terms of the reference
Stratified random sampling is a practical design that satisfies the classification, and a reference polygon may not be internally homoge-
basic accuracy assessment objectives and most of the desirable design neous in terms of the map classification. A polygon defined by a segmen-
criteria. Stratified random sampling affords the option to increase the tation algorithm would not necessarily be homogeneous in terms of
sample size in classes that occupy a small proportion of area to reduce either the map or the reference classifications.
the standard errors of the class-specific accuracy estimates for these Pixels, polygons, or blocks can be used as the spatial unit in accuracy
rare classes. Thus this design addresses the key objective of estimating assessment. Regardless of the unit chosen, a critical feature of the
class-specific accuracy. In regard to the desirable design criteria, strati- response design protocol is that the spatially explicit character of the
fied random sampling is a probability sampling design and it is one of accuracy assessment must be retained. Practitioners should aim to
the easier designs to implement. Stratified sampling is commonly have reference data with an equal or finer level of detail than the data
used in accuracy assessment so it has an advantage of being familiar used to create the map, but we make no recommendation regarding
to the remote sensing community (cf. Cakir, Khorram, & Nelson, 2006; the choice of spatial assessment unit. However, once the spatial assess-
Huang et al., 2010; Mayaux et al., 2006; Olofsson et al., 2011). Increasing ment unit has been chosen, there will be good practice recommenda-
or decreasing the sample size after the data collection has begun is read- tions associated with that specific unit and the choice of spatial unit
ily accommodated by stratified random sampling, and unbiased vari- also has implications on the sampling design (Stehman & Wickham,
ance estimators are available thus avoiding the need to use variance 2011) and analysis. Estimates of accuracy and area derived from the
approximations. An assumption implicit in this recommendation is same map but through the use of different spatial units may be unequal.
that change between two dates is of interest. Little work has been
done to investigate the effective use of strata for multiple change 3.2. Sources of reference data
periods. In the case of stratification based on a change map, it is assumed
that reference data for the sampled locations exists for the initial date of The reference classification can be determined from a variety of
the change period (e.g., archived imagery or aerial photography is avail- sources ranging from actual ground visits to the sample locations or
able). If the reference data must be obtained in real time (e.g., via the use of aerial photography or satellite imagery. There are two ways
ground visit), it would not be possible to stratify by a change map that to ensure that the reference classification is of higher quality than the
does not yet exist at the initial date. An alternative would be to stratify map classification: 1) the reference source has to be of higher quality
by anticipated change or predicted change, with the effectiveness of than what was used to create the map classification, and 2) if using
such strata dependent on how well the predicted change matched the same source material for both the map and reference classifications,
with the ensuing reality of change. the process to create the reference classification has to be more accurate
than the process used to create the classification being evaluated. For
3. Response design example, if Landsat imagery is used to create the map and Landsat is
the only available imagery for the accuracy assessment, then the process
For the accuracy assessment objective, the response design encom- for obtaining the reference classification has to be more accurate than
passes all steps of the protocol that lead to a decision regarding agree- the process for obtaining the map classification. Additionally, other
ment of the reference and map classifications. For area estimation, the spatial data may be used to improve the quality of the reference
response design provides the best available classification of change for classification, such as forest inventory data or some form of vector
each spatial unit sampled. The four major features of the response data (e.g., roads, pipelines, or crop records). In this subsection, different
design are the spatial unit, the source or sources of information used potential sources of reference data for assessing accuracy of change are
to determine the reference classification, the labeling protocol for the identified and strengths and weaknesses of these sources are described.
reference classification, and a definition of agreement. Each of these Possible reference data sources include field plots, aerial photogra-
major features is discussed in the following subsections. phy, forest inventory data, airborne video, lidar, and satellite imagery
(Table 1). Additional sources of freely accessible reference data may
3.1. Spatial assessment unit also be opportunistically available from data mining and crowdsourcing
(Foody & Boyd, 2013; Iwao, Nishida, Kinoshita, & Yamagata, 2006).
The spatial unit that serves as the basis for the location-specific com-
parison of the reference classification and map classification can be a Table 1
pixel, polygon (or segment), or block (Stehman & Wickham, 2011). Possible reference data sources.
The ROI is partitioned based on the chosen spatial unit (i.e., the region
Reference data source Exemplar citation
is completely tiled by these non-overlapping spatial units). Commonly,
Field plots Hyyppä et al. (2000)
the pixel is selected as the spatial unit. The pixel is an arbitrary unit
Air photography Skirvin et al. (2004)
defined mainly by the properties of the sensing system used to acquire Forest inventory data McRoberts (2011); Wulder, White, et al. (2006)
the remotely sensed data or a function of the grid used to sub-divide Airborne video Wulder et al. (2007)
space in a raster based data set. A polygon is defined as a unit of area, Lidar Lindberg, Olofsson, Holmgren, and Olsson (2012)
perhaps irregular in shape, representing a meaningful feature of land Satellite imagery Scepan (1999); Cohen et al. (2010)
Crowdsourcing Iwao et al. (2006); Foody and Boyd (2013)
cover. For example, a polygon may be delineated from a map such that
48 P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57

Practical considerations regarding costs often influence the selection required for nadir revisit (Wulder, White, Coops, & Butson, 2008). The
of reference data, or the use of existing data. While existing or lower cost implication is that when the sun-surface-sensor viewing geometry
data may be desirable from a purchase perspective, the use of disparate changes the structure captured changes, such that trees evident on one
data sources will result in additional effort by project analysts to deal image may be occluded in another. For a given on-line accessible source
with exceptions and inconsistencies. A key to using disparate data of satellite imagery, it should not be expected that historical, archival,
sources is to have the reference data that are actually used in the accu- global coverage from launch to present exist. Regardless, the ability to
racy assessment be, as much as possible, invariant to source. For exam- view images from multiple years can help determine that date when a
ple, the creation of attributed change polygons makes the polygon the change (e.g., a disturbance) occurred. The additional context provided
common denominator, rather than the source data. Creating polygonal around particular change events aids with interpretation of change
change units in a portable format and populating a minimum set of type (e.g., determination of harvesting versus forest removal in support
fields to support a consistent labeling protocol is desirable. The informa- of agricultural expansion).
tion to be recorded for each change unit is itemized in Table 2. There are few, if any, reference data sources that are available with a
Ideally a data source is available for the entire ROI, representing the uniform likelihood globally. There are some archival datasets with wide
change types and dates of interest, at a low cost. The realities versus the global coverage (e.g., Kompsat); although, the utility of these data sets
ideal result in a series of considerations that are detailed in Table 3. For may be limited. The utility of any given reference data source when
instance, if the ROI is small, cost may be less of an issue and access may used to capture and relate change is the date or represented by the
not be relevant. For large area projects over poorly monitored areas, data. While less of an issue with satellite data, air photos and maps
existing data sources are not often available so data purchase and inter- may not be of a known vintage. Acquisition dates of historic photos
pretation costs become the dominant criteria. The ease of interpretation are often lost, plus maps are often representative of a period, not a
and consistency of source reference data permits economies in the singular date. Knowing the conditions that previously existed may not
project flow for the analysts and also promotes automation of repeated be helpful if the date of change occurrence is not known.
activities. Further, the development of a well-documented and consis- Over some regions, land use change and silvicultural records may
tent change validation data set will have utility for multiple projects also be available to inform on the land-cover change. Note that forest
and purposes. harvesting is a land-cover change relating a successional stage, rather
Both high- and very high spatial resolution satellite data are viable than a land use change (which implies a permanent change in how a
candidates for reference data. Imagery is typically considered as very particular parcel of land is used — e.g., forestry to agriculture). This dis-
high spatial resolution (VHSR) with a spatial resolution of b1 m and tinction is important for both monitoring and reporting purposes as the
high spatial resolution (HSR) with a spatial resolution of b10 m. Both permanent removal of forests has differing carbon consequences than
data sources provide information that is finer than the data used in forest harvesting (Kurz, 2010).
most large area monitoring projects, which would typically have a spa- While the good practice guidelines advocate for use of reference data
tial resolution of greater than 10 m. At the fine spatial resolution of of finer spatial resolution than the map product, this is especially so for
satellite-borne VHSR imagery, panchromatic is often the only spectral single date interpretations of the reference data. Following the opening
information collected. The typical 400 to 900 nm panchromatic data of the Landsat archive by the USGS (Woodcock et al., 2008), time series
with small pixels (0.50 m in the case of WorldView-1) closely resemble of imagery created new opportunities for using imagery of the same
large scale aerial photography and can be interpreted using established spatial resolution (e.g., Landsat) when archival data are available. Sim-
aerial photograph interpretation techniques (Wulder, White, Hay, & ple visual approaches may be applied, such as in Fig. 1, where a change
Castilla, 2008) or subject to digital analyses (cf. Falkowski, Wulder, event (fire) that is evident in 2010 can be timed quite precisely by the
White, & Gillis, 2009). Both the SPOT Image® and DigitalGlobe® archives evidence captured (smoke plume) showing when the fire occurred.
can be accessed through Google Earth™, with the image extents by year This type of change dating is rather opportunistic and not to be com-
portrayed. The presence of freely accessible high spatial resolution imag- monly expected.
ery online through Google Earth™ also presents low cost interpretation A more reliable means for determining the timing of change events
options. Limitations of this approach include a lack of data prior to the can be from developing and interrogating time series of images
initiation of the high spatial resolution satellite commercial era (circa (Kennedy, Yang, & Cohen, 2010). To ensure the quality of time series
2000), spatial distribution of available imagery, and the actual temporal transitions developed, Cohen, Yang, and Kennedy (2010) created a
revisit of the images available. The reported temporal revisit can be on logic and tool for determining the timing and nature of changes cap-
the order of days based upon an ability to point the sensor head. For tured (TimeSync, http://timesync.forestry.oregonstate.edu/). Based
instance, IKONOS has off-nadir revisit of 3 to 5 days, with 144 days upon the image collection and archiving protocols present through the
history of Landsat, the spatial and temporal coverage of imagery is not
uniform. The temporal precision possible for dating changes based
upon time series analysis is likely weaker for locations that already
Table 2
have a paucity of data. This situation is due to the historic practices
Example characteristics to record for each change polygon. Some attributes can be
generated in the GIS; others will need to be entered by the analyst. Notion is that followed at given Landsat receiving stations through to the commercial
information is captured and carried to provide insights and a record regarding the era (during the 1980s) when fewer images were collected and archived
changes captured. The aim is that the change polygons can be used in a manner that is (Wulder, Masek, Cohen, Loveland, & Woodcock, 2012). It should not be
invariant to source, but that metadata is captured to explain or better understand any assumed that the temporal density possible for the conterminous
data related anomalies that may emerge.
United States is possible for all other regions (Schroeder et al., 2011).
Attribute Definition/comments Another critical aspect of the response design is that the change
Change area Area changed, e.g., polygon size in hectares period represented by the reference classification must be synchro-
Change perimeter Perimeter of polygon, in meters nous with the change period of the classification. Consider a map
Change type Notation of change type, harvest, fire, insect, urban expansion, representing change between 2000 and 2010. To capture the northern
agricultural development
hemisphere peak photosynthetic period, the imagery used for this
Change date As possible, note the change date. May be available from other
records, e.g., when a fire occurred, or the acquisition date of the hypothetical project was collected July 15, 2000, and 10 years later,
image or photography used. July 15, 2010. The reference data should be collected in 2010, but ideally
Data source Note the data source from which the change polygon is made not after July 15 (assuming similar satellite overpass times) to avoid
Analyst Name or code to denote the interpreter confusion. Data collected after July 15, 2010 will have to be vetted to
Date interpreted Note the date when the interpretation occurred
ensure that the change present in the reference data did not occur
P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57 49

Table 3
Elements for consideration when selecting reference data.

Element Considerations

Cost What is the budget? What amount per unit of reference data can be purchased? Is the interpretation/labeling protocol efficient?
Ease of access Varies by data type. Can field visits be made? Is archival image data available?
Ease of use Is the data produced in a consistent fashion? Is it in formats that are commonly used?
Opportunity for consistency Can protocols be developed and applied in a systematic and repetitive fashion? Can some tasks be automated?
Vintage — temporal representation Is the data representative of a time or time period that is relevant to the change product under consideration?
Spatial coverage Are there opportunities for multiple reference sites from a given reference data source?
Interpretability of change types Does the data source capture and portray the change types of interest? E.g., is the spatial resolution sufficiently fine to enable interpretation?
Geolocation Can the candidate reference data source be assumed to be accurately positioned? Will additional geolocation activities be required?

after the product date of the change map. Imagery from the same year is match the map classification for all maps that might be assessed.
desired but may not always be possible. As such, it is required that the Often the reference imagery or information will permit distinguishing
change reference data approximates the date the change occurred as smaller patches or features that can be distinguished from the map so
precisely as available. Multiple images help refine the timing of the a smaller MMU will be possible for the reference classification.
change event. Mismatched change periods between the map and refer- The easiest case for the labeling protocol occurs when the assess-
ence classifications would be a major source of reference data error. ment unit is homogeneous and a single reference class label can be
assigned (the reference class could be a type of change). Often, however,
3.3. Reference labeling protocol the situation will be more complex making class labeling less certain.
For example, the assessment unit may contain a mixture of classes,
The labeling protocol refers to the steps in the response design that and even if the unit is homogeneous, it may be difficult to assign a single
take the information provided by the reference data and convert that label (e.g., change type) because the unit is not unambiguously one of
information to the label or labels constituting the reference classifica- the classes in the legend but instead falls between two of the discrete
tion. Labeling is far from trivial with numerous definitions for land- class options in the legend (i.e., land-cover classes are a continuum rep-
cover classes in use (cf. Comber, Wadsworth, & Fisher, 2008) although resented on a discrete scale). A variety of options exist for labeling a unit
recent developments such as the FAO's Land Cover Classification system when a single reference label does not adequately represent the uncer-
(LCCS) may act to enhance interoperability (Ahlqvist, 2008). The label- tainty of a unit. One or more alternate reference class labels can be
ing protocol should also include specification of a minimum mapping assigned to account for ambiguity in the reference classification. Another
unit (MMU) for the reference classification. The MMU can have impor- option when defining agreement is to construct a weighted agreement
tant implications for accuracy assessment and area estimation. For based on how closely the different classes are related. For example,
example, increasing the size of the MMU will lead to a reduction in in the GlobCover assessment, a “matrix” of class relationships was
the representation of classes that occupy small, often fragmented, established (Mayaux et al., 2006, GLC2000). A fuzzy reference labeling
patches (Saura, 2002). Changing the MMU can also impact accuracy es- protocol may also be employed, such as the linguistic scale devised by
timates, although the effect is most apparent when a large change is Gopal and Woodcock (1994) or a fuzzy membership vector in which
made (Knight & Lunetta, 2003). Small patches present a challenge to the reference label for a unit specifies a membership value for each
mapping (cf. He, Franklin, Guo, & Stenhouse, 2011) and the accuracy class (Binaghi, Brivio, Ghezzi, & Rampini, 1999; Foody, 1996). Another
of their mapping will degrade as the MMU is increased. However, it is option for mixed units is to specify the proportion of area of each class
possible that overall map accuracy may increase with a larger MMU, present in the unit (Foody, Campbell, Trodd, & Wood, 1992; Lewis &
making it important to ensure that attention is focused on an appropri- Brown, 2001). A different characterization of uncertainty in the refer-
ate measure of accuracy for the application in-hand. The precise effects ence classification is obtained by assigning a confidence rating that
of the MMU will vary as a function of the land-cover mosaic under study represents the interpreter's perception of uncertainty in the reference
and the imagery used. The MMU specified for the response design does classification for that unit. For example, low, moderate and high
not necessarily have to match the MMU specified for the map. In fact, if confidence ratings would indicate increasing confidence on the part of
the reference classification is intended to apply to a variety of maps, it the interpreter that the reference classification is correct. Typically
would be likely that the MMU of the reference classification does not this information can then be used in the analysis to subset results by

Fig. 1. Landsat data can be used for the visual dating of change, with the fire event in progress in inset A, August 3, 2010, with the burned forest outcome evident in inset B, September 20,
2010, Yukon, Canada (Landsat Path 55, Row 18).
50 P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57

confidence rating (Powell et al., 2004; Wickham, Stehman, Fry, Smith, & the reference classification are the uncertainty associated with spatial
Homer, 2001, Table 4). co-registration of the map and reference location (Pontius, 2000) and
The response design should include protocols to enhance consisten- uncertainty associated with the interpretation of the reference data
cy of the reference class labeling. For example, interpretation keys (Pontius & Lippitt, 2006).
should be created if visual assessment is used to obtain the reference Geolocation error is defined as a mismatch between the location of
classification (Kelly, Estes, & Knight, 1999) and specific instructions to the spatial assessment unit identified from the map and the location
translate quantitative field data into reference labels should be provided identified from the reference data. The response design should be
and documented. If multiple interpreters are used, training interpreters constructed to minimize geolocation error. For instance, it is common
to ensure consistency is critical. Interpreters should be in communica- for plots to have a GPS position. The quality of the GPS position can be
tion throughout the process to discuss and review difficult cases and related to the type of instrument used, which can provide an indication
to agree upon a common approach to labeling such cases. Difficult of spatial precision. The length of time, number of position measures to
cases should be noted for future reference and consensus development resolve the location, and the number of satellites are also aspects that
(e.g., the imagery is retained and accessible, and the decision process can be recorded. The magnitude of geolocation error should be charac-
leading to the reference label of the case is documented). Rather than terized by documenting the spatial location quality of the map and
solely visual approaches, entire high spatial resolution images can be reference data sources (e.g., GPS units, aerial photography, or satellite
classified, with the underlying imagery also maintained and accessible imagery). If airborne imagery is to be used, aircraft positioning and
as support information to the accuracy assessment (that is, to gain/ pointing information should be collected. The GPS location of the air-
ensure confidence in the categories selected for a given location). craft does not necessarily indicate the position of the point on the
ground that is captured in photographic or video data. A slight roll of
3.4. Defining agreement the aircraft can create a mismatch between the recorded and actual
positions. Error in the classification may be incorrectly indicated due
Once the map and reference classifications have been obtained for a to these spatial mismatches, especially for smaller change events or
given spatial unit, rules for defining agreement must be specified before rare classes.
proceeding to the analyses that quantify accuracy. In the simplest case, a Interpreter uncertainty can be separated into two parts: 1) inter-
single class label is present for the map and a single label is provided by preter bias is defined as an error in the assignment of the reference
the reference classification. If these labels agree, the map class is correct class to the spatial unit; 2) interpreter variability is a difference between
for that unit; if the labels disagree, the type of misclassification is readily the reference class assigned to the same spatial unit by different inter-
identified. Defining agreement becomes more complex if the assess- preters (i.e., interpreter variability is the complement of among inter-
ment unit is not homogeneous or if more than a one class label is preter agreement). Ideally an assessment of both interpreter bias and
assigned by the map or reference classification. For example, if the refer- interpreter variability would be conducted; in practice, assessing only
ence classification provides a primary and secondary reference label, interpreter variability may be feasible. The difficulty hindering assess-
agreement can be defined as a match between the map label and either ment of interpreter bias is whether a “gold standard” of truth exists
the primary or secondary reference label. If the reference classification against which the interpreted reference classification can be compared.
consists of a vector of proportions of area of the classes present in the For example, on-the-ground reference data may serve to establish the
assessment unit (e.g., the area proportions of the classes are 0.2, 0.5, gold standard of truth for land cover at a single date, but a gold standard
and 0.3), agreement can be defined as the proportion of area for which for change based on field visits would be much more difficult and costly
the map and reference labels are the same. The critical feature of the pro- to establish. Comparison of interpreters to an “expert” interpreter is a
tocol for defining agreement is that it allows construction of an error practical but less satisfying option for quantifying interpreter bias and
matrix in which the elements of the matrix represent proportion of the success of this approach depends on how closely the expert classifi-
area of agreement and disagreement between the map and reference cation mimics the gold standard. A distinction between the accuracy
classifications. These proportions (in terms of area) achieve the neces- assessment of land cover and change does exist, whereby the continu-
sary spatially explicit assessment of map accuracy and the requirements ous nature of land cover benefits more from field visits. Depending on
for area estimation. the change categories of interest, field visits may not be as informative.
For example, slower continuous changes may benefit from field visits,
3.5. Reference classification uncertainty: geolocation and interpreter but rapid stand replacing disturbances may not. The date of change, if
variability not captured in silvicultural records or fire maps, may actually be better
captured from imagery of known vintage than through field visits
In an ideal case, the reference classification is based on a reference (Cohen et al., 2010).
data set of such quality that the sample labels represent the ground A number of issues arise when using multiple interpreters to obtain
truth (i.e. a “gold standard” reference data set). However, the reference the reference classification (Wulder et al., 2007). Disagreements among
classification is subject to uncertainty, and an assessment of this uncer- interpreters evaluating the same sampling unit are likely. These dis-
tainty should be conducted. Small errors in the reference data set can agreements may be resolved by a consensus agreement on the refer-
lead to large biases of the estimators of both classification accuracy and ence class; for example, Powell et al. (2004) required five interpreters
class area (Foody, 2010, 2013). Two potential sources of uncertainty in to agree upon a specific class, with the outcome then treated as a
“gold standard”. Constant communication among the multiple inter-
preters to discuss and document difficult cases is important to foster
Table 4 enhanced consistency and accuracy of the reference labeling process
Population error matrix of four classes with cell entries (pij) expressed in terms of
proportion of area as suggested by good practice recommendations.
(Wickham et al., 2013).
The response design protocols described in this section have focused
Reference on land-cover changes that can be characterized by a complete change
Class 1 Class 2 Class 3 Class 4 Total in class type: conversions of cover. In some studies attention is focused
Map Class 1 p11 p12 p13 p14 p1⋅ on more subtle changes or modifications of land cover, as changes in
Class 2 p21 p22 p23 p24 p2⋅ land cover can be considered as processes (Gómez, White, & Wulder,
Class 3 p31 p32 p32 p34 p3⋅ 2011) with gains and losses in vegetation captured and possible to
Class 4 p41 p42 p43 p44 p4⋅ assign a label (Kennedy et al., 2010). Cohen et al. (2010) show how
Total p⋅1 p⋅2 p⋅3 p⋅4 1
investigation of time series of satellite imagery supported by period
P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57 51

photography can illuminate subtle changes in forest conditions such as classified); 2) the correction for chance agreement used in the common
decline due to insects or water stress and converse recovery of forests formulation of kappa is based on an assumption of random chance that
following disturbance. The response design protocols presented also is not reasonable because it uses the map marginal proportions of area
do not address the situation in which the map provides information as in the definition of chance agreement and these proportions are clearly
a continuous variable. Although many of the basic concepts underlying not simply random; and 3) kappa is highly correlated with overall accu-
the good practice recommendations would apply to a continuous vari- racy so reporting kappa is redundant with overall accuracy.” (Foody,
able, the details of the accuracy assessment methodology (cf. Riemann, 1992; Liu et al., 2007; Pontius & Millones, 2011; Stehman, 1997). Consis-
Wilson, Lister, & Parks, 2010) and area estimation would likely be con- tent with the recommendation in Strahler et al. (2006) the use of kappa
siderably different from the methods presented herein. is strongly discouraged as, despite its widespread use, it actually does
not serve a useful role in accuracy assessment or area estimation.
4. Analysis
4.2. General principles of estimation for good practice
The analysis protocol specifies the measures to be used to express
accuracy and class area as well as the procedures to estimate the selected The analysis protocol is designed to achieve the objectives of esti-
measures from the sample data. In the context of studies of land change, mating accuracy and area from the sample data. Analysis thus requires
there are two key objectives of the analysis: 1) accuracy assessment of statistical inference as the underlying scientific support for generalizing
the change classification, and 2) estimation of area of change. The confu- from the sample data to the population parameters and for quantifying
sion or error matrix (hereafter noted as the error matrix) plays a central uncertainty of the sample-based estimators. We recommend design-
role in meeting both the accuracy assessment and area estimation objec- based inference (Särndal, Swensson, & Wretman, 1992) as the frame-
tives (Foody, 2013; Stehman, 2013). work within which estimation is conducted. A fundamental tenet of
design-based inference is that the specific estimators for accuracy,
4.1. The error matrix area, and the variances of these estimators depend on the sampling
design implemented; different estimators are appropriate for different
The error matrix is a simple cross-tabulation of the class labels allo- sampling designs. Therefore, it is essential that only unbiased or consis-
cated by the classification of the remotely sensed data against the refer- tent estimators should be used. In practical terms, this means that only
ence data for the sample sites. The error matrix organizes the acquired formulas for estimating parameters and variances that account for the in-
sample data in a way that summarizes key results and aids the quanti- clusion probabilities associated with the sampling design implemented
fication of accuracy and area. The main diagonal of the error matrix should be used. All recommended good practice estimators meet
highlights correct classifications while the off-diagonal elements show this condition, but the versions of the estimators presented are usu-
omission and commission errors. The cell entries and marginal values ally forms where the individual inclusion probabilities do not appear
of the error matrix are fundamental to both accuracy assessment and explicitly.
area estimation. Table 4 illustrates a four-class example error matrix
of the type often used in studies of land change. 4.3. Estimating accuracy
The rows of the error matrix represent the labels shown in a map
derived from the classification of the remote sensing data and the col- The cell entries of the population error matrix and the parameters
umns represent the labels depicted in the reference data. This layout is derived from it must be estimated from a sample. Suppose the sample-
not a universal requirement and some may wish to reverse the contents based estimator of pij is denoted as p ^ij . Once p
^ij is available for each
of the rows and columns. In the matrix, pij represents the proportion of element of the error matrix, parameters can be estimated by substituting
area for the population that has map class i and reference class j, where ^ij for pij in the formulas for the parameters. Accordingly, the error matrix
p
“population” is defined as the full region of interest, and pij is therefore should be reported in terms of these estimated area proportions, p ^ij, and
the value that would result if a census of the population was obtained not in terms of sample counts, nij. The specific formula for estimating
(i.e., complete coverage reference classification). pij depends on the sampling design used. For equal probability sam-
Accuracy parameters derived from a population error matrix of q pling designs (e.g., simple random and systematic sampling) and for
classes include overall accuracy stratified random sampling in which the strata correspond to the map
Xq classes,
O¼ p
j¼1 jj
ð1Þ
nij
^ij ¼ W i
p ð4Þ
user's accuracy of class i (the proportion of the area mapped as class i ni
that has reference class i)

U i ¼ pii =pi ð2Þ where Wi is the proportion of area mapped as class i. For simple random
and systematic sampling, Eq. (4) is a poststratified estimator of pij (Card,
or its complementary measure, commission error of class i, 1 − pii/pi ⋅, 1982) and for these sampling designs the poststratified estimator is rec-
and producer's accuracy of class j (the proportion of the area of refer- ommended because it will have better precision than the estimators
ence class j that is mapped as class j), commonly used (cf. Stehman & Foody, 2009). Substituting p ^ij of Eq. (4)
into Eqs. (1)–(3) yields estimators of overall, user's, and producer's accu-
P j ¼ pjj =p j ð3Þ racies. These formulas are simpler special cases of a more general esti-
mation approach described in Strahler et al. (2006, Eq. (3.1)).
or its complementary measure, omission error of class j, 1 − pjj/p⋅j. A The sampling variability associated with the accuracy estimates
variety of other measures of accuracy has been used in remote sensing should be quantified by reporting standard errors. The variance estima-
(Liu, Frazier, & Kumar, 2007). A commonly used measure is the kappa tors are provided below, and taking the square root of the estimated
coefficient of agreement (Congalton & Green, 2009). The problems asso- variance results in the standard error of the estimator. For overall accu-
ciated with kappa include but are not limited to: 1) the correction for racy, the estimated variance is
hypothetical chance agreement produces a measure that is not descrip-
tive of the accuracy a user of the map would encounter (kappa would   X  
^ O^ ¼ q 2^ ^
V Wi U i 1−U i =ðni −1Þ: ð5Þ
underestimate the probability that a randomly selected pixel is correctly i¼1
52 P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57

For user's accuracy of map class i, the estimated variance is random, systematic, or stratified random with the map classes defined
    ^ij ¼ W i nnij leading to
as the strata, Eq. (8) would be computed using p i
^ U
V ^ 1−U
^ ¼U ^ =ðn −1Þ: ð6Þ the often used special case estimator
i i i i

Xq nik
For producer's accuracy of reference class j = k, the estimated vari- ^k ¼
p Wi : ð9Þ
i¼1 ni
ance is

"  2   # This estimator is a poststratified estimator for simple random and


  N 2j: 1−P^ j U^ 1−U
^ Xq  
^ P^ ¼ 1 j j nij nij systematic sampling, and it is the direct stratified estimator of p⋅k for strat-
þ P^ j
2 2
V j N
i≠ j i
1− =ðni −1Þ ð7Þ
N^2 n j: −1 ni ni ified random sampling when the map classes are the strata. For these
j
sampling designs, the stratified estimator (Eq. (9)) generally has better
where N ^ ¼ ∑q N i n is the estimated marginal total number of precision than a variety of alternative estimators of area (Stehman,
j i¼1
ni ij 2013) and consequently the stratified estimator is recommended.
pixels of reference class j, Nj. is the marginal total of map class j and nj. For the stratified estimator of proportion of area (Eq. (9)), the stan-
is the total number of sample units in map class j. These are the usual dard error is estimated by
variance estimators applied to the stratified sampling, and the estima-
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 ffi
tors would be viewed as poststratified variance estimators for simple u
u nik n sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
random and systematic sampling. For systematic sampling, the variance uX 1− ik X Wp
t 2 ni ni ^
i ik −pik
^2
estimators are approximations that usually result in overestimation of ^
Sðpk Þ ¼ W i ¼ ð10Þ
i ni −1 i ni −1
variance. These variance estimators are also based on assumptions that
the assessment unit for the response design is a pixel and each pixel
where nik is the sample count at cell (i,k) in the error matrix, Wi is the
has a hard classification for the map and a hard classification for the ref-
area proportion of map class i, p^ik ¼ W i nnik and the summation is over
erence data. The variance estimators would not apply to a polygon as- i
the q classes. For systematic sampling, Eq. (10) is an approximation
sessment unit or to a mixed pixel situation.
that is typically an overestimate for the actual standard error of system-
atic sampling. The estimated area of class k is A^ ¼Ap ^k, where A is the
4.4. Estimating area k
total map area. The standard error of the estimated area is given by
The error matrix also provides the basis for estimating the area of  
^ ¼ A  S ðp
S A ^k Þ: ð11Þ
classes such as those representing change and no-change. The popula- k
tion error matrix (Table 4) provides two different approaches for estimat-
ing the proportion of area. Suppose we are interested in estimating the ^
 An approximate 95% confidence interval is obtained as Ak  1:96 
proportion of area of class k. The row and column totals are the sums of ^ .
S Ak
the pij values in the respective rows and columns. Thus, the row total
pk⋅ represents the proportion of area mapped as class k (e.g., if k is a 5. Example of good practices: estimating area and assessing accuracy
change class such as forest loss then pk⋅ is the proportion of area mapped of forest change
as forest loss) and the column total p⋅k represents the proportion of area
of class k as determined from the reference classification (e.g., p⋅k would The following hypothetical example illustrates the workflow of
be the proportion of area of forest loss as determined from the reference assessing accuracy of a forest change map and estimating area. Consider
classification). a change map for 2000 to 2010 consisting of two change classes and two
The two area proportion parameters for class k (i.e., pk⋅ and p⋅k) are stable classes: deforestation, forest gain, stable forest and stable non-
unlikely to have the same value, so a decision arises as to which param- forest. The map was produced by supervised classification of data from
eter should be the focus. Once a change map is complete, pk⋅ is known, Landsat ETM+ with the objective of estimating the gross rates of forest
but because the reference classification is available only for a sample, p⋅k loss and gain. The first step in the assessment was to visually inspect
must be estimated from the sample. Consequently, the need to estimate the change map and identify obvious errors by comparing the classified
p⋅k introduces uncertainty in the form of sampling variability, whereas results to the Landsat data of 2000 and 2010. Misclassified regions
pk⋅ is not subject to sampling variability (Stehman, 2005). The map- were relabeled before proceeding to the rigorous evaluation of the
based parameter pk⋅ is known with certainty but likely biased because map. After obvious errors were removed, the areas of the map classes
of classification error. Conversely, p⋅k is determined from the reference were 200,000 Landsat pixels (18,000 ha) of deforestation, 150,000 pixels
classification. Therefore, p⋅k should have smaller bias than pk⋅ (i.e., the (13,500 ha) of forest gain, 3,200,000 pixels (288,000 ha) of stable forest,
bias attributable to reference data error is smaller than the bias attribut- and 6,450,000 pixels (580,500 ha) of stable non-forest. The two change
able to map classification error). The “good practice” guidelines are classes thus occupy 3.5% of the total map area. The accuracy assessment
founded on the premise that the reference classification is superior in was designed for the objectives of estimating overall and class-specific
quality relative to the map classification and that the sampling design accuracies, areas of the individual classes (as determined by the refer-
implemented yields estimates with small standard errors. Consequently, ence classification), and confidence intervals for each accuracy and area
we recommend that area estimation should be based on p⋅k, the propor- parameter. The spatial assessment unit in this example is a Landsat
tion of area derived from the reference classification. pixel (30 m × 30 m).
A variety of estimators has been proposed for estimating p⋅k from the
error matrix. For any sampling design and response design leading to an 5.1. Sampling design
estimated error matrix with pij in terms of proportion of area, a direct
estimator of the proportion of area of class k is A stratified random sampling design with the four map classes as
Xq strata adheres to the recommended practices outlined in Section 2
^k ¼
p p : ^ ð8Þ and satisfies the accuracy assessment and area estimation objectives.
i¼1 ik
In the next two subsections, we present sample size and sample allocation
This estimator is simply the sum of the estimated area proportions of planning calculations for the stratified design. Sample size planning is an
class k as determined from the reference classification (i.e., the sum of inexact science because it is dependent on accuracy and area information
column k of the estimated error matrix). If the sampling design is simple that must be speculative prior to conducting the actual accuracy
P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57 53

assessment. Nevertheless, these planning calculations can provide infor- size needed to achieve certain standard errors for the assumed estimated
mative insight into the choices of sample size and sample allocation to user's accuracy for that class. A small overall sample size might allow for
strata. only 50 sample units per rare class stratum. Suppose that n–r sample
units remain after a sample size of r units has been allocated to the rare
5.1.1. Determining the sample size class strata. The sample size of n–r is then allocated proportionally to
For simple random sampling and targeting overall accuracy as the the area of each remaining stratum. The anticipated estimated variances
estimation objective, Cochran (1977, Eq. (4.2)) suggests using a sample can then be computed (based on the sample size allocation) for user's
size of and overall accuracy and area using Eqs. (5), (6) and (10). The sample
size allocation process can be iterated until an allocation is found that
2
z Oð1−OÞ yields satisfactory anticipated standard errors for the key accuracy and
n¼ ð12Þ
d2 area estimates. The effect of the choice of sample allocation will be
observed in the standard errors of the estimates, however, a poor alloca-
where O is the overall accuracy expressed as a proportion, z is a percen- tion of sample size to strata will not result in biased estimators.
tile from the standard normal distribution (z = 1.96 for a 95% confidence In this example, we know the mapped areas of the four map classes
interval, z = 1.645 for a 90% confidence interval), and d is the desired (Wi), we have conjectured values of user's accuracies and standard
half-width of the confidence interval of O. Eq. (12) provides a starting errors of the strata, and we have estimated a total sample size of 641
point for assessing sample size for the limited scope of estimating overall (Table 5). The resulting sample sizes for proportional and equal alloca-
accuracy. tion are shown in Table 5. As described above, neither of these is optimal
For stratified random sampling, Cochran (1977, Eq. (5.25)) provides and we want to find a compromise between the two. We start by allo-
the following sample size formula (the cost of sampling each stratum is cating 100 sample units each to the change classes and then allocate
assumed the same): the remainder of the sample size proportionally to the stable classes.
This gives the allocation in column “Alloc1”. Since the recommendation
P X !2
ð W i Si Þ
2 W i Si is to allocate between 50 and 100 sample units in the change strata, we
n ¼ h  i2 X ≈   ð13Þ introduce two additional allocations with 75 and 50 sample units in the
^
S O
2
þ ð1=NÞ W i Si ^
S O
change strata, respectively (“Alloc2” and “Alloc3”). To determine which
  of these allocations to use, we need to examine the standard errors of
where N = number of units in the ROI, S O ^ is the standard error of the the estimated user's accuracy, estimated overall accuracy, and estimated
estimated overall accuracy that we would like to achieve, Wi is the areas using Eqs. (5), (6) and (10).
mapped proportion of area of ffi class i, and Si is the standard deviation
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi It is necessary to speculate the outcome of the accuracy assessment
of stratum i, Si ¼ U i ð1−U i Þ (Cochran, 1977, Eq. (5.55)). Because N is to compute the anticipated standard errors for each sample allocation
typically large (e.g., over 10 million pixels in this example), the second considered. The hypothesized error matrix in Table 6 reflects the antic-
term in the denominator of Eq. (13) can be ignored. We specify a target ipated outcome that the change classes will be rare and have lower
standard error for overall accuracy of 0.01. Suppose from past experi- class-specific accuracies than the two stable classes. The population
ence with similar change mapping efforts we know that errors of com- error matrix was also constructed to yield the hypothesized accuracies
mission are relatively common for the change classes while the stable input into the sample size planning calculations of the previous section.
classes are more accurate (e.g., Olofsson et al., 2010, 2011). Consequently, When creating the hypothesized error matrix used for sample size and
we conjecture that user's accuracies of the two change classes will be sample allocation planning, we should draw upon any past experience
0.70 for deforestation and 0.60 for forest gain, and user's accuracies of for insight into the accuracy of the map to be produced.
the stable classes will be 0.90 for stable forest and 0.95 for stable non- Table 7 shows the standard errors of the user's and overall accuracies
forest. The resulting sample size from Eq. (13) is n = 641. These  sample and estimated areas of both deforestation and stable forest for each of
size calculations should be repeated for a variety of choices ofS O ^ and Ui the five sample allocations in Table 5 and the hypothetical population
before reaching a final decision. error matrix of Table 6. No single allocation is best for all estimation
objectives, so a choice among competing objectives is necessary. The em-
5.1.2. Determine sample allocation to strata phasis on prioritizing objectives during the planning stage (Section 2)
Once the overall sample size is chosen, we determine the allocation becomes particularly relevant to the decision of sample allocation
of the sample to strata. It is important that the sample size allocation because different allocations favor different estimation objectives. For
results in precise estimates of accuracy and area. Stehman (2012) iden- example, equal allocation gives the smallest standard error of the user's
tifies four different approaches to sample allocation: proportional, accuracy of deforestation but a high standard error of the estimated
equal, optimal and power allocation. In proportional allocation, the area of deforestation. Proportional allocation will result in smaller stan-
sample size per map class is proportional to the relative area of the dard errors of overall accuracy and area of stable forest but the standard
map class. In this example, and which is usually the case when mapping error for estimated user's accuracy of deforestation is two to four times
land change, the mapped areas of change are small relative to other clas- larger than the corresponding standard errors for other sample alloca-
ses so proportional allocation will lead to small sample sizes in the rare tions. In this case, “Alloc1–3” provide allocations that generate relatively
classes (unless n is very large) and imprecise estimates of user's accura- small standard errors for the different estimates. We will choose “Alloc2”
cy for these rare classes. Allocating an equal sample size to all strata with 75 sample units in the two change classes.
targets estimation of user's accuracy of each map class but equal alloca-
tion is not optimized for estimating area and overall accuracy. Neyman
optimal allocation (Cochran, 1977) can be used to minimize the vari- Table 5
ance of the estimator of overall accuracy or the estimator of area, but Information needed to decide allocation of sample size to strata. The information includes
the mapped area proportions (Wi), conjectured values of user's accuracies (Ui) and
optimal allocation becomes difficult to implement when multiple esti-
standard deviations (Si) of the strata. Columns 5–9 contain five different allocations.
mation objectives are of interest as will be the case when estimating
accuracy and area of several land-cover classes or land-cover change Strata (i) Wi Ui Si Equal Alloc1 Alloc2 Alloc3 Prop
types. 1 Deforestation 0.020 0.700 0.458 160 100 75 50 13
We suggest the following simplified approach to sample size alloca- 2 Forest gain 0.015 0.600 0.490 160 100 75 50 10
tion. Allocate a sample size of 50–100 for each change strata using the 3 Stable forest 0.320 0.900 0.300 160 149 165 182 205
4 Stable non-forest 0.645 0.950 0.218 160 292 325 358 413
variance estimator for user's accuracy (Eq. (6)) to decide the sample
54 P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57

Table 6
Hypothetical population error matrix expressed in terms of proportion of area (see Section 4) used for sample size and sample allocation planning calculations.

Reference

Deforestation Forest gain Stable forest Stable non-forest Total (Wi) Ui

Map Deforestation 0.014 0 0.003 0.003 0.020 0.70


Forest gain 0 0.009 0.003 0.003 0.015 0.60
Stable forest 0.002 0 0.288 0.030 0.320 0.90
Stable non-forest 0.004 0.002 0.025 0.614 0.645 0.95
Total 0.020 0.011 0.319 0.650 1

5.2. Estimating accuracy, area and confidence intervals Table 9 give the mapped area proportions (which are also given by
Wi) while the column totals give the estimated area proportions accord-
To create the reference classification for labeling each sample unit, a ing to the reference data. Multiplying the latter by the total map area
combination of Landsat data from the USGS open archive together with gives the stratified area estimate of each class according to the reference
GoogleEarth™ provides a source of cost free reference data. Our hypo- data. For example, the estimated area of deforestation according to
thetical map was produced using Landsat, and the good practice recom- the reference data is A ^1 ¼ p
^1  Atot ¼ 0:024  10; 000; 000 pixels =
mendations stipulate that if using the same data for creation of both the 235,086 pixels = 21,158 ha. The mapped area of deforestation (Am,1)
map and reference classifications, the process of creating the latter of 200,000 pixels was thus underestimated by 35,086 pixels or 3158 ha.
should be of higher quality than the map-making process. The process The second step is to estimate a confidence interval for the area of
of labeling the sample units thus has to be more accurate than each class. From Eq. (10), Sðp ^1 Þ ¼ 0:0035
 and the standard error for
supervised classification. A manual inspection by three analysts of the estimated area of forest loss is S A ^ 1 ¼ Sðp
^1 Þ  Atot ¼ 0:0035
each of the sample units using a set of Landsat images together with 10; 000; 000 ¼ 34; 097 pixels. The margin of error of the confidence
GoogleEarth™ imagery acquired around the same time as the images interval is 1.96 × 34,097 = 68,418 pixels = 6158 ha. We have thus
used to make the map is assumed to be a more accurate process than estimated the area of deforestation with a 95% confidence interval:
supervised classification. The error matrix resulting from this response 21,158 ± 6158 ha. The area estimate with a 95% confidence interval of
design and sample is presented in terms of the sample counts displayed the forest gain class is 11,686 ± 3756 ha; stable forest is 285,770 ±
in Table 8, and the computations for the accuracy and area estimates are 15,510 ha and stable non-forest 581,386 ± 16,282 ha.
detailed in the following two subsections. This example has illustrated the workflow of assessing accuracy, and
estimating area and confidence intervals of area of the classes of a
5.2.1. Estimating accuracy change map. While this is fairly straightforward once the error matrix
Because the sampling design is stratified random using the map clas- has been constructed, the example highlights the need to consider dif-
ses as strata, the cell entries of the error matrix are estimated using ferent objectives when designing the sample.
Eq. (4). A tool for estimating unbiased accuracy measures and areas with
95% confidence intervals can be downloaded from www.people.bu.
We can now estimate user's accuracy U ^ i ¼ p^ii ; producer's accuracy P^ j ¼
^i
p edu/olofsson/ (click ‘Research’ N ‘Accuracy/Uncertainty’). The tool is im-
^jj
p ^ ¼ ∑q p^ plemented in Matlab™.
^ j ;
p
and overall accuracy O j¼1 jj using the estimated area pro-

portions. Variances for these accuracy measures are estimated using


rffiffiffiffiffiffiffiffiffiffiffiffiffiffi
  6. Summary
^ U
Eqs. (5)–(7). 95% confidence intervals are estimated as 1:96 V ^i
Conducting an accuracy assessment of a land change map serves mul-
(replace U ^ for the producer's and overall accuracies). In
^ i with P^ j and O
tiple purposes. In addition to the obvious purpose of quantifying the
this case, the estimated user's accuracy (±95% confidence interval) is accuracy of the map, the reference sample serves as the basis of estimates
0.88 ± 0.07 for deforestation, 0.73 ± 0.10 for forest gain, 0.93 ± 0.04 of area of each class where area is defined by the reference classification.
for stable forest, and 0.96 ± 0.02 for stable non-forest. The estimated The accuracy assessment sample data also contribute to estimates of un-
producer's accuracy is 0.75 ± 0.21 for deforestation, 0.85 ± 0.23 for certainty of the area estimates. Without an accuracy assessment, there is
forest gain, 0.93 ± 0.03 for stable forest, and 0.96 ± 0.01 for stable no way to communicate map quality in a quantitative and meaningful
non-forest. The estimated overall accuracy is 0.95 ± 0.02. fashion. We acknowledge that there is no singular “best” approach and
the recommendations provided do not preclude the existence of other
5.2.2. Estimating area and uncertainty acceptable practices. However, by following the “good practice” recom-
The next step is to use the estimated area proportions in Table 9 to mendations presented by this paper, scientific credibility of the accuracy
estimate the area of each class. The row totals of the error matrix in and area estimates is ensured. The “good practice” recommendations are
summarized as follows, organized by the three major components of
the accuracy assessment methodology, the sampling design, response
Table 7
design, and analysis.
Standard errors of selected accuracy and area estimates for different sample size
allocations to strata (Table 5) and the hypothetical population error matrix (Table 6).
Standard errors are shown for estimated overall accuracy, estimated user's accuracy for 6.1. General
the rare class deforestation (i = 1) and the common class stable forest (i = 3), and
estimated area (in units of hectares) of deforestation and area of stable forest.
          • Visually inspect the map and correct obvious errors before conducting
Allocation ^
S O ^1
S U ^3
S U ^1
S A ^3
S A the accuracy assessment.
Equal 0.013 0.036 0.024 4035 11,306 • Accuracy and area estimates will be determined from a classification
Alloc1 0.011 0.046 0.025 3307 9744 (i.e., the reference classification) that is of higher quality than the
Alloc2 0.011 0.053 0.023 3138 9270 land change map being evaluated.
Alloc3 0.010 0.065 0.022 3125 8860 • A sampling approach is needed because the cost of obtaining the refer-
Proportional 0.010 0.132 0.021 3600 8614
ence classification for the entire region of interest will be prohibitive.
P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57 55

Table 8
Description of sample data as an error matrix of sample counts, nij (see Table 9 for recommended estimated error matrix used to report accuracy results).

Reference

Deforestation Forest gain Stable forest Stable non-forest Total Am,i [pixels] Wi

Map Deforestation 66 0 5 4 75 200,000 0.020


Forest gain 0 55 8 12 75 150,000 0.015
Stable forest 1 0 153 11 165 3,200,000 0.320
Stable non-forest 2 1 9 313 325 6,450,000 0.645
Total 69 56 175 340 640 10,000,000 1

• The sample used for accuracy assessment and area estimation is • High overhead cost may eliminate field visits as a source of reference
separate from (independent of) the data used to train or develop the data.
classification. • The reference data should provide sufficient temporal representation
consistent with the change period of the map.
6.2. Sampling design • Data from the Landsat open archive in combination with high spatial
resolution imagery provide a low-cost and often useful source of ref-
• Implement a probability sampling design to provide a rigorous foun- erence data (national photograph archives, satellite photo archives
dation via design-based sampling inference. (e.g., Kompsat), and the collections available through Google Earth™
• Document and quantify any deviations from the probability sampling are possible high resolution imagery sources).
protocol. • Specify protocols for accounting for uncertainty in assigning the refer-
• Choose a sampling design on the basis of specified accuracy objectives ence classifications.
and prioritized desirable design criteria. • Assign each sample unit a primary and secondary label (secondary not
• Sampling design guidelines. required if there is highly confidence in the primary label).
○ Stratify by map class to reduce standard errors of class-specific • Include an interpreter specified confidence for each reference label
accuracy estimates. (e.g., high, medium, or low confidence).
○ If resources are adequate, stratify by subregions to reduce standard • Implement protocols to ensure consistency among individual inter-
errors of subregion-specific estimates. preters or teams of interpreters.
○ Use cluster sampling if it provides a substantial cost savings or if • Specify a protocol for defining agreement between the map and refer-
the objectives require a cluster unit for the assessment. ence classifications that will lead to an error matrix expressed in terms
○ Both simple random and systemic selection protocols are accept- of proportion of area.
able options.
• The recommended allocation of sample size to strata (assuming the
map classes are the strata) is to increase the sample size for rare 6.4. Analysis
change classes to achieve an acceptable standard error for estimated
user's accuracies and to allocate the remaining sample size roughly • Report the error matrix in terms of estimated area proportions.
proportional to the area occupied by the common classes. • Report the area (or proportion of area) of each class as determined
• Use sample size and optimal allocation planning calculations as a from the map.
guide to decisions on total sample size and sample allocation. • Report user's accuracy (or commission error), producer's accuracy (or
• Evaluate the potential outcome of sample size and sample allocation omission error), and overall accuracy (Eqs. (1)–(3)).
decisions on the standard errors of accuracy and area estimates for • Avoid use of the kappa coefficient of agreement for reporting accuracy
hypothetical error matrices based on the anticipated accuracy of the of land change maps.
map. • Estimate the area of each class according to the classification deter-
• Stratified random sampling using the map classification to define mined from the reference data.
strata is a simple, but generally applicable design that will typically • Use estimators of accuracy and area that are unbiased or consistent.
satisfy most accuracy and area estimation objectives and desirable • For simple random, systematic, and stratified random sampling when
design criteria. the map classes are defined as strata, use stratified estimators of accu-
racy (Eqs. (5)–(7)) and a stratified estimator of area (Eq. (9)).
6.3. Response design • Quantify sampling variability of the accuracy and area estimates by
reporting standard errors or confidence intervals.
• Reference data should be of higher quality than the data used for • Use design-based inference to define estimator properties and to
creating the map, or if using the same source, the process of creating quantify uncertainty.
the reference classification should be more accurate than the process • Assess the impact of reference data uncertainty on the accuracy and
of creating the map. area estimates.

Table 9
The error matrix in Table 8 populated by estimated proportions of area.

Reference

Deforestation Forest gain Stable forest Stable non-forest Total (Wi) Am,i [pixels]

Map Deforestation 0.0176 0 0.0013 0.0011 0.020 200,000


Forest gain 0 0.0110 0.0016 0.0024 0.015 150,000
Stable forest 0.0019 0 0.2967 0.0213 0.320 3,200,000
Stable non-forest 0.0040 0.0020 0.0179 0.6212 0.645 6,450,000
Total 0.0235 0.0130 0.3175 0.6460 1 10,000,000
56 P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57

The recommendations provided are intended to serve as guide- Foody, G. M. (2013). Ground reference data error and the mis-estimation of the area of
lines for choosing from among options of sampling design, response land cover change as a function of its abundance. Remote Sensing Letters, 4, 783–792.
Foody, G. M., & Boyd, D. S. (2013). Using volunteered data in land cover map validation:
design, and analysis that will yield rigorous and defensible accuracy Mapping West African forests. IEEE Journal of Selected Topics in Applied Earth
and area estimates. But good practice is not static. As improvements Observation and Remote Sensing, 6, 1305–1312.
in technology become available and new methods are developed, good Foody, G. M., Campbell, N. A., Trodd, N. M., & Wood, T. F. (1992). Derivation and applica-
tions of probabilistic measures of class membership from the maximum likelihood
practice recommendations will evolve over time. Also, as practical classification. Photogrammetric Engineering and Remote Sensing, 58, 1335–1341.
experience accumulates with using new technology and methodolo- Gallego, F. J. (2012). The efficiency of sampling very high resolution images for area estima-
gies, good practice recommendations will be further amended to pro- tion in the European Union. International Journal of Remote Sensing, 33, 1868–1880.
GOFC-GOLD (2011). A sourcebook of methods and procedures for monitoring and
vide even more efficient yet still rigorous methods to estimate accuracy reporting anthropogenic greenhouse gas emissions and removals caused by defores-
and area of land change. tation, gains and losses of carbon stocks in forests remaining forests, and forestation.
GOFC-GOLD Report version COP17-1, (GOFC-GOLD Project Office, Natural Resources
Canada, Alberta, Canada).
Acknowledgments
Gómez, C., White, J. C., & Wulder, M.A. (2011). Characterizing the state and processes of
change in a dynamic forest environment using hierarchical spatio-temporal segmen-
This research was funded by the USGS Award Support for tation. Remote Sensing of Environment, 115, 1665–1679.
Gopal, S., & Woodcock, C. (1994). Theory and methods for accuracy assessment of thematic
SilvaCarbon and NASA through its support for the Carbon Monitoring
maps using fuzzy sets. Photogrammetric Engineering and Remote Sensing, 60, 181–188.
System to Boston University, and NASA Grant Number NNX13AP48G Grassi, G., Monni, S., Federici, S., Achard, F., & Mollicone, D. (2008). Applying the conser-
to State University of New York. We acknowledge the European vativeness principle to REDD to deal with the uncertainties of the estimates.
Space Agency (ESA) and NASA for their support to GOFC-GOLD and Environmental Research Letters, 3, 3.
Hansen, M. C., Stehman, S. V., & Potapov, P. V. (2010). Quantification of global gross forest
the CEOS working group of calibration and validation. We thank the cover loss. Proceedings of the National Academy of Sciences, 107, 8650–8655.
anonymous reviewers for the comments that helped improve the He, Y. H., Franklin, S. E., Guo, X. L., & Stenhouse, G. B. (2011). Object-orientated classifica-
manuscript. tion of multi-resolution images for the extraction of narrow linear forest disturbance.
Remote Sensing Letters, 2, 147–155.
Huang, C., Goward, S. N., Masek, J. G., Thomas, N., Zhu, Z., & Vogelmann, J. E. (2010). An
References automated approach for reconstructing recent forest disturbance history using
dense Landsat time series stacks. Remote Sensing of Environment, 114, 183–198.
Achard, F., Eva, H., Stibig, H. -J., Mayaux, P., Gallego, J., Richards, T., et al. (2002). Determina- Hyyppä, J., Hyyppä, H., Inkinen, M., Engdahl, M., Linko, S., & Zhu, Y. H. (2000). Accuracy
tion of deforestation rates of the world's humid tropical forests. Science, 297, 999–1002. comparison of various remote sensing data sources in the retrieval of forest stand
Ahlqvist, O. (2008). In search of classification that supports the dynamics of science: The attributes. Forest Ecology and Management, 128, 109–120.
FAO Land Cover Classification System and proposed modifications. Environment and Iwao, K., Nishida, K., Kinoshita, T., & Yamagata, Y. (2006). Validating land cover maps with
Planning B: Planning and Design, 35, 169–186. Degree Confluence Project information. Geophysical Research Letters, 33 (L23404).
Baker, B.A., Warner, T. A., Conley, J. F., & McNeil, B. E. (2013). Does spatial resolution matter? Jeon, S. B., Olofsson, P., & Woodcock, C. E. (2013). Land use change in New England: A
A multi-scale comparison of object-based and pixel-based methods for detecting reversal of the forest transition. Journal of Land Use Science. http://dx.doi.org/10.
change associated with gas well drilling operations. International Journal of Remote 1080/1747423X.2012.754962.
Sensing, 34, 1633–1651. Johnson, B.A. (2013). High-resolution urban land-cover classification using a competitive
Binaghi, E., Brivio, P. A., Ghezzi, P., & Rampini, A. (1999). A fuzzy set-based accuracy multi-scale object-based approach. Remote Sensing Letters, 4, 131–140.
assessment of soft classification. Pattern Recognition Letters, 20, 935–948. Kelly, M., Estes, J. E., & Knight, K. A. (1999). Image interpretation keys for validation of
Cakir, H. I., Khorram, S., & Nelson, S. A.C. (2006). Correspondence analysis for detecting global land-cover data sets. Photogrammetric Engineering & Remote Sensing, 65,
land cover change. Remote Sensing of Environment, 102, 306–317. 1041–1050.
Card, D. H. (1982). Using map category marginal frequencies to improve estimates Kennedy, R., Yang, Z., & Cohen, W. B. (2010). Detecting trends in forest disturbance and
of thematic map accuracy. Photogrammetric Engineering and Remote Sensing, 49, recovery using yearly Landsat time series: 1. LandTrendr — Temporal segmentation
431–439. algorithms. Remote Sensing of Environment, 114, 2897–2910.
Cochran, W. G. (1977). Sampling techniques (3rd ed.). New York: John Wiley & Sons. Knight, J. F., & Lunetta, R. S. (2003). An experimental assessment of minimum mapping
Cohen, W. B., Yang, Z., & Kennedy, R. (2010). Detecting trends in forest disturbance and unit size. IEEE Transactions on Geoscience and Remote Sensing, 40, 2132–2134.
recovery using yearly Landsat time series: 2. TimeSync — Tools for calibration and Kurz, W. A. (2010). An ecosystem context for global gross forest cover loss estimates.
validation. Remote Sensing of Environment, 114, 2911–2924. Proceedings of the National Academy of Science, 107, 9025–9026.
Comber, A. J., Wadsworth, R. A., & Fisher, P. F. (2008). Using semantics to clarify the con- Lewis, H. G., & Brown, M. (2001). A generalized confusion matrix for assessing area estimates
ceptual confusion between land cover and land use: The example of ‘forest’. Journal of from remotely sensed data. International Journal of Remote Sensing, 22, 3223–3235.
Land Use Science, 3, 185–198. Lindberg, E., Olofsson, K., Holmgren, J., & Olsson, H. (2012). Estimation of 3D vegetation
Congalton, R., & Green, K. (2009). Assessing the accuracy of remotely sensed data: Principles structure from waveform and discrete return airborne laser scanning data. Remote
and practices (2nd ed.). Boca Raton: CRC/Taylor & Francis. Sensing of Environment, 118, 151–161.
de Sy, V., Herold, M., Achard, F., Asner, G. P., Held, A., Kellndorfer, J., et al. (2012). Synergies Liu, C., Frazier, P., & Kumar, L. (2007). Comparative assessment of the measures of themat-
of multiple remote sensing data sources for REDD+ monitoring. Current Opinion in ic classification accuracy. Remote Sensing of Environment, 107, 606–616.
Environmental Sustainability, 4, 696–706. Mayaux, P., Eva, H., Gallego, J., Strahler, A. H., Herold, M., Agrawal, S., et al. (2006). Valida-
DeFries, R., Achard, F., Brown, S., Herold, M., Murdiyarso, D., Schlamadinger, B., et al. tion of the Global Land Cover 2000 map. IEEE Transactions on Geoscience and Remote
(2007). Earth observations for estimating greenhouse gas emissions from deforesta- Sensing, 44, 1728–1739.
tion in developing countries. Environmental Science and Policy, 10, 385–394. McRoberts, R. E. (2011). Satellite image-based maps: Scientific inference or pretty
DeFries, R., Houghton, R. A., Hansen, M., Field, C., Skole, D. L., & Townshend, J. (2002). pictures? Remote Sensing of Environment, 115, 715–724.
Carbon emissions from tropical deforestation and regrowth based on satellite Olofsson, P., Foody, G. M., Stehman, S. V., & Woodcock, C. E. (2013). Making better
observations for the 1980s and 90s. Proceedings of the National Academy of Sciences, use of accuracy data in land change studies: Estimating accuracy and area and
99, 14256–14261. quantifying uncertainty using stratified estimation. Remote Sensing of
Drummond, M.A., & Loveland, T. R. (2010). Land-use pressure and a transition to forest- Environment, 129, 122–131.
cover loss in the eastern United States. BioScience, 60, 286–298. Olofsson, P., Kuemmerle, T., Griffiths, P., Knorn, J., Baccini, A., Gancz, V., et al. (2011).
Duro, D. C., Franklin, S. E., & Duba, M. G. (2012). A comparison of pixel-based and object- Carbon implications of forest restitution in post-socialist Romania. Environmental
based image analysis with selected machine learning algorithms for the classification Research Letters, 6, 045202.
of agricultural landscapes using SPOT-5 HRG imagery. Remote Sensing of Environment, Olofsson, P., Stehman, S. V., Woodcock, C. E., Sulla-Menashe, D., Sibley, A.M., Newell, J.D.,
118, 259–272. et al. (2012). A global land cover validation dataset, I: Fundamental design principles.
Falkowski, M. J., Wulder, M.A., White, J. C., & Gillis, M.D. (2009). Supporting large-area, International Journal of Remote Sensing, 33, 5768–5788.
sample-based forest inventories with very high spatial resolution satellite imagery. Olofsson, P., Torchinava, P., Woodcock, C. E., Baccini, A., Houghton, R. A., Ozdogan, M., et al.
Progress in Physical Geography, 33, 403–423. (2010). Implications of land use change on the national terrestrial carbon budget of
FAO (2010). Global forest resources assessment 2010. Food and Agriculture Organization of Georgia. Carbon Balance and Management, 5, 4.
the United Nations. Pontius, R. G. (2000). Quantification error versus location error in comparison of categor-
Foody, G. M. (1992). On the compensation for chance agreement in image classification ical maps. Photogrammetric Engineering & Remote Sensing, 66, 1011–1016.
accuracy assessment. Photogrammetric Engineering and Remote Sensing, 58, 1459–1460. Pontius, R. G., & Lippitt, C. D. (2006). Can error explain map differences over time?
Foody, G. M. (1996). Approaches for the production and evaluation of fuzzy land cover Cartography and Geographic Information Science, 33, 159–171.
classifications from remotely sensed data. International Journal of Remote Sensing, Pontius, R. G., & Millones, M. (2011). Death to kappa: Birth of quantity disagreement and
17, 1317–1340. allocation disagreement for accuracy assessment. International Journal of Remote
Foody, G. M. (2002). Status of land cover classification accuracy assessment. Remote Sensing, 32, 4407–4429.
Sensing of Environment, 80, 185–201. Powell, R., Matzke, N., de Souza, C., Clark, M., Numata, I., Hess, L., et al. (2004). Sources of
Foody, G. M. (2010). Assessing the accuracy of land cover change with imperfect ground error in accuracy assessment of thematic land-cover maps in the Brazilian Amazon.
reference data. Remote Sensing of Environment, 114, 2271–2285. Remote Sensing of Environment, 90, 221–234.
P. Olofsson et al. / Remote Sensing of Environment 148 (2014) 42–57 57

Pratihast, A. K., Herold, M., de Sy, V., Murdiyarso, D., & Skutsch, M. (2013). Linking Stehman, S. V., & Selkowitz, D. J. (2010). A spatially stratified, multi-stage cluster sampling
community-based and national REDD + monitoring: A review of the potential. design for assessing accuracy of the Alaska (USA) National Land-Cover Data (NLCD).
Carbon Management, 4, 91–104. International Journal of Remote Sensing, 31, 1877–1896.
Riemann, R., Wilson, B. T., Lister, A., & Parks, S. (2010). An effective assessment protocol for Stehman, S. V., Sohl, T. L., & Loveland, T. R. (2003). Statistical sampling to characterize
continuous geospatial datasets of forest characteristics using USFS Forest Inventory recent United States land-cover change. Remote Sensing of Environment, 86, 517–529.
and Analysis (FIA) data. Remote Sensing of Environment, 114, 2337–2352. Stehman, S. V., & Wickham, J.D. (2011). Pixels, blocks of pixels, and polygons: Choosing a
Romijn, J. E., Herold, M., Kooistra, L., Murdiyarso, D., & Verchot, L. (2012). Assessing capaci- spatial unit for thematic accuracy assessment. Remote Sensing of Environment, 115,
ties of non-Annex I countries for national forest monitoring in the context of REDD+. 3044–3055.
Environmental Science and Policy, 20, 33–48. Stehman, S. V., Wickham, J.D., Wade, T. G., & Smith, J. H. (2008). Designing a multi-
Sanz-Sanchez, M., Herold, M., & Penman, J. (2013). REDD + related forest monitoring objective, multi-support accuracy assessment of the 2001 National Land Cover Data
remains key issue: A report following the recent UN climate convention in Doha. (NLCD 2001) of the conterminous United States. Photogrammetric Engineering &
Carbon Management, 4, 125–127. Remote Sensing, 74, 1561–1571.
Särndal, C., Swensson, B., & Wretman, J. (1992). Model assisted survey sampling. New York: Strahler, A. H., Boschetti, L., Foody, G. M., Friedl, M.A., Hansen, M. C., Herold, M., et al.
Springer. (2006). Global land cover validation: Recommendations for evaluation and accuracy
Saura, S. (2002). Effects of minimum mapping unit on land cover data spatial configura- assessment of global land cover maps. EUR 22156 EN — DG. Luxembourg: Office for
tion and composition. International Journal of Remote Sensing, 23, 4853–4880. Official Publications of the European Communities (48 pp.).
Scepan, J. (1999). Thematic validation of high-resolution global land-cover data sets. Tomppo, E. O., Gschwantner, T., Lawrence, M., & McRoberts, R. E. (2010). National forest
Photogrammetric Engineering & Remote Sensing, 65, 1051–1060. inventories: Pathways for common reporting. New York: Springer.
Schroeder, T. A., Wulder, M.A., Healey, S. P., & Moisen, G. G. (2011). Mapping wildfire and UN-REDD (2008). UN Collaborative Programme on Reducing Emissions from Deforesta-
clearcut harvest disturbances in boreal forests with Landsat time series data. Remote tion and Forest Degradation in Developing Countries (UN-REDD). FAO, UNDP, UNEP
Sensing of Environment, 115, 1421–1433. Framework Document.
Skirvin, S. M., Kepner, W. G., Marsh, S. E., Drake, S. E., Maingi, J. K., Edmonds, C. M., et al. Wickham, J.D., Stehman, S. V., Fry, J. A., Smith, J. H., & Homer, C. G. (2001). Thematic
(2004). Assessing the accuracy of satellite-derived land-cover classification using his- accuracy of the NLCD 2001 land cover for the conterminous United States. Remote
torical aerial photography, digital orthophoto quadrangles, and airborne video data. Sensing of Environment, 114, 1286–1296.
In R. S. Lunetta, & J. G. Lyon (Eds.), Remote sensing and GIS accuracy assessment. Wickham, J.D., Stehman, S. V., Gass, L., Dewitz, J., Fry, J. A., & Wade, T. G. (2013). Accuracy
Boca Raton: CRC Press. assessment of NLCD 2006 land cover and impervious surface. Remote Sensing of
Stehman, S. V. (1997). Selecting and interpreting measures of thematic classification Environment, 130, 294–304.
accuracy. Remote Sensing of Environment, 62, 77–89. Woodcock, C. E., Allen, R., Anderson, M., Belward, A., Bindschadler, R., Cohen, W., et al.
Stehman, S. V. (2000). Practical implications of design-based sampling inference for (2008). Free access to Landsat imagery. Science, 320, 1011.
thematic map accuracy assessment. Remote Sensing of Environment, 72, 35–45. Wulder, M.A., Franklin, S., White, J. C., Linke, J., & Magnussen, S. (2006). An accuracy
Stehman, S. V. (2001). Statistical rigor and practical utility in thematic map accuracy. assessment framework for large-area land cover classification products derived from
Photogrammetric Engineering and Remote Sensing, 67, 727–734. medium resolution satellite data. International Journal of Remote Sensing, 27, 663–683.
Stehman, S. V. (2005). Comparing estimators of gross change derived from complete Wulder, M.A., Masek, J. G., Cohen, W. B., Loveland, T. R., & Woodcock, C. E. (2012). Open-
coverage mapping versus statistical sampling of remotely sensed data. Remote Sensing ing the archive: How free data has enabled the science and monitoring promise of
of Environment, 96, 466–474. Landsat. Remote Sensing of Environment, 122, 2–10.
Stehman, S. V. (2009). Sampling designs for accuracy assessment of land cover. Wulder, M.A., White, J. C., Coops, N. C., & Butson, C. R. (2008). Multi-temporal analysis
International Journal of Remote Sensing, 30, 5243–5272. of high spatial resolution imagery for disturbance monitoring. Remote Sensing of
Stehman, S. V. (2012). Impact of sample size allocation when using stratified random Environment, 112, 2729–2740.
sampling to estimate accuracy and area of land-cover change. Remote Sensing Wulder, M.A., White, J. C., Hay, G. J., & Castilla, G. (2008). Towards automated segmenta-
Letters, 3, 111–120. tion of forest inventory polygons on high spatial resolution satellite imagery. The
Stehman, S. V. (2013). Estimating area from an accuracy assessment error matrix. Remote Forestry Chronicle, 84, 221–230.
Sensing of Environment, 132, 202–211. Wulder, M.A., White, J. C., Luther, J. E., Strickland, L. G., Remmel, T. K., & Mitchell, S. W.
Stehman, S. V., & Czaplewski, R. L. (1998). Design and analysis for thematic map ac- (2006). Use of vector polygons for the accuracy assessment of pixel-based land
curacy assessment: Fundamental principles. Remote Sensing of Environment, 64, cover maps. Canadian Journal of Remote Sensing, 32, 268–279.
331–344. Wulder, M.A., White, J. C., Magnussen, S., & McDonald, S. (2007). Validation of a large
Stehman, S. V., & Foody, G. M. (2009). Accuracy assessment. In T. A. Warner, M.D. Nellis, & area land cover product using purpose-acquired airborne video. Remote Sensing of
G. M. Foody (Eds.), The SAGE handbook of remote sensing. London: Sage Publications. Environment, 106, 480–491.
Stehman, S. V., Olofsson, P., Woodcock, C. E., Herold, M., & Friedl, M.A. (2012). A global Zimmerman, P. L., Housman, I. W., Perry, C. H., Chastain, R. A., Webb, J. B., & Finco, M. V.
land cover validation dataset, II: Augmenting a stratified sampling design to estimate (2013). An accuracy assessment of forest disturbance mapping in the western
accuracy by region and land-cover class. International Journal of Remote Sensing, 33, Great Lakes. Remote Sensing of Environment, 128, 176–185.
6975–6993.

You might also like