Modelling Distribution and Abundance With Presence-Only Data
Modelling Distribution and Abundance With Presence-Only Data
J.
R
Modelling
Blackwell
Oxford,
Journal
JPE
British Article
?0021-8901
42eview
L. Pearce
2005 & M. S.
Ecological
of
UK distribution
Boyce
Publishing,
Applied and2005
Society,
Ecology
Ltd. abundance
Ecology 2006
43, 405–412 Modelling distribution and abundance with presence-only
data
JENNIE L. PEARCE* and MARK S. BOYCE†
*Canadian Forest Service, Great Lakes Forestry Centre, Landscape Analysis and Application Section, 1219 Queen St
E, Sault St Marie, ON P6A 2E5 Canada; and †Department of Biological Sciences, University of Alberta, Edmonton,
AB T6G 2E9 Canada
Summary
1. Presence-only data, for which there is no information on locations where the species
is absent, are common in both animal and plant studies. In many situations, these may
be the only data available on a species. We need effective ways to use these data to explore
species distribution or species use of habitat.
2. Many analytical approaches have been used to model presence-only data, some inap-
propriately. We provide a synthesis and critique of statistical methods currently in use
to both estimate and evaluate these models, and discuss the critical importance of study
design in models where only presence can be identified
3. Profile or envelope methods exist to characterize environmental covariates that
describe the locations where organisms are found. Predictions from profile approaches
are generally coarse, but may be useful when species records, environmental predictors
and biological understanding are scarce.
4. Alternatively, one can build models to contrast environmental attributes associated
with known locations with a sample of random landscape locations, termed either
‘pseudo-absences’ or ‘available’. Great care needs to be taken when selecting random
landscape locations, because the way in which they are selected determines the modelling
techniques that can be applied.
5. Regression-based models can provide predictions of the relative likelihood of occur-
rence, and in some situations predictions of the probability of occurrence. The logistic
model is frequently applied, but can rarely be used directly to estimate these models;
instead, case–control or logistic discrimination should be used depending on the sample
design.
6. Cross-validation can be used to evaluate model performance and to assess how effec-
tively the model reflects a quantity proportional to the probability of occurrence. How-
ever, more research is needed to develop a single measure or statistic that summarizes
model performance for presence-only data.
7. Synthesis and applications. A number of statistical procedures are available to explore
patterns in presence-only data; the choice among them depends on the quality of the
presence-only data. Presence-only records can provide insight into the vulnerability,
historical distribution and conservation status of species. Models developed using these
data can inform management. Our caveat is that researchers must be mindful of study
design and the biases inherent in presence data, and be cautious in the interpretation of
model predictions.
Key-words: case–control, distribution, habitats, logistic discrimination, logistic regression,
presence-only studies, pseudo-absences, resource selection functions, RSF, sampling
Journal of Applied Ecology (2006) 43, 405–412
doi: 10.1111/j.1365-2664.2005.01112.x
© 2005 British
Ecological Society Correspondence: Jennie Pearce, 1405 Third Line East, Sault Ste Marie, ON P6A 6J8, Canada (e-mail: [email protected]).
406 The modelling approaches for (2) and (3) derive
Introduction
J. L. Pearce & from different sampling motivations. In (2), biologists
M. S. Boyce To manage a species effectively, conservation projects wish to contrast used or consumed resource units such
may require a description of a species’ geographical as plots of land, denning or nesting sites, prey or food
distribution or use of habitats. Examples include reserve items, with characteristics of resource units that have
design (Araújo & Williams 2000), population viability not been used or where use has not been recorded.
analysis (Boyce et al. 1994; Akçakaya et al. 2004) and Plants provide the clearest example of this view, where
species or resource management (Johnson et al. 2004). individuals are either present or truly absent at any given
Rarely are survey data available to describe species point on the landscape, within a given time-frame.
presence at every location on the landscape. Thus Models provide predictions of the relative probability
models are used to interpolate, or extrapolate beyond of a resource unit being used, given its characteristics.
the locations where species presence is known, by relat- This differs from the motivation behind (3), where all
ing species presence to environmental variables. This resource units within the sampling domain are assumed
has been facilitated by remotely sensed data, allowing to be available to be used, but some are used more fre-
assessment of the distribution of resources over large, quently than others. Radiotelemetry studies of species
and even inaccessible, areas. such as grizzly bears Ursus arctos provide an example
Many approaches have been used to model ‘presence– of this view, where bears might potentially be recorded
absence’ or ‘used–unused’ data (see Guisan & Zimmermann at any point within their home range, but some loca-
2000 for a review). However, there is growing interest in tions are used more frequently than others. The differ-
making use of ‘presence-only’ data, consisting only of ence between these sampling motivations is subtle,
observations of the organism but with no reliable data but explains the historical development of different
on where the species was not found. Sources for these approaches for similar problems.
data include atlases, museum and herbarium records,
species lists, incidental observation databases and
radio-tracking studies.
-
Developing models of species distribution for presence-
only data is challenging (Graham et al. 2004). Several This first group of modelling techniques, termed profile
approaches have been used; however, the choice among techniques, seeks to characterize environmental con-
them is not clear. Terminology also differs between ditions associated with the presence records without
studies. For example, some studies refer to the ‘presence’ reference to other data points. Environmental envelope
of a species, whereas faunal studies on wide-ranging techniques are the most widely applied (e.g. Busby
species often refer to ‘used’ locations, rather than spe- 1986; Caughley et al. 1987; Lindenmayer et al. 1991;
cies presence. Here, we refer to the presence of a species Law 1994; Pearce & Lindenmayer 1998; Walther,
for consistency. We review the various steps in model- Wisz & Rahbek 2004). Chief among these techniques
ling the distribution of species when we know some of have been (Busby 1986, 1991) and
the locations where they occur on the landscape, but (Walker & Cocks 1991). Environmental envelopes
have no information on where they do not occur. enclose presence records into a multidimensional
We provide a synthesis and critique of statistical envelope within environmental space. The various
methods currently in use to both estimate and evaluate techniques use different classification algorithms, but
these models, and discuss the critical importance of often provide similar results. Predictions are sum-
study design in models where only presence can be marized typically as the degree of classification within
identified. Our objective is to provide ecologists and subenvelopes.
managers with a wide range of approaches to explore A recent variation on this approach has been the
patterns in presence-only data, and to identify analyt- development of support vector machines () for
ical aspects that require further development. one-class problems (e.g. Guo et al. 2005). s seek to
identify an environmental envelope or hyperspace
containing the data points, in which the envelope is
Statistical model formulation
optimized with respect to the number of points in the
We review four approaches taken to describe the pres- envelope and to the number of outliers. The distance
ence of a species in relation to environmental predictors between the point and the centre of hyperspace deter-
when only presence is known. These are: mines membership of the hyperspace. The advantage
1. Describing the distribution of the presence-only of this approach over , for example, is that the
records. hyperspace can be any shape, whereas uses
2. Contrasting the distribution of presence records hyperboxes to enclose the presence data (Guo et al.
with that of pseudo-absences. 2005). also is more flexible than , defin-
© 2005 British
3. Contrasting the distributions of presence records ing the environmental envelope using a convex hull and
Ecological Society,
Journal of Applied and available sites. the relative density of observations within environmental
Ecology, 43, 4. Modelling abundance when abundance given space. , therefore, may be considered a refinement
405–412 presence is known. of the approach.
407 Multivariate association methods such as we aim to predict the relative likelihood of presence.
Modelling (Carpenter, Gillison & Winter 1993) also require only There are two reasons for this: (a) separate samples of
distribution and presence data. defines the degree of similarity presence and pseudo-absence data have been selected
abundance among presence sites in terms of environmental condi- where sampling fractions are not known, and (b) the
tions. The method can be used to determine either environ- pseudo-absence data contains an unknown number
mental envelopes or a continuous map of similarity. of presences, and is thus a contaminated sample of
At a finer scale, utilization distributions (UD) can be absences. To understand this we examine the logistic
used to characterize the distribution of animals. The function and its assumptions. The logistic regression
UD is a probability density function that quantifies model assumes that a sample is selected, and that this
an individual’s or group’s relative use of space (van sample contains observations of either the presence
Winkle 1975). Marzluff et al. (2004) have extended this (y = 1) or the absence (y = 0) of a species. For each
approach by modelling the intensity of use relative to observation there is a set of habitat measurements x.
environmental covariates. From this the probability of occurrence [P(y = 1|)] can
Profile techniques summarize environmental char- be estimated:
acteristics at presence locations, and typically each
exp(β0 + β1x1 + … + β p x p )
record has equal weight within the model. Because of P( y = 1 | x ) = eqn 1
1 + exp(β0 + β1x1 + … + β p x p )
this, these techniques are highly dependent on biases in
the presence records. Some approaches, such as - This assumes that presence and absence observations
, can be highly sensitive to the inclusion of outliers. were recorded from a sample of resource units in which
Elith & Burgman (2002) provide a discussion of the the presence of the species at a resource unit was not
pros and cons of geographical and climatic envelope- known prior to sampling. Thus the sample contains
based techniques. Predictions from presence-only presence and absence sites in approximate proportion
approaches are generally coarse, but may be useful at to their occurrence on the landscape. In the absence of
meso-scales to describe poorly understood species when habitat information, the probability of occurrence then
species records, environmental predictors, and biological can be estimated directly from the proportion of obser-
understanding are scarce. vations in the sample at which the species was present.
For example, if in a sample of 100 observations, 20
contain the species, the probability of occurrence is 0·2
[= 20/(20 + 80)]. However, with presence-only data,
. -
we sample the presence locations independently and
Many studies have sought to apply presence–absence then select a sample of pseudo-absence locations, and
techniques to presence-only data by generating pseudo- so the proportion of presences within the sample does
absence data from background areas from which not represent the true prevalence of the species in the
species data are missing. These sites may be selected population, but rather the relative proportion chosen
without replacement from within the study region either by the researcher. For example, we have a sample of 20
randomly (Stockwell & Peterson 2002), randomly with presence records and we select independently a set of 80
case-weighting to reduce the effective sample size of ‘pseudo-absence’ records. In this case the probability of
pseudo-absences (Ferrier & Watson 1996; Ferrier et al. occurrence is also 0·2 [= 20/(20 + 80)]. However, if we
2002), or by using environmentally weighted random select 200 pseudo-absence locations, then the probabil-
sampling (Zaniewski, Lehmann & Overton 2002). ity of occurrence is 0·09 [= 20/(20 + 200)].
Pseudo-absences are assumed to represent true absences, When samples for y = 1 and y = 0 are selected in
although because sites were not searched some pseudo- advance, we need to modify the logistic model to account
absences might represent presence locations (Graham for the probability that a location has been sampled
et al. 2004). Generalized linear models and generalized to obtain probabilities of occurrence. We do this by
additive models have been the most widely applied correcting the model using P1 and P0, the proportion
statistical methods (e.g. Ferrier et al. 2002). However, of occupied and unoccupied locations, respectively,
other approaches such as tree-based methods (e.g. Ferrier selected from the total number of occupied and un-
& Watson 1996) and genetic algorithms (e.g. GARP; occupied locations in the landscape. This also is known as
Stockwell & Peters 1999) also have been considered. a case–control design.
Regression models have generally performed better P
than tree-based methods or genetic algorithms in pre- exp β0 + β1x1 + …+ β p xp + ln 1
P0
dicting species presence (Ferrier & Watson 1996). Tree- P( y = 1 | x, sampled ) =
P
based methods are expected to be highly sensitive to 1 + exp β0 + β1x1 +…+ β p xp + ln 1
P0
biases within the sample data (Hastie et al. 2001), and
the underlying model used to make predictions in eqn 2
© 2005 British
GARP is largely inaccessible and difficult to interpret
Ecological Society,
Journal of Applied (Elith & Burgman 2002). In practice we rarely know what proportion of the used
Ecology, 43, When using presence-only data it is generally not and unused locations we have selected in our samples,
405–412 possible to calculate probabilities of presence; instead and so P0 and P1 are unknown. Model predictions using
408 the uncorrected logistic function are therefore only rel- development of a wide range of alternative modelling
J. L. Pearce & ative predictions. Alternatively, we can interpret model approaches.
M. S. Boyce coefficients in terms of odds ratios, where the odds that Four approaches have been used to model presence-
a species will be present given covariate pattern x, is availability. The first of these, ecological niche factor
compared to a reference habitat, usually one in which analysis () implemented in the package
the values for x1 to xp are set to zero (Keating & Cherry (Hirzel, Hausser & Perrin 2004) is similar to profile
2004). Thus: techniques. uses factor analysis to quantify the
environmental conditions of the presence sites by
P( y = 1 | x )
comparing them to the environmental conditions of the
P( y = 0 | x )
= exp(β1x1 + … + β p x p ) eqn 3 entire region of interest, and predictions are provided
P( y = 1 | x reference )
as a habitat suitability index (Hirzel et al. 2002; Dettki,
P( y = 0 | x reference )
Löfstrand & Edenius 2003; Reutter et al. 2003;
A further complication of this sampling scheme is that Brotons et al. 2004; Chefaoui, Hortal & Lobo 2005).
the process of generating pseudo-absences randomly considers the density of points within subenvelopes
from the landscape of interest means that these loca- of data and is therefore an improvement on presence-
tions are actually an unknown mixture of presence and only approaches. This technique is generally optimistic
absence locations, unless the species is very rare on the regarding species distribution, which may be an advant-
landscape. Keating & Cherry (2004) discuss the diffi- age when a species does not occupy all suitable habitats
culties of deriving probabilities of occurrence in case– on the landscape (Hirzel, Helfer & Metral 2001; Brotons
control designs under these circumstances. However, et al. 2004). The two-class model uses a similar
unless the level of contamination (proportion of approach to , except that it does not assume a par-
presences within the absence sample) is very high, the ticular probability distribution for the data (Guo et al.
model may provide acceptable predictions of the rela- 2005).
tive likelihood of occurrence, or odds-ratios. Based A second approach to modelling presence-availability
on simulations, Lancaster & Imbens (1996) obtained involves using case–control logistic regression where
unbiased estimates of βis with contamination rates less used resource units are contrasted with random loca-
than 20%. Also, they provide an algorithm for dealing tions within an activity area available to individuals.
with situations where greater contamination rates exist. There are different sampling designs available to con-
This approach seeks to calculate the predicted prob- duct this, where cases may be matched or unmatched
ability of species presence where presence locations are with controls (Collett 1991; Arthur et al. 1996; Manly
contrasted with control sites, which are an unknown et al. 2002). Examples of this approach include
mixture of occupied and unoccupied locations. The contrasting wood turtle Clemmys insculpta locations
implementation of this approach is complex, not with paired random locations (Compton, Rhymer &
available in standard statistical packages, and frequently McCollough 2002) and contrasting superb parrot
fails to converge to a unique solution (Keating & Polytelis swainsonii nest trees with paired random trees
Cherry 2004). Barry, Elith & Pearce (unpublished data) (Manning, Lindenmayer & Barry 2004). Models esti-
provide a worked example of this approach for habitat mated using case–control logistic regression are based
studies. on the contrasts between used and control resource
units and can be interpreted as odds ratios or relative
likelihoods of occurrence (Keating & Cherry 2004).
The discussion in the previous section about the
contamination of controls also applies here.
A slightly different approach has been applied in studies A third approach proposed by Manly et al. (2002)
of wide-ranging animals. These studies do not refer to uses logistic regression to estimate relative likelihoods
the presence or absence of a species, but rather to how using an exponential model:
well a habitat is ‘used’, usually determined through
radiotelemetry studies (Frair et al. 2004). In these stud- P( y = 1 | x ) = exp(β1x1 + β 2 x 2 + … + β p x p ) eqn 4
ies, the landscape is considered to be available to the
species of interest and potentially used to some extent, This model has been used widely in resource selection
but some habitats are occupied more frequently than studies (e.g. Campos et al. 1997; Johnson et al. 2002;
others within a given time period. These models describe Nielsen et al. 2002; Boyce et al. 2003), rather than the
the relative probability of use for different resource logistic function, because it avoids the problem of dif-
units (e.g. a pixel) over the study area, as described by ferent denominators encountered in the logistic model.
habitat characteristics. The distinction between this However, as Manly et al. (2002: 101–102) point out,
approach and the pseudo-absence approach is subtle, this approach assumes a particular sampling scheme.
© 2005 British
because in practice the sampling schemes are similar. In particular, this approach requires that one sample of
Ecological Society,
Journal of Applied However, the underlying conceptual difference between presence locations and one sample of available locations
Ecology, 43, contrasting unoccupied-vs.-occupied locations, and be taken, and that any single location selected that
405–412 used-vs.-available locations has resulted in the occurs in both the presence and the available samples
409 be included only in the available sample. McDonald
Modelling (2003) shows that the duplicate records can be removed
--
distribution and from the available sample rather than the observed
abundance sample unless the number of duplicates is high. Manly Often, estimates of relative abundance are made at
et al. (2002) show how, with known sampling frequen- locations where the species has been detected. Examples
cies of presence and available samples, probabilities are counts of individuals, indices of abundance such
of occurrence can be calculated, although Keating & as the Braun–Blanquet scale for plants (Kent & Coker
Cherry (2004) question this model. However, in prac- 1992) and density measurements. Few studies have
tice sampling probabilities are unknown, and irrespec- tried to model data of this type, even when data were
tive of the validity of the model formulation, model acquired through systematic surveys. Regression appro-
predictions provide relative likelihoods of occurrence aches modelling abundance given only presence are
(i.e. the RSF). When interpreted as relative likelihoods, possible using a truncated Poisson or negative binomial
it is not necessary that the predictions are constrained distribution. Alternatively, it may be possible to modify
to lie below 1, a concern raised by Keating & Cherry zero-inflated Poisson or negative binomial (ZIP or
(2004). ZINB) regression models (Welsh et al. 1996; Barry
A fourth approach is to use the logistic regression & Welsh 2002; Dirnböck & Dullinger 2004; Nielsen
algorithm to approximate a logistic discrimination model. et al. 2005) to model abundance given availability,
Here we use the logistic model to estimate a function where available locations are assigned a value of zero.
that discriminates between two distributions of habitat This requires further investigation. We are unaware of
covariates, one set associated with locations where the any application explicitly modelling abundance given
species is present fy=1(x) and another set associated presence only.
with random (available) locations fy=0(x) (Keating &
Cherry 2004). We sample independently from each dis-
Sampling issues
tribution, with probability π1 of a sampled observation
(from the joint distribution of presence and available Knowledge of only the presence of a species presents a
sites) being a presence record, and π2 of it being an number of data-quality issues. Central among these are
available record. We can assume (Seber 1984: 308) that difficulties presented by choice of scale. Models of distri-
the probability of a species being present at a location bution or abundance can be highly sensitive to the scale of
with covariates x, given that it was sampled is: resolution (grain) as well as the extent (domain) (Soberón
& Peterson 2005). There are no obvious guidelines about
f (x) π which choice of scale is appropriate, because such choice
log y = 1 = β0 + β1x1 + … + β p x p + 1 eqn 5
f y = 0 ( x ) π2 will depend on the ecology of the organism at hand and
the objectives of the investigation (Boyce et al. 2003). If
We can combine the sampling constant log(π1/π2) with the intent is to model the global distribution of a
the intercept term β0. Because we have no information species, obviously one should be using a very different
on the sampling proportions we can calculate the rela- scale than if the objective were to model use of habitats
tive probability of occurrence (dropping the intercept within a species’ home range (Johnson 1980).
term). This approach is suitable for discriminating However, selection of extent can be a difficult ques-
between random sites and sites at which the species has tion when using presence-only data. Implicit in that
been observed. Naturally the discriminant function selection is an understanding of the sampling design by
cannot discriminate between sites at which a species which the presence records were obtained. In studies
was present and sites at which it was absent (from a con- where the data were obtained by survey, such as in the
taminated sample of occupied and unoccupied loca- study of Phytophthora ramorum (Guo et al. 2005) or
tions) (Keating & Cherry 2004). Again, predictions caribou Rangifer tarandus (Johnson et al. 2004), then
need not be constrained to lie below 1, because predic- the geographical, temporal and environmental bound-
tions are relative likelihoods rather than probabilities aries of the study are known. However, presence-only
of occurrence. data might be ‘found’ data − collated from multiple
The logistic discrimination model is very similar to sources such as herbarium or museum records and for
the exponential model suggested by Manly et al. (2002), which there is no information on survey effort. Not
and in practice its application differs only because knowing the sampling extent prevents us from defining
resource units that appear in the used sample also can available habitat adequately. For example, many her-
appear in the sample of available units (Johnson barium databases are biased towards roads. A model of
et al. 2006). The logistic discrimination model does not sampling effort would identify that only locations close
require as many assumptions as the exponential model: to roads would have a high probability of being sam-
assumptions that Keating & Cherry (2004) suggest pled, therefore only sites near roads should be included
© 2005 British
might sometimes be violated. Seber (1984: 309) sug- in the available sample. Not accounting for these biases
Ecological Society,
Journal of Applied gests that the logistic discrimination model may be may complicate model interpretation because the
Ecology, 43, relatively robust to observations occurring in both the resulting model might describe sampling effort more
405–412 presence and the available sample. than resource selection.
410 Once sampling scale and extent have been identified, An important feature of this approach is being able
J. L. Pearce & the question then arises as to how to choose random to examine how well model predictions are related to
M. S. Boyce locations from a potentially large area to contrast with the probability of occurrence. A good model is one in
the presence records. Little guidance exists in the liter- which model predictions are proportional to the prob-
ature; however, as Manly et al. (2002) argue, it is most ability of occurrence (Manly et al. 2002). In the k-fold
important to minimize sampling errors, selecting data cross-validation graph, this would imply linear corre-
in such a way as to be fully representative of the study spondence between the test-case area-adjusted fre-
area. This implies that a large number of locations be quencies and model predictions. There is no guarantee
selected randomly from the landscape to contrast with that any of the models described above will capture the
presence locations. McDonald (2003) suggests that true shape of the selection function, and thus might
several orders of magnitude more available units than not be proportional to the probability of occurrence.
used units be employed when applying the exponential Standard transformations of model predictions, e.g.
model. Using GIS databases, such high sampling logarithmic, square root, etc., might be necessary to
intensity for random landscape locations is feasible. scale the resource selection function appropriately.
Modern biotelemetry systems such as GPS radio- Proportionality is important because it allows model
telemetry (Frair et al. 2004) permit the collection of huge predictions to be used explicitly, such as when linking
data sets of animal locations, with short time intervals habitats to populations (Boyce & McDonald 1999;
between locations. Similarly, atlas data are usually McDonald & McDonald 2001).
obtained using grid coverage of the entire region,
and so adjacent sampling squares are not independent
Conclusion
(Augustin, Mugglestone & Buckland 1996). Such data
are inherently plagued with both temporal and spatial A number of statistical procedures are available for
autocorrelation because such frequent locations are exploring patterns in presence-only data; the choice
not independent in time or space (Nielsen et al. 2002). among them depends on the quality of the presence-
To avoid committing a Type I error, adjustments for only data. Profile techniques are most useful when spe-
autocorrelation can be achieved using post-hoc methods cies records, environmental predictors and biological
of variance inflation (Nielsen et al. 2002) such as the understanding are scarce. However, when data quality
Newey–West method (Newey & West 1987), or auto- is higher, regression-based techniques have generally
correlation can be modelled more explicitly using proved more informative than profile techniques. The
mixed models (Laidre et al. 2004). choice among regression modelling strategies depends
on the sampling scheme for the ‘absence’ or ‘control’
records. The logistic regression model should not be
Validating presence-only models
used directly in most instances. Instead, either a case–
Models based on presence-only data can be validated control or discrimination approach should be adopted
with data composed of presences and absences using to contrast presence records with available resource
existing evaluation statistics for presence–absence data units.
[such as the area under the receiver operator character- All techniques (profile and regression-based) may
istic (ROC) curve, or the kappa statistic]. However, when effectively rank habitats. Regression-based approaches
validation data consist only of presence data, model might also provide predictions describing the relative
evaluation is more difficult because of the absence of likelihood of occurrence. If information on the relative
a truly binary statistic. These issues are discussed by proportions of presence and ‘available’ locations
Boyce et al. (2002), who present an approach based on that were sampled is known, then predictions of the
use-availability data to explore model performance; probabilities of occurrence are possible. k-Fold cross-
this approach has been developed further by Hirzel validation can be used to examine model performance
(unpublished data). and proportionality.
In this method k-fold cross-validation is used to cor- Many conservation projects require a complete
relate prediction ranks with area-adjusted frequencies description of a species’ geographical distribution
of predicted values. Prediction ranks are obtained by or use of habitats to manage the species or environment
breaking the range of predicted values into 10 (or some effectively. However, for rare and endangered species,
arbitrary number) evenly spaced bins. Area-adjusted newly introduced species, or species requiring large
frequencies of predicted values are then obtained by geographical areas to meet all their life requirements,
counting the number of occupied sites within the pre- presence–absence data can be difficult or impossible to
dicted value bins, and dividing these values by the area collect. Presence-only records can provide insight into
of the study area assigned the predicted values associated the vulnerability, historical distribution and conserva-
with than bin. This graphical approach holds great tion status of species; models developed using these data
© 2005 British
promise as a method to visualize predictive performance can inform management. Our caveat is that researchers
Ecological Society,
Journal of Applied and to assign thresholds of prediction. However, as yet must be mindful of study design and the biases inherent
Ecology, 43, there is no suitable single measure of performance (or in the presence data and be cautious in the interpreta-
405–412 statistic) available to compare and contrast models. tion of model predictions.
411 status assessment using GIS tools: a case study of Iberian
Acknowledgements Copris species. Biological Conservation, 122, 327–338.
Modelling
Collett, D. (1991) Modelling Binary Data. Chapman & Hall,
distribution and We thank Bryan Manly, Lisa Venier and colleagues at
London, UK.
abundance the Species Modelling workshop in Reideralp, Switzer- Compton, B.W., Rhymer, J.M. & McCollough, M. (2002)
land for invaluable discussion on the use of presence- Habitat selection by wood turtles (Clemmys insculpta): an
only data to model species distribution, and Lisa Venier, application of paired logistic regression. Ecology, 83, 833–
Dan McKenney, Bryan Manly and Simon Ferrier for 843.
Dettki, H., Löfstrand, R. & Edenius, L. (2003) Modelling
their comments on drafts of this manuscript. MSB
habitat suitability for moose in coastal northern Sweden:
acknowledges support from the Natural Sciences and empirical vs. process-oriented approaches. Ambio, 32, 549–
Engineering Research Council of Canada (NSERC), 556.
the National Science Foundation (grant no. DEB- Dirnböck, T. & Dullinger, S. (2004) Habitat distribution
0078130) and the Alberta Conservation Association. models, spatial autocorrelation, functional traits and dis-
persal capacity of alpine plant species. Journal of Vegetation
Science, 15, 77 – 84.
References Elith, J. & Burgman, M.A. (2002) Habitat models for PVA.
Population Viability in Plants (eds C.A. Brigham & M.W.
Akçakaya, H.R., Burgman, M.A., Kindvall, O., Wood, C.C., Schwartz), Springer-Verlag, New York, NY.
Sjögren-Gulve, P., Hatfield, J.S. & McCarthy, M.A. (2004) Ferrier, S. & Watson, G. (1996) An Evaluation of the Effective-
Species Conservation and Management. Oxford University ness of Environmental Surrogates and Modelling Techniques
Press, Oxford, UK. in Predicting the Distribution of Biological Diversity. Con-
Araújo, M.B. & Williams, P.H. (2000) Selecting areas for species sultancy report prepared by the New South Wales National
persistence using occurrence data. Biological Conservation, Parks and Wildlife Service for the Department of Environ-
96, 331 – 345. ment, Sport and Territories.
Arthur, S.M., Manly, B.F.J., McDonald, L.L. & Garner, G.W. Ferrier, S., Watson, G., Pearce, J. & Drielsma, M. (2002)
(1996) Assessing habitat selection when availability changes. Extended statistical approaches to modelling spatial
Ecology, 77, 215 – 227. pattern in biodiversity in northeast New South Wales. I.
Augustin, N.H., Mugglestone, M.A. & Buckland, S.T. (1996) Species-level modelling. Biodiversity and Conservation, 11,
An autologistic model for the spatial distribution of 2275 – 2307.
wildlife. Journal of Applied Ecology, 33, 339 – 347. Frair, J.L., Nielsen, S.E., Merrill, E.H., Lele, S., Boyce, M.S.,
Barry, S.C. & Welsh, A.H. (2002) Generalised additive Munro, R.H.M., Stenhouse, G.B. & Beyer, H.L. (2004)
modelling and zero inflated count data. Ecological Model- Removing habitat-induced, GPS-collar bias from infer-
ling, 157, 179 –188. ences of habitat selection. Journal of Applied Ecology, 41,
Boyce, M.S., Mao, J.S., Merrill, E.H., Fortin, D., Turner, M.G., 201 –212.
Fryxell, J. & Turchin, P. (2003) Scale and heterogeneity in Graham, C.H., Ferrier, S., Huettman, F., Moritz, C. &
habitat selection by elk in Yellowstone National Park. Peterson, A.T. (2004) New developments in museum-based
Ecoscience, 10, 321 –332. informatics and applications in biodiversity analysis. Trends
Boyce, M.S. & McDonald, L.L. (1999) Relating populations in Ecology and Evolution, 19, 497 –503.
to habitats using resource selection functions. Trends in Guisan, A. & Zimmermann, N.E. (2000) Predictive habitat
Ecology and Evolution, 14, 268 –272. distribution models in ecology. Ecological Modelling, 135,
Boyce, M.S., Meyer, J.S. & Irwin, L.L. (1994) Habitat-based 147 –186.
PVA for the northern spotted owl. Statistics in Ecology and Guo, Q., Kelly, M. & Graham, C.H. (2005) Support vector
Environmental Monitoring (eds D.J. Fletcher & B.F.J. Manly), machines for predicting distribution of Sudden Oak Death
pp. 63 – 85. Otago Conference Series no. 2. University of in California. Ecological Modelling, 182, 75 –90.
Otago Press, Dunedin, New Zealand. Hastie, T., Tibshirani, R. & Friedman, J. (2001) The Elements
Boyce, M.S., Vernier, P.R., Nielsen, S.E. & Schmiegelow, F.K.A. of Statistical Learning: Data Mining, Inference and Predic-
(2002) Evaluating resource selection functions. Ecological tion. Springer Verlag, New York.
Modelling, 157, 281 –300. Hirzel, A.H., Hausser, J., Chessel, D. & Perrin, N. (2002)
Brotons, L., Thuiller, W., Araújo, M.B. & Hirtzel, A.H. (2004) Ecological-niche factor analysis: how to compute habitat-
Presence–absence versus presence-only modelling methods for suitability maps without absence data? Ecology, 83, 2027–
predicting bird habitat suitability. Ecography, 27, 437 – 448. 2036.
Busby, J.R. (1986) A biogeoclimatic analysis of Nothofagus Hirzel, A.H., Hausser, J. & Perrin, N. (2004) Biomapper 3·0,
cunninghamii (Hook.) Oerst. in southeastern Australia. User’s Manual [online]. Available at: http://www.unil.ch/
Australian Journal of Ecology, 11, 1 –7. biomapper [accessed September 2004].
Busby, J.R. (1991) – a bioclimatic analysis and predic- Hirzel, A.H., Helfer, V. & Metral, F. (2001) Assessing habitat-
tion system. Nature Conservation: Cost Effective Biology suitability models with a virtual species. Ecological Modelling,
Survey and Data Analysis (eds C.R. Margules & M.P. Austin), 145, 111 –121.
pp. 64 – 68. CSIRO, Australia. Johnson, C.J., Nielsen, S.E., Merrill, E.H., McDonald, T.L. &
Campos, D., Kaur, A., Patil, G.P., Ripple, W.J. & Taillie, C. Boyce, M.S. (2006) Resource selection functions based on
(1997) Resource selection by animals: the statistical analysis use-availability data: theoretical motivation and evaluation
of binary response. Coenoses, 12, 1 –21. methods. Journal of Wildlife Management, in press.
Carpenter, G., Gillison, A.N. & Winter, J. (1993) : a Johnson, C.J., Parker, K.L., Heard, D.C. & Gillingham, M.P.
flexible modelling procedure for mapping potential distri- (2002) A multiscale behavioural approach to understanding
butions of plants and animals. Biodiversity and Conservation, the movements of woodland caribou. Ecological Applica-
2, 667 – 680. tions, 12, 1840 –1860.
© 2005 British
Caughley, G., Short, J., Grigg, G.C. & Nix, H. (1987) Kanga- Johnson, C.J., Seip, D.R. & Boyce, M.S. (2004) A quantitative
Ecological Society,
roos and climate: an analysis of distribution. Journal of approach to conservation planning: using resource selec-
Journal of Applied Animal Ecology, 56, 751 –761. tion functions to map the distribution of mountain caribou
Ecology, 43, Chefaoui, R.M., Hortal, J. & Lobo, J.M. (2005) Potential at multiple spatial scales. Journal of Applied Ecology, 41,
405–412 distribution modelling, niche characterisation and conservation 238 – 251.
412 Johnson, D.H. (1980) The comparison of usage and availability ecosystem of Alberta: taking autocorrelation seriously.
J. L. Pearce & measurements for evaluating resource preference. Ecology, Ursus, 13, 45 – 56.
61, 65 – 71. Nielsen, S.E., Johnson, C.J., Heard, D.C. & Boyce, M.S.
M. S. Boyce
Keating, K.A. & Cherry, S. (2004) Use and interpretation of (2005) Can models of presence–absence be used to scale
logistic regression in habitat selection studies. Journal of abundance? Two case studies considering extremes in life
Wildlife Management, 68, 774 –789. history. Ecography, 28, 1 – 12.
Kent, M. & Coker, P. (1992) Vegetation Description and Pearce, J. & Lindenmayer, D. (1998) Bioclimatic analysis
Analysis: a Practical Approach. John Wiley and Sons, to enhance reintroduction biology of the endangered
Chichester, West Sussex, UK. Helmeted Honeyeater (Lichenostomus melanops cassidix)
Laidre, K.L., Heide-Jorgensen, M.P., Logdson, M.L., Hobbs, R.C., in southeastern Australia. Restoration Ecology, 6, 238–243.
Heagerty, P., Dietz, R., Jorgensen, O.A. & Treble, M.A. Reutter, B.A., Helfer, V., Hirzel, A.H. & Vogel, P. (2003)
(2004) Seasonal narwhal habitat associations in the high Modelling habitat-suitability using museum collections: an
Arctic. Marine Biology, 145, 821 – 831. example with three sympatric Apodemus species from the
Lancaster, T. & Imbens, G. (1996) Case–control studies with Alps. Journal of Biogeography, 30, 581 –590.
contaminated controls. Journal of Econometrics, 71, 145 – Seber, G.A.F. (1984) Multivariate Observations. Wiley, New
160. York, NY.
Law, B.S. (1994) Climatic limitation of the southern distribu- Soberón, J. & Peterson, A.T. (2005) Interpretation of models
tion of the common blossom bat Syconycteris autralis, New of fundamental ecological niches and species’ distributional
South Wales. Australian Journal of Ecology, 19, 366 – 374. areas. Biodiversity Informatics, 2, 1 –10.
Lindenmayer, D.B., Nix, H.A., McMahon, J.P., Hutchinson, Stockwell, D. & Peters, D. (1999) The GARP modelling sys-
M.F. & Tanton, M.T. (1991) The conservation of lead- tem: problems and solutions to automated spatial predic-
beater’s possum, Gymnobelideus leadbeateri (McCoy): a tion. International Journal of Geographyraphic Information
case study of the use of bioclimatic modelling. Journal of Science, 13, 143 –158.
Biogeography, 18, 371 –383. Stockwell, D.R.B. & Peterson, A.T. (2002) Controlling bias in
Manly, B.F.J., McDonald, L.L., Thomas, D.L., McDonald, T.L. biodiversity data. Predicting Species Occurrences: Issues of
& Erickson, W.P. (2002) Resource Selection by Animals, Accuracy and Scale (eds J.M. Scott, P.J. Heglund, M.L.
2nd edn. Kluwer Academic Publishers, Dordrecht, the Morrison, J.B. Haufler, M.G. Raphael, W.A. Wall & F.B.
Netherlands. Samson), pp. 537 – 546. Island Press, Washington, DC.
Manning, A.D., Lindenmayer, D.B. & Barry, S.C. (2004) The Walker, P.A. & Cocks, K.D. (1991) : a procedure for
conservation implications of bird reproduction in the agri- modelling a disjoint environmental envelope for a plant or
cultural ‘matrix’: a case study of the vulnerable superb parrot animal species. Global Ecology and Biogeography Letters, 1,
of south-eastern Australia. Biological Conservation, 120, 108 –118.
363 – 374. Walther, B., Wisz, M. & Rahbek, C. (2004) Known and pre-
Marzluff, J.M., Millspaugh, J.J., Hurvitz, P. & Handcock, M.S. dicted African winter distributions and habitat use of the
(2004) Relating resources to a probabilistic measure of endangered Basra reed warbler (Acrocephalus griseldis) and
space use: forest fragments and Stellar’s jays. Ecology, 85, the near-threatened cinereous bunting (Emberiza cineracea).
1411 –1427. Journal of Ornithology, 145, 287 –299.
McDonald, T.L. (2003) Estimation of resource selection func- Welsh, A.H., Cunningham, R.B., Donnelly, C.F. &
tions when used and available samples overlap. Resource Lindenmayer, D.B. (1996) Modelling the abundance of rare
Selection Methods and Applications (ed. S.V. Huzurbazar), species: statistical models for counts with extra zeros.
pp. 35 – 39. Omnipress, Laramie, WY. Ecological Modelling, 88, 297 – 308.
McDonald, T.E. & McDonald, L.L. (2001) A new ecological van Winkle, W. (1975) Comparison of several probabilistic home-
risk assessment procedure using resource selection models range models. Journal of Wildlife Management, 39, 118–123.
and geographical information systems. Wildlife Society Zaniewski, A.E., Lehmann, A. & Overton, J.M. (2002) Pre-
Bulletin, 30, 1015 –1021. dicting species spatial distributions using presence-only
Newey, W.K. & West, K.D. (1987) A simple, positive semi- data: a case study of native New Zealand ferns. Ecological
definite, heteroskedasticity and autocorrelation consistent Modelling, 157, 261 – 280.
covariance matrix. Econometrica, 55, 703 – 708.
Nielsen, S.E., Boyce, M.S., Stenhouse, G.B. & Munro, R.H.M. Received 12 May 2005; final copy received 1 June 2005
(2002) Modeling grizzly bear habitats in the Yellowhead Editor: Rob Freckleton
© 2005 British
Ecological Society,
Journal of Applied
Ecology, 43,
405–412