ENM Tutorial
ENM Tutorial
A tutorial
21-05-2024
Contents
Preamble 2
1
Chapter 8 - Model projections 54
Projecting Vipera aspis models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Projecting Vipera latastei models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Projecting Vipera latastei models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Final considerations for projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Conclusion 64
Preamble
This guide will help you build ecological niche models using the R package biomod2. It provides a basic guide
for data preparation, model building, and ensemble.
The purpose of this text is to support the practical classes associated with the Ecological Modelling course
and to serve as a guide for the code in the scripts. The guide’s structure follows the structure of the code
provided in the practical classes, with additional information to help students follow along more easily.
The study system is composed of three related species of European vipers: Vipera aspis, Vipera latastei, and
Vipera seoanei, which meet in the Iberian Peninsula in a narrow three-way contact zone. In this guide, we
will model each of these species and predict the contact zone where the three vipers intersect. Reptiles are
often considered good study systems for ecological niche modeling because they rely on external factors such
as temperature to regulate their internal temperature. Most of our predictor variables will be climate-related,
but we will also include a vegetation-related variable. Since climate variables are available for different time
periods, we will use ecological niche models to answer the following question:
• How will climate change impact the contact zone between the three vipers in Iberia?
Although the code is divided into several scripts for logical clarity, it could also be written as a single long
script. In this document, I have opted to divide the chapters according to each script. For example, script 01
will be discussed in chapter 1, and so on.
Disclaimer: Many methods and processes have been simplified for teaching purposes. While the code
provides a basic and correct modeling approach, it may lack certain details and additional efforts needed to
make the models more robust. These points will be noted where applicable in the text.
NOTE: This is a work in progress. If you find any errors, please comment and send them
ò back to me!
2
• data: where all the needed data is saved and where scripts are generating data
• models: where model data from Biomod will be saved
NOTE: there should be some folders named original within the data. These will have the
ò raw data used for the modelling scripts that is obtained along the tutorial. Do not change
these folders.
3
• V. seoanei: doi:10.15468/dl.cvfxu6
All scripts start by setting the working directory. This is important because all paths in the scripts are
relative to this directory. For example, if you placed your folder “Practical_EcoMod” on the desktop, the
paths would be:
• Windows: C:\Users\Peter\Desktop\Practical_EcoMod
• Mac /Users/Peter/Desktop/Practical_EcoMod
• Linux: /home/peter/Desktop/Practical_EcoMod
therefore the full path to the file data/other/iberia.shp would be:
• Windows: C:\Users\Peter\Desktop\Practical_EcoMod\data\other\iberia.shp
• Mac /Users/Peter/Desktop/Practical_EcoMod/data/other/iberia.shp
• Linux: /home/peter/Desktop/Practical_EcoMod/data/other/iberia.shp
by setting the working directory as:
• Windows (notice the use of double \\ ):
setwd("C:\\Users\\Peter\\Desktop\\Practical_EcoMod")
• Mac :
setwd("/Users/Peter/Desktop/Practical_EcoMod")
• Linux :
setwd("/home/peter/Desktop/Practical_EcoMod")
Once the working directory is set, all files can be referenced relative to this directory. For example, instead
of providing the full path to the file, you can simply use the relative pathdata/other/iberia.shp for R to
locate the file.
Each individual script begins by setting this working directory, but in this document, this step is omitted
after being set once here.
NOTE: Besides here, only in Chapter 6 will you be asked to install some packages in this
ò document. However, be aware that if a package is not installed, R will generate an error
indicating that the package cannot be found or is not available. In such cases, install the
necessary package using the instruction provided above (changing the name of the package to
the one you need!).
Packages only need to be installed once. However, to use the functionality of an installed package in an R
session, you must load it with the library() command. This command needs to be executed in every new R
session where you want to use the additional package.
4
open the package
library(CoordinateCleaner)
We need to open the raw GBIF downloaded data into R. Since we downloaded it in the simple text CSV
format, we can use native R functions to open it. If you check the data with a simple text editor, you would
see that it has a header (the first line has column names), and each column is separated by a TAB. We
specify these settings in the arguments of the read.table() function to correctly read the data. Additionally,
we set two other arguments: quote="" and comment.char="". These settings prevent the possible use of
characters such as double quotes (“) or hash symbols (#) in the data, which could potentially break the
reading process.”.
vasp <- read.table("data/species/original/Vaspis.csv", sep="\t", header=TRUE, quote="")
vlat <- read.table("data/species/original/Vlatastei.csv", sep="\t", header=TRUE, quote="")
vseo <- read.table("data/species/original/Vseoanei.csv", sep="\t", header=TRUE, quote="")
We can keep track of the dimensions of each dataset to check how many points we remove in the process.1
dim(vasp)
## [1] 35929 50
dim(vlat)
## [1] 4477 50
dim(vseo)
## [1] 2070 50
The exclamtion point (!) inverts the logical values: TRUE become FALSE, and FALSE become TRUE. This
is useful because TRUEs are the rows preserved and we want to preserved the non-NA in the filtering with [ ].
We repeat for the other species:
mask_vlat <- is.na(vlat$decimalLatitude + vlat$decimalLongitude)
vlat <- vlat[!mask_vlat,]
Just the dimensions again to track how many points we still have after this first filter:
dim(vasp)
## [1] 33813 50
1 InRStudio, you can monitor the Environment panel, usually located at the top right corner of the interface, for updates on
the dataset dimensions.
5
dim(vlat)
## [1] 4152 50
dim(vseo)
## [1] 1862 50
# Course folder structure and da
6
flags_vlat <- clean_coordinates(x = vlat, lon = "decimalLongitude", lat = "decimalLatitude",
species = "species", countries = "countryCode",
country_refcol = "iso_a2", tests = tests)
7
## Flagged 24 of 1862 records, EQ = 0.01.
vseo <- vseo[flags_vseo$.summary,]
For some records, an individual count is reported. This can be higher than one (for instace, how many times
a species is seen in a camera trap) but it can also be zero, indicating absence. We only keep those that have 1
or more count or if this info is not reported.
# Sometimes there are records of absences. Remove them.
i_vasp <- vasp$individualCount
vasp <- vasp[i_vasp > 0 | is.na(i_vasp) , ]
To preserve some quality we remove very old records that might not exist anymore or might have very
imprecise location. Here we arbitrarily chose 1970 to define a date of reliable records.
# Remove very old records
vasp <- vasp[vasp$year > 1970 | is.na(vasp$year), ]
vlat <- vlat[vlat$year > 1970 | is.na(vlat$year), ]
vseo <- vseo[vseo$year > 1970 | is.na(vseo$year), ]
8
# Select only coordinates and remove duplicated coordinates
vasp <- data.frame(species="Vaspis", x=vasp$decimalLongitude, y=vasp$decimalLatitude)
vasp <- unique(vasp)
## [1] 14323 3
dim(vlat)
## [1] 2193 3
dim(vseo)
## [1] 1356 3
Check map
We can merge the three species in the same dataframe to have only one file.
species <- rbind(vasp, vlat, vseo)
and we can plot our points to check in the map if we spot some obvious errors:
sp_factor <- as.factor(species$species)
levels(sp_factor)
abline(v=-0.5, col='blue')
abline(h=41, col='blue')
9
50
45
species$y
40
35
−10 0 10 20
species$x
As shown by the plot with the lines, V. seoanei in blue have a few suspicious points at the east of main
distribution and also at the south, while V. aspis has some wrong points at south west of the main distribution.
We can remove those points based on the coordinates:
For V. soanei everything that is at the right of -0.5º longitude OR south of 41º latitude is to remove:
mask <- species$species == "Vseoanei" & ( species$x > -0.5 | species$y < 41)
species <- species[!mask,]
For V aspis we remove everythin that is at left of 5ª longitude AND south of 41º latitude.
mask <- species$species == "Vaspis" & ( species$x < 5 & species$y < 41)
species <- species[!mask,]
## [1] 14323 3
dim(vlat)
## [1] 2193 3
dim(vseo)
## [1] 1356 3
We set a file name and we write our data set:
filename <- "data/species/speciesPresence_v1.csv"
write.table(species, filename, sep="\t", row.names=FALSE, col.names=TRUE)
We saved as version 1 because we will have to further remove more data to match our model spatial resolution
(Chapter 3).
10
Chapter 2 - Processing the raster data
In this chapter, we are going to obtain and process the predictors in the data/raster folder. In the process
of building an ecological niche model, we need to select and retrieve variables related to the niche of our
species of interest and our specific hypothesis or question. Often, these variables can be found in several
public databases (e.g., WorldClim climate, satellite products) or other sources, but they usually require some
level of processing. Our goal here is to obtain these variable and perform some GIS raster processing inside R
to ensure uniform geographical properties among the raster layers. These steps include:
• Setting the same extent for all layers (aligning the rasters)
• Setting the same resolution (in this case, the study will be performed at a 10km resolution)
• Applying the same No Data mask to all layers
Defining the resolution and extent of the study area requires careful analysis. In this example, we set an
extent that covers the combined distribution of the three species. We opt for a 10km resolution because many
presence data from national atlases and other sources are available at this resolution. This allows us to use
these data in our model (in the previous chapter, we removed points that had an uncertainty higher than
10km). We also choose this resolution for practical reasons. A very high resolution would require much more
processing time and storage space, which may not be feasible within the available time to run this example.
However, the resolution should always be carefully balanced between data availability (both predictors and
presence data), processing limitations, and, most importantly, the requirements of the study system. For
example, it might not make sense to study elephant distributions at a very high spatial resolution.
Note that in this chapter we are limiting the training area, i.e., the area where the models will retrieve
presence and absence data to calibrate and find the optimal statistical solution. Later, in chapter 5, we will
process the variables to focus only on the area of model projections, which is the Iberian Peninsula.
For this chapter, we need to obtain the 19 bioclimatic variables and the EVI variable. All files should be
downloaded to data/rasters/original under respestive folders climate and evi*.
The terra library is essential for niche modeling. It provides functionality to handle spatial data such as
rasters and vectors, as well as perform geoprocessing. This transforms R into a very competent GIS tool.
The library geodata provides access to climate data from WorldClim and other spatial data.
library(terra)
library(geodata)
General layers
Throughout the tutorial, we will define a training area and a projection area for the models. We will explore
each in detail later, but here’s a brief overview: the training area is where we gather data to build the models,
usually covering the species’ distribution range. The projection area is where we aim to generate model
predictions. In this case, the training area includes most of Western and Central Europe and part of North
Africa, aligning with our focal species’ distribution. The projection area is the Iberian Peninsula (Portugal
and Spain), where we will analyze potential contact zones.
Obtain a vector of country polygons with geodata and save it in the appropriate folder.
countries <- world(path = "data/other")
Now we use the information of the countries within the layer to get our projection area. Since with want only
the continental landmass, we have to separate (disaggregate) the polygons with disagg command and keep
the two largest polygons.
iberia <- countries[countries$GID_0 %in% c("PRT", "ESP")]
iberia <- disagg(iberia)
areas <- expanse(iberia)
iberia <- iberia[order(areas, decreasing=TRUE)[1:2]]
writeVector(iberia, "data/other/iberia.shp", overwrite=TRUE)
11
Processing climate layers
First, we are going to obtain the layers from Worldclim through geodata library. These are the 19 bioclimatic
variables (see above for details). The function allow us to set a priori the resolution we want (5 arc minutes
= 0.083333(3) degrees ~ 10km) and the path where to write them.
clim <- worldclim_global(var = 'bio', res = 5, path = 'data/rasters/original')
plot(clim)
wc2.1_5m_bio_1 wc2.1_5m_bio_2 wc2.1_5m_bio_3 wc2.1_5m_bio_4
20 100
20 2000
50
50
50
50
15 80
0 1500
60
−20 10 1000
40
−50
−50
−50
−50
−40 5 500
20
0
−150 −50 50 150 −150 −50 50 150 −150 −50 50 150 −150 −50 50 150
50
50
50
30 0
20 50 0
10 −20 40
0 30 −20
−40
−50
−50
−50
−50
−10 20 −40
−20 −60 10 −60
−150 −50 50 150 −150 −50 50 150 −150 −50 50 150 −150 −50 50 150
50
50
50
20 0 5000
0 10 4000
0 −20 3000
−20 −10
2000
−50
−50
−50
−50
−20 −40
−40 −30 1000
−60 0
−150 −50 50 150 −150 −50 50 150 −150 −50 50 150 −150 −50 50 150
50
50
50
3000
300 150 2500
1000 2000
200 100 1500
−50
−50
−50
−50
The original layer names are lengthy and complex. Since we’ll frequently reference these variables by name
throughout the tutorial, we’ll simplify them with easily identifiable names. Given that we’re opening all files
sequentially from bio1 to bio19, we can rename the layers accordingly.
names(clim)
12
BIO_1 BIO_2 BIO_3 BIO_4
20 100
20 2000
50
50
50
50
15 80
0 1500
60
−20 10 1000
40
−50
−50
−50
−50
−40 5 500
20
0
−150 −50 50 150 −150 −50 50 150 −150 −50 50 150 −150 −50 50 150
50
50
50
30 0
20 50 0
10 −20 40
0 30 −20
−40
−50
−50
−50
−50
−10 20 −40
−20 −60 10 −60
−150 −50 50 150 −150 −50 50 150 −150 −50 50 150 −150 −50 50 150
50
50
50
20 0 5000
0 10 4000
0 −20 3000
−20 −10
2000
−50
−50
−50
−50
−20 −40
−40 −30 1000
−60 0
−150 −50 50 150 −150 −50 50 150 −150 −50 50 150 −150 −50 50 150
50
50
50
3000
300 150 2500
1000 2000
200 100 1500
−50
−50
−50
−50
500 100 50 1000
500
0 0 0 0
−150 −50 50 150 −150 −50 50 150 −150 −50 50 150 −150 −50 50 150
The study area (training area for models) needs to encompass the full distribution of the three species. We
could either check the extreme coordinates of the presence data or use coarse distribution polygons (e.g.,
IUCN distributions) to define the study area. Here, we opt for a simpler approach by defining a rectangle
that we know includes the species’ distributions. This rectangle extends from -11.0 to 27.0 degrees longitude
and from 25 to 55 degrees latitude.
e <- ext(-11, 27, 25, 55)
plot(clim[[1]])
plot(e, add=T)
20
50
0
0
−20
−50
−40
13
We can now use the rectangle extent to crop the climate layers.
clim <- crop(clim, e)
plot(clim)
55
55
55
25 16 55 900
20 14 50 800
45
45
45
45
12 45 700
15 40 600
10 10
8 35 500
35
35
35
35
5 30 400
0 6 25
4 20 300
25
25
25
25
−5
−10 10 −10 10 −10 10 −10 10
55
55
55
10 40
40 35 30
5
45
45
45
45
30 30 20
0
25 10
20 −5
35
35
35
35
20
−10 0
10 15
−10
25
25
25
25
−10 10 −10 10 −10 10 −10 10
55
55
55
30 35 15
30 2000
10
45
45
45
45
20 25 1500
20 5
10 0 1000
15
35
35
35
35
0 10 −5 500
5 −10
−10
25
25
25
25
0
−10 10 −10 10 −10 10 −10 10
55
55
55
120 120 700
250
100 100 600
200
45
45
45
45
80 80 500
150 60 400
60 300
100
35
35
35
35
40 40 200
50 20 20 100
25
25
25
25
0 0 0 0
−10 10 −10 10 −10 10 −10 10
Process EVI
The Enhanced Vegetation Index (EVI) indicates land productivity and serves here as a continuous variable,
acting as a proxy for habitat. We will use 2020 data available from the OpenGeoHub website, which hosts
a variety of free spatial data useful for ecological niche modeling. The EVI data is global with a 250m
resolution, resulting in large file sizes. We need to manually download the six files for 2020 and save them in
the data/rasters/evi folder. Once processed, these files can be deleted to save space. The links to each file
are provided in the table.
Name Month
evi_mod13q1.tmwm.inpaint_p.90_250m_s_20200101_20200228_go_epsg.4326_v20230608
Jan-Feb
evi_mod13q1.tmwm.inpaint_p.90_250m_s_20200301_20200430_go_epsg.4326_v20230608
Mar-Apr
evi_mod13q1.tmwm.inpaint_p.90_250m_s_20200501_20200630_go_epsg.4326_v20230608
May-Jun
evi_mod13q1.tmwm.inpaint_p.90_250m_s_20200701_20200831_go_epsg.4326_v20230608
Jul-Aug
evi_mod13q1.tmwm.inpaint_p.90_250m_s_20200901_20201031_go_epsg.4326_v20230608
Sep-Oct
evi_mod13q1.tmwm.inpaint_p.90_250m_s_20201101_20201231_go_epsg.4326_v20230608
Nov-Dec
As they are huge files, to avoid processing unneeded areas, we crop to the study area defined above.
14
evi <- crop(evi, e)
## |---------|---------|---------|---------|=========================================
plot(evi)
mwm.inpaint_p.90_250m_s_20200101_20200228_go_epsg.4326_v20230608
evi_mod13q1.tmwm.inpaint_p.90_250m_s_20200301_20200430_go_epsg.4326_v20230608
evi_mod13q1.tmwm.inpaint_p.90_250m_s_20200501_20200630_go_epsg.4326_
1.00
50
50
50
0.80 0.80 0.80
45
45
45
0.60 0.60 0.60
40
40
40
0.40 0.40 0.40
35
35
35
0.20 0.20 0.20
30
30
30
−10 −5 0 5 10 15 20 25 −10 −5 0 5 10 15 20 25 −10 −5 0 5 10 15 20 25
mwm.inpaint_p.90_250m_s_20200701_20200831_go_epsg.4326_v20230608
evi_mod13q1.tmwm.inpaint_p.90_250m_s_20200901_20201031_go_epsg.4326_v20230608
evi_mod13q1.tmwm.inpaint_p.90_250m_s_20201101_20201231_go_epsg.4326_
1.00
0.80
50
50
50
0.80 0.80
45
45
45
0.60
0.60 0.60
40
40
40
0.40 0.40
0.40
35
35
35
30
30
We need to further summarize the data into a single file by obtaining the maximum EVI value per pixel for
the year 2020. We can use the app function from terra, which applies a function to summarize all layers on a
pixel-by-pixel basis.
evi <- app(evi, max, na.rm=T)
## |---------|---------|---------|---------|=========================================
plot(evi)
15
50 1.00
0.80
45
0.60
40
0.40
35
0.20
30
16
55
0.80
50
0.70
0.60
45
0.50
40
0.40
0.30
35
0.20
30
0.10
25
−10 0 10 20
## [1] "max"
names(evi) <- "evi"
Although it is not mandatory, it is good practice to have the same No Data mask for all variables used in
modeling. We need to check the No Data areas of each layer and set them to a common area. The climate
layers already share the same No Data mask since they were created using the same process. However, it is
not guaranteed that the EVI has the same mask. Here, any pixel set as No Data in a single layer will be set
as No Data in all layers.
Let’s extract and check the combination of all No Data masks. The app command applies a function (in this
case, the sum function) to all corresponding pixels in all layers. By using is.na(), we get TRUE or FALSE
if the pixel is NA or not in each layer. By summing those values, we will know in how many layers that pixel
is set to NA.
na.clim <- app(is.na(clim), sum)
plot(na.clim)
17
55
0
19
50
45
40
35
30
25
−10 0 10 20
As expected, the pixel is either set to NA in all 19 layers or it has data in all.
Now we sum the EVI No Data to check if there are other pixels with No Data. If so, we will have to set a
common mask for all layers.
na.evi <- is.na(evi)
plot(na.evi)
18
55
False
True
50
45
40
35
30
25
−10 0 10 20
0
1
19
50
20
45
40
35
30
25
−10 0 10 20
Notice that a few pixels have the value of 1. This means that only the EVI had those pixels set as No Data.
19
We need to set a final mask where we detect pixels set to NA in, at least, one layer and apply the mask to all
rasters processed.
na.mask <- (na.clim + na.evi) > 0
plot(na.mask)
55
False
True
50
45
40
35
30
25
−10 0 10 20
clim[na.mask] <- NA
evi[na.mask] <- NA
We have finished the processing of the raster variables. The climate and EVI data sets are now fully aligned
and with the same resolution. We just need to save them into different TIF files2 .
# Write Rasters to file
# Chelsa is a single file with 19 bands/layers
writeRaster(clim, "data/rasters/climate.tif", overwrite=TRUE)
# EVI is a single file with single band/layer
writeRaster(evi, "data/rasters/evi.tif", overwrite=TRUE)
Note that these files a full raster georeferenced images. The images can be visualised as well in any GIS
(e.g. QGIS).
20
values. By setting a resolution and removing duplicate presences, we also reduce bias towards areas with
more sampling effort (an area with more effort tend to have more duplicates at wider spatial resolutions).
We will only need terra library for this chapter.
library(terra)
We can now open the relevant data for this chapter. Since in the last chapter we fully aligned the raster
variables, we only need to open a single one now as it will provide enough information for this process. The
presence data set is that produced at the end of chapter 1.
evi <- rast("data/rasters/evi.tif")
pres <- read.table("data/species/speciesPresence_v1.csv", sep="\t", header=TRUE)
Now we will extract raster values at each presence location point. For that, we will use the extract function
provided by the terra package. This function will detect in which pixel/cell the presence is located and
extract that information. If a presence is located in a No Data pixel, we will be able to identify it and remove
it. However, here we are also interested in identifying if the pixel is the same or not. We cannot rely on EVI
values for that, as different pixels might have the same value.
We can set the cells=TRUE parameter in the extract function. For each presence, this will provide a unique
pixel/cell identifier. Thus, if two presences are in the same pixel, they will have the same identifier, and we
can keep only one, removing the duplicates.
dt <- extract(evi, pres[,c("x", "y")], cells=TRUE)
head(dt)
## ID evi cell
## 1 1 0.5606292 70845
## 2 2 0.5706229 69470
## 3 3 0.5267091 71757
## 4 4 0.6034736 68556
## 5 5 0.5350373 70382
## 6 6 0.4996728 68569
Now we remove those presences that fall in No Data by checking the evi column of the extracted data table:
mask <- is.na(dt$evi)
## [1] 10
dt <- dt[!mask,]
pres <- pres[!mask,]
Now that both the presence and extrated datasets are free of missing data, we need to check for pixel
duplicates. However, there is a detail that adds a bit of complexity to the process. Since we are working with
three species, we have to check for duplicates independently for each species. This is because the three species
can coexist in the same pixel (sympatry), and we do not want to confound this situation as pixel duplicates.
The easiest way to do this is to create a loop over the species so that we can detect duplicates for each species
independently. The code is organized as follows: 1. Check the names of the species to loop over (three
different species in this case) 2. Create a column in the presence data set that will store a value of TRUE is
the presence is duplicated in the pixel (i.e., if there was already another presence before at the same pixel) or
FALSE if the presence is unique or the first record for a given pixel. 3. Loop over species 1. Identify the
rows of presence table referring to the current species in the loop 2. Detect of duplicates only for the current
species 3. Update the duplicated column created in step 2 with the relevant information for the species. 4.
Print in the console how many duplicates were found for each species
21
sps <- unique(pres$species)
pres$duplicated <- NA
## species x y duplicated
## 1 Vaspis 2.70 42.05 FALSE
## 2 Vaspis 2.09 42.31 FALSE
## 3 Vaspis 2.70 41.87 FALSE
## 4 Vaspis 1.97 42.49 FALSE
## 5 Vaspis 2.09 42.13 FALSE
## 6 Vaspis 3.06 42.50 FALSE
sum(pres$duplicated)
## [1] 11058
Everything seems coherent, so we proceed to remove the duplicated rows. We have to invert the logical value
as we want to keep the rows that are not duplicated (were set as FALSE). The sub setting with [ ] only keeps
the rows set as TRUE, thus, by inverting with the exclamation point (!) before, we invert and set TRUE to
FALSE, and FALSE to TRUE.
final_pres <- pres[!pres$duplicated, 1:3]
We only kept the first 3 columns (species, name, longitude and latitude) as they are the information we need
to model.
We can get a summary of how may presence data points we kept at the end of presence processing.
dim(final_pres)
## [1] 6796 3
table(final_pres$species)
##
## Vaspis Vlatastei Vseoanei
## 4969 1227 600
Finally, we end this chapter by saving a new file with the optimized data set that will be used for modelling.
# write to file
filename <- "data/species/speciesPresence_v2.csv"
write.table(final_pres, filename, sep="\t", row.names=FALSE, col.names=TRUE)
22
Chapter 4 - Variable selection for modelling
It is common practice to collect all predictor variables available that are deemed relevant for our study system
and hypotheses. These include direct variables (like temperature) and processed variables (like distances to
habitat categories). However, it is rarely a good idea to include all variables into the modeling process.
To address this, we can reduce the number of variables by checking the pairwise correlation among them.
Correlated variables will bring mostly the same information to the model and might add a confounding effect
when analyzing the variable importance in the models. There is no universal maximum correlation value for
modeling, but avoiding correlations higher than 0.7 is generally considered sufficient. We will set this target
for this example.
We will only need terra library for this.
library(terra)
We need the raster variables created in Chapter 2 and the optimised presence data from Chapter 3.
evi <- rast("data/rasters/evi.tif")
clim <- rast("data/rasters/climate.tif")
pres <- read.table("data/species/speciesPresence_v2.csv", sep="\t", header=TRUE)
The rasters are fully aligned (Chapter 2) and we can stack them together. Having the 20 variables (19
bioclimatic + 1 EVI) in the same object will simplify the process of selection.
rst <- c(clim, evi)
names(rst)
NOTE: The buffer defined here will be important for Chapter 6 when we build the models.
ò We will have to define the same buffer size.”
23
NOTE on spatial data in R: When we open the presence data in R with the read.table
ò command, we just create a data frame (like a spreadsheet in Excel). Although we might have
columns for longitude and latitude, they are just numbers organized in two columns. Since
creating buffers is a spatial operation, we have to inform R that the presences are spatial
points. In other words, we have to formalize that the data frame is, in fact, a list of spatial
points. This can be done with the vect command from the terra package, which is also used
to open spatial data files such as shapefiles. We must define which columns store the longitude
and latitude (in our case, the ‘x’ and ‘y’ columns).
Now we can define our buffers with a 1-degree radius. The buffer function creates an individual buffer
around each presence, and we then aggregate them into a single polygon.
bsize <- 1
buf <- buffer(v, bsize)
buf <- aggregate(buf)
To check how everything is looking, we can plot presence and buffers over the EVI variable.
plot(evi)
plot(buf, cex=0.25, add=T)
plot(v, add=T)
55
0.80
50
0.70
0.60
45
0.50
40
0.40
0.30
35
0.20
30
0.10
25
24
## BIO_1 BIO_2 BIO_3 BIO_4 BIO_5 BIO_6 BIO_7 BIO_8 BIO_9 BIO_10 BIO_11 BIO_12
## 1 NA NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA NA NA NA NA
## BIO_13 BIO_14 BIO_15 BIO_16 BIO_17 BIO_18 BIO_19 evi
## 1 NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA
There are many NAs that come from ocean pixels. We should remove those rows and keep only the valid
data before proceeding with correlations. Since the rasters are fully aligned (Chapter 2), including the No
Data mask, we can use the NAs from a single raster to filter all other variables.
dt <- dt[!is.na(dt[,1]),]
head(dt)
## BIO_1 BIO_2 BIO_3 BIO_4 BIO_5 BIO_6 BIO_7 BIO_8 BIO_9 BIO_10
25
## BIO_1 1.000 0.429 0.516 -0.087 0.870 0.896 0.217 0.410 0.737 0.956
## BIO_2 0.429 1.000 0.675 0.433 0.767 0.073 0.855 0.129 0.422 0.566
## BIO_3 0.516 0.675 1.000 -0.345 0.520 0.458 0.202 0.020 0.538 0.420
## BIO_4 -0.087 0.433 -0.345 1.000 0.325 -0.476 0.831 0.201 -0.181 0.207
## BIO_5 0.870 0.767 0.520 0.325 1.000 0.585 0.661 0.360 0.676 0.957
## BIO_6 0.896 0.073 0.458 -0.476 0.585 1.000 -0.222 0.275 0.690 0.737
## BIO_7 0.217 0.855 0.202 0.831 0.661 -0.222 1.000 0.178 0.174 0.468
## BIO_8 0.410 0.129 0.020 0.201 0.360 0.275 0.178 1.000 -0.046 0.448
## BIO_9 0.737 0.422 0.538 -0.181 0.676 0.690 0.174 -0.046 1.000 0.685
## BIO_10 0.956 0.566 0.420 0.207 0.957 0.737 0.468 0.448 0.685 1.000
## BIO_11 0.962 0.304 0.587 -0.349 0.739 0.967 -0.006 0.314 0.764 0.844
## BIO_12 -0.514 -0.489 -0.263 -0.240 -0.630 -0.340 -0.443 -0.319 -0.382 -0.583
## BIO_13 -0.242 -0.348 -0.096 -0.292 -0.387 -0.100 -0.373 -0.219 -0.131 -0.326
## BIO_14 -0.734 -0.544 -0.472 -0.034 -0.766 -0.585 -0.380 -0.256 -0.678 -0.746
## BIO_15 0.627 0.385 0.431 -0.118 0.583 0.539 0.201 0.134 0.609 0.595
## BIO_16 -0.272 -0.375 -0.101 -0.322 -0.425 -0.112 -0.407 -0.251 -0.155 -0.365
## BIO_17 -0.719 -0.530 -0.468 -0.019 -0.747 -0.582 -0.359 -0.270 -0.636 -0.725
## BIO_18 -0.751 -0.511 -0.512 0.059 -0.760 -0.658 -0.304 -0.069 -0.777 -0.736
## BIO_19 -0.100 -0.301 0.103 -0.497 -0.293 0.114 -0.458 -0.426 0.097 -0.242
## evi -0.230 -0.487 -0.269 -0.193 -0.392 -0.039 -0.435 -0.055 -0.344 -0.302
## BIO_11 BIO_12 BIO_13 BIO_14 BIO_15 BIO_16 BIO_17 BIO_18 BIO_19 evi
## BIO_1 0.962 -0.514 -0.242 -0.734 0.627 -0.272 -0.719 -0.751 -0.100 -0.230
## BIO_2 0.304 -0.489 -0.348 -0.544 0.385 -0.375 -0.530 -0.511 -0.301 -0.487
## BIO_3 0.587 -0.263 -0.096 -0.472 0.431 -0.101 -0.468 -0.512 0.103 -0.269
## BIO_4 -0.349 -0.240 -0.292 -0.034 -0.118 -0.322 -0.019 0.059 -0.497 -0.193
## BIO_5 0.739 -0.630 -0.387 -0.766 0.583 -0.425 -0.747 -0.760 -0.293 -0.392
## BIO_6 0.967 -0.340 -0.100 -0.585 0.539 -0.112 -0.582 -0.658 0.114 -0.039
## BIO_7 -0.006 -0.443 -0.373 -0.380 0.201 -0.407 -0.359 -0.304 -0.458 -0.435
## BIO_8 0.314 -0.319 -0.219 -0.256 0.134 -0.251 -0.270 -0.069 -0.426 -0.055
## BIO_9 0.764 -0.382 -0.131 -0.678 0.609 -0.155 -0.636 -0.777 0.097 -0.344
## BIO_10 0.844 -0.583 -0.326 -0.746 0.595 -0.365 -0.725 -0.736 -0.242 -0.302
## BIO_11 1.000 -0.430 -0.154 -0.701 0.640 -0.174 -0.689 -0.742 0.039 -0.192
## BIO_12 -0.430 1.000 0.899 0.714 -0.251 0.915 0.773 0.741 0.784 0.369
## BIO_13 -0.154 0.899 1.000 0.371 0.151 0.990 0.453 0.480 0.864 0.182
## BIO_14 -0.701 0.714 0.371 1.000 -0.800 0.396 0.984 0.891 0.255 0.541
## BIO_15 0.640 -0.251 0.151 -0.800 1.000 0.133 -0.766 -0.617 0.186 -0.501
## BIO_16 -0.174 0.915 0.990 0.396 0.133 1.000 0.468 0.502 0.885 0.196
## BIO_17 -0.689 0.773 0.453 0.984 -0.766 0.468 1.000 0.891 0.318 0.515
## BIO_18 -0.742 0.741 0.480 0.891 -0.617 0.502 0.891 1.000 0.204 0.423
## BIO_19 0.039 0.784 0.864 0.255 0.186 0.885 0.318 0.204 1.000 0.177
## evi -0.192 0.369 0.182 0.541 -0.501 0.196 0.515 0.423 0.177 1.000
The correlation matrix provides the correlation score for each pair of variables. It is symmetric (the correlation
of A to B is the same as B to A) and the diagonal gives the self-correlation, which is obviously 1.
This matrix is not easy to use for the variable elimination process. Remember that we set a goal of not
having an absolute correlation higher than 0.7, so we should find a way to select important variables that will
provide a new set where the maximum pairwise correlation won’t exceed that value.
One of the best and simplest ways is to organize the variables in a dendrogram by using a hierarchical
clustering method. We should group in the same cluster variables that are similar to each other (highly
correlated). To do this, we need to transform the correlation score into a dissimilarity or distance matrix,
where increasing values will reflect that variables are less similar (low correlation). For that we need to:
1. Get the absolute value of the correlation because we just care about the magnitude of the correlation
and not the direction (two variables that are correlated at -0.87 provide the same information as those
26
that are correlated at +0.87). At this point, the matrix will have values from 0 (low correlation, thus
highly distant) to 1 (high correlation, showing high similarity and low distance).
2. Ensure that lower values reflect similarity (high correlation) and higher values describe dissimilarity
(low correlation). To achieve this, we need to invert the values by subtracting the absolute value from 1.
# Convert correlation to distance
dist <- 1 - abs(corr)
We need to discard one of the symmetrical sides and the diagonal. An easy way to do this is to convert the
matrix to a dist object:
dist <- as.dist(dist)
Cluster Dendrogram
0.4
Height
0.2
0.0
BIO_8
evi
BIO_3
BIO_19
BIO_12
BIO_13
BIO_16
BIO_9
BIO_15
BIO_18
BIO_14
BIO_17
BIO_1
BIO_6
BIO_11
BIO_5
BIO_10
BIO_4
BIO_2
BIO_7
dist
hclust (*, "single")
We can select 5 variables, one from each major group in the tree. Variables can be selected based on how
easily they can be interpreted in terms of the niche of the species. In this case, we select:
• evi
• BIO_8: Mean daily Temperature Wettest Quarter
• BIO_3: Isothermality
• BIO_12: Annual Precipitation Amount
• BIO_1: Annual Temperature
27
We select the vars using the variables names we defined previously and create a new raster with only those
variables.
sel_vars <- c("evi", "BIO_8", "BIO_3", "BIO_12", "BIO_1")
sel_rst <- rst[[sel_vars]]
We select the variables using the names we defined previously and create a new raster with only those
variables.
# Check correlation of final dataset
sel_dt <- dt[,sel_vars]
cor(sel_dt)
28
library(terra)
library(geodata)
To crop and cut the rasters, we use the Iberian Peninsula shapefile
# Open Iberia shape file
v <- vect("data/other/iberia.shp")
We use the Iberian Peninsula shapefile to crop and mask the rasters. Cropping means reducing the extent,
while masking is used to define No Data for all pixels outside the Iberian Peninsula.
# Cut Current data and save
vars <- crop(vars, v, mask=TRUE)
The current dataset for projection is ready, and we can write it to a file. All projection files will be saved in
the data/rasters folder, and filenames will start with a “proj_” prefix.
writeRaster(vars, "data/rasters/proj_current.tif", overwrite=TRUE)
We can now loop over ages and ssp scenarios. The process inside the nested loop is:
1. DowNload the dataset for the combination of GCM model + age + ssp (~10km spatial resolution) using
geodata and writing to data/rasters/original.
2. Crop the rasters to the study area (same extent as vars).
3. Rename layers to follow our name convention.
29
4. Select only the variables of interest from the 19 bioclimate variables.
5. Join the evi (static) to the set of selected bioclimate variables
6. Write the produced raster with the 4 layers in the data/rasters/ folder and following a filename
convention that starts by proj_age_ssp in TIF format.
for (age in ages) {
for (ssp in scenarios) {
# download and rename
clim_fut <- cmip6_world(model='MPI-ESM1-2-HR', ssp=ssp, time=age, var='bioc', res=5, path='data/
clim_fut <- crop(clim_fut, vars[[1]], mask=TRUE)
names(clim_fut) <- paste0("BIO_", 1:19)
#filter the selected vars
clim_fut <- clim_fut[[bios]]
# merge with evi
fut <- c(vars[["evi"]], clim_fut)
# Write raster to file
writeRaster(fut, paste0("data/rasters/proj_", age, "_", ssp, ".tif"), overwrite=TRUE)
}
}
At the end of this loop, we should have 6 projection rasters for future time periods plus 1 projection for
current period.
30
install.packages("xgboost")
install.packages("maxnet")
install.packages("gam")
install.packages("tidyterra")
Recall that the presence table has data for the 3 species and that the selected variables are in a single raster
file.
Model building
In this chapter, we will build models for three species. We’ll start with a detailed walkthrough for Vipera
aspis, and then apply the same process to the other two species. All models will be saved in the models
directory, which should be empty at the start. A folder for each species, organized by biomod2, will be created
automatically.
The first step is to provide data into a formatting functions that will prepare the data in a way that biomod2
can understand and track parameters.
We need a response variable which is a vector of 1s with the same size as presence locations for the species.
This is because we don’t have absences (zeros) and we are using a presence-only model strategy by creating
pseudo-absences.
We also need a table of coordinates for each presence which is present in our presence data file.
Finally we need our explanatory variables or predictors that we prepared before.
# How many presences for Vaspis?
n_va <- sum(pres$species == "Vaspis")
In the formatting data, we also provide some other modeling parameters. We need to define the strategy for
selecting pseudo-absences, which is based on a buffer around presences. We use the same buffer as before
(Chapter 2), which is 110,000 km. We ask for three different sets of 10,000 pseudo-absences followinf the disk
strategy (same as buffer), which will allow us to capture some variation in the initial conditions for each
model.
The dir.name sets the folder where models are to be saved and the resp.name sets a name for the project and
folder to be created inside the models folder.
vaData <- BIOMOD_FormatingData(resp.var = resp_va,
expl.var = vars,
resp.xy = coords_va,
resp.name = "Vaspis",
dir.name = "models",
31
PA.nb.rep = 3,
PA.nb.absences = 10000,
PA.strategy = "disk",
PA.dist.max = 110000)
##
## -=-=-=-=-=-=-=-=-=-=-=-=-=-= Vaspis Data Formating -=-=-=-=-=-=-=-=-=-=-=-=-=-=
##
## ! No data has been set aside for modeling evaluation
## ! No data has been set aside for modeling evaluation
##
## Checking Pseudo-absence selection arguments...
##
## ! No data has been set aside for modeling evaluation
## > Disk pseudo absences selection
## > Pseudo absences are selected in explanatory variables|---------|---------|---------|---------|==
## > random pseudo absences selection
## > Pseudo absences are selected in explanatory variables
##
## ! No data has been set aside for modeling evaluation
## ! No data has been set aside for modeling evaluation
## -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Done -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
We have the data formatted, so we can proceed to the modeling step. Here, we need to define the model
algorithms to run and the train/test strategy. This allows us to control overfitting by providing examples
to train the model and a set of independent examples to test its predictive ability. We choose K-fold
cross-validation for this purpose. In this case, we divide the dataset into five equal-sized folds and test four
against one for each fold.
As models should be evaluated for performance, we choose two common techniques for this: TSS (True Skill
Statistic) and ROC (Receiver Operating Characteristic)
There are many algorithms available for modelling. We are limiting to 4 different algorithms which are
generally fast to train and still provide a good example of the ensemble modelling strategy. We choose:
• Generalized Linear Model (GLM): A general regression model using a binomial family of distributions
(logistic regression)
• Generalized Additive Models (GAM): A regression-based approach where smoothing functions are
applied to each predictor.
• Maximum Entropy (Maxnet): A machine learning method sharing some similarities with GAM and is
an R version of the common MaxEnt approach.
• Extreme Gradient Boosting (XGBOOST): A machine learning method related to decision trees, known
to be fast and perform well.
Each of these algorithms has specific parameters to tune. The biomod2 library offers different strategies
for tuning them. We are choosing the ‘bigboss’ strategy, which consists of a list of the best parameters as
defined by the package authors. This means that the authors tested several algorithms under different data
and provided a list of options for each algorithm that generally performs well in most common situations.
Variable importance is a measure of how much each predictor influences each model. In the context of
Ecological Niche Modeling (ENM), it detects the variable most determining the distribution of the species.
This is done via permutations, set in var.import. We set this to 3, which is a low number of permutations,
but since they take some time to run, we need to keep this number low for this example.
vaModel <- BIOMOD_Modeling(bm.format = vaData,
modeling.id = "EcoMod",
models = c("GAM", "GLM", "MAXNET", "XGBOOST"),
32
CV.strategy = "kfold",
CV.k = 5,
CV.do.full.models = FALSE,
OPT.strategy = "bigboss",
var.import = 3,
metric.eval = c("TSS", "ROC"))
vaModel
##
## -=-=-=-=-=-=-=-=-=-=-=-=-=-=-= BIOMOD.models.out -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
##
## Modeling folder : models
##
## Species modeled : Vaspis
##
## Modeling id : EcoMod
##
## Considered variables : evi BIO_8 BIO_3 BIO_12 BIO_1
##
##
## Computed Models : Vaspis_PA1_RUN1_GAM Vaspis_PA1_RUN1_GLM
## Vaspis_PA1_RUN1_MAXNET Vaspis_PA1_RUN1_XGBOOST Vaspis_PA1_RUN2_GAM
## Vaspis_PA1_RUN2_GLM Vaspis_PA1_RUN2_MAXNET Vaspis_PA1_RUN2_XGBOOST
## Vaspis_PA1_RUN3_GAM Vaspis_PA1_RUN3_GLM Vaspis_PA1_RUN3_MAXNET
## Vaspis_PA1_RUN3_XGBOOST Vaspis_PA1_RUN4_GAM Vaspis_PA1_RUN4_GLM
## Vaspis_PA1_RUN4_MAXNET Vaspis_PA1_RUN4_XGBOOST Vaspis_PA1_RUN5_GAM
## Vaspis_PA1_RUN5_GLM Vaspis_PA1_RUN5_MAXNET Vaspis_PA1_RUN5_XGBOOST
## Vaspis_PA2_RUN1_GAM Vaspis_PA2_RUN1_GLM Vaspis_PA2_RUN1_MAXNET
## Vaspis_PA2_RUN1_XGBOOST Vaspis_PA2_RUN2_GAM Vaspis_PA2_RUN2_GLM
## Vaspis_PA2_RUN2_MAXNET Vaspis_PA2_RUN2_XGBOOST Vaspis_PA2_RUN3_GAM
## Vaspis_PA2_RUN3_GLM Vaspis_PA2_RUN3_MAXNET Vaspis_PA2_RUN3_XGBOOST
## Vaspis_PA2_RUN4_GAM Vaspis_PA2_RUN4_GLM Vaspis_PA2_RUN4_MAXNET
## Vaspis_PA2_RUN4_XGBOOST Vaspis_PA2_RUN5_GAM Vaspis_PA2_RUN5_GLM
## Vaspis_PA2_RUN5_MAXNET Vaspis_PA2_RUN5_XGBOOST Vaspis_PA3_RUN1_GAM
## Vaspis_PA3_RUN1_GLM Vaspis_PA3_RUN1_MAXNET Vaspis_PA3_RUN1_XGBOOST
## Vaspis_PA3_RUN2_GAM Vaspis_PA3_RUN2_GLM Vaspis_PA3_RUN2_MAXNET
## Vaspis_PA3_RUN2_XGBOOST Vaspis_PA3_RUN3_GAM Vaspis_PA3_RUN3_GLM
## Vaspis_PA3_RUN3_MAXNET Vaspis_PA3_RUN3_XGBOOST Vaspis_PA3_RUN4_GAM
## Vaspis_PA3_RUN4_GLM Vaspis_PA3_RUN4_MAXNET Vaspis_PA3_RUN4_XGBOOST
## Vaspis_PA3_RUN5_GAM Vaspis_PA3_RUN5_GLM Vaspis_PA3_RUN5_MAXNET
## Vaspis_PA3_RUN5_XGBOOST
##
##
## Failed Models : none
##
## -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
This command will take some time to run as it needs to build all the combinations of models we specified. It
displays general information on the process, allowing us to check if any models are failing. If models do fail,
we may need to investigate the reasons and adjust some parameters. If everything runs successfully, we can
plot the results to evaluate the models’ performance.
33
plt <- bm_PlotEvalMean(vaModel, dataset="calibration")
0.45
0.40
GAM
GLM
TSS
0.35 MAXNET
XGBOOST
0.30
0.25
34
0.4
GAM
GLM
TSS
MAXNET
XGBOOST
0.3
NOTE: We are assigning the results of the functions to a plt object that gets replaced with
ò each plot. Since the functions used to plot also return the data used to build the plots, assigning
the results to an object prevents displaying the entire dataset in the console. Although we are
not using this data here, you can check the contents of plt at any time!
These plots show the model performance for the calibration and validation sets. Remember that we used the
K-fold strategy with K=5K=5 folds. This means that all five combinations of four folds against one fold were
used for calibration (training) and validation, respectively.
The plots show the TSS (True Skill Statistic) and the ROC/AUC (Receiver Operating Characteristic/Area
Under Curve) values. The closer these values are to 1, the better the model performance. Of course, different
algorithms, different sets of pseudo-absences, and different sets of training data (folds) will provide slight
variations in performance. Generally, the models we built are acceptable, as indicated by ROC/AUC values
higher than 0.65 and TSS greater than 0.25. In general, XGBoost and Maxnet, both machine learning
methods, outperform the simpler regression-based methods.
Both calibration and validation show acceptable performance, indicating good model fitting, predictive ability,
and a low level of overfitting.
We can compare together the different runs and algorithms:
plt <- bm_PlotEvalBoxplot(vaModel, group.by = c('algo', 'run'))
35
ROC TSS
0.8
0.6 RUN1
calibration
RUN2
RUN3
RUN4
RUN5
0.4
AM
LM
AM
ET
ST
LM
ET
ST
N
O
G
G
G
G
AX
AX
BO
BO
M
M
XG
XG
An important step in the modeling results analysis is to understand how much each predictor variable is
contributing to each model.
plt <- bm_PlotVarImpBoxplot(vaModel, group.by = c('expl.var', 'run', 'algo'))
36
GAM GLM
50%
40%
30%
20%
10%
0% RUN1
RUN2
_1
_3
_8
_1
i
2
_3
_8
ev
ev
_1
_1
O
O
O
O
BI
BI
BI
BI
BI
BI
RUN3
BI
BI
MAXNET XGBOOST RUN4
RUN5
50%
40%
30%
20%
10%
0%
_1
i
2
_3
_8
_1
_3
_8
i
ev
ev
_1
_1
O
O
O
O
BI
BI
BI
BI
BI
BI
BI
BI
In general, it is shown that EVI is a low contributor for this species, while climate variables are generally
more associated with the distribution of V. aspis. Although precipitation (BIO_12) is associated with high
presence for Maxnet and GLM, a temperature variable (BIO_3) is generally important to all models.
We can plot the response curves for each variable, which show how the predictions of the model vary with
the range of values for each variable. This involves using the models to predict across the entire range of
values for one variable while keeping the other variables fixed at a constant value. In this example, the other
variables are fixed at their median value.
plt <- bm_PlotResponseCurves(vaModel, fixed.var = 'median')
37
Response curves for Vaspis's models
evi BIO_8 BIO_3
1.00
0.75
0.50
0.25
0.00
10
20
30
40
0.
0.
0.
0.
0.
−1
BIO_12 BIO_1
1.00
0.75
0.50
0.25
0.00
0
00
00
00
00
10
20
50
10
15
20
25
PA1_RUN1_GAM Vaspis_PA1_RUN4_GAM Vaspis_PA2_RUN2_GAM Vaspis_PA2_RUN5_GAM Vaspis_P
We can directly plot all models, but it might take a long time to finish since there are many models (PA x
Runs x Algorithm = 3 x 5 x 4 = 60 models) to show. However, all models are built and saved into a single
file, models/Vaspis/proj_Present/proj_Present_Vaspis.tif, which can be inspected in any GIS.
38
n_vl <- sum(pres$species == "Vlatastei")
coords_vl <- pres[pres$species == "Vlatastei", 2:3]
0.35
GAM
GLM
TSS
MAXNET
XGBOOST
0.30
39
plt <- bm_PlotEvalMean(vlModel, dataset="validation")
0.35
GAM
GLM
TSS
MAXNET
0.30
XGBOOST
0.25
40
ROC TSS
0.7
0.6 RUN1
calibration
RUN2
RUN3
0.5
RUN4
RUN5
0.4
0.3
AM
LM
AM
ET
ST
LM
ET
ST
N
O
G
G
G
G
AX
AX
BO
BO
M
M
XG
XG
plt <- bm_PlotVarImpBoxplot(vlModel, group.by = c('expl.var', 'run', 'algo'))
GAM GLM
70%
60%
50%
40%
30%
20%
10%
0% RUN1
RUN2
_1
_3
_8
_1
_3
_8
i
ev
ev
_1
_1
O
O
O
O
BI
BI
BI
BI
BI
BI
RUN3
BI
BI
_3
_8
_1
_3
_8
i
ev
ev
_1
_1
O
O
O
O
BI
BI
BI
BI
BI
BI
BI
BI
41
plt <- bm_PlotResponseCurves(vlModel, fixed.var = 'median')
0.75
0.50
0.25
0.00
0
−5
10
15
20
30
35
40
45
0.
0.
0.
BIO_12 0. BIO_1
0.75
0.50
0.25
0.00
0
5
00
00
00
10
15
20
50
10
15
20
Vipera seoanei
The input data is now for Vipera seoanei.
n_vs <- sum(pres$species == "Vseoanei")
coords_vs <- pres[pres$species == "Vseoanei", 2:3]
42
CV.k = 5,
CV.do.full.models = FALSE,
OPT.strategy = "bigboss",
var.import = 3,
metric.eval = c("TSS", "ROC"))
# Plotting examples
plt <- bm_PlotEvalMean(vsModel, dataset="calibration")
0.55
GAM
GLM
TSS
0.50
MAXNET
XGBOOST
0.45
0.40
0.74 0.76 0.78 0.80 0.82 0.84 0.86
ROC
plt <- bm_PlotEvalMean(vsModel, dataset="validation")
43
0.6
GAM
0.5
GLM
TSS
MAXNET
XGBOOST
0.4
ROC TSS
0.8
0.7 RUN1
calibration
RUN2
RUN3
0.6 RUN4
RUN5
0.5
0.4
AM
LM
AM
ET
ST
LM
ET
ST
N
O
G
G
G
G
AX
AX
BO
BO
M
M
XG
XG
44
plt <- bm_PlotVarImpBoxplot(vsModel, group.by = c('expl.var', 'run', 'algo'))
GAM GLM
70%
60%
50%
40%
30%
20%
10%
0% RUN1
RUN2
_1
_3
_8
_1
_3
_8
i
ev
ev
_1
_1
O
O
O
O
BI
BI
BI
BI
BI
BI
RUN3
BI
BI
MAXNET XGBOOST RUN4
70% RUN5
60%
50%
40%
30%
20%
10%
0%
_1
_3
_8
_1
_3
_8
i
ev
ev
_1
_1
O
O
O
O
BI
BI
BI
BI
BI
BI
BI
BI
45
Response curves for Vseoanei's models
evi BIO_8 BIO_3
1.00
0.75
0.50
0.25
0.00
2
−5
10
15
35
40
45
0.
0.
0.
0.
0.
0.
BIO_12 BIO_1
1.00
0.75
0.50
0.25
0.00
0
5
00
00
00
10
15
50
10
15
20
46
For this chapter we will need again the biomod2 package.
library(biomod2)
The object vaModel contains all the models built in the previous chapter. Now, we can proceed with
ensembling. The package provides a straightforward function with customizable arguments for ensembling.
We specify the models we want to ensemble and choose the type of ensembling. In this example, we are
ensembling all models together, meaning that all combinations of pseudo-absences, algorithms, and runs are
merged. However, we could choose to ensemble only specific subsets, such as by algorithm, by setting em.by
= "PA_dataset+repet", which would result in separate ensembles for each algorithm. This approach could
be useful for evaluating the performance of each algorithm individually. Here, we are ensembling all models
together using em.by = "all" while models.chosen = 'all' aensures that all built models are available
for ensembling.
We select the median as the ensembling statistic. Other options available include mean and weighted mean,
where weights are typically based on model performance metrics or confidence intervals.
Since the resulting ensemble model provides predictions that have not yet been tested, we evaluate its
performance using the same metrics as before (TSS and ROC). Variable importance is also assessed using 3
permutations, as in previous steps.
vaEnsbl <- BIOMOD_EnsembleModeling(bm.mod = vaModel,
models.chosen = 'all',
em.by = 'all',
em.algo = 'EMmedian',
metric.eval = c('TSS', 'ROC'),
var.import = 3)
vaEnsbl
##
## -=-=-=-=-=-=-=-=-=-=-=-=-= BIOMOD.ensemble.models.out -=-=-=-=-=-=-=-=-=-=-=-=-=
##
## sp.name : Vaspis
##
## expl.var.names : evi BIO_8 BIO_3 BIO_12 BIO_1
##
##
## models computed:
## Vaspis_EMmedianByTSS_mergedData_mergedRun_mergedAlgo, Vaspis_EMmedianByROC_mergedData_mergedRun_merge
##
## models failed: none
##
## -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
At this stage, we should have a new file in the models/Vaspis folder containing the ensembling results, which
47
we can access later as needed.
We can evaluate the ensembled model by plotting its performance metrics. Now we don’t have calibration
and validation datasets and we test with full data.
plt <- bm_PlotEvalBoxplot(vaEnsbl, group.by = c('algo', 'algo'))
ROC TSS
0.7
calibration
0.6
EMmedian
0.5
0.4
n
n
ia
ia
ed
ed
m
m
EM
EM
Both metrics show reasonable performance. However, at this stage, we should consider adding more predictors
or potentially removing some to improve model performance. If an important predictor defining the species
niche is missing, it could significantly affect performance. Another strategy could involve adjusting the width
of the buffer for pseudo-absences. Increasing environmental variability might facilitate model fitting, but this
approach should be carefully balanced and justified (as it might be a contentious subject!).
We can also check the variable importance.
plt <- bm_PlotVarImpBoxplot(vaEnsbl, group.by = c('expl.var', 'algo', 'algo'))
48
EMmedian
40%
30%
EMmedian
20%
10%
0%
_1
_3
_8
i
ev
_1
O
O
O
BI
BI
BI
BI
In this model, BIO_12 and BIO_7 are clearly the most important variables, while NDVI contributes the
least to the models.”
And finally we can check variable response curves.
plt <- bm_PlotResponseCurves(vaEnsbl, fixed.var = 'median')
49
Response curves for Vaspis's models
evi BIO_8 BIO_3
0.6
0.4
0.2
0
10
20
30
40
0.
0.
0.
0.
0.
−1
BIO_12 BIO_1
0.6
0.4
0.2
0
00
00
00
00
10
20
50
10
15
20
25
Vaspis_EMmedianByTSS_mergedData_mergedRun_mergedAlgo Vaspis_EMmedianByROC_mergedData_mergedRun_mergedAlgo
The ensembled models provide a single response curve for each variable, which is easier to interpret than
before. We should give more importance to the BIO_12 and BIO_7 curves, as those are the variables that
contribute more to the models. The model seems to increase the predictive probability with the increase of
the value of these variables. Notice how the NDVI, which is the least important variable, appears to have a
trend, but the prediction range associated with it is low.
# Ensemble
vlEnsbl <- BIOMOD_EnsembleModeling(bm.mod = vlModel,
models.chosen = 'all',
em.by = 'all',
em.algo = 'EMmedian',
metric.eval = c('TSS', 'ROC'),
var.import = 3)
# Plotting examples
plt <- bm_PlotEvalBoxplot(vlEnsbl, group.by = c('algo', 'algo'))
50
ROC TSS
0.7
0.6
calibration
EMmedian
0.5
0.4
n
n
ia
ia
ed
ed
m
m
EM
EM
EMmedian
60%
50%
40%
EMmedian
30%
20%
10%
0%
_1
_3
_8
i
ev
_1
O
O
O
BI
BI
BI
BI
51
plt <- bm_PlotResponseCurves(vlEnsbl, fixed.var = 'median')
0.6
0.4
0.2
0
−5
10
15
20
30
35
40
45
0.
0.
0.
0.
BIO_12 BIO_1
0.8
0.6
0.4
0.2
0
00
00
00
10
15
20
50
10
15
20
Vlatastei_EMmedianByTSS_mergedData_mergedRun_mergedAlgo Vlatastei_EMmedianByROC_mergedData_mergedRun_mergedAlg
# Ensemble
vsEnsbl <- BIOMOD_EnsembleModeling(bm.mod = vsModel,
models.chosen = 'all',
em.by = 'all',
em.algo = 'EMmedian',
metric.eval = c('TSS', 'ROC'),
var.import = 3)
# Plotting examples
plt <- bm_PlotEvalBoxplot(vsEnsbl, group.by = c('algo', 'algo'))
52
ROC TSS
0.8
0.7
calibration
EMmedian
0.6
n
n
ia
ia
ed
ed
m
m
EM
EM
EMmedian
60%
50%
40%
EMmedian
30%
20%
10%
_1
_3
_8
i
ev
_1
O
O
O
BI
BI
BI
BI
53
plt <- bm_PlotResponseCurves(vsEnsbl, fixed.var = 'median')
0.75
0.50
0.25
0.00
2
−5
10
15
35
40
45
0.
0.
0.
0.
0.
0.
BIO_12 BIO_1
0.75
0.50
0.25
0.00
0
00
00
00
10
15
50
10
15
20
Vseoanei_EMmedianByTSS_mergedData_mergedRun_mergedAlgo Vseoanei_EMmedianByROC_mergedData_mergedRun_mergedA
54
We need both terra library for opening the varaibles, and biomod2 for interpreting the models.
library(terra)
library(biomod2)
We can define here some variables that will be useful for looping over ages and SSPs (similar to Chapter 5)
to read and project models to different time periods.”
ages <- c("2021-2040", "2041-2060", "2061-2080")
ssps <- c("126", "585")
Again, we will proceed with the first species and then run the same code for the other two to obtain all
projections.
We can open the raster for the current period variables as this will be constant for all current projections.
curp <- rast("data/rasters/proj_current.tif")
Until now, we have been focusing in the continuous probability predictions. However, models can be converted
to binary (presence vs. absence) by setting a threshold that results in best performance. We can check which
threshold biomod2 is using for these models.
get_evaluations(vaEnsbl)
## full.name merged.by.PA
## 1 Vaspis_EMmedianByTSS_mergedData_mergedRun_mergedAlgo mergedData
## 2 Vaspis_EMmedianByTSS_mergedData_mergedRun_mergedAlgo mergedData
## 3 Vaspis_EMmedianByROC_mergedData_mergedRun_mergedAlgo mergedData
## 4 Vaspis_EMmedianByROC_mergedData_mergedRun_mergedAlgo mergedData
## merged.by.run merged.by.algo filtered.by algo metric.eval cutoff
## 1 mergedRun mergedAlgo TSS EMmedian TSS 498.0
## 2 mergedRun mergedAlgo TSS EMmedian ROC 500.5
## 3 mergedRun mergedAlgo ROC EMmedian TSS 498.0
## 4 mergedRun mergedAlgo ROC EMmedian ROC 500.5
## sensitivity specificity calibration validation evaluation
## 1 72.650 66.797 0.394 NA NA
## 2 71.926 67.609 0.738 NA NA
## 3 72.650 66.797 0.394 NA NA
## 4 71.926 67.609 0.738 NA NA
Depending on the metric, the threshold varies as it has to maximize the performance of each metric separately.
For TSS, the threshold (cutoff) is set to 0.474, and for ROC, it is set to 0.468. Note that biomod2 multiplies
the probabilities by 1000, so they range between 0 and 1000.
We can project the model to current conditions in the Iberian Peninsula. The function will create both
continuous predictions and binary predictions using the thresholds mentioned above. It is important to name
the projections appropriately for easy identification of the rasters. We set the proj.name to Current and
a new folder named proj_Current will be created inside models/Vaspis. In this folder, two rasters will be
saved for continuous and binary predictions. Also, the argument new.env refers to the new environmental
conditions that are the raster variables we want to project to.
55
vaProj <- BIOMOD_EnsembleForecasting(bm.em = vaEnsbl,
proj.name = 'Current',
new.env = curp,
models.chosen = 'all',
metric.binary = 'TSS')
We will have to do the same for each combination of age and SSP projections. For that, we use a nested for
loop as before (chapter 5). The logic is as follows:
• For each combination of age and ssp
1. Construct the respective filename using age and SSP and open the raster.
2. Project the ensembled model to the respective time period, correctly naming the projection with
age_ssp.
for (age in ages) {
for (ssp in ssps) {
projVars <- rast(paste0("data/rasters/proj_", age, "_", ssp, ".tif"))
vaProj <- BIOMOD_EnsembleForecasting(bm.em = vaEnsbl,
proj.name = paste0(age, "_", ssp),
new.env = projVars,
models.chosen = 'all',
metric.binary = 'TSS')
}
}
At this stage, we should have a folder inside models/Vaspis for each projection that we built. We now need
to run the code for the other two species to complete all projections.
get_evaluations(vlEnsbl)
56
Projecting Vipera latastei models
Nowe for Vipera seoanei.
vsName <- load("models/Vseoanei/Vseoanei.EcoMod.ensemble.models.out")
vsEnsbl <- eval(str2lang(vsName))
get_evaluations(vsEnsbl)
layout(matrix(1:4, 1))
plot(p1[[1]], main="Present", range=c(0,1000))
plot(p2[[1]], main="2021-2040", range=c(0,1000))
plot(p3[[1]], main="2041-2060", range=c(0,1000))
plot(p4[[1]], main="2061-2080", range=c(0,1000))
Present 2021−2040 2041−2060 2061−2080
1000 1000 1000 1000
800 800 800 800
600 600 600 600
40
40
40
40
57
Chapter 9 - Finally, answering the question
In this chapter, we will open all projection rasters created previously and analyze how these predictions
can help identify sympatric areas for the three species. As the results showed, the models reveal different
relationships with the predictors, resulting in distinct distributions. How do these potential distributions
intersect with each other?
We will use the terra package since most operations in this chapter involve geoprocessing with rasters. While
any GIS software could be used, using terra allows us to automate some processes.
library(terra)
We can also prepare some variables to loop over future periods, as we did before.
ages <- c("2021-2040", "2041-2060", "2061-2080")
ssps <- c("126", "585")
We can begin by importing the rasters of the current projections for the three species. We divide by 1000 to
get the projections in the range of 0 to 1.
va <- rast("models/Vaspis/proj_Current/proj_Current_Vaspis_ensemble.tif")[[1]]
va <- va / 1000
vl <- rast("models/Vlatastei/proj_Current/proj_Current_Vlatastei_ensemble.tif")[[1]]
vl <- vl / 1000
vs <- rast("models/Vseoanei/proj_Current/proj_Current_Vseoanei_ensemble.tif")[[1]]
vs <- vs / 1000
40
40
We need some processing to determine the sympatric zone between the vipers. Since we are using continuous
probability maps, an easy way to estimate the maximum probability of finding all three species together is
by calculating the product of the three maps. The potential maximum score would be 1, where all species
have the highest probability, while any location with a probability of 0 for any species will have a sympatry
probability of zero.
Additionally, there are other methods to calculate sympatry. For instance, using the binary presence/absence
maps produced earlier, the sum of all maps would provide a range of values from 0 to 3, indicating how many
species are present at each location.
Here, we will calculate the product of the probabilities:
CZ <- va*vl*vs
names(CZ) <- "Present"
plot(CZ, main= "Current contact zone probability")
58
Current contact zone probability
0.30
43
0.25
42
0.20
41
40
0.15
39
0.10
38
0.05
37
0.00
36
−8 −6 −4 −2 0 2
We have to calculate the same for all combinations of age and ssp projections. For that we use the same style
of nested loop we have been using. We will be adding layers to CZ raster with correct names, following the
age_ssp order.
The logic of the loop is the following: * for each combination of age and ssp: 1. Reconstruct the general
name of the file 2. Reconstruct the path for V aspis projection file 3. Import the raster and divide by 1000 4.
Reconstruct the path for V latastei projection file 5. Import the raster and divide by 1000 6. Reconstruct
the path for V seoanei projection file 7. Import the raster and divide by 1000 8. Create the future prediction
of contact zone as futCZ 9. name the layer correctly with age_ssp 10. Stack this layer to the CZ raster.
for (ssp in ssps) {
for (age in ages) {
nm <- paste0("proj_", age, "_", ssp)
vaFile <- paste0("models/Vaspis/", nm, "/", nm, "_Vaspis_ensemble.tif")
va <- rast(vaFile)[[1]]/1000
vlFile <- paste0("models/Vlatastei/", nm, "/", nm, "_Vlatastei_ensemble.tif")
vl <- rast(vlFile)[[1]]/1000
vsFile <- paste0("models/Vseoanei/", nm, "/", nm, "_Vseoanei_ensemble.tif")
vs <- rast(vsFile)[[1]]/1000
# Produce Future contact zone and rename accordingly to age and ssp
futCZ <- va*vl*vs
names(futCZ) <- paste0(age, "_", ssp)
59
We can now plot the raster or contact zone predictions.
layout(matrix(c(1,1,1,1:7), 5, byrow=TRUE))
plot(CZ[[1]], main=names(CZ)[1], col=hcl.colors(25), range=c(0, 0.3))
plot(CZ[[2]], main=names(CZ)[2], col=hcl.colors(25), range=c(0, 0.3))
plot(CZ[[3]], main=names(CZ)[3], col=hcl.colors(25), range=c(0, 0.3))
plot(CZ[[4]], main=names(CZ)[4], col=hcl.colors(25), range=c(0, 0.3))
plot(CZ[[5]], main=names(CZ)[5], col=hcl.colors(25), range=c(0, 0.3))
plot(CZ[[6]], main=names(CZ)[6], col=hcl.colors(25), range=c(0, 0.3))
plot(CZ[[7]], main=names(CZ)[7], col=hcl.colors(25), range=c(0, 0.3))
60
Present
43
0.25
42
0.20
41
40
0.15
39
0.10
38
0.05
37
36
0.00
−5 0
2021−2040_126 2041−2060_126
36 37 38 39 40 41 42 43
36 37 38 39 40 41 42 43
0.25 0.25
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
−5 0 −5 0
2061−2080_126 2021−2040_585
36 37 38 39 40 41 42 43
36 37 38 39 40 41 42 43
0.25 0.25
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
−5 0 −5 0
2041−2060_585 2061−2080_585
36 37 38 39 40 41 42 43
36 37 38 39 40 41 42 43
0.25 0.25
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
−5 0 61 −5 0
We can opt for other visualization strategies to show the change in the contact zone probabilities. For instance,
we can show histograms of the probabilities to highlight how they will decrease for certain scenarios. The
easiest way to do this is to extract the raster data into a data frame:
CZdata <- data.frame(CZ)
head(CZdata)
For the histograms, we need to define the width of the bars and the range of values. In this example, the
values vary between 0 and 0.28, so we choose a width of 0.01 and create a sequence of cutting points (breaks).
brk <- seq(0, 0.35, 0.01)
We can retrieve the names of the layers in the raster to use as titles for plotting. The order of the layers
corresponds to the order of columns in the data frame.
titles <- names(CZ)
Instead of plotting each column individually, as we did above for plotting the rasters, we can use a for loop
that iterates over the columns. We set some nice colors for the bars and plot!
layout(matrix(c(1,1,1,1:7), 5, byrow=TRUE))
for (i in 1:ncol(CZdata)) {
hist(CZdata[,i], breaks=brk, main=titles[i], col="steelblue1", border="white")
}
62
Present
800
600
Frequency
400
200
0
CZdata[, i]
2021−2040_126 2041−2060_126
700
800
Frequency
Frequency
400
300
0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
CZdata[, i] CZdata[, i]
2061−2080_126 2021−2040_585
700
Frequency
Frequency
400
300
0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
CZdata[, i] CZdata[, i]
2041−2060_585 2061−2080_585
800
400 800
Frequency
Frequency
400
0
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
CZdata[, i] 63 CZdata[, i]
We can see that the worst-case scenario predicts a contraction of the contact zone to a narrow area by the end
of the century. Currently, this area is where all three vipers are found together, while other regions represent
high potential for all species, although not all species reach those areas. One reason could be that they are at
the limit of their ecological niche, and one species may be excluded by another that is more dominant in the
region.
We can write the raster to a file that can be opened in any GIS for further exploration of the results.
{r] writeRaster(CZ, "models/sympatry.tif")
Conclusion
We have seen the process of ecological niche modeling using R, focusing on viper species in the Iberian
Peninsula as a case study. We covered topics such as data collection, preprocessing predictors and presence
data, model building, ensembling, and projection to future climate scenarios. The general steps we covered
were:
• Data Collection: Gather presence data for the species of interest along with environmental variables
like bioclimate varibles and NDVI.
• Data Preprocessing: Process the data by checking for duplicates in presence record and other
erroneous coordinates; making sure the predictors are fully geographically align and that correlations
among variables are adequate; prepare a training and projection area.
• Model Building: Use the biomod2 package to build ecological niche models, selecting appropriate
pseudo-absences, algorithms, and resampling strategies.
• Model Evaluation: Evaluate model performance using metrics like True Skill Statistic (TSS) and
Receiver Operating Characteristic (ROC) curves.
• Model Ensembling: Ensemble multiple models to create a more robust prediction by combining
different pseudo-absences sets, algorithms, and resampling strategies.
• Projection: Project the ensembled models to future climate scenarios to assess potential distribution
changes under different climate conditions.
• Spatial Analysis: Analyzs spatial patterns in the model predictions, for instance, identify sympatric
zones, and assess changes in contact zones over time.
At this point, you should understand that:
• Ecological niche modeling provides valuable insights into species distributions and responses to environ-
mental changes.
• Model performance depends on various factors such as data quality, variable selection, algorithm choice,
and vary between model evaluation metrics.
• Ensembling multiple models improves prediction accuracy and robustness by capturing a wider range
of situtations and simplify the analyses of complex model results.
• Future climate projections highlight potential shifts in species distributions, and might provide a good
insight for conservation planning and management strategies.
• Spatial analysis is an important process in the ecological modelling an, in this example, was essential to
identify areas of overlap and potential interactions between species.
64