Geostatistical Learning

Uploaded by

Manuel Castro Ribeiro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views

Geostatistical Learning

Uploaded by

Manuel Castro Ribeiro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Air pollution models in epidemiologic studies

with geostatistics and machine learning

Manuel Ribeiro1
Centro de Recursos Naturais e Ambiente (CERENA), Instituto Superior Técnico,
Universidade de Lisboa
Av. Rovisco Pais, 1049-001 Lisboa, Portugal
[email protected]

Abstract. Development of air pollution models for large regions is a

priority for population-based epidemiologic studies. The rapid develop-
ment of big data information systems and machine learning algorithms
have opened new grounds for renements of current model frameworks.
This commentary overviews recent contributions and outlines extensions
from geostatistics and machine learning perspectives. For the coming
years, expected advances will expand the use of learning algorithms to
model spatial trends and incorporate spatial covariance models in the
learning processes. These extensions will rene existing modelling frame-
works contributing to improve accuracy of air pollution models for ex-
posure assessment.

Keywords: geostatistics · machine learning· air pollution · exposure

1 Overview
Cancers, respiratory and cardiovascular diseases have been the most common
reasons for premature death attributable to air pollution around the globe. Still,
apart from few exceptions, there is not a denite knowledge about the impacts
of air pollution on populations. Therefore the development of accurate models
predicting air pollution is critical to better understand the health impacts of
exposure to air pollution.
Air pollution models require data collection, which often comes from mon-
itoring by local or national institutions [6], using a set of ground-based sta-
tions geographically dispersed over locations selected for regulatory purposes.
Although these provide air pollutants information with high accuracy and high
temporal resolution, they are sparse in space and have limited area coverage.
In such cases, monitoring data can be combined with statistical models pre-
dicting pollution gradients between ground-based monitoring stations and in
areas where no stations exist at all [32]. To this end, remotely sensed data [14],
chemical transport models [15] or Copernicus Atmospheric Monitoring Service
(CAMS) products [26] have been used. Similarly, meteorological data [35] and
other auxiliary information such as land-use data or distance to roads are use-
ful for modelling, predict and describe smaller-scale variations [5, 30] and are a
solution to obtain high-resolution air pollution maps [29, 33].
2 Manuel Ribeiro

More traditional modelling approaches are supported on linear combinations

of potential predictors while considering their geographic locations. More speci-
cally, land use regressions [17, 21], employ linear regression combining geographic
data collected on air pollutants and on potential predictors, and are a relatively
cheap and practical approach to predict air pollution exposure in urban areas;
spatial regressions [36, 38], allow linear regression parameters to vary smoothly
as a function of spatial neighborhoods and increase the potential to capture non-
stationary relations between predictors and air pollutants; geostatistical models
[24, 33], take into account for spatial trends and spatial autocorrelation, captur-
ing the intensity and direction of the spatial processes underlying air pollution
concentrations, which is especially relevant in large regions, where variations in
their intensity and direction are likely to occur.
In the last decade, approaches integrating machine learning techniques have
been applied to handle non-linear interactions between predictors [2], and hybrid
models have been developed to account for spatial dependence of air pollutants
[28]. Mostly focused on supervised learning, some popular algortihms used are
decision trees [7], random forest [19, 37], articial neural networks [1]. Compared
to the abovementioned linear models, the superiority of these algorithms relies
in their capability to deal better with such complex non-linearities [18].
In addition to air pollution mapping, assessment of spatial uncertainty of
predictions (hereinafter referred to as "uncertainty") is also required, since pre-
dictions have uncertainty that typically varies spatially and temporally. In fact,
quantifying uncertainty (e.g. prediction intervals) provides important informa-
tion about the prediction error in the air pollutant values used in exposure
assessment, and addresses spatial misalignment of pollutant and health data [9].
Not taking uncertainty into account may produce misleading conclusions about
the potential impacts on population´s health and weaken the scientic valid-
ity of its ndings. In Bayesian framework, uncertainty can be assessed using a
hierarchical formulation [3], or with geostatistical algorithms using stochastic
modelling [23]. Yet, quantication of spatial uncertainty is still in its infancy.
After this rst section with an overview, the commentary is divided into 2
sections (second, third): section 2 presents some basics on geostatistical and ma-
chine learning and section 3 present some possible ways ahead in the perspective
of combining geostatistical and machine learning methods.

2 Models

While geostatistics is widely used in environmental applications of spatial data

modelling, it shows at the same time, diculties to model non-linear and com-
plex dependency structures. Similarly, machine learning algorithms are powerful
tools to solve complex real-world problems. Yet, they usually do not consider
sample locations and spatial autocorrelation in the learning process. Combining
the strengths of both approaches extends the existing methods to model air pol-
lution, which is challenging due to the complex non-linear physical and chemical
Air pollution models with geostatistics & machine learning 3

underlying processes and interactions aecting air pollution concentrations at

dierent spatiotemporal scales.

2.1 Geostatistical models

Geostatistics includes a set of statistical techniques suited to model spatially

correlated data, providing optimal unbiased predictions with minimum mean
squared prediction error. Geostatistics are based on the assumption that sample
data are a single realization of a spatial random process, Y (s). In (1), S represents
a nite spatial continuous domain and s is a spatial index.

{Y (s) : s ∈ S} (1)
To model spatially correlated data variogram functions are used, and Kriging
(univariate case) or Cokriging (multivariate case) estimators are used for optimal
linear prediction. There are dierent types of Kriging techniques, and the choice
depends on data available and the aims of analysis. For a comprehensive review
about Kriging methods please refer to [8].
When the the objective of analysis is to assess spatial uncertainty, Kriging
techniques are not adequate. Specically, Kriging techniques aim at minimizing
the prediction error, and this involves smoothing data variability. To assess spa-
tial uncertainty, geostatistical algorithms using stochastic simulations should be
preferred, as they reproduce the uctuations observed in the sample data, instead
of producing optimal prediction. These algorithms generate simulated maps with
similar statistical properties to those of the observed data (e.g., histogram, spa-
tial covariance), and have been applied to quantify air quality uncertainty to
assess health impacts of exposure [23, 24, 34].

2.2 Machine learning models

In the past decade, many dierent learning algorithms have been applied to
model air pollution for epidemiology studies [2]. Currently, these models use
modelling frameworks with high dimensional input spaces including features
such as air pollution (e.g., remotely sensed data, CTM, ground-based stations),
meteorological-based models (e.g., temperature, wind, pressure, humidity), other
relevant biophysical and socio-economic information (e.g., land-use land cover,
seasonality, elevation, distance to roads, population density).
Mostly focused on supervised learning regression algorithms, the learning
processes use the input features, x, for training and choose a function f (x) with
parameters w, that better ts the output values y . To decide which function
f (x, w) provides the best approximation to y , a measure of loss is computed. A
popular choice for regression problems is to tune w from data with the following
loss function, L:

L[y, f (x, w)] = [y − f (x, w)]2 (2)

4 Manuel Ribeiro

Random forest are one of the most used algorithms to predict air pollution
[25], as they provide an excellent trade-o between interpretability and per-
formance, when compared with other learning methods (e.g., neural networks,
support vector machine). Simpler learning methods such as linear regression ap-
proaches have a long tradition in science and are also widely used [18]. Following
the principle of Occam's razor, when no major interactions and approximately
linear associations exist, these models perform well and should be preferred in-
stead of more complex algorithms. Neural networks are more complex than the
abovementioned algorithms and are more ecient to solve non-linear problems.
Several algorithmic variants are well established (e.g., articial, convolutional
neural networks) in air pollution modelling [27]. In complex non-linear and high
dimensional feature spaces, this algorithm can achieve high model performances
[4].
The trade-o between model performance and the ability to generalize is a
relevant issue to be taken into consideration, as the models should avoid over-
tting the data (leading to low performance on new data). Moreover, the better
performance of some complicated models may be hard to interpret. For example,
in neural networks or in support vector machine algorithms, transfer functions
and kernels used to t the data, are rather articial models and hard to interpret.
A clearer understanding of the internal mechanisms leading to some output will
contribute to improve their use in environmental applications.

3 Geostatistical learning models

A modelling framework combining geostatistical and machine learning methods
and explored in recent years, relies on the grounds of supervised machine learning
with Regression Kriging [12], also known as Kriging with External Drift (KED).
Typically, KED decomposes a spatial random process Y (s) into a function rep-
resenting the deterministic part of the variation (e.g. linear model), m(s), and a
stochastic residual, r(s), describing the spatially dependent part of variability:

Y (s) = m(s) + r(s) (3)

Real-world environments are highly non-linear and exhibit spatially depen-
dent underlying processes. Therefore, simplistic linear model assumptions like
linearity and independency are a limitation of traditional KED technique. The
performance of the deterministic part, m(s), can be improved with machine
learning algorithms, as they do not require normally distributed data and are
able to handle interactions and non-linear relationships between input features.
In regression settings, random forests [19] or neural networks [27] are some of
the algorithmic models that could be applied to model the deterministic part of
RK.

Uncertainty The spatial covariance model of the residual part, r(s) can be
inferred from data and incorporated in a geostatistical simulation algorithm to
Air pollution models with geostatistics & machine learning 5

predict (map) spatially continuous surface of air pollution residuals. In fact,

these simulation algorithms do characterize the spatial parameters of interest
providing the means to generate realizations that reproduce spatial anisotropic
correlation structure, and the empirical histogram of the stochastic residuals,
r(s). Then, simulated results are added to the learned/tuned model outputs to
create the nal simulated air pollutant maps. Two key maps may be drawn from
the set of geostatistical simulations: a pointwise median map of the variable of
interest and the spatial uncertainty attached, which can be quantied by the
pointwise interquartile range (IQR).
Parallelization and multithreading processes can be used as optimization
techniques to run geostatistical simulation algorithms. In fact, these algorithms
provide a measure of spatial uncertainty but at a computational high cost [10].
Several geostatistical simulation algorithms are available to assess spatial un-
certainty of predictions (e.g., Sequential Gaussian Simulation, Direct Sequential
Simulation, or Turning Bands). The performance of dierent algorithms can be
measured using some loss-function, to select the best model. Readers may refer to
Gómez-Hernandez and Srivastava [11] for a comprehensive review of simulation
algorithms.

Further extensions The ability to model and predict accurately air pollution
with spatial data can be further rened by extending research in the perspec-
tive of machine learning tunning parameters. It is well known that air pollution
exhibits spatial trends and spatial autocorrelation [22]. Relying on learning ap-
proaches based on random samples of spatial data to tune model parameters,
fails to assess a model's performance in terms of spatial mapping and only val-
idate its ability to reproduce sampling data [20]. Therefore, instead of leaving
spatial interactions to be learned from data, spatial parameters or functions (e.g.,
geostatistical semi-variogram) could be used as tuning parameter within the al-
gorithmic model to control the optimal space and time ranges to sample [13]. The
idea can easily be extended to models aiming at predicting in the multipollutant
case [31], by extending the spatial tunning parameters to cross-variograms.
The inuential work of Kanevski [16] oers a solid perspective on the grounds
of machine learning developments for analysis and modelling of spatial environ-
mental data, that can be easily transferable to air pollution modelling. In future
work, researchers could extend the model framework, considering renements in
the assessment of spatial uncertainty, and optimization of parameters tunning
by incorporating spatial properties of data.

Aknowledgements Manuel Ribeiro acknowledges Fundação para a Ciência

e Tecnologia for the research contract IF2018/CP1384/IST-ID/175/2018 and
CERENA pluriannual funding FCT-UIDB/04028/2020.
6 Manuel Ribeiro

References
1. Araujo, L.N., Belotti, J.T., Alves, T.A., Tadano, Y.d.S., Siqueira, H.: En-
semble method based on articial neural networks to estimate air pollution
health risks. Environmental Modelling and Software 123(10456), 7 (2020),
https://doi.org/10.1016/J.ENVSOFT.2019.104567
2. Bellinger, C., Jabbar, M., M. S., Z., O., O.V.: A systematic review of data mining
and machine learning for air pollution epidemiology. BMC Public Health (2017),
https://doi.org/10.1186/s12889-017-4914-3
3. Beloconi, A., Vounatsou, P.: Substantial reduction in particulate matter
air pollution across europe during 2006-2019: A spatiotemporal modeling
analysis. Environmental Science and Technology 55, 1550515518 (2021),
https://doi.org/10.1021/acs.est.1c03748
4. Cabaneros, S.M., Calautit, J.K., Hughes, B.R.: A review of articial neural network
models for ambient air pollution prediction. Environmental Modelling and Software
(2019), https://doi.org/10.1016/j.envsoft.2019.06.014
5. Cai, J., Ge, Y., Li, H., Yang, C., Liu, C., Meng, X., Wang, W., Niu,
C., Kan, L., Schikowski, T., Yan, B., Chillrud, S.N., Kan, H., Jin, L.: Ap-
plication of land use regression to assess exposure and identify potential
sources in pm2.5, bc, no2 concentrations. Atmospheric Environment 223 (2020),
https://doi.org/10.1016/j.atmosenv.2020.117267
6. for Europe, W.R.O.: Health risk assessment of air pollution general principles.
Copenhagen (2016)
7. Gao, Y., Wang, Z., Li, C.y., Zheng, T., Peng, Z.R.: Assessing neighborhood varia-
tions in ozone and pm2.5 concentrations using decision tree method. Building and
Environment 188 (2021), https://doi.org/10.1016/j.buildenv.2020.107479
8. Goovaerts, P.: Geostatistics for Natural Resources Evaluation. Oxford University
Press, New York (1997)
9. Gryparis, A., Paciorek, C.J., Zeka, A., Schwartz, J., Coull, B.A.: Measurement
error caused by spatial misalignment in environmental epidemiology. Biostatistics
10, 258274 (2009), https://doi.org/10.1093/biostatistics/kxn033
10. Guan, Q., Kyriakidis, P.C., Goodchild, M.F.: A parallel computing approach to fast
geostatistical areal interpolation. International Journal of Geographical Informa-
tion Science 25, 12411267 (2011), https://doi.org/10.1080/13658816.2011.563744
11. Gómez-Hernández, J.J., Srivastava, R.M.: One step at a time: The origins of se-
quential simulation and beyond. Mathematical Geosciences 53, 193209 (2021),
https://doi.org/10.1007/s11004-021-09926-0
12. Hengl, T., Heuvelink, G.B.M., Stein, A.: A generic framework for spatial predic-
tion of soil variables based on regression-kriging. Geoderma 120, 7593 (2004),
https://doi.org/10.1016/j.geoderma.2003.08.018
13. Homann, J., Zortea, M., de Carvalho, B., Zadrozny, B.: Geostatistical learning:
Challenges and opportunities. arXiv (arXiv:2102.08791) (2021)
14. de Hoogh, K., Saucy, A., Shtein, A., Schwartz, J., West, E.A., Strassmann, A.,
Puhan, M., Roösli, M., Stafoggia, M., Kloog, I.: Predicting ne-scale daily no2
for 2005-2016 incorporating omi satellite data across switzerland. Environmental
Science and Technology (2019), https://doi.org/10.1021/acs.est.9b03107
15. Jerrett, M., Turner, M.C., Beckerman, B.S., Iii, C.A.P., Donkelaar, A.v., Martin,
R.v., Serre, M., Crouse, D.L., Gapstur, S.M.S., Krewski, D., Diver, W.R., Coogan,
P.F.P., Thurston, G.D., Burnett, R.T.: Comparing the health eects of ambient
particulate matter estimated using ground-based versus remote sensing exposure
estimates. Environmental Health Perspectives 125, 552559 (2017)
Air pollution models with geostatistics & machine learning 7

16. Kanevsky, M.: Machine Learning for Spatial Environmental Data: Theory, Appli-
cations, and Software. EPFL Press (2009)
17. Kerckhos, J., Hoek, G., Portengen, L., Brunekreef, B., Vermeulen, R.C.H.:
Performance of prediction algorithms for modeling outdoor air pollution spa-
tial surfaces. Environmental Science and Technology 53, 14131421 (2019),
https://doi.org/10.1021/acs.est.8b06038
18. Liu, X., Lu, D., Zhang, A., Liu, Q., Jiang, G.: Data-driven machine learning in envi-
ronmental pollution: Gains and problems. Environmental Science and Technology
(2022), https://doi.org/10.1021/acs.est.1c06157
19. Ma, R., Ban, J., Wang, Q., Zhang, Y., Yang, Y., He, M.Z., Li, S., Shi, W., Li,
T.: Random forest model based ne scale spatiotemporal o3 trends in the beijing-
tianjin-hebei region in china, 2010 to 2017. Environmental Pollution 276 (2021),
https://doi.org/10.1016/j.envpol.2021.116635
20. Meyer, H., Reudenbach, C., Wöllauer, S., Nauss, T.: Importance of spa-
tial predictor variable selection in machine learning applications moving
from data reproduction to spatial prediction. Ecological Modelling 411 (2019),
https://doi.org/10.1016/j.ecolmodel.2019.108815
21. Morley, D.W., Gulliver, J.: A land use regression variable generation, modelling
and prediction tool for air pollution exposure assessment. Environmental Modelling
and Software 105, 1723 (2018), https://doi.org/10.1016/j.envsoft.2018.03.030
22. Pak, U., Ma, J., Ryu, U., Ryom, K., Juhyok, U., Pak, K., Pak, C.: Deep learning-
based pm2.5 prediction considering the spatiotemporal correlations: A case study
of beijing, china. Science of The Total Environment 699(13356), 1 (2020),
https://doi.org/10.1016/J.SCITOTENV.2019.07.367
23. Ribeiro, M.C., Pereira, M.J.: Modelling local uncertainty in relations between birth
weight and air quality within an urban area: combining geographically weighted
regression with geostatistical simulation. Environmental Science and Pollution Re-
search 2554, 25942 (2018), https://doi.org/10.1007/s11356-018-2614-x
24. Ribeiro, M.C., Pinho, P., Llop, E., Branquinho, C., Pereira, M.J.: Geostatistical
uncertainty of assessing air quality using high-spatial-resolution lichen data: A
health study in the urban area of sines, portugal. Science of the Total Environment
562, 740750 (2016), https://doi.org/10.1016/j.scitotenv.2016.04.081
25. Rybarczyk, Y., Zalakeviciute, R.: Machine learning approaches for outdoor air
quality modelling: A systematic review. Applied Sciences (Switzerland) 8 (2018),
https://doi.org/10.3390/app8122570
26. Schratz, P., Becker, M., Lang, M., Brenning, A.: Mlr3spatiotempcv: Spatiotempo-
ral resampling methods for machine learning in r. arXiv 2110.12674 (2021)
27. Shams, S.R., Jahani, A., Kalantary, S., Moeinaddini, M., Khorasani, N.: The eval-
uation on articial neural networks (ann) and multiple linear regressions (mlr)
models for predicting so2 concentration. Urban Climate 37(10083), 7 (2021),
https://doi.org/10.1016/J.UCLIM.2021.100837
28. Shao, Y., Ma, Z., Wang, J., Bi, J.: Estimating daily ground-level pm2.5 in china
with random-forest-based spatiotemporal kriging. Science of the Total Environ-
ment 740 (2020), https://doi.org/10.1016/j.scitotenv.2020.139761
29. Silibello, C., Carlino, G., Stafoggia, M., Gariazzo, C., Finardi, S., Pepe, N.,
Radice, P., Forastiere, F., Viegi, G.: Spatial-temporal prediction of ambient ni-
trogen dioxide and ozone levels over italy using a random forest model for popula-
tion exposure assessment. Air Quality, Atmosphere & Health 14, 817829 (2021),
https://doi.org/10.1007/s11869-021-00981-4/Published
8 Manuel Ribeiro

30. Son, Y., Osornio-vargas, A., Neill, M., Hystad, P., Texcalac-sangrador, J.,
Ohman-strickland, P., Meng, Q., Schwander, S.: Land use regression models
to assess air pollution exposure in mexico city using ner spatial and tempo-
ral input parameters. Science of the Total Environment 639, 40 48 (2018),
https://doi.org/10.1016/j.scitotenv.2018.05.144
31. Song, J., Stettler, M.E.J.: A novel multi-pollutant space-time learning network for
air pollution inference. Science of The Total Environment 811(15225), 4 (2022),
https://doi.org/10.1016/J.SCITOTENV.2021.152254
32. Sorek-Hamer, M., Chateld, R., Liu, Y.: Review: Strategies for using satellite-
based products in modeling pm2.5 and short-term pollution episodes. Environment
International 144(10605), 7 (2020), https://doi.org/10.1016/j.envint.2020.106057
33. Xu, H., Bechle, M.J., Wang, M., Szpiro, A.A., Vedal, S., Bai, Y., Marshall, J.D.:
National pm2.5 and no2 exposure models for china based on land use regression,
satellite measurements, and universal kriging. Science of the Total Environment
655, 423433 (2019), https://doi.org/10.1016/j.scitotenv.2018.11.125
34. Young, L.J., Gotway, C.A., Yang, J., Kearney, G., DuClos, C.: Linking
health and environmental data in geographical analysis : It's so much more
than centroids. Spatial and Spatio-temporal Epidemiology 1, 7384 (2009),
https://doi.org/10.1016/j.sste.2009.07.008
35. Yáñez, M.A., Baettig, R., Cornejo, J., Zamudio, F., Guajardo, J., Fica, R.: Urban
airborne matter in central and southern chile: Eects of meteorological conditions
on ne and coarse particulate matter. Atmospheric Environment 161, 221234
(2017), https://doi.org/10.1016/j.atmosenv.2017.05.007
36. Zhai, L., Li, S., Zou, B., Sang, H., Fang, X., Xu, S.: An improved
geographically weighted regression model for pm2.5 concentration esti-
mation in large areas. Atmospheric Environment 181, 145154 (2018),
https://doi.org/10.1016/j.atmosenv.2018.03.017
37. Zhan, Y., Luo, Y., Deng, X., Zhang, K., Zhang, M., Grieneisen, M.L.,
Di, B.: Exposure in china using hybrid random forest and spatiotem-
poral kriging model. Satellite-Based Estimates of Daily NO 2 (2018),
https://doi.org/10.1021/acs.est.7b05669
38. Zou, B., Fang, X., Feng, H., Zhou, X.: Simplicity versus accuracy for es-
timation of the pm2.5 concentration: a comparison between lur and gwr
methods across time scales. Journal of Spatial Science p. 119 (2019),
https://doi.org/10.1080/14498596.2019.1624203