Geostatistical Learning
Geostatistical Learning
1 Overview
Cancers, respiratory and cardiovascular diseases have been the most common
reasons for premature death attributable to air pollution around the globe. Still,
apart from few exceptions, there is not a denite knowledge about the impacts
of air pollution on populations. Therefore the development of accurate models
predicting air pollution is critical to better understand the health impacts of
exposure to air pollution.
Air pollution models require data collection, which often comes from mon-
itoring by local or national institutions [6], using a set of ground-based sta-
tions geographically dispersed over locations selected for regulatory purposes.
Although these provide air pollutants information with high accuracy and high
temporal resolution, they are sparse in space and have limited area coverage.
In such cases, monitoring data can be combined with statistical models pre-
dicting pollution gradients between ground-based monitoring stations and in
areas where no stations exist at all [32]. To this end, remotely sensed data [14],
chemical transport models [15] or Copernicus Atmospheric Monitoring Service
(CAMS) products [26] have been used. Similarly, meteorological data [35] and
other auxiliary information such as land-use data or distance to roads are use-
ful for modelling, predict and describe smaller-scale variations [5, 30] and are a
solution to obtain high-resolution air pollution maps [29, 33].
2 Manuel Ribeiro
2 Models
{Y (s) : s ∈ S} (1)
To model spatially correlated data variogram functions are used, and Kriging
(univariate case) or Cokriging (multivariate case) estimators are used for optimal
linear prediction. There are dierent types of Kriging techniques, and the choice
depends on data available and the aims of analysis. For a comprehensive review
about Kriging methods please refer to [8].
When the the objective of analysis is to assess spatial uncertainty, Kriging
techniques are not adequate. Specically, Kriging techniques aim at minimizing
the prediction error, and this involves smoothing data variability. To assess spa-
tial uncertainty, geostatistical algorithms using stochastic simulations should be
preferred, as they reproduce the uctuations observed in the sample data, instead
of producing optimal prediction. These algorithms generate simulated maps with
similar statistical properties to those of the observed data (e.g., histogram, spa-
tial covariance), and have been applied to quantify air quality uncertainty to
assess health impacts of exposure [23, 24, 34].
In the past decade, many dierent learning algorithms have been applied to
model air pollution for epidemiology studies [2]. Currently, these models use
modelling frameworks with high dimensional input spaces including features
such as air pollution (e.g., remotely sensed data, CTM, ground-based stations),
meteorological-based models (e.g., temperature, wind, pressure, humidity), other
relevant biophysical and socio-economic information (e.g., land-use land cover,
seasonality, elevation, distance to roads, population density).
Mostly focused on supervised learning regression algorithms, the learning
processes use the input features, x, for training and choose a function f (x) with
parameters w, that better ts the output values y . To decide which function
f (x, w) provides the best approximation to y , a measure of loss is computed. A
popular choice for regression problems is to tune w from data with the following
loss function, L:
Random forest are one of the most used algorithms to predict air pollution
[25], as they provide an excellent trade-o between interpretability and per-
formance, when compared with other learning methods (e.g., neural networks,
support vector machine). Simpler learning methods such as linear regression ap-
proaches have a long tradition in science and are also widely used [18]. Following
the principle of Occam's razor, when no major interactions and approximately
linear associations exist, these models perform well and should be preferred in-
stead of more complex algorithms. Neural networks are more complex than the
abovementioned algorithms and are more ecient to solve non-linear problems.
Several algorithmic variants are well established (e.g., articial, convolutional
neural networks) in air pollution modelling [27]. In complex non-linear and high
dimensional feature spaces, this algorithm can achieve high model performances
[4].
The trade-o between model performance and the ability to generalize is a
relevant issue to be taken into consideration, as the models should avoid over-
tting the data (leading to low performance on new data). Moreover, the better
performance of some complicated models may be hard to interpret. For example,
in neural networks or in support vector machine algorithms, transfer functions
and kernels used to t the data, are rather articial models and hard to interpret.
A clearer understanding of the internal mechanisms leading to some output will
contribute to improve their use in environmental applications.
Uncertainty The spatial covariance model of the residual part, r(s) can be
inferred from data and incorporated in a geostatistical simulation algorithm to
Air pollution models with geostatistics & machine learning 5
Further extensions The ability to model and predict accurately air pollution
with spatial data can be further rened by extending research in the perspec-
tive of machine learning tunning parameters. It is well known that air pollution
exhibits spatial trends and spatial autocorrelation [22]. Relying on learning ap-
proaches based on random samples of spatial data to tune model parameters,
fails to assess a model's performance in terms of spatial mapping and only val-
idate its ability to reproduce sampling data [20]. Therefore, instead of leaving
spatial interactions to be learned from data, spatial parameters or functions (e.g.,
geostatistical semi-variogram) could be used as tuning parameter within the al-
gorithmic model to control the optimal space and time ranges to sample [13]. The
idea can easily be extended to models aiming at predicting in the multipollutant
case [31], by extending the spatial tunning parameters to cross-variograms.
The inuential work of Kanevski [16] oers a solid perspective on the grounds
of machine learning developments for analysis and modelling of spatial environ-
mental data, that can be easily transferable to air pollution modelling. In future
work, researchers could extend the model framework, considering renements in
the assessment of spatial uncertainty, and optimization of parameters tunning
by incorporating spatial properties of data.
References
1. Araujo, L.N., Belotti, J.T., Alves, T.A., Tadano, Y.d.S., Siqueira, H.: En-
semble method based on articial neural networks to estimate air pollution
health risks. Environmental Modelling and Software 123(10456), 7 (2020),
https://doi.org/10.1016/J.ENVSOFT.2019.104567
2. Bellinger, C., Jabbar, M., M. S., Z., O., O.V.: A systematic review of data mining
and machine learning for air pollution epidemiology. BMC Public Health (2017),
https://doi.org/10.1186/s12889-017-4914-3
3. Beloconi, A., Vounatsou, P.: Substantial reduction in particulate matter
air pollution across europe during 2006-2019: A spatiotemporal modeling
analysis. Environmental Science and Technology 55, 1550515518 (2021),
https://doi.org/10.1021/acs.est.1c03748
4. Cabaneros, S.M., Calautit, J.K., Hughes, B.R.: A review of articial neural network
models for ambient air pollution prediction. Environmental Modelling and Software
(2019), https://doi.org/10.1016/j.envsoft.2019.06.014
5. Cai, J., Ge, Y., Li, H., Yang, C., Liu, C., Meng, X., Wang, W., Niu,
C., Kan, L., Schikowski, T., Yan, B., Chillrud, S.N., Kan, H., Jin, L.: Ap-
plication of land use regression to assess exposure and identify potential
sources in pm2.5, bc, no2 concentrations. Atmospheric Environment 223 (2020),
https://doi.org/10.1016/j.atmosenv.2020.117267
6. for Europe, W.R.O.: Health risk assessment of air pollution general principles.
Copenhagen (2016)
7. Gao, Y., Wang, Z., Li, C.y., Zheng, T., Peng, Z.R.: Assessing neighborhood varia-
tions in ozone and pm2.5 concentrations using decision tree method. Building and
Environment 188 (2021), https://doi.org/10.1016/j.buildenv.2020.107479
8. Goovaerts, P.: Geostatistics for Natural Resources Evaluation. Oxford University
Press, New York (1997)
9. Gryparis, A., Paciorek, C.J., Zeka, A., Schwartz, J., Coull, B.A.: Measurement
error caused by spatial misalignment in environmental epidemiology. Biostatistics
10, 258274 (2009), https://doi.org/10.1093/biostatistics/kxn033
10. Guan, Q., Kyriakidis, P.C., Goodchild, M.F.: A parallel computing approach to fast
geostatistical areal interpolation. International Journal of Geographical Informa-
tion Science 25, 12411267 (2011), https://doi.org/10.1080/13658816.2011.563744
11. Gómez-Hernández, J.J., Srivastava, R.M.: One step at a time: The origins of se-
quential simulation and beyond. Mathematical Geosciences 53, 193209 (2021),
https://doi.org/10.1007/s11004-021-09926-0
12. Hengl, T., Heuvelink, G.B.M., Stein, A.: A generic framework for spatial predic-
tion of soil variables based on regression-kriging. Geoderma 120, 7593 (2004),
https://doi.org/10.1016/j.geoderma.2003.08.018
13. Homann, J., Zortea, M., de Carvalho, B., Zadrozny, B.: Geostatistical learning:
Challenges and opportunities. arXiv (arXiv:2102.08791) (2021)
14. de Hoogh, K., Saucy, A., Shtein, A., Schwartz, J., West, E.A., Strassmann, A.,
Puhan, M., Roösli, M., Stafoggia, M., Kloog, I.: Predicting ne-scale daily no2
for 2005-2016 incorporating omi satellite data across switzerland. Environmental
Science and Technology (2019), https://doi.org/10.1021/acs.est.9b03107
15. Jerrett, M., Turner, M.C., Beckerman, B.S., Iii, C.A.P., Donkelaar, A.v., Martin,
R.v., Serre, M., Crouse, D.L., Gapstur, S.M.S., Krewski, D., Diver, W.R., Coogan,
P.F.P., Thurston, G.D., Burnett, R.T.: Comparing the health eects of ambient
particulate matter estimated using ground-based versus remote sensing exposure
estimates. Environmental Health Perspectives 125, 552559 (2017)
Air pollution models with geostatistics & machine learning 7
16. Kanevsky, M.: Machine Learning for Spatial Environmental Data: Theory, Appli-
cations, and Software. EPFL Press (2009)
17. Kerckhos, J., Hoek, G., Portengen, L., Brunekreef, B., Vermeulen, R.C.H.:
Performance of prediction algorithms for modeling outdoor air pollution spa-
tial surfaces. Environmental Science and Technology 53, 14131421 (2019),
https://doi.org/10.1021/acs.est.8b06038
18. Liu, X., Lu, D., Zhang, A., Liu, Q., Jiang, G.: Data-driven machine learning in envi-
ronmental pollution: Gains and problems. Environmental Science and Technology
(2022), https://doi.org/10.1021/acs.est.1c06157
19. Ma, R., Ban, J., Wang, Q., Zhang, Y., Yang, Y., He, M.Z., Li, S., Shi, W., Li,
T.: Random forest model based ne scale spatiotemporal o3 trends in the beijing-
tianjin-hebei region in china, 2010 to 2017. Environmental Pollution 276 (2021),
https://doi.org/10.1016/j.envpol.2021.116635
20. Meyer, H., Reudenbach, C., Wöllauer, S., Nauss, T.: Importance of spa-
tial predictor variable selection in machine learning applications moving
from data reproduction to spatial prediction. Ecological Modelling 411 (2019),
https://doi.org/10.1016/j.ecolmodel.2019.108815
21. Morley, D.W., Gulliver, J.: A land use regression variable generation, modelling
and prediction tool for air pollution exposure assessment. Environmental Modelling
and Software 105, 1723 (2018), https://doi.org/10.1016/j.envsoft.2018.03.030
22. Pak, U., Ma, J., Ryu, U., Ryom, K., Juhyok, U., Pak, K., Pak, C.: Deep learning-
based pm2.5 prediction considering the spatiotemporal correlations: A case study
of beijing, china. Science of The Total Environment 699(13356), 1 (2020),
https://doi.org/10.1016/J.SCITOTENV.2019.07.367
23. Ribeiro, M.C., Pereira, M.J.: Modelling local uncertainty in relations between birth
weight and air quality within an urban area: combining geographically weighted
regression with geostatistical simulation. Environmental Science and Pollution Re-
search 2554, 25942 (2018), https://doi.org/10.1007/s11356-018-2614-x
24. Ribeiro, M.C., Pinho, P., Llop, E., Branquinho, C., Pereira, M.J.: Geostatistical
uncertainty of assessing air quality using high-spatial-resolution lichen data: A
health study in the urban area of sines, portugal. Science of the Total Environment
562, 740750 (2016), https://doi.org/10.1016/j.scitotenv.2016.04.081
25. Rybarczyk, Y., Zalakeviciute, R.: Machine learning approaches for outdoor air
quality modelling: A systematic review. Applied Sciences (Switzerland) 8 (2018),
https://doi.org/10.3390/app8122570
26. Schratz, P., Becker, M., Lang, M., Brenning, A.: Mlr3spatiotempcv: Spatiotempo-
ral resampling methods for machine learning in r. arXiv 2110.12674 (2021)
27. Shams, S.R., Jahani, A., Kalantary, S., Moeinaddini, M., Khorasani, N.: The eval-
uation on articial neural networks (ann) and multiple linear regressions (mlr)
models for predicting so2 concentration. Urban Climate 37(10083), 7 (2021),
https://doi.org/10.1016/J.UCLIM.2021.100837
28. Shao, Y., Ma, Z., Wang, J., Bi, J.: Estimating daily ground-level pm2.5 in china
with random-forest-based spatiotemporal kriging. Science of the Total Environ-
ment 740 (2020), https://doi.org/10.1016/j.scitotenv.2020.139761
29. Silibello, C., Carlino, G., Stafoggia, M., Gariazzo, C., Finardi, S., Pepe, N.,
Radice, P., Forastiere, F., Viegi, G.: Spatial-temporal prediction of ambient ni-
trogen dioxide and ozone levels over italy using a random forest model for popula-
tion exposure assessment. Air Quality, Atmosphere & Health 14, 817829 (2021),
https://doi.org/10.1007/s11869-021-00981-4/Published
8 Manuel Ribeiro
30. Son, Y., Osornio-vargas, A., Neill, M., Hystad, P., Texcalac-sangrador, J.,
Ohman-strickland, P., Meng, Q., Schwander, S.: Land use regression models
to assess air pollution exposure in mexico city using ner spatial and tempo-
ral input parameters. Science of the Total Environment 639, 40 48 (2018),
https://doi.org/10.1016/j.scitotenv.2018.05.144
31. Song, J., Stettler, M.E.J.: A novel multi-pollutant space-time learning network for
air pollution inference. Science of The Total Environment 811(15225), 4 (2022),
https://doi.org/10.1016/J.SCITOTENV.2021.152254
32. Sorek-Hamer, M., Chateld, R., Liu, Y.: Review: Strategies for using satellite-
based products in modeling pm2.5 and short-term pollution episodes. Environment
International 144(10605), 7 (2020), https://doi.org/10.1016/j.envint.2020.106057
33. Xu, H., Bechle, M.J., Wang, M., Szpiro, A.A., Vedal, S., Bai, Y., Marshall, J.D.:
National pm2.5 and no2 exposure models for china based on land use regression,
satellite measurements, and universal kriging. Science of the Total Environment
655, 423433 (2019), https://doi.org/10.1016/j.scitotenv.2018.11.125
34. Young, L.J., Gotway, C.A., Yang, J., Kearney, G., DuClos, C.: Linking
health and environmental data in geographical analysis : It's so much more
than centroids. Spatial and Spatio-temporal Epidemiology 1, 7384 (2009),
https://doi.org/10.1016/j.sste.2009.07.008
35. Yáñez, M.A., Baettig, R., Cornejo, J., Zamudio, F., Guajardo, J., Fica, R.: Urban
airborne matter in central and southern chile: Eects of meteorological conditions
on ne and coarse particulate matter. Atmospheric Environment 161, 221234
(2017), https://doi.org/10.1016/j.atmosenv.2017.05.007
36. Zhai, L., Li, S., Zou, B., Sang, H., Fang, X., Xu, S.: An improved
geographically weighted regression model for pm2.5 concentration esti-
mation in large areas. Atmospheric Environment 181, 145154 (2018),
https://doi.org/10.1016/j.atmosenv.2018.03.017
37. Zhan, Y., Luo, Y., Deng, X., Zhang, K., Zhang, M., Grieneisen, M.L.,
Di, B.: Exposure in china using hybrid random forest and spatiotem-
poral kriging model. Satellite-Based Estimates of Daily NO 2 (2018),
https://doi.org/10.1021/acs.est.7b05669
38. Zou, B., Fang, X., Feng, H., Zhou, X.: Simplicity versus accuracy for es-
timation of the pm2.5 concentration: a comparison between lur and gwr
methods across time scales. Journal of Spatial Science p. 119 (2019),
https://doi.org/10.1080/14498596.2019.1624203