0% found this document useful (0 votes)
50 views16 pages

A Machine Learning Based Ensemble Model For Estimating 2024 Science of The

Uploaded by

sulaeman.salasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views16 pages

A Machine Learning Based Ensemble Model For Estimating 2024 Science of The

Uploaded by

sulaeman.salasa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Science of the Total Environment 916 (2024) 170209

Contents lists available at ScienceDirect

Science of the Total Environment


journal homepage: www.elsevier.com/locate/scitotenv

A machine learning-based ensemble model for estimating diurnal variations


of nitrogen oxide concentrations in Taiwan
Aji Kusumaning Asri a, Hsiao-Yun Lee b, Yu-Ling Chen a, Pei-Yi Wong c, Chin-Yu Hsu d, e,
Pau-Chung Chen f, g, h, i, Shih-Chun Candice Lung j, k, l, Yu-Cheng Chen f, m, Chih-Da Wu a, f, n, *, **, ***
a
Department of Geomatics, College of Engineering, National Cheng Kung University, Tainan, Taiwan
b
Department of Leisure Industry and Health Promotion, National Taipei University of Nursing and Health Sciences, Taipei, Taiwan
c
Department of Environmental and Occupational Health, National Cheng Kung University, Tainan, Taiwan
d
Department of Safety, Health and Environmental Engineering, Ming Chi University of Technology, Taiwan
e
Center for Environmental Sustainability and Human Health, Ming Chi University of Technology, Taiwan
f
National Institute of Environmental Health Sciences, National Health Research Institutes, Miaoli, Taiwan
g
Institute of Environmental and Occupational Health Sciences, National Taiwan University College of Public Health, Taipei, Taiwan
h
Department of Environmental and Occupational Medicine, National Taiwan University Hospital, Taipei, Taiwan
i
Department of Public Health, National Taiwan University College of Public Health, Taipei, Taiwan
j
Research Center for Environmental Changes, Academia Sinica, Taipei, Taiwan
k
Department of Atmospheric Sciences, National Taiwan University, Taipei, Taiwan
l
Institute of Environmental Health, School of Public Health, National Taiwan University, Taipei, Taiwan
m
Department of Occupational Safety and Health, China Medical University, Taichung, Taiwan
n
Innovation and Development Center of Sustainable Agriculture, National Chung Hsing University, Taichung City 402, Taiwan

H I G H L I G H T S G R A P H I C A L A B S T R A C T

• Diurnal variations of NOx were assessed


using a machine learning-based
ensemble model.
• Longitudinal data over 27 years were
incorporated to develop an ensemble
model.
• An ensemble model was selected to es­
timate NOx concentrations in Taiwan.
• The ensemble model exhibited good
ability to predict diurnal NOx variations
(R2 > 0.90).
• Cross validations and internal-external
verifications were used for model
evaluation.

* Corresponding author at: Department of Geomatics, College of Engineering, National Cheng Kung University, Tainan 70101, Taiwan.
** Corresponding author at: National Institute of Environmental Health Sciences, National Health Research Institutes, Miaoli 35053, Taiwan.
*** Corresponding author at: Innovation and Development Center of Sustainable Agriculture, National Chung Hsing University, Taichung City 402, Taiwan.
E-mail addresses: [email protected] (A.K. Asri), [email protected] (H.-Y. Lee), [email protected] (Y.-L. Chen), [email protected]
(P.-Y. Wong), [email protected] (C.-Y. Hsu), [email protected] (P.-C. Chen), [email protected] (S.-C.C. Lung), [email protected]
(Y.-C. Chen), [email protected] (C.-D. Wu).

https://doi.org/10.1016/j.scitotenv.2024.170209
Received 8 September 2023; Received in revised form 2 January 2024; Accepted 14 January 2024
Available online 24 January 2024
0048-9697/© 2024 Elsevier B.V. All rights reserved.
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

A R T I C L E I N F O A B S T R A C T

Editor: Pavlos Kassomenos Air pollution is inextricable from human activity patterns. This is especially true for nitrogen oxide (NOx), a
pollutant that exists naturally and also as a result of anthropogenic factors. Assessing exposure by considering
Keywords: diurnal variation is a challenge that has not been widely studied. Incorporating 27 years of data, we attempted to
Diurnal concentrations estimate diurnal variations in NOx across Taiwan. We developed a machine learning-based ensemble model that
Ensemble learning
integrated hybrid kriging-LUR, machine-learning, and an ensemble learning approach. Hybrid kriging-LUR was
Hybrid kriging-LUR
performed to select the most influential predictors, and machine-learning algorithms were applied to improve
Machine-learning algorithms
NOx model performance. The three best machine-learning algorithms were suited and reassessed to develop ensemble
learning that was designed to improve model performance. Our ensemble model resulted in estimates of daytime,
nighttime, and daily NOx with high explanatory powers (Adj-R2) of 0.93, 0.98, and 0.94, respectively. These
explanatory powers increased from the initial model that used only hybrid kriging-LUR. Additionally, the results
depicted the temporal variation of NOx, with concentrations higher during the daytime than the nighttime.
Regarding spatial variation, the highest NOx concentrations were identified in northern and western Taiwan.
Model evaluations confirmed the reliability of the models. This study could serve as a reference for regional
planning supporting emission control for environmental and human health.

1. Introduction Data collected from air monitoring stations has routinely been used
to estimate air pollution. However, the uneven geographic distribution
Nitrogen oxides (NOx) come from natural sources such as volcanoes, of air monitoring stations can lead to unreliable estimates at locations
oceans, biological decay, and lightning strikes (United States Environ­ where the concentrations are not observed (Hofstra et al., 2009). To
mental Protection Agency, 2023; Zhang et al., 2003), as well as from address the shortcomings of monitoring station placement, kriging and
anthropogenic combustion processes. NOx is often connected to several inverse distance weighting (IDW) have been applied (Chen and Lin,
health issues, including chronic neurological disorders, vascular 2022; Rivera-González et al., 2015). Kriging is a conventional
adventitia inflammation, and deaths caused by respiratory diseases regression-based geostatistical interpolation method that is often used to
(Barua et al., 2019; César et al., 2015; Meijles and Pagano, 2016). The predict the concentration of pollution in areas not covered by moni­
European Union under the air quality directive 2008/EC/50 established toring stations (Lai et al., 2020; Meik and Lawing, 2017). However,
a critical level of NOx concentrations for protecting vegetation. Specif­ kriging lacks consideration of distinct spatial characteristics, which re­
ically, the annual average NOx value should not exceed 30 μg/m3 duces the accuracy of the exposure assessment, and this is problematic
(UNION, 2008). According to a 2015 EPA statistical report, Taiwan when analyzing large areas. Regarding IDW, this interpolation method
emitted >430,000 metric tons of NOx, which reflects the seriousness of does not have an error assessment, it is sensitive to outliers, and the
the situation. The Taiwan Environmental Protection Administration interpolation results depend on the weighting parameters and the size of
(Taiwan EPA) has reported that Taiwan’s annual NO2 concentration, the search window (Boke, 2017; Wu et al., 2017). In order to adjust for
one component of NOx, exceeds 40 μg/m3. Because NOx is one of the the spatial characteristics of study areas, land use regression (LUR) has
most harmful air pollutants affecting both environmental and human widely been applied, and it allows for the assessment of multiple vari­
health, the Taiwan EPA aims to reduce NOx emissions by 4000 metric ables over a long-term period (Beelen et al., 2011; Cai et al., 2020; de
tons per year, explicitly targeting emission reductions from industrial Hoogh et al., 2014; Meng et al., 2015). In addition to depicting spatial
factors (Taiwan EPA, 2019). variations in air pollution, LUR, which employs multiple regression
Air pollution assessments recently have been performed with time concepts, can also determine the potential variables that may influence
considerations ranging from daily to monthly levels (He et al., 2022; Lyu air pollution concentrations (Eeftens et al., 2016; Wu et al., 2017).
et al., 2022; Wong et al., 2021a; Zhang et al., 2021). For example, a Nevertheless, LUR only refers to the linear relationship between pre­
study of Beijing-Tianjin-Hebei used monthly data from 2014 to 2021 to dictors and outcomes, so this model may be less powerful in estimation
estimate the spatiotemporal variation in air pollution and ozone con­ when the association between predictors and air pollution becomes non-
centrations (Lyu et al., 2022). A study by Wong and colleagues used a linear.
dataset comprised of 11 years of daily data to estimate fine particulate Recently, machine learning algorithms have been developed to
matter exposure in Taiwan (Wong et al., 2021a), and a study developed handle complex linear and non-linear interactions between factors (Ren
in China modeled six years of data from 2015 to 2020 to estimate et al., 2020). Machine learning is not only able to deal with big data but
spatiotemporal particulate matter (He et al., 2022). Even though the also able to construct model estimates with high accuracy (Sarker, 2021;
estimations in previous studies that used long-term data did describe Zhou et al., 2017). Several studies have combined multiple machine
temporal variations of pollutant exposure, studies to this point have not learning algorithms to optimize model performance, which is known as
focused on estimating diurnal levels, where differences in daytime and the ensemble model (Arowosegbe et al., 2022; Huang et al., 2022a; Lai
nighttime concentrations may arise. NOx is strongly related to human et al., 2022; Li et al., 2020; Liu and Chen, 2022). This approach is very
activity patterns, and human activity patterns differ during the day and effective because each algorithm offers its own advantages in handling
night (Korhonen et al., 2021; Li et al., 2017). In addition, emissions from data. A previous study has reported that the predictive power of each
industry and manufacturing that operate throughout the day and/or algorithm can be summed in the ensemble model, so there is an additive
night fluctuate (Chen et al., 2023), so calculating estimates of diurnal effect that leads to a more accurate model performance than the single
variations may be important to consider. Diurnal variation estimates in use of an algorithm (Pintelas and Livieris, 2020). For example, a study
NOx exposure were supported by Wang and Song’s, 2018 study. They by Lai and colleagues used the ensemble model to integrate the recurrent
reported a difference in daytime and nighttime concentrations of poly­ neural network, long short-term memory, and gated recurrent unit al­
cyclic aromatic hydrocarbons, organic compounds that react with gorithms to develop an estimation model for CO, O3, and NO2 gasses in
several pollutants including NOx (Wang and Song, 2018). Estimating the Taiwan (Lai et al., 2022). Their ensemble model resulted in coefficient
diurnal variation of pollutant exposure is essential because people of all determination (R2) reaching 0.73, 0.51, and 0.37 for CO, O3, and NO2,
lifestyles risk exposure, including those who are active at nighttime (Wu respectively. Similarly, a study conducted in China was able to estimate
and Song, 2019). As such, diurnal estimates of NOx are needed for the NO2 with an R2 value of 0.85 by integrating independent machine
creation of prudent policies and regulations. learning algorithms (i.e., random forest, extremely randomized trees,

2
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

extreme gradient boosting) and a generalized additive model (Huang advantages of hybrid kriging land-use regression (hybrid kriging-LUR),
et al., 2022b). Although the proposed ensemble model can yield mod­ machine learning algorithms, and an ensemble learning technique.
erate- to high-level performance, several drawbacks must be noted. The hybrid kriging-LUR was built in advance to select the most impor­
Huang et al. (2022b) acknowledged that the ability of their developed tant predictors, and those selected predictors were evaluated using an
model was limited to predicting temporal variation, and that a scaling ensemble learning approach, which was then integrated into machine-
approach with fine resolution was not feasible in wider areas due to the learning algorithms. Several validation tests, including an overfitting
constraint of computing power. These issues are inseparable from the test, 10-fold-cross-validation, spatial cross-validation, peak validation,
main shortcoming of machine learning-based models that make it and internal-external verification, were performed to evaluate the reli­
difficult to accurately select the most critical predictors before training ability of the model’s performance. Then, the estimations from the
each algorithm (Sarker, 2021). Because of this, selecting irrelevant ensemble model were used to generate high-accuracy NOx maps in
predictors can lead to overfitting problems, which decrease computa­ various circumstances. Distinguishing from a previous investigation by
tional efficiency as the model may take a long time in data processing Wong et al. (2023), this study offered an automated approach that could
due to its overly complex nature (Morapedi and Obagbuwa 2023). boost the efficiency of data processing, starting from the selection of
Furthermore, by hindering the ability of a model to accurately estimate influential predictors to the development of estimation models. To our
and interpret the complex air pollution patterns, overfitting can also knowledge, previous studies have not built ensemble models with an
provide a limited interpretation of the spatiotemporal variation of air automated approach in a series of model developments. Policies could
pollution (Hasnain et al., 2022). Accordingly, it is necessary to apply a be created using the validated results to regulate NOx emission.
predictor selection method before training a machine learning algorithm
(Zhong et al., 2021). Knowing that selecting appropriate predictors can 2. Materials and methods
be addressed by LUR through a systematic stepwise regression process,
we hypothesized that applying this approach before training machine- 2.1. Characteristics of the study area
learning algorithms was a feasible way to overcome the drawbacks.
Since the potential predictor selections using conventional stepwise Taiwan is located at the junction of the South China Sea and the
regression with a manual process may take an extensive amount of time, Pacific Ocean, and it has an area of 36,197 km2 that is inhabited by >23
an automation process can then be considered. million people (National Statistics, Republic of China-Taiwan, 2023). On
In addition to developing a single model, this study proposed an this populous island, domestic combustion from vehicle emissions and
enhancement of a machine learning-based ensemble model to estimate rapid industrial development are the main sources of air pollution
the diurnal variation (i.e., daytime, nighttime, and daily) of NOx con­ (Huang et al., 2022a). The Ministry of the Interior reported in 2016 that
centrations across the main island of Taiwan. This method integrated the the industrial area in Taiwan is 215.40 km2 (Ministry of the Interior,

Fig. 1. The air quality monitoring stations in Taiwan.

3
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

2023), and the latest data from June 2022 reflects 22.7 million regis­ fruit orchard, mixed farming, forest, and park; (3) public infrastructures:
tered vehicles in Taiwan (Taiwan Environmental Protection Adminis­ international airport and railway; and (4) others: water body, port, and
tration, 2023a). The main island of Taiwan is dominated by hills and gravel fields. Rather than using the greenspace data provided by the
mountains in the east and lowlands in the west (Fig. 1), and the diver­ Ministry of the Interior-Taiwan, this study used satellite-based vegeta­
gence in topographical landscape is conducive to air pollutants being tion index to estimate greenness exposure. Greenness exposure was
retained within the island with little dispersion. Although the four sea­ represented by the normalized difference vegetation index (NDVI), ac­
sons in Taiwan are not as distinct as those in the temperate zone, quired every 16 days using MOD13Q1 and a Terra sensor with a spatial
different meteorological characteristics of the seasons result in air resolution of 250-m x 250-m. The data were provided by the National
pollution concentrations varying in Taiwan during the year (Lee et al., Aeronautics and Space Administration (NASA) and were collected from
2022; Lee et al., 2020; Yang, 2002). 1994 to 2020. NDVI with 250 m spatial resolution was upscaled in this
study to 50 m by applying the “nearest-neighbor resampling” function
2.2. Nitrogen oxides database provided by ArcGIS 10.7.1. This resampling technique is used in image
processing that involves discovering the nearest value of the original
The hourly nitrogen oxides (NOx) data from 1994 through 2020 used data (Brandsma and Können, 2006). The advantages of this technique
in this study were provided by the Taiwan EPA. Data were collected are maintaining the original values while avoiding interpolation be­
from 76 national air quality monitoring stations deployed across tween the data, meaning this technique ensures that the original values
Taiwan. In general, the monitoring stations are located in major cities can be preserved in the resampled data. By maintaining correctness after
and urban areas across the country. The types of monitoring stations upscaling, this function replicates the original values as closely as
include general, traffic, industrial, national park, and background sta­ possible in the resampled data.
tions, which are classified based on the area characteristics. Although
the stations are located in areas with different characteristics, the in­ 2.3.2. Geospatial databases
strument and monitoring technique, Chemiluminescence detection, are Some geospatial information was considered, including road
the same. Chemiluminescence detection is an air pollution monitoring network, distribution of industries, thermal power plants, Chinese res­
technique that is based on the production of electromagnetic radiation taurants, and temples. The road network was represented by the density
due to chemical reactions. One of the products of the reactions enters an of major and local roads, and the data were obtained from the Ministry
excited state and radiates light when it returns to its initial state (Fontijn, of Transportation and Communications in 2001, 2006, 2013, 2017, and
1976). Regarding data quality and reliability, the accuracy (deviation) 2020. Moreover, industrial park data provided by the Ministry of Eco­
of NOx measurement is ≤15 % and the capture efficiency of NOx con­ nomic Affairs were examined because industrial combustion has been
centrations is nearly 99.9 % (Taiwan Environmental Protection reported to be a significant source of NOx (Wang et al., 2021; Zhang and
Administration, 2023b). To illustrate the diurnal variations of NOx (i.e., Zhang, 2016). Other major sources such as thermal power plants, Chi­
daytime and nighttime exposure), NOx concentrations from 06:00 to nese restaurants, and incense burning from temples were included based
17:00 were calculated to represent daytime exposure, while NOx con­ on the findings of prior studies (Skoulidou et al., 2021; Tian et al., 2013).
centrations from 18:00 to 05:00 were calculated to represent nighttime Google Maps offered the thermal power plant distribution, and King­
exposure. This study divided the diurnal time of day and night based on waytek Technology Co., Ltd. provided 2006, 2008, 2010, and 2012 data
the average sunrise and sunset times in Taiwan (Central Weather Bu­ of Chinese restaurants and temples.
reau, 2022). On average, sunrise is at 5:49 am and sunset is at 5:58 pm
(17:58). In Taiwan, sunrise occurs later and sunset occurs earlier in 2.3.3. Topographical and meteorological factors
winter than in summer. Nonetheless, sunrise still occurs between 05:00 Meteorological factors and topographical conditions have been
to 06:00 and sunset occurs between 17:00 to 18:00 in all seasons of identified as the cause of weak pollutant dispersion and strong pollutant
Taiwan. Daily exposure was calculated by averaging the hourly values of trapping in Taiwan (Yang, 2002). Accordingly, we assessed elevation
each day. Nearly one and a half million observations of NOx were and slope using Digital Terrain Model (DTM) data to represent the
assessed as the dependent variable in constructing the predictive topographic landscape of the study area. The data, with 20-m x 20-m
models. Adapting methods from a prior study, we applied the leave-one- spatial resolution, were obtained from the Ministry of the Interior and
out ordinary kriging to estimate spatial variations in NOx concentra­ resampled to 50-m x 50-m spatial resolution. This resampling process
tions, which was then used as a variable in the model development was performed considering all variables that were used to develop the
(Chen et al., 2020). Adopting the main concept of leave-one-out cross- model, and it was generated at 50-m spatial resolution. In this case, the
validation, this technique can estimate the prediction variance of an resampling technique was the same as the process for greenness-NDVI
ordinary kriging model. It involves leaving out a subset of observations data (Section 2.3 Potential Predictors - Land use inventory). In addition,
and predicting them using the remaining observations. In the context of meteorological factors such as temperature, relative humidity, air
kriging, this approach is referred to as “leave-one-out kriging (Pang pressure, rainfall, wind direction, and wind speed were included. These
et al., 2023). Technically, this method includes the following steps: (1) data were collected from 1994 to 2020 through the bank for atmo­
fit a kriging model to the entire dataset; (2) remove one observation spheric and hydrologic resources and were provided by the Central
from the dataset and refit the kriging model to the remaining data; (3) Weather Bureau of Taiwan.
use the refitted kriging model to estimate the value of the removed In estimating some meteorological data, e.g., temperature, relative
observation; (4) re-run steps 2 and 3 for each observation in the dataset. humidity, air pressure, and rainfall, the kriging method was applied.
Meanwhile, for wind data estimates, the inverse distance weighting
2.3. Potential predictors (IDW) method was used. To assess wind direction, we initially calculated
U and V components by assessing the east/west, north/south, southeast/
2.3.1. Land use inventories southwest, and northeast/northwest wind magnitudes. Then, those
Previous studies have indicated that land use is a determinant of air component estimates were interpolated separately using IDW. Slightly
pollution (Fujitani et al., 2021; Huang et al., 2021; Zhao and Wang, different from wind direction data that requires U and V component
2022). Land use data provided in 2007–2009 and 2015 by the National calculations, wind speed data at unsampled locations was directly
Land Surveying and Mapping Center (NLSMC), the Ministry of the interpolated using IDW. The IDW procedure was chosen to interpolate
Interior (MOI) were used in this study. The available data included: (1) wind data referring to previous studies (Ali et al., 2012; Zhao et al.,
built-up areas: industrial, commercial, funeral industry, manufacturing, 2022). IDW was chosen since this method has advantages for interpo­
and residential areas; (2) green spaces: paddy field, upland cropping, lating wind data due to its simplicity, local influence assumption, and

4
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

ability to represent complex spatial patterns in the interpolation data. model evaluations of the five machine learning algorithms to develop a
The results of the interpolation process were used to construct the machine learning-based ensemble model; (5) validation tests; and (6)
model. Further, knowing that air pollution in terms of NOx concentra­ spatiotemporal-based mapping of NOx diurnal variation using predictor
tion fluctuates seasonally (Roberts-Semple et al., 2012; van der A. et al., parameters selected in the final ensemble model. The following sections
2008), a dummy variable was created to assess seasonal impact. are further explanations of each of these steps.

2.3.4. Co-pollutant emissions 2.4.1. Data preprocessing and development of hybrid kriging-LUR models
Referring to previous studies that identified the linkage between NOx In the early stages of data processing, geospatial-based variables such
concentrations and other pollutants, the emissions from fine particulate as land cover data around the stations were estimated at buffer ranges of
matter (PM2.5), sulfur dioxide (SO2), and ozone (O3) were included as 50, 150, 250, 500, 750, 1000, 1250, 1500, 1750, 2000, 2500, 3000,
predictors (Guo et al., 2022; Nguyen et al., 2022; Thunis et al., 2021). 4000, and 5000 m and generated at 50-m x 50-m spatial resolution. In
The data were obtained from the Taiwan EPA and were collected during this case, circular buffer types were used to create multiple buffers at
the study period. The ordinary kriging interpolation method was applied those specific distances around input features. Focal statistics were then
to generate an estimated map of each pollutant, which then were applied to estimate the density of these variables. We applied the
examined in the model development. Euclidean distance method to measure the distance between two vari­
ables in Euclidean space (Liu, 2015; Selby and Kockelman, 2013), such
2.3.5. Sociological factors as the distance between land cover data from the air quality monitoring
Burning incense and joss paper during festival celebrations is a station. Euclidean distance estimates the distance from each cell in the
Chinese tradition that worsens air quality (Khezri et al., 2015; Wang and raster to the closest source. Using the geoprocessing tool in ArcGIS, this
Yu, 2022). In Zhang’s study there was some evidence to suggest that technique calculated, for each cell, the distance of the central point of
burning incense and joss paper could contribute to NOx emissions, emission sources from the air quality monitoring stations (ESRI, 2023).
especially if they are burned in large quantities and in poorly ventilated As listed in the Table S1, there were nearly 400 generated variables set
areas (Zhang et al., 2019). Thus, we considered the Chinese holiday as predictors to estimate NOx concentrations.
festival that is thought to contribute to NOx emissions in Taiwan. In this Further, we conducted a correlation analysis using Pearson’s algo­
study, we used raster grid data for the total census population of rithm to select the variables that were most likely to be significantly
townships across Taiwan, provided by the Ministry of the Interior, associated with NOx. This algorithm assumed that the relationship be­
Department of Household Registration, to calculate the number of tween variables was linear and normally distributed. For this analysis, p-
households celebrating the festival. Furthermore, temples as the main value <0.1 was adopted as the criteria to select predictors for the model
contributor of burning incense and joss paper were considered as an development. Meanwhile, a variance inflation factor (VIF) with a
emitter of air pollution considering their large prevalence, with 12,000 threshold of <3 was applied to avoid multicollinearity issues and to
Buddhist and Taoist temples in Taiwan (Yeh, 2023). filter the predictors. The predictors that met the multicollinearity test
criteria were then used in the development of the hybrid kriging-LUR
model for daytime, nighttime, and daily NOx concentrations. In this
2.4. The machine learning-based ensemble model case, we adopted the basic procedure introduced by Wu and his col­
leagues to model air pollution by using LUR model (Wu et al., 2017).
To estimate NOx concentrations, several algorithms were employed In this study, hybrid kriging-LUR was applied to determine the most
with processes, i.e., hybrid kriging-LUR, machine learning, and an likely predictors contributing to NOx concentrations. Hybrid kriging-
ensemble algorithm. As illustrated in Fig. 2, the steps that were LUR is defined as an estimation technique that integrates leave-one-
completed include: (1) selection of variables potentially related to NOx out-kriging interpolation and a land-use regression (LUR) model to
concentration considering diurnal variations using a correlation test; (2) assess the spatial-temporal variability of air pollution (i.e., NOx). Leave-
development of stepwise regression under hybrid kriging-LUR to select one-out kriging is a technique to estimate the predicted variance of a
influential predictors; (3) coupling of the selected predictors from the kriging model. It involves leaving out a subset of observations and
hybrid kriging-LUR with five algorithms to generate the machine predicting them using the remaining observations (Chen et al., 2020).
learning models; (4) selection of the three best algorithms based on the

Fig. 2. Study framework.

5
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

LUR refers to a statistical method to model the relationship between air framework allows big data processing, performs robust data scaling to
pollution and land use characteristics (Ryan and LeMasters, 2007). This overcome outliers, can select useful features, and can better determine
model typically uses multiple regression analysis to develop estimation the best model to generate the most accurate estimates. To support the
models and determines the potential variables (Mölter and Lindley, analyses, multiple statistical-based software for model calculations were
2021). In the hybrid kriging-LUR development, the results of the leave- used, namely SPSS 17.0 and R 3.6.3. For the automation process in the
one-out kriging are assessed in the LUR model to determine the potential machine learning analyses, this study used Python 3.7 integrated with
predictors that contribute to NOx. Those selected predictors are then the Jupyter Notebook platform.
used to develop a machine learning-based ensemble model. Unlike Wu
et al. (2017) who employed conventional stepwise regression to 2.4.3. A machine learning-based ensemble model
manually select influential predictors, this study adopted an automated A machine learning-based ensemble model procedure was developed
stepwise regression process under the hybrid kriging-LUR model. This in this study to improve the model performance in estimating NOx
automation approach has enhanced time efficiency for the predictor concentrations. Among the above-mentioned five machine learning al­
selection process, particularly for big data processing involving many gorithms, the three best algorithms were integrated to develop the final
variables. In this case, Python 3.7 integrated with the Jupyter Notebook ensemble model. The three best algorithms were selected based on the
platform was used for the automation process in the hybrid kriging-LUR ranking of the models’ performance metrics, which included the highest
development. adjusted coefficient of determination (Adj. R2) and overfitting criteria
with residual value of R <0.1. The training models were generated using
2.4.2. Machine learning algorithms 80 % of the datasets, and the remaining 20 % of the datasets were used
After the potential predictors across time variations of the hybrid for model testing. The estimations from the selected three best models
kriging-LUR model were determined, five machine learning algorithms without an overfitting issue were set as predictors to refit machine
were integrated, including random forest (RFR), gradient boosting learning algorithms, including RFR, GBR, LGBMR, CBR, and XGBR using
(GBR), light gradient boosting machine (LGBMR), categorical boosting AutoML framework. The performance of the selected-as-best algorithms
(CBR), and extreme gradient boosting (XGBR). First was RFR, a machine were again checked to develop the final machine learning-based
learning algorithm used for both classification and decision tree ensemble model of NOx concentrations.
regression (Abedin et al., 2021). RFR’s advantages are that it is robust
against overfitting, is flexible to perform regression, can effectively 2.5. Validation methods
handle large datasets with a large number of variables, and offers a
reliable alternative to traditional parametric-semiparametric statistical To evaluate the performance of the ensemble model, we applied
analysis. Second was GBR, an ensemble method that combines several several validation tests, including an overfitting test, 10-fold-cross-vali­
decision trees and applies gradient descending functions. This algorithm dation (10-fold CV), spatial cross-validation (spatial CV), internal veri­
was used to optimize a smoothed estimation of the compatibility index, fication (spatial and temporal verification), external verification, and
which enhances accuracy (Chen et al., 2013; Friedman, 2001). Third peak validation. First, for the overfitting test, we used 80 % of the total
was LGBMR, a type of gradient boosting-based machine learning algo­ datasets (1994–2019) for training, and the remaining 20 % were
rithm that can effectively reduce memory usage, thereby increasing data employed to perform testing validation. Second, for the 10-fold CV, the
processing speed (Wang et al., 2021). LGBMR was used because this data were divided into 10 subsets that were then assigned, with nine
algorithm applies a leaf-wise tree growth approach that can prevent subsets used for developing the training model and the one remaining
overfitting and provides better model accuracy than other gradient- subset used for the testing model. This step was repeated until all subsets
boosting branches (Yang et al., 2018). Fourth was CBR, which has ad­ were assessed. Third, regarding the spatial CV, we split the data based on
vantages in dealing with gradient bias because it adopts the advantages the location of the monitoring station into 10 subsets or 10 groups. In
of general gradient boosting and applies an ordered boosting approach this case, we divided the observation data of all stations into 10 subsets
to training data (Prokhorenkova et al., 2017). Different from GBR, equally. Since this study used the data from 76 stations, each group
XGBR, LGBMR, and RFR that require one-hot coding to handle cate­ included observation data from 7 or 8 stations. In the data processing,
gorical variables, CBR can handle categorical variables natively. By we trained the model on 9 subsets and validated it on the one remaining
using this native approach, CBR does not require one-hot coding and can subset. We repeated this process 10 times, each time using a different
support target coding, a powerful solution for coding categorical vari­ subset as the test. Fourth, we conducted spatial verification by executing
ables that avoids generating a large number of features (Hancock and the algorithm to test the model for the different monitoring stations (i.e.,
Khoshgoftaar, 2020; Löw et al., 2019). Fifth was XGBR, a branch of station-based validation), six air quality zones (AQZ), and spatial pat­
boosting algorithm that integrates multiple gradient-boosted decision terns. To validate the spatial pattern, we used three randomly selected
trees (Chen and Guestrin, 2016). We included this machine learning time points to ensure that the main estimation model could depict the
algorithm because it is flexible, overcomes missing data, increases spatial variations in NOx exposure at different time periods. Temporal
model performance by applying cross-validation for each iteration, verification was examined by developing the model according to
works well even when the sample of a dataset is small, and its data different time variations, including day-of-year, seasonal, and annual.
processing is faster than in original gradient boosting (Rusdah and Fifth, in terms of external verification, this study assessed data from
Murfi, 2020). 2020 to develop the models. Sixth, to ensure the accuracy and reliability
To reduce manual intervention and to improve the efficiency of data of the ensemble model and reduce the subjectivity in parameter selec­
processing, modeling, and hyperparameter optimization, this study in­ tion, peak validation was developed by using different datasets. This
tegrated five algorithms with an automation framework known as validation was completed using data with high daytime, nighttime, and
automated machine learning, or AutoML (Elshawi and Sakr, 2020). We daily concentrations of NOx, specifically those >75 %, 80 %, 85 %, 90 %,
applied AutoML libraries and frameworks that are freely available in 95 %, and 99 %. Lastly, the determinant coefficients (R2; Adj. R2), mean
repositories for public use. By using one of the AutoML frameworks, square error (MSE), root mean square error (RMSE), and mean absolute
“auto_ml” developed by Parry (2019), automated machine learning error (MAE) were examined to determine the robustness of model esti­
models including Auto-RFR, Auto-GBR, Auto-LGBMR, Auto-CBR, and mates. The validation models were confirmed to be robust if the per­
Auto-XGBR were developed. AutoML was selected because this frame­ formance of metrics did not change significantly from the training
work supports analysis of tree-based algorithms (e.g., decision trees, model. We used the SHAPley Additive exPlanations (SHAP) visualiza­
random forests, XGBoost/Gradient Boosted Trees, etc.) and regression- tion function to further investigate the influential predictors of NOx from
based models. Suitable for both analysis and real-time estimates, this the ensemble model. SHAP applies the Sharply value that is optimal

6
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

based on game theory, and it has the advantage of being able to grid resolution of 50-m x 50-m were generated using the best selected
demonstrate the effect size of each predictor more intuitively (Molnar, model, which was the ensemble model. In addition to generating diurnal
2020). variation maps, we created a linear trend map, a thematic map of NOx
that depicts the long-term annual linear trend. This thematic map was
2.6. Mapping of NOx diurnal variations used to visualize the trends and patterns of NOx over a specific time. This
map was generated using the significant value of the linear regression
Spatiotemporal maps of daytime, nighttime, and daily NOx with a estimation of NOx that was assessed from annual anomalies during the

(a)

(b)

Fig. 3. (a) Spatial distribution of the average NOx concentrations during the full study period (1994 to 2020) at each observation station in the three timeframes, i.e.,
(I) daytime; (II) nighttime; (III) daily, coupled with the basic statistical estimates of NOx at all stations during the study period; (b) The annual average trend of NOx
concentrations during the study period (1994 to 2020) at all stations, depicted for each season and presented for the three timeframes (daytime; nighttime; daily).

7
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

study period. The value on each grid was calculated by applying the hybrid kriging-LUR, and it slightly improved the single machine
linear equation, y = mx + b, where y was denoted as the NOx concen­ learning models, especially for nighttime and daily data. First, for esti­
tration in each grid, m was the slope of the NOx trend, x was the period/ mating the NOx concentration during the daytime, the results demon­
year of the data, and b was the y-intercept. To evaluate the model per­ strate that the ensemble model significantly improved model
formance, this study also generated several maps of validation analyses performance compared to the initial estimation that used hybrid kriging-
in terms of spatial pattern maps. ArcGIS 10.7.1 was then utilized for LUR. This is indicated by an increase in the determination coefficient
spatial mapping and analysis. (Adj. R2) from 0.57 to 0.93 and a decrease in the error (RMSE) from
17.95 ppb to 7.20 ppb. Accordingly, 11 potential predictors from the
3. Results hybrid kriging-LUR model were selected to develop machine learning
algorithms. Among the five algorithms, Auto-GBR, Auto-XGBR, and
3.1. Statistical summaries of NOx concentrations: Daytime, nighttime, Auto-RFR were considered the three best algorithms with the greatest
and daily explanatory power for developing the ensemble model, and they
explained 93 % of daytime NOx variation. Second, in estimating night­
As illustrated in Fig. 3 (a), the average concentrations of NOx during time NOx concentrations, this study yielded the same trend as for esti­
the study period (1994–2020) in the three timeframes were not signif­ mating daytime NOx concentrations. From the hybrid kriging-LUR to the
icantly different, with means of 25.30 ppb, 25.38 ppb, and 25.77 ppb for ensemble model, the adjusted R2 increased from 0.68 to 0.98 and the
daytime, nighttime, and daily, respectively. However, when comparing RMSE decreased from 13.46 ppb to 3.74 ppb. Sixteen predictors from the
daytime and nighttime levels, the maximum concentration during the hybrid kriging-LUR were included in the machine learning algorithms
day was higher (138.87 ppb) than at night (104.29 ppb). The observa­ for the main estimation. After that, Auto-GBR, Auto-LGBMR, and Auto-
tion station data revealed that the highest concentrations of NOx were in RFR were identified as the three best algorithms for developing the
Taipei City, New Taipei City, and Kaohsiung City, which were visualized ensemble model, and they explained 98 % of nighttime NOx variation.
in Fig. 3 (a) with red dots. The overall trend of NOx concentrations from Third, we examined daily NOx concentrations. The results indicate that
1994 to 2020 was a decrease for all seasons and across all time settings the ensemble model increased the model performance, with an adjusted
(Fig. 3 (b)). In this case, NOx was highest in spring (March to May) and R2 increase from 0.65 to 0.94 and an RMSE decrease from 14.49 ppb to
lowest in autumn (September to November). 6.00 ppb. To develop the final ensemble model, 14 variables from the
hybrid kriging-LUR result were included in the machine learning
calculation. From this, Auto-LGBMR, Auto-XGBR, and Auto-RFR were
3.2. Overall performance of model estimates identified as the three best algorithms for developing the ensemble
model, and they explained 94 % of daily NOx variation. All potential
As illustrated in Fig. 4, a machine learning-based ensemble model predictors from the hybrid kriging-LUR result that were selected to train
provided NOx estimations with better model performance than did

Fig. 4. Model performance represented by R2 (bar diagram) and RMSE (orange line) of daytime, nighttime, and daily NOx estimates for each algorithm, i.e., a) hybrid
kriging-LUR; b) GBR; c) LGBMR; d) CBR; e) XGBR; f) RFR; and g) machine learning-based ensemble models. Model performance is presented for training data, testing
data, 10-fold-CV, and external validation analysis.

8
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

the machine learning algorithms are described in detail in Fig. S1 and the same as the ensemble model, though there was a decrease in Yilan
Table S2. and Hua-Tung AQZ. Meanwhile, the temporal validation by seasons
Compared to the use of a single machine learning algorithm, the use revealed that the model performance did not differ significantly across
of the proposed ensemble model was not fully capable of significantly seasons, although changes occurred in RMSE, MSE, and MAE. The
improving model performance. However, the ensemble model did serve adjusted R2 were 0.92, 0.98, and 0.93 for daytime, nighttime, and daily,
a role in increasing the performance of NOx estimation compared to a respectively. Verification of day-of-year and annual are illustrated in
single model, as reflected in the decrease in error as represented by Fig. S5 and Fig. S6. Based on the adjusted R2 and RMSE, the results
lower RMSE values, especially for training data and 10-fold-CV. Further, indicate there was no substantial change in model performance among
deciding on the use of an ensemble model to estimate diurnal variations the validation models associated with the ensemble model. Because
of NOx also referred to the performance ranking analysis of each algo­ spatial dependence may occur in developed models, spatial CV was
rithm that was presented in Table S3 and Table S4. NOx concentration performed, and the results confirmed the robustness of the model, with
estimates based on time variations had better exploratory power in the relatively similar adjusted R2 values for all subsets (Fig. S7).
final ensemble at night and were followed in explanatory power by daily
and then daytime NOx estimates. Fig. S2 illustrates the most influential 3.4. Spatiotemporal estimation of NOx diurnal variations
variables in developing the NOx estimation model in terms of daytime,
nighttime, and daily concentrations. We identified that residential areas 3.4.1. Map of NOx concentrations
within a 1500 m buffer zone, manufacturing areas within a 4000 m Fig. 7 depicted the spatiotemporal estimates of NOx concentrations at
buffer zone, and temple density within a 500 m circular buffer were the different timeframes across Taiwan. Generated by multi-temporal data
main predictors contributing to NOx concentrations for all time varia­ from 1994 to 2019, the highest averages of NOx concentration (red
tions. In addition, examining the direction of correlation presented by spots) during daytime, nighttime, and daily were identified in northern
SHAP plotting (Fig. S3), we identified positive associations for those Taiwan (i.e., Taipei City; New Taipei City), central Taiwan (i.e., Tai­
potential predictors of daytime, nighttime, and daily NOx concentra­ chung City), and southern Taiwan (i.e., Kaohsiung City). Moreover, the
tions. The positive direction is indicated by the increasing number of red western part of Taiwan has a darker distribution pattern than the eastern
dots on the right. part, reflecting that the western region is an urban area and has a higher
concentration of NOx than the eastern region that is dominated by
3.3. Validation of selected ensemble model mountainous areas. Comparing time variations, NOx concentrations
during the daytime were higher than the nighttime, especially in hotspot
To discern the capability of the ensemble model, internal validation areas, which may also affect the daily averages of NOx concentrations.
for the spatial pattern of NOx was performed. As shown in the Fig. 5, the Additionally, the spatial patterns of daytime, nighttime, and daily NOx
spatial pattern of NOx was depicted as point estimates at each station (d, concentrations by season were generated. As illustrated in Fig. 8, the
e, f) and raster grid estimates that were generated throughout the study highest NOx levels were observed in winter (December–February) and
area (a, b, c). Using three randomly selected time points, the result spring (March–May), while the lowest concentrations were observed in
confirmed that the ensemble model had a fine ability to depict changes summer (June – August) and autumn (September–November). Kaoh­
in spatial patterns of NOx concentrations at different periods. Further, siung, the biggest city in the southwestern region of Taiwan, had the
when conducting peak validation, we examined NOx concentrations highest NOx concentrations in spring and winter. Nonetheless, it was
>75 %, 80 %, 85 %, 90 %, 95 %, and 99 %. As illustrated in Fig. 6, the noted that NOx concentrations remained high throughout all seasons in
mid-high performance for the daytime NOx estimation model had Taipei and New Taipei City.
adjusted R2 of 0.56, 0.56, 0.86, 0.87, 0.88, and 0.89, respectively. For
nighttime estimation, the adjusted R2 were 0.80, 0.80, 0.93, 0.94, 0.95, 3.4.2. Linear trend of NOx concentrations during the study period
and 0.95, respectively. For daily NOx estimation, the adjusted R2 were The linear trend analysis revealed that daily NOx concentrations
0.58, 0.58, 0.85, 0.87, 0.88, and 0.89, respectively. increased significantly in almost all regions during the time range (or­
This study also performed internal verification, which included ange areas), with an average increase of 3.7 ppb/year and a maximum
spatial and temporal validation. As illustrated in Fig. S4, spatial vali­ increase of up to 7.2 ppb/year. Regarding daytime, NOx concentration
dation involving six air quality zones yielded an R2 value that was nearly significantly increased (except in the mid-east region), with an average

Fig. 5. Map of estimated NOx concentrations during (I) daytime, (II) nighttime, and (III) daily, generated using the ensemble model. The spatial pattern of NOx was
depicted as point estimates at each station (d, e, f) and raster grid estimates that were generated throughout the study area (a, b, c). Some plots represent specific
conditions, i.e. (a, d) when almost all areas in Taiwan have low concentrations; (b, e) when almost all regions in Taiwan have high concentrations; and (c, f) when
northern Taiwan has a slightly higher concentration than the southern region.

9
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

Fig. 6. Peak validation to evaluate the performance of the ensemble model for data with high daytime, nighttime, and daily NOx concentrations. The assessed
concentration values were: >75 %, 80 %, 85 %, 90 %, 95 %, and 99 %.

Fig. 7. Diurnal variability maps of NOx concentrations (ppb) in Taiwan at different exposure windows (daytime, nighttime, and daily) over the study period.

increase of 6.2 ppb/year and a maximum increase of 13.0 ppb/year. learning-based ensemble model demonstrated an efficient estimation of
Meanwhile, nighttime NOx concentration significantly increased evenly air pollution concentrations, which vary during specific time spans due
in all areas, with an average increase of 1.5 ppb/year and a maximum to the different community activity patterns that exist in a day. Our
increase of 3.1 ppb/year. As illustrated in Fig. 9, the linear trends of study findings yielded estimates of daytime, nighttime, and daily NOx
daytime, nighttime, and daily NOx exhibit the same pattern, although concentrations with high explanatory performance. The adj-R2 of day­
the significance of the increase in daytime exposure concentrations was time, nighttime, and daily NOx estimates were 0.93, 0.98, and 0.94,
lower, especially in mid-eastern Taiwan. respectively. Kriging interpolation was employed in the model devel­
opment process to efficiently estimate air pollution, considering for
4. Discussion spatial autocorrelation among stations, and assigning weights based on
the distance between the stations and the target site (Wong et al., 2023).
Observing the trend over time in exposure levels, we identified that In terms of spatial autocorrelation, kriging can be a useful to preserve
NOx concentrations in Taiwan gradually decreased from 1994 to 2020. spatial variability that would be lost using a simpler method (Auchin­
This environmental improvement may be attributed to the Taiwanese closs et al., 2007). Further, the stochastic interpolation in kriging
government’s policy to promote environmentally friendly power plants weights the distance between the measurement and the predicted areas
(Tsai et al., 2021a, 2021b). Examining long-term datasets, the machine to improve the estimation (Choi and Chong, 2022). The hybrid kriging-

10
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

Fig. 8. Diurnal variability maps of NOx concentrations (ppb) during the different seasons, i.e., winter, spring, summer, and autumn.

11
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

Fig. 9. Linear trends of NOx concentration (ppb/year) for (a) daytime; (b) nighttime; and (c) daily, estimated from annual anomalies from 1994 to 2019 using linear
trend analysis. Orange and blue represent a positive (increasing) and negative (decreasing) trend, respectively, and green color represents the statistical p-value of
the trends.

LUR model was then constructed to identify the most influential pre­ machine learning algorithms, and the compilation of an ensemble al­
dictors of NOx. The model was further improved by including gradient gorithm to construct a spatial model that helped to select predictors.
boost-based machine learning algorithms to train the selected pre­ Because the performances of the predictive models varied, it was
dictors. As reported in a previous study, the utilization of machine necessary to determine the most suitable algorithm for data processing
learning algorithms not only improves spatial modeling but also reduces and modeling. The algorithms with the best estimation of model per­
processing time (Marsland, 2014; Meyer et al., 2019). As for the inte­ formance were chosen in this study, which led to improved accuracy and
gration, the developed machine learning-based ensemble models pre­ boosted processing time.
sented better model performance and were able to minimize estimation In developing the machine learning-based ensemble models, several
errors. Examining, independent dataset, our peak validation depicted variables were identified as strongly predicting NOx concentrations in
the predictive ability in estimating the diurnal variation of NOx at the Taiwan, and these included the density of manufacturing industries,
75th to 99th percentile, Adj. R2 ranging from 0.56 to 0.95. temples, and residential areas. Doubtlessly, the manufacturing industry
Prior studies have used ensemble models to estimate air pollution; is the main source of NOx due to chemical combustion (Tsai et al.,
however, our study proposed a better integration approach than did any 2021a, 2021b). Temples were also recognized as a NOx emission source.
of those (Huang et al., 2022b; Guo et al., 2020; Requia et al., 2020). A This finding can be attributed to the common worship activities in
prior study constructed an ensemble model to estimate ground-level Taiwan such as burning incense and joss paper, and this assumption was
ozone in the United States and focused solely on the use of machine supported in Lee and Wang’s study (Lee and Wang, 2004). They re­
learning without identifying the most influential predictors, which ported that the level of air pollution and volatile compounds, including
resulted in lengthy data processing time (Requia et al., 2020). In NOx, increase due to the burning of incense (Lee and Wang, 2004).
contrast, our study selected potential predictors by using hybrid kriging- Lastly, residential areas with dense population activities were also
LUR before integrating them with machine learning algorithms. This considered to contribute to an increase in tropospheric pollutants such
allowed for the extraction of the most important predictors as well as a as NOx via the use of gas stoves and vehicles (Silveira et al., 2018).
reduction in data processing time. Furthermore, our model was able to Nonetheless, there were no differences in predictors affecting NOx
estimate long-term variations in air pollution concentration in terms of concentrations at different time settings (i.e., daytime, nighttime, and
NOx at a finer spatial resolution (50-m x 50-m grid size), while Huang daily). It is noted that although NOx concentrations differ at time set­
et al. was not able to obtain a better resolution of NO2 estimation over a tings, the aforementioned three variables were the primary sources of
wider area in China due to limited computing power and computer NOx exposure in Taiwan for all settings. Furthermore, land-use variables
storage (Huang et al., 2022b). Guo et al. combined a predictor selection that can identify the spatial variation of NOx concentrations were
method and several neural network algorithms to estimate fine partic­ established by the built model. Among the predictors that were exam­
ulate matter (PM2.5) in Shanghai, China. They used only the Pearson ined, NOx observations that were predicted by automated spatial leave-
algorithm to select predictors, though the Pearson algorithm cannot one-out kriging had an important influence on concentration variations.
fully address complex linear and non-linear issues (Guo et al., 2020). This indicates that the exposure of NOx around the monitoring stations
Instead, the current study combined geographical statistical approaches, was significantly related to the exposure measured at the point of the

12
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

monitoring station. not consider satellite-based predictor variables for the assessed datasets.
Interpreting the spatiotemporal distribution of exposure, generated This was because there was a large number of missing values in the
maps illustrated that there was a high concentration of NOx in the satellite image data due to weather factors such as clouds and rain, and
northern region of Taiwan, especially in Taipei and New Taipei City. those missing values could have affected the model development.
Meanwhile, there was a medium concentration of NOx in the central Fourth, this study examined a limited land-use dataset that was only
region, notably Taichung City, and the southern region, notably Kaoh­ available for a few periods. We suggest using variable data in the future
siung City. This is reasonable considering the given cities are urbani­ that is more robust since it could improve the performance of the esti­
zation centers where people live and work (Ritchie and Roser, 2018). In mation model.
addition, Taipei, the nation’s capital, and Kaohsiung city are areas
experiencing rapid development of manufacturing industries. When 5. Conclusions
comparing day and night exposure levels, our findings reflect that the
average NOx concentration during the day is 30 % higher than at night. This study demonstrated a machine learning-based ensemble model
This may be because the main emission sources are more active during for estimating NOx concentrations, targeting the main island of Taiwan.
the daytime than the nighttime. For instance, residential areas, known as The proposed method yielded excellent predictivity in estimating the
a significant emission source of NOx, host large community activities spatiotemporal variation of NOx, accomplishing coefficients of deter­
during the daytime (Crippa et al., 2020). Likewise, worship activities in mination of 93 %, 98 %, and 94 % for daytime, nighttime, and daily
temples are often performed during the day rather than at night (Hien concentrations, respectively. After conducting a series of validation
et al., 2022; Lung and Kao, 2003). Considering working hours, the tests, it was confirmed that the ensemble model in this study provided
manufacturing industry mostly operates during the day, affecting the more efficient and accurate estimates of NOx concentrations than the
daily average NOx concentrations in Taiwan. The difference in NOx by approaches of previous studies. This study focused on NOx in Taiwan,
the time of day emphasizes the necessity to estimate diurnal variation of but the procedures proposed in this study could be adapted to estimate
NOx concentration in future studies. When assessing concentrations by other air pollutants in other areas. In sum, this study provided important
season, we noted that winter and spring had the highest peaks in NOx information about air pollution estimation and could be used as a
concentrations, especially in urban lowlands such as Taipei, New Taipei reference in supporting emission control for environmental and human
City, and Kaohsiung. This may be because the cold seasons (i.e., winter health.
and spring) exhibit stable atmospheric conditions of high pressure, cold
air, and low wind speed that enable pollutants to stay and accumulate, CRediT authorship contribution statement
especially in areas with lower altitude (Maurer et al., 2019). A study
conducted by Lin and colleagues reported that long-range transport Aji Kusumaning Asri: Writing – review & editing, Writing – original
(LRT) of air pollutants influence the northern part of Taiwan during draft, Visualization, Formal analysis, Conceptualization. Hsiao-Yun
winter and spring. Specifically, the northeast winds shift particulate Lee: Writing – review & editing, Writing – original draft, Formal anal­
matter and air pollutants including NOx to Taiwan, resulting in the ysis, Conceptualization. Yu-Ling Chen: Visualization, Validation, Soft­
concentrations of NOx peaking during the cold seasons (Lin et al., 2004). ware, Methodology, Conceptualization. Pei-Yi Wong: Writing – review
To our knowledge, the current study is one of the only studies to & editing, Software. Chin-Yu Hsu: Writing – review & editing, Data
estimate the diurnal variation of air pollution by applying an approach curation. Pau-Chung Chen: Writing – review & editing, Resources, Data
to enhance ensemble models. Although previous studies have employed curation. Shih-Chun Candice Lung: Writing – review & editing,
ensemble models to estimate PM2.5, NO2, and benzene, most of them Funding acquisition. Yu-Cheng Chen: Writing – review & editing, Re­
have determined the most influential predictors by using manual tech­ sources, Data curation. Chih-Da Wu: Writing – review & editing, Vali­
niques (Hsu et al., 2022; Wong et al., 2021a; Wong et al., 2021b; Wong dation, Supervision, Resources, Project administration, Methodology,
et al., 2023). Now that the potential predictors contributing to NOx in Investigation, Funding acquisition, Data curation, Conceptualization.
Taiwan have been recognized, it is suggested that the government revise
emission control policies for emission sources such as manufacturing Declaration of competing interest
industries. Moreover, to reduce emissions from incense and joss paper
burning in temples, a regulation of the amount of incense and joss paper The authors declare that they have no known competing financial
burning in temples should be established. Further, merely using hands to interests or personal relationships that could have appeared to influence
pray or worship should be encouraged. This study noted that NOx also the work reported in this paper.
comes from residential activities, and one example of this is vehicle
emissions. Accordingly, the development of a green public trans­ Data availability
portation system should be pursued. Using the study’s results, policy­
makers and community stakeholders will be able to establish more Data will be made available on request.
effective environmental regulations for developing sustainable cities
and communities. By comparing several algorithms, this study provided Acknowledgement
a better estimate of NOx concentrations, and it has been validated by a
series of tests. Additionally, this approach can be modified and adopted This study was funded by the National Science and Technology
to estimate variations in other pollutants in other regions or countries Council, R.O.C.- Taiwan (NSTC 109WFA0910475; NSTC 112-2123-M-
that may have different characteristics. 001-008; NSTC 112-2121-M-006-004-) and Academia Sinica, Taiwan,
We acknowledge several limitations of this study. First, this study under “Trans-disciplinary PM2.5 Exposure Research in Urban Areas for
found that the ensemble model only slightly improved the predictive Health-oriented Preventive Strategies (II),” Project No.: AS-SS-110-02.
performance of the single algorithm in some cases. We assumed that this We acknowledge that this research was supported in part by the Higher
condition may occur due to the impact of model diversity and a sto­ Education Sprout Project, Ministry of Education to the Headquarters of
chastic learning algorithm. Thus, it is important to carefully consider the University Advancement at National Cheng Kung University (NCKU).
characteristics of the data and the algorithms when using ensemble This work was financially supported by the “Innovation and Develop­
techniques to ensure optimal performance. Second, this study focused ment Center of Sustainable Agriculture” from The Featured Areas
only on estimating the diurnal variation of NOx in Taiwan. In order to Research Center Program within the framework of the Higher Education
enrich research findings, considering estimates of air pollution varia­ Sprout Project by the Ministry of Education (MOE) in Taiwan. This study
tions at critical times such as rush hour traffic is needed. Third, we did was also supported by the National Aeronautics and Space

13
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

Administration (NASA) and the United States Geological Survey (USGS), Development of land use regression models for nitrogen dioxide, ultrafine particles,
lung deposited surface area, and four other markers of particulate matter pollution in
who provided satellite-derived datasets.
the Swiss SAPALDIA regions. Environ. Health 15 (1), 1–14. https://doi.org/
10.1186/S12940-016-0137-9/TABLES/6.
Appendix A. Supplementary data Elshawi, R., Sakr, S., 2020. Automated machine learning: techniques and frameworks.
Lecture Notes in Business Information Processing 390, 40–69. https://doi.org/
10.1007/978-3-030-61627-4_3/FIGURES/6.
Supplementary data to this article can be found online at https://doi. ESRI. (2023). Understanding Euclidean distance analysis. ArcGIS Desktop: Release
org/10.1016/j.scitotenv.2024.170209. 10.7.1. Redlands, CA: Environmental Systems Research Institute. Retrieved from
https://desktop.arcgis.com/en/arcmap/latest/tools/spatial-analyst-toolbox/under
standing-euclidean-distance-analysis.htm#:~:text=The%20Euclidean%20distance
References %20output%20raster%20contains%20the%20measured%20distance%20from,cell%
20center%20to%20cell%20center. (Accessed in June 2023).
Fontijn, A., 1976. Chemiluminescence techniques in air pollutant monitoring. In: Modern
Abedin, M.Z., Moon, M.H., Hassan, M.K., Hajek, P., 2021. Deep learning-based exchange
Fluorescence Spectroscopy, pp. 159–192. https://doi.org/10.1007/978-1-4684-
rate prediction during the COVID-19 pandemic. Ann. Oper. Res. 1–52 https://doi.
2583-3_6.
org/10.1007/s10479-021-04420-6.
Friedman, J.H., 2001. Greedy function approximation: A gradient boosting machine.
Ali, S.M., Mahdi, A.S., Shaban, A.H., 2012. Wind speeds estimation on the ground level
Ann. Stat. 29 (5), 1189–1232. https://doi.org/10.1214/AOS/1013203451.
for windmills site selection. Iraqi Journal of Science 53 (4), 965–970.
Fujitani, Y., Takahashi, K., Saitoh, K., Fushimi, A., Hasegawa, S., Kondo, Y., Tanabe, K.,
Arowosegbe, O.O., Röösli, M., Künzli, N., Saucy, A., Adebayo-Ojo, T.C., Schwartz, J.,
Takami, A., Kobayashi, S., 2021. Contribution of industrial and traffic emissions to
Kebalepile, M., Jeebhay, M.F., Dalvie, M.A., de Hoogh, K., 2022. Ensemble averaging
ultrafine, fine, coarse particles in the vicinity of industrial areas in Japan. Environ.
using remote sensing data to model spatiotemporal PM10 concentrations in sparsely
Adv. 5, 100101 https://doi.org/10.1016/J.ENVADV.2021.100101.
monitored South Africa. Environ. Pollut. (Barking, Essex : 1987) 310. https://doi.
Guo, C., Liu, G., Chen, C.H., 2020. Air pollution concentration forecast method based on
org/10.1016/J.ENVPOL.2022.119883.
the deep ensemble neural network. Wirel. Commun. Mob. Comput. 2020 https://doi.
Auchincloss, A.H., Diez Roux, A.V., Brown, D.G., Raghunathan, T.E., Erdmann, C.A.,
org/10.1155/2020/8854649.
2007. Filling the gaps: spatial interpolation of residential survey data in the
Guo, Y., Zhu, L., Wang, X., Qiu, X., Qian, W., Wang, L., 2022. Assessing environmental
estimation of neighborhood characteristics. Epidemiology (Cambridge, Mass.) 18
impact of NOX and SO2 emissions in textiles production with chemical footprint. Sci.
(4), 469–478. https://doi.org/10.1097/EDE.0B013E3180646320.
Total Environ. 831, 154961 https://doi.org/10.1016/j.scitotenv.2022.154961.
Barua, S., Kim, J.Y., Yenari, M.A., Lee, J.E., 2019. The role of NOx inhibitors in
Hancock, J.T., Khoshgoftaar, T.M., 2020. CatBoost for big data: an interdisciplinary
neurodegenerative diseases. IBRO Reports 7, 59–69. https://doi.org/10.1016/J.
review. Journal of Big Data 7 (1), 1–45. https://doi.org/10.1186/S40537-020-
IBROR.2019.07.1721.
00369-8/FIGURES/9.
Beelen, R., de Hoogh, K., Eeftens, M., Meliefste, K., Cirach, M., de Nazelle, A.,
Hasnain, A., Sheng, Y., Hashmi, M.Z., Bhatti, U.A., Hussain, A., Hameed, M., Zha, Y.,
Nieuwenhuijsen, M., Molter, A., Cyrys, J., Birk, M., Bellander, T., Sugiri, D., Tsai, M.-
2022. Time series analysis and forecasting of air pollutants based on prophet
Y., Ineichen, A., Madsen, C., Gryparis, A., Modig, L., Mosler, G., Vienneau, D.,
forecasting model in Jiangsu province, China. Front. Environ. Sci. 10, 945628
Hoek, G., 2011. Estimating long-term exposure to air pollution in 38 study areas in
https://doi.org/10.3389/fenvs.2022.945628.
Europe in a harmonized way using land use regression modeling (ESCAPE project).
He, W., Meng, H., Han, J., Zhou, G., Zheng, H., Zhang, S., 2022. Spatiotemporal PM2.5
Epidemiology 22, S82. https://doi.org/10.1097/01.EDE.0000391915.46823.B9.
estimations in China from 2015 to 2020 using an improved gradient boosting
Boke, A.S., 2017. Comparative evaluation of spatial interpolation methods for estimation
decision tree. Chemosphere 296, 134003. https://doi.org/10.1016/J.
of missing meteorological variables over Ethiopia. J. Water Resour. Protect. 9 (8),
CHEMOSPHERE.2022.134003.
945–959. https://doi.org/10.4236/JWARP.2017.98063.
Hien, T.T., Ngo, T.H., Lung, S.C.C., Ngan, T.A., Minh, T.H., Cong-Thanh, T., Nguyen, L.S.
Brandsma, T., Können, G.P., 2006. Application of nearest-neighbor resampling for
P., Chi, N.D.T., 2022. Characterization of particulate matter (PM1 and PM2.5) from
homogenizing temperature records on a daily to sub-daily level. Int. J. Climatol. 26
incense burning activities in temples in Vietnam and Taiwan. Aerosol Air Qual. Res.
(1), 75–89.
22 (11), 220193 https://doi.org/10.4209/AAQR.220193.
Cai, J., Ge, Y., Li, H., Yang, C., Liu, C., Meng, X., Wang, W., Niu, C., Kan, L., Schikowski,
Hofstra, N., New, M., McSweeney, C., 2009. The influence of interpolation and station
T., Yan, B., Chillrud, S. N., Kan, H., & Jin, L. (2020). Application of land use
network density on the distributions and trends of climate variables in gridded daily
regression to assess exposure and identify potential sources in PM2.5, BC, NO2
data. Clim. Dyn. 2009 35:5 35 (5), 841–858. https://doi.org/10.1007/S00382-009-
concentrations. 9, 223. doi:https://doi.org/10.1016/J.ATMOSENV.2020.117267.
0698-1.
Central Weather Bureau. (2022). Timetable of sunrise and sunset in the 110th year of the
Hsu, C.Y., Xie, H.X., Wong, P.Y., Chen, Y.C., Chen, P.C., Wu, C.Da., 2022. A mixed spatial
Republic of China. https://www.cwb.gov.tw/Data/astronomy/2021/sundat/01t
prediction model in estimating spatiotemporal variations in benzene concentrations
aipei.pdf (Accessed July 2023).
in Taiwan. Chemosphere 301. https://doi.org/10.1016/J.
César, A.C.G., Carvalho, J.A., Nascimento, L.F.C., 2015. Association between NOx
CHEMOSPHERE.2022.134758.
exposure and deaths caused by respiratorydiseases in a medium-sized Brazilian city.
Huang, C., Sun, K., Hu, J., Xue, T., Xu, H., Wang, M., 2022b. Estimating 2013–2019 NO2
Braz. J. Med. Biol. Res. 48 (12), 1130. https://doi.org/10.1590/1414-
exposure with high spatiotemporal resolution in China using an ensemble model.
431X20154396.
Environ. Pollut. 292, 118285 https://doi.org/10.1016/j.envpol.2021.118285.
Chen, P.C., Lin, Y.T., 2022. Exposure assessment of PM2.5 using smart spatial
Huang, D., He, B., Wei, L., Sun, L., Li, Y., Yan, Z., Wang, X., Chen, Y., Li, Q., Feng, S.,
interpolation on regulatory air quality stations with clustering of densely-deployed
2021. Impact of land cover on air pollution at different spatial scales in the vicinity
microsensors. Environ. Pollut. 292, 118401 https://doi.org/10.1016/J.
of metropolitan areas. Ecol. Indic. 132, 108313 https://doi.org/10.1016/J.
ENVPOL.2021.118401.
ECOLIND.2021.108313.
Chen, T., Guestrin, C., 2016. XGBoost: A scalable tree boosting system. In: Proceedings of
Huang, S.K., Chen, S.-Y., Chou, K.-L., Hsu, W.C., Lai, K.-H., Chueh, T.-H., Kuo, L., Lu, W.,
the ACM SIGKDD International Conference on Knowledge Discovery and Data
2022a. Optimizing the PM2.5 tradeoffs: the case of Taiwan. Aerosol Air Qual. Res. 22
Mining, 13–17-August-2016, pp. 785–794. https://doi.org/10.1145/
(10), 210315 https://doi.org/10.4209/AAQR.210315.
2939672.2939785.
Khezri, B., Chan, Y.Y., Tiong, L.Y.D., Webster, R.D., 2015. Annual air pollution caused by
Chen, T.H., Hsu, Y.C., Zeng, Y.T., Candice Lung, S.C., Su, H.J., Chao, H.J., Wu, C.Da.,
the hungry ghost festival. Environ Sci Process Impacts 17 (9), 1578–1586. https://
2020. A hybrid kriging/land-use regression model with Asian culture-specific
doi.org/10.1039/C5EM00312A.
sources to assess NO2 spatial-temporal variations. Environ. Pollut. 259, 113875
Korhonen, A., Relvas, H., Miranda, A.I., Ferreira, J., Lopes, D., Rafael, S., Almeida, S.M.,
https://doi.org/10.1016/J.ENVPOL.2019.113875.
Faria, T., Martins, V., Canha, N., Diapouli, E., Eleftheriadis, K., Chalvatzaki, E.,
Chen, Y., Jia, Z., Mercola, D., Xie, X., 2013. A gradient boosting algorithm for survival
Lazaridis, M., Lehtomäki, H., Rumrich, I., Hänninen, O., 2021. Analysis of spatial
analysis via direct optimization of concordance index. Comput. Math. Methods Med.
factors, time-activity and infiltration on outdoor generated PM2.5 exposures of school
2013 https://doi.org/10.1155/2013/873595.
children in five European cities. Sci. Total Environ. 785, 147111 https://doi.org/
Chen, Y.C., Chou, C.C.K., Liu, C.Y., Chi, S.Y., Chuang, M.T., 2023. Evaluation of the
10.1016/J.SCITOTENV.2021.147111.
nitrogen oxide emission inventory with TROPOMI observations. Atmos. Environ.
Lai, H. C., Hsiao, M. C., Liou, J. L., Lai, L. W., Wu, P. C., & Fu, J. S. (2020). Using costs
298, 119639 https://doi.org/10.1016/J.ATMOSENV.2023.119639.
and health benefits to estimate the priority of air pollution control action plan: a case
Choi, K., Chong, K., 2022. Modified inverse distance weighting interpolation for
study in Taiwan. Applied Sciences 2020, Vol. 10, Page 5970, 10(17), 5970. doi:
particulate matter estimation and mapping. Atmosphere 13 (5). https://doi.org/
https://doi.org/10.3390/APP10175970.
10.3390/atmos13050846.
Lai, W. I., Chen, Y. Y., & Sun, J. H. (2022). Ensemble machine learning model for
Crippa, M., Solazzo, E., Huang, G., Guizzardi, D., Koffi, E., Muntean, M., Schieberle, C.,
accurate air pollution detection using commercial gas sensors. Sensors (Basel,
Friedrich, R., Janssens-Maenhout, G., 2020. High resolution temporal profiles in the
Switzerland), 22(12). doi:https://doi.org/10.3390/S22124393.
emissions database for global atmospheric research. Scientific Data 2020 7:1 7 (1),
Lee, C.-H., Brimblecombe, P., Lee, C.-L., 2022. Fifty-Year Change in Air Pollution in
1–17. https://doi.org/10.1038/s41597-020-0462-2.
Kaohsiung. Environmental Science and Pollution Research, Taiwan. https://doi.org/
de Hoogh, K., Korek, M., Vienneau, D., Keuken, M., Kukkonen, J., Nieuwenhuijsen, M.J.,
10.1007/s11356-022-21756-z.
Badaloni, C., Beelen, R., Bolignano, A., Cesaroni, G., Pradas, M.C., Cyrys, J.,
Lee, M., Lin, L., Chen, C.Y., Tsao, Y., Yao, T.H., Fei, M.H., Fang, S.H., 2020. Forecasting
Douros, J., Eeftens, M., Forastiere, F., Forsberg, B., Fuks, K., Gehring, U.,
air quality in Taiwan by using machine learning. Scientific Reports 2020 10:1 10 (1),
Gryparis, A., Bellander, T., 2014. Comparing land use regression and dispersion
1–13. https://doi.org/10.1038/s41598-020-61151-7.
modelling to assess residential exposure to ambient air pollution for epidemiological
Lee, S.C., Wang, B., 2004. Characteristics of emissions of air pollutants from burning of
studies. Environ. Int. 73, 382–392. https://doi.org/10.1016/J.ENVINT.2014.08.011.
incense in a large environmental chamber. Atmos. Environ. 38 (7), 941–951.
Eeftens, M., Meier, R., Schindler, C., Aguilera, I., Phuleria, H., Ineichen, A., Davey, M.,
https://doi.org/10.1016/j.atmosenv.2003.11.002.
Ducret-Stich, R., Keidel, D., Probst-Hensch, N., Künzli, N., Tsai, M.Y., 2016.

14
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

Li, M., Zhang, Q., Kurokawa, J.I., Woo, J.H., He, K., Lu, Z., Ohara, T., Song, Y., Streets, D. States. Environ. Sci. Technol. 54 (18), 11037. https://doi.org/10.1021/ACS.
G., Carmichael, G.R., Cheng, Y., Hong, C., Huo, H., Jiang, X., Kang, S., Liu, F., Su, H., EST.0C01791.
Zheng, B., 2017. MIX: A mosaic Asian anthropogenic emission inventory under the Ritchie, H. & Roser, M. (2018). Urbanization. Our World in Data. Retrieved from:
international collaboration framework of the MICS-Asia and HTAP. Atmos. Chem. https://ourworldindata.org/urbanization.
Phys. 17 (2), 935–963. https://doi.org/10.5194/ACP-17-935-2017. Rivera-González, L.O., Zhang, Z., Sánchez, B.N., Zhang, K., Brown, D.G., Rojas-
Li, Z., Yim, S.H.L., Ho, K.F., 2020. High temporal resolution prediction of street-level Bracho, L., Osornio-Vargas, A., Vadillo-Ortega, F., O’Neill, M.S., 2015. An
PM2.5 and NOx concentrations using machine learning approach. J. Clean. Prod. 268, assessment of air pollutant exposure methods in Mexico City, Mexico. J. Air Waste
121975 https://doi.org/10.1016/j.jclepro.2020.121975. Manage. Assoc. (1995), 65(5), 581. https://doi.org/10.1080/10962247.2015.1020
Lin, C.Y., Liu, S.C., Chou, C.C., Liu, T.H., Lee, C.T., Yuan, C.S., Young, C.Y., 2004. Long- 974.
range transport of Asian dust and air pollutants to Taiwan. Terr. Atmos. Ocean. Sci. Roberts-Semple, D., Song, F., Gao, Y., 2012. Seasonal characteristics of ambient nitrogen
15 (5), 759–784. oxides and ground–level ozone in metropolitan northeastern New Jersey. Atmos.
Liu, J., Chen, W., 2022. First satellite-based regional hourly NO2 estimations using a Pollut. Res. 3 (2), 247–257. https://doi.org/10.5094/APR.2012.027.
space-time ensemble learning model: A case study for Beijing-Tianjin-Hebei region, Rusdah, D.A., Murfi, H., 2020. XGBoost in handling missing values for life insurance risk
China. Sci. Total Environ. 820 https://doi.org/10.1016/J. prediction. SN Applied Sciences 2 (8), 1–10. https://doi.org/10.1007/S42452-020-
SCITOTENV.2022.153289. 3128-Y/TABLES/5.
Liu, X., 2015. Methods and Applications of Longitudinal Data Analysis: Patterns of Ryan, P.H., LeMasters, G.K., 2007. A review of land-use regression models for
Residual Covariance Structure. Chapter 5- Spatial Power Model – SP(POW). Elsevier, characterizing intraurban air pollution exposure. Inhal. Toxicol. 19 (sup1), 127–133.
p. 140. https://doi.org/10.1080/08958370701495998.
Löw, N., Hesser, J., Blessing, M., 2019. Multiple retrieval case-based reasoning for Sarker, I.H., 2021. Machine learning: algorithms, real-world applications and research
incomplete datasets. J. Biomed. Inform. 92, 103127 https://doi.org/10.1016/j. directions. SN Computer Science 2 (3), 1–21. https://doi.org/10.1007/S42979-021-
jbi.2019.103127. 00592-X/FIGURES/11.
Lung, S.-C.C., Kao, M.-C., 2003. Worshippers’ exposure to particulate matter in two Selby, B., Kockelman, K.M., 2013. Spatial prediction of traffic levels in unmeasured
temples in Taiwan. J. Air Waste Manage. Assoc. 53 (2), 130–135. https://doi.org/ locations: applications of universal kriging and geographically weighted regression.
10.1080/10473289.2003.10466140. J. Transp. Geogr. 29, 24–32. https://doi.org/10.1016/j.jtrangeo.2012.12.009.
Lyu, Y., Ju, Q., Lv, F., Feng, J., Pang, X., Li, X., 2022. Spatiotemporal variations of air Silveira, C., Ferreira, J., Monteiro, A., et al., 2018. Emissions from residential combustion
pollutants and ozone prediction using machine learning algorithms in the Beijing- sector: how to build a high spatially resolved inventory. Air Qual. Atmos. Health 11,
Tianjin-Hebei region from 2014 to 2021. Environ. Pollut. 306, 119420 https://doi. 259–270. https://doi.org/10.1007/s11869-017-0526-4.
org/10.1016/J.ENVPOL.2022.119420. Skoulidou, I., Koukouli, M. E., Segers, A., Manders, A., Balis, D., Stavrakou, T., van
Marsland, S., 2014. Machine learning: an algorithmic perspective. Mach. Learn. 1–452 Geffen, J., & Eskes, H. (2021). Changes in power plant NOx emissions over
https://doi.org/10.1201/B17476/MACHINE-LEARNING-STEPHEN-MARSLAND. Northwest Greece using a data assimilation technique. Atmosphere 2021, Vol. 12,
Maurer, M., Klemm, O., Lokys, H.L., Lin, N.H., 2019. Trends of fog and visibility in Page 900, 12(7), 900. doi:https://doi.org/10.3390/ATMOS12070900.
Taiwan: climate change or air quality improvement? Aerosol Air Qual. Res. 19 (4), Taiwan EPA. (2019). Taiwan poised to introduce new regulations on NOx emissions.
896–910. https://doi.org/10.4209/AAQR.2018.04.0152. Environmental Protection Administration: Central News Agency. Retrieved fr
Meijles, D.N., Pagano, P.J., 2016. NOx and inflammation in the vascular adventitia. omwww.taiwannews.com.tw/en/news/3728101 (Accessed: January 2023).
Hypertension 67 (1), 14–19. https://doi.org/10.1161/ Taiwan EPA, 2023a. Control of Mobile Sources of Air Pollution -Air Quality-Air-EPA
HYPERTENSIONAHA.115.03622. Topics. Environmental Protection Administration. Retrieved from. https://www.epa.
Meik, J.M., Lawing, A.M., 2017. Considerations and pitfalls in the spatial analysis of gov.tw/eng/6DFB84DC8C65F910. (Accessed January 2023).
water quality data and its association with hydraulic fracturing. Advances in Taiwan EPA, 2023b. Air Quality Observation and Forecast Network. Environmental
Chemical Pollution, Environmental Management and Protection 1, 227–256. Protection Administration. Retrieved from. https://www.epa.gov.tw/eng/7201410
https://doi.org/10.1016/bs.apmp.2017.08.013. 5B1616E36. (Accessed January 2023).
Meng, X., Chen, L., Cai, J., Zou, B., Wu, C.F., Fu, Q., Zhang, Y., Liu, Y., Kan, H., 2015. Thunis, P., Clappier, A., Beekmann, M., Putaud, J.P., Cuvelier, C., Madrazo, J., de
A land use regression model for estimating the NO2 concentration in shanghai, Meij, A., 2021. Non-linear response of PM2.5 to changes in NOx and NH3 emissions in
China. Environ. Res. 137, 308–315. https://doi.org/10.1016/J. the Po basin (Italy): consequences for air quality plans. Atmos. Chem. Phys. 21,
ENVRES.2015.01.003. 9309–9327. https://doi.org/10.5194/acp-21-9309-2021.
Meyer, H., Reudenbach, C., Wöllauer, S., Nauss, T., 2019. Importance of spatial predictor Tian, H., Liu, K., Hao, J., Wang, Y., Gao, J., Qiu, P., Zhu, C., 2013. Nitrogen oxides
variable selection in machine learning applications – moving from data reproduction emissions from thermal power plants in China: current status and future predictions.
to spatial prediction. Ecol. Model. 411, 108815 https://doi.org/10.1016/J. Environ. Sci. Technol. 47 (19), 11350–11357. https://doi.org/10.1021/ES402202D/
ECOLMODEL.2019.108815. SUPPL_FILE/ES402202D_SI_001.PDF.
Ministry of the Interior, 2023. Taiwan Developed Area: Urban: Industrial Zone. Tsai, J. H., Chen, S. H., Chen, S. F., & Chiang, H. L. (2021a). Air pollutant emission
Construction and Planning Agency, Ministry of The Interior. Retrieved from. http abatement of the fossil-fuel power plants by multiple control strategies in Taiwan.
s://www.ceicdata.com/en/taiwan/developed-area-by-region-annual/developed-are Energies 2021, Vol. 14, page 5716, 14(18), 5716. doi:https://doi.org/10.3390/E
a-urban-industrial-zone. (Accessed January 2023). N14185716.
Molnar, C., 2020. Interpretable Machine Learning: A Guide for Making Black Box Models Tsai, J.-H., Lee, M.-Y., Chiang, H.-L., 2021b. Effectiveness of SOx, NOx, and primary
Explainable. Leanpub, United States. particulate matter control strategies in the improvement of ambient PM
Mölter, A., Lindley, S., 2021. Developing land use regression models for environmental concentration in Taiwan. Atmosphere 12 (4), 460. https://doi.org/10.3390/
science research using the XLUR tool–more than a one-trick pony. Environ. Model atmos12040460.
Softw. 143, 105108 https://doi.org/10.1016/j.envsoft.2021.105108. UNION, P., 2008. Directive 2008/50/EC of the European Parliament and of the council
National Statistics, Republic of China - Taiwan. (2023). Total Population. Retrieved from of 21 may 2008 on ambient air quality and cleaner air for Europe. Off. J. Eur. 29,
https://eng.stat.gov.tw/Point.aspx?sid=t.9&n=4208&sms=11713 (Accessed. 169–212.
January 2023). United States Environmental Protection Agency, 2023. Nitrogen Oxides Control
Morapedi, T.D., Obagbuwa, I.C., 2023. Air pollution particulate matter (PM2.5) Regulations | Ground-Level Ozone. New England, US EPA. Retrieved from.
prediction in South African cities using machine learning techniques. Front. Artif. https://www3.epa.gov/region1/airquality/nox.html. (Accessed February 2023).
Intell. 6 https://doi.org/10.3389/frai.2023.1230087. van der A., R.J., Eskes, H.J., Boersma, K.F., van Noije, T.P.C., Van Roozendael, M., De
Nguyen, D.H., Lin, C., Vu, C.T., Cheruiyot, N.K., Nguyen, M.K., Le, T.H., Bui, X.T., 2022. Smedt, I., Peters, D.H.M.U., Meijer, E.W., 2008. Trends, seasonal variability and
Tropospheric ozone and NOx: A review of worldwide variation and meteorological dominant NOx source derived from a ten-year record of NO2 measured from space.
influences. Environ. Technol. Innov. 102809 https://doi.org/10.1016/j. J. Geophys. Res. Atmos. 113 (D4), 4302. https://doi.org/10.1029/2007JD009021.
eti.2022.102809. Wang, J., Song, G., 2018. A deep spatial-temporal ensemble model for air quality
Pang, Y., Wang, Y., Lai, X., Zhang, S., Liang, P., Song, X., 2023. Enhanced kriging leave- prediction. Neurocomputing 314, 198–206. https://doi.org/10.1016/J.
one-out cross-validation in improving model estimation and optimization. Comput. NEUCOM.2018.06.049.
Methods Appl. Mech. Eng. 414, 116194 https://doi.org/10.1016/j. Wang, K.Y., Yu, J., 2022. Policy compliance and ritual maintenance dilemma: can
cma.2023.116194. Chinese folk Temples’ air pollution control measures ensure visitor satisfaction?
Parry, P. (2019). Automated machine learning for production and analytics: auto_ml. Front. Environ. Sci. 10, 793. https://doi.org/10.3389/FENVS.2022.907701/BIBTEX.
PyPI. MIT license. Retrieved from https://pypi.org/project/auto_ml/#files (accessed Wang, Y., Chen, J., Chen, X., Zeng, X., Kong, Y., Sun, S., Guo, Y., Liu, Y., 2021. Short-
in June 2023). term load forecasting for industrial customers based on TCN-LightGBM. IEEE Trans.
Pintelas, P., & Livieris, I. E. (2020). Ensemble Algorithms and Their Applications. Power Syst. 36 (3), 1984–1997. https://doi.org/10.1109/TPWRS.2020.3028133.
Retrieved November 2, 2022, from www.mdpi.com/journal/algorithms. Wong, P.Y., Su, H.J., Lee, H.Y., Chen, Y.C., Hsiao, Y.P., Huang, J.W., Teo, T.A., Wu, C.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2017). Da, Spengler, J.D., 2021a. Using land-use machine learning models to estimate daily
CatBoost: unbiased boosting with categorical features. Advances in neural information NO2 concentration variations in Taiwan. J. Clean. Prod. 317, 128411 https://doi.
processing systems, 2018-December, 6638–6648. Doi:10.48550/arxiv.1706.09516. org/10.1016/J.JCLEPRO.2021.128411.
Ren, X., Mi, Z., Georgopoulos, P.G., 2020. Comparison of machine learning and land use Wong, P.Y., Lee, H.Y., Zeng, Y.T., Chern, Y.R., Chen, N.T., Candice Lung, S.C., Su, H.J.,
regression for fine scale spatiotemporal estimation of ambient air pollution: Wu, C.Da., 2021b. Using a land use regression model with machine learning to
modeling ozone concentrations across the contiguous United States. Environ. Int. estimate ground level PM2.5. Environ. Pollut. 277, 116846 https://doi.org/10.1016/
142, 105827 https://doi.org/10.1016/J.ENVINT.2020.105827. J.ENVPOL.2021.116846.
Requia, W.J., Di, Q., Silvern, R., Kelly, J.T., Koutrakis, P., Mickley, L.J., Sulprizio, M.P., Wong, P.Y., Su, H.J., Lung, S.C.C., Wu, C.D., 2023. An ensemble mixed spatial model in
Amini, H., Shi, L., Schwartz, J., 2020. An ensemble learning approach for estimating estimating long-term and diurnal variations of PM2.5 in Taiwan. Sci. Total Environ.
high spatiotemporal resolution of ground-level ozone in the contiguous United 866, 161336 https://doi.org/10.1016/J.SCITOTENV.2022.161336.

15
A.K. Asri et al. Science of the Total Environment 916 (2024) 170209

Wu, C.-D., Chen, Y.-C., Pan, W.-C., Zeng, Y.-T., Chen, M.-J., Guo, Y.L., Lung, S.-C.C., Zhang, R., Tie, X., Bond, D.W., 2003. Impacts of anthropogenic and natural NOx sources
2017. Land-use regression with long-term satellite-based greenness index and over the U.S. on tropospheric chemistry. Proc. Natl. Acad. Sci. U. S. A. 100 (4),
culture-specific sources to model PM2.5 spatial-temporal variability. Environ. Pollut. 1505–1509. https://doi.org/10.1073/PNAS.252763799/ASSET/1D35C036-0140.
224, 148–157. https://doi.org/10.1016/j.envpol.2017.01.074. Zhang, S., Zhong, L., Chen, X., Liu, Y., Zhai, X., Xue, Y., Wang, W., et al., 2019. Emissions
Wu, Y., Song, G., 2019. The impact of activity-based mobility pattern on assessing fine- characteristics of hazardous air pollutants from the incineration of sacrificial
grained traffic-induced air pollution exposure. Int. J. Environ. Res. Public Health 16 offerings. Atmosphere 10 (6), 332. https://doi.org/10.3390/atmos10060332.
(18). https://doi.org/10.3390/IJERPH16183291. Zhang, S.-Y., Zhang, H.-W., 2016. Regional Differences of NOx Emission and Its Causes in
Yang, K.L., 2002. Spatial and seasonal variation of PM10 mass concentrations in Taiwan. China, pp. 661–667. https://doi.org/10.2991/EESED-16.2017.91.
Atmos. Environ. 36 (21), 3403–3411. https://doi.org/10.1016/S1352-2310(02) Zhao, C., Wang, B., 2022. How does new-type urbanization affect air pollution?
00312-6. Empirical evidence based on spatial spillover effect and spatial Durbin model.
Yang, S., Zhang, H., Yang, S., Zhang, H., 2018. Comparison of several data mining Environ. Int. 165, 107304 https://doi.org/10.1016/J.ENVINT.2022.107304.
methods in Credit card default prediction. Intell. Inf. Manag. 10 (5), 115–122. Zhao, W., Zhong, Y., Li, Q., Li, M., Liu, J., Tang, L., 2022. Comparison and correction of
https://doi.org/10.4236/IIM.2018.105010. IDW based wind speed interpolation methods in urbanized Shenzhen. Front. Earth
Yeh, M.C., 2023. The development of Temple culture in Taiwan. SHS Web Conf. 168, Sci. 16 (3), 798–808.
02001. https://doi.org/10.1051/shsconf/202316802001. Zhong, S., Zhang, K., Bagheri, M., Burken, J.G., Gu, A.Z., Li, B., Zhang, H., 2021. Machine
Zhang, P., Ma, W., Wen, F., Liu, L., Yang, L., Song, J., Wang, N., Liu, Q., 2021. Estimating learning: New ideas and tools in environmental science and engineering. Environ.
PM2.5 concentration using the machine learning GA-SVM method to improve the Sci. Technol. https://doi.org/10.1021/acs.est.1c01339.
land use regression model in Shaanxi, China. Ecotoxicol. Environ. Saf. 225 https:// Zhou, L., Pan, S., Wang, J., Vasilakos, A.V., 2017. Machine learning on big data:
doi.org/10.1016/J.ECOENV.2021.112772. opportunities and challenges. Neurocomputing 237, 350–361. https://doi.org/
10.1016/J.NEUCOM.2017.01.026.

16

You might also like