4 Paper

Neural Computing and Applications
https://doi.org/10.1007/s00521-024-09637-7 (0123456789().,-volV)(0123456789().
,- volV)
ORIGINAL ARTICLE
Self-supervised air quality estimation with graph neural network

assistance and attention enhancement
Viet Hung Vu1 • Duc Long Nguyen1 • Thanh Hung Nguyen1 • Quoc Viet Hung Nguyen2 •
Phi Le Nguyen1 • Thanh Trung Huynh3
Received: 17 November 2022 / Accepted: 21 February 2024

The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024
Abstract
The rapid progress of industrial development, urbanization, and traffic has caused air quality degradation that negatively
affects human health and environmental sustainability, especially in developed countries. However, due to the limited
number of sensors available, the air quality index at many locations is not monitored. Therefore, many research, including
statistical and machine learning approaches, have been proposed to tackle the problem of estimating air quality value at an
arbitrary location. Most of the existing research perform interpolation process based on traditional techniques that leverage
distance information. In this work, we propose a novel deep-learning-based model for air quality value estimation. This
approach follows the encoder–decoder paradigm, with the encoder and decoder trained separately using different training
mechanisms. In the encoder component, we proposed a new self-supervised graph representation learning approach for
spatio-temporal data. For the decoder component, we designed a deep interpolation layer that employs two attention
mechanisms and a fully connected layer using air quality data at known stations, distance information, and meteorology
information at the target point to predict air quality at arbitrary locations. The experimental results demonstrate significant
improvements in estimation accuracy achieved by our proposed model compared to state-of-the-art approaches. For the
MAE indicator, our model enhances the estimation accuracy from 4.93% to 34.88% on the UK dataset, and from 6.89% to
31.94% regarding the Beijing dataset. In terms of the RMSE, the average improvements of our method on the two datasets
are 13.33% and 14.37%, respectively. The statistics for MAPE are 36.05% and 13.25%, while for MDAPE, they are
24.48% and 36.33%, respectively. Furthermore, the value of R2 score attained by our proposed model also shows con-
siderable improvement, with increases of 5.39% and 32.58% compared to that of comparison benchmarks. Our source code
and data are available at https://github.com/duclong1009/Unsupervised-Air-Quality-Estimation.
Keywords Air quality interpolation Graph neural network Time-series prediction
1 Introduction pollutants is, therefore, essential for dealing with issues

related to poor air quality and evaluating the efficacy of
Air pollution and air quality monitoring. Increasing environmental initiatives. This need has fostered the
industrialization and urbanization, especially in developing worldwide development of air quality monitoring solu-
nations, has resulted in severe air pollution in many cities. tions. Traditional air quality surveillance relies on sta-
According to the World Health Organization, air pollution tionary monitoring stations following stringent criteria
is the leading cause of 36% of lung cancer deaths, 27% of [3, 4]. The outstanding advantage of this method lies in its
heart attacks, 34% of strokes, and 35% of respiratory high accuracy. Due to the high installation and mainte-
failure deaths [1]. In addition, an earlier study [2] has nance expenses, however, there are relatively few station-
shown a substantial link between air pollution and meteo- ary monitoring stations constructed, resulting in extremely
rological variables. Understanding the characteristics of sparse coverage. To this end, low-cost sensors-based air
quality monitoring systems have emerged as a viable
Viet Hung Vu and Duc Long Nguyen have contributed alternative to increase the granularity of monitoring [5–8].
equally to this work. However, the sensor-based solution suffers from an
inherent limitation of low precision. Indeed, low-cost
Extended author information available on the last page of the article
123
sensor measurements are impacted by several external stations. Qi et al. in [21] offered an autoencoder-based deep
meteorological factors and have extremely poor accuracy learning model to predict the air quality indicators at
when compared to stationary monitoring stations. Fur- unlabeled positions. Li et al. [22] constructed a composite
thermore, although the sensor-based strategy may consid- deep learning model that leverages an autoencoder-based
erably improve the observation density, covering all full residual and bootstrap aggregation. This model is
regions with sensors is impractical; thus, a comprehensive designed to estimate PM2:5 concentrations with high spatial
picture of the geographical distribution of air pollution is and temporal precision. Rijal et al. [23] developed an
often unavailable. In light of this, several efforts have been ensemble model of CNN and Feed Forward Neural Net-
devoted to proposing a new approach known as spatial air work to estimate PM2:5 concentrations using the input of
quality estimation, which utilizes data acquired from the satellite image dataset.
nearby monitored places to forecast air quality at unmon- Despite several attempts, deep learning-based spatial air
itored locations. quality estimation is still in its infancy, with most existing
approaches failing to model the spatial relationships of air
1.1 Spatial air quality estimation: existing quality indicators at nearby locations, as well as the cor-
approaches and their limitations relation between air quality indicators and other external
factors (e.g., humidity, temperature, wind speed, wind
Unlike the temporal predicting air quality, i.e., utilizing direction, etc.). Indeed, most estimation methods proposed
historical air quality data gathered from a particular area to so far have relied on the inverse distance paradigm to
predict the air quality at the same location in future [9–13], model the spatial relationship and have not considered the
the spatial air quality estimation problem has not been correlation between air quality indicators and other exter-
substantially studied. The conventional approach to solve nal variables. These limitations have hampered estimation
the spatial air quality estimation problem is to rely on accuracy and left this topic as an unsolved challenge.
geostatistical approach including inverse distance weight-
ing (IDW) [14], ordinary kriging (OK) [15], ordinary co- 1.2 Our solution
kriging (OCK) [16], land-use regression [17], and Biased
Sentinel Hospitals Areal Disease Estimation (STPI- This study focuses on spatial estimation of the PM2:5
BSHADE) [18]. Such non-learning methods, however, indicator (one of the most significant pollutants) and pro-
often require domain expertise to engineer special param- poses a novel approach for addressing the above-mentioned
eters to achieve the optimal solution. Moreover, these issues. Specifically, we develop a deep learning model that
techniques cannot model the complicated relationships allows us to model the three essential properties of air
between air quality indicators and other variables (such as quality indicators: (1) spatial correlation between air
geographical distance and meteorological factors), thereby quality indicators gathered at different places; (2) temporal
leading to poor estimation accuracy. Also, the performance correlation among air quality indicators obtained from the
of these methods is strongly affected by various factors same location; and (3) relationship between air quality
such as the sampling density and the data variation [16]. indicators and external meteorological factors.
Deep learning, which has recently emerged as a viable The main contributions of our paper are fourfold as
method for capturing nonlinear and complicated relation- follows.
ships, is being used to tackle a broad range of issues,
• Firstly, we leverage the graph neural network to model
including classification and regression tasks. Several early
the spatial correlation of the PM2:5 indicator collected
attempts have been devoted to utilizing deep learning in
from various sites. Additionally, we offer a novel
estimating air quality indicators spatially, such as [19–21].
adversarial training technique that strengthens the
Authors in [19] compared the performance of ANN and
induction capability and noise resistance, thereby
Multivariate Linear Regression in spatial interpolating five
enhancing the representation capability of the graph
regulated air pollutants (Nitrogen dioxide, Nitrogen
neural network. Moreover, we design a location-based
monoxide, Ozone, Carbon monoxide, and Sulfur dioxide).
attention mechanism to emphasize significant locations.
The results showed the superiority of ANN in most cases,
• Secondly, we employ recurrent units within the graph
especially when the density of the monitoring network is
neural network to capture the temporal correlation
limited. In [20], the authors proposed a spatial interpola-
between air quality series gathered from the same site.
tion/extrapolation method consisting of a geo-layer and
• Thirdly, we develop a data fusion mechanism that
LSTM layers. The geo-layer is responsible for selecting
integrates multiple meteorological data to improve
stations that strongly correlate to the targeted location,
estimation accuracy. Specifically, we introduce a
while the LTSM is in charge of deriving temporal corre-
method for representing wind-related information
lations from the historical data collected by the selected
123
(considered one of the elements affecting PM2:5 the reality due to the shortage of monitoring stations. To pre-
most) and incorporating it into the input features. dict the air quality indicator at arbitrage areas, the earlier
Additionally, we develop a feature-based attention techniques often apply interpolation deterministic methods
mechanism that can simulate the correlations between such as Inverse Distance Weighting (IDW) and Ordinary
latent features, thereby highlighting the most significant Kriging (OK) [16]. However, these approaches ignore the
ones. impact of historical data, and moreover, they apply a
• Finally, we conduct extensive experiments on real simple linear/nonlinear equation that cannot model the
datasets to justify the performance of our model against spatiotemporal dynamics of the air quality data. The recent
existing approaches. The empirical results show that techniques attempt to address this issue by combining the
our technique outperforms the others. traditional interpolation method with deep learning archi-
The remainder of the paper is organized as follows. Sec- tecture. For example, in [31], Ma et al. developed a method
tion 2 introduces related works. In Sect. 3, we first for- combining a Bidirectional Long-short-term memory net-
mulate the targeted problem and then provide the work (BLSTM) with Inverse Distance Weighting (IDW) to
motivation and overall of our proposed architecture. In fill the areas without monitoring stations. Guo et al. [32]
Sect. 4, the proposed self-supervised graph representation proposed KIDW-TCGRU, which first employs K-nearest
learning approach is discussed thoroughly, which is fol- Inverted Distance Weighting to generate interpolated data
lowed by detailed implementation of the deep interpolation before passing to a Time-Distributed Convolutional Gated
layer in Sect. 5. In Sect. 6, we detail the experiment set- Recurrent Unit (TCGRU) model to extract the spatial–
tings and raise several questions, which are then answered temporal characteristics and estimate the air quality. [21]
by experimental results. Finally, we summarize our work in proposed a deep air model to conduct prediction for PM2:5
Sect. 7. value in the unsupervised area using the auto-encoder and
fully connected network.
Our work goes beyond the existing works by developing
2 Related works an attention-based graph convolution network (GCN) [33]
to capture the nonlinear spatial relationship of air quality
In this section, we introduce related works about air quality between geographic locations. We exploit the noise-resis-
prediction and graph self-supervised representational tant properties of GCN and augment it with the adversarial
learning in Sects. 2.1 and 2.2, respectively. training process to enhance the induction capability of the
prediction model. We also integrate the rich meteorology
2.1 Air quality prediction information, especially the wind strength and wind direc-
tion, to model their dynamic effect on the propagation of
Air quality prediction and monitoring have received great air quality from monitored areas to the target location.
interest from both industry and academia in the last two
decades. Most of the existing works focused on the setting 2.2 Self-supervised graph representation
of predicting the air quality in the near future at the learning
monitoring stations using their historical data [24, 25].
Especially, recent techniques leverage advanced architec- Self-supervised graph representation learning can be clas-
ture such as recurrent neural network [11, 26] and graph sified into three main genres: generation-based method,
neural network [27, 28] to extract better the underlying auxiliary property-based method, and contrast-based
pattern from the past data and achieve remarkable predic- method [34]. In the first paradigm, the existing works (e.g.,
tion accuracy. For instance, Zhao et al. [29] developed a GAE/VGAE [35], MGAE [36], EdgeMask [37]) follow the
method combining long-short-term memory (LSTM) with encoder–decoder method, where the model input for the
fully connected neural network (LSTM-FC) to predict the encoder is the output of the graph perturbation process. The
PM2:5 value of a specific monitoring station over the next generative decoder tries to reproduce the graph from the
48 h. Qi et al. combined graph convolutional network and encoder output, with a loss aiming to minimize the dif-
LSTM in a novel hybrid model [30] to extract temporal and ference between the original and the reconstructed graphs.
spatial characteristics of input data. Liang et al. [28] The second paradigm, auxiliary property-based methods,
developed a novel method leveraging a multi-level atten- has the same training method as the supervised learning
tion mechanism to model the dynamic spatiotemporal paradigm, as both of them require the ‘‘sample-label’’ pair.
dependencies. However, they automatically generate pseudo-labels from
However, these methods cannot be applied to estimate several hand-crafted auxiliary graph properties using sev-
the air quality in the unmonitored area, which is essential in eral unsupervised clustering/partitioning algorithms. Then,
the decoder tries to perform classification or regression
123
using the representation learned from the encoder, with a 3.2 Multivariate air quality data
loss aiming to minimize the difference between the pre-
dicted labels and the pseudo-labels. Representative models The interpolation of air quality in the unmonitored area
belonging to this genre are Distance2Cluster, Centrality relies on data collected from the nearby monitoring sta-
Score Ranking, Cluster Preserving [38], etc. The disad- tions. The monitoring stations gather two kinds of data: air
vantage of this paradigm is the dependence on the accuracy quality indicators (e.g., PM2:5 , CO2 , NO2 , etc.) and mete-
of the pseudo-labels and the selection of auxiliary proper- orological data (e.g., temperature, evaporation, wind speed,
ties. The third paradigm, contrast-based methods, is built precipitation, etc.).
on the idea of mutual information maximization, where the The multivariate historical data at the station Si , denoted
estimated MI between two views of the same object (e.g., as Xi , have the form of Xi ¼ fXi0 . . .Xit1 ; Xit ; . . .XiT g, with
node, subgraph, and graph) is maximized, otherwise, Xit ¼ ðQti ; M ti Þ where T is the current timestamp, and vector
minimized. Based on the pretext tasks used, these algo- Qti , Mit represent air quality indicators and meteorological
rithms can be categorized into two categories: same-scale data collected at location Si at timestamp t, respectively.
and cross-scale contrastive learning. The same-scale Furthermore, we assume that Qti ¼ fqti 0 ; qti 1 ; :::; qti m g,
approaches such as DeepWalk [39], node2vec [40], where qti 0 depicts the target air quality indicator, e.g.,
GRACE [41], and GraphSAGE [42] use peer instances PM2:5 , and qti 1 ; :::; qti m represent other indicators.
(e.g., node versus node) to perform discrimination. Cross-
Specifically, the air quality vector Qti contains information
scale methods discriminate views across various graph
on PM2:5 , AQI, PM10 , CO, NO2 , and O3 for the Beijing
topologies (e.g., node versus graph). For example, DGI
dataset, and PM2:5 , PM10 , O3 , NO2 , and SO2 for the UK
[43] learns the contrast between graph-label representation
dataset. The meteorological vector Mit includes Tempera-
and node-level representation. STDGI [44] improved DGI
ture, Surface pressure, Evaporation, and Precipitation data
by analyzing spatiotemporal graphs and maximizing the
for both datasets. The length of vector Qti varies between 6
mutual information between node features at the adjacent
for the Beijing dataset and 5 for the UK dataset, while
time steps.
vector Mit has a constant length of 4 for both datasets. Since
vector Xit is the concatenation of vectors Qti and Mit , its
length is 10 for the Beijing dataset and 9 for the UK
3 Problem formulation and approach
dataset.
In this section, we first give a formulation for the spatial air
quality estimation problem in Sect. 3.1. We then discuss
3.3 Problem formulation
the design principle and the overview of our approach in
Given a monitor station grid S ¼ fS1 ; . . .; Sn g with the
Sect. 3.4.
historical multivariate data of X ¼ fX1 ; . . .; Xn g, a target
location S satisfying the distance constraint of
3.1 Problem formulation
DðS; Si Þ\D (8i ¼ 1; :::; n), our objective is to estimate the
current PM2:5 indicator at S. We assume that meteorolog-
Monitoring station grid. We assume that there are n mon-
ical data (e.g., temperature, evaporation, wind speed, and
itoring stations located at different locations, forming a
precipitation) are available for any region, including the
station grid S. Each monitoring station Si 2 S in the grid is
target location S. This information can be easily collected
associated with a coordinate Ci ¼ ðui ; ki Þ, where ui ; ki
using publicly available sources such as Copernicus [46].
represent the latitude and longitude, respectively. We
We denote the vector representing meteorological data at
denote by DðSi ; Sj Þ the geographical distance between two
the target location at time step t as Mt . Furthermore, we
stations Si ; Sj 2 S. In this study, the Haversine function
also have the detailed coordinate of the target location,
[45], a popular method for calculating the sphere distance,
denoted by C, as well as those of the monitoring stations,
is employed. We constraint that there exists a maximum
denoted by C ¼ fC1 ; . . .; Cn g.
distance threshold D such that for every station Si , there
The aim of this study is to develop a predictive model,
exists at least one station Sj staying within its radius D ,
denoted as P, which takes into account historical data from
i.e., DðSi ; Sj Þ\D . For real-world setting, D often does
monitoring stations, meteorological data from the target
not exceed 200 km [30]. Regarding the outlined restriction,
location, and the coordinates of both the monitoring sta-
our proposed method is best suited for country-level
tions and the target location. The model is designed to
application.
provide an estimation of the PM2:5 indicator at the target
location as its output.
123
The problem can be mathematically represented as Table 2 Important operations

Operator Definition
follows.
Dot product
Input: Hadamard product
X ¼ fX1 ; :::; Xn g: Historical data at the monitoring [; ] Concatenation
stations.
C ¼ fC1 ; . . .; Cn g: Coordinates of the monitoring
stations. here is to design an architecture that is capable of
MT : Meteorological data at the target location at the capturing both these information at the same time.
current time step. • C2: Multi-modal information: The air quality data
C: Coordinate of the target location. consist of multivariate features, including different air
quality indicators and meteorology features. Some
Output: p ¼ PðX; C; MT ; CÞ, features might impact the prediction of PM2:5 indicator
where p is the estimated PM2:5 indicator at target loca- more than others. Thus, the challenge is how to
tion S at current time step. effectively integrate this varied range of features into
The notations and operators used in the paper are sum- the model to boost the accuracy of the target feature
marized in Tables 1 and 2, respectively. indicator estimation.
• C3: Interpolation capability: In the problem of esti-
3.4 Design principle mating air quality at an arbitrary unknown location, the
lack of historical air quality data raises a significant
We argue that a framework tackling the above problem challenge. To overcome this issue, the framework
should overcome the following challenges: should be able to model the correlation between the
• C1: Spatiotemporal dependency: Fine-grained air qual- locations based on available stations and generalize to
ity maintains both temporal and spatial dependency arbitrage places. The direct application of existing
[47], which means that current air quality indicators are interpolation technique such as Inverse Distance
often relevant to its historical data as well as the air Weighting (IDW) may not fully model the non-
quality indicators at nearby locations. The challenge Euclidean characteristics of the input data.
Table 1 Important notations

Symbols Definition
Si i-th monitoring station

Ci ¼ ð/i ; ki Þ coordinates of Si
S target location
C ¼ ð/; kÞ the coordinate of S
S monitoring station grid
C coordinates of monitoring stations
n the number of monitoring stations
Qti air quality indicators collected by Si at time step t
M ti meteorological information collected by Si at time step t
Xit ¼ ðQti ; M ti Þ air quality and meteorology information collected by Si at time step t
Xi historical data collected by Si up to the current time step
Xt historical data collected by all monitoring stations at time step t
X historical data collected by all monitoring stations up to the current time step
MT meteorology information at target location at the current time step T
ST wind score vector at target location at the current time step T
A monitoring stations’ adjacency matrix
t t
G ¼ ðX ; AÞ the graph representing information of all monitoring stations at time step t
DðSi ; Sj Þ distance between Si and Sj
D distances between all monitoring stations’ locations and the target location
p estimated PM2:5 level at target location S
123
3.5 Framework overview collected at all stations at time step t), and A is the
adjacency matrix representing the inverse-distances of
In this work, we aim to address the above-mentioned the stations. The details are provided in Sect. 4.1.
challenges by developing a framework that performs • Graph-based Spatiotemporal modeling: This compo-
interpolation at any specified location. To address C1, we nent is responsible for modeling the spatial–temporal
propose a temporal graph convolution network (T-GCN) correlation of the data obtained from the multivariate
that can learn the spatiotemporal dynamic of the input data. input data representation block and transforming it into
T-GCN comprises of Gated Recurrent Network (GRU) the latent vector space. Specifically, the graph-based
[48], which is well-known for its strong capability in spatiotemporal modeling block comprises multiple
handling sequence data, and Graph Convolutional Network graph neural networks and a recurrent neural network.
[33], a deep network that can effectively capture the rela- The former captures the spatial relationship between the
tionship between nodes in spatial domain using node fea- monitoring stations, while the latter extracts the
ture. We then propose a novel approach using the sequential features of the air quality data series across
contrastive learning paradigm, in which we design a cor- multiple time steps. The features retrieved by the graph-
rupt function that utilizes both the global view corruption based spatiotemporal modeling block are then fused
and feature-level corruption. This learning mechanism with the information of the target location (i.e.,
aims to increase the induction capability of the proposed meteorological and distance data) before being passed
network. Aiming to solve the problem C2, we propose a to the Attention-based interpolation module. This
preprocessing process to analyze wind direction from the procedure is described in detail in Sect. 4.
neighboring stations to the target location, which is then • Attention-based interpolation: This module acts as a
further used to modify the current wind strength based on decoder that receives information from the graph-based
its directness to the location of interest. Besides, each input spatiotemporal modeling block, applying two attention
feature has a different influence on the target feature. mechanisms to highlight important information and
Regarding this challenge, we offer a feature-aware atten- produce the final prediction results. In the Attention-
tion mechanism that adaptively scores the essential factor based interpolation module, we employ two attention
of each feature and highlights the relevant features. Con- mechanisms, namely location- and feature-based atten-
cerning C3, we develop a deep-learning-based interpolation, to capture the correlation between the locations
tion method that leverages the location-aware attention and latent features and to highlight the most significant
mechanism to learn the inter-station dependency. Further- ones. The output of the Attention-based interpolation
more, this technique also considers the meteorology fea- module is the final estimation result. The details of this
tures at the target point as an additional input feature to block will be discussed in Sect. 5.
further enhance the performance of the proposed model.
To realize the functions discussed above, we design the
framework following the self-supervised training paradigm
4 Spatiotemporal graph representation
as shown in Fig. 1. In particular, our model is comprised of
learning
three major components: Multivariate input data repre-
sentation, Graph-based Spatiotemporal modeling, and
In this section, we first describe the construction of the
Attention-based interpolation.
spatiotemporal input graph in Sect. 4.1. After that, we go
• Multivariate input data representation: The main through the structure of the spatiotemporal embedding
objective of this block is to receive the input data and network in Sect. 4.2. Finally, we elaborate on the process
organize it into the graph-based structure. Specifically, of training the embedding network following the con-
the input data consist of historical data acquired from trastive learning paradigm in Sect. 4.5.
the monitoring network, coordinates of the monitoring
stations and the location of interest, meteorological 4.1 Construction of the input graph network
information at the target location, and distances
between the target location and the monitoring stations. The historical data of the monitoring stations in k time
Specifically, the data of the monitoring network consist steps (from T k þ 1 to T) are represented by k graphs.
of the historical data from the previous k time-step to Specifically, at each time step t (T k þ 1 t T), we
the current time-step, which is represented by k graphs, construct a complete weighted graph Gt whose nodes
Tkþ1
G ; . . .; GT . In addition, each graph Gt represent the monitoring stations, and the weight of each
(t ¼ T k þ 1; :::; T) is represented by ðXt ; AÞ, where edge reflects the correlation between the two vertices. As
Xt is the attribute matrix (representing the data mentioned in Sect. 3.5, Gt can be represented as
123
Fig. 1 Overview of our proposed framework, which is comprised of three major components: multivariate input data representation, graph-based
spatiotemporal modeling, and attention-based interpolation
Gt ¼ fXt ; Ag, where Xt 2 Rnl conveys air quality and temperature, and precipitation, we also introduce the wind-
meteorological data of all monitoring station at time step t, related information as this is an important factor that
and A 2 Rnn is the adjacency matrix, n is the number of affects the air quality indicators, especially PM2:5 [50]. For
monitoring stations, and l is the total number of air quality instance, air pollution would be more significant when the
indicators and meteorological features. As the correlation wind blows directly from the contaminated source to the
between two stations’ air quality measurements tends to be target area. However, the wind vector, including the
inversely related to their geographical distance [49], we strength and direction, is often unavailable in the moni-
utilize the geographical distance to define the weighted toring stations. Therefore, we propose a method to measure
adjacency matrix as follows. the influence of the wind blowing from a monitoring sta-
8 tion to the targeted location, considering the wind
< 1
if DðSi ; Sj Þ [ 1 direction.
Aij ¼ DðSi ; Sj Þ ð1Þ
: Let C ¼ ðk; uÞ, and Ci ¼ ðki ; ui Þ be the coordinates of
1 if DðSi ; Sj Þ 1;
the targeted location and the monitoring station Si ,
where DðSi ; Sj Þ is the geographical distance, measured in respectively. Using the Haversine formula, we determine
kilometers, between two stations Si and Sj . Note that when the angle between Si and S, denoted as hi as follows.
the distance is extremely small, its inverse value tends to havðhi Þ ¼ havðu ui Þ þ cosðui Þ cosðuÞhavðk ki Þ;
approach infinity. Therefore, to avoid the situation where
some elements of the adjacency become excessively large, 2 hi
havðhi Þ ¼ sin
we define a geographical distance threshold, for every 2
station pair whose distance is below this lower bound, the ð2Þ
weight of their connecting edge is set to a constant. In this
Intuitively, the smaller hi is, the more impact the wind from
paper, we set the threshold and the constant to 1. It is
Si imposes on the targeted location. Motivated by this
important to note that the same distance constraint, as
intuition, we define a so-called wind score si of a moni-
described in [30], is applied to the stations in the network.
toring station Si to the targeted location as follows.
Specifically, we only consider stations with adjacent sta- (
tions located within a maximum distance of 200 kms. cosðhi Þ if 0 \hi \90
Concerning the node features Xt , besides the common si ¼ ð3Þ
0 if hi [ 90 ;
scalar factors used in existing works such as NO2 , O3 ,
123
This hand-crafted feature is then combined with other air dynamics. First, each input graph Gt (carrying information
quality indicators and meteorological data to form the node about the monitoring station grid at time step t) is fed to a
attribute in the input graphs. GCN unit to extract the temporal correlation between the
stations at that time step. The output of each GCN unit at
4.2 Spatiotemporal embedding network time step t, along with the output of the GRU unit at the
previous time step, is then sent into the GRU unit to cap-
We employ a deep network architecture onto the con- ture the temporal properties until time step t. Finally, the
structed graphs to learn simultaneously spatial and tem- output of the last GRU unit is fused with the node attributes
poral properties from the input data. The spatiotemporal of the last input graph (i.e., GT ) to generate latent vectors
network consists of two primary components: a spatial encapsulating spatiotemporal information of all historical
extractor which leverages graph neural network to capture data acquired from the monitoring station grid. More
spatial correlation patterns from the input stations; and a specifically, the operation inside each GRU cell at time
temporal extractor which utilizes the recurrent neural step t is defined as follows:

network to encode temporal correlation. rt ¼ r Wr fgc ðXt ; AÞ þ Ur hðt1Þ þ br ;

zt ¼ r Wz fgc ðXt ; AÞ þ Uz hðt1Þ þ bz ;
4.2.1 Spatial extractor ð5Þ

ct ¼ tanh Wn fgc ðXt ; AÞ þ hðt1Þ ðRt Wn Þ þ bn ;
Aiming to capture the spatial relation among monitoring
stations, we employ a deep neural network architecture that ht ¼ ð1 zt Þ ct þ zt hðt1Þ ;
has a strong ability to effectively describe the graph where W and U denote the weight matrices for each control
structure of the real-world dataset and to capture the spatial gate and the b terms are bias vectors, depicts the
dependency between data points in various locations. Hadamard product, r represents the activation function,
Hence, we propose using GCN [33], a multi-layer graph and rt , zt , ct , and ht are the reset gate, update gate, candi-
neural network that applies a neighborhood aggregation date hidden state, and hidden state of the GRU unit,
scheme f ð:Þ on each layer. Considering the k-layer GCN respectively.
network, the feed-forward pass process is:

ðkÞ ðk1Þ
ð1Þ

f f . . .f ðGÞ. . . , where G is the input network. 4.3 Adversarial self-supervised training
The function f ð:Þ takes the hidden features of the previous
layer as input and produces the features of the subsequent We design an adversarial self-supervised learning mecha-
layer: nism (see Fig. 2) to train the embedding network effi-
1 1 ciently. Besides the embedding network, our mechanism
Hlþ1 ¼ f ðG; Hl ; W l Þ ¼ rðD^2 A^D^2 Hl W l Þ; ð4Þ
consists of two components: the graph corruptor and the
where A^ ¼ A þ In is the adjacency matrix of the with added discriminator. The former is responsible for tweaking the
self-connections, In is the identity matrix, D^ is the diagonal original graph to generate a negative sample, while the
matrix representing the node degree of the adjacency latter distinguishes between the positive and negative
matrix A, and W l represents the weight matrix in the l-th examples. Our training flow is performed as follows. Each
layer, Hl is the l-th layer output embedding with H0 is the training sample Xt ¼ fxt1 ; :::; xtn g (i.e., the graph repre-
matrix comprised of the attributes of all nodes, r denotes senting information of all monitoring stations at time step t)
the activation function, which is ReLU in our proposed is fed into the graph embedding network, which generates
approach. Aiming to balance between the expressive power embedding vectors Ht ¼ fht1 ; :::; htn g with hti representing
and the computation efficiency of the model, we choose the information about node xti . On the other hand, the original
number of network layers k equals 2. graph Xt is also corrupted by the Graph Corruptor to yield
X~ t . Every pair of an embedding vector ht and its corre-
i
4.2.2 Temporal extractor sponding node in the original graph, i.e., xti , is considered
positive, whereas a pair of an embedding htj and its corre-
We leverage GRU network [48] to extract temporal char- sponding node in the corrupted graph, i.e., x~tj is considered
acteristics of data series collected from monitoring stations negative. The graph embedding network and discriminator
in multiple time steps. GRU is chosen due to its simplicity are trained adversarially to minimize the loss associated
and capability in dealing with gradient-related problems with positive pairs and maximize the loss contributed by
(i.e., gradient vanishing, gradient explosion). negative pairs.
Inspired by T-GCN [51], we combine the two extractors
in the following manner to capture the spatiotemporal
123
Fig. 2 An illustration of our

proposed self-supervised
training process, which consists
of two components: the graph
corruptor and the discriminator.
The former is responsible for
tweaking the original graph and
generating a negative sample,
while the latter is for
distinguishing between positive
and negative pairs
In the following, we will provide the details of the graph embedding vector of X. We generate several positive pairs
corruptor and the discriminator. (hi ; xi ), and negative pairs (hj ; x~j ) (1 i; j n). For each
(hi ; xi ), the discriminator D predicts the probability for
4.3.1 Graph corruptor ðhi ; xi Þ being a positive pair by applying a bilinear scoring
function.
The objective of the corruptor is to manipulate the original
Dðxi ; hi Þ ¼ r hTi Wxi : ð7Þ
graph in order to generate a so-called negative sample that
is distinct from the original. Notably, the more diverse the Similarly, the probability for a negative pair (htj ; x~tj ) being a
corrupted graphs, the more generalized the data used to positive pair is defined by
train the graph embedding network. Therefore, we design
the graph corruptor module that augments the original D x~j ; hj ¼ r hTj W x~j ; ð8Þ
graph at both structure and attribute levels.
For the structural perturbation, we apply a row-wise where W is a learnable matrix, and r is the logistic sigmoid
shuffling process as in [44]. Since we randomly swap the activation function. We design a contrastive loss Lssl as
value of the node feature, this process equals changing the follows.
!
topology structure of the graph. Then, we apply attribute 1 X
N X
N
corruption by adding a Gaussian noise to each node feature. Lssl ¼ EðX;AÞ ½log Dðxi ; hi Þ þ E ~ e log 1 D x~j ; hj ;
2N i¼1 j¼1
ðX; A Þ
More specifically, we sample the random vector m~ 2 Rl ,

where l is the number of node attributes. Each element of where N is the number of the positive and negative sam-
this vector is drawn from a Gaussian distribution of range ples. The encoder is then trained separately following this
[0; 1]. Then, the corrupted node features X ~ are computed loss function and updates the weight of parameters after
by: each epoch using the Adam optimizer [52].
~ ¼ ½x1 m;
X ~ . . .; xn m~ > ;
~ x2 m; ð6Þ
where is the Hadamard product operator. 5 Multi-level attention interpolator
4.3.2 Discriminator To interpolate the air quality at unmonitored areas using

historical time-series collected from the monitoring station
Based on the concept of mutual information (MI) maxi- grid, we design a multi-level attention mechanism con-
mization [44], the discriminator is trained such that the sisting of two attention layers, namely location-aware
mutual information between each node embedding and its attention and feature-aware attention. The former is
respective node feature is maximized and is minimized responsible for capturing inter-station relationships and
otherwise. Let X be an input graph, and suppose emphasizing the most significant stations to the location of
~ ¼ fx~1 ; :::; x~n g,
X ¼ fx1 ; :::; xn g. Besides, let us denote by X interest. In the meantime, the latter is able to model the
and H ¼ fh1 ; :::; hn g be the corrupted version and the
123
relationship within latent characteristics that pertain to the The interpolated embedding vector h is then combined with
same locations. the meteorological data at the target location to form the
query, while the embedding vectors of the monitoring
5.1 Location-aware attention stations are employed as the keys and values for the
attention block. For each key hi , its attention weight gi is
After passing input graphs through the graph-based spa- defined as follows.
tiotemporal modeling module, we acquire embedding
gi ¼ W g hi U g ½h; m ;
vectors, each of which retrieves information related to a ð10Þ
monitoring station. Obviously, the correlation of each m ¼ W 0g MT ; ST ;
monitoring with the target location is different. Conse-
quently, the contribution of each monitoring method station where Wg ; Ug ; Wg0 are the learnable parameters, MT ; ST are
in determining PM2:5 at the target station is also distinct. the meteorological data and the wind score the at the tar-
Typically, nearby stations have a more significant impact geted location at current time step T, h is the interpolated
on the targeted site than distant stations. However, these feature vector. Finally, the attention weights gi are nor-
effects vary throughout time and are influenced by malized and employed to calculate the final context vector
numerous circumstances. For instance, if a strong wind as follows.
blows from a monitoring station to the target location, the expðgi Þ
air quality values acquired at the monitoring station may bi ¼ Pn ;
j¼1 exp gj
exhibit a substantial association with those at the target ð11Þ
location. Inspired by this observation, we design a new X
n
h~location ¼ bi hi :
attention mechanism that highlights the most relevant
i¼1
monitoring stations. The architecture of this mechanism is
depicted in Fig. 3. Specifically, we first create a so-called
interpolated embedding vector h, which is defined by the 5.2 Feature-aware attention
weighted sum of embedding vectors obtained from the
graph modeling module as follows. As described in the previous section, the output of the
Xn graph-based spatiotemporal modeling block is latent vec-
1
h ¼ Pn 1
dj1 hj ; ð9Þ tors. However, not every component in these latent vectors
k¼1 dk j¼1 contributes equally to the prediction of the PM2:5 indicator.
To reduce the impact of irrelevant components and
where hj is the embedding vector corresponding to the
emphasize those of highly correlated components, we
monitoring station Sj , and dj is the distance between the
design a so-called feature-aware attention mechanism as
target location S and monitoring stations Sj , and n is the
shown in Fig. 4. The feature-aware attention uses the same
number of the monitoring stations. Intuitively, this h can be
seen as a rough estimation of the air quality at the targeted
location using the information from monitoring stations.
Fig. 3 An illustration of our location-aware attention mechanism,

which highlights monitoring stations with the highest correlation to Fig. 4 An illustration of our proposed feature-aware attention
the target site mechanism, which highlights the most significant latent features
123
query as the location-aware attention, i.e., the concatena- 5.3 Air quality interpolator training
tion of the interpolated feature vector and the meteoro-
logical data of the targeted location. On the other hand, Given the context vectors generated by the attention
there are n keys/values that are created by concatenating blocks, we will now present the last step in our model that
components from embedding vectors produced by the generates the estimation result. We observe that meteoro-
graph-based spatiotemporal modeling block. Specifically, logical factors, including precipitation, temperature, evap-
the j-th key is the combination of the j-th latent features of oration, and wind, play a substantial effect in determining
embedding vectors h1 ; :::; hn . Let kj be the j-th key; then, its PM2:5 indicator. Therefore, we propose integrating these
attention weight is defined as follows. meteorological data and the wind score vector at the target
ej ¼ W f kj U f ½h; m ; location with the context vectors via a fully connected
ð12Þ layer. The output of our model can be represented mathe-
m ¼ W 0f MT ; ST ; matically as follows.

where W f ; U f ; W 0f are the learnable parameters and MT , ST O ¼ FC r FC Wi h~location ; h~feature ; MT ; ST Wo ; ð14Þ
are the meteorology feature vector, and the wind score where Wi ; Wo is the learnable matrices, r is the ReLU
vector at the target location at the current time step T, activation function.
respectively. Finally, the attention weights are normalized We leverage the MSE (Mean Square Error) loss function
and utilized to calculated a context vector as follows. to train the model whose formula is as follows.

c j ¼ Pm
exp ej
; 1X N
MSE ¼ ðyt y~t Þ2 ; ð15Þ
j¼1 exp ej N t¼1
ð13Þ
X
m
~
hfeature ¼ cj k j : where yt and y~t are the estimation result and the ground
j¼1 truth, respectively, and N is the number of observations in
the input data batch. The overall training process of our
framework is summarized in Algorithm 1.
Algorithm 1 Training process
123
5.4 Air quality interpolation process (RQ4) Is our technique interpretable?

for unsupervised places
In the following, we first describe the dataset we used
throughout the experiment in 6.1, and the experimental
Given the trained model, the historical data from the
setting in 6.3. We then conduct our empirical evaluations to
monitoring stations, the meteorological data at the target
answer the stated questions, including an end-to-end
location, and the distances from the target location to the
comparison in 6.5, an ablation study in 6.6, a feature
monitoring stations, we then perform the interpolation
importance analysis in 6.7, and an analysis of the rela-
process. It is important to highlight that during the training
tionship between distance and the predictive performance
process of the encoder, we use all input features, including
in 6.10.
PM-related features, such as PM2.5, PM10, and AQI.
Conversely, during the training of the interpolator, we
6.1 Study area and experimental setup
restrict our focus to meteorological features at the target
location (e.g., Temperature, Precipitation, Surface pres-
Study area We conducted our study in two distinct geo-
sure) and historical input features from neighboring sta-
graphical locations: Beijing, China, and the United King-
tions. This training scheme aligns with methodologies
dom. Further details regarding the specifics of each dataset
employed in related works [31, 32]. Firstly, the historical
are provided below.
data of the monitoring stations are fed into the graph-based
spatiotemporal modeling block to extract embedding vec- • UK Dataset In 2021, Reani et al [53] published a
tors. These embedding vectors are then combined with the dataset of UK daily meteorology, air quality, and pollen
meteorology information at the target location and passed measurements for four consecutive years from 2016 to
through two attention blocks to generate the context fea- 2019. This dataset covers an area of 242,295 km2 ,
ture- and location-aware context vectors. Finally, the con- including varied kinds of topography consisting of
text vectors are fused with the meteorological data and the rugged, undeveloped hills and low mountains, and
wind score vector at the target location by a fully con- rolling plains. The authors collected daily data of
nected layer to produce the final prediction result. The temperature, evaporation, precipitation, wind speed, O3 ,
whole inference process is summarized in Algorithm 2. NO2 , SO2 , PM10 and PM2:5 , over the period of 1462
days. The dataset provides data from 141 air quality
stations across the United Kingdom.
Algorithm 2 Testing process
6 Performance evaluation • Beijing Dataset The Beijing [54] dataset collects the air
quality and meteorological information of 35 stations
In this section, we conduct experiments with the aim of across Beijing in 2018, which includes 8643 data
answering the following research questions: points. This dataset covers an area of 16,441 km2 ,
mostly including urban areas, and industrial areas with
(RQ1) Does our proposed model outperform the baseline
dense traffic networks; hence, high air quality pollutant
methods?
indices are usually recorded. It includes hourly record-
(RQ2) How important is each design choice affecting
ings of 6 types of pollutants, namely PM2:5 , PM10 , NO2 ,
our model?
CO, SO2 , and O3 , over a period of 8643 h. Besides,
(RQ3) How important is each input feature to the esti-
meteorology features including temperature, evapora-
mation of air quality?
tion, precipitation, and wind speed are also recorded.
123
Table 3 summarizes the dataset statistical information, e.g.,

where yt and y~t are the ground truth and the predicted result
mean, median, standard deviation (std), and max, min
of the model, respectively. We also use yt , and y
~t standing
value of the processed datasets.
for the mean values of yt and y~t , respectively.
6.1.1 Experimental setup
6.2.1 Benchmarks
In accordance with the restriction outlined in [30], our
In order to verify the performance of our proposed model,
analysis is restricted to stations with adjacent stations
we compare our method with the deep learning-based
located within a distance of 200 km. As a result, the
techniques for fair comparison, namely KIDW-TCGRU,
number of stations meeting this criterion is reduced to 30
BiLSTM-IDW, AttPolling FCNN, and FCNN.
for the UK dataset, while the Beijing dataset retains 35
stations. These stations are then randomly partitioned into • BiLSTM-IDW: introduced by Ma et al. [31] in 2019. It
training, validation, and testing sets. During the training is a two-phased model using BiLSTM to learn the
phase, the encoder model is trained to learn embeddings output feature embedding and Invert Distance Weight
from all features, including the target feature (e.g., PM2:5 ) (IDW) to aggregate the feature embedding using a
from all training stations. The decoder then utilizes current distance-based linear function. The aggregated output
information from all training stations to predict the target feature is then forwarded to a prediction layer to output
feature value for the validation stations. In the testing the final interpolation value.
phase, the model’s performance is evaluated by predicting • KIDW-TCGRU: proposed by Guo et al. [32] in 2020.
the current target feature for the testing stations. The This approach is a combination of the inverse-distance
training/validation/testing ratio is set to 7:5:4 for the Bei- weighting KNN (IDW-KNN) and the TCGRU model.
jing and 9:4:4 for the UK datasets, respectively. Figures 5 The IDW-KNN method aims to select the nearest
and 6 illustrate the distribution of monitoring stations stations to perform interpolation. The TCGRU model,
within the training, validation, and testing datasets for both which is the combination of time-distributed convolu-
the Beijing and UK datasets. tional neural network (TCNN) and gated recurrent
network (GRU), otherwise, helps the model learn
6.2 Setting spatial and temporal characteristics.
• AttPolling FCNN: proposed by Colchado et al. [55].
Evaluation Indicators In this work, we use four statistical This method introduces a deep learning model based on
indicators to evaluate the performance of models, including an attention mechanism that learns the impact score of
root mean square error (RMSE), mean absolute error neighbor nodes regardless of distance information.
(MAE), mean absolute percentage error (MAPE), median Then, the prediction layer consisting of multiple fully
absolute percentage error (MdAPE), and R2 Score (R2 ). connected layers combines the weighted feature vector
Formulas of these indicators are presented in 16. from neighboring nodes to calculate the PM2:5 concen-
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi tration value at the target point.
u N
u1 X • FCNN: This method is a simplified version of the
RMSE ¼ t ðyt y~t Þ2 ;
N t¼1 Attention-Polling FCNN method. However, instead of
automatically finding the neighboring nodes, these
1X N
stations are determined using the distance information.
MAE ¼ jyt y~t j;
N t¼1 This approach then uses the meteorology information
and PM2:5 value from neighbor nodes as input for the
yt y~t ð16Þ prediction layer, which consists of multiple fully
MdAPE ¼ median j j ;
y~t connected layers, to output the interpolated value.
1X N
yt y~t
MAPE ¼ j j;
N t¼1 y~t 6.2.1.1 End-to-end comparison In this section, we eval-
PN uate the accuracy of our method (GEDE) in contrast to the
ðy~t yt Þ2 baseline methods.
R2 ¼ 1 Pt¼1
N 2
;
t¼1 ðyt yt Þ Table 4 shows the detailed performances of proposed
methods and baseline models in both datasets. The best
experimental results are highlighted in bold for ease of
reference. Overall, our proposed model’s result exceeds the
performance of other baseline models in both datasets.
123
Table 3 A detailed statistical

Dataset Feature Max Min Mean Median Std
description of Beijing and UK
datasets Beijing CO ( lg/ m3 ) 1.9 0.1 0.83 0.7 0.48
NO 2 (lg/m3 ) 104.95 3 48.42 43 28.04
O 3 (lg/m3 ) 191 1 57.59 46 53.9
3 18 1 5.6 3 4.82
SO 2 (lg/m )
PM 10 (lg/m3 ) 232 1 90.4 77.11 60.85
3 150.93 3.6 55.3 46.6 39.34
PM 2:5 (lg/m )
Temperature ( C) 39 16.9 12.17 13.14 13.07
Evaporation ( rH 3 ) 99.18 47.8 83.49 84.38 7.6
Precipitation (mm) 18 0 0.195 0 0.97
Wind speed (m/s) 8.25 0.03 1.93 1.72 1.16
UK O 3 (lg/ m3 ) 48.5 0.56 28.24 29.02 7.93
NO 2 (lg/ m3 ) 23.97 2.31 7.51 6.5 3.28
3 3.76 0.04 0.43 0.36 0.29
SO 2 (lg/ m )
3 75.25 3.29 9.68 8.25 5.3
NO x (lg/ m )
PM 2:5 (lg/ m3 ) 71.85 2.93 10.76 8.04 7.58
3 92.49 5.45 22.88 21.47 10.26
PM 10 (lg/ m )
Temperature ( C) 25.74 7.79 7.19 7.53 4.95
Evaporation (rH 3 ) 170 10.65 39.4 36 24.8
Precipitation (mm) 45 0 0.3 0.06 0.57
Wind speed (m/s) 12.2 0.09 4.52 4.36 2.11
Fig. 5 An illustration of the distribution of the Beijing dataset
Specifically, in terms of MAE error, our proposed model For other indicators, the proposed method’s performance
improves from 4.93 to 34.88% in the UK dataset, while this still exceeds others’. For example, the average improve-
number for the Beijing dataset ranges from 6.89 to 31.94%. ment of our proposed model in terms of RMSE is 13.33%
123
Fig. 6 An illustration of the distribution of the UK dataset
and 14.37% on the Beijing and UK datasets, respectively. • GEDE-1: We remove the local attention mechanism to
In MAPE indicator, these statistics are 36.05% and examine the impact of this mechanism on the final
13.25%, while for MDAPE they are 24.48% and 36.33%. output result.
Finally, considering the R2 error, the average improvement • GEDE-2: The global attention mechanism is removed
of our proposed model in two proposed datasets is 5.39% from the architecture to explore the importance of this
and 32.58%, respectively. mechanism.
To further facilitate understanding, we visualize the • GEDE-3: The graph neural network is removed from
predicted and ground-truth values in Figs. 7 and 8. the architecture of the encoder. The encoder is then
Specifically, we choose two specific examples from the two used to train the embedding for the nodes using fully
datasets. The first example was obtained from a station connected layers and the recurrent neural network.
named ‘‘New North Zone’’ from 14/12/2018 to 23/12/2018, • GEDE-4: The recurrent neural network is removed
while the second example was obtained from a station from the architecture of the encoder. Specifically, we
named ‘‘EDNS’’ from 10/06/2019 to 07/11/2019. In both want to monitor the recurrent neural network’s impact
figures, it is noticeable that the proposed method predicts on the encoder’s learning.
low points significantly better than others. However, due to • GEDE-5: In this variant, we measure the impact of the
the nature of the method, which heavily relies on neighbor meteorology feature embedding layer by removing it
stations’ PM2:5 values, the predicted PM2:5 indicator’s from the original architecture. In this specific experi-
tendency is hugely affected by the tendencies of other ment, the decoder, in this case, estimates the output
neighbor stations. An example for this problem of this value using only the air quality features.
proposed method problem is the surge in the predicted data • GEDE-6: We remove the node-feature corruption from
value on the Beijing dataset between the 720th timestep the corrupt function to examine the impact of this
and the 750th timestep. approach in learning a highly representative feature
embedding.
6.2.1.2 Ablation study Aiming to answer the question Table 5 illustrates the results of the mentioned ablation
RQ2, we conduct experiments in which we remove each models for several evaluation indicators. We notice that
component from the model and record the variation in the our full model GEDE outperforms the other variants,
performance of models. The detail of each case is detailed which shows our design choices’ positive impact. In
as follows:
123
particular, the final GEDE model’s performance surpasses and Temperature. These findings align with existing works
these other two attentive variants, e.g., GEDE-1 and on the relatedness to PM2:5 [56–58]. Furthermore, as Fig. 9
GEDE-2 at 3.52% and 4.86% in MAE metric, averaged illustrates, there is a clear and robust correlation between
over both datasets. This illustrates the effectiveness of CO, NO2 , O3 , SO2 , and PM2:5 index, which are also con-
combining both attention mechanisms compared to using firmed in [59] and [60]. This strong correlation is likely due
only one. A similar drop of averaged MAE at 17.43% and to their common sources, stemming from vehicle emis-
18.78% can be seen for GEDE-3 and GEDE-4, respec- sions, industrial processes, and fossil fuel combustion. This
tively, which highlights the benefits of using graph neural connection is particularly evident in densely populated,
network and recurrent neural network to learn the spa- industrialized, and heavily trafficked areas like Beijing. On
tiotemporal characteristics. However, as the statistics pro- the other hand, the temperature feature shows a weaker
vided, it is noticeable that spatial characteristic provides correlation in both datasets; however, it still positively
more impact than temporal characteristic. The full model influences the model’s predictive ability, as stated in [58].
outperforms the variants GEDE-5 at 12.5% in MAE met- It is worth noting that, apart from the air quality index,
ric, which shows the positive impact of the meteorology other indices do not strongly correlate with PM2:5 . Overall,
feature embedding and the node-feature corrupt function. this experiment provides valuable insights into the corre-
Last but not least, the variant GEDE-6 has averaged MAE lation between PM2:5 , air quality indices, and weather
reduction of 16.47% compared to the full model. This features in the UK and Beijing datasets.
indicates the robustness of our node feature corrupt func- However, the correlation coefficients can only answer
tion to the previous approaches, which allows the final the question of which features are most correlated with
model to learn the general feature embedding and adapt PM2.5 but not which features are most influential in esti-
well to unseen data. mating PM2.5. To address this question, we adopt the
Shapley value [61] to evaluate the influence of each indi-
6.2.1.3 Feature importance analysis To address research vidual input feature on the performance of our proposed
question RQ3, we conduct two analyses: model.
1. First, we measure the correlation coefficient of each
6.2.3 Shapley values
input feature (excluding PM-related indicators) to the
target feature (i.e., PM2:5 ).
The Shapley value associated with each feature represents
2. Second, we select five features that correlate most to
the average marginal impact of that feature’s value across
PM2:5 and measure their Shapley values in predicting
all possible feature combinations. Let us denote by Fk an
PM2:5 using our proposed model.
input feature, and Z ¼ fF1 ; . . .; Fm g the set of all input
features (excluding PM-related indicators). Moreover, for
6.2.2 Correlation coefficient each subset S Z, let us denote by valð SÞ the MAE of our
proposed model when using S and PM2.5 from the moni-
The coefficient of each input feature to PM2:5 is plotted in toring stations to predict PM2.5 at the targeted location.
Fig. 9. For the Beijing dataset, the most correlated features The Shapley value /Fk ðvalÞ of a feature Fk (k ¼ 1; :::; m) is
are CO, NO2 , SO2 , Surface pressure, and Temperature. For calculated using the following formula:
the UK dataset, they are NO2 , SO2 , O3 , Surface pressure,
Table 4 A detailed comparison

Dataset Model MAE RMSE MAPE MdAPE R2
of the average accuracy of fine-
grained air quality estimation Beijing FCNN 11.7 15.36 0.67 0.26 0.86
methods
BiLSTM-IDW 13.25 18.28 0.61 0.322 0.85
KIDW-TCGRU 16.28 20.38 0.78 0.432 0.76
AttPolling FCNN 11.15 15.03 0.63 0.252 0.87
GEDE 10.6 14.12 0.43 0.239 0.88
UK FCNN 2.33 3.58 0.38 0.297 0.45
BiLSTM-IDW 2.59 3.6 0.39 0.39 0.42
KIDW-TCGRU 2.85 4.18 0.52 0.52 0.43
AttPolling FCNN 2.32 3.35 0.37 0.257 0.48
GEDE 2.16 3.19 0.36 0.233 0.59
The best results are highlighted in Bold
123
Fig. 7 Visualization of prediction result against the ground-truth on the Beijing dataset
X jSj!ðm jSj 1Þ! ranking first and second, respectively, in terms of influence.
/Fk ðvalÞ ¼ In the UK dataset, NO2 holds the second most significant
S ZnfF g
m! ð17Þ
k
effect, while O3 exerts the greatest influence in the UK
ðvalðS [ fFk ; PM2:5 gÞ valðS [ PM2:5 ÞÞ: dataset, which is noteworthy given its close association
with the PM2:5 index, as mentioned in [62]. Moreover, SO2
It is worth noting that our model utilizes PM2:5 and other
and CO also make positive contributions to the model’s
indicators from monitoring stations to predict PM2:5 at an
predictive capabilities, consistent with their relationship
arbitrary location of interest; thus, values of PM2:5 from
with PM2:5 as indicated in [63], where it is stated that
monitoring stations must be included in the input data. A
‘‘PM2:5 was positively correlated with CO at both daily and
lower Shapley value indicates a higher influence of the
monthly scales.‘‘ Given that these features are all linked to
corresponding feature on the estimation accuracy of the
air quality characteristics, their positive impact on pre-
PM2:5 indicator. As the number of subset S increases
dicting the PM2:5 index is unsurprising. Furthermore, both
exponentially to the cardinality of Z, in this experiment, we
temperature datasets exhibit favorable effects; however,
only investigate the Shapley values of five input features
their contribution to the model’s estimation performance is
that most correlate to PM2:5 . Specifically, for the Beijing
not on par with the air quality index features.
dataset, the selected features encompass CO, NO2 , SO2 ,
Surface pressure, and Temperature. In contrast, the UK
6.2.3.1 Impacts of geographical distance on estimation
dataset’s chosen features consist of O3 , SO2 , NO2 , Surface
accuracy In this section, we investigate the impacts of
Pressure, and Temperature. As illustrated in Fig. 10, the
geographical distance on estimation accuracy. Specifically,
most influential factors impacting the performance of the
we fix the set of monitoring stations used in the training
estimation model in the Beijing dataset are NO2 and SO2 ,
phase, which we refer to as training stations. We then vary
123
Fig. 8 Visualization of prediction result against the ground-truth on the UK dataset
Table 5 Effects of different

Dataset Model GEDE GEDE-1 GEDE-2 GEDE-3 GEDE-4 GEDE-5 GEDE-6
components on model’s
performance Beijing MAE 10.21 10.44 10.93 12.91 13.37 12.23 13.46
RMSE 15.22 16.06 17.2 17.89 18.89 17.6 25.54
MdAPE 0.21 0.21 0.25 0.31 0.31 0.28 0.33
MAPE 0.25 0.48 0.52 0.71 0.7 0.6 0.83
R2 0.89 0.88 0.86 0.85 0.83 0.85 0.85
UK MAE 2.16 2.27 2.23 2.51 2.31 2.36 2.35
RMSE 3.19 3.37 3.2 3.65 3.39 3.52 4.96
MdAPE 0.23 0.24 0.24 0.28 0.26 0.24 0.24
MAPE 0.26 0.39 0.39 0.48 0.37 0.39 0.39
R2 0.39 0.34 0.31 0.18 0.29 0.22 0.21
The best results are highlighted in Bold
the locations of interest used in the testing phase. (We dataset. Specifically, we randomly choose 7 training sta-
name these locations as testing locations.) Our objective is tions in the central region, and 12 testing stations spread
to investigate the potential impact of the distance between throughout the network. The testing stations are catego-
the testing locations and the training stations on prediction rized into three clusters based on their average distance to
accuracy. The experiment is conducted using the Beijing the training stations , namely Cluster 1, Cluster 2, and
123
Fig. 9 Correlation between PM2:5 with other indicators
Fig. 10 Shapley value on Beijing and UK datasets
Cluster 3. We illustrate the locations of each target station T-GCN model and the contrastive learning paradigm to
in Fig. 11. The reason for choosing many testing locations embed the spatial and temporal characteristics of the input
(instead of only two) is to guarantee the generalizability of graph before applying a deep-learning-based interpolation
the results. method to estimate the target indicator at an arbitrary target
The results, shown in Table 6, depict that the estimation location. Furthermore, two attention mechanisms are
accuracy is inversely proportional to the distance between introduced: Location-aware attention and Feature-aware
testing locations and training stations. Cluster 1, with the attention, which capture interstation relationships and
shortest average distance of 3.83 kms, achieves a signifi- emphasize the most significant stations to the location of
cantly better performance with an MAE indicator of 7.12, interest, respectively. The proposed model achieves state-
62.07%, and 143.96% higher than Cluster 2 and Cluster 3, of-the-art performance on the prediction task compared to
respectively. other baselines. Despite its state-of-the-art performance in
prediction tasks compared to other baselines, our proposed
model does exhibit certain limitations, such as extended
training time and large model size. These are primarily due
7 Conclusion to the requirement of two training steps of the spatiotem-
poral graph encoder and the multi-level attention interpo-
This paper presents a novel framework for fine-grained air lator. In future work, we aim to refine our current
quality estimation that leverages graph self-supervised methodology to mitigate these weaknesses and enhance the
representational learning to effectively capture the spatial model’s estimation accuracy. We hope our work will
and temporal dynamics. Specifically, we leverage the
123
Fig. 11 Visualization of the distribution of train and target stations in the experiment by clusters.
Acknowledgements This work was funded by Vingroup Joint Stock

Table 6 Mean MAE of stations with different distances Company (Vingroup JSC), Vingroup, and supported by the Vingroup
Innovation Foundation (VINIF) under project code
Cluster Mean distance(km) Mean MAE
VINIF.2020.DA09. This research is partially funded by Hanoi
University of Science and Technology (HUST) under grant number
1 3.83 7.12
T2022-PC-049. Viet Hung Vu and Duc Long Nguyen were funded by
2 14.97 11.54 Vingroup Joint Stock Company and supported by the Domestic
3 46.01 17.37 Master/PhD Scholarship Programme of Vingroup Innovation Foun-
dation (VINIF), Vingroup Big Data Institute (VINBIGDATA), under
Grant VINIF.2022.Ths.BK.05 and VINIF.2022.Ths.BK.07,
respectively.
encourage and facilitate future research on the air quality Data availability The code and datasets generated during and/or
analyzed during the current study are available in. https://github.com/
estimation problem.
duclong1009/Unsupervised-Air-Quality-Estimation
Declarations
Appendix A Details of hyper-parameter
settings Conflict of interest All authors declare that they have no conflicts of
interest.
All our experiment is conducted on NVIDIA GeForce RTX
2080 Ti graphic card. The Cuda version is 11.4. The deep-
learning framework PyTorch version 3.8 is used to References
implement this approach. In our implementation, we use
1. W. H. O. (WHO): Ambient air pollution: A global assessment of
the default batch size of 32 using the Adam optimizer [52]. exposure and burden of disease (2016)
The self-supervised training of embedding is carried out for 2. Tai AP, Mickley LJ, Jacob DJ (2010) Correlations between fine
30 epochs, with the initial learning rate of 1e3 . The particulate matter (pm2. 5) and meteorological variables in the
number of epochs trained for the supervised models is also united states: implications for the sensitivity of pm2. 5 to climate
change. Atmos Environ 44(32):3976–3984. https://doi.org/10.
30, with the initial learning rate of 1e3 . We use early 1016/j.atmosenv.2010.06.060
stopping to get the best model weight. The value of 3. Kulmala M (2018) Build a global Earth observatory. Nature
patience in early stopping is 10 epochs. Publishing Group
123
4. Rahmati Aidinlou H, Nikbakht AM (2022) Fuzzy-based model- geographic long short-term memory neural network for pm 2.5.
ing of thermohydraulic aspect of solar air heater roughened with J Clean Prod 237:117729. https://doi.org/10.1016/j.jclepro.2019.
inclined broken roughness. Neural Comput Appl 11772
34(3):2393–2412. https://doi.org/10.1007/s00521-021-06547-w 21. Qi Z, Wang T, Song G, Hu W, Li X, Zhang Z (2018) Deep air
5. Liu X, Jayaratne R, Thai P, Kuhn T, Zing I, Christensen B, learning: Interpolation, prediction, and feature analysis of fine-
Lamont R, Dunbabin M, Zhu S, Gao J, Wainwright D, Neale D, grained air quality. IEEE Trans Knowl Data Eng
Kan R, Kirkwood J, Morawska L (2020) Low-cost sensors as an 30(12):2285–2297. https://doi.org/10.1109/TKDE.2018.2823740
alternative for long-term air quality monitoring. Environ Res 22. Li L, Girguis M, Lurmann F, Pavlovic N, McClure C, Franklin
185:109438. https://doi.org/10.1016/j.envres.2020.109438 M, Wu J, Oman LD, Breton C, Gilliland F, Habre R (2020)
6. deSouza P, Anjomshoaa A, Duarte F, Kahn R, Kumar P, Ratti C Ensemble-based deep learning for estimating pm2.5 over Cali-
(2020) Air quality monitoring using mobile low-cost sensors fornia with multisource big data including wildfire smoke.
mounted on trash-trucks: methods development and lessons Environ Int 145:106143. https://doi.org/10.1016/j.envint.2020.
learned. Sustain Cities Soc 60:102239. https://doi.org/10.1016/j. 106143
scs.2020.102239 23. Rijal N, Gutta RT, Cao T, Lin J, Bo Q, Zhang J (2018) Ensemble
7. Motlagh NH, Lagerspetz E, Nurmi P, Li X, Varjonen S, Miner- of deep neural networks for estimating particulate matter from
aud J, Siekkinen M, Rebeiro-Hargrave A, Hussein T, Petaja T, images. In: 2018 IEEE 3rd International conference on image,
Kulmala M, Tarkoma S (2020) Toward massive scale air quality vision and computing (ICIVC), pp 733–738. https://doi.org/10.
monitoring. IEEE Commun Mag 58(2):54–59. https://doi.org/10. 1109/ICIVC.2018.8492790
1109/MCOM.001.1900515 24. Dixit E, Jindal V (2022) Ieesep: an intelligent energy efficient
8. Idrees Z, Zheng L (2020) Low cost air pollution monitoring stable election routing protocol in air pollution monitoring
systems: a review of protocols and enabling technologies. J Ind WSNS. Neural Comput Appl. https://doi.org/10.1007/s00521-
Inf Integr 17:100123. https://doi.org/10.1016/j.jii.2019.100123 022-07027-5
9. Lin Y-C, Lee S-J, Ouyang C-S, Wu C-H (2020) Air quality 25. Ari D, Alagoz BB (2022) An effective integrated genetic pro-
prediction by neuro-fuzzy modeling approach. Appl Soft Comput gramming and neural network model for electronic nose cali-
86:105898. https://doi.org/10.1016/j.asoc.2019.105898 bration of air pollution monitoring application. Neural Comput
10. Xiao X, Jin Z, Wang S, Xu J, Peng Z, Wang R, Shao W, Hui Y Appl. https://doi.org/10.1007/s00521-022-07129-0
(2022) A dual-path dynamic directed graph convolutional net- 26. Al-Janabi S, Alkaim A, Al-Janabi E, Aljeboree A, Mustafa M
work for air quality prediction. Sci Total Environ 827:154298. (2021) Intelligent forecaster of concentrations (pm2. 5, pm10,
https://doi.org/10.1016/j.scitotenv.2022.154298 no2, co, o3, so2) caused air pollution (IFCSAP). Neural Comput
11. Wang J, Li J, Wang X, Wang J, Huang M (2021) Air quality Appl 33(21):14199–14229. https://doi.org/10.1007/s00521-021-
prediction using CT-LSTM. Neural Comput Appl 06067-7
33(10):4779–4792. https://doi.org/10.1007/s00521-020-05535-w 27. Wardana I, Gardner JW, Fahmy SA (2022) Estimation of missing
12. Wang J, Song G (2018) A deep spatial-temporal ensemble model air pollutant data using a spatiotemporal convolutional autoen-
for air quality prediction. Neurocomputing 314:198–206. https:// coder. Neural Comput Appl. https://doi.org/10.1007/s00521-022-
doi.org/10.1016/j.neucom.2018.06.049 07224-2
13. Han J, Liu H, Zhu H, Xiong H, Dou D (2021) Joint air quality and 28. Liang Y, Ke S, Zhang J, Yi X, Zheng Y (2018) Geoman: Multi-
weather prediction based on multi-adversarial spatiotemporal level attention networks for geo-sensory time series prediction.
networks. Proceed AAAI Conf Artif Intell 35:4081–4089. https:// In: Proceedings of the twenty-seventh international joint confer-
doi.org/10.1609/aaai.v35i5.16529 ence on artificial intelligence, IJCAI-18, pp 3428–3434. https://
14. Chen P-C, Lin Y-T (2022) Exposure assessment of pm2.5 using doi.org/10.24963/ijcai.2018/476
smart spatial interpolation on regulatory air quality stations with 29. Zhao J, Deng F, Cai Y, Chen J (2018) Long short-term memory–
clustering of densely-deployed microsensors. Environ Pollut fully connected (LSTM-FC) neural network for pm2.5 concen-
292:118401. https://doi.org/10.1016/j.envpol.2021.118401 tration prediction. Chemosphere. https://doi.org/10.1016/j.chemo
15. Beauchamp M, Malherbe L, de Fouquet C, Létinois L, Tognet F sphere.2018.12.128
(2018) A polynomial approximation of the traffic contributions 30. Qi Y, Li Q, Karimian H, Liu D (2019) A hybrid model for spa-
for kriging-based interpolation of urban air quality model. tiotemporal forecasting of pm2.5 based on graph convolutional
Environ Modell Softw 105:132–152. https://doi.org/10.1016/j. neural network and long short-term memory. Sci Total Environ.
envsoft.2018.03.033 https://doi.org/10.1016/j.scitotenv.2019.01.333
16. Li J, Heap AD (2011) A review of comparative studies of spatial 31. Ma J, Ding Y, Gan VJL, Lin C, Wan Z (2019) Spatiotemporal
interpolation methods in environmental sciences: performance prediction of pm2.5 concentrations at different time granularities
and impact factors. Eco Inform 6(3):228–241. https://doi.org/10. using IDW-BLSTM. IEEE Access 7:107897–107907
1016/j.ecoinf.2010.12.003 32. Guo C, Liu G, Lyu L, Chen CH (2020) An unsupervised pm2.5
17. Noi E, Murray AT (2022) Interpolation biases in assessing spatial estimation method with different Spatio-temporal resolutions
heterogeneity of outdoor air quality in Moscow, Russia. Land Use based on KIDW-TCGRU. IEEE Access 8:190263–190276.
Policy 112:105783. https://doi.org/10.1016/j.landusepol.2021. https://doi.org/10.1109/ACCESS.2020.3032420
105783 33. Kipf TN, Welling M (2017) Semi-supervised classification with
18. Xu C, Wang J, Hu M, Wang W (2022) A new method for graph convolutional networks. In: Proceedings of the 5th Inter-
interpolation of missing air quality data at monitor stations. national conference on learning representations. ICLR ’17.
Environ Int 169:107538. https://doi.org/10.1016/j.envint.2022. https://doi.org/10.48550/ARXIV.1609.02907
107538 34. Liu Y, Jin M, Pan S, Zhou C, Zheng Y, Xia F, Yu P (2022) Graph
19. Alimissis A, Philippopoulos K, Tzanis C, Deligiorgi D (2018) self-supervised learning: a survey. IEEE Transactions on
Spatial estimation of urban air pollution with the use of artificial knowledge and data engineering abs/2103.00111, 1–1. https://doi.
neural network models. Atmos Environ 191:205–213. https://doi. org/10.1109/TKDE.2022.3172903
org/10.1016/j.atmosenv.2018.07.058 35. Kipf TN, Welling M (2016) Variational graph auto-encoders.
20. Ma J, Ding Y, Cheng JC, Jiang F, Wan Z (2019) A temporal- CoRR abs/1611.07308. 1611.07308. https://doi.org/10.48550/
spatial interpolation and extrapolation method based on ARXIV.1611.07308
123
36. Wang C, Pan S, Long G, Zhu X, Jiang J (2017) Mgae: traffic prediction. IEEE Trans Intell Transp Syst
marginalized graph autoencoder for graph clustering. In: Pro- 21(9):3848–3858. https://doi.org/10.1109/tits.2019.2935152
ceedings of the 2017 ACM on conference on information and 52. Kingma DP, Ba J (2015) Adam: A method for stochastic opti-
knowledge management. CIKM ’17, pp. 889–898. https://doi.org/ mization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International
10.1145/3132847.3132967 conference on learning representations, ICLR 2015, San Diego,
37. Jin W, Derr T, Liu H, Wang Y, Wang S, Liu Z, Tang J (2020) CA, USA, May 7-9, 2015, Conference Track Proceedings. https://
Self-supervised learning on graphs: deep insights and new doi.org/10.48550/ARXIV.1412.6980
direction. CoRR abs/2006.10141. https://doi.org/10.48550/ 53. Reani M, Lowe D, Gledson A, Topping D, Jay C (2022) UK daily
ARXIV.2006.10141 meteorology, air quality, and pollen measurements for
38. Hu Z, Fan C, Chen T, Chang K-W, Sun Y (2019) Pre-training 2016–2019, with estimates for missing data. Sci Data 9(1):43.
graph neural networks for generic structural feature extraction. https://doi.org/10.1038/s41597-022-01135-6
In: ICLR 2019 Workshop: representation learning on graphs and 54. Wang H air pollution and meteorological data in Beijing
manifolds. https://doi.org/10.48550/ARXIV.1905.13728 2017-2018. https://doi.org/10.7910/DVN/USXCAK
39. Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: Online 55. Colchado LE, Villanueva E, Ochoa-Luna J (2021) A neural
learning of social representations. In: Proceedings of the 20th network architecture with an attention-based layer for spatial
ACM SIGKDD International conference on knowledge discovery prediction of fine particulate matter. In: 2021 IEEE 8th Interna-
and data mining. KDD ’14, pp 701–710. Association for Com- tional conference on data science and advanced analytics
puting Machinery, New York, NY, USA. https://doi.org/10.1145/ (DSAA), pp 1–10. https://doi.org/10.1109/DSAA53316.2021.
2623330.2623732 9564200
40. Grover A, Leskovec J (2016) Node2vec: Scalable feature learn- 56. Chen Y, Zang L, Du W, Xu D, Shen G, Zhang Q, Zou Q, Chen J,
ing for networks. In: Proceedings of the 22nd ACM SIGKDD Zhao M, Yao D (2018) Ambient air pollution of particles and gas
International conference on knowledge discovery and data min- pollutants, and the predicted health risks from long-term exposure
ing. KDD ’16, pp 855–864. Association for Computing to pm25 in zhejiang province, china. Environ Sci Pollut Res
Machinery, New York, NY, USA. https://doi.org/10.1145/ 25(24):23833–23844. https://doi.org/10.1007/s11356-018-2420-5
2939672.2939754 57. Chen Z, Xie X, Cai J, Chen D, Gao B, He B, Cheng N, Xu B
41. Zhu Y, Xu Y, Yu F, Liu Q, Wu S, Wang L (2020) Deep Graph (2018) Understanding meteorological influences on pm2:5 con-
Contrastive Representation Learning. In: ICML Workshop on centrations across china: a temporal and spatial perspective.
Graph Representation Learning and Beyond. https://doi.org/10. Atmos Chem Phys 18(8):5343–5358
48550/ARXIV.2006.04131 58. Wang J, Ogawa S (2015) Effects of meteorological conditions on
42. Hamilton WL, Ying R, Leskovec J (2017) Inductive representa- pm2.5 concentrations in Nagasaki, Japan. Int J Environ Res
tion learning on large graphs. NIPS’17, pp 1025–1035. Curran Public Health 12:9089–101. https://doi.org/10.3390/
Associates Inc., Red Hook, NY, USA. https://doi.org/10.48550/ ijerph120809089
ARXIV.1706.02216 59. Mi K, Zhuang R, Zhang Z, Gao J, Pei Q (2019) Spatiotemporal
43. Velickovic P, Fedus W, Hamilton WL, Liò P, Bengio Y, Hjelm characteristics of pm2.5 and its associated gas pollutants, a case
RD (2019) Deep graph infomax. In: 7th International conference in china. Sustain Cities Soc 45:287–295. https://doi.org/10.1016/
on learning representations, ICLR 2019, New Orleans, LA, USA, j.scs.2018.11.004
May 6-9, 2019. https://doi.org/10.48550/ARXIV.1809.10341 60. Li K, Bai K (2019) International Journal of Environmental
44. Opolka FL, Solomon A, Cangea C, Velickovic P, Liò P, Hjelm Research and Public Health. Spatiotemporal Assoc Between
RD (2019) Spatio-temporal deep graph infomax. ICLR 2019 abs/ pm2.5 So2 Well No2 China From 2015 to 2018 16(13):2352.
1904.06316. https://doi.org/10.48550/ARXIV.1904.06316 https://doi.org/10.3390/ijerph16132352
45. Winarno E, Hadikurniawati W, Rosso RN (2017) Location based 61. Hart S In: Eatwell, J., Milgate, M., Newman, P. (eds.) Shapley
service for presence system using haversine method. In: 2017 Value, pp 210–216. Palgrave Macmillan UK, London (1989).
International conference on innovative and creative information https://doi.org/10.1007/978-1-349-20181-5_25
technology (ICITech), pp 1–4. https://doi.org/10.1109/INNOCIT. 62. Jia M, Zhao T, Cheng X, Gong S, Zhang X, Tang L, Liu D, Wu
2017.8319153. IEEE X, Wang L, Chen Y (2017) Inverse relations of pm2.5 and o3 in
46. copernicus: ERA5 Hourly Data on Single Levels from 1959 to air compound pollution between cold and hot seasons over an
Present. https://doi.org/10.24381/cds.adbb2d47. https://cds.cli urban area of east china. Atmosphere. https://doi.org/10.3390/
mate.copernicus.eu/cdsapp/#!/dataset/reanalysis-era5-single- atmos8030059
levels Accessed 2019-09-30 63. Fu H, Zhang Y, Liao C, Mao L, Wang Z, Hong N (2020)
47. Li S, Xie G, Ren J, Guo L, Yang Y, Xu X (2020) Urban pm2.5 Investigating PM(2.5) responses to other air pollutants and
concentration prediction via attention-based CNN-LSTM. Appl meteorological factors across multiple temporal scales. Sci Rep
Ci. https://doi.org/10.3390/app10061953 10(1):15639. https://doi.org/10.1038/s41598-020-72722-z
48. Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical eval-
uation of gated recurrent neural networks on sequence modeling. Publisher’s Note Springer Nature remains neutral with regard to
In: NIPS 2014 Workshop on deep learning, December. https:// jurisdictional claims in published maps and institutional affiliations.
doi.org/10.48550/ARXIV.1412.3555
49. Tobler WR (1970) A computer movie simulating urban growth in
Springer Nature or its licensor (e.g. a society or other partner) holds
the Detroit region. Econ Geogr 46:234–240. https://doi.org/10.
exclusive rights to this article under a publishing agreement with the
2307/143141
author(s) or other rightsholder(s); author self-archiving of the
50. Cichowicz R, Wielgosinski G, Fetter W (2020) Effect of wind
accepted manuscript version of this article is solely governed by the
speed on the level of particulate matter pm10 concentration in
terms of such publishing agreement and applicable law.
atmospheric air during winter season in vicinity of large com-
bustion plant. J Atmos Chem 77:1–14. https://doi.org/10.1007/
s10874-020-09401-w
51. Zhao L, Song Y, Zhang C, Liu Y, Wang P, Lin T, Deng M, Li H
(2020) T-GCN: a temporal graph convolutional network for
123
Authors and Affiliations
Viet Hung Vu1 • Duc Long Nguyen1 • Thanh Hung Nguyen1 • Quoc Viet Hung Nguyen2 • Phi Le Nguyen1 •
Thanh Trung Huynh3
& Thanh Hung Nguyen Thanh Trung Huynh

hungnt@soict.hust.edu.vn thanh.huynh@epfl.ch
Viet Hung Vu
hung.vv221026m@sis.hust.edu.vn 1
Hanoi University of Science and Technology, Hanoi,
Duc Long Nguyen Vietnam
long.nd222179m@sis.hust.edu.vn 2
Griffith University, Gold Coast, Queensland, Australia
Quoc Viet Hung Nguyen 3
The École Polytechnique Fédérale de Lausanne, Lausanne,
quocviethung1@gmail.com Switzerland
Phi Le Nguyen
lenp@soict.hust.edu.vn
123

4 Paper

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

4 Paper

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4 Paper

Uploaded by

Copyright:

Available Formats

Neural Computing and Applications

Self-supervised air quality estimation with graph neural network

Phi Le Nguyen1 • Thanh Trung Huynh3

Received: 17 November 2022 / Accepted: 21 February 2024

Keywords Air quality interpolation Graph neural network Time-series prediction

1 Introduction pollutants is, therefore, essential for dealing with issues

The problem can be mathematically represented as Table 2 Important operations

Table 1 Important notations

Si i-th monitoring station

Fig. 2 An illustration of our

More specifically, we sample the random vector m~ 2 Rl ,

where  is the Hadamard product operator. 5 Multi-level attention interpolator

4.3.2 Discriminator To interpolate the air quality at unmonitored areas using

Fig. 3 An illustration of our location-aware attention mechanism,

5.4 Air quality interpolation process (RQ4) Is our technique interpretable?

Table 3 summarizes the dataset statistical information, e.g.,

Table 3 A detailed statistical

Fig. 5 An illustration of the distribution of the Beijing dataset

Fig. 6 An illustration of the distribution of the UK dataset

Table 4 A detailed comparison

Fig. 8 Visualization of prediction result against the ground-truth on the UK dataset

Table 5 Effects of different

Fig. 9 Correlation between PM2:5 with other indicators

Fig. 10 Shapley value on Beijing and UK datasets

Acknowledgements This work was funded by Vingroup Joint Stock

Authors and Afﬁliations

Thanh Trung Huynh3

& Thanh Hung Nguyen Thanh Trung Huynh

You might also like

where is the Hadamard product operator. 5 Multi-level attention interpolator