1. Introduction
With the rapid development of industrialization and urbanization, air quality issues have become the focus of social concern, especially in rapidly developing urban areas [
1]. PM
2.5, a major factor affecting air quality, presents a serious threat to public health and environmental conservation [
2]. Thus, accurate prediction of PM
2.5 concentration is crucial for both government agencies and the public. The variations in PM
2.5 concentrations are significantly influenced by socio-economic factors, human activities, and the spatial distribution of urban structures. Currently, most cities nationwide (such as Beijing, Guangzhou, and Haikou) have established air monitoring stations for monitoring hourly data on various air pollutants and meteorological factors [
3]. However, although these monitoring stations can provide real-time air pollution data, they cannot predict pollutant concentrations in advance. Consequently, accurate advance prediction of PM
2.5 concentration has become essential for managing environmental health and preventing severe pollution events.
Previous studies on the prediction of PM
2.5 concentration have mainly emphasized short-term point prediction. This approach focuses solely on specific momentary values of air pollutant concentration, overlooking the long-term trends and predictive uncertainty of PM
2.5 concentration [
4,
5]. Such short-term point-prediction methods pose challenges in offering comprehensive information for decision making and constrain the deep comprehension of future air quality conditions [
6]. Therefore, it is particularly important to develop an air quality prediction framework that can simultaneously consider long-term point and interval prediction. However, this is still a challenging topic, and its core issues can be summarized as follows:
- (1)
How to fully exploit the interactions and impacts among air pollutants, meteorological factors, and spatial and temporal factors [
7,
8]. Meteorological factors have an important influence on the formation, transport, and dispersion of air pollutants. In addition, there is a degree of correlation between different monitoring stations. Therefore, it is crucial to fully consider the correlation between multiple monitoring stations and exploit the effects between multiple air pollutants and meteorological factors in the air quality prediction modelling process.
- (2)
How to improve the accuracy and reliability of long-term predictions. Accurate long-term predictions can provide us with sufficient time to take measures against air pollution. However, there are complex nonlinear relationships among the factors affecting air pollutants, and the current prediction models applied to air pollution are mainly designed for short-term prediction tasks, which makes it challenging to capture the long-term dependences among air pollution time series effectively [
9]. Therefore, fully exploiting the spatial and temporal effects between air pollutant concentration and meteorological factors is the key to achieving accurate PM
2.5 prediction.
- (3)
How to effectively use interval prediction to quantify uncertainty in PM
2.5 concentration changes. Most previous studies on PM
2.5 concentration prediction have focused on point prediction, but point prediction often has difficulty covering more fluctuating information (e.g., uncertainty, variability, and trends) [
10]. The key to achieving interval prediction is modelling the point-prediction error distribution. Therefore, choosing an appropriate method to fit the point-prediction error distribution is the key to achieving interval prediction.
As we all know, the formation and variation of PM
2.5 concentration are influenced by multiple factors, including meteorological conditions, environmental parameters, and human activities. For instance, specific meteorological conditions such as temperature and wind speed can significantly impact not only the transport and dispersion of pollutants but also determine the stability and reactivity of pollutants in the air [
11]. Additionally, alterations in human activities and the distribution of points of interest (POIs) can have direct or indirect impacts on air quality [
12]. Pollutant emissions from these activities can lead to correlated and synergistic PM
2.5 concentration at different monitoring stations. However, many studies currently consider only the relationship between neighboring stations in the actual geographic area, ignoring geospatial similarity [
13,
14]. For example, two stations may be geographically distant, but they may exhibit similar patterns [
15]. Therefore, it is imperative to fully consider the geographic similarity of all stations to enhance the accuracy of air quality prediction.
In recent years, machine learning and deep learning techniques have shown significant performance in short-term prediction of PM
2.5 [
8,
16,
17]. However, long-term prediction tasks present a higher challenge to existing models [
18]. The core of long-term prediction modelling lies in choosing a multi-step prediction strategy [
19]. The strategies commonly used in the current research can be categorized into recursive strategies [
20] and direct multi-output strategies [
21]. Recursive prediction strategies have the advantage of incorporating the extraction of the time dependence within the predicted sequence into the modelling. However, introducing the predicted values leads to a severe error accumulation problem [
22]. Conversely, the direct multi-output prediction strategy simultaneously generates predictions at multiple time points during the training process, effectively improving the prediction efficiency and mitigating the error accumulation problem [
23]. However, this strategy typically relies on complex network architectures to capture long-term temporal dependencies between time series. Moreover, existing deep learning models for short-term prediction struggle to capture long-term dependencies in time series. Recently, the Transformer model has performed well in long-term time-series prediction owing to its advantages in capturing long-term dependencies, thus providing a new direction for long-term prediction of PM
2.5 [
24]. It is worth noting that although the Transformer has significant advantages in establishing remote dependencies between data, it still has limitations in dealing with complex dependencies among multiple variables [
25]. In addition, this study notes that CNNs have powerful grid-data processing capabilities to capture localized patterns and features in time series effectively. Therefore, combining a CNN with the Transformer model to construct a hybrid prediction framework that can effectively integrate multivariate and deeply mine long-term dependencies is crucial for improving the accuracy and reliability of PM
2.5 prediction.
Although point prediction of PM
2.5 concentration plays an important role in air pollution control, errors are inevitable in this type of prediction due to the volatility and non-stationarity of changes in PM
2.5 concentration [
6]. In order to fully consider more uncertain information, interval prediction of PM
2.5 can effectively cover a range of PM
2.5 concentrations at different confidence levels, providing more practical information for decision makers. Currently, a commonly adopted strategy for interval prediction is to use deep learning models for point prediction and then model the distribution of prediction errors [
26]. Error distribution analysis usually takes the form of a probability density function. Parametric [
27] and nonparametric methods [
10] are the two main techniques for extracting the probability density function of the error distribution. Parametric methods require specific presuppositions about the error distribution, such as normal distribution, exponential distribution, etc. However, in practice, the error distribution may be skewed, and these assumptions may lead to bias in estimating the error distribution. In contrast, nonparametric methods are more flexible and adaptable as they do not require specific assumptions about the error distribution but rather infer the shape of the error distribution directly from the data. Among them, kernel density estimation (KDE) is a commonly used nonparametric method, which has been widely used in wind power generation interval prediction [
28], wave height interval prediction [
29], and other fields.
According to the above analysis of the literature, a long-term point-and-interval-prediction framework for PM2.5 concentration that integrates a convolutional neural network (CNN) and the Transformer model is introduced. The proposed approach comprehensively accounts for the interactions among air pollutants, meteorological factors, and PM2.5 data from strongly correlated stations. The main contributions of this study are as follows:
- (1)
In selecting influencing factors, this study considers both the interactions among various air pollutants and meteorological factors, as well as the correlations and synergies between monitoring stations across different geographic areas. The PM2.5 concentrations at strongly correlated stations are used as one of the features to mine the potential relationship between them and the target station.
- (2)
For long-term point prediction, this study notes the advantage of Transformer in mining the long-term dependence of time series. The overall structural design incorporates both CNN and Transformer models to effectively capture the long-term dependencies among multidimensional variables, thereby accomplishing stable and reliable PM2.5 predictions.
- (3)
In terms of long-term interval prediction, this study further utilizes KDE to obtain prediction intervals for PM2.5 concentration at different confidence levels based on point-prediction results to provide more information about uncertainty levels.
The rest of this paper is organized as follows.
Section 2 provides an overview of related work.
Section 3 reviews the theoretical principles of these methods.
Section 4 presents the dataset and experimental results.
Section 5 summarizes the discussion. Finally,
Section 6 gives conclusions and outlines future work.
3. Methodology
3.1. The Overall Framework
This study combines air pollutant and meteorological data from target stations and strongly correlated stations to exploit intricate spatial and temporal relationships for long-term point and interval prediction of PM
2.5. The overall framework is depicted in
Figure 1. First, multi-source data are collected and preprocessed. POI data in the study area are used to perform spatial clustering analysis of all monitoring stations to screen for strongly correlated stations. The Pearson correlation coefficient is used to analyze the correlation between all features to determine the feature variables for final input into the model. For model training and testing, the dataset is separated into three sets: training, validation, and test, in a 7:1:2 ratio. Second, a hybrid deep learning model based on a convolutional neural network and the Transformer is applied to achieve accurate long-term point predictions of PM
2.5. Finally, KDE-based interval prediction is performed based on point-prediction error estimation to obtain the prediction intervals of PM
2.5 at different confidence levels.
3.2. Preliminaries
Assume that there are stations in the study area, denoted by the set . Each station contains three attributes, including station id, longitude, and latitude. The count of different types of POIs around each station is denoted by , where denotes the number of stations and denotes the total number of categories of POIs. Let represent all the features of station at historical time , encompassing air pollution data (PM2.5, PM10, CO, etc.), meteorological data (wind speed, wind direction, temperature, etc.), and PM2.5 concentration at strongly correlated stations. PM2.5 is the target pollutant in this study. For the target station , the historical observation data are used to predict the point and interval concentration of PM2.5 for the future time interval from to , where denotes the historical time step . The point prediction of PM2.5 concentration from to is denoted by . For a given confidence interval , the interval prediction is denoted by .
3.3. Spatial Clustering Based on POIs
In order to examine the similarity and geographical association patterns across monitoring stations, this method obtains POI data for the study area using Baidu’s open API. The POI data include a range of geographical entities, such as business areas, cultural facilities, transportation hubs, and more. Afterwards, hierarchical clustering is utilized to spatially group all monitoring stations. Hierarchical clustering algorithms create a dendrogram by grouping into clusters stations that are both spatially close and similar in nature. This approach eliminates the requirement to pre-determine the number of clusters, making it easier to explore potential geographical patterns within the research region without prior information. Examining the clustering outcomes enhances our overall comprehension of the geographical connections between monitoring stations, uncovering groups of stations that display close spatial correlations with mutually beneficial changes. The spatial clustering module is represented by pseudo-code in Algorithm 1, and the formulas used in the algorithm are provided in Equations (1)–(3).
Algorithm 1 Proposed spatial clustering approach |
-
Input: represent station location information and POI information, respectively;
|
-
Output:;
|
- 1:
dimensions; - 2:
for do - 3:
for do - 4:
compute . according to Equation (1); - 5:
if do - 6:
update ; - 7:
is regarded as a separate cluster; - 8:
while do - 9:
for do - 10:
for do - 11:
according to Equations (2) and (3); - 12:
; - 13:
find the most similar clusters: ; - 14:
merge ; - 15:
for do - 16:
; - 17:
; - 18:
return ;
|
where
denote the latitude and longitude of the POI and
denote the latitude and longitude of the monitoring stations. The value 6378.137 is the radius of the Earth’s equator in kilometers.
represents the normalized Euclidean distance between data points
and
,
represents the number of dimensions of the data point,
and
, respectively, represent the values of data points
and
in the
dimension,
is the standard deviation on the
dimension,
represents the similarity between clusters
and
, and
and
, respectively, represent the number of samples in each cluster.
3.4. ConvFormer Network
The structure of the proposed ConvFormer network is shown in
Figure 2. The Transformer has a significant advantage in capturing long-term dependencies between time series. However, it is difficult for it to capture relationships between multivariate variables. Therefore, this study combines CNN to mine local patterns and short-term dependencies among multivariate variables and the Transformer to obtain long-term dependencies among time series. Additionally, this study adopts a direct multi-output strategy for long-term point prediction.
CNNs, representing a robust deep learning model, have demonstrated successful applications in image analysis, natural language processing, and various other fields. In the field of multivariate time-series prediction, a CNN automatically learns complex patterns and regularities in time-series data through its structure of convolutional and pooling layers, which can effectively deal with the interactions and temporal relationships among multiple variables. Therefore, this proposed method utilizes a CNN to process historical observation data. The input multivariate time-series data are converted into two-dimensional feature variables
, and the convolution operation is performed to obtain the feature map
where
denotes the sliding window step size and
denotes the dimension of the input features. The computation process of each element in the feature map is shown in Equation (4). Then, the maximum pooling operation is utilized to retain the most significant features in the multivariate data and ignore the less important information. The final matrix
is obtained as an input to the Transformer.
where
denotes the feature output value of row
and column
of the feature map,
denotes the value in row
and column
of the input feature matrix,
denotes the chosen activation function,
denotes the weight value of the row
and column
of the convolution kernel,
denotes the deviation of the convolution kernel.
The Transformer model is a feed-forward neural network architecture. Its core is its self-attention mechanism, which can be utilized to effectively capture the relationship between any two points in a time series. In particular, the self-attention mechanism computes correlation weights between each position in the input sequence and other locations, and then applies these weights to generate a representation of each position. The self-attention mechanism is defined by the following formula:
where
,
,
denote the query vector, key vector, and value vector, respectively,
denotes the dimensionality of the key, and
is the activation function that transforms the input to the interval [0, 1]. The self-attention mechanism derives the attention weights by evaluating the similarity between the query and the keys, and it produces the final representation through a weighted sum.
In contrast to the original Transformer architecture, this model eliminates the need for final probability calculations using . Instead, the final predicted value of the target pollutant concentration at the station at time is derived by mapping the generated feature maps to the output values.
3.5. Interval-Prediction Method: Non-Parametric Kernel Density Estimation
Interval prediction of PM
2.5 relies on point prediction, followed by the delineation of upper and lower bounds to define the prediction intervals. This approach quantifies the uncertainty in PM
2.5 concentration changes, offering comprehensive early warning information for future PM
2.5 variations. Non-parametric KDE is widely applied in interval prediction due to its independence from specific probability distribution assumptions. KDE, as a non-parametric estimation method, is not constrained by the specific form of probability distribution, which enables it to fit sample data accurately and reliably. Therefore, the proposed method uses KDE to quantitatively analyze and estimate point-prediction results for PM
2.5. First, the error sequence
is initially derived based on the difference between predicted and actual values within the training set. The optimal bandwidth
of the KDE is then determined based on a grid search and a five-fold cross-validation approach. Based on the obtained optimal bandwidth
, the KDE model is fitted on the error sequence
, where the estimation function of KDE is described as follows:
where
denotes the number of samples and
denotes the kernel function. Commonly used kernel functions include the Gaussian kernel function, Epanechnikov kernel function, rectangular kernel function, etc. Compared with other kernel functions, the Gaussian kernel function can generate a smoother density estimation curve, which is conducive to capturing the overall characteristics of the data distribution. Therefore, the Gaussian kernel function is used here, and its expression is as follows:
Based on the fitted KDE model, the probability density function (PDF) and cumulative distribution function (CDF) of the error are calculated. For a given confidence level
, the lower and upper bounds of the confidence interval are
and
. Finally, the interval-prediction result
for the test set is obtained through Equation (8).
5. Discussion
This study proposes a prediction framework based on the ConvFormer-KDE model, which combines CNN, Transformer, and KDE techniques to obtain long-term point and interval prediction for PM2.5 concentration. In selecting influencing factors, some meteorological factors and other pollution factors cannot be ignored. Therefore, PM10, CO, O3, wind speed, temperature, pressure, and wind direction, which are highly correlated to PM2.5, were included in the modelling of this study. In addition, PM2.5 values from stations that were strongly correlated with the target station were used as input to the model. In the modelling process, unlike previous studies where the final prediction results are obtained after integrating the prediction results of two separate models, this study converts the CNN-extracted features into the input dimensions required by the Transformer model. Subsequently, the Transformer is utilized to mine the long-term dependencies in the time series to obtain the prediction results. The ConvFormer-KDE takes full advantage of different deep-learning modules. CNN is able to learn temporal relationships and interactions between multivariate variables, and the multi-attention mechanism of the Transformer enables the model to track each data point in relation to another specific data point, allowing it to capture long-term dependencies between temporal sequences. In terms of output strategy selection, this method employs direct output of multiple predicted duration values simultaneously rather than recursively training multiple models. The key to recursive multistep prediction methods lies in continuously updating the dataset and utilizing the updated dataset to make predictions. These methods have the problem of error accumulation becoming worse as the prediction time increases, since each prediction builds on the previous one. The direct multi-output approach chosen in this study can alleviate this problem, and the model structure is simpler and more efficient in terms of computational efficiency.
In terms of interval prediction, directly predicting the upper and lower bounds of intervals often necessitates specifying a fixed interval width, making it challenging to calculate the uncertainty of prediction results. Therefore, implementing interval prediction based on point-prediction results is more widely used. The point-prediction model used in this study plays an important role in interval prediction. A preliminary analysis of the point-prediction error is performed and a probability density function (PDF) of the point-prediction error is constructed using the KDE method. Subsequently, the cumulative distribution function (CDF) is employed to depict the distribution of the error at a specific confidence level, and the upper and lower bounds of the interval prediction are ultimately derived at the designated confidence level. In KDE, the selection of the kernel function holds significant importance as it directly impacts the level of smoothing and the bias–variance trade-off of the estimation. In this study, the Gaussian function was selected as the kernel function for KDE. Typically, the Gaussian function offers smoother characteristics compared with other kernel functions, resulting in a more continuous and smoother distribution of weights within the observations and yielding a smaller bias.