0% found this document useful (0 votes)
13 views15 pages

Intelligent Forecasting of Air Quality and Pollution Prediction Using Machine Learning

Uploaded by

717821I102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views15 pages

Intelligent Forecasting of Air Quality and Pollution Prediction Using Machine Learning

Uploaded by

717821I102
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Hindawi

Adsorption Science & Technology


Volume 2022, Article ID 5086622, 15 pages
https://doi.org/10.1155/2022/5086622

Research Article
Intelligent Forecasting of Air Quality and Pollution Prediction
Using Machine Learning

D. Kothandaraman ,1 N. Praveena ,2 K. Varadarajkumar ,3 B. Madhav Rao,4


Dharmesh Dhabliya,5 Shivaprasad Satla,6 and Worku Abera 7
1
School of Computer Science and Artificial Intelligence, SR University, Warangal, Telangana, India
2
Department of Information Technology, Velagapudi Ramakrishna Siddhartha Engineering College, Vijayawada, India
3
Department of Computer Science and Engineering, Malla Reddy University, Hyderabad, 500043 Telangana, India
4
Department of Computer Science and Engineering, SIR C R Reddy College of Engineering, Eluru, India
5
Department of Computer Engineering, Vishwakarma Institute of Information Technology, India
6
Department of Computer Science and Engineering, Malla Reddy Engineering College, Secunderabad, 500100 Telangana, India
7
Department of Food Process Engineering, College of Engineering and Technology, Wolkite University, Wolkite, Ethiopia

Correspondence should be addressed to D. Kothandaraman; [email protected],


N. Praveena; [email protected], and Worku Abera; [email protected]

Received 28 March 2022; Revised 24 April 2022; Accepted 6 May 2022; Published 27 June 2022

Academic Editor: Lakshmipathy R

Copyright © 2022 D. Kothandaraman et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.

Air pollution consists of harmful gases and fine Particulate Matter (PM2.5) which affect the quality of air. This has not only
become the key issues in scientific research but also turned to be an important social issues of the public’s life. Therefore,
many experts and scholars at different R&Ds, universities, and abroad are involved in lot of research on PM2.5 pollutant
predictions. In this scenario, the authors proposed various machine learning models such as linear regression, random forest,
KNN, ridge and lasso, XGBoost, and AdaBoost models to predict PM2.5 pollutants in polluted cities. This experiment is carried
out using Jupyter Notebook in Python 3.7.3. From the results with respect to MAE, MAPE, and RMSE metrics, among the
models, XGBoost, AdaBoost, random forest, and KNN models (8.27, 0.40, and 13.85; 9.23, 0.45, and 10.59; 39.84, 1.94, and
54.59; and 49.13, 2.40, and 69.92, respectively) are observed to be more reliable models. The PM2.5 pollutant concentration
(PClow-PChigh) range observed for these models is 0-18.583 μg/m3, 18.583-25.023 μg/m3, 25.023-28.234μg/m3, and 28.234-
49.032 μg/m3, respectively, so these models can both predict the PM2.5 pollutant and can forecast the air quality levels in a
better way. On comparison between various existing models and proposed models, it was observed that the proposed models
can predict the PM2.5 pollutant with a better performance with a reduced error rate than the existing models.

1. Introduction of oils, discharges from industrial production processes, and


transportation emissions that have PM2.5 as its major air
Nowadays, accurate air pollution prediction and forecast pollutant [1] which has received much attention due to their
become a challenging and significant task due to increased destructive effects on human health, other kinds of creatures,
air pollution which acts as a fundamental problem in many and environment [2]. Various studies testify that air pollu-
parts of the world. Generally, the pollution is divided into tion leads to respiratory and cardiovascular disease leading
two types: (1) natural pollution because of volcanic erup- to death of animals and plants, acid rain, climate change,
tions and forest fires resulting in emission of SO2, CO2, global warming, etc. thus making economic loses and the
CO, NO2, and sulfate as air pollutants and (2) man-made human life of a society difficult to survive in the world [3].
pollution because of some human activities such as burning Regarding the effects of PM2.5 investigated over the last 25
2 Adsorption Science & Technology

years using the comparative analysis of ML techniques, 2. Related Works


Ameer et al. [4] have estimated that approximately 4.2 mil-
lion people have died due to long-term exposure of PM2.5 In the recent years, many prediction models were developed
in the atmosphere, while an additional 250,000 deaths have for solving PM2.5 pollutant issues. Zhang et al. [10] used a
occurred due to ozone exposure [1]. In worldwide rankings light gradient boosting decision tree model to process the
of mortality risk factors, PM2.5 was ranked as 5th and high dimensional data to predict PM2.5 within 24 h based
accounted for 7.6% of total deaths all over the world. From on the historical datasets and predictive datasets and then
1990 to 2015, the number of deaths due to air pollution compared it with various models using evaluated metrics
has increased, especially in China and India with more than such as Symmetric Mean Absolute Percentage Error
20% of 1.1 million deaths worldwide attributed to respira- (SMAPE), MAE, and RMSE.
tory diseases [5]. Hence, worldwide, huge number of [11]) reported a spatial ensemble model to predict PM2.5
research has been carried out on topics like air pollution for the Beijing railway station, but it is not reliable for other
levels and air quality forecasts to control air pollution more locations. Kim et al. [12] reported effects of the indoor PM2.5
effectively. Extensive research specifies that air pollution pollutant, i.e., asthma attacks in children, based on peak
forecasting approaches can be imprecisely divided into three breath flow rates using deep learning methods’ rule for
traditional classes: (1) statistical forecasting methods, (2) predicting respiratory disease risk. Caraka et al. [13]
artificial intelligence methods [6], and (3) numerical fore- reported prediction of PM2.5 using the Markov chain sto-
casting methods [4]. chastic process and VAR-NN-PSO. Using the PM2.5 fea-
PM2.5 pollutants are fine particles that are made up of a ture of higher probability to pass through the lower
combination of gases and particles which are hazardous respiratory tract, its range can be categorized into no risk
when released into the atmosphere [2]. These pollutants (1-30), medium risk (30-48), and moderate risk (>49) in
are mainly responsible for causing human respiratory dis- Chaozhou and Pingtung for the datasets obtained from
eases in one way or another, and when severe, it can further Jan 2014 to May 2019.
lead to the pandemic COVID-19 [7, 8] resulting in increased Beelen et al. [14] established a multicenter cohort study
death level. The present models focus on only the PM2.5 pol- in Europe to study the positive correlation between PM2.5
lutant because from the survey, it is obvious that PM2.5 concentration and heart disease mortality during a long time
causes high issues in human being compared to other pollut- exposure period to PM2.5 [1, 15]. Tiwari et al. [16] consid-
ants, and it is the one that creates other pollutants. Statistical ered an XGBoost model on a building that utilizes atmo-
analysis for PM2.5 pollutant prediction is done using histor- spheric data of Velachery and database of the central
ical meteorological datasets. However, existing models are control room collected from a commercial station in Tamil
constrained to utilize some basic standard classification Nadu for predicting air quality management. This model
techniques; few models are used for forecasting, yet the also considers the highly unstable meteorological parameters
results showed poor error rate performance. such as relative humidity, wind speed pressure, temperature,
In this proposed approach, six different machine learn- and wind direction of the geographic region.
ing models [9] which include regression models such as lin- Bing et al. [17] and Pasha et al. [18] reported a new
ear regression model (LR), random forest model (RF), KNN model for forecasting air quality index in China using
model, ridge and lasso model (RL), XGBoost model (Xgb), support vector regression, and the results showed a decrease
and AdaBoost model (Adab) have been implemented to in MAPE when there is a robust interaction. Lin et al. [19]
predict the PM2.5 pollutant using meteorological and proposed a novel system based on a cloud model granulation
PM2.5 pollutant historical datasets that are downloaded algorithm for air quality forecasting through data explora-
from 1st Jan 2014 to 1st Dec 2019. These data have been tion in three monitoring localities in Wuhan City with high
monitored continuously for 24 h with a time period of an accuracy.
hour using the following meteorological features such as Xiao et al. [20] identified a novel hybrid model by com-
temperature (T in °C), minimum temperature (Tm in °C), bining air mass trajectory analysis and wavelet transforma-
maximum temperature (TM in °C), total train/snowmelt tion to improve the artificial neural network for forecasting
(PP in mm), humidity (H in %), wind speed (V in km/h), the daily average concentrations of PM2.5. Soh et al. [21] rec-
visibility (VV in km), and maximum sustained wind speed ognized the data-driven model ST-DNN to predict PM2.5
(VM in km/h). Also, the proposed machine learning time series data and other pollutants in seven locations for
models have been evaluated using statistical metrics such only 48 h using real-time Taiwan and Beijing datasets. Heni
as Mean Absolute Error (MAE), Mean Absolute Percentage et al. [22] and Li et al. [23] used multivariate multistep time
Error (MAPE), Mean Square Error (MSE), Root Mean series prediction with random forest models to improve the
Square Error (RMSE), and R2 . Results show the achieve- performance and to reduce the time complexity of the air
ment of better performance with decreased error rate when pollutant prediction models.
compared to traditional prediction models. This paper has Regarding the effects of PM2.5 over the last 25 years,
been organized as follows. Section 2 discusses the related Ameer et al. [4] discussed comparison among various
works, Section 3 introduces machine learning models for regression techniques such as decision tree, random forest
predicting PM2.5 and forecasting air quality, Section 4 pre- gradient boosting, and ANN [24] multilayer perceptron
sents model results and analysis, and Section 5 concludes regression with respect to error rate and processing time
the paper. for forecasting air quality in smart cities. In [25], a deep
Adsorption Science & Technology 3

learning model consisting of a recurrent neural network with models’ performance, machine learning models achieve a
long-short-term memory is used to predict local 8 h aver- better performance with minimum error rates.
aged surface ozone concentrations for 72 h based on hourly
air quality and also used meteorological data measurements 3.1. Architecture for Machine Learning Models. Figure 1 rep-
as a tool to forecast air pollution values with decreased error resents the machine learning model for predicting the PM2.5
rate. pollutant in the affected cities. Figure 1 consists of three
Deters et al. [26] and Sallauddin et al. [27] considered a layers: (1) the first layer is an input layer which has the
machine learning method based on six years of meteorolog- PM2.5 pollutant and meteorological datasets for preprocess-
ical and pollution data analyses in Belisario and Cotocollao ing and feature extraction, (2) the second layer contains six
to predict the concentrations of PM2.5 using wind direction, different machine learning models which are used to predict
its speed, and rainfall levels and then compared it to various the PM2.5 pollutant along with its working principle, and (3)
ML algorithms such as BT, L-SVM [28], and ANN regres- the output layer consists of certain steps like training models
sion models. The high correlation between estimated and and testing models and then the final step to predict the
real data for a time series analysis during the wet season con- PM2.5 pollutant range and to forecast its air quality level
firms a better prediction of PM2.5 when the climatic condi- among the various categories.
tions are getting more dangerous or there are high-level 3.2. Flowchart Representation. Figure 2 represents the flow-
conditions of precipitation or strong winds. Zhao et al. chart for predicting the PM2.5 pollutant with the assistance
[29] and Ni et al. [30] introduced a multivariate linear of machine learning models. Here, the prediction process
regression model to achieve short period prediction of was started using real-time meteorological and its PM2.5 pol-
PM2.5, and the parameters included are data on aerosol opti- lutant historical datasets. Then, the data were preprocessed
cal depth obtained through remote sensing and meteorolog- and then feature extracted to remove unwanted data to
ical factors from ground monitoring temperature, relative obtain cleaned datasets for training models. Then, six differ-
humidity, and wind velocity. ent models were integrated for training and testing with real-
The present paper investigated different prediction time data. Then, finally check the prediction of the PM2.5
models related to the PM2.5 pollutant which are statistically pollutant range and then proceed further to forecast whether
analyzed. All existing approaches have mostly implemented air quality levels are good or satisfied in order to deploy the
so many prediction models such as NN [31], L-SVM (Linear models; otherwise, the models and datasets should be
Support Vector Machines), BT (Boosted Trees), CGM, and enhanced again.
NN (neural network) [26]; deep learning consisting of a
recurrent neural network with long-short-term memory 3.3. Implementation of PM2.5 Pollutant Prediction Models.
[25]; decision tree, gradient boosting, random forest, For all the models, performances of training and testing
ANN multilayer perceptron regression [4, 15], and multi- models were evaluated using metrics such as R2 (equation
variate linear regression model [29]; AdaBoost, XGBoost, (1)), Mean Absolute Error (MAE) (equation (2)), Mean
GBDT, LightGBM, and DNN [10]; and predictive data Absolute Percentage Error (MAPE) (equation (3)), Mean
feature exploration-based air quality prediction approach. Square Error (MSE) (equation (4)), and Root Mean Square
In the proposed PM2.5 pollutant prediction, six different Error (RMSE) (equation (5)), and similarly the PM2.5 pollut-
machine learning models have been used, and the results ant was also evaluated.
were compared with those of the above-mentioned exist-
0 12
ing models. m  
B 1/m∑ i=1 ð x observed ð i Þ − 
x observed Þ x predicted ð i Þx predicted C
R2 = @qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2ffiA ,
m m 
3. Machine Learning Models for Predicting 1/m∑i=1 ðxobserved ðiÞ − xobserved Þ 1/m∑i=1 xpredicted ðiÞxpredicted
2

PM2.5 and Forecasting Air Quality ð1Þ


In these proposed machine learning models to predict the
1 m  
PM2.5 pollutant, meteorological datasets were collected for MAE = 〠 xobserved ðiÞ − xpredicted ðiÞ, ð2Þ
24 hours of the day from 1st Jan 2014 to 31st Dec 2019. m i=1
The main objective of the proposed models is to apply vari-
ous machine learning models to predict the PM2.5 pollutant 1 m xobserved ðiÞ − xpredicted ðiÞ
range and its level of air quality in any polluted cities. MAPE = 〠 × 100, ð3Þ
m i=1 xobserved ðiÞ
Though not more than three or four techniques in existing
models have predicted the PM2.5 pollutant [4, 10, 25, 26, 1 m 
29], here six different machine learning models such as LR, RME = 〠 xobserved ðiÞ − xpredicted ðiÞ , ð4Þ
m i=1
RF, KNN, RL, Xgb, and Adab models were implemented to
predict the PM2.5 pollutant with different hyperparameter sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
tuning to increase the accuracy with reduced error rate. 1 m 
RMSE = 〠 xobserved ðiÞ − xpredicted ðiÞ 2 : ð5Þ
The present models were initially preprocessed with various m i=1
meteorological and PM2.5 pollutant datasets. During the
model creation, the datasets were split as training sets of 3.4. Model Deployment for Forecasting Air Quality. To eval-
70% and testing sets of 30%. When compared with existing uate the PM2.5 pollutant concentration for forecasting air
4 Adsorption Science & Technology

Hidden layer

Linear regression
Input model Output Forecasting
layer layer air quality levels
PM2.5 polluted Random forest model

DataPre-processing and feature


smart cities

Proposed data science models


Meteorological and PM2.5
PM2.5

PM2.5 pollutant prediction


range with better accuracy
pollutant datasets
KNN model G (0-30)

Trained models

Testing models
extractions
Sa (31-60)
MP (61-90)
P (91-120)
Ridge and VP (121-
lasso regression model 250)
Se (250 +)

XG boost model
P-Poor G-Good
VP-Very poor Sa-Satisfactory
Se-Severe Ada boost model MP-Moderately Polluted

Figure 1: Machine learning model for PM2.5 pollutant prediction and air quality forecasting.

Start the
process

Meteorological and PM2.5


pollutant datasets

Pre-processing data

Feature extractions

Developing data science


models

PM2.5 trained models

PM2.5 testing models

Y es
PM2.5 pollutant in safe No To enhance No To enhance
range or not? models? datasets

Yes

Models ready to use in


real time

Stop the
process

Figure 2: Flowchart representations for predicting PM2.5 and air quality forecasting.
Adsorption Science & Technology 5

(a)
100

90

80

70
Meteorological data

60

50

40

30

20

10

0
0 500 1000 1500 2000
Total number of samples
T PP
TM VV
Tm V
H VM
(b)

Figure 3: (a) Sample study area map for experimental purpose. (b) Variation among meteorological data.

quality level, equation (6) is used [4]. 4. Results and Analysis


4.1. Experiment Setup. This experiment was carried out using
AQRhigh− AQRlow Jupyter Notebook and a computing system which has a pro-
AQR = ðPC − PClow Þ + AQRlow , ð6Þ cessor speed of Intel(R) Core(TM) i5-2450M [email protected] GHz
PChigh− PClow
and RAM of 12 GB. The proposed machine learning models
are exposed to data cleaning and feature extraction for training
and testing models using Python 3.7.3.
where AQR is the air quality range, PC is the pollutant con-
centration, PClow is the concentration break point ≤ PC, 4.2. Details about Meteorological and PM2.5 Datasets. Mete-
PChigh is the concentration break point ≥ PC, AQRlow is the orological and PM2.5 historical datasets were collected
AQR break point corresponding to PClow , and AQRhigh is (anand-vihar, delhi-air-quality) from the Delhi Pollution
the AQR break point corresponding to PChigh . Control Committee (http://aqicn.org) for experimental
6 Adsorption Science & Technology

1400

1200

PM-2.5 (24 hr)-(from 2014 to 2019)


1000

800

600

400

200

0 200 400 600 800 1000 1200 1400 1600 1800 2000
Total number of samples

Figure 4: Overall PM2.5 variation with respect to time series.

Table 1: Meteorological and PM2.5 dataset analysis.

Observed
Samples obtained Samples not obtained Total samples Mean of PM2.5 SD of PM2.5
datasets
(from and to, months) (from and to, months) obtained 24 h per day per year in μg/m3 per year
(years)
01-01-2014; 1:00 AM and 01-12-2014;
2014 Nil 6360 258 119.3437
24:00 PM
01-01-2015; 1:00 AM and 01-12-2015;
2015 Nil 7584 228 90.30255
24:00 PM
01-01-2016; 1:00 AM and 01-12-2016;
2016 Nil 8136 229 107.5823
24:00 PM
01-01-2017; 1:00 AM and 01-12-2017;
2017 Nil 8616 221 94.87083
24:00 PM
Data of all months are available except 01-07-2018; 1:00 AM and
2018 7536 215 88.63759
for the 7th month 31-07-2018; 24 PM
01-01-2016; 1:00 AM and 01-12-2016;
2019 Nil 8664 261 92.81299
24:00 PM

purpose only as shown in Figures 3(a), 3(b), and 4. These “N” is the number of samples and “i” is a single sample in
datasets include various climatic conditions based on T the ith PM2.5 range.
(°C), Tm (°C), TM (°C), PP (mm), H (%), V (km/h), VV
(km), and VM (km/h) (Figure 3). The PM2.5 pollutant is
shown in Figure 4. The data was obtained for 24 hours with 4.3. Statistical Information about Datasets. Table 2 repre-
a time period of an hour from 1st Jan 2014 (1:00 AM) to 31st sents the statistical analysis of both meteorological and
Dec 2019 (24:00 PM), and data sources are stored in CSV file PM2.5 datasets that are considered with various features such
format. Average PM2.5 samples are stored in the file 2044 as T, TM, Tm, H, PP, VV, V, VM, and PM2.5. Datasets are
∗ 24 = 49056. For a year, 8176 samples (approximately) evaluated using statistical features such as the count, mean,
are observed, and for an hour, a maximum of two samples SD, MIN, 25%, 50%, 75%, and MAX. The overall PM2.5 var-
(approximately) is appended depending on climatic condi- ies from 78 to 824 (μg/m2 ) for 2014, from 61 to 494 (μg/m2 )
tions. The remaining data are considered to be null values for 2015, from 70 to 694 (μg/m2 ) for 2016, from 71 to 612
or improper data which are removed by using data prepro- (μg/m2 ) for 2017, from 57 to 538 (μg/m2 ) for 2018, from
cessing techniques. Further information about datasets has 38 to 658 (μg/m2 ) for 2019, and from 38 to 824 (μg/m2 )
been presented in Table 1. for 2014-2019. Based on statistics, the maximum PM2.5 pol-
Using datasets in Table 1, variation of PM2.5ith daily con- lutant range is exceeding the default air quality forecasting
centration was measured in terms of statistical features such limit levels, and this is indicated as “severe” in Table 2. So
as mean and standard deviation as shown in Figure 5, where in this work, six different machine learning models were
Adsorption Science & Technology 7

250

200

PM-2.5 Mean and SD


150

100

50

0
2014 2015 2016 2017 2018 2019

Years
Mean
SD

Figure 5: Years vs. PM2.5 mean and SD.

Table 2: Statistical analysis of both meteorological and PM2.5 datasets (2014 to 2019).

Statistical features T TM Tm H PP VV V VM PM2.5


Count 2044 2044 2044 2044 2038 2044 2044 2044 2044
Mean 23.98728 30.4362 19.60274 66.01761 3.085113 6.75093 4.114335 7.037818 219.8787
SD 2.318939 2.879207 2.268557 14.38204 10.13789 0.637014 2.324433 3.311582 100.0151
MIN 19.1 23.8 13.7 25 0 4 0.2 1.9 38
25% 22.43359 28.50713 18.08281 56.38164 -3.70727 6.32413 2.556964 4.819058 152.8685
50% 22.48728 28.9362 18.10274 64.51761 1.585113 5.25093 2.614335 5.537818 218.3787
75% 25.54097 32.36527 21.12267 75.65358 9.877499 7.177729 5.671705 9.256578 286.8888
MAX 29.9 37.6 24.8 94 132.33 9.2 12.4 22.2 824

applied to minimize the PM2.5 pollutant range and are 4.4.2. Normal Distribution Curve Fitting (NDCF) for PM2.5.
observed to predict air quality levels in a better way. Figure 9 represents the curve fitting using normal distribu-
tion for PM2.5 pollutant datasets. Perfect fit range for the
4.4. Feature Extraction. Figure 6 represents the pair plot of normal distribution curve is observed to be 0.0085, and this
feature extraction for meteorological and PM2.5 pollutant value can be satisfactorily considered near to 0.01. The x-axis
datasets which clear the null values using preprocessing shows the correlation coefficient features, and the y-axis shows
mean and SD. x- and y-axes represent eight different mete- the dependent feature of PM2.5.
orological features such as T, TM, Tm, H, PP, VV, V, and
VM and the PM2.5 pollutant. Figure 7 represents the feature 4.5. Comparing NDCF among Machine Learning Models.
extraction using regression. Figure 10(a) represents the LR model curve fitting showing
a value of about 0.0085 with the correlation coefficient in
4.4.1. Heat Map for Correlating Coefficient between Features. the x-axis and the dependent feature of PM2.5 in the y-axis.
Figure 8 represents the heat map to find the cross- Figure 10(b) represents the KNN model without hyperpara-
correlation between different meteorological and PM2.5 pol- meter tuning which shows overfit of the curve while the
lutant features; if values come nearby 1, then it shows a curve fitting value is 0.0095 for the KNN model using hyper-
strong positive correlation; if values come nearby -1, then parameter tuning and is shown in Figure 10(c). Figure 10(d)
it shows a negative correlation; and if values come nearby represents RF models without hyperparameter tuning which
0 meaning neutral, it is an independent correlation. Thus, shows overfit of the curve while the curve fitting value is
the heat map is used to remove the unwanted features in 0.0094 for the RF model using hyperparameter tuning and
PM2.5 pollutant datasets (i.e., strongly correlated). is shown in Figure 10(e). Figure 10(f) represents RL models
8 Adsorption Science & Technology

Figure 6: Feature extraction of PM2.5.

without hyperparameter tuning which otherwise represents RF, KNN, RL, Xgb, and Adab for various performance met-
overfit of the curve while the curve fitting value is 0.0075 rics are as follows: for MAE, their values are 55.12, 39.84,
for RL models using hyperparameter tuning and is shown 49.13, 55.12, 8.27, and 9.23, respectively; for MAPE, their
in Figure 10(g). Figure 10(h) represents Xgb models without values are 2.69, 1.94, 2.40, 2.69, 0.40, and 0.45, respectively;
hyperparameter tuning which otherwise represents overfit of for MSE, their values are 5157.17, 2980.71, 4889.74,
the curve while the curve fitting value is 0.0086 for Xgb 5157.17, 192.08, and 112.15, respectively; and for RMSE,
models using hyperparameter tuning and is shown in their values are 71.81, 54.59, 69.92, 71.81, 13.85, and 10.59,
Figure 10(i). Figure 10(j) represents the curve fitting for respectively. From the above results, Xgb, Adab, RF, and
the Adab model with tuning which is observed to have KNN models are considered to achieve better performance
0.0095 which is a perfect fit model. results in all means and then compared to the other models.
Table 4 represents the correlation coefficient determina-
4.6. Performance Measures. Table 3 represents the perfor- tion in terms of R2 using LR, RF, KNN, RL, Xgb, and Adab.
mance results of different machine learning models which From Table 4, when the performance results of the training
are used to predict the PM2.5 pollutant. The results of LR, set value is nearer to one, it is considered to be the better
Adsorption Science & Technology 9

Figure 7: Feature extraction using regression.

performance. So the better performance results are KNN 4.7. Comparative Analysis
train set and test set values of 1.0 and -0.228, respectively;
Xgb train set and test set values of 0.999 and 0.3072, respec- 4.7.1. Comparison in Terms of RMSE and MAE. Among all
tively; and RF train set and test set values of 0.904 and 0.382, pollutants, only the PM2.5 pollutant is considered in the
respectively. existing Xgb and Adab models [10] for comparison with
10 Adsorption Science & Technology

1.0

0.27

T
1 0.80 0.76 0.49 –0.075 0.00 0.031 –0.30
0.8

TM
0.89 1 0.49 0.07 –0.000 0.27 –0.14 –0.000 –0.23
0.6
1 0.11 0.27 0.24
Tm
0.76 0.40 0.10 0.45
0.4
0.49 0.07 0.11 1 0.27 0.31 0.28 0.26 –0.14
H

0.2
pp

–0.075 –0.003 0.19 0.27 1 –0.000 –0.002 –0.002 –0.12


W

0.27 0.27 0.45 0.31 –0.000 1 –0.0007 –0.000 –0.12 0.0

0.03 –0.14 0.27 0.28 –0.097 1 0.84 0.30


V

–0.002 –0.2
W

0.031 –0.003 0.24 0.26 –0.002 –0.000 0.84 1 0.33


–0.4
PM-2.5

–0.30 –0.23 –0.5 –0.14 –0.12 –0.12 0.30 0.33 1


–0.6

T TM Tm H PP W V W PM-2.5

Figure 8: Correlation coefficient matrix of PM2.5.

0.008

0.007
Correlation coefficient

0.006

0.005

0.004

0.003

0.002

0.001

0.000
–100 0 100 200 300 400 500

PM-2.5
PM-2.5 Range

Figure 9: Normal distribution curve fitting for the PM2.5 pollutant.

proposed models in terms of performance metrics like compared to the proposed model which is represented in
RMSE and MAE because other types of models were not Table 5(a).
reported in the existing work. In the case of the existing In the case of the existing work especially that use the
work, RMSE for Xgb and Adab is observed to be 38.8253 trajectory model and trajectory with wavelet model to pre-
and 38.825, respectively, while MAE for Xgb and Adab is dict the PM2.5 pollutant [20], 2 days for each monitoring sta-
27.054 and 32.957, respectively; in the case of proposed tion (a, b, c, and d) are considered with RMSE and MAE as
models, RMSE for Xgb and Adab is 13.85 and 10.59, respec- evaluating metrics. But for comparison with the present
tively, while MAE for Xgb and Adab is 8.27 and 9.23, respec- model, only one station with one day is considered because
tively. On comparing these two data, proposed models the error rate for the remaining days for other stations is
represent better results than the existing work, and regarding higher than the proposed value. On comparing these two
error rate, the existing model shows increased error rates data, proposed models (Xgb and Adab) represent better
Adsorption Science & Technology 11

0.008
0.008
0.007
Correlation coefficient

0.006 0.006

0.005
0.004
0.004

0.003 0.002
0.002
0.000
0.001
–400 –300 –200 –100 0 100 200 300 400
0.000 PM-2.5
–200 –100 0 100 200 300
KNN model PM-2.5 range
PM-2.5
Linear regression PM-2.5 range
(a) (b)

0.010
0.008
Correlation coefficient
Correlation coefficient

0.008

0.006 0.006

0.004 0.004

0.002 0.002

0.000
0.000 –100 0 100 200
–200 –100 0 100 200 300
PM-2.5
PM-2.5
Random forest model PM-2.5
KNN model PM-2.5 range
(c) (d)

0.008 0.008
0.007
Correlation coefficient
Correlation coefficient

0.006 0.006
0.005

0.004 0.004
0.003

0.002 0.002
0.001
0.000 0.000
–200 –100 0 100 200 300 –200 –100 0 100 200 300
PM-2.5 PM-2.5
Random forest model PM-2.5 range Lasso and ridge model PM-2.5 range
(e) (f)

Figure 10: Continued.


12 Adsorption Science & Technology

0.007
0.008
0.006
0.007
Correlation coefficient

Correlation coefficient
0.005 0.006
0.005
0.004
0.004
0.003
0.003
0.002
0.002
0.001 0.001
0.000 0.000
–100 0 100 200 300 –300 –200 –100 0 100 200 300 300
PM-2.5 PM-2.5

Lasso and ridge model PM-2.5 range XGBoost model PM-2.5 range
(g) (h)

0.06 0.008

Correlation coefficient
Correlation coefficient

0.05
0.006
0.04

0.03 0.004

0.06 0.002
0.02
0.000
0.01 –200 –100 0 100 200 300
–60 –40 –20 0 20 40 60 80
PM-2.5
PM-2.5
AdaBoost model PM-2.5 range
XGBoost model PM-2.5 range
(i) (j)

Figure 10: (a) LR model curve fitting. (b) KNN model without hyperparameter tuning. (c) KNN model using hyperparameter tuning.
(d) RF models without hyperparameter tuning. (e) RF model using hyperparameter tuning. (f) RL models without hyperparameter
tuning. (g) RL models using hyperparameter tuning. (h) Xgb models without hyperparameter tuning. (i) Xgb models using
hyperparameter tuning. (j) Curve fitting for the Adab model with tuning.

Table 3: Statistical validation for proposed models using the Table 4: Statistical validation in terms of correlation coefficient R2 .
following metrics.
S. no Proposed models R2 train set R2 test set
S. no Proposed models MAE MAPE MSE RMSE 1. LR 0.401 0.320
1. LR 55.12 2.69 5157.17 71.81 2. RF 0.904 0.382
2. RF 39.84 1.94 2980.71 54.59 3. KNN 1.0 -0.228
3. KNN 49.13 2.40 4889.74 69.92 4. RL 0.4013 0.320
4. RL 55.12 2.69 5157.17 71.81 5. Xgb 0.999 0.3072
5. Xgb 8.27 0.40 192.08 13.85 6. Adab 0.6055 0.4290
6. Adab 9.23 0.45 112.15 10.59

results than the existing work, and also regarding error rate, for LR, RF, KNN, RL, Xgb, and Adab are observed to be
the existing model shows increased error rates compared to 2.69, 1.94, 2.40, 2.69, 0.40, and 0.45, respectively. This result
the proposed model which is represented in Table 5(b). clearly shows that the proposed models represent better
MAPE with decreased error rates for all the six models when
4.7.2. Comparison in Terms of MAPE. In the case of the compared with existing models and is shown in Table 6(a).
existing paper, MAPE values for Linear-Support Vector The proposed models use 2190 days data for predicting
Machines (L-SVM), Boosted Trees (BT), Convolutional PM2.5 with better results while the existing VAR-NN-PSO
Generalization Model (CGM), and neural networks (NN) model [13] shows a MAPE value of 3.57% for 180 days
are observed to be 41.8, 44.4, 15.0, and 40.7, respectively PM2.5 data in Pingtung and a MAPE value of 4.87% in
[26], while in the case of proposed models, MAPE values Chaozhou. This is shown in Table 6(b).
Adsorption Science & Technology 13

Table 5

(a) Comparison in terms of RMSE and MAE

Proposed models Present RMSE Present MAE Existing RMSE Existing MAE
Xgb 13.85 8.27 33.0947 27.054
Adab 10.59 9.23 38.825 32.957

(b) Comparison in terms of RMSE and MAE

Proposed models Present RMSE Present MAE Existing models Existing RMSE value for 1 day Existing MAE value for 1 day
Xgb 13.85 8.27 Trajectory 28.98 21.52
Adab 10.59 9.23 Trajectory with wavelet 19.75 11.58

Table 6

(a) Comparison in terms of MAPE

Proposed models Present MAPE Existing models Existing MAPE


LR 2.69 L-SVM 41.8
RF 1.94 BT 44.4
KNN 2.40 CGM 15.0
RL 2.69 NN 40.7
Xgb 0.40
Adab 0.45

(b) Comparison in terms of MAPE

Proposed models Present MAPE Existing MAPE


LR 2.69
RF 1.94 3.57
KNN 2.40
RL 2.69
Xgb 0.40 4.87
Adab 0.45

(c) Comparison in terms of MAPE

Proposed models Present MAPE Existing model Existing MAPE


LR 2.69 5.70
RF 1.94 Spatial ensemble model 13.90
KNN 2.40 28.78
RL 2.69 9.80
Xgb 0.40
Adab 2.55

In the case of the existing spatial ensemble model [11], 4.8. Deployment of the Models. In proposed models for test-
one location with 4 quadrants is considered for PM2.5 data, ing, various meteorological data are randomly selected from
and MAPE values obtained for the 1st, 2nd, 3rd, and 4th quar- datasets like T (25.3), TM (31.6), Tm (22.4), H (74), PP (0),
ter are 5.7034%, 13.9070%, 28.7859%, and 9.8086%, respec- VV (6.3), V (3.9), and VM (9.4) to predict the PM2.5 pollut-
tively. But in the case of the proposed models, data from ant range. For Xgb, KNN, and Adab, the results obtained are
all polluted locations are considered for predicting PM2.5, 0-18.583 μg/m3, 18.583-25.023 μg/m3, and 25.023-28.234 μg/
and it is in a better way than the existing models as shown m3, respectively, which fall in the category of “good” air
in Table 6(c). quality levels. Similarly, RF of 28.234-49.032 μg/m3 and RL
14 Adsorption Science & Technology

Table 7: Forecasting air quality levels.

Predicted PM2.5 range Default PM2.5 range


S. no Deployment models Air quality levels Impact on health
(PClow-PChigh) (AQRlow-AQRhigh)
1. Xgb 0-18.583
Air is good for health
2. KNN 18.583-25.023 0~30.0 Good
3. Adab 25.023-28.234
4. RF 28.234-49.032 Air is acceptable
31.0~60.0 Satisfactory
5. RL 49.032-51.334
6. LR 51.334-65.345 61.0~90.0 Moderately polluted Irritation symptoms occur
91.0~120.0 Poor
No models were found Not in predicted range 121.0~250.0 Very poor Cause respiratory diseases
250+ Severe

of 49.032-51.334 μg/m3 value fall in the category of “satisfac- Data Availability


tory” air quality levels. In the case of “moderately pollutant,”
air quality levels of 51.334-65.345 μg/m3 in LR agree to this. The data used to support the findings of this study are
In the remaining default PM2.5 pollutant ranges like 91-120, included within the article. Should further data or informa-
121-250, and 250+, none of the proposed machine learning tion be required, these are available from the corresponding
models is forecasting air quality levels. Comparing the author upon request.
models regarding the category of “good” air quality levels,
Xgb comes first followed by KNN and then Adab, which is Conflicts of Interest
shown in Table 7.
The authors declare that there is no conflict of interest. The
study was performed as a part of the employment.
5. Conclusions
Acknowledgments
Air pollution is harmful to both the environment and The authors acknowledged the characterization support to
human existence. When some substances in the atmosphere complete this research work.
exceed a certain concentration, it results in air pollution.
One of the effective pollution control measures is to predict
PM2.5 and to forecast the air quality. In the proposed
References
models, the PM2.5 pollutant is predicted using meteorologi- [1] L. Bai, J. Wang, X. Ma, and H. Lu, “Air pollution forecasts: an
cal datasets and six different models (LR, RF, KNN, RL, overview,” International Journal of Environmental Research
Xgb, and Adab models) are used for forecasting air quality and Public Health, vol. 15, no. 4, p. 780, 2018.
levels. The results were evaluated using statistical metrics [2] A. C. Kemp, B. P. Horton, J. P. Donnelly, M. E. Mann,
such as MAE, MAPE, MSE, RMSE, and R2 . The better per- M. Vermeer, and S. Rahmstorf, “Climate related sea-level
formance results for correlation coefficient determination variations over the past two millennia,” Proceedings of the
in terms of R2 are KNN train set and test set values of 1.0 National Academy of Sciences, vol. 108, no. 27, pp. 11017–
and -0.228, respectively; Xgb train set and test set values of 11022, 2011.
0.999 and 0.3072, respectively; and RF train set and test set [3] J. Wang, H. Jiang, Q. Zhou, J. Wu, and S. Qin, “China's natural
values of 0.904 and 0.382, respectively. Among those pro- gas production and consumption analysis based on the multi-
posed models from the results with respect to MAE, MAPE, cycle Hubbert model and rolling Grey model,” Renewable and
Sustainable Energy Reviews, vol. 53, pp. 1149–1167, 2016.
and RMSE metrics (8.27, 0.40, and 13.85; 9.23, 0.45, and
10.59; 39.84, 1.94, and 54.59; and 49.13, 2.40, and 69.92, [4] S. Ameer, M. Ali Shah, A. Khan et al., “Comparative analysis of
machine learning techniques for predicting air quality in smart
respectively, for Xgb, Adab, RF, and KNN), it could be obvi-
cities,” IEEE Access, vol. 7, pp. 128325–128338, 2019.
ous that Xgb, Adab, KNN, and RF are reliable models when
[5] D. Zhu, C. Cai, T. Yang, and X. Zhou, “A machine learning
compared to the existing models. The PM2.5 pollutant
approach for air quality prediction: model regularization and
(PClow-PChigh) range observed for these models is 0- optimization,” Big Data Cognitive Computing, vol. 2, no. 1,
18.583 μg/m3, 25.023-28.234 μg/m3, 18.583-25.023 μg/m3, pp. 5–15, 2018.
and 28.234-49.032 μg/m3, respectively. It can be concluded [6] D. Ramesh, “Enhancements of artificial intelligence and
that by using the proposed models, the PM2.5 pollutant can machine learning,” International Journal of Advanced Science
be predicted; thereby, it can forecast the air quality levels and Technology, vol. 28, no. 17, pp. 16–23, 2019.
also in a better way. Finally, it is obvious that this research [7] M. G. H. David, R. Faner, O. Sibila, J. R. Badia, and A. Agusti,
is very useful for the society since forecasting air quality Do Chronic Respiratory Diseases or Their Treatment Affect the
levels acts as an important tool to prevent air pollution by Risk of SARS-CoV-2 Infection, Elsevier Ltd. Science Direct,
taking necessary actions and steps to control the pollutants. 2020.
Adsorption Science & Technology 15

[8] Y. Ying, L. Chang, and L. Wang, “Laboratory testing of SARS- [22] P. Heni and S. Saket, “Air pollution prediction system for
CoV, MERS-CoV, and SARS-CoV-2 (2019-nCoV): current smart city using data mining technique: a survey,” Health,
status, challenges, and countermeasures,” Reviews in Medical vol. 6, no. 12, pp. 990–999, 2019.
Virology, vol. 30, no. 3, article e2106, 2020. [23] J. Li, L. Xiaoli, and K. Wang, “Atmospheric PM2.5 concentra-
[9] M. Chitty, Artificial Intelligence, Machine Learning & Machine tion prediction based on time series and interactive multiple
Learning Glossary & Taxonomy, Cambridge Health Institute, model approach,” Advances in Meterology, vol. 2019, article
2020. 1279565, pp. 1–11, 2019.
[10] Y. Zhang, Y. Wang, M. Gao et al., “A predictive data feature [24] V. Veeramsetty and R. Deshmukh, “Electric power load
exploration-based air quality prediction approach,” IEEE forecasting on a 33/11 kV substation using artificial neural
Access, vol. 7, no. 2019, pp. 30732–30743, 2019. networks,” SN Applied Sciences, vol. 2, no. 5, pp. 1–10, 2020.
[11] Y. Xu and H. Liu, “Spatial ensemble prediction of hourly [25] B. S. Freeman, G. Taylor, B. J. Gharabaghi, and J. Thé, “fore-
PM2.5 concentrations around Beijing railway station in casting air quality time series using deep learning,” Journal of
China,” Air Quality, Atmosphere and Health, vol. 13, no. 5, the Air & Waste Management Association, vol. 68, no. 8,
pp. 563–573, 2020. pp. 866–886, 2018.
[12] D. Kim, S. Cho, L. Tamil, D. J. Song, and A. S. Seo, “Predicting [26] J. K. Deters, R. Zalakeviciute, M. Gonzalez, and Y. Rybarczyk,
asthma attacks: effects of indoor PM concentrations on peak “Modeling PM2.5 urban pollution using machine learning and
expiratory flow rates of asthmatic children,” IEEE Access, selected meteorological parameters,” Journal of Electrical and
vol. 8, pp. 8791–8797, 2020. Computer Engineering, vol. 2017, Article ID 5106045, 14 pages,
[13] R. E. Caraka, R. C. Chen, T. Toharudin, B. Pardamean, 2017.
H. Yasin, and S. H. Wu, “Prediction of status particulate mat- [27] M. Sallauddin, D. Ramesh, A. Harshavardhan, and S. N. S.
ter 2.5 using state Markov chain stochastic process and Pasha, “A comprehensive study on traditional AI and ANN
HYBRID VAR-NN-PSO,” IEEE Access, vol. 2, pp. 161654– architecture,” International Journal of Advanced Science and
161665, 2019. Technology, vol. 28, no. 17, pp. 479–487, 2019.
[14] R. Beelen, O. Raaschounielsen, M. Stafoggia et al., “Effects of [28] A. Harshavardhan and B. Suresh, “An improved brain tumor
long-term exposure to air pollution on natural-cause mortal- segmentation and classification method using SVM with vari-
ity: an analysis of 22 European cohorts within the multicentre ous kernels,” Journal of International Pharmaceutical
ESCAPE project,” The Lancet, vol. 383, no. 9919, pp. 785–795, Research, vol. 46, no. 2, pp. 489–495, 2019.
2014. [29] R. Zhao, X. Gu, B. Xue, J. Zhang, and W. Ren, “Short period
[15] Y. Bai, Y. Li, X. Wang, J. Xie, and C. Li, “Air pollutants concen- PM2.5 prediction based on multivariate linear regression
trations forecasting using back propagation neural network model,” PloS One, vol. 13, no. 7, article e0201011, 2018.
based on wavelet decomposition with meteorological condi- [30] X. Ni, H. Huang, and W. Du, “Relevance analysis and short-
tions,” Atmospheric Pollution Research, vol. 7, no. 3, pp. 557– term prediction of PM2.5 concentrations in Beijing based on
566, 2016. multi-source data,” Atmospheric Environment, vol. 150,
[16] R. Tiwari, S. Upadhyay, P. Singhal, U. Garg, and S. Bisht, “Air pp. 146–161, 2017.
pollution level prediction system,” International Journal of [31] N. H. A. Rahman, M. H. Lee, Suhartono, and M. T. Latif,
Innovative Technology and Exploring Engineering, vol. 8, “Artificial neural networks and fuzzy time series forecasting:
no. 6C, 2019. an application to air quality,” Quality and Quantity, vol. 49,
[17] C. L. Bing, B. Arihant, C. Pei-Chann, K. T. Manoj, and no. 6, pp. 2633–2647, 2015.
T. Cheng-Chin, “Urban air quality forecasting based on
multi-dimensional collaborative support vector regression
(SVR): a case study of Beijing-Tianjin-Shijiazhuang,” PloS
One, vol. 12, no. 7, article e0179763, 2017.
[18] S. N. Pasha, A. Harshavardhan, D. Ramesh, and S. S. Md,
“Variation analysis of artificial intelligence, machine learning
and advantages of deep architectures,” International Journal
of Advanced Science and Technology, vol. 28, no. 17, pp. 488–
495, 2019.
[19] Y. Lin, L. Zhao, L. Haiyan, and Y. Sun, “Air quality forecasting
based on cloud model granulation,” EURASIP Journal on
Wireless Communications and Networking, vol. 2018, no. 1,
10 pages, 2018.
[20] F. Xiao, Y. Li, J. Zhu, L. Hou, and J. W. Jin, “Artificial neural
networks forecasting of PM2.5 pollution using air masstrajec-
tory based geographic model and wavelet transformation,”
Atmospheric Environmen, vol. 107, pp. 118–128, 2015.
[21] C. J. Soh and J. Huang, “Adaptive deep learning-based air
quality prediction model using the most relevant spatial-
temporal relations,” IEEE Access, vol. 6, pp. 38186–38199,
2018.

You might also like