Air Quality Forecasting Using Machine Learning
Ms.S.Lavanya Senthil Kumar S, Anandhaselvam M
Assistant Professor, Dept of IT Dept of Information Technology,
Velammal College of Engineering and Technology, Velammal College of Engineering and Technology,
Viraganoor, Madurai- 625 009. Viraganoor, Madurai - 625009
[email protected] [email protected],
[email protected] Abstract—Air quality index (AQI) forecasting is a Primary Pollutants include: -
critical component of environmental monitoring and Carbon dioxide (CO2): Carbon dioxide is playing an
public health management. Accurate predictions of AQI important role in causing air pollution. It is also named as
levels are essential for providing timely information to Greenhouse gas. Global warming a major concern caused by
the public and relevant authorities to mitigate the health increase in carbon dioxide in air.CO2 is exhale by
impacts of poor air quality. This research focuses on the Human.CO2 is also released by burning of fossil fuels.
application of the Facebook Prophet (FB Prophet) model, Sulphur oxide (SOX): Sulphur dioxide (SO2) released by
a powerful time series forecasting tool, to predict AQI burning coal and petroleum. It is released by various
values. The study explores the potential of FB Prophet in industries. When react with Catalyst (NO2), results in
capturing complex patterns and trends in air quality H2SO4 causing acid rains that forms the major cause of Air
data, providing reliable short-term and long-term Pollution. Nitrogen oxide (NOX): Most commonly Nitrogen
forecasts. FB Prophet's ability to handle seasonality, dioxide (NO2) that is caused by thunderstorm, rise in
outliers, and missing data makes it well-suited for temperature. Carbon monoxide (CO): -Carbon monoxide is
modeling AQI, a time series dataset with inherent caused by burning of coal and wood. It is released by
variations. The research involves data collection from Vehicles. It is odorless, colorless, toxic gas. It forms a smog
various air quality monitoring stations, preprocessing, in air and thus a primary pollutant in air pollution. Toxic
and feature engineering. FB Prophet is then utilized to metals Example are Lead and Mercury Chlorofluorocarbons
train forecasting models using historical AQI data. (CFC): -Chlorofluorocarbons released by air conditioners,
Performance evaluation is conducted using quantitative refrigerators which react with other gases and damage the
indices to assess the accuracy and efficiency of the FB Ozone Layer. Therefore, Ultraviolet Rays reach the earth
Prophet model in predicting AQI values. Preliminary surface and thus cause harms to human beings. Garbage,
results suggest that FB Prophet can provide highly Sewage and industrial Process also causes Air Pollution.
accurate and efficient forecasts of AQI, with the Particles originating from dust storms, forest, volcanoes in
potential to contribute significantly to early warning the form of solid or liquid causing air pollution.
systems and public health initiatives aimed at mitigating Secondary Pollutants include: -
the adverse effects of air pollution. The study Ground Level Ozone: It is just above the earth surface and
underscores the importance of utilizing advanced forms when Hydrocarbon react with Nitrogen Oxide in the
forecasting techniques in the field of air quality sunlight presence. Acid Rain: When Sulphur dioxide react
management and provides valuable insights for future with nitrgendioxide, oxygen and water in air thus causing
research and applications in environmental and public acid rain and fall on ground in dry or wet form. The
health sectors. difference between Primary Pollutants and Secondary
Pollutants is Primary Pollutants are those which are released
Keywords—Air Pollution Monitoring, Machine Learning, into air directly from Source whereas Secondary Pollutants
Predictive Models, Forecasting, FB Prophet Model, Regression. are those which are formed by reacting with either primary
pollutants or with other atmospheric component. There are
I. INTRODUCTION various pollutants causing air pollution but PM 2.5 being the
Air pollution is dangerous for human health and should major air pollutant as proposed by the author (J. Angelin
be decrease fast in urban and rural areas so it is necessary to Jebamalar & A. Sasi Kumar,2019) and comes out with the
predict the quality of air accurately. There are many types of best results in predicting level of PM 2.5 in their research
pollution like water pollution, air pollution, soil pollution etc [13]. Logistic regression and autoregression help in
but most important among these is air pollution which determining the level of PM2.5 [4]. The day wise prediction
should be controlled immediately as humans inhale oxygen of pollutant level [1] was removed by various authors
through air. further by predicting hourly wise data using different
There are various causes of air pollution. Outdoor air algorithm. Benzene concentration can also account into air
pollution caused by industries, factories, vehicles and Indoor pollution and its concentration can be determined with CO
air pollution is caused if air inside the house is contaminated [7]. These are the causes of air pollution. Air pollution is
by smokes, chemicals, smell. Two types of Pollutants that is causing harmful effects on human beings and plants. It
causing air pollution are Primary Pollutants and Secondary causes the less threatening diseases like irritation in throat,
Pollutants. nose. Headache to most severe disease like Respiratory
Problems, shortness of breath, Lungs Cancer, brain disease, prediction, their work underscored the importance of
kidney disease and even leads to death. There are masks machine learning in various fields.
which protect us from increasing air pollution and various
acts are there to control air pollution. It is also necessary to 8. Argue, C. J. et al. (2022): Argue and colleagues proposed
create awareness among human being about air pollution. robust Secretary and Prophet algorithms for packing integer
It is necessary to predict the air quality accurately. Various programs. While not directly related to air quality prediction,
traditional methods are there to measure it but results are not their research demonstrated advancements in algorithm
accurate and it involves a lot of mathematical calculations. development that could be applicable to machine learning in
Machine Learning a subset of Artificial Intelligence has an various domains.
important role in predicting air quality. Various researches
are being done on measuring Air quality Index by using 9. Yu, Q et al. (2020): Yu and his team developed an
Machine Learning algorithms. So, to control Air Pollution improved denoising autoencoder for maritime image
first necessary step is to measure accurately the Air Index denoising and semantic segmentation. While not directly
Quality. Machine Learning algorithms plays an important related to air quality prediction, their work showcased
role in measuring air quality index accurately. In this paper advancements in deep learning techniques applicable to
Various algorithm are compared on the basis of different environmental data analysis.
condition in different areas and Neural Network comes out
with best results [1]. In section 2 Literature survey is 10. Monil Patel et al. (2020): Patel et al. explored customer
discussed and in section 3 result obtained by various segmentation using machine learning techniques. While not
researchers is discussed and in Section 4 conclusion is specific to air quality prediction, their study demonstrated
discussed. the applicability of machine learning in analyzing and
segmenting large datasets for various purposes.
II. LITERATURE SURVEY
1. Wang, J. S., Wang, Y., Zhao, M. X., et al. (2019): Wang 11. Zhili Zhao et al. (2020): Zhao et al. proposed combining
et al. applied the ARIMA model to predict the air quality forward with recurrent neural networks for hourly air quality
index in Suzhou. Their study provided insights into time- prediction in Northwest China. Their study addressed the
series forecasting techniques for air quality assessment. challenge of hourly prediction accuracy by leveraging the
memory capabilities of recurrent neural networks.
2. Yang, S., and L. Zhao (2017): Yang and Zhao utilized the
Random Forest algorithm for urban air quality forecast, 12. Shu Wang et al. (2020): Wang and his team developed a
demonstrating the effectiveness of ensemble learning model for predicting gas concentration using gated recurrent
methods in predicting air pollutant levels. neural networks. Their research contributed to improving
the accuracy of gas concentration prediction, which is
3. Chang Tianjun et al. (2019): Tianjun and colleagues crucial for air quality assessment.
proposed a Prophet-Stochastic Forest Optimization model
for predicting air quality index size. Their approach aimed 13. Aditya C R et al. (2018): Aditya et al. worked on the
to optimize prediction accuracy by integrating machine detection and prediction of air pollution using various
learning with stochastic forest optimization techniques. machine learning models. Their study provided insights into
the application of machine learning for air quality
4. S. De Vito et al. (2008): De Vito et al. conducted on-field assessment.
calibration of an electronic nose for benzene estimation in
urban pollution monitoring scenarios. Their research 14. Timothy M. Amado & Jennifer C. Dela Cruz (2018):
contributed to the development of sensor-based methods for Amado and Dela Cruz developed machine learning-based
air pollutant detection. predictive models for air quality monitoring and
characterization. Their research aimed to provide accurate
5. Gong, Feixiang et al. (2020): Gong and his team analyzed predictions of air pollutant levels for better environmental
building power consumption trends using the Prophet management.
algorithm. While not directly related to air quality prediction,
their work demonstrated the versatility of machine learning
algorithms in analyzing environmental data. III. METHODOLOGY
A. Dataset
6. Zunic, Emir et al. (2020): Zunic and collaborators applied The dataset used in this study is sourced from a research
Facebook's Prophet algorithm for successful sales investigation conducted by Vito et al. It comprises a total of
forecasting based on real-world data. Although not specific
9,357 samples of hourly averaged responses from a set of
to air quality prediction, their study showcased the five metal oxide chemical sensors integrated into a multi-
applicability of machine learning algorithms in diverse sensor device for air quality assessment. The data collection
domains.
took place in a highly polluted urban area within an Italian
city, where the device was positioned at street level on a
7. Kourou, Konstantina et al. (2015): Kourou et al. reviewed
field. This data collection spanned from March 2004 to
machine learning applications in cancer prognosis and
February 2005, encompassing nearly a year, making it one
prediction, highlighting the potential of advanced algorithms of the most extensive publicly available datasets for this
in healthcare. While not directly related to air quality
period. The dataset includes responses recorded by chemical C. Metrics
sensors deployed in the field over an extended duration. In this research, after data cleaning, the Prophet
Ground Truth data was collected concurrently and consists algorithm was imported. Next, this study defines a train set
of hourly averaged measurements for various air quality and make the model be fit to the dataset. Then this study can
indicators, including Carbon Monoxide (CO), Non Metanic output the predictions with lower limits and upper limits,
Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx), and which are also known as ‘yhat_lower’ and ‘yhat_upper”.
Nitrogen Dioxide (NO2). The reference certified Finally, the linear regression line chart with the
analyzer(NO2) was co-located with the sensor device to predicted outcomes can be plotted. When carrying out
provide reliable measurements.As noted in the associated the prediction, two typical metrics called MAPE and MAE
essay, the dataset exhibits indications of cross-sensitivities are employed. Mean Absolute percentage error (MAPE)
among the sensors, concept drifts, and sensor drifts. These measures the accuracy of a company's forecasting
factors can potentially impact a sensor's accuracy in process. It shows the average accuracy between the
estimating air pollutant concentrations. To account for these expected amount and the actual amount by the absolute
issues, the dataset also contains missing values marked with percentage error for each entry in the average data set:
a specific value of -200. In the data preprocessing phase,
various steps were undertaken to ensure that the data was
meaningful and usable for practical applications. Initially,
missing data points were addressed using the data. dropna() (1) Mean absolute error (MAE) is a measure of errors
function to eliminate entries with missing values. between paired observations expressing the same
Subsequently, the dataset's columns were parsed and phenomenon. Examples of Y versus X include
transformed, converting the measurements into floating- comparisons of predicted versus observed, subsequent
point values for further analysis. time versus initial time, and one technique of measurement
versus an alternative technique of measurement. MAE is
B. Prophet Algorithm calculated as the sum of absolute errors divided by the
Prophet is a highly efficient forecasting method that has sample size.
been implemented in both R and Python [5-7]. This tool is
renowned for its speed and the capability to produce fully
automated forecasts, which can subsequently be fine-tuned
manually by analysts and data scientists. Prophet's primary D. Result and Discussion
strength lies in its proficiency in predicting time series data,
particularly those with complex seasonal patterns occurring N ds trend yhat_lower yhat_upper
annually, monthly, daily, weekly, and during holiday o
periods. It is exceptionally well-suited for datasets with
1 2004-01-04 946.009417 790.067680 1104.635574
strong seasonality and historical records spanning multiple
2 2004-01-11 943.188093 787.485341 1088.642583
seasons.One of the key advantages of Prophet is its
3 2004-01-18 940.366770 776.806100 1094.341314
resilience to outliers and missing data, as well as its ability
to adapt to changing trends. This open-source software was 4 2004-02-08 931.902799 770.489706 1085.760666
generously contributed by Facebook's primary Data Science 5 2004-02-15 929.081475 777.884113 1100.201295
team, making it widely accessible to the community.
Table 1. The concentration of NOx
Prophet is fundamentally a piecewise linear or logistic
growth curve-tipped additive regression model. It employs a
weekly seasonal component represented through the use of There are several types of air quality indexes in the
dummy variables, as well as a yearly seasonal component dataset, including CO, C6H6, CO2, NMHC. In this
modeled using Fourier series. This dual approach allows research, NOx was used as an example to plot its
Prophet to handle datasets that encompass extended time trends. Table 1 above shows the trend measurements of
periods with detailed historical observations, whether they the concentration of NOx and the upper predictions as well
occur at hourly, daily, or weekly intervals. The tool excels as the lower predictions. Both two-line charts below are
in scenarios involving multiple robust seasonal patterns, the plotted in Figure 1. and Figure 2. according to the
occurrence of known significant but irregular events, trending measurements, daily and weekly. They are both
missing data points, large outliers, or non-linear growth nearly smooth lines after the linear regression. It can be
trends that approach an upper limit. Moreover, Prophet is noticed clearly from the scatter plot that the figures fluctuate
both swift and accurate in its forecasting capabilities. It not in a rather irregular way. Therefore, some errors cannot be
only provides the flexibility to adjust parameters but also avoided, which influences the accuracy of the result. It is
allows the creation of custom seasonality components, acceptable because the environment surrounded the people
which can significantly enhance the accuracy of forecasts. is changeable and not strictly regular.
The tool is also proficient in managing outliers and
addressing various data-related challenges autonomously.
When it comes to holidays or significant events that may
influence predictions, Prophet offers a holiday feature that
adapts its forecasting accordingly. Additionally, the tool is
equipped with automatic change point detection, enhancing
its adaptability and reliability in time series forecasting tasks.
V. CONCLUSION
In this research, a thorough and reliable model for
predicting air quality is proposed to be realized. which can
also assist the government in learning about regional
environmental problems and, to some extent, in the
development of useful policies by advising or cautioning
individuals about their outdoor activities. Prophet algorithm
and python are the main tools in this research. A dataset was
Fig 1. The daily prediction results. employed in this research to train the model. According to
the dataset, some prediction results about certain weekly air
quality index were successfully made by the model. And the
prediction results were output by the codes in the form of
line charts. The accuracy of the model is acceptable and
reliable. In the future, more specific issues can be discussed.
For example, some drastic changes related to specific
festivals and seasons should be considered to avoid the
limitations posed by the dataset and further enhance the
model’s predictive capabilities.
Fig 2. The weekly prediction results REFERENCES
yhat yhat_lower yhat_upper y [1] Wang, J. S., Wang, Y., Zhao, M. X., et al. (2019). Application of
946.0094 790.067680 1104.635574 880.666667 ARIMA model in the prediction of air quality index in Suzhou.
Journal of Public Health and Preventive Medicine, 30(2), 18-20.
943.1880 787.485341 1088.642583 760.484990
940.3667 776.806100 1094.341314 1490.333333 [2] Yang, S., and L. Zhao. Application of Random Forest Algorithm in
Urban Air Quality Forecast. Stat. Decis 20, 2017, 83-86.
931.9027 770.489706 1085.760666 869.108333
[3] Chang Tianjun, et al. Prediction of air quality index size based on
929.0814 777.884113 1100.201295 706.395833 Prophet-Stochastic Forest Optimization model, Environmental
Pollution and Prevention, 41.07, 2019.
Table 2. Comparison of the predictions and the realistic data [4] S. De Vito, E. Massera, M. Piga, L. Martinotto, G. Di Francia, On
field calibration of an electronic nose for benzene estimation in an
The line chart in Figure 3 clearly illustrates the urban pollution monitoring scenario, Sensors and Actuators B:
upper and lower predictions generated by the Prophet Chemical, Volume 129, Issue 2, 22 February 2008, Pages 750-757,
ISSN 0925-4005.
algorithm. These predictions exhibit a predominantly
linear trend, even though there is inherent uncertainty in [5] Gong, Feixiang, et al. Trend analysis of building power consumption
based on prophet algorithm. 2020 Asia Energy and Electrical
the actual values. The data points form a polygonal chain, Engineering Symposium (AEEES). IEEE, 2020..
and the predictions do not align closely with the actual [6] Zunic, Emir, et al. Application of facebook's prophet algorithm for
outcomes. While the prediction plot effectively captures successful sales forecasting based on real-world data. arXiv preprint
the overall trend of the original dataset, it struggles to arXiv:2005.07575, 2020.
accurately predict turning points and substantial changes. [7] Kourou, Konstantina, et al. Machine learning applications in cancer
To address these limitations, enhancements to the Prophet prognosis and prediction. Computational and structural biotechnology
algorithm are warranted. Firstly, the algorithm can be journal 13, 2015, 8-17.
refined to identify potential changepoints and respond to [8] Argue, C. J., et al. Robust Secretary and Prophet Algorithms for
Packing Integer Programs. Proceedings of the 2022 Annual ACM-
them proactively. Furthermore, it should offer SIAM Symposium on Discrete Algorithms (SODA). Society for
adjustability in the changepoint detection scale. It's Industrial and Applied Mathematics, 2022.
crucial to consider that the environment is dynamic and [9] Yu, Q et al. Improved denoising autoencoder for maritime image
subject to continuous change, making the ability to adapt denoising and semantic segmentation of USV. China
to significant events and transitions a valuable asset. Communications 17.3, 2020, 46-57.
[10] Monil, Patel, et al. Customer Segmentation Using Machine Learning.
IV. FUTURE WORK International Journal for Research in Applied Science and
Engineering Technology (IJRASET) 8.6, 2020, 2104-2108.
In the future, our project aims to improve the way it
[11] Zhili Zhao , Jian Qin, Zhaoshuang He, Huan Li, Yi Yang and
predicts air quality. We'll work on making the predictions Ruisheng Zhang, Combining forward with recurrent neural networks
even more accurate by using smarter methods and for hourly air quality prediction in Northwest of China,
adjusting the settings. We're also planning to make the Environmental Science and Pollution Research,2020
predictions happen in real-time and team up with special [12] Shu Wang, Yuhuang Hu, Javier Burgues´ Santiago Marco and Shih-
devices to get better results in specific areas. Plus, we'll Chii Liu, Prediction of Gas Concentration Using Gated Recurrent
create a user-friendly phone app that gives you advice on Neural Networks, 2020, IEEE
what to do based on the air quality. We'll look at different [13] Aditya C R , Chandana R Deshmukh , Nayana D K and Praveen
Gandhi Vidyavastu , Detection and Prediction of Air Pollution using
types of data sources and explore better computer Machine Learning Models,2018, International Journal of Engineering
techniques to make our predictions even better. We hope Trends and Technology (IJETT).
to use our predictions to help make good decisions about [14] Timothy M. Amado & Jennifer C. Dela Cruz, Development of
the environment. We'll also listen to what people say and Machine Learning-based Predictive Models for Air Quality
try to make the project work for more people and in more Monitoring and Characterization, 2018, IEEE
places. And we'll make it easier for everyone to
understand what our predictions mean.