Prediction of Air Quality Index Using Supervised Machine Learning
Prediction of Air Quality Index Using Supervised Machine Learning
https://doi.org/10.22214/ijraset.2022.43993
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
Abstract: The proposed system depicts various strategies utilized for forecast of Air Quality Index (AQI) utilizing supervised
machine learning procedures. The system examines machine learning algorithm for air quality index by computing algorithm
accuracy which will bring about the best precision. Moreover, the system exhibits different machine learning ac- curacy figures
from the given dataset with assessment order report which recognizes the perplexity lattice. The outcome shows the adequacy of
machine learning suggested calculation method that can be contrasted and best exactness with accuracy, Recall and F1 Score.
The air pollution database contains data for each state of India. Four supervised machine learning algorithms, decision tree,
random forest tree, Naïve Bayes theorem and K-nearest neighbor are compared and evaluated.
I. INTRODUCTION
Technological innovation in recent years has been a remarkable technological advancement. Instead of just writing instructions as a
general rule, the philosophy of artificial intelligence, in which the system makes its own decisions, gradually affects all aspects of
society. From the very first phase of the launch to the major platform vendors, machine learning in its segment has become a focus
area for all companies.
Machine learning is a place where the artificial intelligence system collects data from the sensors and learns behavior in the
environment. The ability to practice machine learning (ml) algorithms has been one of the reasons why machine learning is used to
predict air quality indicators. The learning algorithms for the four machines used in the following system, decision tree, random
forests, naïve bayes ,k nearest neighbors are compared.
Many researchers present other algorithms used in the proposed system but none of them compare their success as a single study
under the same conditions and in the same data for all six. By collecting customer data and correlating it with behaviors over time,
machine learning algorithms can learn associations and help teams tailor product development and marketing initiatives to customer
demand.
In India, air pollution is considered a widespread problem. Daily the specified air quality indicators are significantly higher than the
highest rates considered to be appropriate for public health care. The situation is worse in large urban areas such as Delhi where
AQI has achieved the highest total time of 999 AQI. The central government and national authorities have implemented a number of
measures to reduce air pollution.
The first phase requires air quality indicator prediction for the situation to improve. Six different project class dividers have been
created based on different algorithms. Database analysis by supervised machine learning program (SMLT) to capture a few similar
information, dynamic analysis, single dynamic analysis, dynamic dual analysis and various analyzes, short-term treatment and data
analysis.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1371
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
In India, as in many different countries, the record revolves around six key elements polluted by particles less than 10 micrometers
in diameter (PM10),particle less than 2.5 micrometers in diameter (PM2.5), carbon monoxide (CO), Ammonia (NH3), nitrogen
dioxide (NO2), and ozone (O3).
At that point, the test station should have the option of supplying one with a specific toxicity and its norm in a certain period of CO
and O3, which is generally considered to be in control for more than eight hours, and in the other three, a normal 24 hours. The unit
of measurement per cubic meter is a microgram (or milligram, due to CO).
A. System Architecture
The first step is to provide dataset to the prediction model by the user. The dataset supplied to machine learning model is used to
train the model. Every new data detail filled at the time of application form acts as a test data set. The obtained data set and the
previous data set are kept in Datawarehouse. Pre- processing and validation are the next stage. Pre-processing corresponds to the
data transformations that are performed before the algorithm is processed.
Data Pre-processing is a method for transforming raw data into a clean data package. In other terms, once the data is obtained from
different outlets, it is raw and can not be evaluated. The validation is the import of the dataset enabled library bundles. It analyses
the variable identification by data shape, data type and evaluating the missing values and duplicate values.
A validation dataset is a sample of data held back from training the model that is used to give an estimate of model skill while
tuning model’s and procedures that you can use to make the best use of validation and test datasets when evaluating the models.
After pre-processing the data, several machine learning algorithms such as decision tree, Random forest, Naïve Bayes, Logistic
Regression, SVM , etc. are used to train the dataset and predict the AQI. The machine learning algorithm which gives the best
accuracy is selected to builds the prediction model.
Each model will have different performance characteristics. Using resampling methods like cross validation, you can get an estimate
for how accurate each model may be on unseen data. Photo graphic method is not sufficient to calculate PM 2.5 and it takes only
one pollutants of concentration . It needs to be able to use these estimates to choose one or two best models from the suite of models
that you have created.
It is necessary to regularly compare the output of several distinctly different learning algorithms and to build a test harness to easily
compare multiple simultaneous learning algorithms in Python with scikit-learn. The work given is based on only one PM 2.5 and to
collect more monitoring data from other cities to verify the generalization of the work and more factors such as geomorphic
conditions. Finally, an interactive GUI (graphical user interface) is developed using Tkinter in python library.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1372
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
B. Objective
The goal is to develop a machine learning model for real-time air quality forecasting, to potentially replace the updatable supervised
machine learning classification models by predicting results in the form of best accuracy by comparing supervised algorithm.
C. Problem Statement
The management and protection of air quality in many industrial and urban areas is actually a critical tasks, owing to specific styles
of emissions induced by the usage of transport electric power, etc. the accumulation of toxic gasses presents a significant danger to
the quality of life in smart cities. As air pollution rises, needed a effective air quality control models to gather information on air
pollutant concentrations and to measure air pollution in all areas.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1373
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
Only data becomes useless before it becomes readily visible, such as graphs and charts. The ability to easily imagine examples of
data and others is essential for both applied analytics and machine learning. You can consider the various styles of pictures you need
to recognize while you are visualizing Python data and how you can use them to fully visualize your personal data.
How to model time series details using bar charts of lines and categorical numbers?
How to sum up distributions of data using histograms and box plots.
How to sum up variables ’connection to scatter plot.
Most machine learning algorithms are prone to attribute values in input data and delivery. Input data outliers will deceive and
misrepresent the machine learning algorithms training cycle, which contributes to longer training times, less reliable models and
eventually weaker performance.
Well before predictive models of training data are created, outliers will lead to misleading representations and misleading perceptions
of gathered information. In descriptive statistics such as mean and standard deviation and in tracks such as histograms and
scatterplots, Outliers will distort the overview distribution of attribute values by consolidating the body of evidence.
A histogram is a bar representation of data that varies over a range. It plots the height of the data belonging to a range along the y-
axis and the range along the x-axis. Histograms are used to plot the data over a range of values. The Histograms use a bar
representation to show the data belonging to each range. Use the iris data which contains the information about flowers to plot the
histograms. Finally, outliers may be indicators of data cases, such as fraud detection abnormalities and computer protection, which
are important to the issue. It can not adapt the model to the training data and it can not guarantee whether the model would function
correctly on the real data. You need to be sure the model has the correct data characteristics so it will not have too much noise.
Cross-validation is a method in which a model is educated using a data-set subset and then tested with an adjacent dataset sub-set.
1) Data Validation
Data processing, data analysis, and the process of addressing data material, consistency, and configuration will add up to a time-
consuming list of tasks. It helps to understand the data and its properties during the data recognition process; this information can help
you select which algorithm to use to construct the model. By regression algorithm, for example, data from time series can be
analyzed; classification algorithms can also be utilized to evaluate discrete data.
2) Data Pre-Processing
Pre-processing relates to the preparation of the data until it is passed to the algorithm. Preprocessing data is a method for
transforming raw data into a clean collection of data. With other words, if the data were obtained with raw format from various
outlets, this is not appropriate for study. The machine learning cycle will prove to be right in order to obtain improved outcomes
from the implemented model. Random Forest algorithm does not accept null values. Certain Machine Learning models require
details in a defined format. Therefore, null values from the initial raw data collection must be handled to operate random forest
algorithms. And it is also critical that data sets are so structured that more than one machine learning and deep learning algorithm in
a given dataset is implemented.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1374
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
2) Decision tree
The algorithm of the decision array falls into the supervised learning algorithm category. It works for both continuous and categorical
variables of output.
Assumptions of Decision tree:
At the beginning, considering the whole training set as the root.
Attributes are considered to be categorical for finding information gain, given the attributes are continuous.
Recursive distributions are made on the basis of attribute values.
Using statistical methods for root or internal node ordering attributes.
Decision Trees is a data gathering categorization and analysis technique. Decision trees constitute the fundamental recursive basis
for the sequence process of classification, in that one of the disjoint class decision-making structures contains nodes and leaves, is
allocated a case identified with a collection of attributes. Each tree node includes testing a specific attribute and every tree leaf
denotes a class. The test usually compares the value with a constant of an attribute. Leaf nodes are usually categorized in all cases
whereas for classified set is entered or where probabilities are spread over any possible classification.
The process is continued until the termination is completed. It is designed in a recur- sive dividing-and-conquer top-down fashion. All
characteristics should be categorical. They would otherwise be discretized in advance. Top of the tree characteristics have a greater
effect on the classification and the knowledge gain principle is used. Too many branches can easily be over-fitted and can reflect
disturbances caused by noise or bolts.
2) Random forest
Random forests were a hybrid of trees forecasting flaws in forest general statement since each tree operates on the theory of the
random variable within each tree. The error of generalizing forests depends on the strength of each of the forest’s trees and the
relationship between them on the individual since the percentage of forest trees is vital. By randomly selecting the characteristics for
splitting every node, error levels are similar, but are richer in noise.
In carrying out the random forest algorithm the following are basic steps:
From the given dataset, choose N Random Data.
Create a decision tree on the basis of those N records.
Pick the number of trees you want and repeat steps 1 and 2 in the algorithm.
In the event of a problem with regression, for a new record, the value for Y for each tree in the forest is expected. The final
value can be calculated by taking the sum of all the forested trees ’values.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1375
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1376
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
For the estimation of results, the data collection is split into two groups, that is, a training collection, and a test set that typically
utilizes 7:3 ratios to distinguish the training sample from the test sample. The data model built using Random Forest, Logistic
Rectification, Decision Tree Algorithms, KNN, Support Vector Classifier (SVC) and Naive Bayes is applied to the training set and
the Test Set Prediction is conducted on the basis of the precision and the test output.
The raw recorded data contains meteorological information for various cities in India. The idea is to use various machine learning
techniques such as Logistic Regression, Support Vector Machines, Random Forests, K-Nearest Neighbors, Naïve Bayes Classifier
and Decision Tree for model building and research purposes.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1377
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1378
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1379
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
D. Decision Tree
Decision Trees is a data gathering categorization and analysis technique. Decision trees constitute the fundamental recursive basis for
the sequence process of classification, in that one of the disjoint class decision-making structures contains nodes and leaves, is
allocated a case identified with a collection of attributes. Each tree node includes testing a specific attribute and every tree leaf denotes
a class. The test usually compares the value with a constant of an attribute. Leaf nodes are usually categorized in all cases whereas
for classified set is entered or where probabilities are spread over any possible classification. After applying Decision Tree algorithm,
the prediction accuracy obtained was 99.88
Figure 5.7: Classification report of Decision Tree
E. Testing
The classification report of all the four algorithms was analyzed and evaluated on the data set and compared each algorithm
accuracy with each which other that can be seen in Table 1.
Algorithm Accuracy
Naive Bayes 97.38
Random Forest 99.16
K-Nearest Neighbors 97.61
Decision Tree 99.88
Table 1: Comparison of Accuracy result of each algorithm
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1380
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
The Decision tree with 99.88 percent precision is perhaps the most effective methodology, whereas the least accurate algorithm is
Naïve Bayes with an accuracy of 97.38. On the otherhand the Random Forest has almost near accuracy that of Decision Tree with
an accuracy of 99.16, this is due to the fact that random forest is in itself a kind of decision tree and hence resembles almost same
accuracy as decision tree. Also the logistic regression, Naïve Bayes and KNN have same accuracy of 97 approximately.
3) Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test
data for which the true values are known. It allows the visualization of the performance of an algorithm.
4) F1 Score
The indicator of precision of a check is the F1 (also F-point or F-mount). It takes into consideration both the precision p and the
warning r of the check in order to measure the score: p is the proportion of correct positive results determined by the number of
possible results recorded by the classifier and r is the number of correct positive results segregated by all samples involved.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1381
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
5) A.5 PM2.5
PM2.5 applies to particle atmosphere that is smaller than 2.5 micrometers in diameter and is about 3 percent of the thickness of the
hair of humans. The particles usually known as PM2.5 are so tiny that even an electron microscope can be observed. These are also
smaller than PM10, which are 10 micrometers or fewer and are known as tiny particles.
REFERENCES
[1] Acharjya, D. P. “A survey on big data analytics: challenges, open research issues and tools." , (2019)
[2] Challa, J. S., Goyal, P., Nikhil, S., Mangla, A., Balasubramaniam, S. S., and Goyal N. “Dd-rtree: A dynamic distributed data structure for efficient data
distribution among cluster nodes for spatial data mining algorithms.” , 2016 IEEE International Conference on Big Data (Big Data). 27–36.
[3] De Leon, A., Anderson, H., Bland, J., Strachan, D., and Bower, J. “Effects of air pollution on daily hospital admissions for respiratory disease in london
between 1987-88 and 1991-92.” (1996) Journal of epidemiology and community health, 50 Suppl 1, s63–70.
[4] Li, S., Song, S., and Fei, X. “Spatial characteristics of air pollution in the main city area of chengdu, china.” (2011) 19th International Conference on
Geoinformatics, 1–4.
[5] Qin, D., Yu, J., Zou, G., Yong, R., Zhao, Q., and Zhang, B. “A novel combined prediction scheme based on cnn and lstm for urban pm2.5 concentration.”
(2019) IEEE Access, 7, 20050–20059.
[6] Yue, G., Gu, K., and Qiao, J. “Effective and efficient photo -based pm2.5 concentrationestimation.” (2019) IEEE Transactions on Instrumentation and
Measurement,PP,1–10
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1382