0% found this document useful (0 votes)

159 views

Prediction of Air Quality Index Using Supervised Machine Learning

The proposed system depicts various strategies utilized for forecast of Air Quality Index (AQI) utilizing supervised machine learning procedures.

Uploaded by

IJRASETPublications

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views

Prediction of Air Quality Index Using Supervised Machine Learning

The proposed system depicts various strategies utilized for forecast of Air Quality Index (AQI) utilizing supervised machine learning procedures.

Uploaded by

IJRASETPublications

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

10 VI June 2022

https://doi.org/10.22214/ijraset.2022.43993
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com

Prediction of Air Quality Index Using Supervised

Machine Learning
Rajat R. Relkar1, Vaibhav Deulkar2, Ridam Gunjewar3, Rahul Panghate4, Pratik Gaurkar5, Mukul Singanjude6, Ritik
Jarile7, Prof. Preetee K. Karmore8
1, 2, 3, 4, 5, 6, 7
Department of Computer Science and Engineering
8
Guide
Shri Vidyarthi Sudhar Sangh’s
Dr. Babasaheb Ambedkar College of Engineering and Research, Nagpur

Abstract: The proposed system depicts various strategies utilized for forecast of Air Quality Index (AQI) utilizing supervised
machine learning procedures. The system examines machine learning algorithm for air quality index by computing algorithm
accuracy which will bring about the best precision. Moreover, the system exhibits different machine learning accuracy figures
from the given dataset with assessment order report which recognizes the perplexity lattice. The outcome shows the adequacy of
machine learning suggested calculation method that can be contrasted and best exactness with accuracy, Recall and F1 Score.
The air pollution database contains data for each state of India. Four supervised machine learning algorithms, decision tree,
random forest tree, Naïve Bayes theorem and K-nearest neighbor are compared and evaluated.

ORGANIZATION OF THE THESIS

In this thesis, Introduction and architecture of our system, the objective and the problem statement will be discussed in Chapter 1.
The Review of Literature will be discussed in Chapter 2. Chapter 3 will consist of Work Done on various modules of the project.
The System Design is explained in the Chapter 4. The Results obtained by usingvarious algorithms are mentioned in the Chapter 5.
The Chapter 6 will consist of Conclusion. At last, we have mentioned the Appendix and the References.

I. INTRODUCTION
Technological innovation in recent years has been a remarkable technological advancement. Instead of just writing instructions as a
general rule, the philosophy of artificial intelligence, in which the system makes its own decisions, gradually affects all aspects of
society. From the very first phase of the launch to the major platform vendors, machine learning in its segment has become a focus
area for all companies.
Machine learning is a place where the artificial intelligence system collects data from the sensors and learns behavior in the
environment. The ability to practice machine learning (ml) algorithms has been one of the reasons why machine learning is used to
predict air quality indicators. The learning algorithms for the four machines used in the following system, decision tree, random
forests, naïve bayes ,k nearest neighbors are compared.
Many researchers present other algorithms used in the proposed system but none of them compare their success as a single study
under the same conditions and in the same data for all six. By collecting customer data and correlating it with behaviors over time,
machine learning algorithms can learn associations and help teams tailor product development and marketing initiatives to customer
demand.
In India, air pollution is considered a widespread problem. Daily the specified air quality indicators are significantly higher than the
highest rates considered to be appropriate for public health care. The situation is worse in large urban areas such as Delhi where
AQI has achieved the highest total time of 999 AQI. The central government and national authorities have implemented a number of
measures to reduce air pollution.
The first phase requires air quality indicator prediction for the situation to improve. Six different project class dividers have been
created based on different algorithms. Database analysis by supervised machine learning program (SMLT) to capture a few similar
information, dynamic analysis, single dynamic analysis, dynamic dual analysis and various analyzes, short-term treatment and data
analysis.

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1371
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com

In India, as in many different countries, the record revolves around six key elements polluted by particles less than 10 micrometers
in diameter (PM10),particle less than 2.5 micrometers in diameter (PM2.5), carbon monoxide (CO), Ammonia (NH3), nitrogen
dioxide (NO2), and ozone (O3).
At that point, the test station should have the option of supplying one with a specific toxicity and its norm in a certain period of CO
and O3, which is generally considered to be in control for more than eight hours, and in the other three, a normal 24 hours. The unit
of measurement per cubic meter is a microgram (or milligram, due to CO).

A. System Architecture
The first step is to provide dataset to the prediction model by the user. The dataset supplied to machine learning model is used to
train the model. Every new data detail filled at the time of application form acts as a test data set. The obtained data set and the
previous data set are kept in Datawarehouse. Pre- processing and validation are the next stage. Pre-processing corresponds to the
data transformations that are performed before the algorithm is processed.
Data Pre-processing is a method for transforming raw data into a clean data package. In other terms, once the data is obtained from
different outlets, it is raw and can not be evaluated. The validation is the import of the dataset enabled library bundles. It analyses
the variable identification by data shape, data type and evaluating the missing values and duplicate values.
A validation dataset is a sample of data held back from training the model that is used to give an estimate of model skill while
tuning model’s and procedures that you can use to make the best use of validation and test datasets when evaluating the models.
After pre-processing the data, several machine learning algorithms such as decision tree, Random forest, Naïve Bayes, Logistic
Regression, SVM , etc. are used to train the dataset and predict the AQI. The machine learning algorithm which gives the best
accuracy is selected to builds the prediction model.

Figure 1.1: System Architecture Of The Proposed System

Each model will have different performance characteristics. Using resampling methods like cross validation, you can get an estimate
for how accurate each model may be on unseen data. Photo graphic method is not sufficient to calculate PM 2.5 and it takes only
one pollutants of concentration . It needs to be able to use these estimates to choose one or two best models from the suite of models
that you have created.
It is necessary to regularly compare the output of several distinctly different learning algorithms and to build a test harness to easily
compare multiple simultaneous learning algorithms in Python with scikit-learn. The work given is based on only one PM 2.5 and to
collect more monitoring data from other cities to verify the generalization of the work and more factors such as geomorphic
conditions. Finally, an interactive GUI (graphical user interface) is developed using Tkinter in python library.

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1372
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com

B. Objective
The goal is to develop a machine learning model for real-time air quality forecasting, to potentially replace the updatable supervised
machine learning classification models by predicting results in the form of best accuracy by comparing supervised algorithm.

C. Problem Statement
The management and protection of air quality in many industrial and urban areas is actually a critical tasks, owing to specific styles
of emissions induced by the usage of transport electric power, etc. the accumulation of toxic gasses presents a significant danger to
the quality of life in smart cities. As air pollution rises, needed a effective air quality control models to gather information on air
pollutant concentrations and to measure air pollution in all areas.

II. REVIEW OF LITERATURE

A. Photo–Based PM2.5 Concentration Estimation
A confluence of both features described above with a non-linear mapping procedure may project the PM2.5 concentration of the im-
age. Photo graphic method is not sufficient to calculate PM2.5and it taken only one pollutants of concentration. Research is
performed on the context of only one PM2.5 and to obtain further testing data from others, for example geomorphic conditions, to
check that research is generalized. Thus the regularity of air pollutant data cannot be more accurately determined and the prediction
results obtained.

B. Effects of Air Pollution on Hospital Admissions

De Leon presented the effects of air pollution on daily hospital admissions for respiratory disease in Poisson regression analysis of
day by day tallies of hospital confirmations, modifying for impacts of pattern, occasional and other recurrent variables, day of the
week, occasions, flu pestilence, temperature, dampness, and auto connection. Variables of pollution were particulates black smoke
(BS), nitrogen dioxide (NO2), Sulphur dioxide (SO2), and ozone (03).

C. Data Cleaning for Data Quality

Challa presented Data cleaning techniques to achieve quality data. Now every second data is rapidly generated over the internet for a
day and thus it has become a huge task to make the right decision. Surrounded by data but hungry for information, knowledge plays
a vital part for decision-makers to generate more income in industry. Data collected from different sources may have dirty
information, data cleaning should be performed before the data is loaded into the warehouse in order to obtain data quality.

D. Survey on Big Data Analytics

Acharjya paper’s primary objective is to research the potential impact of big data problems, issues in open research, and different
apparatuses related with it. This paper provides important insights on big data and the principles behind its analysis, which is an
important factor for this project because of large size of the sample data being used to provide conclusions.

E. Spatial Characteristics of Air Pollution

It explains that air pollution is a major and pervasive influence in Chengdu, owing to its effect on human health and well-being. The
main pollution induced by soil by the use of petroleum products was sulfur dioxide (SO2), carbon monoxide (CO), nitrogen dioxide
(NO2) and particulate matter (PM10). The outcomes showed that OK strategy was smarter to produce the forecast maps of toxin
fixations than IDW technique. High spatial decent variety of NO2, SO2 and PM10 focuses existed, particularly in the east-west
course. Between the six principle locales in Chengdu, Chengdu District was dirtied all the more truly in 2010.

III. WORK DONE

A. Exploration Data Analysis of Visualization
Data analysis is an essential capability in computational analytics and machine learning. In addition, statistics rely on detailed
explanations and data estimates. Data visualization is an effective toolkit to achieve a contextual understanding. It can be helpful as
you research and study a dataset which can help to detect trends, missing results, outliers and several other items.
Data visualizations are able to be utilized with a little technical awareness to communicate and explain core relations in more
graphic and involved plots and maps than cumulative or concrete measures.

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1373
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com

Only data becomes useless before it becomes readily visible, such as graphs and charts. The ability to easily imagine examples of
data and others is essential for both applied analytics and machine learning. You can consider the various styles of pictures you need
to recognize while you are visualizing Python data and how you can use them to fully visualize your personal data.
 How to model time series details using bar charts of lines and categorical numbers?
 How to sum up distributions of data using histograms and box plots.
 How to sum up variables ’connection to scatter plot.
Most machine learning algorithms are prone to attribute values in input data and delivery. Input data outliers will deceive and
misrepresent the machine learning algorithms training cycle, which contributes to longer training times, less reliable models and
eventually weaker performance.
Well before predictive models of training data are created, outliers will lead to misleading representations and misleading perceptions
of gathered information. In descriptive statistics such as mean and standard deviation and in tracks such as histograms and
scatterplots, Outliers will distort the overview distribution of attribute values by consolidating the body of evidence.
A histogram is a bar representation of data that varies over a range. It plots the height of the data belonging to a range along the y-
axis and the range along the x-axis. Histograms are used to plot the data over a range of values. The Histograms use a bar
representation to show the data belonging to each range. Use the iris data which contains the information about flowers to plot the
histograms. Finally, outliers may be indicators of data cases, such as fraud detection abnormalities and computer protection, which
are important to the issue. It can not adapt the model to the training data and it can not guarantee whether the model would function
correctly on the real data. You need to be sure the model has the correct data characteristics so it will not have too much noise.
Cross-validation is a method in which a model is educated using a data-set subset and then tested with an adjacent dataset sub-set.

B. Variable Identification Process

Machine learning validation techniques are employed to obtain the machine learning (ML) error rate model, which can be
considered as similar to the true data set error rate. You may not need testing strategies if the data volume is sufficiently high to
represent the population. To locate the missing value, repeat the value and the data form definition, if the attribute is float or integer.
The data sample was used to provide an objective model fit evaluation on the testing dataset when adjusting hyper parameters of the
model. The appraisal gets more skewed as expertise is integrated into the model setup on the validation dataset. To test a given
model, the validation collection is used, but this is for regular evaluation. As machine learning developers, this data is used to fine-
tune the hyperparameter of the model.

1) Data Validation
Data processing, data analysis, and the process of addressing data material, consistency, and configuration will add up to a time-
consuming list of tasks. It helps to understand the data and its properties during the data recognition process; this information can help
you select which algorithm to use to construct the model. By regression algorithm, for example, data from time series can be
analyzed; classification algorithms can also be utilized to evaluate discrete data.

2) Data Pre-Processing
Pre-processing relates to the preparation of the data until it is passed to the algorithm. Preprocessing data is a method for
transforming raw data into a clean collection of data. With other words, if the data were obtained with raw format from various
outlets, this is not appropriate for study. The machine learning cycle will prove to be right in order to obtain improved outcomes
from the implemented model. Random Forest algorithm does not accept null values. Certain Machine Learning models require
details in a defined format. Therefore, null values from the initial raw data collection must be handled to operate random forest
algorithms. And it is also critical that data sets are so structured that more than one machine learning and deep learning algorithm in
a given dataset is implemented.

C. Decision vs Logistic Algorithm

1) Logistic Regression
Analysis of logistic regression seems to be the most common approach of regression which could be used in binary dependent
parameter modeling. Logistic regression is a computational methodology which often represents the associations between the in-
dependent variables x1 x2 ... xn and y that is the two alternative categories of discrete dependent variable coded in 0 or 1.
Independent variables may be permanent, discrete, binary or combined.

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1374
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com

That is, P(Y=1) as a function of X is predicted by the model of logistic regression.

Logistic regression Assumptions:
 The dependent variable must be binary for binary logistic regression.
 The desired result should be the level 1 factor for the binary regression of the dependent variable.
 Only the meaningful variables should be included.
 The separate variables should be mutually independent. In other words, the model ought to have little.
 The independent variables have a linear relationship with the log odds.
 Logistic regression requires sample sizes quite large.

2) Decision tree
The algorithm of the decision array falls into the supervised learning algorithm category. It works for both continuous and categorical
variables of output.
Assumptions of Decision tree:
 At the beginning, considering the whole training set as the root.
 Attributes are considered to be categorical for finding information gain, given the attributes are continuous.
 Recursive distributions are made on the basis of attribute values.
 Using statistical methods for root or internal node ordering attributes.
Decision Trees is a data gathering categorization and analysis technique. Decision trees constitute the fundamental recursive basis
for the sequence process of classification, in that one of the disjoint class decision-making structures contains nodes and leaves, is
allocated a case identified with a collection of attributes. Each tree node includes testing a specific attribute and every tree leaf
denotes a class. The test usually compares the value with a constant of an attribute. Leaf nodes are usually categorized in all cases
whereas for classified set is entered or where probabilities are spread over any possible classification.
The process is continued until the termination is completed. It is designed in a recursive dividing-and-conquer top-down fashion. All
characteristics should be categorical. They would otherwise be discretized in advance. Top of the tree characteristics have a greater
effect on the classification and the knowledge gain principle is used. Too many branches can easily be over-fitted and can reflect
disturbances caused by noise or bolts.

D. Support Vector Machine vs Random Forest Algorithm

1) Support Vector Machines
Support Vector Machine has been a collection of similar supervised learning methods for use in categorization and regression. This
belongs to a family of generalized linear classifications. One unique aspect of SVM is that the empirical classification error is
minimized and geometrical margin maximized at the same time.
 How to detangle the various names used for referring vector support machines.
 The representation that SVM will use when the model is actually stored on the disk.
 How to use a trained representation of the SVM model to make accurate predictions for new data.
 What to learn from the training data on an SVM platform.
 How would the data be better optimized for SVM?
 Where can you look for additional SVM information

2) Random forest
Random forests were a hybrid of trees forecasting flaws in forest general statement since each tree operates on the theory of the
random variable within each tree. The error of generalizing forests depends on the strength of each of the forest’s trees and the
relationship between them on the individual since the percentage of forest trees is vital. By randomly selecting the characteristics for
splitting every node, error levels are similar, but are richer in noise.
In carrying out the random forest algorithm the following are basic steps:
 From the given dataset, choose N Random Data.
 Create a decision tree on the basis of those N records.
 Pick the number of trees you want and repeat steps 1 and 2 in the algorithm.
 In the event of a problem with regression, for a new record, the value for Y for each tree in the forest is expected. The final
value can be calculated by taking the sum of all the forested trees ’values.

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1375
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com

E. K-Nearest Neighbor vs Naive Bayes Algorithm

1) K-Nearest Neighbor
K-Nearest Neighbor is a supervised learning algorithms which hold all instances in n- dimensional space corresponding to training
data points. When an unknown discrete data is obtained, it analyzes the nearest k number of stored instances (near neighbors) and
finally returns the most previewed class and returns the average of k similar neighbors for real-value results.
In the nearest distance algorithm, each of the k neighbors assesses the contribution by distance by the subsequent question and gives
more weight to the nearest neighbors. KNN is typically resilient to noisy data since the closest neighbors are combined. The
algorithm of K-nearest is a classification algorithm, which is supervised: a group of defined points is needed and is used to learn how to
label other points. This searches for a specific point which has such neighbors votes in view of the points that are labelled nearest to
the new point (the "k" is the amount of neighborhoods that it checks). It aims at the neighbors’ voting points, and the mark that is the
most significant one of the neighborhood. Predict the testing collection for the whole training package. By filtering through the
entire collection to find the k "closest" instances, KNN prediction for a new case. Close-up is evaluated across all apps of proximity
(Euclidean).

2) Naive Bayes algorithm

Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the simplest supervised learning
algorithms. Naive Bayes classifier is the fast, accurate and reliable algorithm. Naive Bayes classifiers have high accuracy and speed
on large datasets. Naive Bayes classifier assumes that the effect of a particular feature in a class is independent of other features.
Naive Bayes is a Bayes Theorem-based system of statistical classification. This is one of the basic guided algorithms for learning. A
quick, accurate and reliable algorithm is the Naive Bayes classifier. High precision and efficient on broad datasets are the Naive
Bayes classifier. The classification of Naive Bayes means that the influence of a certain function in a class is unconditional to
others. Although these characteristics are interdependent, they are still independent. This presumption renders equations easier and
is thus called naive.
IV. SYSTEM DESIGN
The raw recorded data contains meteorological information for various cities in India. The idea is to use various machine learning
techniques such as Logistic Regression, Support Vector Machines, Random Forests, K-Nearest Neighbors, Naïve Bayes Classifier
and Decision Tree for model building and research purposes.

A. Data Analysis and Visualization

Data visualization is an essential specialist knowledge in applied statistics and the statistics of machine learning rely practically on
quantitative explanations, and software analyses are valuable instruments for obtaining qualitative understanding. This helps to
discover and know about a dataset and help to identify trends in fraudulent data transfers and more. Information visualizations are
feasible with low domain knowledge to communicate and display core connections through graphs and maps which are more
emotional and consumers than associative or substantive steps.

Figure 4.1: Demonstration of Pollutants vs Avg Pollution

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1376
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com

For the estimation of results, the data collection is split into two groups, that is, a training collection, and a test set that typically
utilizes 7:3 ratios to distinguish the training sample from the test sample. The data model built using Random Forest, Logistic
Rectification, Decision Tree Algorithms, KNN, Support Vector Classifier (SVC) and Naive Bayes is applied to the training set and
the Test Set Prediction is conducted on the basis of the precision and the test output.

Figure 4.2: Demonstration of Concentration Pollutants

The raw recorded data contains meteorological information for various cities in India. The idea is to use various machine learning
techniques such as Logistic Regression, Support Vector Machines, Random Forests, K-Nearest Neighbors, Naïve Bayes Classifier
and Decision Tree for model building and research purposes.

V. RESULTS AND DISCUSSION

Four Machine Learning techniques i.e. Naive Bayes method, Random Forests algorithm, K- Nearest Neighbor (KNN) and Decision
Tree algorithm, were used for the purpose of building a prediction model.

A. Naïve Bayes Algorithm

Naïve Bayes is quite a basic learning algorithm that uses Bayes and strongly assumes that the attributes in the groups are
conditionally separate. Combined with its analytical performance and many other attractive attributes, this leads to the use of naive
Bayes in practice. After applying Naive Bayes Classifier, the prediction accuracy obtained was 97.38

Figure 5.1: Classification report of Naïve Bayes

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1377
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com

Figure 5.2: Naïve Bayes Accuracy graph

B. Random Forest Algorithm

Random forests were a hybrid of trees forecasting flaws in forest general statement since each tree operates on the theory of the
random variable within each tree. The error of generalizing forests depends on the strength of each of the forest’s trees and the
relationship between them on the individual since the percentage of forest trees is vital. By randomly selecting the characteristics for
splitting every node, error levels are similar to adaboost, but are richer in noise. After applying Random Forests algorithm, the
prediction accuracy obtained was 99.16
Figure 5.3: Classification report of Random Forest

Figure 5.4: Random Forest Accuracy graph

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1378
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com

C. K-Nearest Neighbors Algorithm

K-Nearest Neighbors is the simplest and most direct method for classification where the distribution of data has had no firsthand
knowledge. This rule essentially preserves the whole training set in the process of learning and assigns a class representing the majority
mark of its closest neighbors in the training set to each submission. The next most simple form of KNN is the nearest neighbor’s law
(NN) when N= 1. After applying KNN, the prediction accuracy obtained was 97.61

Figure 5.5: Classification report of K-Nearest Neighbors

Figure 5.6: KNN Accuracy graph

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1379
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com

D. Decision Tree
Decision Trees is a data gathering categorization and analysis technique. Decision trees constitute the fundamental recursive basis for
the sequence process of classification, in that one of the disjoint class decision-making structures contains nodes and leaves, is
allocated a case identified with a collection of attributes. Each tree node includes testing a specific attribute and every tree leaf denotes
a class. The test usually compares the value with a constant of an attribute. Leaf nodes are usually categorized in all cases whereas
for classified set is entered or where probabilities are spread over any possible classification. After applying Decision Tree algorithm,
the prediction accuracy obtained was 99.88
Figure 5.7: Classification report of Decision Tree

Figure 5.8: Decision Tree Accuracy graph

E. Testing
The classification report of all the four algorithms was analyzed and evaluated on the data set and compared each algorithm
accuracy with each which other that can be seen in Table 1.

Algorithm Accuracy
Naive Bayes 97.38
Random Forest 99.16
K-Nearest Neighbors 97.61
Decision Tree 99.88
Table 1: Comparison of Accuracy result of each algorithm

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1380
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com

The Decision tree with 99.88 percent precision is perhaps the most effective methodology, whereas the least accurate algorithm is
Naïve Bayes with an accuracy of 97.38. On the otherhand the Random Forest has almost near accuracy that of Decision Tree with
an accuracy of 99.16, this is due to the fact that random forest is in itself a kind of decision tree and hence resembles almost same
accuracy as decision tree. Also the logistic regression, Naïve Bayes and KNN have same accuracy of 97 approximately.

Figure 5.9: GUI Screenshot

VI. CONCLUSION AND SUMMARY

Prevention of air pollution is the need of the hour, so a powerful machine learning system was established with the help of prediction
model. Prediction of pollution events has become most important issue in major cities in India due to the increased urbanization of
the population and the associated impact of traffic volumes. Data from a variety of heterogeneous resources were used and involved
collection and cleansing for use in machine learning algorithms.
The number of model parameters and optimized outputs were reduced with help of structure regularization which in turn, alleviated
model complexity. The Decision Tree Algorithm gave the best results among all the algorithms, with an overall accuracy of 99.8.
The prediction model precision findings, helped in evaluating and contrasting current work on air quality assessment which is based
upon Big Data Analytics and Machine Learning.
APPENDIX A
A. Parameters
1) Precision and Recall
Pattern detection is a fraction of the appropriate instances within the retrieved instances, with knowledge retrieval and classification
(machine study), accuracy (also called optimistic predictive value), while recall (also known as sensitivity) is the fraction of the total
number of the specific instances that were currently retrieved . Accuracy and alert are both dependent on awareness and significance
calculation.

2) Air Quality Index

The state and local agencies use an Air Quality Index (AQI), in order to disclose to the general population how safe or how sterile the
environment is actually predictable. Various nations have their own measures of air quality, which are compatible with various
national criteria of air quality.

3) Confusion Matrix
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test
data for which the true values are known. It allows the visualization of the performance of an algorithm.

4) F1 Score
The indicator of precision of a check is the F1 (also F-point or F-mount). It takes into consideration both the precision p and the
warning r of the check in order to measure the score: p is the proportion of correct positive results determined by the number of
possible results recorded by the classifier and r is the number of correct positive results segregated by all samples involved.

©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 1381
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com

5) A.5 PM2.5
PM2.5 applies to particle atmosphere that is smaller than 2.5 micrometers in diameter and is about 3 percent of the thickness of the
hair of humans. The particles usually known as PM2.5 are so tiny that even an electron microscope can be observed. These are also
smaller than PM10, which are 10 micrometers or fewer and are known as tiny particles.

6) A.6 Learning Curve

In machine learning, the validity and testing performance for an estimator for a variety of simulation samples is seen in the learning
curve (or testing curve). This is a guide to understand how much more training data will improve a machine understand algorithm, and
if the estimator has a variance flaw or bias mistake. It is also a aid. When both the validity score and the training scores are
converged to an extremely low value with the growing scale of the training collection, further training results does not gain much.

REFERENCES
[1] Acharjya, D. P. “A survey on big data analytics: challenges, open research issues and tools." , (2019)
[2] Challa, J. S., Goyal, P., Nikhil, S., Mangla, A., Balasubramaniam, S. S., and Goyal N. “Dd-rtree: A dynamic distributed data structure for efficient data
distribution among cluster nodes for spatial data mining algorithms.” , 2016 IEEE International Conference on Big Data (Big Data). 27–36.
[3] De Leon, A., Anderson, H., Bland, J., Strachan, D., and Bower, J. “Effects of air pollution on daily hospital admissions for respiratory disease in london
between 1987-88 and 1991-92.” (1996) Journal of epidemiology and community health, 50 Suppl 1, s63–70.
[4] Li, S., Song, S., and Fei, X. “Spatial characteristics of air pollution in the main city area of chengdu, china.” (2011) 19th International Conference on
Geoinformatics, 1–4.
[5] Qin, D., Yu, J., Zou, G., Yong, R., Zhao, Q., and Zhang, B. “A novel combined prediction scheme based on cnn and lstm for urban pm2.5 concentration.”
(2019) IEEE Access, 7, 20050–20059.
[6] Yue, G., Gu, K., and Qiao, J. “Effective and efficient photo -based pm2.5 concentrationestimation.” (2019) IEEE Transactions on Instrumentation and
Measurement,PP,1–10

Strategic Cost Management 2021 Edition.
100% (6)
Strategic Cost Management 2021 Edition.
567 pages
AlinIQ AMS Guide
No ratings yet
AlinIQ AMS Guide
40 pages
1 s2.0 S0925753523000802 Main
No ratings yet
1 s2.0 S0925753523000802 Main
12 pages
Literature Review Example Ieee
75% (4)
Literature Review Example Ieee
8 pages
Aqi To Print
No ratings yet
Aqi To Print
63 pages
Microsoft Malware Prediction
100% (1)
Microsoft Malware Prediction
16 pages
Chi Square Test in Weka
67% (3)
Chi Square Test in Weka
40 pages
Final Report
No ratings yet
Final Report
8 pages
Quantitative Techniques Group 1 Final
No ratings yet
Quantitative Techniques Group 1 Final
106 pages
Final Year Publishing Paper Air Quality Index Prediction-39120034
No ratings yet
Final Year Publishing Paper Air Quality Index Prediction-39120034
8 pages
Mapping of Air Pollution Using GIS: A Case Study of Hyderabad
No ratings yet
Mapping of Air Pollution Using GIS: A Case Study of Hyderabad
7 pages
A Literature Review On Prediction of Air Quality Index and Forecasting Ambient Air Pollutants Using Machine Learning Algorithms
No ratings yet
A Literature Review On Prediction of Air Quality Index and Forecasting Ambient Air Pollutants Using Machine Learning Algorithms
5 pages
Process Safety Capability Studies PDF
No ratings yet
Process Safety Capability Studies PDF
51 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
29 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
Final Document
No ratings yet
Final Document
51 pages
The Ethics and Norms of Artificial Intelligence (AI)
No ratings yet
The Ethics and Norms of Artificial Intelligence (AI)
3 pages
Sliding Window Blockchain Architecture For Internet of Things
No ratings yet
Sliding Window Blockchain Architecture For Internet of Things
47 pages
Pattern Recognition and Anomaly Detection
No ratings yet
Pattern Recognition and Anomaly Detection
2 pages
Lab Manual
No ratings yet
Lab Manual
69 pages
Introduction To Artificial Learning Lecture One
No ratings yet
Introduction To Artificial Learning Lecture One
16 pages
Pre-Feasibility Study: Men Footwear
No ratings yet
Pre-Feasibility Study: Men Footwear
18 pages
Industrial Safety Book - Ing Maria Del Carmen
No ratings yet
Industrial Safety Book - Ing Maria Del Carmen
162 pages
Seminar Presentation On Air Pollution in Textile Industry
No ratings yet
Seminar Presentation On Air Pollution in Textile Industry
16 pages
Site Selection Criteria (1) Ncer
100% (2)
Site Selection Criteria (1) Ncer
11 pages
Unit No:-II Comparators, Thread and Gear Metrology, Surface Roughness Measurement (08 HRS)
No ratings yet
Unit No:-II Comparators, Thread and Gear Metrology, Surface Roughness Measurement (08 HRS)
36 pages
Introduction To Technical Report Writing
No ratings yet
Introduction To Technical Report Writing
16 pages
Disease Prediction Synopsis
No ratings yet
Disease Prediction Synopsis
3 pages
Small Scale Industry: Presented By: Prakhyath Rai
No ratings yet
Small Scale Industry: Presented By: Prakhyath Rai
81 pages
4988-Me 010 805 G01
No ratings yet
4988-Me 010 805 G01
2 pages
Reference Books:: Artificial Intelligence
No ratings yet
Reference Books:: Artificial Intelligence
3 pages
Internship Report - Software - Salaries Predictions
100% (1)
Internship Report - Software - Salaries Predictions
17 pages
Combustion Technology
No ratings yet
Combustion Technology
1 page
Unit-3 BPPE Notes
No ratings yet
Unit-3 BPPE Notes
22 pages
CV
No ratings yet
CV
5 pages
EPC Unit 2
No ratings yet
EPC Unit 2
18 pages
Gujarat Technological University: External Examiner's Feedback Form
0% (1)
Gujarat Technological University: External Examiner's Feedback Form
1 page
Echotel 335: Ultrasonic Non Contact Transmitter For Level, Volume or Open Channel Flow
No ratings yet
Echotel 335: Ultrasonic Non Contact Transmitter For Level, Volume or Open Channel Flow
4 pages
MiniProjectRubrics2018 PDF
No ratings yet
MiniProjectRubrics2018 PDF
6 pages
OOSE Case Study Converted 1
No ratings yet
OOSE Case Study Converted 1
13 pages
Augmented Reality in Quality Control
No ratings yet
Augmented Reality in Quality Control
6 pages
Stability of Columns
No ratings yet
Stability of Columns
44 pages
Lecture 2
No ratings yet
Lecture 2
71 pages
18bt01036 FFO Lab Manual
No ratings yet
18bt01036 FFO Lab Manual
25 pages
Project Report Forest Fire Final
No ratings yet
Project Report Forest Fire Final
26 pages
Air Quality Prediction Based On Machine Learning
No ratings yet
Air Quality Prediction Based On Machine Learning
5 pages
Machine Learning
100% (1)
Machine Learning
46 pages
List of Annexure 1 Journals PDF
50% (2)
List of Annexure 1 Journals PDF
715 pages
2 Marks MM
No ratings yet
2 Marks MM
5 pages
Topic 3.0 Stability Analysis of FB Controlsystems TCE 5102
No ratings yet
Topic 3.0 Stability Analysis of FB Controlsystems TCE 5102
30 pages
Journal of Management Science & Engineering Research - Vol.5, Iss.1 March 2022
No ratings yet
Journal of Management Science & Engineering Research - Vol.5, Iss.1 March 2022
80 pages
pattern Recognition_Unit_1&2
100% (1)
pattern Recognition_Unit_1&2
41 pages
CONCLUSION
No ratings yet
CONCLUSION
3 pages
Predicting Students Performance Using Data Mining Technique With Rough Set Theory Concepts
No ratings yet
Predicting Students Performance Using Data Mining Technique With Rough Set Theory Concepts
7 pages
An Introduction To Seaborn
No ratings yet
An Introduction To Seaborn
42 pages
OHASMS - Case Study.
No ratings yet
OHASMS - Case Study.
4 pages
Ssuet Final Year Project Report Format
No ratings yet
Ssuet Final Year Project Report Format
12 pages
Wind Power Plant Lecture Note
No ratings yet
Wind Power Plant Lecture Note
31 pages
Forest Fire Prediction Using Machine Learning
No ratings yet
Forest Fire Prediction Using Machine Learning
28 pages
Introduction To UAV Systems, 4 Ed
No ratings yet
Introduction To UAV Systems, 4 Ed
15 pages
Prediction of Air Pollution in Smart Cities Using Machine Learning Techniques
No ratings yet
Prediction of Air Pollution in Smart Cities Using Machine Learning Techniques
7 pages
Air Quality Forecasting Using Deep Learning Framework
No ratings yet
Air Quality Forecasting Using Deep Learning Framework
8 pages
doc
No ratings yet
doc
7 pages
Air Quality Prediction in Urban Environment Using IoT Sensor Data
No ratings yet
Air Quality Prediction in Urban Environment Using IoT Sensor Data
7 pages
Adsorption Study On Waste Water Characteristics by Using Natural Bio-Adsorbents
No ratings yet
Adsorption Study On Waste Water Characteristics by Using Natural Bio-Adsorbents
6 pages
Study and Analysis of Non-Newtonian Fluid Speed Bump
No ratings yet
Study and Analysis of Non-Newtonian Fluid Speed Bump
8 pages
Air Conditioning Heat Load Analysis of A Cabin
No ratings yet
Air Conditioning Heat Load Analysis of A Cabin
9 pages
Design and Analysis of Fixed-Segment Carrier at Carbon Thrust Bearing
No ratings yet
Design and Analysis of Fixed-Segment Carrier at Carbon Thrust Bearing
10 pages
Study and Analysis of Non-Newtonian Fluid Speed Bump
No ratings yet
Study and Analysis of Non-Newtonian Fluid Speed Bump
8 pages
Se of Optimism Software To Observe Effect of Different Sources in Optical Fiber
No ratings yet
Se of Optimism Software To Observe Effect of Different Sources in Optical Fiber
7 pages
Advanced Wireless Multipurpose Mine Detection Robot
No ratings yet
Advanced Wireless Multipurpose Mine Detection Robot
7 pages
IoT-Based Smart Medicine Dispenser
100% (1)
IoT-Based Smart Medicine Dispenser
8 pages
Role of Artificial Intelligence in Emotion Recognition
No ratings yet
Role of Artificial Intelligence in Emotion Recognition
5 pages
TNP Portal Using Web Development and Machine Learning
No ratings yet
TNP Portal Using Web Development and Machine Learning
9 pages
Topology Optimisation of Piston
No ratings yet
Topology Optimisation of Piston
8 pages
Controlled Hand Gestures Using Python and OpenCV
No ratings yet
Controlled Hand Gestures Using Python and OpenCV
7 pages
BIM Data Analysis and Visualization Workflow
No ratings yet
BIM Data Analysis and Visualization Workflow
7 pages
11 V May 2023
No ratings yet
11 V May 2023
34 pages
Skill Verification System Using Blockchain SkillVio
No ratings yet
Skill Verification System Using Blockchain SkillVio
6 pages
Design and Analysis of Components in Off-Road Vehicle
No ratings yet
Design and Analysis of Components in Off-Road Vehicle
23 pages
Real Time Human Body Posture Analysis Using Deep Learning
100% (1)
Real Time Human Body Posture Analysis Using Deep Learning
7 pages
A Review On Speech Emotion Classification Using Linear Predictive Coding and Neural Networks
No ratings yet
A Review On Speech Emotion Classification Using Linear Predictive Coding and Neural Networks
5 pages
Structural Analysis of The Performance of The Diagrid System With and Without Shear Wall
No ratings yet
Structural Analysis of The Performance of The Diagrid System With and Without Shear Wall
13 pages
Image Detection and Real Time Object Detection
100% (1)
Image Detection and Real Time Object Detection
8 pages
Pneumonia Detection Using X-Rays by Deep Learning
No ratings yet
Pneumonia Detection Using X-Rays by Deep Learning
6 pages
Design and Analysis of Fixed Brake Caliper Using Additive Manufacturing
No ratings yet
Design and Analysis of Fixed Brake Caliper Using Additive Manufacturing
9 pages
Smart Parking System Using MERN Stack
No ratings yet
Smart Parking System Using MERN Stack
6 pages
CryptoDrive A Decentralized Car Sharing System
100% (1)
CryptoDrive A Decentralized Car Sharing System
9 pages
Comparative in Vivo Study On Quality Analysis On Bisacodyl of Different Brands
No ratings yet
Comparative in Vivo Study On Quality Analysis On Bisacodyl of Different Brands
17 pages
Credit Card Fraud Detection Using Machine Learning and Blockchain
100% (1)
Credit Card Fraud Detection Using Machine Learning and Blockchain
9 pages
Fund Future Empowering The Crowdfunding
No ratings yet
Fund Future Empowering The Crowdfunding
6 pages
Low Cost Scada System For Micro Industry
No ratings yet
Low Cost Scada System For Micro Industry
5 pages
Business Support System For Local Stores
No ratings yet
Business Support System For Local Stores
8 pages
Dark Store E-Commerce Website Using Sentiment Analysis Prediction
No ratings yet
Dark Store E-Commerce Website Using Sentiment Analysis Prediction
6 pages
AI - Unit 4 - Notes
No ratings yet
AI - Unit 4 - Notes
6 pages
Chapter 4classification and Prediction
No ratings yet
Chapter 4classification and Prediction
19 pages
A Survey On Football Player Performance and Value Estimation Using Machine Learning Techniques (#1215552) - 2816789
No ratings yet
A Survey On Football Player Performance and Value Estimation Using Machine Learning Techniques (#1215552) - 2816789
6 pages
Fault Tracing-Module 4
No ratings yet
Fault Tracing-Module 4
4 pages
Decision Tree
No ratings yet
Decision Tree
11 pages
Sentiment Analysis of IMDb Movie Reviews
No ratings yet
Sentiment Analysis of IMDb Movie Reviews
6 pages
Sensitivity Analysis 2
No ratings yet
Sensitivity Analysis 2
19 pages
DWM Solution May 2019
No ratings yet
DWM Solution May 2019
9 pages
Decision Tree
No ratings yet
Decision Tree
31 pages
Brainy Business
No ratings yet
Brainy Business
4 pages
Data Mining UNIT-III R20 Syllabus
No ratings yet
Data Mining UNIT-III R20 Syllabus
50 pages
CSE704 Data Analytics Syllabus Theory
No ratings yet
CSE704 Data Analytics Syllabus Theory
2 pages
Modelling 2 Ed
No ratings yet
Modelling 2 Ed
74 pages
The Parkinson'S Disease Detection Using Machine Learning Techniques
No ratings yet
The Parkinson'S Disease Detection Using Machine Learning Techniques
6 pages
Constructing A Highly Accurate Price Prediction Model in Real Estate Investment Using LightGBM
No ratings yet
Constructing A Highly Accurate Price Prediction Model in Real Estate Investment Using LightGBM
4 pages
Predictive Analytics A Review of Trends and Techni
No ratings yet
Predictive Analytics A Review of Trends and Techni
7 pages
HUM 4072 - Unit 3
No ratings yet
HUM 4072 - Unit 3
69 pages
Paper 1
No ratings yet
Paper 1
10 pages
MODULE 6 - Decision Theory
No ratings yet
MODULE 6 - Decision Theory
12 pages
Attention Detection in ASD - 2022
No ratings yet
Attention Detection in ASD - 2022
20 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
Shweta Mba Project Report Final
No ratings yet
Shweta Mba Project Report Final
74 pages
800 AI Lab documentation final
No ratings yet
800 AI Lab documentation final
50 pages
ML Lec-12
No ratings yet
ML Lec-12
17 pages
GWL Prediction Paper
No ratings yet
GWL Prediction Paper
17 pages
From Excel To Machine Learning
100% (1)
From Excel To Machine Learning
48 pages

Prediction of Air Quality Index Using Supervised Machine Learning

Uploaded by

Prediction of Air Quality Index Using Supervised Machine Learning

Uploaded by

10 VI June 2022

Prediction of Air Quality Index Using Supervised

ORGANIZATION OF THE THESIS

Figure 1.1: System Architecture Of The Proposed System

II. REVIEW OF LITERATURE

B. Effects of Air Pollution on Hospital Admissions

C. Data Cleaning for Data Quality

D. Survey on Big Data Analytics

E. Spatial Characteristics of Air Pollution

III. WORK DONE

B. Variable Identification Process

C. Decision vs Logistic Algorithm

That is, P(Y=1) as a function of X is predicted by the model of logistic regression.

D. Support Vector Machine vs Random Forest Algorithm

E. K-Nearest Neighbor vs Naive Bayes Algorithm

2) Naive Bayes algorithm

A. Data Analysis and Visualization

Figure 4.1: Demonstration of Pollutants vs Avg Pollution

Figure 4.2: Demonstration of Concentration Pollutants

V. RESULTS AND DISCUSSION

A. Naïve Bayes Algorithm

Figure 5.1: Classification report of Naïve Bayes

Figure 5.2: Naïve Bayes Accuracy graph

B. Random Forest Algorithm

Figure 5.4: Random Forest Accuracy graph

C. K-Nearest Neighbors Algorithm

Figure 5.5: Classification report of K-Nearest Neighbors

Figure 5.6: KNN Accuracy graph

Figure 5.8: Decision Tree Accuracy graph

Figure 5.9: GUI Screenshot

VI. CONCLUSION AND SUMMARY

2) Air Quality Index

6) A.6 Learning Curve

You might also like