Report 4
Report 4
Bachelor of Engineering
Submitted in partial fulllment of the requirements for the award of the degree of
by
Abhishek K USN:4JN22EC400
Chandana H B USN:4JN22EC403
Varsha G S USN:4JN22EC420
Vaishnavi R USN:4JN22EC422
Assistant Professor,
December 2024
VISVESVARAYA TECHNOLOGICAL UNIVERSITY
CERTIFICATE
This is to certify that the project work entitled Air Quality Index Prediction is
carried out by Abhishek k (4JN22EC400) , Chandana H B (4JN22EC403) ,
Varsha G S (4JN22EC420), Vaishnavi R (4JN22EC422) , the bonade students
of JNN College of Engineering, Shimoga in partial fulllment for the award of Bachelor
that all the corrections/suggestions indicated for internal assessment have been incorpo-
rated in the report deposited in the departmental library. The project report has been
External Viva
1.
2.
ABSTRACT
The project "Air Quality Index Prediction" aims to forecast air quality using machine
learning techniques. This involves analyzing pollutants such as CO2 , NO2 , PM2.5, and
O3 , alongside meteorological factors like temperature, humidity, and wind speed. The
project addresses challenges like complex environmental factors and data accuracy to
provide actionable insights for public health and governmental planning. Data is collected
from cloud-based sources, preprocessed for consistency, and divided into 80 percentage
of training and 20 percentage of testing sets. Various machine learning models, such as
Linear Regression, KNN and Lasso Regression are applied to predict the Air Quality
Index (AQI) based on extracted features. The models are validated using metrics such
as Mean Squared Error (MSE) and Mean Absolute Error (MAE). The project uses tools
like Jupyter and Spyder for coding and analysis. The outcome is a predictive model that
provides accurate AQI trends, enabling proactive measures during high-pollution periods.
This work also highlights the potential of machine learning in improving environmental
i
ACKNOWLEDGEMENTS
The satisfaction and euphoria that accompany the successful completion of any task
would be incomplete without the mention of the people who made it possible whose
constant guidance and encouragement crowned the eorts with success.We would like
to acknowledge the help and encouragement given by various people during the course
of the mini project and thankful to our beloved professor and Principal Dr Y Vijaya
Kumar for providing excellent academic climate. We would also like to thank our dean
Electronics and Communication Engineering, Shimogga for his kind support and guidance
and encouragement throughout the course of this work. We are deeply indebted and very
grateful to the invaluable guidance given by our Assistant Professor Mrs.Ujwala B S and
Mrs.Sumathi k for their guidance and support, during this project work. I would like to
thank all the teaching and non-teaching sta of Dept. of ECE for their kind co-operation
during the course of the work. The support provided by the college and departmental
library is greatly acknowledged. And lastly, I would hereby acknowledge and thank my
parents who have been a source of inspiration and also instrumental in the successful
Abhishek K 4JN22EC400
Chandana H B 4JN22EC403
Varsha G S 4JN22EC420
Vaishnavi R 4JN22EC422
ii
Contents
Abstract i
Acknowledgements ii
List of Figures v
List of Tables vi
1 Introduction 1
1.1 General Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Theoretical Background 6
2.1 Literature survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.6 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Spyder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
iii
3.4.2 Flowchart of Extract Combine Data . . . . . . . . . . . . . . . . . 22
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 References 35
7 Appendix 36
7.1 Programme of HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
iv
List of Figures
3.1 Block diagram of the project . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3 Pairgrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 HeatMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
v
List of Tables
4.1 Independent features in dataset . . . . . . . . . . . . . . . . . . . . . . . 26
vi
Chapter 1
Introduction
1.1 General Introduction
Air pollution poses a signicant threat to public health, ecosystems, and the environment,
making its monitoring and mitigation a critical global priority. Rapid urbanization, in-
air quality in both urban and rural areas. The Air Quality Index (AQI) serves as a
standardized metric to represent the levels of air pollution, helping authorities and the
public to understand its implications on daily life. AQI values are determined based on
the concentration of key pollutants such as PM2.5, PM10, NO2 , SO2 , CO,and O3, each
of which has distinct health impacts. Traditional methods of monitoring air quality, such
as manual sampling and analysis, are often time-consuming, limited in spatial coverage,
and unable to predict future conditions eectively. Additionally, these methods fail to
humidity, and wind speed with pollutant levels. Therefore, there is an urgent need for
Advancements in Machine Learning (ML) and data analytics have provided new oppor-
tunities for enhancing air quality monitoring systems. ML algorithms can process large
datasets eciently, identify patterns, and predict future AQI levels with high accuracy.
provide actionable insights, such as early warnings about hazardous pollution events.
This project focuses on predicting AQI using advanced ML techniques, including Ran-
dom Forest and Neural Networks. By integrating real-time data from various sources,
such as remote sensors and meteorological records, the proposed system aims to over-
come the limitations of traditional approaches. Accurate AQI predictions can empower
policymakers, urban planners, and individuals to take timely measures to mitigate pol-
lution impacts.Identifying critical parameters like PM2.5 and NO2, which signicantly
inuence AQI variations. Predictive models are trained and validated using datasets col-
lected from the Central Pollution Control Board (CPCB) and other reliable sources. In
conclusion, this research aims to establish an ecient, reliable, and scalable AQI predic-
tion system using state-of-the-art ML methods. It not only contributes to public health
awareness but also supports sustainable urban development by informing strategies for
pollution control and urban planning. The outcomes of this project hold the potential
to drive impactful environmental policies and enhance the quality of life for communities
worldwide.
This project aims to leverage advanced machine learning techniques to develop a robust
model for real-time Air Quality Index (AQI) prediction, enhancing decision-making for
pollution mitigation.
1.3 Methodology
The project involves collecting air quality and meteorological data, preprocessing it to
handle inconsistencies, and applying feature selection methods to retain the most rel-
evant parameters. The rened data is split into training and testing sets, followed by
the development of predictive models using machine learning algorithms such as Linear
Regression, KNN, and Lasso Regression. The models are evaluated using metrics like
Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) to ensure accuracy
in forecasting AQI.
Data Collection: The rst step involves gathering air quality and meteorological
data from reliable sources such as the Central Pollution Control Board (CPCB).
Key features include pollutant concentrations (e.g., PM2.5, NO2, CO) and mete-
orological variables (e.g., temperature, humidity, wind speed). These features are
Data Preprocessing: The collected raw data is cleaned through imputation tech-
niques to ll missing values and aggregation methods to remove redundancy. Data
normalization ensures consistency, and outliers are handled to improve model per-
formance.
Neighbors (KNN), and Decision Tree Regressor are employed to build predictive
Performance Evaluation: The models are validated using metrics like Mean Ab-
solute Error (MAE) and R² Score to ensure reliability. Comparative analysis be-
tween standalone and hybrid models highlights the most eective approach. The
nal model provides actionable insights for policymakers by oering accurate AQI
Mean Absolute Error (MAE): MAE is the arithmetic average of the dier-
ence between the ground truth and the predicted values. It can also be dened
It tells us how far the predictions diered from the actual result. Mathematical
n
1X
MAE = |yi − ŷi |
n i=1
Mean Squared Error (MSE): MSE is a common metric used to evaluate the
tween the predicted values and the actual values.Mathematical representation for
n
1X
MSE = (yi − ŷi )2
n i=1
Root Mean Square Error (RMSE): RMSE is the square root of the average
of the squared dierence between the target value and the value predicted by the
model. It is square root of mean square error (MSE). The implementation is very
v
u n
u1 X
RMSE = t (yi − ŷi )2
n i=1
Where
Develop a robust machine learning-based system to predict Air Quality Index (AQI) us-
ing pollutant and meteorological data. Create visual outputs such as graphs and maps
for better interpretation of predicted air quality levels across locations. Provide action-
able recommendations for policy makers to mitigate air pollution and improve urban
air quality standards. Potential Impact Enhanced public health awareness by providing
timely warnings about hazardous air quality levels. Support for policy makers in imple-
1.5 Limitations
quality of historical air quality and meteorological data, which may not always be
comprehensive or consistent.
2. Regional Constraints: Models trained on specic datasets may not generalize well
3. Real-time Challenges: Integrating real-time data into the system can face challenges
sions, or local vegetation eects might not always be represented adequately in the
dataset.
7. Model Interpretability: Advanced models such as neural networks can act as black
8. Temporal Resolution: The system might fail to provide high temporal resolution
weather patterns and pollution trends, which might not hold true in rapidly chang-
ing environments.
10. Scalability Issues: Implementing the system across larger regions or globally re-
Theoretical Background
Air quality prediction involves analyzing the complex interactions between pollutants
like PM2.5, NO2, and CO and meteorological factors such as temperature, humidity,
and wind speed. Traditional statistical methods, while useful, often fail to account for
including Regression Models, K-Nearest Neighbors (KNN), and Neural Networks, provide
advanced algorithms with environmental data, this project aims to improve the reliability
nitrodioxide, and particulate matter (PM2.5, PM 10) across dierent urban and
urban areas.It provides insights into air quality uctuations inuenced by both
air quality parameters in Qatar,leveraging sensors for real-time data collection and
analysis.
Data is processed into an accessible format for the general public via websites and
mobile applications.
MGMS stations are powered by solar panels, their reliance on clear weather can
Result: The result of this paper is the wireless sensor network eectively monitors
and analyses air quality in real-time, providing valuable insights into pollution levels
In this paper Author have Implemented a smart system designed for outdoor pol-
environmental sensors and machine learning models to predict pollution levels and
provide real-time data, which can help individuals and authorities take preventive
pollution data with weather conditions to predict pollution levels and inform health-
tem by integrating smart outdoor pollution monitoring sensors with predictive mod-
nation of linear regression and articial neural networks (ANN) to predict various
pollutants, achieving high accuracy for pollutants such as PM 2.5 and PM 10, which
is essential for real time monitoring and forecasting. User Friendly Data Access:
The system includes web and mobile interfaces to display air quality data, making
it accessible to both experts and the general public, thus promoting environmental
is limited by the specic sensor types it uses, which may restrict its application to
only a certain set of pollutants, leaving out other signicant air quality parameters.
Result: The study demonstrates that the pollution weather prediction system accu-
rately monitors and forecasts air quality, enabling proactive measures for healthier
living.
In this paper Authors have Implemented about the development of a wireless sen-
sor network system for real-time indoor air quality monitoring. The system is de-
like Carbon dioxide and particulate matter and transmitting this data wireless for
Technical Method: The paper presents a real-time indoor air quality monitoring
system using a wireless sensor network, where multiple low cost sensor nodes mea-
sure parameters like Carbon dioxide, temperature, and humidity, transmitting the
data wireless for real-time analysis and visualization, enabling ecient air quality
management.
Advantages: Spatial Coverage: Uses spatial prediction to cover areas with fewer
sensors. Real-time Data: Oers immediate access to IAQ data, which can sup-
port timely decisions about ventilation. The system can be applied in residential
to air quality prediction. It highlights the importance of monitoring air quality due
to its impact on public health and environmental sustainability. The authors discuss
the strengths and weaknesses of various machine learning models in predicting air
quality indices (AQI) and other related metrics.The paper reviews various machine
Forest, and Linear Regression, for predicting air quality indices (AQI). It highlights
mitigating health impacts associated with poor air quality. The study emphasizes
the strengths and limitations of each algorithm, concluding that ensemble methods
like Random Forest show signicant promise for accurate and reliable predictions.
Advantages: Machine learning algorithms like Random Forest and Support Vector
Machines provide high accuracy and eciency in predicting air quality by process-
ing large datasets automatically. They can adapt to dierent regions and pollution
scenarios.These models are scalable and can integrate additional data sources,such
quire high-quality historical data, which may not be available in all regions. Over-
ting: Complex models like Random Forest and SVM are prone to overtting and
Result: The paper reviews various machine learning algorithms used for air quality
PM 2.5, Carbon dioxide, and nitrogen dioxide,and discussing their strengths and
This paper discusses the application of machine learning techniques for predicting
and analyzing the Air Quality Index (AQI). The study evaluates the performance of
(LR), for atmospheric modeling and pollution level forecasting. The authors lever-
age machine learning to: Enhance the accuracy of AQI predictions using historical
and environmental data. Analyze the role of pollutants (e.g., PM2.5, PM10) and
Tchnical Methods: The paper employs machine learning algorithms such as De-
Regression (SVR),and Linear Regression (LR) to predict the Air Quality Index
(AQI) by analyzing historical pollution data and environmental factors, with eval-
uation based on metrics like Rsqare score,Mean Absolute Error (MAE), and Mean
2.2 Summary
The reviewed papers explore various approaches to air quality monitoring and prediction,
emphasizing real-time data collection through wireless sensor networks and advanced
machine learning techniques. They focus on pollutants like PM2.5, NO2, and CO2,
such as Random Forest, SVM, and Decision Trees are highlighted for their predictive
accuracy. Challenges include hardware limitations, scalability, and the need for high-
quality data. These systems aim to enhance public health by providing actionable insights
and enabling proactive pollution management. The studies collectively underscore the
The design and implementation of the Air Quality Index Prediction System involve in-
tegrating data collection, processing, and predictive modeling. Real-time air quality
data is gathered from sensors measuring pollutants like PM2.5, NO2, and CO2, along
missing values, remove redundancies, and extract relevant features. Machine learning
algorithms, such as Linear Regression, Decision Tree, and Random Forest, are applied
to build predictive models. The system architecture ensures ecient data ow, model
training, and validation. Finally, results are displayed via user-friendly platforms, aiding
The g 3.1 shows system design for the AQI Prediction integrates a range of technologies
and methodologies to deliver real-time, accurate air quality assessments. The design
begins with the incorporation of real-time sensors and cloud-based data sources, which
work in tandem to continuously collect crucial air quality and meteorological parameters.
These parameters may include pollutants such as PM2.5, PM10, CO2, NO2, and O3, as
well as weather conditions like temperature, humidity, and wind speed, which are all
module is employed to clean and rene the data. This preprocessing step ensures that
any noise or irrelevant information is removed, and that the dataset is formatted for use
the most pertinent variables that inuence air quality, which allows for more accurate
predictions.
Data Preprocessing involves preparing raw data for accurate AQI prediction by cleaning,
normalizing, and rening it. Missing values are imputed, outliers are removed, and
redundant data is consolidated. Key features like PM2.5, NO2, and temperature are
extracted, while irrelevant ones are discarded to improve eciency. The dataset is then
split into training and testing sets, ensuring reliable model performance.Eective data
preprocessing is critical for accurate Air Quality Index (AQI) predictions. The steps
undertaken include:
1. Data Collection:Data was acquired from cloud-based climate repositories. This in-
ter (PM2.5), Nitrogen Dioxide (NO2), Carbon Monoxide (CO), temperature, wind
2. Data Cleaning:Missing Value Imputation: Gaps in data were lled using imputation
Feature Engineering:
The input layer accepts various meteorological and pollutant-related features that inu-
Particulate Matter: PM2.5 and PM10 concentrations, the primary indicators of air
pollution.The Gaseous Pollutants like NO2, CO, and other harmful gases that contribute
pressure, and rainfall.the purpose is to feed raw data into the model. Proper normal-
ization ensures that features with varying scales (e.g., temperature in °C vs. PM2.5 in
The primary goal is to predict the AQI value based on meteorological and pollutant data
to help monitor air quality and guide decision-making. The model usage involves several
steps:
PM2.5, PM10, CO, NO2, temperature, humidity, wind speed, and other features,
is processed and input into the system.Each feature provides critical insights into
2. Feature Selection:Not all collected data points may be relevant. Feature selection
methods identify the most signicant variables aecting AQI prediction. For ex-
ample, PM2.5 and PM10 are more signicant than less impactful features.
3. Training the Model:During training, the model learns to map input features to the
AQI.A diverse range of data points, representing dierent times, locations, and
pollution scenarios, ensures the model captures complex patterns in the data.
4. Prediction Process:Once trained, the model predicts AQI for new or unseen data
by applying the learned relationships between input features and AQI.For instance,
given data on PM2.5, NO2, and temperature for a particular day, the model esti-
Validation ensures that the model is accurate, reliable, and generalizable to unseen data.
Data Splitting:
1. Training Set: 80% of the dataset is used for training, enabling the model to learn
2. Test Set: 20% the dataset is reserved for testing, used to evaluate how well the
model performs on unseen data. This splitting ensures that the model is not over-
Predictive Analysis: The machine learning model forecasts AQI using historical pollution
Algorithms Employed:
AQI.
3.2.6 Testing
Independent Test Data Uses a reserved dataset (unseen during training) to test the
model's generalization. Assess the model's ability to predict AQI for varied climatic and
pollutant conditions.
Simulate extreme environmental conditions (e.g., high pollution during dust storms)
to test robustness. Introduce missing or noisy data to evaluate the model's handling
of real-world challenges. Compare the model's predictions with actual AQI values ob-
tained from trusted monitoring systems. Benchmark agains texisting models or methods
to assess relative performance Visualization. Create plots for predicted vs. actual AQI
3.3 Implementation
The "Air Quality Index Prediction" project uses Python for data analysis and machine
learning implementation due to its versatile libraries like Pandas and Scikit-learn. Jupyter
Notebook facilitates exploratory data analysis, while Matplotlib and Seaborn create vi-
sualizations of AQI trends and model performance. Spyder IDE is used for coding and
debugging the project eciently. For deployment, Flask or Django builds a web interface
to display AQI predictions, supported by cloud platforms like AWS or Google Cloud
for scalability. Version control tools like GitHub ensure collaborative development and
3.3.1 Spyder
grated Development Environment tailored for scientic computing and data analysis. It
is written in Python and is often included in the Anaconda distribution. Spyder features
a powerful editor with advanced features like syntax highlighting, code introspection, and
code and view results in real-time. The variable explorer in Spyder provides an intuitive
way to inspect data and variables during runtime. It supports integration with libraries
like NumPy, pandas, Matplotlib, and SciPy for scientic computing tasks. Spyder also
includes debugging tools, making it easier to identify and x errors in code. The IDE is
customizable with support for plugins and layouts. It is widely used for machine learning,
data visualization, and statistical analysis. Overall, Spyder is ideal for researchers, data
designed for writing and executing code. It supports multiple programming languages,
with Python being the most commonly used. Jupyter allows users to create and share
documents that contain live code, equations, visualizations, and explanatory text. It
is widely used for data analysis, machine learning, and statistical modeling due to its
interactivity and ease of use. Each notebook consists of cells, which can contain code or
Jupyter supports visualization libraries like Matplotlib and Seaborn, making it ideal for
analyzing and presenting data. Its modular nature allows users to execute code incre-
in various formats, such as HTML, PDF, or Python scripts, for sharing and collabora-
tion. Jupyter integrates well with scientic computing and machine learning frameworks,
making it popular among researchers and data scientists. Overall, it is a powerful tool
1. Pandas : Pandas is a powerful open-source Python library used for data ma-
nipulation and analysis.It provides two primary data structures: Series (1D) and
DataFrame (2D), which allow users to work with labeled and tabular data e-
ciently.It supports reading and writing data to various le formats, including CSV,
Excel, SQL, and JSON. Oers functionalities for data cleaning, ltering, grouping,
cessing to prepare datasets for machine learning or statistical analysis. Pandas also
supports time series analysis, enabling functionalities like resampling, shifting, and
handling date ranges. The library is optimized for performance, making it suitable
for large datasets, and provides support for multi-indexing, which allows users to
create complex data hierarchies.The integration of Pandas with other libraries such
as NumPy and Matplotlib allows users to easily transform and visualize data. It can
handle a wide variety of data types, including text, integers, oats, and categorical
support for large, multi-dimensional arrays and matrices, along with a collection of
NumPy is the ndarray (N-dimensional array), which allows for ecient storage and
manipulation of large datasets. Unlike traditional Python lists, NumPy arrays are
much more ecient in terms of both memory and computation.One of the key fea-
simplies complex mathematical calculations. For example, you can perform arith-
metic operations on entire arrays without needing to loop through each element
to integrate well with other libraries like Pandas, SciPy, and Scikit-learn, making
it a foundational library in the Python data science stack. It provides tools for e-
cient data manipulation and analysis, especially for tasks involving large datasets or
supports advanced indexing and slicing techniques that allow users to manipulate
data in complex ways. This functionality is particularly useful when working with
including line plots, scatter plots, bar charts, histograms, and more. The library is
highly customizable, allowing users to adjust plot attributes such as colors, labels,
and styles to suit their needs. It is built on NumPy and integrates well with other
libraries in the Python ecosystem, such as pandas and SciPy. The primary inter-
interface for ease of use.Matplotlib supports multiple backends for rendering, mak-
It enables the creation of publication-quality plots with detailed control over ele-
ments like gure size, resolution, and layout. The library supports saving plots in
various formats, such as PNG, SVG, PDF, and more. Advanced users can lever-
age its object-oriented API for ner control over plot elements. With its extensive
documentation and active community, Matplotlib remains a go-to tool for data
visualization in Python.
for exploratory data analysis (EDA) and visualizing relationships between vari-
ing higher-level functions for creating visualizations such as box plots, violin plots,
pair plots, heatmaps, and more. It automatically handles many of the plot aes-
thetics, such as colors, labels, and styles, providing a polished appearance without
plot allows users to create a grid of scatter plots showing pairwise relationships
other matrix-like data.Seaborn also provides functionality for dealing with categor-
ical data. Plots like bar plots, count plots, and categorical scatter plots help users
oers various options for customizing plot styles and color palettes, enhancing the
cludes tools for visualizing regression models, distributions, and uncertainty in data,
which is particularly useful for analyzing patterns, trends, and outliers. It can also
3.3.3 Algorithm
computes the linear relationship between the dependent variable and one or more
the relationship between one or more independent variables (Xx) and a dependent
to minimize the dierence (error) between the predicted values and actual values by
dence of errors and normal distribution of residuals.It uses the Mean Squared Error
(MSE) as a loss function to evaluate the dierence between actual and predicted
values.The model optimizes the slope and intercept using techniques like Gradient
ing AQI, the dependent variable is the AQI value, and the independent variables
are environmental features such as PM2.5, temperature, humidity, wind speed, and
the form:
Y = mX + C
where Y Represents AQI, X is the Independent variable, m is the Slope, and c is the
Intercept. Data Uses the Historical air quality and meteorological data are used to
train the linear regression model, ensuring it learns the patterns and trends in AQI.
Prediction is Once trained, the model predicts AQI values for unseen data based on
the linear relationship it has derived. The model assumes a linear relationship be-
tween variables, which might not fully capture the complexity of AQI inuenced by
multiple, nonlinear factors.In this project, linear regression is one of the algorithms
explored alongside more complex methods like LASSO regression and decision tree
regression for better accuracy. Metrics such as Mean Squared Error (MSE) and
Mean Absolute Error (MAE) are calculated to assess the performance and accu-
racy of the linear regression model. By analyzing the coecients of the regression
equation, the project identies which features (e.g., temperature, PM2.5) have the
most signicant impact on AQI.It provides a foundational method for AQI predic-
the project.
hance the model's generalization by shrinking the coecients of less important fea-
regression can be applied in air quality monitoring to predict pollutant levels (like
PM2.5) based on various environmental factors while selecting the most relevant
most relevant environmental and pollutant factors while reducing the impact of
dataset where not all features contribute equally to the prediction. Example: Pre-
Lasso regression helps us identify which factors are most important while ignoring
strongly inuence PM2.5 levels. Wind Speed might have a moderate eect.CO
Levels might turn out to be irrelevant and removed by Lasso.May not perform
levels (e.g., PM2.5) by identifying key environmental factors like temperature and
humidity.Helps avoid overtting and selects the most important features. Exam-
ple: Helps exclude less relevant features like wind speed when predicting PM2.5
concentrations.
machine learning (ML) algorithm that can be used for classication of regression
tasks - and is also frequently used in missing value imputation.It is based on the
idea that the observations closest to a given data point are the most similar
observations in a data set, and we can therefore classify unforeseen points based
on the values of the closest existing points. By choosing K, the user can select
expensive for large datasets and sensitive to irrelevant or unscaled features. The
and adaptable to non-linear relationships. Example: Predicts AQI for a given day
The owchart 3.4 outlines the steps involved in visualizing yearly data trends.
Initialization: The process begins by initializing the years and data structures.
Iteration End Check: The process checks if it has iterated through all years. If not,
Data Iteration: Once all years are processed, it iterates through the data dictionary.
Line Plot Creation: For each year, a line plot is created to visualize the trend.
The owchart 3.5 outlines the steps involved in processing and analyzing air quality
specied directory.
Data Processing: For each year, it processes the air quality data.
PM 2.5 Data Extraction: It extracts the PM 2.5 data for the year.
Essentially, this owchart details the workow for organizing, combining, and extracting
The usage of model validation, selection, testing, and algorithms plays a crucial role
in developing a robust system for Air Quality Index (AQI) prediction. Model valida-
tion ensures that the predictive models perform reliably on unseen data by splitting the
dataset into training and testing subsets and evaluating the model's performance using
metrics like Mean Squared Error (MSE) and Mean Absolute Error (MAE). The selec-
and Decision Tree Regression is driven by their ability to handle specic patterns and
3.5 Summary
narios, rening the model to address over tting or under tting issues. The owchart of
the process typically begins with data collection and preprocessing, followed by feature
extraction and selection, model training, validation, and deployment. This systematic
approach enables the integration of machine learning techniques into a streamlined work
ow, ensuring that the model provides reliable and actionable predictions for air quality
Model training is a critical step in the development of the Air Quality Index prediction
system. It involves using machine learning techniques to learn patterns and relation-
ships between air quality parameters (e.g., PM2.5, PM10, CO, NO2) and meteorological
factors (e.g., temperature, humidity, wind speed). The process begins with splitting the
dataset into training and testing sets, where 80 percentage of the data is used for training
the model and the remaining 20 percentage is reserved for evaluation.Various algorithms
like Linear Regression, Lasso Regression and K-Nearest Neighbors are trained on the
processed dataset. These models analyze large datasets to identify trends and predict
AQI values. During training, the model adjusts its parameters to minimize error metrics
such as Mean Squared Error (MSE) or Root Mean Squared Error (RMSE). Hyperparam-
eter tuning is also performed to optimize the model's performance by using techniques
like Grid Search or Random Search. Cross-validation is employed to ensure the model
generalizes well across dierent subsets of data. The trained model is then ready to be
validated on the testing set, ensuring it accurately predicts AQI under diverse conditions.
The g 4.1 represents the daily variations in PM2.5 levels over three years (2013, 2014,
and 2015). The x-axis represents the day of the year (1 to 365), and the y-axis shows the
PM2.5 concentration levels. The lines for each year (blue for 2013, orange for 2014, and
green for 2015) illustrate seasonal trends in air pollution. Key observations include high
PM2.5 levels at the beginning and end of each year, indicating seasonal peaks, likely due
to winter pollution caused by heating systems and weather conditions. The mid-year
months show lower PM2.5 concentrations, reecting cleaner air during warmer months,
possibly due to better dispersion and reduced sources of pollution. This visualization
highlights year-to-year variations while showing consistent seasonal patterns, crucial for
The Table 4.1 shows the Collected data from cloud climate relevant air quality
analysis
Implemented the Machine learning techniques by using AQI data to extract the
◦
T Average Temperature( C)
◦
TM Maximum Temperature( C)
◦
Tm Minimum Temperature( C)
SLP Atmospheric pressure at sea level(hpa)
H Average relative humidity
PP Total rainfall and / or snowmelt(mm)
VV Average visibility(Km)
V Average wind speed (Km/h)
VM Maximum sustained wind speed (km/h)
VG Maximum speed of wind (Km/h)
RA Indicator if there was rain or drizzle
SN Snow Indicator
TS indicates whether there strom
FG indicates whether there was for
over a specic period, which is likely used in the context of Air Quality Index (AQI)
such as PM2.5, PM10, CO, NO2, or O3, with their respective concentrations plotted
against time.
1. Red Line: The consistently high value suggests it might represent a parameter with
2. Yellow Line: The spiking values indicate a pollutant with signicant temporal
variations, possibly PM2.5 or PM10, which are known to uctuate based on envi-
3. Purple and Other Lines: These might represent secondary pollutants or meteoro-
The visualization is crucial for understanding the variability and trends of pollutants
over time, highlighting peaks that can correlate with poor air quality episodes. Such
plots are used during data analysis to identify relationships between pollutants and AQI,
Below Fig shows PairGrid it's a special tool in Seaborn that helps to create a grid of
plots to compare every combination of variables in a dataset.It can show scatter plots
between two variables of PM 2.5 and other independent variables respectively It can show
The Fig 4.4 shows correlation Heatmap created using the Seaborn library in Python.
a matrix are represented as colors. In this case, the heatmap visualizes the correlation
The code begins by importing the Seaborn library, which is widely used for statistical
data visualization. It then computes the correlation matrix of a dataset using the `.corr()`
function, which measures the relationship between dierent numerical features. The
correlation values range from -1 to 1, where a value close to 1 indicates a strong positive
correlation, a value close to -1 represents a strong negative correlation, and values around
To enhance readability, the heatmap is plotted with a gure size of (20,20). The
colors in the heatmap are determined by the `"RdYlGn"` colormap, where red represents
negative correlations, yellow indicates weak or no correlation, and green signies strong
positive correlations. Additionally, the `annot=True` parameter ensures that the actual
By analyzing the heatmap, one can identify which features are highly correlated with
each other, either positively or negatively. This is useful in various data analysis tasks,
such as feature selection in machine learning, where highly correlated features might
negative correlations can provide insights into inverse relationships between variables.
1. The Output of linear regression : The data is preprocessed and trained with
linear regression algorithm to predict the AQI. The gure shows how the linear
The gure 4.5 is a scatter plot graph and X-axis and Y-axis are observed AQI value
shown in g 4.6.These metrics are statistical criteria that can be used to measure
and monitor the performance of a model. As our thesis deals with prediction, we've
2. The Output of Lasso Regression : The data is preprocessed and trained with
The gure 4.7 is a scatter plot graph and X-axis and Y-axis are observed AQI value
The g 4.8 shows the performance of a machine learning model of Lasso Regression.
3. The Output of KNN Regressor : The data is preprocessed and trained with
The gure 4.9 is a scatter plot graph and X-axis and Y-axis are observed AQI value
Three Algorithms
The Air Quality Index (AQI) Prediction project eectively demonstrates the applica-
tion of machine learning techniques to address one of the most pressing environmental
issues: air pollution. By leveraging historical data, real-time monitoring, and advanced
predictive models, the project facilitates accurate AQI forecasting. This enables proac-
tive measures to mitigate health risks, enhance public safety, and support environmental
eorts.
The use of diverse algorithms, such as Linear Regression, Random Forest, and K-
reliable pollution predictions.By utilizing data from sensors and meteorological sources,
the system ensures timely information, empowering individuals and authorities to re-
spond eectively to pollution threats.The project supports public health measures, urban
disciplinary impact.Scalability and Future. The project sets a foundation for integrating
additional data sources, rening predictive accuracy, and expanding to regional or global
scales. Looking forward, the system's adaptability to new data types, advanced machine
learning methods like deep learning, and collaboration with wearable technology can sig-
nicantly enhance its utility. This work plays a robust groundwork for smarter urban
such as Recurrent Neural Networks (RNNs) and Transformers for improved time-
series predictions. Leveraging additional data sources, like satellite imagery and
vide real-time AQI alerts. Launching mobile applications with personalized health
Policy and Decision Support: Assisting governments with dynamic policy imple-
optimized placement of renewable energy projects like solar and wind farms.Supporting
Public Awareness and Education: Engaging the community through gamied learn-
ing platforms about air pollution's eects. Encouraging collective action by em-
AQI trends.
Sensors in 2020,IEEE.
prediction system: Smart outdoor pollution monitoring and prediction for healthy
3. N. Salman, A. H. Kemp, A. Khan and C. Noakes, Real time wireless sensor network
2019.
Manjula Prediction and Analysis of Air Quality Index using Machine Learning
import os
import time
import requests
import sys
def retrieve_html():
for year in range(2013,2019):
for month in range(1,13):
if(month<10):
url='http://en.tutiempo.net/climate/0{}-{}/ws-421820.html'.format(month,year)
else:
url='http://en.tutiempo.net/climate/{}-{}/ws-421820.html'.format(month,year)
texts=requests.get(url)
text_utf=texts.text.encode('utf=8')
if not os.path.exists("Data/Html_Data/{}".format(year)):
os.makedirs("Data/Html_Data/{}".format(year))
with open("Data/Html_Data/{}/{}.html".format(year,month),"wb") as output:
output.write(text_utf)
sys.stdout.flush()
if _name=="main_":
start_time=time.time()
retrieve_html()
stop_time=time.time()
print("Time taken {}".format(stop_time-start_time))
finalD[a].pop(12)
finalD[a].pop(11)
finalD[a].pop(10)
finalD[a].pop(9)
finalD[a].pop(0)
return finalD
def data_combine(year, cs):
for a in pd.read_csv('Data/Real-Data/real_' + str(year) + '.csv', chunksize=cs):
df = pd.DataFrame(data=a)
mylist = df.values.tolist()
return mylist
if _name_ == "_main_":
if not os.path.exists("Data/Real-Data"):
os.makedirs("Data/Real-Data")
for year in range(2013, 2017):
final_data = []
with open('Data/Real-Data/real_' + str(year) + '.csv', 'w') as csvfile:
wr = csv.writer(csvfile, dialect='excel')
wr.writerow(['T', 'TM', 'Tm', 'SLP', 'H', 'VV', 'V', 'VM', 'PM 2.5'])
for month in range(1, 13):
temp = met_data(month, year)
final_data = final_data + temp
pm = getattr(sys.modules[_name], 'avg_data{}'.format(year))()
if len(pm) == 364:
pm.insert(364, '-')
for i in range(len(final_data)-1):
# final[i].insert(0, i + 1)
final_data[i].insert(8, pm[i])
with open('Data/Real-Data/real_' + str(year) + '.csv', 'a') as csvfile:
wr = csv.writer(csvfile, dialect='excel')
flag = 0
for elem in row:
if elem == "" or elem == "-":
flag = 1
if flag != 1:
wr.writerow(row)
data_2013 = data_combine(2013, 600)
data_2014 = data_combine(2014, 600)
data_2015 = data_combine(2015, 600)
data_2016 = data_combine(2016, 600)
total=data_2013+data_2014+data_2015+data_2016
with open('Data/Real-Data/Real_Combine.csv', 'w') as csvfile:
wr = csv.writer(csvfile, dialect='excel')
wr.writerow(['T', 'TM', 'Tm', 'SLP', 'H', 'VV', 'V', 'VM', 'PM 2.5'])
wr.writerows(total)
df=pd.read_csv('Data/Real-Data/Real_Combine.csv')
import pandas as pd
import matplotlib.pyplot as plt
def avg_data_2013():
temp_i=0
average=[]
for rows in pd.read_csv('Data/AQI/aqi2013.csv',chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row['PM2.5'])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row['PM2.5'])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!='NoData' and i!='PwrFail' and i!='---' and i!='InVld':
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1
average.append(avg)
return average
def avg_data_2018():
temp_i=0
average=[]
for rows in pd.read_csv('Data/AQI/aqi2018.csv',chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row['PM2.5'])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!='NoData' and i!='PwrFail' and i!='---' and i!='InVld':
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1
average.append(avg)
return average
if _name=="main_":
lst2013=avg_data_2013()
lst2014=avg_data_2014()
lst2015=avg_data_2015()
lst2016=avg_data_2016()
lst2017=avg_data_2017()
lst2018=avg_data_2018()
plt.plot(range(0,365),lst2013,label="2013 data")
plt.plot(range(0,364),lst2014,label="2014 data")
plt.plot(range(0,365),lst2015,label="2015 data")
plt.plot(range(0,121),lst2016,label="2016 data")
plt.xlabel('Day')
plt.ylabel('PM 2.5')
plt.legend(loc='upper right')
plt.show()