0% found this document useful (0 votes)
17 views50 pages

Report 4

The project report on 'Air Quality Index Prediction' details the use of machine learning techniques to forecast air quality by analyzing pollutants and meteorological factors. It highlights the challenges of data accuracy and environmental complexity, while employing various models like Linear Regression and KNN to predict AQI trends. The project aims to provide actionable insights for public health and environmental management, utilizing tools such as Jupyter and Spyder for analysis.

Uploaded by

sumathi k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views50 pages

Report 4

The project report on 'Air Quality Index Prediction' details the use of machine learning techniques to forecast air quality by analyzing pollutants and meteorological factors. It highlights the challenges of data accuracy and environmental complexity, while employing various models like Linear Regression and KNN to predict AQI trends. The project aims to provide actionable insights for public health and environmental management, utilizing tools such as Jupyter and Spyder for analysis.

Uploaded by

sumathi k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Belagavi,Karnataka - 590 018.

A Project Work Report on

Air Quality Index Prediction

Bachelor of Engineering
Submitted in partial fulllment of the requirements for the award of the degree of

Electronics and Communication Engineering


in

by

Abhishek K USN:4JN22EC400
Chandana H B USN:4JN22EC403
Varsha G S USN:4JN22EC420
Vaishnavi R USN:4JN22EC422

Under the Guidance of

Mrs Sumathi K M.Tech.

Assistant Professor,

Dept. of ECE, JNNCE, Shimoga-577 204.

Department of Electronics and Communication Engineering


JNN College of Engineering, Shimoga - 577 204.

December 2024
VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Belagavi-590 018.

JNN College of Engineering

Department of Electronics and Communication Engineering


Shimoga-577 204.

CERTIFICATE
This is to certify that the project work entitled Air Quality Index Prediction is
carried out by Abhishek k (4JN22EC400) , Chandana H B (4JN22EC403) ,
Varsha G S (4JN22EC420), Vaishnavi R (4JN22EC422) , the bonade students
of JNN College of Engineering, Shimoga in partial fulllment for the award of Bachelor

of Engineering" in department of Electronics and Communication Engineering" of the

Visvesvaraya Technological University, Belagavi, during the year 2024-25. It is certied

that all the corrections/suggestions indicated for internal assessment have been incorpo-

rated in the report deposited in the departmental library. The project report has been

approved as it satises the academic requirements in respect of project work prescribed

for the said degree.

Signature of the Guide Signature of the Coordinator


Sumathi K Ujwala B S

Assistant Professor, Assistant Professor

Dept. of ECE,JNNCE, Shimoga. Dept. of ECE,JNNCE, Shimoga.

Signature of the HoD Signature of the Principal


Dr.S.V. Sathyanarayana Dr. Y. Vijaya Kumar

Professor& HOD of ECE, Dean(R&D) Principal

Dept. of ECE,JNNCE, Shimoga. JNNCE, Shimoga.

External Viva

Name of the examiner Signature with date

1.

2.
ABSTRACT

The project "Air Quality Index Prediction" aims to forecast air quality using machine

learning techniques. This involves analyzing pollutants such as CO2 , NO2 , PM2.5, and

O3 , alongside meteorological factors like temperature, humidity, and wind speed. The

project addresses challenges like complex environmental factors and data accuracy to

provide actionable insights for public health and governmental planning. Data is collected

from cloud-based sources, preprocessed for consistency, and divided into 80 percentage

of training and 20 percentage of testing sets. Various machine learning models, such as

Linear Regression, KNN and Lasso Regression are applied to predict the Air Quality

Index (AQI) based on extracted features. The models are validated using metrics such

as Mean Squared Error (MSE) and Mean Absolute Error (MAE). The project uses tools

like Jupyter and Spyder for coding and analysis. The outcome is a predictive model that

provides accurate AQI trends, enabling proactive measures during high-pollution periods.

This work also highlights the potential of machine learning in improving environmental

management and public health awareness.

Keywords: Machine Learning , MSE, MAE, RMSE ;

i
ACKNOWLEDGEMENTS

The satisfaction and euphoria that accompany the successful completion of any task

would be incomplete without the mention of the people who made it possible whose

constant guidance and encouragement crowned the eorts with success.We would like

to acknowledge the help and encouragement given by various people during the course

of the mini project and thankful to our beloved professor and Principal Dr Y Vijaya

Kumar for providing excellent academic climate. We would also like to thank our dean

academics Dr Manjunatha P for helping us to make this mini project successful. We

would like to express my sincere gratitude to Dr S V Sathyanarayana, Head of Department

Electronics and Communication Engineering, Shimogga for his kind support and guidance

and encouragement throughout the course of this work. We are deeply indebted and very

grateful to the invaluable guidance given by our Assistant Professor Mrs.Ujwala B S and

Mrs.Sumathi k for their guidance and support, during this project work. I would like to

thank all the teaching and non-teaching sta of Dept. of ECE for their kind co-operation

during the course of the work. The support provided by the college and departmental

library is greatly acknowledged. And lastly, I would hereby acknowledge and thank my

parents who have been a source of inspiration and also instrumental in the successful

completion of this presentation.

Thank You Everyone,

Abhishek K 4JN22EC400

Chandana H B 4JN22EC403

Varsha G S 4JN22EC420

Vaishnavi R 4JN22EC422

ii
Contents
Abstract i

Acknowledgements ii

List of Figures v

List of Tables vi

1 Introduction 1
1.1 General Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Calculation of the Error Rate : . . . . . . . . . . . . . . . . . . . 3

1.4 Scope of the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theoretical Background 6
2.1 Literature survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Design and Implementation 11


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.3 Model Building: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.4 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.5 Model Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.6 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.1 Spyder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.2 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 Flowchart of Implementation Process . . . . . . . . . . . . . . . . . . . . 21

3.4.1 Flowchart of the AQI Data . . . . . . . . . . . . . . . . . . . . . . 21

iii
3.4.2 Flowchart of Extract Combine Data . . . . . . . . . . . . . . . . . 22

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Results and Discussion 25


4.1 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1.1 Experimental Output . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1.2 Algorithms Output . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Conclusion & Future Scope 33


5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 References 35

7 Appendix 36
7.1 Programme of HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.2 Programme of Extract Combine Data . . . . . . . . . . . . . . . . . . . . 37

7.3 Programme of Plot AQI . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

iv
List of Figures
3.1 Block diagram of the project . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Spyder Logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Jupyter Logo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Flowchart of the AQI Data . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.5 Flowchart of Extract Combine Data . . . . . . . . . . . . . . . . . . . . . 22

4.1 The output of AQI Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2 The output of Extract Combine Data . . . . . . . . . . . . . . . . . . . . 27

4.3 Pairgrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4 HeatMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.5 Scatter plot graph for linear regression model . . . . . . . . . . . . . . . 29

4.6 Comparision of performance metrics for linear regression model . . . . . 30

4.7 Scatter plot graph for Lasso regression model . . . . . . . . . . . . . . . 30

4.8 Comparision of performance metrics for Lasso regression model . . . . . 31

4.9 Scatter plot graph for KNN regressor model . . . . . . . . . . . . . . . . 31

4.10 Comparision of performance metrics for KNN regressor model . . . . . . 31

v
List of Tables
4.1 Independent features in dataset . . . . . . . . . . . . . . . . . . . . . . . 26

4.2 Comparison of performance metrics for all models . . . . . . . . . . . . . 32

vi
Chapter 1

Introduction
1.1 General Introduction

Air pollution poses a signicant threat to public health, ecosystems, and the environment,

making its monitoring and mitigation a critical global priority. Rapid urbanization, in-

dustrialization, and vehicular emissions have signicantly contributed to deteriorating

air quality in both urban and rural areas. The Air Quality Index (AQI) serves as a

standardized metric to represent the levels of air pollution, helping authorities and the

public to understand its implications on daily life. AQI values are determined based on

the concentration of key pollutants such as PM2.5, PM10, NO2 , SO2 , CO,and O3, each

of which has distinct health impacts. Traditional methods of monitoring air quality, such

as manual sampling and analysis, are often time-consuming, limited in spatial coverage,

and unable to predict future conditions eectively. Additionally, these methods fail to

capture the nonlinear relationships between environmental variables like temperature,

humidity, and wind speed with pollutant levels. Therefore, there is an urgent need for

innovative solutions to improve air quality prediction, enabling proactive decision-making

Advancements in Machine Learning (ML) and data analytics have provided new oppor-

tunities for enhancing air quality monitoring systems. ML algorithms can process large

datasets eciently, identify patterns, and predict future AQI levels with high accuracy.

These predictive models leverage meteorological data and pollutant concentrations to

provide actionable insights, such as early warnings about hazardous pollution events.

This project focuses on predicting AQI using advanced ML techniques, including Ran-

dom Forest and Neural Networks. By integrating real-time data from various sources,

such as remote sensors and meteorological records, the proposed system aims to over-

come the limitations of traditional approaches. Accurate AQI predictions can empower

policymakers, urban planners, and individuals to take timely measures to mitigate pol-

lution impacts.Identifying critical parameters like PM2.5 and NO2, which signicantly

inuence AQI variations. Predictive models are trained and validated using datasets col-

lected from the Central Pollution Control Board (CPCB) and other reliable sources. In

Dept. of ECE, JNNCE, Shimoga December-2024 1


Air Quality Index Prediction

conclusion, this research aims to establish an ecient, reliable, and scalable AQI predic-

tion system using state-of-the-art ML methods. It not only contributes to public health

awareness but also supports sustainable urban development by informing strategies for

pollution control and urban planning. The outcomes of this project hold the potential

to drive impactful environmental policies and enhance the quality of life for communities

worldwide.

1.2 Problem Statement

This project aims to leverage advanced machine learning techniques to develop a robust

model for real-time Air Quality Index (AQI) prediction, enhancing decision-making for

pollution mitigation.

1.3 Methodology

The project involves collecting air quality and meteorological data, preprocessing it to

handle inconsistencies, and applying feature selection methods to retain the most rel-

evant parameters. The rened data is split into training and testing sets, followed by

the development of predictive models using machine learning algorithms such as Linear

Regression, KNN, and Lasso Regression. The models are evaluated using metrics like

Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) to ensure accuracy

in forecasting AQI.

ˆ Data Collection: The rst step involves gathering air quality and meteorological

data from reliable sources such as the Central Pollution Control Board (CPCB).

Key features include pollutant concentrations (e.g., PM2.5, NO2, CO) and mete-

orological variables (e.g., temperature, humidity, wind speed). These features are

crucial for predicting the Air Quality Index.

ˆ Data Preprocessing: The collected raw data is cleaned through imputation tech-

niques to ll missing values and aggregation methods to remove redundancy. Data

normalization ensures consistency, and outliers are handled to improve model per-

formance.

ˆ Model Development: Machine learning algorithms like Linear Regression, K-Nearest

Neighbors (KNN), and Decision Tree Regressor are employed to build predictive

Dept. of ECE, JNNCE, Shimoga December-2024 2


Air Quality Index Prediction

models. These algorithms identify the complex relationships between pollutants

and environmental factors. Advanced techniques, such as ensemble learning, are

used to enhance prediction accuracy.

ˆ Performance Evaluation: The models are validated using metrics like Mean Ab-

solute Error (MAE) and R² Score to ensure reliability. Comparative analysis be-

tween standalone and hybrid models highlights the most eective approach. The

nal model provides actionable insights for policymakers by oering accurate AQI

forecasts and identifying critical contributors to pollution levels.

1.3.1 Calculation of the Error Rate :

ˆ Mean Absolute Error (MAE): MAE is the arithmetic average of the dier-

ence between the ground truth and the predicted values. It can also be dened

as measure of errors between paired observations expressing same phenomenon.

It tells us how far the predictions diered from the actual result. Mathematical

representation for MAE is given below.

n
1X
MAE = |yi − ŷi |
n i=1

ˆ Mean Squared Error (MSE): MSE is a common metric used to evaluate the

performance of regression models. It quanties the average squared dierence be-

tween the predicted values and the actual values.Mathematical representation for

MSE is given below.

n
1X
MSE = (yi − ŷi )2
n i=1

ˆ Root Mean Square Error (RMSE): RMSE is the square root of the average

of the squared dierence between the target value and the value predicted by the

model. It is square root of mean square error (MSE). The implementation is very

much similar to MSE.Mathematical representation for RMSE is given below.

v
u n
u1 X
RMSE = t (yi − ŷi )2
n i=1

Where

Dept. of ECE, JNNCE, Shimoga December-2024 3


Air Quality Index Prediction

n is the number of observations.

yi is the actual value.

ŷi is the predicted value.

1.4 Scope of the Project

Develop a robust machine learning-based system to predict Air Quality Index (AQI) us-

ing pollutant and meteorological data. Create visual outputs such as graphs and maps

for better interpretation of predicted air quality levels across locations. Provide action-

able recommendations for policy makers to mitigate air pollution and improve urban

air quality standards. Potential Impact Enhanced public health awareness by providing

timely warnings about hazardous air quality levels. Support for policy makers in imple-

menting data-driven strategies to combat air pollution.Contribution to sustainable urban

planning by integrating environmental data into decision-making processes.

1.5 Limitations

1. Data Availability: The accuracy of predictions depends on the availability and

quality of historical air quality and meteorological data, which may not always be

comprehensive or consistent.

2. Regional Constraints: Models trained on specic datasets may not generalize well

to regions with dierent environmental conditions or pollutant patterns.

3. Real-time Challenges: Integrating real-time data into the system can face challenges

such as delays in data acquisition or hardware limitations in monitoring sensors.

4. Non-Linear Relationships: While machine learning models handle non-linear rela-

tionships better than traditional methods, extreme uctuations in pollutants due

to sudden events (e.g., wildres, dust storms) remain dicult to predict.

5. Computational Demand: Complex models like neural networks require signicant

computational resources, which might not be feasible in low-resource settings.

6. Limited Feature Representation: Factors like human activities, industrial emis-

sions, or local vegetation eects might not always be represented adequately in the

dataset.

Dept. of ECE, JNNCE, Shimoga December-2024 4


Air Quality Index Prediction

7. Model Interpretability: Advanced models such as neural networks can act as black

boxes, making it challenging to interpret the reasons behind certain predictions.

8. Temporal Resolution: The system might fail to provide high temporal resolution

predictions, such as hourly changes, due to constraints in data granularity.

9. Dependency on Assumptions: Predictive models often rely on assumptions about

weather patterns and pollution trends, which might not hold true in rapidly chang-

ing environments.

10. Scalability Issues: Implementing the system across larger regions or globally re-

quires robust infrastructure, consistent data streams, and signicant resources.

Dept. of ECE, JNNCE, Shimoga December-2024 5


Chapter 2

Theoretical Background
Air quality prediction involves analyzing the complex interactions between pollutants

like PM2.5, NO2, and CO and meteorological factors such as temperature, humidity,

and wind speed. Traditional statistical methods, while useful, often fail to account for

the nonlinear relationships inherent in these interactions. Machine learning techniques,

including Regression Models, K-Nearest Neighbors (KNN), and Neural Networks, provide

more accurate predictions by identifying hidden patterns in large datasets. By integrating

advanced algorithms with environmental data, this project aims to improve the reliability

and scalability of AQI prediction systems.

2.1 Literature survey

1. E. Yaacoub, A. Kadri, M. Mushtaha and A. Abudayya, Air quality

monitoring and analysis in Qatar using a wireless sensor network de-

ployment published in Sensors in 2020,IEEE.

In this paper Authors have proposed deployment of a Wireless Sensor Network to

monitor air quality in Qatar.The primary goal is to develop a real-time, ecient,

and scalable monitoring system to track pollutants such as Carbondioxide, Carbon,

nitrodioxide, and particulate matter (PM2.5, PM 10) across dierent urban and

industrial zones in the country.The paper discusses the deployment of a wireless

sensor network. The goal is to monitorpollutants and environmental factors across

urban areas.It provides insights into air quality uctuations inuenced by both

human activities as industrial emissions, trac and naturalfactors such asweather,

dust storms. Technical Method: deploying a wireless sensor network to monitor

air quality parameters in Qatar,leveraging sensors for real-time data collection and

analysis.

Advantages: Real-time Monitoring: The system provides real-time data collec-

tion,allowing immediate analysis of air quality.User-Friendly Data Presentation:

Dept. of ECE, JNNCE, Shimoga December-2024 6


Air Quality Index Prediction

Data is processed into an accessible format for the general public via websites and

mobile applications.

Disadvantages: Dependency on External Power for Some Operations: Though the

MGMS stations are powered by solar panels, their reliance on clear weather can

limit operation during extended periods of overcast conditions or at night.Scalability

Constraints in Urban Areas: The deployment of multiple sensors in dense urban

settings could face scalability challenges.

Result: The result of this paper is the wireless sensor network eectively monitors

and analyses air quality in real-time, providing valuable insights into pollution levels

and helping identify trends for better environmental management.

2. S. Pandya, H. Ghayvat, A. Sur, M. Awais, K. Kotecha et al., Pollution

weather prediction system: Smart outdoor pollution monitoring and

prediction for healthy breathing and living published in 2020.

In this paper Author have Implemented a smart system designed for outdoor pol-

lution monitoring and weather prediction.The system leverages a combination of

environmental sensors and machine learning models to predict pollution levels and

provide real-time data, which can help individuals and authorities take preventive

actions to minimize exposure to harmful pollutants.It combines real-time outdoor

pollution data with weather conditions to predict pollution levels and inform health-

ier living practices.

Technical Method:involves the development of a pollution weather prediction sys-

tem by integrating smart outdoor pollution monitoring sensors with predictive mod-

els to forecast air quality and support healthier living.

Advantages: Accurate Prediction of Multiple Pollutants: The system uses a combi-

nation of linear regression and articial neural networks (ANN) to predict various

pollutants, achieving high accuracy for pollutants such as PM 2.5 and PM 10, which

is essential for real time monitoring and forecasting. User Friendly Data Access:

The system includes web and mobile interfaces to display air quality data, making

it accessible to both experts and the general public, thus promoting environmental

awareness and timely actions.

Disadvantages: Hardware Limitations with Sensor Types: The prediction system

is limited by the specic sensor types it uses, which may restrict its application to

Dept. of ECE, JNNCE, Shimoga December-2024 7


Air Quality Index Prediction

only a certain set of pollutants, leaving out other signicant air quality parameters.

Result: The study demonstrates that the pollution weather prediction system accu-

rately monitors and forecasts air quality, enabling proactive measures for healthier

living.

3. N. Salman, A. H. Kemp, A. Khan and C. Noakes, Real time wire-

less sensor network (WSN)based indoor air quality monitoring system

published in IFAC-Papers in 2019.

In this paper Authors have Implemented about the development of a wireless sen-

sor network system for real-time indoor air quality monitoring. The system is de-

signed to ensure healthy indoor environments by continuously detecting pollutants

like Carbon dioxide and particulate matter and transmitting this data wireless for

analysis and action.

Technical Method: The paper presents a real-time indoor air quality monitoring

system using a wireless sensor network, where multiple low cost sensor nodes mea-

sure parameters like Carbon dioxide, temperature, and humidity, transmitting the

data wireless for real-time analysis and visualization, enabling ecient air quality

management.

Advantages: Spatial Coverage: Uses spatial prediction to cover areas with fewer

sensors. Real-time Data: Oers immediate access to IAQ data, which can sup-

port timely decisions about ventilation. The system can be applied in residential

buildings, workplaces, schools,and healthcare facilities

Disadvantages: Limited Sensor Types: Currently lacks particulate matter sen-

sors,which limits its ability to monitor a broader range of pollutants. Power

Consumption: Requires further optimization to extend operational time between

charges, especially in battery-powered setups

Result : The system successfully demonstrates real-time monitoring and manage-

ment of indoor air quality, providing continuous data on environmental parameters

and enabling prompt actions to improve air quality.

4. T. Madan, S. Sagar, D. Virmani, Air Quality Prediction using Machine

Learning Algorithms A Review Published in 2020.

The paper provides a comprehensive review of machine learning algorithms applied

Dept. of ECE, JNNCE, Shimoga December-2024 8


Air Quality Index Prediction

to air quality prediction. It highlights the importance of monitoring air quality due

to its impact on public health and environmental sustainability. The authors discuss

the strengths and weaknesses of various machine learning models in predicting air

quality indices (AQI) and other related metrics.The paper reviews various machine

learning algorithms, including Support Vector Machines, Decision Trees, Random

Forest, and Linear Regression, for predicting air quality indices (AQI). It highlights

the importance of these models in monitoring air pollution,forecasting trends, and

mitigating health impacts associated with poor air quality. The study emphasizes

the strengths and limitations of each algorithm, concluding that ensemble methods

like Random Forest show signicant promise for accurate and reliable predictions.

Technical Method: Data Collection and Preprocessing, Algorithms Reviewed, Eval-

uation Metrics, Comparison and the Feature analysis.

Advantages: Machine learning algorithms like Random Forest and Support Vector

Machines provide high accuracy and eciency in predicting air quality by process-

ing large datasets automatically. They can adapt to dierent regions and pollution

scenarios.These models are scalable and can integrate additional data sources,such

as meteorological conditions or emissions inventories, to improve predictions.

Disadvantages: Data and Computational Demands: Machine learning models re-

quire high-quality historical data, which may not be available in all regions. Over-

ting: Complex models like Random Forest and SVM are prone to overtting and

are less interpretable than simpler models.

Result: The paper reviews various machine learning algorithms used for air quality

prediction, highlighting their eectiveness in forecasting air quality parameters like

PM 2.5, Carbon dioxide, and nitrogen dioxide,and discussing their strengths and

challenges in real-world application.

5. B D Parameshachari, G.M.Siddesh, V.Sridhar, M.Latha, K.N.A.Sattar,

and G. Manjula Prediction and Analysis of Air Quality Index using

Machine Learning Algorithms Published in 2022.

This paper discusses the application of machine learning techniques for predicting

and analyzing the Air Quality Index (AQI). The study evaluates the performance of

various regression-based models, such as Decision Tree Regression(DTR), Random

Forest Regression (RFR), Support Vector Regression(SVR), and Linear Regression

Dept. of ECE, JNNCE, Shimoga December-2024 9


Air Quality Index Prediction

(LR), for atmospheric modeling and pollution level forecasting. The authors lever-

age machine learning to: Enhance the accuracy of AQI predictions using historical

and environmental data. Analyze the role of pollutants (e.g., PM2.5, PM10) and

other environmental factors in determining air quality. Provide actionable insights

to policy makers and stake holders for mitigating pollution.

Tchnical Methods: The paper employs machine learning algorithms such as De-

cisionTreeRegression (DTR), Random Forest Regression (RFR), Support Vector

Regression (SVR),and Linear Regression (LR) to predict the Air Quality Index

(AQI) by analyzing historical pollution data and environmental factors, with eval-

uation based on metrics like Rsqare score,Mean Absolute Error (MAE), and Mean

Squared Error (MSE).

2.2 Summary

The reviewed papers explore various approaches to air quality monitoring and prediction,

emphasizing real-time data collection through wireless sensor networks and advanced

machine learning techniques. They focus on pollutants like PM2.5, NO2, and CO2,

integrating meteorological factors for accurate forecasting. Machine learning algorithms

such as Random Forest, SVM, and Decision Trees are highlighted for their predictive

accuracy. Challenges include hardware limitations, scalability, and the need for high-

quality data. These systems aim to enhance public health by providing actionable insights

and enabling proactive pollution management. The studies collectively underscore the

potential of combining technology and analytics for environmental sustainability.

Dept. of ECE, JNNCE, Shimoga December-2024 10


Chapter 3

Design and Implementation


3.1 Introduction

The design and implementation of the Air Quality Index Prediction System involve in-

tegrating data collection, processing, and predictive modeling. Real-time air quality

data is gathered from sensors measuring pollutants like PM2.5, NO2, and CO2, along

with meteorological parameters. The collected data undergoes preprocessing to handle

missing values, remove redundancies, and extract relevant features. Machine learning

algorithms, such as Linear Regression, Decision Tree, and Random Forest, are applied

to build predictive models. The system architecture ensures ecient data ow, model

training, and validation. Finally, results are displayed via user-friendly platforms, aiding

in timely decision-making for pollution management.

3.2 System Design

The g 3.1 shows system design for the AQI Prediction integrates a range of technologies

and methodologies to deliver real-time, accurate air quality assessments. The design

begins with the incorporation of real-time sensors and cloud-based data sources, which

work in tandem to continuously collect crucial air quality and meteorological parameters.

These parameters may include pollutants such as PM2.5, PM10, CO2, NO2, and O3, as

well as weather conditions like temperature, humidity, and wind speed, which are all

signicant factors in determining air quality.Once the data is collected, a preprocessing

module is employed to clean and rene the data. This preprocessing step ensures that

any noise or irrelevant information is removed, and that the dataset is formatted for use

in predictive modeling. Additionally, feature extraction techniques are applied, selecting

the most pertinent variables that inuence air quality, which allows for more accurate

predictions.

Dept. of ECE, JNNCE, Shimoga December-2024 11


Air Quality Index Prediction

Figure 3.1: Block diagram of the project

3.2.1 Data Preprocessing

Data Preprocessing involves preparing raw data for accurate AQI prediction by cleaning,

normalizing, and rening it. Missing values are imputed, outliers are removed, and

redundant data is consolidated. Key features like PM2.5, NO2, and temperature are

extracted, while irrelevant ones are discarded to improve eciency. The dataset is then

split into training and testing sets, ensuring reliable model performance.Eective data

preprocessing is critical for accurate Air Quality Index (AQI) predictions. The steps

undertaken include:

1. Data Collection:Data was acquired from cloud-based climate repositories. This in-

cluded key meteorological and pollution-related features such as Particulate Mat-

ter (PM2.5), Nitrogen Dioxide (NO2), Carbon Monoxide (CO), temperature, wind

speed, and humidity.

2. Data Cleaning:Missing Value Imputation: Gaps in data were lled using imputation

techniques to ensure dataset completeness and consistency.

3. Data Aggregation: Redundant data entries were consolidated to maintain relevance

and reduce computational overhead.

Feature Engineering:

1. Feature Importance: Permutation Feature Importance was applied to rank features

based on their predictive relevance.

Dept. of ECE, JNNCE, Shimoga December-2024 12


Air Quality Index Prediction

2. Feature Selection: Low-importance features were removed to streamline the model

and improve eciency.

3. Data Normalization:All features were normalized to ensure uniformity and enhance

the performance of machine learning algorithms.

4. Dataset Splitting:The dataset was divided into two subsets: 80

3.2.2 Model Architecture

The input layer accepts various meteorological and pollutant-related features that inu-

ence AQI. These features include:

Particulate Matter: PM2.5 and PM10 concentrations, the primary indicators of air

pollution.The Gaseous Pollutants like NO2, CO, and other harmful gases that contribute

to air quality.The Meteorological Data's Temperature, humidity, wind speed, atmospheric

pressure, and rainfall.the purpose is to feed raw data into the model. Proper normal-

ization ensures that features with varying scales (e.g., temperature in °C vs. PM2.5 in

µg/m³) do not dominate others.

3.2.3 Model Building:

The primary goal is to predict the AQI value based on meteorological and pollutant data

to help monitor air quality and guide decision-making. The model usage involves several

steps:

1. Input Data Integration:Data from real-time sensors or historical datasets,including

PM2.5, PM10, CO, NO2, temperature, humidity, wind speed, and other features,

is processed and input into the system.Each feature provides critical insights into

air pollution levels and atmospheric conditions.

2. Feature Selection:Not all collected data points may be relevant. Feature selection

methods identify the most signicant variables aecting AQI prediction. For ex-

ample, PM2.5 and PM10 are more signicant than less impactful features.

3. Training the Model:During training, the model learns to map input features to the

AQI.A diverse range of data points, representing dierent times, locations, and

pollution scenarios, ensures the model captures complex patterns in the data.

Dept. of ECE, JNNCE, Shimoga December-2024 13


Air Quality Index Prediction

4. Prediction Process:Once trained, the model predicts AQI for new or unseen data

by applying the learned relationships between input features and AQI.For instance,

given data on PM2.5, NO2, and temperature for a particular day, the model esti-

mates the AQI for that day.

5. Real-Time Applications:The model can be integrated into real-time systems, such

as mobile apps, dashboards, or IoT devices, to provide live AQI updates.Alerts

or recommendations (e.g., "Reduce outdoor activities" or "Wear a mask") can be

issued based on AQI thresholds.

3.2.4 Model Validation

Validation ensures that the model is accurate, reliable, and generalizable to unseen data.

The validation process includes the following:

Data Splitting:

1. Training Set: 80% of the dataset is used for training, enabling the model to learn

relationships between features and AQI.

2. Test Set: 20% the dataset is reserved for testing, used to evaluate how well the

model performs on unseen data. This splitting ensures that the model is not over-

tting and can generalize its predictions.

3.2.5 Model Usage

Predictive Analysis: The machine learning model forecasts AQI using historical pollution

data, meteorological factors (temperature, humidity, etc.), and environmental conditions.

Algorithms Employed:

1. Linear Regression: For establishing a linear relationship between pollutants and

AQI.

2. Lasso Regression: To improve prediction accuracy by feature selection.

3. K-Nearest Neighbors(KNN): To estimate AQI based on neighboring data points.

Dept. of ECE, JNNCE, Shimoga December-2024 14


Air Quality Index Prediction

3.2.6 Testing

Independent Test Data Uses a reserved dataset (unseen during training) to test the

model's generalization. Assess the model's ability to predict AQI for varied climatic and

pollutant conditions.

Simulate extreme environmental conditions (e.g., high pollution during dust storms)

to test robustness. Introduce missing or noisy data to evaluate the model's handling

of real-world challenges. Compare the model's predictions with actual AQI values ob-

tained from trusted monitoring systems. Benchmark agains texisting models or methods

to assess relative performance Visualization. Create plots for predicted vs. actual AQI

values.Develop trend analyses to demonstrate the model's temporal prediction accuracy.

Post-Validation Improvements Rene the model based on identied weaknesses, such as

underperforming regions or pollutant types.Enhance feature engineering to include addi-

tional environmental factors (e.g., rainfall or wind direction).Experiment with advanced

algorithms like ensemble methods for better accuracy Scenario Testing.

3.3 Implementation

The "Air Quality Index Prediction" project uses Python for data analysis and machine

learning implementation due to its versatile libraries like Pandas and Scikit-learn. Jupyter

Notebook facilitates exploratory data analysis, while Matplotlib and Seaborn create vi-

sualizations of AQI trends and model performance. Spyder IDE is used for coding and

debugging the project eciently. For deployment, Flask or Django builds a web interface

to display AQI predictions, supported by cloud platforms like AWS or Google Cloud

for scalability. Version control tools like GitHub ensure collaborative development and

ecient project management.

3.3.1 Spyder

Figure 3.2: Spyder Logo

Dept. of ECE, JNNCE, Shimoga December-2024 15


Air Quality Index Prediction

Spyder, short for Scientic Python Development Environment, is an open-source Inte-

grated Development Environment tailored for scientic computing and data analysis. It

is written in Python and is often included in the Anaconda distribution. Spyder features

a powerful editor with advanced features like syntax highlighting, code introspection, and

auto-completion. It includes an interactive IPython console, allowing users to execute

code and view results in real-time. The variable explorer in Spyder provides an intuitive

way to inspect data and variables during runtime. It supports integration with libraries

like NumPy, pandas, Matplotlib, and SciPy for scientic computing tasks. Spyder also

includes debugging tools, making it easier to identify and x errors in code. The IDE is

customizable with support for plugins and layouts. It is widely used for machine learning,

data visualization, and statistical analysis. Overall, Spyder is ideal for researchers, data

scientists, and developers working in Python.

3.3.2 Jupyter Notebook

Figure 3.3: Jupyter Logo

Jupyter Notebook is an open-source, web-based interactive computing environment

designed for writing and executing code. It supports multiple programming languages,

with Python being the most commonly used. Jupyter allows users to create and share

documents that contain live code, equations, visualizations, and explanatory text. It

is widely used for data analysis, machine learning, and statistical modeling due to its

interactivity and ease of use. Each notebook consists of cells, which can contain code or

Markdown text, enabling a seamless combination of code execution and documentation.

Jupyter supports visualization libraries like Matplotlib and Seaborn, making it ideal for

analyzing and presenting data. Its modular nature allows users to execute code incre-

mentally, facilitating debugging and iterative development. Notebooks can be exported

in various formats, such as HTML, PDF, or Python scripts, for sharing and collabora-

Dept. of ECE, JNNCE, Shimoga December-2024 16


Air Quality Index Prediction

tion. Jupyter integrates well with scientic computing and machine learning frameworks,

making it popular among researchers and data scientists. Overall, it is a powerful tool

for exploring,documenting, and sharing computational workows.

1. Pandas : Pandas is a powerful open-source Python library used for data ma-

nipulation and analysis.It provides two primary data structures: Series (1D) and

DataFrame (2D), which allow users to work with labeled and tabular data e-

ciently.It supports reading and writing data to various le formats, including CSV,

Excel, SQL, and JSON. Oers functionalities for data cleaning, ltering, grouping,

merging, reshaping, and handling missing values.Frequently used in data prepro-

cessing to prepare datasets for machine learning or statistical analysis. Pandas also

supports time series analysis, enabling functionalities like resampling, shifting, and

handling date ranges. The library is optimized for performance, making it suitable

for large datasets, and provides support for multi-indexing, which allows users to

create complex data hierarchies.The integration of Pandas with other libraries such

as NumPy and Matplotlib allows users to easily transform and visualize data. It can

handle a wide variety of data types, including text, integers, oats, and categorical

data, making it versatile for a range of applications, from nance to engineering.

It is commonly used in data preprocessing, exploratory data analysis (EDA), and

machine learning tasks.

2. NumPy : NumPy is a core library for numerical computing in Python, providing

support for large, multi-dimensional arrays and matrices, along with a collection of

mathematical functions to operate on these arrays. The central data structure in

NumPy is the ndarray (N-dimensional array), which allows for ecient storage and

manipulation of large datasets. Unlike traditional Python lists, NumPy arrays are

much more ecient in terms of both memory and computation.One of the key fea-

tures of NumPy is its ability to perform element-wise operations on arrays, which

simplies complex mathematical calculations. For example, you can perform arith-

metic operations on entire arrays without needing to loop through each element

individually. NumPy also provides a wide variety of mathematical functions, such

as linear algebra operations (e.g., matrix multiplication, eigenvalues, and eigenvec-

tors), random number generation, Fourier transforms, and more.NumPy is designed

to integrate well with other libraries like Pandas, SciPy, and Scikit-learn, making

it a foundational library in the Python data science stack. It provides tools for e-

Dept. of ECE, JNNCE, Shimoga December-2024 17


Air Quality Index Prediction

cient data manipulation and analysis, especially for tasks involving large datasets or

high-performance computations. NumPy also supports broadcasting, which allows

operations between arrays of dierent shapes and dimensions.Additionally, NumPy

supports advanced indexing and slicing techniques that allow users to manipulate

data in complex ways. This functionality is particularly useful when working with

high-dimensional data, such as images or time-series data, where users need to

extract specic parts of the array or modify values in-place.

3. Matplotlib : Matplotlib is a popular Python library for creating static, inter-

active, and animated visualizations. It provides a variety of plotting functions,

including line plots, scatter plots, bar charts, histograms, and more. The library is

highly customizable, allowing users to adjust plot attributes such as colors, labels,

and styles to suit their needs. It is built on NumPy and integrates well with other

libraries in the Python ecosystem, such as pandas and SciPy. The primary inter-

face of Matplotlib is through its `pyplot` module, which provides a MATLAB-like

interface for ease of use.Matplotlib supports multiple backends for rendering, mak-

ing it versatile across platforms and environments, including Jupyter Notebooks.

It enables the creation of publication-quality plots with detailed control over ele-

ments like gure size, resolution, and layout. The library supports saving plots in

various formats, such as PNG, SVG, PDF, and more. Advanced users can lever-

age its object-oriented API for ner control over plot elements. With its extensive

documentation and active community, Matplotlib remains a go-to tool for data

visualization in Python.

4. Seaborn : Seaborn is a data visualization library built on top of Matplotlib that

provides an interface for creating attractive and informative statistical plots. It is

designed to work seamlessly with Pandas DataFrames,making it particularly useful

for exploratory data analysis (EDA) and visualizing relationships between vari-

ables.Seaborn simplies many of the complex plotting tasks in Matplotlib by oer-

ing higher-level functions for creating visualizations such as box plots, violin plots,

pair plots, heatmaps, and more. It automatically handles many of the plot aes-

thetics, such as colors, labels, and styles, providing a polished appearance without

requiring extensive customization.One of Seaborn's standout features is its ability

to visualize complex relationships between multiple variables. For example, a pair

plot allows users to create a grid of scatter plots showing pairwise relationships

Dept. of ECE, JNNCE, Shimoga December-2024 18


Air Quality Index Prediction

between several variables, while a heatmap can visualize correlation matrices or

other matrix-like data.Seaborn also provides functionality for dealing with categor-

ical data. Plots like bar plots, count plots, and categorical scatter plots help users

quickly visualize the distribution of categorical variables. Additionally, Seaborn

oers various options for customizing plot styles and color palettes, enhancing the

aesthetic appeal of visualizations.In addition to basic statistical plots, Seaborn in-

cludes tools for visualizing regression models, distributions, and uncertainty in data,

which is particularly useful for analyzing patterns, trends, and outliers. It can also

visualize the results of statistical tests like t-tests or ANOVA.

3.3.3 Algorithm

1. Linear Regression: It is a type of supervised machine learning algorithm that

computes the linear relationship between the dependent variable and one or more

independent features by tting a linear equation to observed data or Linear Re-

gression is a fundamental statistical and machine learning method used to model

the relationship between one or more independent variables (Xx) and a dependent

variable (Yy).It is commonly used for predicting continuous outcomes.The goal is

to minimize the dierence (error) between the predicted values and actual values by

nding the best-tting line.Linear regression assumes a linear relationship, indepen-

dence of errors and normal distribution of residuals.It uses the Mean Squared Error

(MSE) as a loss function to evaluate the dierence between actual and predicted

values.The model optimizes the slope and intercept using techniques like Gradient

Descent or the Normal Equation.Dependent and Independent Variables: In predict-

ing AQI, the dependent variable is the AQI value, and the independent variables

are environmental features such as PM2.5, temperature, humidity, wind speed, and

pollutant concentrations. The linear regression model establishes a relationship in

the form:

Y = mX + C

where Y Represents AQI, X is the Independent variable, m is the Slope, and c is the

Intercept. Data Uses the Historical air quality and meteorological data are used to

train the linear regression model, ensuring it learns the patterns and trends in AQI.

Prediction is Once trained, the model predicts AQI values for unseen data based on

Dept. of ECE, JNNCE, Shimoga December-2024 19


Air Quality Index Prediction

the linear relationship it has derived. The model assumes a linear relationship be-

tween variables, which might not fully capture the complexity of AQI inuenced by

multiple, nonlinear factors.In this project, linear regression is one of the algorithms

explored alongside more complex methods like LASSO regression and decision tree

regression for better accuracy. Metrics such as Mean Squared Error (MSE) and

Mean Absolute Error (MAE) are calculated to assess the performance and accu-

racy of the linear regression model. By analyzing the coecients of the regression

equation, the project identies which features (e.g., temperature, PM2.5) have the

most signicant impact on AQI.It provides a foundational method for AQI predic-

tion and is complemented by other advanced algorithms for improved precision in

the project.

2. Lasso Regression : It is a type of linear regression that uses regularization to en-

hance the model's generalization by shrinking the coecients of less important fea-

tures to zero. It is especially useful when dealing with high-dimensional data.Lasso

regression can be applied in air quality monitoring to predict pollutant levels (like

PM2.5) based on various environmental factors while selecting the most relevant

ones.Lasso regression is used to predict AQI by identifying and emphasizing the

most relevant environmental and pollutant factors while reducing the impact of

irrelevant or less signicant ones. This is particularly useful in a high-dimensional

dataset where not all features contribute equally to the prediction. Example: Pre-

dicting PM2.5 Levels We want to predict PM2.5 concentration based on factors

such as: Temperature , Humidity, WindSpeed ,Trac Volume.

Lasso regression helps us identify which factors are most important while ignoring

irrelevant ones. For example: Temperature, Humidity,and Trac Volume might

strongly inuence PM2.5 levels. Wind Speed might have a moderate eect.CO

Levels might turn out to be irrelevant and removed by Lasso.May not perform

well if all predictors are equally important or highly correlated.Predicts pollutant

levels (e.g., PM2.5) by identifying key environmental factors like temperature and

humidity.Helps avoid overtting and selects the most important features. Exam-

ple: Helps exclude less relevant features like wind speed when predicting PM2.5

concentrations.

3. K-NEAREST NEIGHBOURS REGRESSION : KNN is a simple, supervised

machine learning (ML) algorithm that can be used for classication of regression

Dept. of ECE, JNNCE, Shimoga December-2024 20


Air Quality Index Prediction

tasks - and is also frequently used in missing value imputation.It is based on the

idea that the observations closest to a given data point are the most similar

observations in a data set, and we can therefore classify unforeseen points based

on the values of the closest existing points. By choosing K, the user can select

the number of nearby observations to use in the algorithm. Requires high-quality,

well-distributed training data for accurate predictions.Predicts air quality indices

using nearby data points with similar environmental conditions.Computationally

expensive for large datasets and sensitive to irrelevant or unscaled features. The

main advantage is Simple to understand, no assumptions about data distribution,

and adaptable to non-linear relationships. Example: Predicts AQI for a given day

by averaging AQI values of the most similar past days.

3.4 Flowchart of Implementation Process

3.4.1 Flowchart of the AQI Data

Figure 3.4: Flowchart of the AQI Data

The owchart 3.4 outlines the steps involved in visualizing yearly data trends.

Dept. of ECE, JNNCE, Shimoga December-2024 21


Air Quality Index Prediction

ˆ Initialization: The process begins by initializing the years and data structures.

ˆ Iteration: It then iterates through each year in the dataset.

ˆ Calculation: For each year, it calculates the yearly average.

ˆ Storage: The calculated average is stored in a data dictionary.

ˆ Iteration End Check: The process checks if it has iterated through all years. If not,

it continues the iteration.

ˆ Data Iteration: Once all years are processed, it iterates through the data dictionary.

ˆ Line Plot Creation: For each year, a line plot is created to visualize the trend.

ˆ Labeling and Titling: The plot is labeled and titled appropriately.

ˆ Visualization: The nal plot is displayed.

ˆ End: The process concludes.

3.4.2 Flowchart of Extract Combine Data

Figure 3.5: Flowchart of Extract Combine Data

Dept. of ECE, JNNCE, Shimoga December-2024 22


Air Quality Index Prediction

The owchart 3.5 outlines the steps involved in processing and analyzing air quality

data, likely focusing on PM 2.5 levels.

ˆ Start: The process begins.

ˆ Directory Check: It checks for the existence of a

ˆ specied directory.

ˆ Directory Creation: If the directory doesn't exist, it creates one.

ˆ Year Loop: The process loops through each year's data.

ˆ Data Processing: For each year, it processes the air quality data.

ˆ Data Combination: It combines the processed data from all years.

ˆ CSV Creation: The combined data is saved in a CSV le.

ˆ DataFrame Loading: The CSV data is loaded into a DataFrame.

ˆ PM 2.5 Data Extraction: It extracts the PM 2.5 data for the year.

ˆ End: The process concludes.

Essentially, this owchart details the workow for organizing, combining, and extracting

relevant air quality data for analysis.

The usage of model validation, selection, testing, and algorithms plays a crucial role

in developing a robust system for Air Quality Index (AQI) prediction. Model valida-

tion ensures that the predictive models perform reliably on unseen data by splitting the

dataset into training and testing subsets and evaluating the model's performance using

metrics like Mean Squared Error (MSE) and Mean Absolute Error (MAE). The selec-

tion of algorithms such as Linear Regression, Lasso Regression, K-Nearest Neighbors,

and Decision Tree Regression is driven by their ability to handle specic patterns and

complexities in the data, ensuring accurate predictions of AQI.

3.5 Summary

Testing is conducted to evaluate the eectiveness of these algorithms in real-world sce-

narios, rening the model to address over tting or under tting issues. The owchart of

Dept. of ECE, JNNCE, Shimoga December-2024 23


Air Quality Index Prediction

the process typically begins with data collection and preprocessing, followed by feature

extraction and selection, model training, validation, and deployment. This systematic

approach enables the integration of machine learning techniques into a streamlined work

ow, ensuring that the model provides reliable and actionable predictions for air quality

management and health advisories.

Dept. of ECE, JNNCE, Shimoga December-2024 24


Chapter 4

Results and Discussion


4.1 Model Training

Model training is a critical step in the development of the Air Quality Index prediction

system. It involves using machine learning techniques to learn patterns and relation-

ships between air quality parameters (e.g., PM2.5, PM10, CO, NO2) and meteorological

factors (e.g., temperature, humidity, wind speed). The process begins with splitting the

dataset into training and testing sets, where 80 percentage of the data is used for training

the model and the remaining 20 percentage is reserved for evaluation.Various algorithms

like Linear Regression, Lasso Regression and K-Nearest Neighbors are trained on the

processed dataset. These models analyze large datasets to identify trends and predict

AQI values. During training, the model adjusts its parameters to minimize error metrics

such as Mean Squared Error (MSE) or Root Mean Squared Error (RMSE). Hyperparam-

eter tuning is also performed to optimize the model's performance by using techniques

like Grid Search or Random Search. Cross-validation is employed to ensure the model

generalizes well across dierent subsets of data. The trained model is then ready to be

validated on the testing set, ensuring it accurately predicts AQI under diverse conditions.

4.1.1 Experimental Output

The g 4.1 represents the daily variations in PM2.5 levels over three years (2013, 2014,

and 2015). The x-axis represents the day of the year (1 to 365), and the y-axis shows the

PM2.5 concentration levels. The lines for each year (blue for 2013, orange for 2014, and

green for 2015) illustrate seasonal trends in air pollution. Key observations include high

PM2.5 levels at the beginning and end of each year, indicating seasonal peaks, likely due

to winter pollution caused by heating systems and weather conditions. The mid-year

months show lower PM2.5 concentrations, reecting cleaner air during warmer months,

possibly due to better dispersion and reduced sources of pollution. This visualization

Dept. of ECE, JNNCE, Shimoga December-2024 25


Air Quality Index Prediction

highlights year-to-year variations while showing consistent seasonal patterns, crucial for

understanding and predicting AQI trends.

ˆ The Table 4.1 shows the Collected data from cloud climate relevant air quality

analysis

ˆ Implemented the Machine learning techniques by using AQI data to extract the

features such as temperature, humidity, wind speed ,rainfall etc. . .


T Average Temperature( C)

TM Maximum Temperature( C)

Tm Minimum Temperature( C)
SLP Atmospheric pressure at sea level(hpa)
H Average relative humidity
PP Total rainfall and / or snowmelt(mm)
VV Average visibility(Km)
V Average wind speed (Km/h)
VM Maximum sustained wind speed (km/h)
VG Maximum speed of wind (Km/h)
RA Indicator if there was rain or drizzle
SN Snow Indicator
TS indicates whether there strom
FG indicates whether there was for

Table 4.1: Independent features in dataset

Figure 4.1: The output of AQI Dataset

The g 4.2 represents a time series visualization of various pollutant concentrations

over a specic period, which is likely used in the context of Air Quality Index (AQI)

prediction. Each line in the graph corresponds to a dierent parameter or pollutant,

such as PM2.5, PM10, CO, NO2, or O3, with their respective concentrations plotted

against time.

Dept. of ECE, JNNCE, Shimoga December-2024 26


Air Quality Index Prediction

Figure 4.2: The output of Extract Combine Data

1. Red Line: The consistently high value suggests it might represent a parameter with

minimal uctuation, such as a normalized index or a pollutant with steady levels.

2. Yellow Line: The spiking values indicate a pollutant with signicant temporal

variations, possibly PM2.5 or PM10, which are known to uctuate based on envi-

ronmental activities like trac or industrial emissions.

3. Purple and Other Lines: These might represent secondary pollutants or meteoro-

logical factors like temperature or humidity, which inuence AQI indirectly.

The visualization is crucial for understanding the variability and trends of pollutants

over time, highlighting peaks that can correlate with poor air quality episodes. Such

plots are used during data analysis to identify relationships between pollutants and AQI,

facilitating model training and feature selection for prediction.

4.1.2 Algorithms Output

Below Fig shows PairGrid it's a special tool in Seaborn that helps to create a grid of

plots to compare every combination of variables in a dataset.It can show scatter plots

between two variables of PM 2.5 and other independent variables respectively It can show

histograms or other charts on the diagonal.

Dept. of ECE, JNNCE, Shimoga December-2024 27


Air Quality Index Prediction

Figure 4.3: Pairgrid

The Fig 4.4 shows correlation Heatmap created using the Seaborn library in Python.

A Heatmap is a graphical representation of data where individual values contained in

a matrix are represented as colors. In this case, the heatmap visualizes the correlation

between dierent features in a dataset.

The code begins by importing the Seaborn library, which is widely used for statistical

data visualization. It then computes the correlation matrix of a dataset using the `.corr()`

function, which measures the relationship between dierent numerical features. The

correlation values range from -1 to 1, where a value close to 1 indicates a strong positive

correlation, a value close to -1 represents a strong negative correlation, and values around

0 suggest little to no correlation between features.

To enhance readability, the heatmap is plotted with a gure size of (20,20). The

colors in the heatmap are determined by the `"RdYlGn"` colormap, where red represents

negative correlations, yellow indicates weak or no correlation, and green signies strong

positive correlations. Additionally, the `annot=True` parameter ensures that the actual

correlation values are displayed within the heatmap cells.

By analyzing the heatmap, one can identify which features are highly correlated with

each other, either positively or negatively. This is useful in various data analysis tasks,

Dept. of ECE, JNNCE, Shimoga December-2024 28


Air Quality Index Prediction

such as feature selection in machine learning, where highly correlated features might

be redundant and can be removed to avoid multicollinearity. Similarly, understanding

negative correlations can provide insights into inverse relationships between variables.

Figure 4.4: HeatMap

1. The Output of linear regression : The data is preprocessed and trained with

linear regression algorithm to predict the AQI. The gure shows how the linear

regression model is congured.

Figure 4.5: Scatter plot graph for linear regression model

The gure 4.5 is a scatter plot graph and X-axis and Y-axis are observed AQI value

Dept. of ECE, JNNCE, Shimoga December-2024 29


Air Quality Index Prediction

and predicted AQI value respectively.

To analyze the performance of a machine learning model we need some metrics as

shown in g 4.6.These metrics are statistical criteria that can be used to measure

and monitor the performance of a model. As our thesis deals with prediction, we've

considered MAE and RMSE as the performance metrics.

Figure 4.6: Comparision of performance metrics for linear regression model

2. The Output of Lasso Regression : The data is preprocessed and trained with

linear Lasso regression algorithm to predict the AQI.

The gure 4.7 is a scatter plot graph and X-axis and Y-axis are observed AQI value

and predicted AQI value respectively.

Figure 4.7: Scatter plot graph for Lasso regression model

The g 4.8 shows the performance of a machine learning model of Lasso Regression.

Dept. of ECE, JNNCE, Shimoga December-2024 30


Air Quality Index Prediction

Figure 4.8: Comparision of performance metrics for Lasso regression model

3. The Output of KNN Regressor : The data is preprocessed and trained with

KNN regressor algorithm to predict the AQI.

The gure 4.9 is a scatter plot graph and X-axis and Y-axis are observed AQI value

and predicted AQI value respectively.

Figure 4.9: Scatter plot graph for KNN regressor model

The g 4.10 shows the performance of a machine learning model of KNN.

Figure 4.10: Comparision of performance metrics for KNN regressor model

Dept. of ECE, JNNCE, Shimoga December-2024 31


Air Quality Index Prediction

The below table 4.2 shows the Comparision of performance metrics of

Three Algorithms

Algorithm MAE MSE RMSE


Linear Regression 44.8362 3687.5430 60.7251
Lasso Regression 44.508 3627.8109 60.2313
KNN Regression 25.2455 1681.8142 41.0099

Table 4.2: Comparison of performance metrics for all models

Dept. of ECE, JNNCE, Shimoga December-2024 32


Chapter 5

Conclusion & Future Scope


5.1 Conclusion

The Air Quality Index (AQI) Prediction project eectively demonstrates the applica-

tion of machine learning techniques to address one of the most pressing environmental

issues: air pollution. By leveraging historical data, real-time monitoring, and advanced

predictive models, the project facilitates accurate AQI forecasting. This enables proac-

tive measures to mitigate health risks, enhance public safety, and support environmental

eorts.

The use of diverse algorithms, such as Linear Regression, Random Forest, and K-

Nearest Neighbors, underscores the potential of computational methods in providing

reliable pollution predictions.By utilizing data from sensors and meteorological sources,

the system ensures timely information, empowering individuals and authorities to re-

spond eectively to pollution threats.The project supports public health measures, urban

planning, disaster management, and environmental policy-making, highlighting its multi-

disciplinary impact.Scalability and Future. The project sets a foundation for integrating

additional data sources, rening predictive accuracy, and expanding to regional or global

scales. Looking forward, the system's adaptability to new data types, advanced machine

learning methods like deep learning, and collaboration with wearable technology can sig-

nicantly enhance its utility. This work plays a robust groundwork for smarter urban

environments, informed public decision-making, and sustainable ecosystem management.

5.2 Future Scope

ˆ Enhanced Prediction Models: Incorporating advanced machine learning models

such as Recurrent Neural Networks (RNNs) and Transformers for improved time-

series predictions. Leveraging additional data sources, like satellite imagery and

real-time trac data, to enhance prediction accuracy.

Dept. of ECE, JNNCE, Shimoga December-2024 33


Air Quality Index Prediction

ˆ Global and Regional Adaptability: Scalability to various geographic locations with

customization for specic environmental and climatic conditions.Oering hyper-

local AQI predictions by integrating dense IoT sensor networks.

ˆ Real-Time Applications: Developing wearable health-monitoring devices that pro-

vide real-time AQI alerts. Launching mobile applications with personalized health

recommendations based on AQI data.

ˆ Policy and Decision Support: Assisting governments with dynamic policy imple-

mentation during high-pollution periods, such as vehicle restrictions. Providing

insights for long-term urban planning to design sustainable, eco-friendly cities.

ˆ Integration with Renewable Energy Initiatives: Identifying high-pollution zones for

optimized placement of renewable energy projects like solar and wind farms.Supporting

global carbon credit systems by monitoring and reducing emissions.

ˆ Public Awareness and Education: Engaging the community through gamied learn-

ing platforms about air pollution's eects. Encouraging collective action by em-

powering citizens to crowdsource pollution data.

ˆ Cross-Disciplinary Research: Collaborating with health experts to assess the long-

term impacts of air pollution. Studying environmental conservation impacts using

AQI trends.

Dept. of ECE, JNNCE, Shimoga December-2024 34


References
1. E. Yaacoub, A. Kadri, M. Mushtaha and A. Abudayya,  Air quality monitoring

and analysis in Qatar using a wireless sensor network deployment  published in

Sensors in 2020,IEEE.

2. S. Pandya, H. Ghayvat, A. Sur, M. Awais, K. Kotecha et al,  Pollution weather

prediction system: Smart outdoor pollution monitoring and prediction for healthy

breathing and living  published in 2020.

3. N. Salman, A. H. Kemp, A. Khan and C. Noakes, Real time wireless sensor network

(WSN)based indoor air quality monitoring system  published in IFAC-Papers in

2019.

4. T. Madan, S. Sagar, D. Virmani,  Air Quality Prediction using Machine Learning

Algorithms A Review  Published in 2020.

5. B D Parameshachari, G.M.Siddesh, V.Sridhar, M.Latha, K.N.A.Sattar, and G.

Manjula  Prediction and Analysis of Air Quality Index using Machine Learning

Algorithms  Published in 2022.

Dept. of ECE, JNNCE, Shimoga December-2024 35


Appendix
5.3 Programme of HTML

import os
import time
import requests
import sys
def retrieve_html():
for year in range(2013,2019):
for month in range(1,13):
if(month<10):
url='http://en.tutiempo.net/climate/0{}-{}/ws-421820.html'.format(month,year)
else:
url='http://en.tutiempo.net/climate/{}-{}/ws-421820.html'.format(month,year)
texts=requests.get(url)
text_utf=texts.text.encode('utf=8')
if not os.path.exists("Data/Html_Data/{}".format(year)):
os.makedirs("Data/Html_Data/{}".format(year))
with open("Data/Html_Data/{}/{}.html".format(year,month),"wb") as output:
output.write(text_utf)
sys.stdout.flush()
if _name=="main_":
start_time=time.time()
retrieve_html()
stop_time=time.time()
print("Time taken {}".format(stop_time-start_time))

Dept. of ECE, JNNCE, Shimoga December-2024 36


Air Quality Index Prediction

5.4 Programme of Extract Combine Data

from Plot_AQI import avg_data_2013,avg_data_2014,avg_data_2015,avg_data_2016


import requests
import sys
import pandas as pd
from bs4 import BeautifulSoup
import os
import csv
def met_data(month, year):
file_html = open('Data/Html_Data/{}/{}.html'.format(year,month),'rb')
plain_text = file_html.read()
tempD = []
finalD = []
soup = BeautifulSoup(plain_text, "lxml")
for table in soup.findAll('table', {'class': 'medias mensuales numspan'}):
for tbody in table:
for tr in tbody:
a = tr.get_text()
tempD.append(a)
rows = len(tempD) / 15
for times in range(round(rows)):
newtempD = []
for i in range(15):
newtempD.append(tempD[0])
tempD.pop(0)
finalD.append(newtempD)
length = len(finalD)
finalD.pop(length - 1)
finalD.pop(0)
for a in range(len(finalD)):
finalD[a].pop(6)
finalD[a].pop(13)

Dept. of ECE, JNNCE, Shimoga December-2024 37


Air Quality Index Prediction

finalD[a].pop(12)
finalD[a].pop(11)
finalD[a].pop(10)
finalD[a].pop(9)
finalD[a].pop(0)
return finalD
def data_combine(year, cs):
for a in pd.read_csv('Data/Real-Data/real_' + str(year) + '.csv', chunksize=cs):
df = pd.DataFrame(data=a)
mylist = df.values.tolist()
return mylist
if _name_ == "_main_":
if not os.path.exists("Data/Real-Data"):
os.makedirs("Data/Real-Data")
for year in range(2013, 2017):
final_data = []
with open('Data/Real-Data/real_' + str(year) + '.csv', 'w') as csvfile:
wr = csv.writer(csvfile, dialect='excel')
wr.writerow(['T', 'TM', 'Tm', 'SLP', 'H', 'VV', 'V', 'VM', 'PM 2.5'])
for month in range(1, 13):
temp = met_data(month, year)
final_data = final_data + temp
pm = getattr(sys.modules[_name], 'avg_data{}'.format(year))()
if len(pm) == 364:
pm.insert(364, '-')
for i in range(len(final_data)-1):
# final[i].insert(0, i + 1)
final_data[i].insert(8, pm[i])
with open('Data/Real-Data/real_' + str(year) + '.csv', 'a') as csvfile:
wr = csv.writer(csvfile, dialect='excel')
flag = 0
for elem in row:
if elem == "" or elem == "-":

Dept. of ECE, JNNCE, Shimoga December-2024 38


Air Quality Index Prediction

flag = 1
if flag != 1:
wr.writerow(row)
data_2013 = data_combine(2013, 600)
data_2014 = data_combine(2014, 600)
data_2015 = data_combine(2015, 600)
data_2016 = data_combine(2016, 600)
total=data_2013+data_2014+data_2015+data_2016
with open('Data/Real-Data/Real_Combine.csv', 'w') as csvfile:
wr = csv.writer(csvfile, dialect='excel')
wr.writerow(['T', 'TM', 'Tm', 'SLP', 'H', 'VV', 'V', 'VM', 'PM 2.5'])
wr.writerows(total)
df=pd.read_csv('Data/Real-Data/Real_Combine.csv')

5.5 Programme of Plot AQI

import pandas as pd
import matplotlib.pyplot as plt
def avg_data_2013():
temp_i=0
average=[]
for rows in pd.read_csv('Data/AQI/aqi2013.csv',chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row['PM2.5'])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i

Dept. of ECE, JNNCE, Shimoga December-2024 39


Air Quality Index Prediction

elif type(i) is str:


if i!='NoData' and i!='PwrFail' and i!='---' and i!='InVld':
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1
average.append(avg)
return average
def avg_data_2014():
average=[]
for rows in pd.read_csv('Data/AQI/aqi2014.csv',chunksize=24):
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row['PM2.5'])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!='NoData' and i!='PwrFail' and i!='---' and i!='InVld':
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1
average.append(avg)
return average
def avg_data_2017():
temp_i=0
average=[]
for rows in pd.read_csv('Data/AQI/aqi2017.csv',chunksize=24):
add_var=0
avg=0.0

Dept. of ECE, JNNCE, Shimoga December-2024 40


Air Quality Index Prediction

data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row['PM2.5'])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!='NoData' and i!='PwrFail' and i!='---' and i!='InVld':
temp=float(i)
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1
average.append(avg)
return average
def avg_data_2018():
temp_i=0
average=[]
for rows in pd.read_csv('Data/AQI/aqi2018.csv',chunksize=24):
add_var=0
avg=0.0
data=[]
df=pd.DataFrame(data=rows)
for index,row in df.iterrows():
data.append(row['PM2.5'])
for i in data:
if type(i) is float or type(i) is int:
add_var=add_var+i
elif type(i) is str:
if i!='NoData' and i!='PwrFail' and i!='---' and i!='InVld':
add_var=add_var+temp
avg=add_var/24
temp_i=temp_i+1

Dept. of ECE, JNNCE, Shimoga December-2024 41


Air Quality Index Prediction

average.append(avg)
return average
if _name=="main_":
lst2013=avg_data_2013()
lst2014=avg_data_2014()
lst2015=avg_data_2015()
lst2016=avg_data_2016()
lst2017=avg_data_2017()
lst2018=avg_data_2018()
plt.plot(range(0,365),lst2013,label="2013 data")
plt.plot(range(0,364),lst2014,label="2014 data")
plt.plot(range(0,365),lst2015,label="2015 data")
plt.plot(range(0,121),lst2016,label="2016 data")
plt.xlabel('Day')
plt.ylabel('PM 2.5')
plt.legend(loc='upper right')
plt.show()

Dept. of ECE, JNNCE, Shimoga December-2024 42

You might also like