0% found this document useful (0 votes)

18 views6 pages

22J TMA03 Solution Document

The document discusses research on using machine learning techniques to predict air quality levels. It finds that nonlinear models like random forests and neural networks provide better predictions than linear regression models. An ensemble random forest model with hyperparameters of n_estimators=14, max_depth=25, and min_samples_split=2 achieved the highest validation accuracy of 0.94 for predicting air quality levels based on environmental data from Frankfurt, Germany. However, both decision tree and random forest models predicted level 1 air quality most accurately and struggled more with other levels.

Uploaded by

parth98

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views6 pages

22J TMA03 Solution Document

Uploaded by

parth98

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

TMA03 – Parth Shah – E395923X

Since 2011, there has been an exponential increase in research articles about the use of
machine learning in environmental studies. Over 50% of these articles come from the US
and China. For all countries, articles on air-pollution are most common (>40%), followed by
water pollution, and then soil pollution. However, for Iran, research interest on water
pollution exceeds air-pollution. There are very few articles on solid waste pollution, but
China leads this field, having published the most articles on it. Meanwhile, the US and UK
have more equally distributed articles on air, water, and soil pollution.

Applying machine learning to the assessment of air-pollution has led to good PM 2.5
predictions made by non-linear models. This is useful in areas with no PM2.5 monitoring
facilities. However, despite non-linear models performing better at predicting air-pollution
than linear models, they only provide an increase of 0.1 – 0.19 in R 2, the goodness of fit,
which was just 0.6 to begin with. In the application of estimating health outcomes from
pollutant exposure, linear regression models provided great interpretability, but ultimately,
ensemble models provide better results. However, these are not interpretable since they
form a complex combination of the results of individual models.

Word Count: 200 words

3.
Machine Learning Techniques for Predicting Air-Quality

Background and context

Machine learning is a powerful tool in environmental science. By analysing large datasets,
machine learning models can pick up on hidden patterns. This insight can be used in many
applications, such as the problem of classifying air-quality levels. This is an important
problem as it can help governments make key environmental decisions like imposing traffic
limitations on days where air-pollution is predicted to be very high, benefitting citizens as it
can reduce exposure to harmful pollutants.

Machine learning techniques can be divided into two categories — “supervised learning” or
“unsupervised learning”. In supervised learning, models are trained on labelled data,
meaning that the model is provided with lots of input and correct output pairs. Models use
this knowledge to train themselves for predicting the correct answer when a new, unseen,
input is given. However, in unsupervised learning, models are not provided with the correct
output. The model itself has to find relationships in the data.

This report focusses on different machine learning techniques for classifying air-pollution
levels and provides a comparison against different techniques. Wider implications of
deploying these models are also considered, and further recommendations are provided.

Aim and Objectives

From the work produced by Liu et al. (2022), it was found that linear, regression-based
machine learning models were not effective for the problem of predicting air-pollution
levels, as air-pollution depends on many complex environmental factors which portray non-
linear relationships. It was shown that non-linear models like Random Forest models, and
deep neural networks like RNNs and LSTMs provided better results. The work from Choi and
Kim (2021) concluded that using PCA improved performance, but underlying models which
used RNNs, LSTMs, and BiLSTMs, did not provide satisfactory results.

Therefore, the research question is to find the effectiveness of decision-tree and ensemble
models in air-quality predictions. Decision-trees will be considered as the starting point, and
then ensemble models, which are models that consider the different outputs of multiple
smaller decision-trees and return the overall consensus among them, will be explored.

Methods
The data is created by Bagul (2020) and obtained from Kaggle. It includes environmental and
air-quality observations in Frankfurt, Germany, between 31/12/2018, and 20/02/2022. The
data is split into training, validation, and testing sets, following a 40:20:20 split.

A Decision-Tree classifier can be thought of as a flowchart. Depending on the value of each

column of the dataset, the final output can vary accordingly. For example, if the air pressure
is very high, it is more likely that air-pollution will be very high too. The first experiment was
to use a decision-tree classifier with a maximum depth of 3. This, however, did not provide
satisfactory results on the validation set. Therefore, the maximum depth was then changed
to 10, and this provided slightly better results. With the knowledge that a greater maximum
depth provided better results, it was decided to test increasing maximum depth values,
ranging from 10 to 40, and incrementing by 2 at every iteration.

Ensemble models were then used, namely Random Forest and Bagging, to see whether
better results could be obtained. Ensemble models aim to reduce the problem of over-fitting
(a model fine-tuning to the training data, leading to poor performance with unseen data), by
combining smaller individual models. Additionally, to strategically experiment with other
hyper-parameters, an automated search over the different combinations of hyper-
parameters was employed. All possible combinations of the hyper-parameters were iterated
over and the combination with the highest validation accuracy was kept track of.

The chosen hyper-parameters are explained below in more detail:

Random Forest
Hyper-parameter Description Reason Range of Values
n_estimators The number of When it is large, the
individual Random Forest can make a
decision-trees more informed decision. If 12 to 18.
in the random too large however, it could Increments by 2.
forest. lead to un-confident
classifications.
max_depth The maximum Increasing/decreasing this
depth of each has a large effect on
20 to 30.
tree. accuracy, as it changes the
Increments by 5.
output of each individual
tree.
min_samples_split The minimum Adjusting this affects how
number of split decisions are made,
2 to 22.
samples impacting the structure of
Increments by 10.
required for a each decision-tree.
split.

Bagging
Hyper-parameter Description Reason Range of Values
n_estimators The number of Adjusting this affects the
individual classifier, because a greater
decision-trees innumber of decision-trees
10 to 14.
the random could lead to more accurate
Increments by 2.
forest. classifications, but too large
a number could also
negatively impact it.
max_samples The maximum Having too many samples to
number of optimise could lead to over-
1000 to 5000.
training data fitting to the training set, in-
Increments by
points to sample turn leading to reduced
1000.
for each accuracy.
decision-tree.
max_features The maximum Adjusting this affects the 1 to 9.
number of quality of each tree, as
features to sampling fewer features
Increments by 2.
sample for each leads to more unique
decision-tree. decision-trees.
Results
The results from the four models explored are summarised in the table below:
Method Hyper-Parameters Training Accuracy Validation Accuracy
Decision-Tree max_depth = 3 0.73 0.73
Decision-Tree max_depth = 10 0.95 0.93
n_estimators = 14,
Random Forest max_depth = 25, 0.947 0.94
min_samples_split = 2.
n_estimators = 14,
Bagging max_samples = 5000, 0.819 0.816
max_features = 9.

Figure 1 - The confusion matrix of the Random Forest model (left) and the Decision-Tree model (with
maximum depth of 10) (right) when predicting inputs from the test data.

Evaluation of results
From the results, the random forest model performed the best out of all four models, but
only slightly better than the decision-tree classifier with a maximum depth of 10.

Upon further analysis of the confusion matrix obtained from the random forest model, the
random forest model has slightly more true-positives for all levels except for level 1.
However, in both models, the high validation accuracy mostly stems from predicting the air-
quality level of 1 very accurately. Predictions for other levels are much less accurate. This
could be because there are almost six times as many samples for level 1 compared to level
2, and even larger disparities for other levels. Having more data for other levels can help the
model better understand patterns arising in other levels, and therefore perform better than
currently. This could also explain why Bagging is not effective, as taking a sample of the
already unbalanced dataset exacerbates problems.

It is interesting to see that better results are obtained in the random forest model when the
maximum depth of each tree is 25, rather than 30. This suggests that random forest models
work optimally with smaller decision-trees, which makes sense, as their inherent purpose is
to reduce over-fitting by considering multiple smaller trees.
It is difficult to compare these results to those obtained by Choi and Kim (2021), since the
problem in question is a classification problem, i.e. predicting discrete air-quality levels,
compared to a regression problem that Choi and Kim (2021) explore. Additionally, Choi and
Kim (2021) explore air-pollution data in the Far-East region, which is vastly different to
observations in Frankfurt, due to differing climatic activity arising from difference in
regulations and demographics. As such, providing a fair comparison between the results
holds little meaning.

Discussion of wider implications

COVID restrictions were in place during a significant duration of the data-capture period.
During this time, air-pollution levels decreased due to limited usage of vehicles, public-
transport, and factories. Since the dataset only considers days, and not the actual date, a
sunny Wednesday pre-COVID may have had high pollution, while a similarly sunny
Wednesday during COVID may have had opposite conditions. This inherent bias within the
dataset can lead to poor accuracies within the classification model.

Further, the predictions made by the model may reveal disparities in air-quality across
different communities. Certain neighbourhoods could have naturally high pollution values as
they could be closer to industrial areas. It is important that government decisions based on
these predictions are taken in a way which does not harm such communities.

Conclusion
As mentioned by Liu et al. (2022), there are many complex factors surrounding air-pollution,
which not only depend on the location concerning the prediction, but also on surrounding
locations (Choi and Kim, 2021). For machine learning to be reliably used in the application of
air-quality prediction, better quality data, and more accurate predictions are required.
Training data should include more information about other environmental factors, data
from surrounding areas, and also be captured under consistent conditions. Further, models
should have more than three-nines (99.9%) of accuracy before being used for serious
applications like informing governmental decisions, as false positives or negatives can have
grave consequences.

References
Bagul A. (2020) Air-quality dataset. Available at:
https://www.kaggle.com/datasets/avibagul80/air-quality-dataset.

Choi, S. and Kim, B. (2021) ‘Applying PCA to deep learning forecasting models for predicting
PM2.5’, Sustainability, 13(7), p. 3726. Available at: http://doi.org/10.3390/su13073726.

Liu, X., Lu, D., Zhang, A.,; Liu, Q. and Jiang, G. (2022) ‘Data-driven machine learning in
environmental pollution: gains and problems’, Environmental Science and Technology,
56(4), pp. 2124–2133. Available at: https://doi.org/10.1021/acs.est.1c06157.

Word Count: 1500 words.

Air Quality Prediction
No ratings yet
Air Quality Prediction
21 pages
Research Article: Prediction of Air Quality Index Using Machine Learning Techniques: A Comparative Analysis
No ratings yet
Research Article: Prediction of Air Quality Index Using Machine Learning Techniques: A Comparative Analysis
26 pages
Air Quality Prediction Using LSTM Algorithm and Arduino: Ii. Literature Review
No ratings yet
Air Quality Prediction Using LSTM Algorithm and Arduino: Ii. Literature Review
7 pages
PublishedPaperNo.8 2022
100% (1)
PublishedPaperNo.8 2022
14 pages
Radiograph Interpretation CASTINGS
No ratings yet
Radiograph Interpretation CASTINGS
5 pages
Predicting Air Quality Using Weather Forecasting and Machine Learning
No ratings yet
Predicting Air Quality Using Weather Forecasting and Machine Learning
70 pages
FULLTEXT02
No ratings yet
FULLTEXT02
41 pages
Weather Forecasting Basepaper
100% (1)
Weather Forecasting Basepaper
14 pages
Rainfall Prediction Using Machine Learning
No ratings yet
Rainfall Prediction Using Machine Learning
9 pages
B.E Cse Batchno 334
No ratings yet
B.E Cse Batchno 334
74 pages
Stat Learn Big Data 20130401
No ratings yet
Stat Learn Big Data 20130401
53 pages
Air Quality Prediction Using Machine Learning Algorithms
100% (1)
Air Quality Prediction Using Machine Learning Algorithms
4 pages
Journal of Environmental and Public Health - 2023 - Gupta - Prediction of Air Quality Index Using Machine Learning
No ratings yet
Journal of Environmental and Public Health - 2023 - Gupta - Prediction of Air Quality Index Using Machine Learning
26 pages
Environmental Pollution Analysis and Prediction of Influential Factors: A Data-Driven Investigation
No ratings yet
Environmental Pollution Analysis and Prediction of Influential Factors: A Data-Driven Investigation
14 pages
Atmosphere 15 01337
No ratings yet
Atmosphere 15 01337
18 pages
Ensemble of Naive Bayes, Decision Tree, and Random Forest To Predict Air Quality
No ratings yet
Ensemble of Naive Bayes, Decision Tree, and Random Forest To Predict Air Quality
13 pages
B4A Tutorials PDF
100% (4)
B4A Tutorials PDF
119 pages
AIMLREPORT
No ratings yet
AIMLREPORT
31 pages
A Hybrid Air Quality Prediction Model Based On Empirical Mode Decomposition
No ratings yet
A Hybrid Air Quality Prediction Model Based On Empirical Mode Decomposition
13 pages
Air Quality Prediction Using Machine Learning
No ratings yet
Air Quality Prediction Using Machine Learning
15 pages
Modeling Air Quality Prediction Using A Deep Learning Approach Method Optimization and Evaluation
No ratings yet
Modeling Air Quality Prediction Using A Deep Learning Approach Method Optimization and Evaluation
26 pages
Huawei SinlgeSDB HSS9860-BE Feature Description
No ratings yet
Huawei SinlgeSDB HSS9860-BE Feature Description
26 pages
Class 2a-Decision Trees
No ratings yet
Class 2a-Decision Trees
28 pages
IoT Air Quality Presentation-1
No ratings yet
IoT Air Quality Presentation-1
18 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
28 pages
A03 Research Paper
No ratings yet
A03 Research Paper
11 pages
Air Quality Index Forecasting Via Genetic Algorithm-Based Improved Extreme Learning Machine
No ratings yet
Air Quality Index Forecasting Via Genetic Algorithm-Based Improved Extreme Learning Machine
12 pages
Slay The Day
No ratings yet
Slay The Day
21 pages
Mohit
No ratings yet
Mohit
12 pages
(S1 IJEECS 2024 Aziz Jihadian Barid) Optimization ENSEMBLE Air Quality
No ratings yet
(S1 IJEECS 2024 Aziz Jihadian Barid) Optimization ENSEMBLE Air Quality
9 pages
ML Case Study 85
No ratings yet
ML Case Study 85
11 pages
Implementation of Random Forest Algorithm For Air Quality Classification: A Case Study of DKI Jakarta's Air Quality Index
No ratings yet
Implementation of Random Forest Algorithm For Air Quality Classification: A Case Study of DKI Jakarta's Air Quality Index
5 pages
Training Seminar
No ratings yet
Training Seminar
12 pages
Air Quality Index Prediction Via Multi Task Machine Learning
No ratings yet
Air Quality Index Prediction Via Multi Task Machine Learning
13 pages
Checkfinal 123
No ratings yet
Checkfinal 123
18 pages
An Effective Air Pollution Prediction Model Using Machine Learning Algorithms
No ratings yet
An Effective Air Pollution Prediction Model Using Machine Learning Algorithms
8 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
7 pages
Review Paper... BBBBBBB
No ratings yet
Review Paper... BBBBBBB
7 pages
Ieee Template (2) Review 2 Mohan
No ratings yet
Ieee Template (2) Review 2 Mohan
8 pages
Forecasting Air Quality Abstract ICACE 2024
No ratings yet
Forecasting Air Quality Abstract ICACE 2024
8 pages
Air Quality Index Prediction: Suresh Aneesh Jain
No ratings yet
Air Quality Index Prediction: Suresh Aneesh Jain
5 pages
NNNNN
No ratings yet
NNNNN
6 pages
Airqualitypridiction
No ratings yet
Airqualitypridiction
7 pages
Comparative Analysis of Machine Learning Models For Weather Data Processing
No ratings yet
Comparative Analysis of Machine Learning Models For Weather Data Processing
6 pages
Folds: Nomenclature, Classification & Recognition
0% (1)
Folds: Nomenclature, Classification & Recognition
22 pages
TIJER2306218
No ratings yet
TIJER2306218
5 pages
Implement Classification and Time Series Analysis in Tensorflow
No ratings yet
Implement Classification and Time Series Analysis in Tensorflow
7 pages
IoT Framework For Real Time Weather Monitoring Using Machine Learning Techniques
No ratings yet
IoT Framework For Real Time Weather Monitoring Using Machine Learning Techniques
7 pages
R1-Weather Prediction Mode1
No ratings yet
R1-Weather Prediction Mode1
7 pages
Waterquality
No ratings yet
Waterquality
4 pages
1756 ControlLogix Controllers
No ratings yet
1756 ControlLogix Controllers
40 pages
Final Year Publishing Paper Air Quality Index Prediction-39120034
No ratings yet
Final Year Publishing Paper Air Quality Index Prediction-39120034
8 pages
Document
No ratings yet
Document
3 pages
Applied Sciences: Air Quality Index and Air Pollutant Concentration Prediction Based On Machine Learning Algorithms
No ratings yet
Applied Sciences: Air Quality Index and Air Pollutant Concentration Prediction Based On Machine Learning Algorithms
9 pages
Classification of Air Pollution Levels Using Supervised
No ratings yet
Classification of Air Pollution Levels Using Supervised
2 pages
An Efficient Implementation of ARIMA Technique For Air Quality Prediction
No ratings yet
An Efficient Implementation of ARIMA Technique For Air Quality Prediction
7 pages
Rainfall
No ratings yet
Rainfall
24 pages
Air Quality Analysis Using Machine Learning
No ratings yet
Air Quality Analysis Using Machine Learning
3 pages
IEEE Research Paper With Charts
No ratings yet
IEEE Research Paper With Charts
4 pages
RESEARCH - On Aqi
No ratings yet
RESEARCH - On Aqi
8 pages
2.RGP Corneal Lens
No ratings yet
2.RGP Corneal Lens
13 pages
Postere 1
No ratings yet
Postere 1
1 page
Air Quality Prediction
No ratings yet
Air Quality Prediction
8 pages
TENSION TEST ON Tor Steel
No ratings yet
TENSION TEST ON Tor Steel
7 pages
The Definite Integrals
No ratings yet
The Definite Integrals
25 pages
2797 8011 1 PB
No ratings yet
2797 8011 1 PB
3 pages
Husqvarna 2003 SM WRE 125 Manual
No ratings yet
Husqvarna 2003 SM WRE 125 Manual
2 pages
Batch Record
No ratings yet
Batch Record
11 pages
PIC Microcontroller and Embedded Systems Muhammad Ali Mazidi, Rolin McKinlay and Danny Causey
No ratings yet
PIC Microcontroller and Embedded Systems Muhammad Ali Mazidi, Rolin McKinlay and Danny Causey
10 pages
Corrosion Protection of Rock Bolts by Epoxy Coating and Its Effec PDF
No ratings yet
Corrosion Protection of Rock Bolts by Epoxy Coating and Its Effec PDF
9 pages
FIBA Basketball Equipment 2020 - V1
No ratings yet
FIBA Basketball Equipment 2020 - V1
30 pages
Ciprofloxacin Suspension in Syrup NF
No ratings yet
Ciprofloxacin Suspension in Syrup NF
0 pages
A Web Application For Sensor Data Collection and Visualisation
No ratings yet
A Web Application For Sensor Data Collection and Visualisation
39 pages
Nodal Analysis and (IPR, TPC) Curve
No ratings yet
Nodal Analysis and (IPR, TPC) Curve
9 pages
Ceng131 Surveying First Exam Nov 2012 Solution
No ratings yet
Ceng131 Surveying First Exam Nov 2012 Solution
7 pages
Sample
No ratings yet
Sample
14 pages
ParthShah TMA03
No ratings yet
ParthShah TMA03
12 pages
MSD Digital 6A and 6AL Ignition Control
No ratings yet
MSD Digital 6A and 6AL Ignition Control
20 pages
Trojan Port List
No ratings yet
Trojan Port List
13 pages
E395923X TM112 TMAParthShah E395923X TM112
No ratings yet
E395923X TM112 TMAParthShah E395923X TM112
15 pages
KCPSM6 User Guide 30sept14 PDF
No ratings yet
KCPSM6 User Guide 30sept14 PDF
124 pages
MG HG Replacement
No ratings yet
MG HG Replacement
16 pages
ASR2016 - AdvancedHigher 5
No ratings yet
ASR2016 - AdvancedHigher 5
8 pages
Fractional Fourier Transform
No ratings yet
Fractional Fourier Transform
28 pages
Ocas Calculator For Modules With Cancelled Final Assessment
No ratings yet
Ocas Calculator For Modules With Cancelled Final Assessment
5 pages
User Maual For Operation and PC Software and APP of TC66 (C) Type-C USB PD Trigger Meter 2019.6.5
No ratings yet
User Maual For Operation and PC Software and APP of TC66 (C) Type-C USB PD Trigger Meter 2019.6.5
12 pages
Revised Notes Chapter 1
No ratings yet
Revised Notes Chapter 1
16 pages
Stochastic Physics Code in The UM: Unified Model Documentation Paper 081
No ratings yet
Stochastic Physics Code in The UM: Unified Model Documentation Paper 081
23 pages
Short Essay New Higher
No ratings yet
Short Essay New Higher
3 pages
Automated Face Mask Detection: A Project by Nishant Goel Under The Guidance of Dr. Anil Kumar
No ratings yet
Automated Face Mask Detection: A Project by Nishant Goel Under The Guidance of Dr. Anil Kumar
21 pages
WJEC GCSE Maths Intermediate Paper 2 November 2022
No ratings yet
WJEC GCSE Maths Intermediate Paper 2 November 2022
24 pages
Network Questions 2 - LANWANEtc
No ratings yet
Network Questions 2 - LANWANEtc
1 page
Rhi Jma Des 2015
No ratings yet
Rhi Jma Des 2015
16 pages
Desymm
No ratings yet
Desymm
13 pages
Bank Statement
No ratings yet
Bank Statement
2 pages
Tension 13: 5or1 He T TH Ro No H RD in
No ratings yet
Tension 13: 5or1 He T TH Ro No H RD in
1 page

22J TMA03 Solution Document

Uploaded by

22J TMA03 Solution Document

Uploaded by

TMA03 – Parth Shah – E395923X

Word Count: 200 words

Background and context

Aim and Objectives

A Decision-Tree classifier can be thought of as a flowchart. Depending on the value of each

The chosen hyper-parameters are explained below in more detail:

Discussion of wider implications

Word Count: 1500 words.

You might also like