0% found this document useful (0 votes)
18 views6 pages

22J TMA03 Solution Document

The document discusses research on using machine learning techniques to predict air quality levels. It finds that nonlinear models like random forests and neural networks provide better predictions than linear regression models. An ensemble random forest model with hyperparameters of n_estimators=14, max_depth=25, and min_samples_split=2 achieved the highest validation accuracy of 0.94 for predicting air quality levels based on environmental data from Frankfurt, Germany. However, both decision tree and random forest models predicted level 1 air quality most accurately and struggled more with other levels.

Uploaded by

parth98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views6 pages

22J TMA03 Solution Document

The document discusses research on using machine learning techniques to predict air quality levels. It finds that nonlinear models like random forests and neural networks provide better predictions than linear regression models. An ensemble random forest model with hyperparameters of n_estimators=14, max_depth=25, and min_samples_split=2 achieved the highest validation accuracy of 0.94 for predicting air quality levels based on environmental data from Frankfurt, Germany. However, both decision tree and random forest models predicted level 1 air quality most accurately and struggled more with other levels.

Uploaded by

parth98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

TMA03 – Parth Shah – E395923X

2.

Since 2011, there has been an exponential increase in research articles about the use of
machine learning in environmental studies. Over 50% of these articles come from the US
and China. For all countries, articles on air-pollution are most common (>40%), followed by
water pollution, and then soil pollution. However, for Iran, research interest on water
pollution exceeds air-pollution. There are very few articles on solid waste pollution, but
China leads this field, having published the most articles on it. Meanwhile, the US and UK
have more equally distributed articles on air, water, and soil pollution.

Applying machine learning to the assessment of air-pollution has led to good PM 2.5
predictions made by non-linear models. This is useful in areas with no PM2.5 monitoring
facilities. However, despite non-linear models performing better at predicting air-pollution
than linear models, they only provide an increase of 0.1 – 0.19 in R 2, the goodness of fit,
which was just 0.6 to begin with. In the application of estimating health outcomes from
pollutant exposure, linear regression models provided great interpretability, but ultimately,
ensemble models provide better results. However, these are not interpretable since they
form a complex combination of the results of individual models.

Word Count: 200 words


3.
Machine Learning Techniques for Predicting Air-Quality

Background and context


Machine learning is a powerful tool in environmental science. By analysing large datasets,
machine learning models can pick up on hidden patterns. This insight can be used in many
applications, such as the problem of classifying air-quality levels. This is an important
problem as it can help governments make key environmental decisions like imposing traffic
limitations on days where air-pollution is predicted to be very high, benefitting citizens as it
can reduce exposure to harmful pollutants.

Machine learning techniques can be divided into two categories — “supervised learning” or
“unsupervised learning”. In supervised learning, models are trained on labelled data,
meaning that the model is provided with lots of input and correct output pairs. Models use
this knowledge to train themselves for predicting the correct answer when a new, unseen,
input is given. However, in unsupervised learning, models are not provided with the correct
output. The model itself has to find relationships in the data.

This report focusses on different machine learning techniques for classifying air-pollution
levels and provides a comparison against different techniques. Wider implications of
deploying these models are also considered, and further recommendations are provided.

Aim and Objectives


From the work produced by Liu et al. (2022), it was found that linear, regression-based
machine learning models were not effective for the problem of predicting air-pollution
levels, as air-pollution depends on many complex environmental factors which portray non-
linear relationships. It was shown that non-linear models like Random Forest models, and
deep neural networks like RNNs and LSTMs provided better results. The work from Choi and
Kim (2021) concluded that using PCA improved performance, but underlying models which
used RNNs, LSTMs, and BiLSTMs, did not provide satisfactory results.

Therefore, the research question is to find the effectiveness of decision-tree and ensemble
models in air-quality predictions. Decision-trees will be considered as the starting point, and
then ensemble models, which are models that consider the different outputs of multiple
smaller decision-trees and return the overall consensus among them, will be explored.

Methods
The data is created by Bagul (2020) and obtained from Kaggle. It includes environmental and
air-quality observations in Frankfurt, Germany, between 31/12/2018, and 20/02/2022. The
data is split into training, validation, and testing sets, following a 40:20:20 split.

A Decision-Tree classifier can be thought of as a flowchart. Depending on the value of each


column of the dataset, the final output can vary accordingly. For example, if the air pressure
is very high, it is more likely that air-pollution will be very high too. The first experiment was
to use a decision-tree classifier with a maximum depth of 3. This, however, did not provide
satisfactory results on the validation set. Therefore, the maximum depth was then changed
to 10, and this provided slightly better results. With the knowledge that a greater maximum
depth provided better results, it was decided to test increasing maximum depth values,
ranging from 10 to 40, and incrementing by 2 at every iteration.

Ensemble models were then used, namely Random Forest and Bagging, to see whether
better results could be obtained. Ensemble models aim to reduce the problem of over-fitting
(a model fine-tuning to the training data, leading to poor performance with unseen data), by
combining smaller individual models. Additionally, to strategically experiment with other
hyper-parameters, an automated search over the different combinations of hyper-
parameters was employed. All possible combinations of the hyper-parameters were iterated
over and the combination with the highest validation accuracy was kept track of.

The chosen hyper-parameters are explained below in more detail:


Random Forest
Hyper-parameter Description Reason Range of Values
n_estimators The number of When it is large, the
individual Random Forest can make a
decision-trees more informed decision. If 12 to 18.
in the random too large however, it could Increments by 2.
forest. lead to un-confident
classifications.
max_depth The maximum Increasing/decreasing this
depth of each has a large effect on
20 to 30.
tree. accuracy, as it changes the
Increments by 5.
output of each individual
tree.
min_samples_split The minimum Adjusting this affects how
number of split decisions are made,
2 to 22.
samples impacting the structure of
Increments by 10.
required for a each decision-tree.
split.

Bagging
Hyper-parameter Description Reason Range of Values
n_estimators The number of Adjusting this affects the
individual classifier, because a greater
decision-trees innumber of decision-trees
10 to 14.
the random could lead to more accurate
Increments by 2.
forest. classifications, but too large
a number could also
negatively impact it.
max_samples The maximum Having too many samples to
number of optimise could lead to over-
1000 to 5000.
training data fitting to the training set, in-
Increments by
points to sample turn leading to reduced
1000.
for each accuracy.
decision-tree.
max_features The maximum Adjusting this affects the 1 to 9.
number of quality of each tree, as
features to sampling fewer features
Increments by 2.
sample for each leads to more unique
decision-tree. decision-trees.
Results
The results from the four models explored are summarised in the table below:
Method Hyper-Parameters Training Accuracy Validation Accuracy
Decision-Tree max_depth = 3 0.73 0.73
Decision-Tree max_depth = 10 0.95 0.93
n_estimators = 14,
Random Forest max_depth = 25, 0.947 0.94
min_samples_split = 2.
n_estimators = 14,
Bagging max_samples = 5000, 0.819 0.816
max_features = 9.

Figure 1 - The confusion matrix of the Random Forest model (left) and the Decision-Tree model (with
maximum depth of 10) (right) when predicting inputs from the test data.

Evaluation of results
From the results, the random forest model performed the best out of all four models, but
only slightly better than the decision-tree classifier with a maximum depth of 10.

Upon further analysis of the confusion matrix obtained from the random forest model, the
random forest model has slightly more true-positives for all levels except for level 1.
However, in both models, the high validation accuracy mostly stems from predicting the air-
quality level of 1 very accurately. Predictions for other levels are much less accurate. This
could be because there are almost six times as many samples for level 1 compared to level
2, and even larger disparities for other levels. Having more data for other levels can help the
model better understand patterns arising in other levels, and therefore perform better than
currently. This could also explain why Bagging is not effective, as taking a sample of the
already unbalanced dataset exacerbates problems.

It is interesting to see that better results are obtained in the random forest model when the
maximum depth of each tree is 25, rather than 30. This suggests that random forest models
work optimally with smaller decision-trees, which makes sense, as their inherent purpose is
to reduce over-fitting by considering multiple smaller trees.
It is difficult to compare these results to those obtained by Choi and Kim (2021), since the
problem in question is a classification problem, i.e. predicting discrete air-quality levels,
compared to a regression problem that Choi and Kim (2021) explore. Additionally, Choi and
Kim (2021) explore air-pollution data in the Far-East region, which is vastly different to
observations in Frankfurt, due to differing climatic activity arising from difference in
regulations and demographics. As such, providing a fair comparison between the results
holds little meaning.

Discussion of wider implications


COVID restrictions were in place during a significant duration of the data-capture period.
During this time, air-pollution levels decreased due to limited usage of vehicles, public-
transport, and factories. Since the dataset only considers days, and not the actual date, a
sunny Wednesday pre-COVID may have had high pollution, while a similarly sunny
Wednesday during COVID may have had opposite conditions. This inherent bias within the
dataset can lead to poor accuracies within the classification model.

Further, the predictions made by the model may reveal disparities in air-quality across
different communities. Certain neighbourhoods could have naturally high pollution values as
they could be closer to industrial areas. It is important that government decisions based on
these predictions are taken in a way which does not harm such communities.

Conclusion
As mentioned by Liu et al. (2022), there are many complex factors surrounding air-pollution,
which not only depend on the location concerning the prediction, but also on surrounding
locations (Choi and Kim, 2021). For machine learning to be reliably used in the application of
air-quality prediction, better quality data, and more accurate predictions are required.
Training data should include more information about other environmental factors, data
from surrounding areas, and also be captured under consistent conditions. Further, models
should have more than three-nines (99.9%) of accuracy before being used for serious
applications like informing governmental decisions, as false positives or negatives can have
grave consequences.

References
Bagul A. (2020) Air-quality dataset. Available at:
https://www.kaggle.com/datasets/avibagul80/air-quality-dataset.

Choi, S. and Kim, B. (2021) ‘Applying PCA to deep learning forecasting models for predicting
PM2.5’, Sustainability, 13(7), p. 3726. Available at: http://doi.org/10.3390/su13073726.

Liu, X., Lu, D., Zhang, A.,; Liu, Q. and Jiang, G. (2022) ‘Data-driven machine learning in
environmental pollution: gains and problems’, Environmental Science and Technology,
56(4), pp. 2124–2133. Available at: https://doi.org/10.1021/acs.est.1c06157.

Word Count: 1500 words.

You might also like