22J TMA03 Solution Document
22J TMA03 Solution Document
2.
Since 2011, there has been an exponential increase in research articles about the use of
machine learning in environmental studies. Over 50% of these articles come from the US
and China. For all countries, articles on air-pollution are most common (>40%), followed by
water pollution, and then soil pollution. However, for Iran, research interest on water
pollution exceeds air-pollution. There are very few articles on solid waste pollution, but
China leads this field, having published the most articles on it. Meanwhile, the US and UK
have more equally distributed articles on air, water, and soil pollution.
Applying machine learning to the assessment of air-pollution has led to good PM 2.5
predictions made by non-linear models. This is useful in areas with no PM2.5 monitoring
facilities. However, despite non-linear models performing better at predicting air-pollution
than linear models, they only provide an increase of 0.1 – 0.19 in R 2, the goodness of fit,
which was just 0.6 to begin with. In the application of estimating health outcomes from
pollutant exposure, linear regression models provided great interpretability, but ultimately,
ensemble models provide better results. However, these are not interpretable since they
form a complex combination of the results of individual models.
Machine learning techniques can be divided into two categories — “supervised learning” or
“unsupervised learning”. In supervised learning, models are trained on labelled data,
meaning that the model is provided with lots of input and correct output pairs. Models use
this knowledge to train themselves for predicting the correct answer when a new, unseen,
input is given. However, in unsupervised learning, models are not provided with the correct
output. The model itself has to find relationships in the data.
This report focusses on different machine learning techniques for classifying air-pollution
levels and provides a comparison against different techniques. Wider implications of
deploying these models are also considered, and further recommendations are provided.
Therefore, the research question is to find the effectiveness of decision-tree and ensemble
models in air-quality predictions. Decision-trees will be considered as the starting point, and
then ensemble models, which are models that consider the different outputs of multiple
smaller decision-trees and return the overall consensus among them, will be explored.
Methods
The data is created by Bagul (2020) and obtained from Kaggle. It includes environmental and
air-quality observations in Frankfurt, Germany, between 31/12/2018, and 20/02/2022. The
data is split into training, validation, and testing sets, following a 40:20:20 split.
Ensemble models were then used, namely Random Forest and Bagging, to see whether
better results could be obtained. Ensemble models aim to reduce the problem of over-fitting
(a model fine-tuning to the training data, leading to poor performance with unseen data), by
combining smaller individual models. Additionally, to strategically experiment with other
hyper-parameters, an automated search over the different combinations of hyper-
parameters was employed. All possible combinations of the hyper-parameters were iterated
over and the combination with the highest validation accuracy was kept track of.
Bagging
Hyper-parameter Description Reason Range of Values
n_estimators The number of Adjusting this affects the
individual classifier, because a greater
decision-trees innumber of decision-trees
10 to 14.
the random could lead to more accurate
Increments by 2.
forest. classifications, but too large
a number could also
negatively impact it.
max_samples The maximum Having too many samples to
number of optimise could lead to over-
1000 to 5000.
training data fitting to the training set, in-
Increments by
points to sample turn leading to reduced
1000.
for each accuracy.
decision-tree.
max_features The maximum Adjusting this affects the 1 to 9.
number of quality of each tree, as
features to sampling fewer features
Increments by 2.
sample for each leads to more unique
decision-tree. decision-trees.
Results
The results from the four models explored are summarised in the table below:
Method Hyper-Parameters Training Accuracy Validation Accuracy
Decision-Tree max_depth = 3 0.73 0.73
Decision-Tree max_depth = 10 0.95 0.93
n_estimators = 14,
Random Forest max_depth = 25, 0.947 0.94
min_samples_split = 2.
n_estimators = 14,
Bagging max_samples = 5000, 0.819 0.816
max_features = 9.
Figure 1 - The confusion matrix of the Random Forest model (left) and the Decision-Tree model (with
maximum depth of 10) (right) when predicting inputs from the test data.
Evaluation of results
From the results, the random forest model performed the best out of all four models, but
only slightly better than the decision-tree classifier with a maximum depth of 10.
Upon further analysis of the confusion matrix obtained from the random forest model, the
random forest model has slightly more true-positives for all levels except for level 1.
However, in both models, the high validation accuracy mostly stems from predicting the air-
quality level of 1 very accurately. Predictions for other levels are much less accurate. This
could be because there are almost six times as many samples for level 1 compared to level
2, and even larger disparities for other levels. Having more data for other levels can help the
model better understand patterns arising in other levels, and therefore perform better than
currently. This could also explain why Bagging is not effective, as taking a sample of the
already unbalanced dataset exacerbates problems.
It is interesting to see that better results are obtained in the random forest model when the
maximum depth of each tree is 25, rather than 30. This suggests that random forest models
work optimally with smaller decision-trees, which makes sense, as their inherent purpose is
to reduce over-fitting by considering multiple smaller trees.
It is difficult to compare these results to those obtained by Choi and Kim (2021), since the
problem in question is a classification problem, i.e. predicting discrete air-quality levels,
compared to a regression problem that Choi and Kim (2021) explore. Additionally, Choi and
Kim (2021) explore air-pollution data in the Far-East region, which is vastly different to
observations in Frankfurt, due to differing climatic activity arising from difference in
regulations and demographics. As such, providing a fair comparison between the results
holds little meaning.
Further, the predictions made by the model may reveal disparities in air-quality across
different communities. Certain neighbourhoods could have naturally high pollution values as
they could be closer to industrial areas. It is important that government decisions based on
these predictions are taken in a way which does not harm such communities.
Conclusion
As mentioned by Liu et al. (2022), there are many complex factors surrounding air-pollution,
which not only depend on the location concerning the prediction, but also on surrounding
locations (Choi and Kim, 2021). For machine learning to be reliably used in the application of
air-quality prediction, better quality data, and more accurate predictions are required.
Training data should include more information about other environmental factors, data
from surrounding areas, and also be captured under consistent conditions. Further, models
should have more than three-nines (99.9%) of accuracy before being used for serious
applications like informing governmental decisions, as false positives or negatives can have
grave consequences.
References
Bagul A. (2020) Air-quality dataset. Available at:
https://www.kaggle.com/datasets/avibagul80/air-quality-dataset.
Choi, S. and Kim, B. (2021) ‘Applying PCA to deep learning forecasting models for predicting
PM2.5’, Sustainability, 13(7), p. 3726. Available at: http://doi.org/10.3390/su13073726.
Liu, X., Lu, D., Zhang, A.,; Liu, Q. and Jiang, G. (2022) ‘Data-driven machine learning in
environmental pollution: gains and problems’, Environmental Science and Technology,
56(4), pp. 2124–2133. Available at: https://doi.org/10.1021/acs.est.1c06157.