A Flexible and Lightweight Deep Learning Weather Forecasting Model
A Flexible and Lightweight Deep Learning Weather Forecasting Model
https://doi.org/10.1007/s10489-023-04824-w
Abstract
Numerical weather prediction is an established weather forecasting technique in which equations describing wind, temperature,
pressure and humidity are solved using the current atmospheric state as input. This study examines deep learning to forecast
weather given historical data from two London-based locations. Two distinct Bi-LSTM recurrent neural network models
were developed in the TensorFlow deep learning framework and trained to make predictions in the next 24 and 72 h, given
the past 120 h. The first trained neural network predicted temperature at Kew Gardens with a forecast accuracy of ± 2 ◦ C in
73% of instances in a whole unseen year, and a root mean squared errors of 1.45 ◦ C. The second network predicted 72-h air
temperature and relative humidity at Heathrow with root mean squared errors 2.26 ◦ C and 14% respectively and 80% of the
temperature predictions were within ± 3 ◦ C while 80% of relative humidity predictions were within ± 20%. Both networks were
trained with five years of historical data, with cloud training times of over a minute (24-h network) and three minutes (72-h).
13
Vol.:(0123456789)
24992 G. Zenkner, S. Navarro‑Martinez
purely data-driven, without any kind of data assimilation or presented, while Section 5 concludes the paper and outlines
hybridisation. The model is tested using historical data from future research directions.
two London- based locations to train a Bi-LSTM recurrent
neural network to predict temperature and relative humidity.
The main contributions of this article are: 2 Related work
• The creation of a Deep Neural Network framework to Machine Learning (ML) is showing large potential in
use historical weather data to create forecasts of selected fluid mechanics [6, 7], where it can be used to model sub-
weather features over desired length. grid stress [8, 9] or extract turbulent structures [10]. One
• The development of two models to predict temperature of the first ML applications in weather forecasting was
and humidity hourly evolution over 24 and 72 h in two Schizas et al. [11] in 1991, where Artificial Neural Net-
locations in London. works (ANN), where used to predict minimum tempera-
• The study of forecasting errors investigating seasonal tures. Similarly, Ochiai et al. [12] used ANN in 1995 to
variations and forecast length. predict rainfall and snowfall. These models were able to
improve the forecasting accuracy compared to statistical
The rest of the paper is structured as follows. In Section 2, models [13]. However, the limited forecast of 30–180 min
the relevant literature related to the use of Machine Learn- and difficulties in obtaining solution convergence made
ing in Weather forecasting is discussed, while in Section 3, practical application impossible. Traditional machine
the architecture and the dataset used for testing is described. learning examples include support vector machine or lin-
In Section 4, the results with the two models developed are ear regression which are typically far less computationally
Fig. 1 Joint probability density functions of two features (off-diagonal) and single-feature probability density functions (diagonal) for the two locations
13
A flexible and lightweight deep learning weather forecasting model 24993
demanding than neural networks and have been investi- predict precipitation more accurately in 89% of instances
gated as forecasting candidates. For example, Ma et al. compared to existing weather prediction techniques. Hew-
[14] deployed a traditional machine learning model age et al. [13] report that their ML models predict weather
known as XGBoost, which are comprised of gradient conditions 12 h into the future with higher accuracy than
boosted decision trees, to predict air temperature and conventional weather forecasting.
humidity over a 3-h period with resulting root mean Neural networks have been identified as being particu-
square errors (RMSE) of temperature of 1.77 ◦ C . Despite larly promising in precipitation forecasting. A MetNet model
the relatively good result of traditional machine learning developed at Google [28] was shown to predict precipita-
approaches, there are several reasons why a deep learning tion accurately over the course of eight hours. In this hybrid
approach is preferred for weather prediction. Traditional approach, several models were used at different stages includ-
algorithms are unable to model non-linearity, which is ing LSTMs and CNNs. Despite its good performance, the
essential in predicting the evolution of the weather [15, model requires large volumes of data. An improvement
16]. Similarly, Shao et al. [17] reported that statistical was obtained by Met-Net2 [29], outperforming up to 12 h
and traditional ML techniques are not well-suited for state-of-the-art weather models operating in the Continental
complex wind forecasting and attribute this need to the United States. Fu et al. [30], upon evaluating many neural
turbulent and chaotic behaviour of wind. Recent efforts network architectures, settled on a combined Bidirectional-
have focused on using Support Vector Machines and vari- LSTM (Bi-LSTM) and a one-dimensional CNN to predict
ations for short term series forecasting and classification ground air temperature, relative humidity and wind speed
of non-linear data and time series [18–20]. Deep Learning over seven days. They used weather station data from ten
(DL) leverages the growing volume and accessibility of weather stations in Beijing and the final model contained
data. While traditional machine learning models reach more than a million nodes. Despite its size and complexity,
a point beyond which additional training data no longer the quantitative performance relative to the local weather
improves model performance, deep learning models have observations was uncertain. The latest trends among others,
been observed to benefit from the increase in data [21]. include the use of hybrid LSTM/GAN [31] to predict cloud
DL networks have been increasingly used in time series movement, LSTM/CNN for drought forecast [32]. Wind fore-
forecasting in several applications, examples include casting is of great importance in wind power and load estima-
finance [22], sugarcane yield prediction [23] and power tions and DL has been recently applied [33–36]. Most of the
load forecasting [24] among others. DL has the potential applications focused on short term which sped up prediction
to significantly improve the accuracy of weather forecast- by up to 24 h.
ing and its applications increased exponentially. Bauer The recent literatures shows that DL applications in weather
et al. [4] showed that their Convolutional Neural Network forecast are accelerating, with large-scale forecasts using CNN-
(CNN) ensemble forecasting model can predict anoma- variant architectures and LSTM dominating point forecast.
lies such as Hurricane Irma. Weyn et al. [25] increased However, there are clearly several research bottlenecks associ-
the accuracy of weather prediction by applying ensem- ated with short-term forecasting. Most applications have been
ble modelling of separate CNN models, each with dif- in wind-farm sites with "simple" weather patterns, while urban
ferent starting conditions and sets of weights. Roy et al. environments are more complex to predict as the turbulence
[26] evaluated a multilayer perceptron, a long short-term content of the signal is larger. Moreover, there is a deterioration
memory (LSTM) model and a hybrid CNN/LSTM model
and concludes that models with more complex architec-
tures in general improve performance, while Ravuri et al. Table 2 Parameters used in Model A including number of epochs and
[27] demonstrated that their neural network model can optimiser settings
Parameter Value
Table 1 Architecture of the Bi-LSTM used in Model A, which
Context Length 120 h
includes the number and type of layers and the number of nodes in
each layer Gradient Optimisation Adaptive moment esti-
mation (ADAM)
Layer Type Value Shape Parameters Learning rate 0.001
Input - - (120 × 6) 0 Model optimised metric Mean squared error
Hidden Bi-LSTM Tanh activa- (32 × 512) 538,624 Performance metric Root mean squared error
tion function Epochs 2
Hidden Dropout 0.25 (32 × 512) 0 Batch size 32
Output Linear - (32 × 6) 3,078 Runtime size 78 s
Total 541,702 Train, validate, test ratios 0.7, 0.15 and 0.15
13
24994 G. Zenkner, S. Navarro‑Martinez
Fig. 2 Comparison between predicted and measured temperature at Kew Gardens using the forecast length of one hour and a context length of
120 h. Scatter plot (left), one-year predictions (right)
of predictions after several hours and there is not an optimal extracted from the Centre for Environmental Data Analysis
forecast length, which seems to depend on application. [38] and contains weather information from 2015–2021 with
dozens of hourly weather parameters, hereinafter referred to
as features for consistency. However, not all features are avail-
able for all weather stations and so the selection was limited
3 Methodology and data processing to six unique features (three per weather station). The features
of particular interest are air temperature, relative humidity and
LSTMs are applied frequently in sequential problems as they wind speed at both Heathrow and Kew Gardens, see Fig. 1.
address the issue of loss of long-term memory [37]. The With the features selected, the dataset is normalised. This
Bi-LSTM recurrent neural network builds upon the LSTM is performed by using the mean and standard deviation for
structure. In a Bi-LSTM model a duplicate layer is produced, each feature. The mean and standard deviation are calculated
where sequential information flows in chronological order from the training dataset, as including data from the valida-
through the first layer while the duplicate layer is used for tion and test sets and may result in overfitting [39].
the same sequential information, but in reversed order. This The training, validation and test datasets are split up in frac-
provides the model with far more context as key information tions of 0.7, 0.15 and 0.15 respectively with the chronological
at both the start and end of the sequence is available. sequence of the data maintained. This corresponds to a sample
The training data is openly available by the Met Office size of 36,825, 7,891 and 7,892 observations respectively.
from two London weather observation stations: Kew Gardens Two networks were created, Model A, to forecast 24 h
(51.482, -0.294) and Heathrow (51.479, -0.451). The data was and Model B to predict 72 h. The same dataset with the
13
A flexible and lightweight deep learning weather forecasting model 24995
Table 3 Root mean squared error (RMSE), mean average error the number of features and batch size. The total number of
(MAE) and maximum error between hourly and 24-h temperature parameters to be trained in the model is the sum of those in
predictions in Fig. 3
the hidden layer and output layer, totalling 541,702.
RMSE [◦ C] MAE [◦ C] Max. Error [◦ C] A dropout layer is included to minimise the impact of over-
fitting by randomly setting the weight of 25% of the units in
Single timestep 0.86 0.63 2.19
the hidden layer to zero. Dropout is a well-established tech-
Multi-timestep 1.74 1.33 4.76
nique in neural network modelling to overcome overfitting and
is considered a more practical approach than regularisation,
which is a common approach to reduce overfitting in tradi-
same split ratio for training, validation, and testing was tional machine learning problems (Table 2) [40].
used in both models. However, Model B is deeper, with a The training process was performed using Jupyter Note-
denser Bi-LSTM with more cells and an additional Feed book within a Google Colaboratory environment. The com-
Forward neural network (FNN) in the second hidden layer. plete runtime was 78 s after which predictions could be made
Model B was trained on the same dataset with the same within 10 s. The maximum memory usage during training was
split ratio for training, validation, and testing. less than 16 GB. The entire test dataset corresponds to roughly
The architecture of Model A is characterised in Table 1 one year of data in 2020 (while training is 2015–2019). The
and determines the number of calculations performed. The model uses 120 measured hourly data as input and the output is
input layer shape is defined by the length of the context and the desired forecast hours. A benefit of having a context length
the number of features. The hidden layer shape is defined greater than the forecast length is that some measured data
by the batch size and number of Bi-LSTM units; 256 for- will always be used in making the prediction. However, the
ward and 256 backward units. A batch size of 32 results in returns are diminished as the temporal gap between the meas-
1,151 observations per batch from a total of 36,825 train- ured data and forecast increases. A model with a larger context
ing observations with any difference subtracted from the of 240 h capture the data trend but failed to express the peaks
final batch. Finally, the output layer shape is defined by and troughs accurately. The approach was first tested by doing
Fig. 4 24-h forecast of the air temperature at Kew Gardens during four days in different seasons
13
24996 G. Zenkner, S. Navarro‑Martinez
Table 4 Root mean squared error (RMSE), mean average error error according to all three metrics is approximately twice
(MAE) and maximum errors for the 24-h temperature prediction as large as the single-step error.
(Fig. 4), values in parentheses are normalised RMSE
To quantify how well our 24-h model generalises to dif-
RMSE [◦ C] RMSE Naïve MAE [◦ C] Max. Error [◦ C] ferent time periods and seasons, four prediction windows
[◦ C] spaced 90 days apart are illustrated in Fig. 4. A benchmark
Summer 1.33 (0.91) 3.30 (2.27) 1.74 (1.20) 4.76 (3.27) model, naive mode, is used for comparison. The naive model
Autumn 1.12 (0.77) 10.4 (7.15) 1.36 (0.93) 2.39 (1.64) uses the last measured temperature for the entire 24-h fore-
Winter 1.64 (1.13) 7.30 (5.02) 2.03 (1.40) 4.01 (2.76) cast. The naive model does not made assumptions about the
Spring 1.73 (1.19) 3.00 (2.06) 2.25 (1.55) 4.24 (2.91) future state and is completely uninformed. The root mean
Mean 1.45 6.00 1.84 3.85 squared errors confirm the neural network performs signifi-
Std. dev 0.244 3.06 0.333 0.886 cantly better than the naive model in all instances (Table 4)
with an average error of 1.45◦ C and 6.00◦ C for the neural
network and naive forecast respectively.
a single-hour forecast (see Fig. 2). This process is repeated To contextualise the performance, the neural network
across the entire test dataset and 7,772 single-hour predictions was compared to performance metrics from the Met
are generated. The root mean, mean absolute and maximum Office. The 24-h predictions produced by the neural net-
errors were 0.89◦ C , 0.62◦ C and 12.81◦ C respectively. work were in 72.9% of all instances accurate to ±2◦ C .
By comparison, the Met Office states 92.5% of its 24-h
temperature predictions are accurate to ±2◦ C while 92%
4 Results of 24-h wind speed predictions are within 5 knots [41].
Note that measurements used in the weather stations were
4.1 24‑h temperature forecast acquired with a resolution of ± 0.1◦ C (Fig. 5).
A better statistical comparisons is done by looking at the
To predict 24-h, a comparison was initially made between probability density functions of the predicted and measured data.
the single-step (predict 24 h in one step) and multi-step pre- The 96 individual forecasts are derived from the four windows in
diction models to assess the impact of error propagation, see Fig. 4. These points were used to compute a distribution function
Fig. 3. Table 3 shows that the multi-step model prediction and are compared to the measured temperature distribution for
13
A flexible and lightweight deep learning weather forecasting model 24997
the same period, while the entire yearly data was used to cre- optimisation, the model should not be used for predictions
ate a benchmark. The 96-sample measured temperature peak exceeding one day.
is wider than the predicted peak indicating that predictions are
conservative with both curves demonstrating bimodal behaviour. 4.2 72‑h temperature, relative humidity and wind
Nonetheless, the predicted and measured distribution agree very velocity forecasts
well, except tails on very hot days. Outlier temperatures above
40 ◦ C were measured that are not predicted. The Model B setup, is shown in Table 5. The main differ-
Using the same network (Model A), the length of fore- ence with Model A is the addition of a linear layer within
cast was varied next to understand the deterioration of the the hidden layer and a reduction in the dropout percentage
predictions without adapting the model and parameters. to 10%. The hyperparameters used in the optimised model
Ten different forecast lengths were tested ranging from one are recorded in Table 6.
to 168 h (seven days). The RMSE mean and standard devi- As the first model, an increase in the number of epochs
ation are plotted against forecast length in Fig. 6 to indicate resulted in a reduction in the error and increase of the
uncertainty for increasing forecast lengths. For consistency, r-square value. However, there was no direct correlation
each prediction was run with a single epoch rather than between optimisation of these two parameters and how the
attempting to optimise performance by identifying the most 72-h forecast performed over different time periods. There-
suitable number of epochs for each forecast length. The fore, once a capable architecture was identified, a similar
single hour prediction has the smallest mean and standard trial-and-error approach began to optimise the hyperparam-
deviation, both of which increase as the forecast length eters and context length based on the RMSE from the four
increases, but become more stable after 24 h. 1–24 h pre- windows. Initially, 120 h were used for the context length
dictions have a mean error less than 3◦ C . Beyond 24 h, the but later changed to 168 h as this gave optimal performance.
prediction uncertainty continues to increase before rapidly After upwards of twenty iterations with different conditions,
converging around 4 ◦ C . While there are many caveats to the hyperparameters listed in Table 6 resulted in the best
this information, the results suggest that, without further
Table 6 The finalised hyperparameters used to train Model B includ-
ing the number of epochs and optimiser settings
13
24998 G. Zenkner, S. Navarro‑Martinez
Fig. 7 72-h forecast of the air temperature at Heathrow during four days in different seasons. Symbols same as Fig. 4
performance. * Once the model was trained, it was possible standard deviation 2.26◦ C and 0.316◦ C respectively, with
to make new predictions rapidly, within 15 s. The single- 79.5% of the temperature forecasts are within ±3◦ C when
step hourly prediction RMSE was 0.94◦ C, MAE 0.68◦ C and making a 72-h forecast (compared to 1.45◦ C and 0.244◦ C in
maximum error 14.94◦ C when calculated over the entire test single day prediction) (Table 7).
dataset. While the numbers are comparable to the single- Figure 8 shows predicted distribution for 72-h forecasts.
hour predictions generated in Model A, the model did not Despite the qualitatively good agreement, the modelled dis-
perform quite as well over three days as one day. This is to tribution has a narrower peak with extreme high tempera-
be expected as the forecast window is three times longer and tures underestimated (similarly to model A), showcasing the
the likelihood of error propagation is much higher. difficulty to represent the tails of the distribution.
The four windows in Fig. 7 illustrate how the Bi-LSTM The model takes in all features from both locations result-
and linear model is highly capable of making predictions ing in six unique features and 12 features in total. As before,
with excellent generalisability across different periods and it is possible to generate a prediction for any one of the fea-
seasons. The three day forecast resulted RMSE mean and tures introduced to the model in training. While the model
13
A flexible and lightweight deep learning weather forecasting model 24999
Fig. 8 Temperature probability density functions (left) and scatter plot (right) at Heathrow
does take all inputs into consideration during training and all 12 features is used when minimising the loss, assigning
seeks to minimise the loss function with respect to all fea- different levels of importance to each feature. During the
tures, the performance arising from this approach does not training of Model B, the objective was to optimise the 72-h
necessarily translate into good generalisability across all temperature predictions, there was no guarantee that this
timescales. When training the model, the weighted sum of performance would translate into comparable performance
Fig. 9 72-h forecast of the relative humidity at Heathrow during four days in different seasons
13
25000 G. Zenkner, S. Navarro‑Martinez
Table 8 Root mean squared RMSE [%] MAE [%] Max. Error [%]
error (RMSE), mean average
error (MAE) and maximum Kew G Heathrow Kew G Heathrow Kew G Heathrow
errors for the 72-h relative
humidity prediction (Fig. 7), Winter 8.78 7.46 6.33 5.4 22.9 20.6
values in parentheses are Autumn 9.48 8.25 7.30 6.21 21.9 19.6
normalised RMSE Summer 28.0 29.1 22.2 23.43 58.6 61.6
Spring 11.9 11.5 8.43 8.51 36.2 33.4
Total 58.1 56.3 44.3 43.6 139.6 135.2
Average 14.5 14.0 11.1 10.9 34.9 33.8
for another feature, in this case relative humidity. The accu- humidity and wind speed. Despite a three-fold increase
racy of the results in Fig. 9 are a byproduct of the process to in the forecast length, the model was able to accurately
optimise the air temperature. If the relative humidity were predict air temperature with an RMSE of 2.26◦ C at Heath-
the focus of the optimisation, the forecast prediction would row and was able to predict the temperature accurately to
probably show considerable improvement (Table 8). within ±3◦ C in 79.5% of instance. It was able to predict
the relative humidity in the same location with an RMSE
of 14%. However, Model B was optimised with respect to
5 Conclusions and future work air temperature which impacted the accuracy.
The flexibility and speed of the model makes it attrac-
This paper presented a novel, flexible, deep learning local tive to short-term local forecast in locations where weather
weather forecasting. The approach is capable of rapidly stations are present but it maybe difficult to have accurate
predicting weather features and generating cheap, reliable weather predictions (due to topography, local effects, etc.)
short duration forecasts. The model is purely data-driven, The result show that predictions up to three days have accu-
in contrast with earlier approaches that required varying racy comparable to expensive numerical weather predic-
degrees of data assimilation or hybrid model. A total of tions. However, featured-based optimisation may be required
two models were trained and used to predict air tempera- to improve the accuracy of features such as wind speed or
ture and relative humidity. The dataset used to train the humidity. Future lines of research will be in this direction.
models contained six years of historical weather observa-
tions from Kew Garden and Heathrow weather observa-
Authors contribution G.Z and S.NM contributed to the conception of
tion stations in London. The objective of having multiple the presented idea. G.Z. wrote the main text, performed the simulations
locations is to infer a topographical representation for the and prepared all figures. S.NM revised the manuscript. All authors
model to learn from. As the two weather observation sta- reviewed the manuscript.
tions are positioned 11 km apart, it is expected that they
Data availability The weather hourly data that support the findings of
would share similar weather characteristics. Discrepancies this study are publicly available in the NCAS British Atmospheric Data
in wind speed and humidity between the location could be Centre, https://catalogue.ceda.ac.uk/
explained by local land features and artificial structures.
Kew Gardens is positioned near the river Thames in a built- Declarations
up area while the nearest body of water to Heathrow is sev-
Ethical and informed consent for data used Not Applicable.
eral kilometres away. Heathrow observation station is situ-
ated within the airport boundaries with few obstructions. Competing Interests The authors have no conflict of interest.
Model A is a 24-h prediction network designed to pre-
dict air temperature. This model was intended to demon- Open Access This article is licensed under a Creative Commons Attri-
strate proof of concept and was trained with wet bulb, air bution 4.0 International License, which permits use, sharing, adapta-
and dew point temperatures. The Model A achieved its tion, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source,
objective of establishing a baseline for further predictions. provide a link to the Creative Commons licence, and indicate if changes
It showed that air temperature could be predicted with were made. The images or other third party material in this article are
reasonable accuracy compared to the Met Office, predict- included in the article's Creative Commons licence, unless indicated
ing the air temperature within a range of 2 ◦ C in 72.9% otherwise in a credit line to the material. If material is not included in
the article's Creative Commons licence and your intended use is not
of instances with a maximum error of 3.85◦ C occurring permitted by statutory regulation or exceeds the permitted use, you will
mostly in very hot days. Model B is a 72-h prediction need to obtain permission directly from the copyright holder. To view a
network that attempted to predict air temperature, relative copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
13
A flexible and lightweight deep learning weather forecasting model 25001
13
25002 G. Zenkner, S. Navarro‑Martinez
39. Wang JQ, Du Y, Wang J (2020) Lstm based long-term Publisher's note Springer Nature remains neutral with regard to
energy consumption prediction with periodicity. Energy jurisdictional claims in published maps and institutional affiliations.
197(117):197
40. Kreuzer D, Munz M, Schlüter S (2020) Short-term temperature
Gabriel Zenkner received his MSc at Imperial College London in
forecasts using a convolutional neural network - an applica-
Advanced Mechanical Engineering in 2022. His current research inter-
tion to different weather stations in germany. Mach Learn Appl
ests include data science, data engineering and prediction of physical
2:100,007
and social phenomena using machine learning techniques.
41. Met-Office (2022) How accurate are our public forecasts?
https:// w ww. m etof f ice. g ov. u k/ a bout- u s/ w hat/ a ccur a cy-
Salvador Navarro‑Martinez is a reader at Imperial College London in
and-t rust/h ow-a ccura te-a re-o ur-p ublic-foreca sts . Accessed
the Department of Mechanical Engineering. His research interests lie
30/08/2022
in the modelling of turbulent flows using stochastic and machine learn-
ing techniques.
13