BERT4ST Windpowerforecast
BERT4ST Windpowerforecast
Keywords: Accurate forecasting of wind power generation is essential for ensuring power safety, scheduling various
Wind power forecasting energy sources, and improving energy utilization. However, the elusive nature of wind, influenced by various
Spatial–temporal forecasting meteorological and geographical factors, greatly complicates the wind power forecasting task. To improve the
Large language model
forecasting accuracy of wind power (WP), we propose a BERT-based model for spatio-temporal forecasting
(BERT4ST), which is the first approach to fine-tune a large language model for the spatio-temporal modeling
of WP. To deal with the inherent characteristics of WP, BERT4ST exploits the individual spatial and temporal
dependency of patches and redesigns a set of spatial and temporal encodings. By well analyzing the connection
between bidirectional attention networks and WP spatio-temporal data, BERT4ST employs a pre-trained BERT
encoder as the backbone network to learn the individual spatial and temporal dependency of patches of WP
data. Additionally, BERT4ST fine-tunes the pre-trained backbone in a multi-stage manner, i.e., first aligning the
language model with the spatio-temporal data and then fine-tuning the downstream tasks while maintaining
the stability of the backbone network. Experimental results demonstrate that our BERT4ST achieves desirable
performance compared to some state-of-the-art methods.
1. Introduction easy to implement, but often fall short in performance. Deep learning-
based models can consistently outperform those statistical alternatives
With the increasing global demand for renewable energy, wind by leveraging deeper-layer neural networks to learn intricate meteo-
power has caught more and more attention. However, the actual wind rological features. There are many deep learning-based models, such
power generation strictly depends on the wind speed which may vio- as convolution neural networks (CNN), long-short term memory net-
lently change over time, bringing significant challenges for the safety
works (LSTM), and self-attention networks. Recently, because of its
of the power grid and the efficiency of energy utilization [1,2]. What
strong power to represent temporal characteristics in various scenarios,
is worse, the elusive nature of wind speed, influenced by various
meteorological and geographical factors, greatly complicates the wind self-attention has gained more and more popularity.
power forecasting task. Therefore, accurate forecasting of WP genera- The above forecasting models can be combined to further improve
tion is essential for ensuring power safety, scheduling various energy forecasting performance. Specifically, forecast combination aims to
sources, and improving energy utilization. Due to the necessity of low find a suitable forecasting model pool and accurately assign appropri-
computational complexity and high precision in short-term wind power ate weights to each model to achieve accurate and stable prediction.
forecasting, it is challenging to promptly deliver numeric weather Nikodinoska et al. [11] applied the dynamic elastic net as a combi-
prediction data (NWP) through physical modeling methods [3,4]. Fur- nation scheme to reduce the variance of renewable energy forecasting
thermore, wind speed may change very fast and it is difficult for NWP errors. Lu et al. [12] devised a variance-based strategy to distribute
to accurately predict wind speed. Consequently, short-term and ultra- weights among individual wind power forecasting models. In [13],
short-term WP forecasting mainly rely on historical data. By capturing various models, which offer different probability distributions, were
the temporal and spatial correlations of historical WP data, we can
amalgamated to enhance the performance of probabilistic wind power
model the evolution of WP, and predict the future WP.
prediction. Raza et al. [14] utilized trimmed aggregation to combine
Previous WP forecasting research employed statistical models, such
as ARIMAs [5–7] and machine learning techniques [8–10], which are five neural networks with various structural complexities for day-ahead
∗ Corresponding author.
E-mail address: [email protected] (Q. Ling).
https://doi.org/10.1016/j.enconman.2024.118331
Received 7 January 2024; Received in revised form 27 February 2024; Accepted 19 March 2024
Available online 27 March 2024
0196-8904/© 2024 Elsevier Ltd. All rights reserved.
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331
2
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331
such as addressing differences between different domains, need to be (MLM) to generate comprehensive bidirectional language representa-
taken into account. The significance of cross-modality transfer learn- tions. After pre-training, an extra output layer is added to BERT for
ing lies in its ability to fine-tune well-trained models for given data, performance improvement across various downstream tasks, without
enabling the adaptation of the final results to domains with insuffi- necessitating task-specific structural modifications to BERT.
cient training data. For instance, some studies have successfully trans- The backbone of the BERT network comprises a stacked transformer
ferred networks trained for visual tasks to infrared image recognition encoder, which utilizes layers of self-attention mechanisms to capture
tasks [49–51]. Pre-trained NLP models also demonstrated effectiveness intricate associations between tokens and sentences and can thereby
when being transferred to tasks regarding audio [52,53] and time series facilitate language understanding. As shown in Fig. 1, the input to
data [54]. the BERT network includes the token ID, position ID, and sentence
The studies of large language models for time series forecasting ID. Accordingly, the embedding layer generates the token embeddings,
can be categorized into three types. The first type implements pre- position embeddings, and sentence embeddings, and sums them to
trained large language models (LLM) to forecasting directly through create a representation of the input data. During the pre-training of
prompt settings [55–58]. Inspired by the amazing performance of BERT, a masked language model (MLM) mechanism is applied, which
chatbots, they leverage LLM models’ language understanding capability randomly masks some input tokens with a certain probability and
to analyze the characteristics of time series data. The second type predicts the original tokens.
aggregates diverse time series data from various domains into a large
model, aiming to create a model applicable across a broad spectrum 3.3. Individual spatial and temporal dependency of patches
of scenarios [59]. The third type focuses on fine-tuning large language
models. Zhou et al. [54] made an initial attempt to fine-tune a pre- Fig. 2 illustrates three types of feature dependency in spatio-
trained GPT model and achieved optimal results in all of the time temporal modeling. Traditional spatio-temporal data forecasting relies
series tasks except for time series forecasting. Chang et al. [60] pro- on constructing or learning a graph that represents the inter-site
posed a multi-stage fine-tuning method, firstly fine-tuning the backbone correlations at a specific time moment, and model dependencies in
network to align time series and text data, and secondly training the the temporal domain and spatial domain. However, recent research
downstream forecasting layer. Sun et al. [61] introduced contrastive on modeling efficiency has led to subsequent attempts to avoid over-
learning to enhance the model’s temporal representation capability. learning of spatial correlations in a channel-independent manner. Wind
Jin et al. [62] suggested fine-tuning the input layer of the large lan- power data, as a special type of spatio-temporal data, may exhibit
guage model, instead of the backbone network, to effectively transform rich correlations between different time moments and stations due
temporal input into textual semantic input, making it compatible with to the influence of atmospheric flow. For example, if the wind flows
pre-trained models. However, to our best knowledge, large language from station D to station A, then we can infer the wind speed or
models for spatio-temporal forecasting have not been explored. wind power of station A from the history data of station D. Moreover,
these correlations between stations constantly change. Therefore, this
3. Preliminaries and problem formulation paper takes a BERT backbone network to model individual spatial and
temporal dependency [64] of patches of wind data.
3.1. Problem formulation
3.4. The relationship between wind speed and wind power
We consider a region with 𝑀 wind power stations, and denote their
locations as 𝑺 = (𝒔1 , 𝒔2 , … , 𝒔𝑀 ) ∈ R𝑀×2 , where 𝑠𝑖 ∈ R2 represent Wind power (WP) is generated by wind turbines, where the wind
the longitude and latitude of station 𝑖. We denote the historical wind is deflected by the turbine blades, converted into mechanical energy
power data as 𝑿 = (𝒙1 , 𝒙2 , … , 𝒙𝑀 ) ∈ R𝑀×𝑃 , where 𝑃 is the length
through rotation, and then used to drive generators to produce elec-
of the historical data, and 𝑥𝑖 = (𝑥𝑖,1 , 𝑥𝑖,2 , … , 𝑥𝑖,𝑃 ) ∈ R𝑃 represents the
tricity [65]. The power generation of a wind turbine can be described
historical wind power data of station 𝑖. Based on the historical data 𝑿,
by the following wind power function, which illustrates the relationship
our goal is to forecast the wind power at the next 𝑄 time steps, denoted
between the turbine output power and wind speed (WS) [65],
as 𝒀̂ = (𝒚̂ 1 , 𝒚̂ 2 , … , 𝒚̂ 𝑀 ) ∈ R𝑀×𝑄 , where 𝒚̂ 𝑖 = (𝑦̂𝑖,𝑃 +1 , 𝑦̂𝑖,𝑃 +2 , … , 𝑦̂𝑖,𝑃 +𝑄 ) ∈
R𝑄 is the forecast power of station 𝑖 for the upcoming 𝑄 time steps. ⎧0 𝑣 < 𝑣𝑖𝑛 , 𝑣 > 𝑣𝑜𝑢𝑡
⎪
𝑃 (𝑣) = ⎨𝜌 × 𝐴 × 𝐶𝑝 × 𝑣3 ∕2 𝑣𝑖𝑛 ≤ 𝑣 ≤ 𝑣𝑟𝑎𝑡𝑒𝑑 (1)
3.2. Bidirectional encoder representations from transformers (BERT) ⎪
⎩𝑃𝑟𝑎𝑡𝑒𝑑 𝑣𝑟𝑎𝑡𝑒𝑑 < 𝑣 ≤ 𝑣𝑜𝑢𝑡
BERT is a pre-trained language representation model, and different where 𝑃 (𝑣) is the turbine output power at the WS of 𝑣, 𝑃𝑟𝑎𝑡𝑒𝑑 is the rated
from traditional methods that involve unidirectional language mod- WP, 𝜌 is the air density, 𝐴 is the sweeping area of an impeller, 𝐶𝑝 is the
els or shallow concatenation of two unidirectional models for pre- WP coefficient, and 𝑣𝑖𝑛 , 𝑣𝑟𝑎𝑡𝑒𝑑 , 𝑣𝑜𝑢𝑡 denote the cut-in, rated, and cut-out
training [63]. In contrast, BERT leverages a masked language model WS, respectively. From (1), we observe that WP is strongly correlated
3
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331
Fig. 2. Three types of feature dependency in spatio-temporal modeling. (a) Traditional spatio-temporal models tend to use spatial and temporal dependency mechanisms to model
the data characteristics in the spatial and temporal domains, respectively. (b) Recent research applies channel independence to avoid the over-learning of the interactions between
channels. (c) In this paper, we apply individual spatial and temporal dependency of patches to accomplish the wind power forecasting task, a special case of spatio-temporal
modeling.
Fig. 3. The framework of our proposed BERT4ST. Stage 1: Supervised fine-tuning uses the MLM approach to align the backbone model with time-series data. Stage 2: Downstream
fine-tuning for forecasting with an initial linear probing. Stage 3: a full fine-tuning for forecasting.
with WS. So the forecasting models of wind power may share similar 4.1. Training stages
structures with the forecasting models of wind speed. Of course, the
trained models of wind power forecasting and wind speed forecasting We use pre-trained BERT as the backbone network because of its
may have different parameters. bidirectional structure, which is suitable for aligning the large language
model with spatio-temporal data. It is observed that in spatio-temporal
data, the state of a specific station at a given time moment is not only
4. Methodology
related to other time moments at the same station but also to different
time moments at other stations [66,67]. Therefore, when encoding
The framework of our proposed BERT4ST is shown in Fig. 3. We spatio-temporal correlations, it is necessary to consider the forecasting
perform three-stage fine-tuning on BERT4ST. In the first stage, we apply performance of different stations at different time intervals. As illus-
the MLM mechanism to align the backbone model with spatio-temporal trated in Fig. 3, we represent the input with a combination of station
data. The latter stages perform downstream fine-tunings of BERT4ST. ID and patch ID. For example, 𝐴1 represents the first patch of station A,
Particularly, the second stage trains the output layer and the third stage 𝐴2 represents the second patch of station A, and 𝐵1 represents the first
fine-tunes all of the modules. patch of station B. Subsequently, the model learns rich spatio-temporal
The flowchart of BERT4ST is shown in Fig. 4. We first collect information through the stacked self-attention layers.
and preprocess source data. Then we implement the spatial–temporal Suppose there are 𝑀 stations, and we set the patch size as 𝑝 and the
modeling with a pre-trained BERT network. To further improve the historical data length
( as 𝑃 . )
Therefore, the historical data of each station
forecasting accuracy, we implement a three-stage fine-tuning. is segmented into 𝑁 = 𝑃𝑝 patches. Consequently, a total of 𝑀 × 𝑁
4
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331
Fig. 4. The flowchart of BERT4ST. We collect source data and preprocess it. Then we implement the spatial–temporal modeling with a pre-trained BERT network. To achieve
more accurate forecasting, we implement a three-stage fine-tuning.
patches are taken as the input tokens. Then we randomly mask these 4.3. Input encoding layer
tokens at a specific probability, where the values of masked tokens
are set to 0. In the first phase, the backbone network is applied to Appropriate time series encoding is crucial for aligning LLM with
reconstruct the values of these masked tokens. As shown in Fig. 3(a), time series data. Unlike the BERT model, which employs pre-trained
we mask the patches of 𝐴3 , 𝐵2 and 𝐶5 , and fine-tune the backbone embedding layers to separately map token ID, position ID, and sentence
network’s parameters by minimizing the MSE loss function between the ID, we redesign a set of encodings tailored for the spatial and temporal
reconstructed value and the true value of masked ones. characteristics of spatio-temporal data, which is shown in Fig. 5. The
After alignment, the backbone is further fine-tuned in the down- geographical coordinate of a station is denoted as 𝒔 = (𝑥, 𝑦) ∈ R2 , the
starting time of the current patch is denoted as [ℎℎ ∶ 𝑚𝑚], where 0 ≤
stream forecasting task. By [68], performing initial linear probing and
ℎℎ < 24 is the hour value in a day and 0 ≤ 𝑚𝑚 < 60 is the minute value
later full fine-tuning can achieve optimal results. Thus, our downstream
in an hour, the patch values are denoted as 𝒗 = (𝑣1 , 𝑣2 , … , 𝑣𝑝 ) ∈ R𝑝 ,
fine-tuning is also conducted in two stages. We initially perform linear
where 𝑝 is the length of each patch. We calculate the patch encoding
probing in the second stage, training only the final linear layer. In the
𝑬 𝑣 , the temporal encoding 𝑬 𝑡 and the position encoding 𝑬 𝑠 as
third stage, we conduct complete fine-tuning to adjust all modules. In
these two stages, the task involves collecting spatio-temporal encodings 𝑬𝑣 = 𝒗 ⋅ 𝑾 𝑣, (2)
( )
for all patches and providing future WP forecastings through a full 2𝜋(ℎℎ ⋅ 60 + 𝑚𝑚) 2𝜋(ℎℎ ⋅ 60 + 𝑚𝑚)
connection layer. 𝑬 𝑡 = 𝑠𝑖𝑛( ), 𝑐𝑜𝑠( ) ⋅ 𝑾 𝑡, (3)
24 × 60 24 × 60
𝑬𝑠 = 𝒔 ⋅ 𝑾 𝑠, (4)
4.2. Patching and instance normalization
where 𝑾 𝑣 ∈ R𝑝×𝑑 , 𝑾𝑡 ∈ R2×𝑑 , 𝑾𝑠 ∈ R2×𝑑
are trainable parameters,
and 𝑑 is the dimension of the pre-trained model. We set the temporal
Statistical properties, such as mean and variance, often vary over encoding in (3) to ensure that it takes one day as a cycle and there exist
time in time series data, leading to a distribution shift problem. This correlations in local time moments. Then we sum the three encodings
temporal distribution shift problem poses a significant challenge for as the input of the backbone network.
accurate time series forecasting. To tackle this issue, Kim et al. [64]
introduce a straightforward yet powerful normalization method, known 4.4. Backbone and the output layer
as Reversible Instance Normalization (RevIN). RevIN is a widely ap-
plicable normalization-and-denormalization technique to address the We employ BERT as the backbone network because of the con-
challenges posed by temporal distribution shifts. In the instance nor- sideration of spatio-temporal modeling. During fine-tuning the back-
malization of time series, the temporal average and standard deviation bone network, we utilize Low-Rank Adaptation (LoRA). As plotted in
are calculated, and the time series is scaled into a standard normal dis- Fig. 6(a), LoRA utilizes the product of two low-rank matrices to update
tribution. In the instance denormalization, the output of the forecasting the original high-rank weight matrix. Denote the dimension of the
layer is transformed as the final result. backbone as 𝑑. Then the original weight matrix has the size of 𝑑 × 𝑑.
We set the dimension of LoRA to 𝑑 ′ (𝑑 ′ ≪ 𝑑), and the sizes of these two
When self-attention models are applied to long time series, there is
low-rank matrices are 𝑑 × 𝑑 ′ and 𝑑 ′ × 𝑑. Therefore, we can train only
an over-fitting risk to noise [69]. To avoid that risk, Nie et al. [31]
2 ⋅ 𝑑 ⋅ 𝑑 ′ parameters, rather than the original 𝑑 2 ones, which can greatly
proposed a method to independently patch the time series for each
enhance the training efficiency.
channel, which has become a common practice in time series learning. As shown in Figs. 3 and 6(b), the affine transformations and feed-
For each station, the data series is split into some patches, each of which forward layer (FFN) of the original backbone network are frozen, while
contains the time series information in a short interval. In this way, in stages 1 and 3, fine-tuning is applied to the LoRA networks on Query
the input data becomes a sequence based on time intervals, rather than and Key, as well as the layer normalization layer.
time steps. The subsequent backbone network avoids the over-fitting The fine-tuning in the first stage aims to align the LLM model with
risk to noise while enhancing the ability to model time series at a larger time series data by reconstructing the masked patches. Therefore, in
scale. We also adopt this patching mechanism in this paper. However, this stage, we feed the output layer with the complex spatio-temporal
in contrast to the channel-independent patching of [31], we combine representation of each patch from the backbone and reconstruct the
the information of all channels to capture the complex spatio-temporal masked patches. The output layer of the first stage is a full connection
correlations of WP data. layer with the dimension of (𝑑, 𝑝), where 𝑝 denotes the patch size and
5
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331
Fig. 5. The inputs and encodings of the BERT4ST network. We redesign a set of encodings tailored for the spatial and temporal characteristics of spatio-temporal data and sum
them as the input encoding.
Fig. 6. LoRA fine-tuning of BERT4ST. (a) The LoRA fine-tuning can greatly enhance the training efficiency by reducing training parameters. (b) LoRA is implemented to the query
and key affine transformations of the backbone network and can improve the training efficiency.
𝑑 is the model dimension. The fine-tuning at the first stage is achieved Table 1
by minimizing the difference between the reconstructed values and the The details of datasets.
ground truth. Dataset #sites Start time End time Time resolution Type
The fine-tuning at stages 2 and 3 aims to forecast the future WP Chuzhou 12 2021/1/1 2021/12/31 5 min Wind power
by integrating the spatio-temporal representations of a station’s his- NorthSea 15 2016/1/1 2018/12/31 10 min Wind speed
torical data. Thus, in the output layers of these two stages, a fully
connected layer with the dimension of (𝑁 ⋅ 𝑑, 𝑙) is employed, where 𝑁
is the number of historical patches. The fine-tuning is accomplished by
This paper adopts pre-trained bert-base-uncased,3 which employs
minimizing the difference between the forecast values and the ground a 12-layer transformer encoder with 110 million parameters and a
truth. model dimension of 768. We set the number of layers of BERT4ST to 4,
i.e., we select the first 4 layers of the pre-trained BERT encoder as the
5. Experiments backbone network. The dimension of LoRA fine-tuning is 128. At stage
1, 20% patches were randomly masked and 30 training epochs were
5.1. Experimental setup chosen with an initial learning rate of 1e−3. At stage 2, 20 training
epochs were chosen with an initial learning rate of 1e−4. At stage 3,
In this paper, we use a wind power dataset, Chuzhou dataset. In 20 training epochs were chosen with an initial learning rate of 3e−6.
addition, to demonstrate the applicability of our BERT4ST for wind A CosineAnnealingLR strategy was employed to adjust the learning
speed forecasting tasks, we introduce a wind speed dataset, NorthSea rate during training, and the batch size is 64. At each stage, an early
dataset. The Chuzhou dataset collected historical power generation stopping with a patience of 5 is implemented, i.e., if the loss function
data from 12 wind power stations in the Chuzhou area of Anhui on the validation set did not decrease for 5 consecutive epochs, the
Province, China.1 The NorthSea dataset collected wind data from 15 training for the current stage is terminated. The first 70% samples of
meteorological observation sites in the North Sea of Europe.2 The the dataset were used as the training set, the next 10% ones as the
distribution of the sites of both datasets is shown in Fig. 7, and the validation set, and the remaining 20% ones as the test set. We coded our
relevant data information is presented in Table 1. In our experiments, BERT4ST with PyTorch and performed experiments on an NVIDIA RTX
the first 70% samples are used for the train set, the next 10% are used 3080 GPU. We run each experiment 5 times, and calculate the averages
for the validation set, and the rest are used for the test set. In this of mean absolute error (MAE) and root mean square error (RMSE) as
section we perform experiments at the horizons of 10 min, 1 h, and
1 ∑𝑛 ∑ 𝑀 𝑃∑ +𝑄
4 h, to verify the forecasting performance of the proposed BERT4ST. 𝑀𝐴𝐸 = |𝑦 − 𝑦̂𝑖,𝑗,𝑡 |, (5)
𝑛 ⋅ 𝑀 ⋅ 𝑄 𝑖=1 𝑗=1 𝑡=𝑃 +1 𝑖,𝑗,𝑡
1
https://github.com/zflai/DST.
2 3
https://frost.met.no/index.html. https://huggingface.co/bert-base-uncased.
6
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331
Table 2 noise. It is exciting that all of the fine-tuned large language models
Hyperparameter settings of baseline models.
(GPT4TS, LLM4TS, BERT4ST) achieve good results. Except for STID,
Model d_model n_graph n_layer head_num batch_size lr some spatio-temporal models do not perform well, although they did
DLinear [69] 64 – 1 – 64 0.001 well in many other fields. The main reason may lie in that the spatial
LSTM [21] 32 – 2 – 64 0.001
correlation of WP data changes over time and there are no strong global
Transformer [25] 128 – 2 8 32 0.001
Informer [28] 128 – 2 8 32 0.001
distribution characteristics, which is usually assumed by the traditional
Autoformer [29] 128 – 2 8 32 0.001 spatio-temporal models. Furthermore, BERT4ST achieves significant
STGCN [40] 32 2 2 – 64 0.001 performance improvement compared to GPT4TS and LLM4TS due to its
GWNet [41] 32 3 2 – 64 0.001 introduction of spatio-temporal modeling. In the subsequent ablation
MTGNN [70] 128 2 3 – 32 0.001 studies, we will explore the performance contributions of detailed
STID [19] 64 – 3 – 32 0.001 model designs.
GPT4TS [54] 768 – 4 12 64 0.0001 We first explore the impact of spatio-temporal modeling on the
LLM4TS [60] 768 – 4 12 64 0.0001
forecasting performance. We implement two variants of BERT4ST,
namely BERT4TS and BERT4ST-ss. To compare with the single-stage
trained GPT4TS, these variants do not adopt multi-stage training. Sim-
√
√ ilar to GPT4TS and LLM4TS, BERT4TS employs a channel-independent
√ ∑𝑛 ∑ 𝑀 𝑃∑ +𝑄
√ 1
𝑅𝑀𝑆𝐸 = √ (𝑦 − 𝑦̂𝑖,𝑗,𝑡 )2 , (6) processing approach. By such a preprocessing approach, patches from
𝑛 ⋅ 𝑀 ⋅ 𝑄 𝑖=1 𝑗=1 𝑡=𝑃 +1 𝑖,𝑗,𝑡 different stations are treated independently, without modeling the in-
teractions between them. The parameter settings are the same as those
where 𝑛 is the number of samples, 𝑀 is the number of stations, 𝑃 is
illustrated in Section 5.1. The experimental results are shown in Ta-
the historical data length and 𝑄 is the forecasting length, 𝑦 denotes the
ble 4, indicating that the proposed variant BERT4TS achieves results
true value, and 𝑦̂ denotes the forecast one.
similar to GPT4TS, while BERT4ST-ss significantly outperforms both
of them. This suggests that introducing spatio-temporal correlation is
5.2. Forecasting results crucial for WP forecasting.
Now we verify the performance improvement of multi-stage train-
We compare BERT4ST with baselines, including classic time se- ing. Table 5 compares the forecasting results of BERT4ST and its
ries forecasting models, spatio-temporal forecasting models, and LLM- variant BERT4ST-ss on the Chuzhou dataset. The parameter settings
based models here. The adopted time series forecasting models include are the same as those illustrated in Section 5.1. We record the test
LSTM [21], DLinear [69], Transformer [25], Informer [28] and Auto- set performance of BERT4ST after the training of stages 2 and 3. We
former [29]. The adopted spatio-temporal forecasting models include observe that after the training of stage 2, the performance of BERT4ST
GMAN [18], STGCN [40], MTGNN [70], GWNet [41] and STID [19]. surpasses that of BERT4ST-ss, and there is further improvement after
Additionally, the adopted LLM-based models include GPT4TS [54] and the completion of stage 3. This indicates that multi-stage training is nec-
LLM4TS [60]. We conduct a grid search on the main hyper-parameters essary for fine-tuning large language models. Initially, self-supervised
of the baseline models, and the finally chosen hyper-parameter settings learning is employed to align the language model with time series data.
are listed in Table 2. Among the fields of Table 2, d_model denotes the Subsequently, more refined adjustments are carried out while keeping
dimension of models, n_graph denotes the depth of graph networks, the backbone network stable and fully training the forecasting layer.
n_layer denotes the depth of temporal networks, head_num denotes the Then we investigate the impact of spatial and temporal encoding
head number of self-attention networks, batch_size denotes the batch designs on performance improvement. We implement variants, which
size, and lr denotes the learning rate. Moreover, we train the models consider the use of temporal and spatial encoding, and present the
separately for different forecasting horizons. For example, we train results of the Chuzhou dataset in Table 6. The parameter settings are
models for 10-min-ahead forecasting with 𝑄 = 2, 1-h-ahead forecasting the same as those of Section 5.1. We observe that incorporating either
with 𝑄 = 12, and 4-h-ahead forecasting with 𝑄 = 48. temporal or spatial encoding can enhance forecasting performance.
Table 3 shows the forecasting results of various methods. The best Moreover, performance is superior when using only temporal encoding
results are shown in bold, and the second-best results in underlined. compared to using only spatial encoding, indicating that temporal
It can be observed that our proposed BERT4ST achieves the best encoding plays a dominant role. Furthermore, combining both temporal
results in all scenarios. In addition, we observe that the models based and spatial encodings achieves the best performance.
on self-attention mechanisms for time series forecasting (Transformer, In addition, we investigate the impact of different depths of the
Informer, Autoformer) perform poorly due to the over-learning risk of network on forecasting performance. We conduct experiments with
7
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331
Table 3
Forecasting results on the two datasets.
10 min (2 step) 1 h (12 steps) 4 h (48 steps)
MAE RMSE MAE RMSE MAE RMSE
DLinear [69] 1.27 2.36 3.38 6.06 7.35 12.26
LSTM [21] 1.24 2.28 3.29 6.06 7.80 13.03
Transformer [25] 1.46 2.41 3.39 6.20 7.18 12.44
Informer [28] 1.42 2.39 3.46 6.19 7.50 12.77
Chuzhou Autoformer [29] 2.01 3.23 4.23 6.87 7.86 12.68
STGCN [40] 2.07 3.42 5.44 8.75 10.83 16.55
GWNet [41] 1.50 2.48 3.80 6.57 8.75 13.40
MTGNN [70] 2.87 4.29 5.26 8.12 9.41 14.94
STID [19] 1.25 2.31 3.29 5.95 7.38 12.16
GPT4TS [54] 1.37 2.48 3.28 6.13 7.05 12.42
LLM4TS [60] 1.24 2.29 3.24 6.03 7.12 12.45
BERT4ST 1.20 2.19 3.03 5.58 6.78 11.91
10 min (1 steps) 1 h (6 steps) 4 h (24 steps)
MAE RMSE MAE RMSE MAE RMSE
DLinear [69] 0.403 0.589 0.624 0.903 1.075 1.534
LSTM [21] 0.386 0.564 0.616 0.888 1.120 1.601
Transformer [25] 0.392 0.570 0.619 0.892 1.116 1.573
Informer [28] 0.401 0.581 0.625 0.897 1.238 1.551
NorthSea Autoformer [29] 0.610 0.834 0.672 0.961 1.971 2.513
STGCN [40] 0.552 0.802 1.026 1.373 1.748 2.284
GWNet [41] 0.452 0.656 0.786 1.101 1.502 2.014
MTGNN [70] 0.408 0.595 0.609 0.876 1.030 1.463
STID [19] 0.387 0.576 0.612 0.885 1.068 1.523
GPT4TS [54] 0.392 0.576 0.618 0.898 1.089 1.569
LLM4TS [60] 0.389 0.573 0.615 0.892 1.085 1.564
BERT4ST 0.378 0.550 0.575 0.827 0.980 1.395
Table 4 Table 6
Ablation studies on spatio-temporal modeling. Ablation studies on spatial and temporal encodings. Both of them boost the
10 min (2 steps) 1 h (12 steps) 4 h (48 steps) performance. Temporal encoding plays a dominant role.
Temporal Spatial 10 min (2 steps) 1 h (12 steps) 4 h (48 steps)
MAE RMSE MAE RMSE MAE RMSE
Chuzhou MAE RMSE MAE RMSE MAE RMSE
GPT4TS [54] 1.37 2.48 3.28 6.13 7.05 12.42
BERT4ST-ss 1.23 2.27 3.07 5.70 7.02 12.25 ✗ ✗ 1.21 2.22 3.06 5.64 6.83 12.03
BERT4TS 1.25 2.32 3.21 6.00 7.14 12.45 ✓ ✗ 1.20 2.21 3.06 5.62 6.83 11.97
✗ ✓ 1.21 2.22 3.06 5.63 6.86 12.03
BERT4ST 1.20 2.19 3.03 5.58 6.78 11.91
✓ ✓ 1.20 2.19 3.03 5.58 6.78 11.91
10 min (1 step) 1 h (6 steps) 4 h (24 steps)
MAE RMSE MAE RMSE MAE RMSE
NorthSea GPT4TS [54] 0.392 0.576 0.618 0.898 1.089 1.569
BERT4ST-ss 0.392 0.576 0.588 0.844 0.999 1.433 model’s performance by separately evaluating the prediction accuracy
BERT4TS 0.391 0.574 0.615 0.892 1.088 1.570 at specific time steps. We trained all models in the 4-h ahead forecasting
BERT4ST 0.378 0.550 0.575 0.827 0.980 1.395 task and calculated their prediction accuracy at specific time steps. The
results are plotted in Fig. 9, showing that BERT4TS achieved the best
Table 5 performance at all time steps.
Ablation studies on the multi-stage training. BERT4ST outperforms BERT4ST-ss after
the training of stage 2, and there is further improvement after the completion of stage
3. 6. Conclusion
10 min (2 steps) 1 h (12 steps) 4 h (48 steps)
MAE RMSE MAE RMSE MAE RMSE
BERT4ST-ss: single stage 1.23 2.27 3.07 5.70 7.02 12.25
In this paper, we propose the BERT4ST method, which is the
BERT4ST: after stage 2 1.23 2.25 3.09 5.66 6.82 11.91 first to utilize a pre-trained language model for spatio-temporal wind
BERT4ST: after stage 3 1.20 2.19 3.03 5.58 6.78 11.91 power(WP) forecasting. By analyzing the similarity between bidirec-
tional attention networks and WP spatio-temporal data, we propose to
employ a pre-trained BERT encoder as the backbone network to learn
various depths on the Chuzhou dataset, and other parameter settings individual spatial and temporal dependency of patches of WP data. To
are the same as those of Section 5.1. The results are plotted in Fig. 8. handle the spatio-temporal correlation of WP data, we also redesign a
It can be observed that the model performs optimally at the depth of set of spatial and temporal encodings. During the fine-tuning process,
4, followed by the depth of 6. However, as the depth increases further, we adopt a multi-stage training manner, first aligning the language
the performance tends to decline. We infer that too deep a model is model with time series data and then fine-tuning the downstream tasks
easily over-fit. while maintaining the stability of the backbone network. Experimental
The previous experiments focused on forecasting accuracy over results demonstrate that our BERT4ST achieves desirable performance
continuous time intervals. Now we aim to verify the stability of the compared to some state-of-the-art methods in most scenarios.
8
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331
Fig. 8. Forecasting performance under different model depths. A 4-layer backbone is the best setting of BERT4ST.
Fig. 9. Prediction performance at different time steps. BERT4TS achieves the best performance at most time steps.
Zefeng Lai: Methodology, Investigation, Formal analysis, Concep- Public datasets are used. The sources of these datasets are provided.
tualization. Tangjie Wu: Resources, Formal analysis, Data curation.
Xihong Fei: Software, Resources, Project administration. Qiang Ling: Acknowledgment
Writing – review & editing, Writing – original draft, Supervision, Inves-
tigation, Funding acquisition, Conceptualization.
This work was supported in part by the Key Science and Technol-
Declaration of competing interest ogy Program of Anhui under Grant 202203f07020002, in part by the
Key Common Technology Development Program of Hefei (Research
The authors declare that they have no known competing finan- on Multi-sensor Perception and Fusion Algorithms for Autonomous
cial interests or personal relationships that could have appeared to Driving) under Grant GJ2022GX35, and in part by the Natural Science
influence the work reported in this paper. Foundation of Hefei under Grant 2022003.
9
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331
References [30] Liu S, Yu H, Liao C, Li J, Lin W, Liu AX, Dustdar S. Pyraformer: Low-complexity
pyramidal attention for long-range time series modeling and forecasting. In:
[1] Jonkman JM. Dynamics modeling and loads analysis of an offshore floating wind International conference on learning representations. 2021.
turbine. University of Colorado at Boulder; 2007. [31] Nie Y, Nguyen NH, Sinthong P, Kalagnanam J. A time series is worth 64 words:
[2] Fernandez LM, Saenz JR, Jurado F. Dynamic models of wind farms with fixed Long-term forecasting with transformers. 2022, arXiv preprint arXiv:2211.14730.
speed wind turbines. Renew Energy 2006;31(8):1203–30. [32] Wu B, Wang L, Zeng Y-R. Interpretable wind speed prediction with multivariate
[3] Li Z, Ye L, Zhao Y, Pei M, Lu P, Li Y, Dai B. A spatiotemporal directed graph time series and temporal fusion transformers. Energy 2022;252:123990.
convolution network for ultra-short-term wind power prediction. IEEE Trans [33] Xiong B, Lou L, Meng X, Wang X, Ma H, Wang Z. Short-term wind power
Sustain Energy 2022;14(1):39–54. forecasting based on attention mechanism and deep learning. Electr Power Syst
[4] Bentsen LØ, Warakagoda ND, Stenbro R, Engelstad P. Spatio-temporal wind Res 2022;206:107776.
speed forecasting using graph networks and novel transformer architectures. Appl [34] Shilin S, Yuekai L, Qi L, Tianyang W, Fulei C. Deep learning-based multistep
Energy 2023;333:120565. ahead wind speed and power generation forecasting using direct method. Energy
[5] Kavasseri RG, Seetharaman K. Day-ahead wind speed forecasting using f-ARIMA Convers Manage 2023;283:116916.
models. Renew Energy 2009;34(5):1388–93. [35] Wu T, Ling Q. Mixformer: Mixture transformer with hierarchical con-
[6] Hodge B-M, Zeiler A, Brooks D, Blau G, Pekny J, Reklatis G. Improved wind text for spatio-temporal wind speed forecasting. Energy Convers Manage
power forecasting with ARIMA models. In: Computer aided chemical engineering, 2024;299:117896.
vol. 29, Elsevier; 2011, p. 1789–93. [36] Fei X, Ling Q. Attention-based global and local spatial-temporal graph
[7] Liu X, Lin Z, Feng Z. Short-term offshore wind speed forecast by seasonal convolutional network for vehicle emission prediction. Neurocomputing
ARIMA-A comparison against GRU and LSTM. Energy 2021;227:120492. 2023;521:41–55.
[8] Haque AU, Nehrir MH, Mandal P. A hybrid intelligent model for deterministic [37] Guo S, Lin Y, Feng N, Song C, Wan H. Attention based spatial-temporal graph
and quantile regression approach for probabilistic wind power forecasting. IEEE convolutional networks for traffic flow forecasting. Proc AAAI Conf Artif Intell
Trans Power Syst 2014;29(4):1663–72. 2019;33(01):922–9.
[9] Li Y, Shi H, Han F, Duan Z, Liu H. Smart wind speed forecasting approach using [38] Chengqing Y, Guangxi Y, Chengming Y, Yu Z, Xiwei M. A multi-factor driven
various boosting algorithms, big multi-step forecasting strategy. Renew Energy spatiotemporal wind power prediction model based on ensemble deep graph
2019;135:540–53. attention reinforcement learning networks. Energy 2023;263:126034.
[10] Carneiro TC, Rocha PA, Carvalho PC, Fernández-Ramírez LM. Ridge regression [39] Pan X, Wang L, Wang Z, Huang C. Short-term wind speed forecasting based on
ensemble of machine learning models applied to solar and wind forecasting in spatial-temporal graph transformer networks. Energy 2022;253:124095.
Brazil and Spain. Appl Energy 2022;314:118936. [40] Yu B, Yin H, Zhu Z. Spatio-temporal graph convolutional networks: A deep
[11] Nikodinoska D, Käso M, Müsgens F. Solar and wind power generation fore- learning framework for traffic forecasting. 2017, arXiv preprint arXiv:1709.
casts using elastic net in time-varying forecast combinations. Appl Energy 04875.
2022;306:117983. [41] Wu Z, Pan S, Long G, Jiang J, Zhang C. Graph wavenet for deep spatial-temporal
[12] Lu P, Ye L, Zhao Y, Dai B, Pei M, Li Z. Feature extraction of meteorological graph modeling. 2019, arXiv preprint arXiv:1906.00121.
factors for wind power prediction based on variable weight combined method. [42] Liu Y, Liu Q, Zhang J-W, Feng H, Wang Z, Zhou Z, Chen W. Multivariate time-
Renew Energy 2021;179:1925–39. series forecasting with temporal polynomial graph neural networks. In: Advances
[13] Lin Y, Yang M, Wan C, Wang J, Song Y. A multi-model combination ap- in neural information processing systems. 2022.
proach for probabilistic wind power forecasting. IEEE Trans Sustain Energy [43] Hu T, Wu W, Guo Q, Sun H, Shi L, Shen X. Very short-term spatial and temporal
2018;10(1):226–37. wind power forecasting: A deep learning approach. CSEE J Power Energy Syst
[14] Raza MQ, Mithulananthan N, Li J, Lee KY, Gooi HB. An ensemble framework 2019;6(2):434–43.
for day-ahead forecast of PV output power in smart grids. IEEE Trans Ind Inform [44] Zhu Q, Chen J, Shi D, Zhu L, Bai X, Duan X, Liu Y. Learning temporal and
2018;15(8):4624–34. spatial correlations jointly: A unified framework for wind speed prediction. IEEE
[15] Wang J, Zhou Y, Zhang Y, Lin F, Wang J. Risk-averse optimal combining forecasts Trans Sustain Energy 2019;11(1):509–23.
for renewable energy trading under CVaR assessment of forecast errors. IEEE [45] Wang Y, Zou R, Liu F, Zhang L, Liu Q. A review of wind speed and wind power
Trans Power Syst 2023. forecasting with deep neural networks. Appl Energy 2021;304:117766.
[16] Li Y, Yu R, Shahabi C, Liu Y. Diffusion convolutional recurrent neural network: [46] Ren Y, Li Z, Xu L, Yu J. The data-based adaptive graph learning network for
Data-driven traffic forecasting. 2017, arXiv preprint arXiv:1707.01926. analysis and prediction of offshore wind speed. Energy 2023;126590.
[17] Kipf TN, Welling M. Semi-supervised classification with graph convolutional [47] Yu X, Tang B, Zhang K. Fault diagnosis of wind turbine gearbox using a novel
networks. 2016, arXiv preprint arXiv:1609.02907. method of fast deep graph convolutional networks. IEEE Trans Instrum Meas
[18] Zheng C, Fan X, Wang C, Qi J. GMAN: A graph multi-attention network for 2021;70:1–14.
traffic prediction. Proc AAAI Conf Artif Intell 2020;34(01):1234–41. [48] Wang L, He Y. M2STAN: Multi-modal multi-task spatiotemporal attention net-
[19] Shao Z, Zhang Z, Wang F, Wei W, Xu Y. Spatial-temporal identity: A simple yet work for multi-location ultra-short-term wind power multi-step predictions. Appl
effective baseline for multivariate time series forecasting. In: Proceedings of the Energy 2022;324:119672.
31st ACM international conference on information & knowledge management. [49] Pu N, Chen W, Liu Y, Bakker EM, Lew MS. Dual gaussian-based variational
2022, p. 4454–8. subspace disentanglement for visible-infrared person re-identification. In: Pro-
[20] Olaofe ZO. A 5-day wind speed & power forecasts using a layer recurrent neural ceedings of the 28th ACM international conference on multimedia. 2020, p.
network (LRNN). Sustain Energy Technol Assess 2014;6:1–24. 2149–58.
[21] Shahid F, Zameer A, Muneeb M. A novel genetic LSTM model for wind power [50] Fu C, Hu Y, Wu X, Shi H, Mei T, He R. CM-NAS: Cross-modality neural
forecast. Energy 2021;223:120069. architecture search for visible-infrared person re-identification. In: Proceedings of
[22] Ju Y, Sun G, Chen Q, Zhang M, Zhu H, Rehman MU. A model combining the IEEE/CVF international conference on computer vision. 2021, p. 11823–32.
convolutional neural network and LightGBM algorithm for ultra-short-term wind [51] Huang N, Liu J, Miao Y, Zhang Q, Han J. Deep learning for visible-infrared
power forecasting. IEEE Access 2019;7:28309–18. cross-modality person re-identification: A comprehensive review. Inf Fusion
[23] Yaghoubirad M, Azizi N, Farajollahi M, Ahmadi A. Deep learning-based multistep 2023;91:396–411.
ahead wind speed and power generation forecasting using direct method. Energy [52] Ghosal D, Majumder N, Mehrish A, Poria S. Text-to-audio generation using
Convers Manage 2023;281:116760. instruction-tuned LLM and latent diffusion model. 2023, arXiv preprint arXiv:
[24] Wang Y, Chen T, Zhou S, Zhang F, Zou R, Hu Q. An improved wavenet 2304.13731.
network for multi-step-ahead wind energy forecasting. Energy Convers Manage [53] Lyu C, Wu M, Wang L, Huang X, Liu B, Du Z, Shi S, Tu Z. Macaw-LLM: Multi-
2023;278:116709. modal language modeling with image, audio, video, and text integration. 2023,
[25] Yoo J, Kang U. Attention-based autoregression for accurate and efficient multi- arXiv preprint arXiv:2306.09093.
variate time series forecasting. In: Proceedings of the 2021 SIAM international [54] Zhou T, Niu P, Wang X, Sun L, Jin R. One fits all: Power general time series
conference on data mining. analysis by pretrained LM. 2023, arXiv preprint arXiv:2302.11939.
[26] Sun S, Liu Y, Li Q, Wang T, Chu F. Short-term multi-step wind power forecasting [55] Gruver N, Finzi M, Qiu S, Wilson AG. Large language models are zero-shot time
based on spatio-temporal correlations and transformer neural networks. Energy series forecasters. 2023, arXiv preprint arXiv:2310.07820.
Convers Manage 2023;283:116916. [56] Xue H, Salim FD. Promptcast: A new prompt-based learning paradigm for time
[27] Child R, Gray S, Radford A, Sutskever I. Generating long sequences with sparse series forecasting. IEEE Trans Knowl Data Eng 2023.
transformers. 2019, arXiv preprint arXiv:1904.10509. [57] Cao D, Jia F, Arik SO, Pfister T, Zheng Y, Ye W, Liu Y. TEMPO: Prompt-
[28] Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W. Informer: Beyond based generative pre-trained transformer for time series forecasting. 2023, arXiv
efficient transformer for long sequence time-series forecasting. Proc AAAI Conf preprint arXiv:2310.04948.
Artif Intell 2021;35(12):11106–15. [58] Yu X, Chen Z, Ling Y, Dong S, Liu Z, Lu Y. Temporal data meets LLM–explainable
[29] Wu H, Xu J, Wang J, Long M. Autoformer: Decomposition transformers with financial time series forecasting. 2023, arXiv preprint arXiv:2306.11025.
auto-correlation for long-term series forecasting. Adv Neural Inf Process Syst [59] Garza A, Mergenthaler-Canseco M. TimeGPT-1. 2023, arXiv preprint arXiv:2310.
2021;34:22419–30. 03589.
10
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331
[60] Chang C, Peng W-C, Chen T-F. LLM4TS: Two-stage fine-tuning for time-series [66] Fang Y, Qin Y, Luo H, Zhao F, Zeng L, Hui B, Wang C. CDGNet: A cross-time
forecasting with pre-trained llms. 2023, arXiv preprint arXiv:2308.08469. dynamic graph-based deep learning model for traffic forecasting. 2021, arXiv
[61] Sun C, Li Y, Li H, Hong S. TEST: Text prototype aligned embedding to activate preprint arXiv:2112.02736.
LLM’s ability for time series. 2023, arXiv preprint arXiv:2308.08241. [67] Song J, Son J, Seo D-h, Han K, Kim N, Kim S-W. ST-GAT: A spatio-temporal graph
[62] Jin M, Wang S, Ma L, Chu Z, Zhang JY, Shi X, Chen P-Y, Liang Y, Li Y-F, Pan S, et attention network for accurate traffic speed prediction. In: Proceedings of the
al. Time-LLM: Time series forecasting by reprogramming large language models. 31st ACM international conference on information & knowledge management.
2023, arXiv preprint arXiv:2310.01728. 2022, p. 4500–4.
[63] Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidi- [68] Kumar A, Raghunathan A, Jones R, Ma T, Liang P. Fine-tuning can distort
rectional transformers for language understanding. 2018, arXiv preprint arXiv: pretrained features and underperform out-of-distribution. 2022, arXiv preprint
1810.04805. arXiv:2202.10054.
[64] Kim T, Kim J, Tae Y, Park C, Choi J-H, Choo J. Reversible instance normalization [69] Zeng A, Chen M, Zhang L, Xu Q. Are transformers effective for time series
for accurate time-series forecasting against distribution shift. In: International forecasting? Proc AAAI Conf Artif Intell 2023;37(9):11121–8.
conference on learning representations. 2021. [70] Wu Z, Pan S, Long G, Jiang J, Chang X, Zhang C. Connecting the dots:
[65] Wang Y, Hu Q, Li L, Foley AM, Srinivasan D. Approaches to wind power curve Multivariate time series forecasting with graph neural networks. In: Proceedings
modeling: A review and discussion. Renew Sustain Energy Rev 2019;116:109422. of the 26th ACM SIGKDD international conference on knowledge discovery &
data mining. 2020, p. 753–63.
11