0% found this document useful (0 votes)

36 views11 pages

BERT4ST Windpowerforecast

Uploaded by

Nevin Şehbal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views11 pages

BERT4ST Windpowerforecast

Uploaded by

Nevin Şehbal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Energy Conversion and Management 307 (2024) 118331

Contents lists available at ScienceDirect

Energy Conversion and Management

journal homepage: www.elsevier.com/locate/enconman

BERT4ST:: Fine-tuning pre-trained large language model for wind power

forecasting
Zefeng Lai a , Tangjie Wu a , Xihong Fei b , Qiang Ling a,c ,∗
a
Univ. of Science and Technology of China, Hefei, China
b
Anhui University of Technology, Maanshan, China
c
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China

ARTICLE INFO ABSTRACT

Keywords: Accurate forecasting of wind power generation is essential for ensuring power safety, scheduling various
Wind power forecasting energy sources, and improving energy utilization. However, the elusive nature of wind, influenced by various
Spatial–temporal forecasting meteorological and geographical factors, greatly complicates the wind power forecasting task. To improve the
Large language model
forecasting accuracy of wind power (WP), we propose a BERT-based model for spatio-temporal forecasting
(BERT4ST), which is the first approach to fine-tune a large language model for the spatio-temporal modeling
of WP. To deal with the inherent characteristics of WP, BERT4ST exploits the individual spatial and temporal
dependency of patches and redesigns a set of spatial and temporal encodings. By well analyzing the connection
between bidirectional attention networks and WP spatio-temporal data, BERT4ST employs a pre-trained BERT
encoder as the backbone network to learn the individual spatial and temporal dependency of patches of WP
data. Additionally, BERT4ST fine-tunes the pre-trained backbone in a multi-stage manner, i.e., first aligning the
language model with the spatio-temporal data and then fine-tuning the downstream tasks while maintaining
the stability of the backbone network. Experimental results demonstrate that our BERT4ST achieves desirable
performance compared to some state-of-the-art methods.

1. Introduction easy to implement, but often fall short in performance. Deep learning-
based models can consistently outperform those statistical alternatives
With the increasing global demand for renewable energy, wind by leveraging deeper-layer neural networks to learn intricate meteo-
power has caught more and more attention. However, the actual wind rological features. There are many deep learning-based models, such
power generation strictly depends on the wind speed which may vio- as convolution neural networks (CNN), long-short term memory net-
lently change over time, bringing significant challenges for the safety
works (LSTM), and self-attention networks. Recently, because of its
of the power grid and the efficiency of energy utilization [1,2]. What
strong power to represent temporal characteristics in various scenarios,
is worse, the elusive nature of wind speed, influenced by various
meteorological and geographical factors, greatly complicates the wind self-attention has gained more and more popularity.
power forecasting task. Therefore, accurate forecasting of WP genera- The above forecasting models can be combined to further improve
tion is essential for ensuring power safety, scheduling various energy forecasting performance. Specifically, forecast combination aims to
sources, and improving energy utilization. Due to the necessity of low find a suitable forecasting model pool and accurately assign appropri-
computational complexity and high precision in short-term wind power ate weights to each model to achieve accurate and stable prediction.
forecasting, it is challenging to promptly deliver numeric weather Nikodinoska et al. [11] applied the dynamic elastic net as a combi-
prediction data (NWP) through physical modeling methods [3,4]. Fur- nation scheme to reduce the variance of renewable energy forecasting
thermore, wind speed may change very fast and it is difficult for NWP errors. Lu et al. [12] devised a variance-based strategy to distribute
to accurately predict wind speed. Consequently, short-term and ultra- weights among individual wind power forecasting models. In [13],
short-term WP forecasting mainly rely on historical data. By capturing various models, which offer different probability distributions, were
the temporal and spatial correlations of historical WP data, we can
amalgamated to enhance the performance of probabilistic wind power
model the evolution of WP, and predict the future WP.
prediction. Raza et al. [14] utilized trimmed aggregation to combine
Previous WP forecasting research employed statistical models, such
as ARIMAs [5–7] and machine learning techniques [8–10], which are five neural networks with various structural complexities for day-ahead

∗ Corresponding author.
E-mail address: [email protected] (Q. Ling).

https://doi.org/10.1016/j.enconman.2024.118331
Received 7 January 2024; Received in revised form 27 February 2024; Accepted 19 March 2024
Available online 27 March 2024
0196-8904/© 2024 Elsevier Ltd. All rights reserved.
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331

The rest of this paper is organized as follows. In Section 2, we discuss

Nomenclature the related work regarding WP forecasting. In Section 3, we formulate
WS Wind speed the concerned task of spatio-temporal WP forecasting and introduce the
WP Wind power BERT backbone and individual spatio-temporal dependency of patches.
In Section 4, we propose our BERT4ST model. Experimental results
LLM Large language model
in Section 5 demonstrate that our BERT4ST achieves desirable perfor-
MLM Masked language model
mance compared to some state-of-the-art methods. Some concluding
BERT Bidirectional encoder representations from
remarks are placed in Section 6.
transformers
LoRA Low rank adaptation 2. Related work
FFN Feed-forward network
MAE Mean absolutely error 2.1. Deep learning based methods for time series forecasting
RMSE Rooted mean square error
Deep learning models are becoming increasingly popular in WP
forecasting, owing to their incorporation of a more extensive set of
flexible parameters, which substantially improve the capability of cap-
solar power prediction. Wang et al. [15] proposed the risk-averse turing and understanding time series features. Various deep learning
optimal combining forecasts to reduce the extremely large prediction approaches for WP forecasting have been developed under the archi-
errors by combining multiple individual forecasts. tectures of Long Short-Term Memory (LSTM), Convolutional Neural
Due to the elusive nature of wind patterns, some studies consider the Networks (CNN), and self-attention mechanisms. LSTM-based mod-
locations of WP stations within a specific region, aiming to model WP els [20,21] represent the atmospheric system’s state with a vector and
in both temporal and spatial domains. Spatio-temporal data mining, an iteratively update it with WP data at each time step. In contrast, CNN-
extensively investigated field in recent years, investigates the temporal based models [22–24] employ convolutional neural networks (CNN)
and spatial correlations among nodes within a given area to forecast to grasp the dynamics of WP data within local time intervals. These
future values [16]. Early spatio-temporal works predefine adjacency CNN-based models utilize multi-layer CNN layers with distinct re-
matrices based on graph node interconnection, such as road connectiv- ceptive fields, facilitating the extraction of higher-level time series
ity and intersection distance, and utilize graph convolutional networks representations.
(GCN) [17] for updation in the spatial domain. In the temporal do- Attention-based models have gained more and more popularity
main, methods, such as Convolutional Neural Networks (CNN) and in time series forecasting due to their notable performance, exempli-
Long Short-Term Memory networks (LSTM), have been well applied. fied by models like Transformer [25,26], Sparse Transformer [27],
Subsequent works introduce the idea of calculating an adjacency matrix Informer [28], Autoformer [29], Pyraformer [30], and patchTST [31].
at different time instants of a day, and dynamically learning spatial These models excel in capturing temporal dependencies at a larger
correlations. Following this trend, some research results have demon- scale, demonstrating superior accuracy compared to early WP forecast-
strated that graph convolutional networks can fundamentally embody ing models [32–35].
a special type of attention mechanism. Adopting a unified attention
approach allows them to learn feature interactions in both spatial and 2.2. Spatio-temporal forecasting models
temporal domains [18]. Moreover, [19] proposed a method involving
temporal and spatial identifications, which can further elevate the ac- Spatio-temporal forecasting models focus on collecting pertinent
curacy of spatio-temporal data forecasting with reduced computational time series information from neighboring locations and analyzing spa-
complexity tial correlations for accurate forecastings [36–39]. Among them,
Recently large models have been implemented in many fields, DCRNN [16] employs a fixed adjacent matrix to conduct diffusion
like computer vision (CV) and natural language processing (NLP). convolution over the hidden state of RNN. STGCN [40] introduces
Researchers are now exploring integrating large models into time series graph convolution for spatial correlation modeling, while GWNet [41]
forecasting studies. These studies can be roughly classified into 3 types: utilizes dilated causal convolution in the temporal encoding layer.
(1) The first type applies pre-trained large language models (LLM)
GMAN [18] incorporates self-attention in both temporal and spa-
to output forecastings through prompt settings; (2) The second type
tial domains. TPGNN [42] proposes a novel temporal propagation-
aggregates diverse time series data from various domains into a large
based graph neural network to capture high-order information from
model, aiming to create a model that is applicable across a broad
temporal graphs. The above models have found applications in WP
spectrum of scenarios; (3) The third type focuses on fine-tuning large
forecasting as well. Early WP forecasting models relied on CNNs for
language models. Among them, this paper adopts the third type. To
spatial similarity learning, but often overlooked spatial topology be-
our best knowledge, using large language models for spatio-temporal
tween stations, posing challenges for accurate learning in irregular
WP forecasting has not been explored. Therefore, this paper proposes a
scenes [43–45]. Subsequent spatio-temporal-based models tend to em-
BERT-based model for spatio-temporal forecasting (BERT4ST), which
ploy graph convolutional networks (GCN) [46,47] and graph atten-
is built upon fine-tuning BERT, a large language model. Our main
tion networks (GAT) [48] in spatial correlation modeling, and con-
contributions can be summarized as follows.
tinue to employ LSTM or CNN for temporal correlation modeling.
• We exploit individual spatial and temporal dependency of patches Recently, some models [4,39] have tried to apply transformer-based
during modeling spatio-temporal correlation with a pre-trained spatio-temporal models for WP forecasting.
BERT model, and redesign a set of temporal and spatial encodings
tailored for the WP forecasting task. 2.3. Large language models for time series forecasting
• A multi-stage fine-tuning mechanism is adopted in this paper. The
first stage aligns spatio-temporal data with textual information. In Cross-modality transfer learning is a machine learning technique
the downstream fine-tuning of the latter two stages, we maintain that transfers knowledge from one domain to another. When dealing
stable and accurate spatio-temporal encodings. with data from different domains, various pre-processing and feature
• Experimental results demonstrate the effectiveness of our extraction tasks are necessary. Subsequently, machine-learning-based
BERT4ST in regional WP forecasting, surpassing some state-of- algorithms are employed to transfer knowledge from the source domain
the-art models. to the target domain. Throughout this process, additional operations,

2
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331

Fig. 1. The inputs and encodings of the BERT network.

such as addressing differences between different domains, need to be (MLM) to generate comprehensive bidirectional language representa-
taken into account. The significance of cross-modality transfer learn- tions. After pre-training, an extra output layer is added to BERT for
ing lies in its ability to fine-tune well-trained models for given data, performance improvement across various downstream tasks, without
enabling the adaptation of the final results to domains with insuffi- necessitating task-specific structural modifications to BERT.
cient training data. For instance, some studies have successfully trans- The backbone of the BERT network comprises a stacked transformer
ferred networks trained for visual tasks to infrared image recognition encoder, which utilizes layers of self-attention mechanisms to capture
tasks [49–51]. Pre-trained NLP models also demonstrated effectiveness intricate associations between tokens and sentences and can thereby
when being transferred to tasks regarding audio [52,53] and time series facilitate language understanding. As shown in Fig. 1, the input to
data [54]. the BERT network includes the token ID, position ID, and sentence
The studies of large language models for time series forecasting ID. Accordingly, the embedding layer generates the token embeddings,
can be categorized into three types. The first type implements pre- position embeddings, and sentence embeddings, and sums them to
trained large language models (LLM) to forecasting directly through create a representation of the input data. During the pre-training of
prompt settings [55–58]. Inspired by the amazing performance of BERT, a masked language model (MLM) mechanism is applied, which
chatbots, they leverage LLM models’ language understanding capability randomly masks some input tokens with a certain probability and
to analyze the characteristics of time series data. The second type predicts the original tokens.
aggregates diverse time series data from various domains into a large
model, aiming to create a model applicable across a broad spectrum 3.3. Individual spatial and temporal dependency of patches
of scenarios [59]. The third type focuses on fine-tuning large language
models. Zhou et al. [54] made an initial attempt to fine-tune a pre- Fig. 2 illustrates three types of feature dependency in spatio-
trained GPT model and achieved optimal results in all of the time temporal modeling. Traditional spatio-temporal data forecasting relies
series tasks except for time series forecasting. Chang et al. [60] pro- on constructing or learning a graph that represents the inter-site
posed a multi-stage fine-tuning method, firstly fine-tuning the backbone correlations at a specific time moment, and model dependencies in
network to align time series and text data, and secondly training the the temporal domain and spatial domain. However, recent research
downstream forecasting layer. Sun et al. [61] introduced contrastive on modeling efficiency has led to subsequent attempts to avoid over-
learning to enhance the model’s temporal representation capability. learning of spatial correlations in a channel-independent manner. Wind
Jin et al. [62] suggested fine-tuning the input layer of the large lan- power data, as a special type of spatio-temporal data, may exhibit
guage model, instead of the backbone network, to effectively transform rich correlations between different time moments and stations due
temporal input into textual semantic input, making it compatible with to the influence of atmospheric flow. For example, if the wind flows
pre-trained models. However, to our best knowledge, large language from station D to station A, then we can infer the wind speed or
models for spatio-temporal forecasting have not been explored. wind power of station A from the history data of station D. Moreover,
these correlations between stations constantly change. Therefore, this
3. Preliminaries and problem formulation paper takes a BERT backbone network to model individual spatial and
temporal dependency [64] of patches of wind data.
3.1. Problem formulation
3.4. The relationship between wind speed and wind power
We consider a region with 𝑀 wind power stations, and denote their
locations as 𝑺 = (𝒔1 , 𝒔2 , … , 𝒔𝑀 ) ∈ R𝑀×2 , where 𝑠𝑖 ∈ R2 represent Wind power (WP) is generated by wind turbines, where the wind
the longitude and latitude of station 𝑖. We denote the historical wind is deflected by the turbine blades, converted into mechanical energy
power data as 𝑿 = (𝒙1 , 𝒙2 , … , 𝒙𝑀 ) ∈ R𝑀×𝑃 , where 𝑃 is the length
through rotation, and then used to drive generators to produce elec-
of the historical data, and 𝑥𝑖 = (𝑥𝑖,1 , 𝑥𝑖,2 , … , 𝑥𝑖,𝑃 ) ∈ R𝑃 represents the
tricity [65]. The power generation of a wind turbine can be described
historical wind power data of station 𝑖. Based on the historical data 𝑿,
by the following wind power function, which illustrates the relationship
our goal is to forecast the wind power at the next 𝑄 time steps, denoted
between the turbine output power and wind speed (WS) [65],
as 𝒀̂ = (𝒚̂ 1 , 𝒚̂ 2 , … , 𝒚̂ 𝑀 ) ∈ R𝑀×𝑄 , where 𝒚̂ 𝑖 = (𝑦̂𝑖,𝑃 +1 , 𝑦̂𝑖,𝑃 +2 , … , 𝑦̂𝑖,𝑃 +𝑄 ) ∈
R𝑄 is the forecast power of station 𝑖 for the upcoming 𝑄 time steps. ⎧0 𝑣 < 𝑣𝑖𝑛 , 𝑣 > 𝑣𝑜𝑢𝑡
⎪
𝑃 (𝑣) = ⎨𝜌 × 𝐴 × 𝐶𝑝 × 𝑣3 ∕2 𝑣𝑖𝑛 ≤ 𝑣 ≤ 𝑣𝑟𝑎𝑡𝑒𝑑 (1)
3.2. Bidirectional encoder representations from transformers (BERT) ⎪
⎩𝑃𝑟𝑎𝑡𝑒𝑑 𝑣𝑟𝑎𝑡𝑒𝑑 < 𝑣 ≤ 𝑣𝑜𝑢𝑡
BERT is a pre-trained language representation model, and different where 𝑃 (𝑣) is the turbine output power at the WS of 𝑣, 𝑃𝑟𝑎𝑡𝑒𝑑 is the rated
from traditional methods that involve unidirectional language mod- WP, 𝜌 is the air density, 𝐴 is the sweeping area of an impeller, 𝐶𝑝 is the
els or shallow concatenation of two unidirectional models for pre- WP coefficient, and 𝑣𝑖𝑛 , 𝑣𝑟𝑎𝑡𝑒𝑑 , 𝑣𝑜𝑢𝑡 denote the cut-in, rated, and cut-out
training [63]. In contrast, BERT leverages a masked language model WS, respectively. From (1), we observe that WP is strongly correlated

3
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331

Fig. 2. Three types of feature dependency in spatio-temporal modeling. (a) Traditional spatio-temporal models tend to use spatial and temporal dependency mechanisms to model
the data characteristics in the spatial and temporal domains, respectively. (b) Recent research applies channel independence to avoid the over-learning of the interactions between
channels. (c) In this paper, we apply individual spatial and temporal dependency of patches to accomplish the wind power forecasting task, a special case of spatio-temporal
modeling.

Fig. 3. The framework of our proposed BERT4ST. Stage 1: Supervised fine-tuning uses the MLM approach to align the backbone model with time-series data. Stage 2: Downstream
fine-tuning for forecasting with an initial linear probing. Stage 3: a full fine-tuning for forecasting.

with WS. So the forecasting models of wind power may share similar 4.1. Training stages
structures with the forecasting models of wind speed. Of course, the
trained models of wind power forecasting and wind speed forecasting We use pre-trained BERT as the backbone network because of its
may have different parameters. bidirectional structure, which is suitable for aligning the large language
model with spatio-temporal data. It is observed that in spatio-temporal
data, the state of a specific station at a given time moment is not only
4. Methodology
related to other time moments at the same station but also to different
time moments at other stations [66,67]. Therefore, when encoding
The framework of our proposed BERT4ST is shown in Fig. 3. We spatio-temporal correlations, it is necessary to consider the forecasting
perform three-stage fine-tuning on BERT4ST. In the first stage, we apply performance of different stations at different time intervals. As illus-
the MLM mechanism to align the backbone model with spatio-temporal trated in Fig. 3, we represent the input with a combination of station
data. The latter stages perform downstream fine-tunings of BERT4ST. ID and patch ID. For example, 𝐴1 represents the first patch of station A,
Particularly, the second stage trains the output layer and the third stage 𝐴2 represents the second patch of station A, and 𝐵1 represents the first
fine-tunes all of the modules. patch of station B. Subsequently, the model learns rich spatio-temporal
The flowchart of BERT4ST is shown in Fig. 4. We first collect information through the stacked self-attention layers.
and preprocess source data. Then we implement the spatial–temporal Suppose there are 𝑀 stations, and we set the patch size as 𝑝 and the
modeling with a pre-trained BERT network. To further improve the historical data length
( as 𝑃 . )
Therefore, the historical data of each station
forecasting accuracy, we implement a three-stage fine-tuning. is segmented into 𝑁 = 𝑃𝑝 patches. Consequently, a total of 𝑀 × 𝑁

4
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331

Fig. 4. The flowchart of BERT4ST. We collect source data and preprocess it. Then we implement the spatial–temporal modeling with a pre-trained BERT network. To achieve
more accurate forecasting, we implement a three-stage fine-tuning.

patches are taken as the input tokens. Then we randomly mask these 4.3. Input encoding layer
tokens at a specific probability, where the values of masked tokens
are set to 0. In the first phase, the backbone network is applied to Appropriate time series encoding is crucial for aligning LLM with
reconstruct the values of these masked tokens. As shown in Fig. 3(a), time series data. Unlike the BERT model, which employs pre-trained
we mask the patches of 𝐴3 , 𝐵2 and 𝐶5 , and fine-tune the backbone embedding layers to separately map token ID, position ID, and sentence
network’s parameters by minimizing the MSE loss function between the ID, we redesign a set of encodings tailored for the spatial and temporal
reconstructed value and the true value of masked ones. characteristics of spatio-temporal data, which is shown in Fig. 5. The
After alignment, the backbone is further fine-tuned in the down- geographical coordinate of a station is denoted as 𝒔 = (𝑥, 𝑦) ∈ R2 , the
starting time of the current patch is denoted as [ℎℎ ∶ 𝑚𝑚], where 0 ≤
stream forecasting task. By [68], performing initial linear probing and
ℎℎ < 24 is the hour value in a day and 0 ≤ 𝑚𝑚 < 60 is the minute value
later full fine-tuning can achieve optimal results. Thus, our downstream
in an hour, the patch values are denoted as 𝒗 = (𝑣1 , 𝑣2 , … , 𝑣𝑝 ) ∈ R𝑝 ,
fine-tuning is also conducted in two stages. We initially perform linear
where 𝑝 is the length of each patch. We calculate the patch encoding
probing in the second stage, training only the final linear layer. In the
𝑬 𝑣 , the temporal encoding 𝑬 𝑡 and the position encoding 𝑬 𝑠 as
third stage, we conduct complete fine-tuning to adjust all modules. In
these two stages, the task involves collecting spatio-temporal encodings 𝑬𝑣 = 𝒗 ⋅ 𝑾 𝑣, (2)
( )
for all patches and providing future WP forecastings through a full 2𝜋(ℎℎ ⋅ 60 + 𝑚𝑚) 2𝜋(ℎℎ ⋅ 60 + 𝑚𝑚)
connection layer. 𝑬 𝑡 = 𝑠𝑖𝑛( ), 𝑐𝑜𝑠( ) ⋅ 𝑾 𝑡, (3)
24 × 60 24 × 60
𝑬𝑠 = 𝒔 ⋅ 𝑾 𝑠, (4)
4.2. Patching and instance normalization
where 𝑾 𝑣 ∈ R𝑝×𝑑 , 𝑾𝑡 ∈ R2×𝑑 , 𝑾𝑠 ∈ R2×𝑑
are trainable parameters,
and 𝑑 is the dimension of the pre-trained model. We set the temporal
Statistical properties, such as mean and variance, often vary over encoding in (3) to ensure that it takes one day as a cycle and there exist
time in time series data, leading to a distribution shift problem. This correlations in local time moments. Then we sum the three encodings
temporal distribution shift problem poses a significant challenge for as the input of the backbone network.
accurate time series forecasting. To tackle this issue, Kim et al. [64]
introduce a straightforward yet powerful normalization method, known 4.4. Backbone and the output layer
as Reversible Instance Normalization (RevIN). RevIN is a widely ap-
plicable normalization-and-denormalization technique to address the We employ BERT as the backbone network because of the con-
challenges posed by temporal distribution shifts. In the instance nor- sideration of spatio-temporal modeling. During fine-tuning the back-
malization of time series, the temporal average and standard deviation bone network, we utilize Low-Rank Adaptation (LoRA). As plotted in
are calculated, and the time series is scaled into a standard normal dis- Fig. 6(a), LoRA utilizes the product of two low-rank matrices to update
tribution. In the instance denormalization, the output of the forecasting the original high-rank weight matrix. Denote the dimension of the
layer is transformed as the final result. backbone as 𝑑. Then the original weight matrix has the size of 𝑑 × 𝑑.
We set the dimension of LoRA to 𝑑 ′ (𝑑 ′ ≪ 𝑑), and the sizes of these two
When self-attention models are applied to long time series, there is
low-rank matrices are 𝑑 × 𝑑 ′ and 𝑑 ′ × 𝑑. Therefore, we can train only
an over-fitting risk to noise [69]. To avoid that risk, Nie et al. [31]
2 ⋅ 𝑑 ⋅ 𝑑 ′ parameters, rather than the original 𝑑 2 ones, which can greatly
proposed a method to independently patch the time series for each
enhance the training efficiency.
channel, which has become a common practice in time series learning. As shown in Figs. 3 and 6(b), the affine transformations and feed-
For each station, the data series is split into some patches, each of which forward layer (FFN) of the original backbone network are frozen, while
contains the time series information in a short interval. In this way, in stages 1 and 3, fine-tuning is applied to the LoRA networks on Query
the input data becomes a sequence based on time intervals, rather than and Key, as well as the layer normalization layer.
time steps. The subsequent backbone network avoids the over-fitting The fine-tuning in the first stage aims to align the LLM model with
risk to noise while enhancing the ability to model time series at a larger time series data by reconstructing the masked patches. Therefore, in
scale. We also adopt this patching mechanism in this paper. However, this stage, we feed the output layer with the complex spatio-temporal
in contrast to the channel-independent patching of [31], we combine representation of each patch from the backbone and reconstruct the
the information of all channels to capture the complex spatio-temporal masked patches. The output layer of the first stage is a full connection
correlations of WP data. layer with the dimension of (𝑑, 𝑝), where 𝑝 denotes the patch size and

5
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331

Fig. 5. The inputs and encodings of the BERT4ST network. We redesign a set of encodings tailored for the spatial and temporal characteristics of spatio-temporal data and sum
them as the input encoding.

Fig. 6. LoRA fine-tuning of BERT4ST. (a) The LoRA fine-tuning can greatly enhance the training efficiency by reducing training parameters. (b) LoRA is implemented to the query
and key affine transformations of the backbone network and can improve the training efficiency.

𝑑 is the model dimension. The fine-tuning at the first stage is achieved Table 1
by minimizing the difference between the reconstructed values and the The details of datasets.

ground truth. Dataset #sites Start time End time Time resolution Type

The fine-tuning at stages 2 and 3 aims to forecast the future WP Chuzhou 12 2021/1/1 2021/12/31 5 min Wind power
by integrating the spatio-temporal representations of a station’s his- NorthSea 15 2016/1/1 2018/12/31 10 min Wind speed

torical data. Thus, in the output layers of these two stages, a fully
connected layer with the dimension of (𝑁 ⋅ 𝑑, 𝑙) is employed, where 𝑁
is the number of historical patches. The fine-tuning is accomplished by
This paper adopts pre-trained bert-base-uncased,3 which employs
minimizing the difference between the forecast values and the ground a 12-layer transformer encoder with 110 million parameters and a
truth. model dimension of 768. We set the number of layers of BERT4ST to 4,
i.e., we select the first 4 layers of the pre-trained BERT encoder as the
5. Experiments backbone network. The dimension of LoRA fine-tuning is 128. At stage
1, 20% patches were randomly masked and 30 training epochs were
5.1. Experimental setup chosen with an initial learning rate of 1e−3. At stage 2, 20 training
epochs were chosen with an initial learning rate of 1e−4. At stage 3,
In this paper, we use a wind power dataset, Chuzhou dataset. In 20 training epochs were chosen with an initial learning rate of 3e−6.
addition, to demonstrate the applicability of our BERT4ST for wind A CosineAnnealingLR strategy was employed to adjust the learning
speed forecasting tasks, we introduce a wind speed dataset, NorthSea rate during training, and the batch size is 64. At each stage, an early
dataset. The Chuzhou dataset collected historical power generation stopping with a patience of 5 is implemented, i.e., if the loss function
data from 12 wind power stations in the Chuzhou area of Anhui on the validation set did not decrease for 5 consecutive epochs, the
Province, China.1 The NorthSea dataset collected wind data from 15 training for the current stage is terminated. The first 70% samples of
meteorological observation sites in the North Sea of Europe.2 The the dataset were used as the training set, the next 10% ones as the
distribution of the sites of both datasets is shown in Fig. 7, and the validation set, and the remaining 20% ones as the test set. We coded our
relevant data information is presented in Table 1. In our experiments, BERT4ST with PyTorch and performed experiments on an NVIDIA RTX
the first 70% samples are used for the train set, the next 10% are used 3080 GPU. We run each experiment 5 times, and calculate the averages
for the validation set, and the rest are used for the test set. In this of mean absolute error (MAE) and root mean square error (RMSE) as
section we perform experiments at the horizons of 10 min, 1 h, and
1 ∑𝑛 ∑ 𝑀 𝑃∑ +𝑄
4 h, to verify the forecasting performance of the proposed BERT4ST. 𝑀𝐴𝐸 = |𝑦 − 𝑦̂𝑖,𝑗,𝑡 |, (5)
𝑛 ⋅ 𝑀 ⋅ 𝑄 𝑖=1 𝑗=1 𝑡=𝑃 +1 𝑖,𝑗,𝑡

1
https://github.com/zflai/DST.
2 3
https://frost.met.no/index.html. https://huggingface.co/bert-base-uncased.

6
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331

Fig. 7. The station locations of both datasets.

Table 2 noise. It is exciting that all of the fine-tuned large language models
Hyperparameter settings of baseline models.
(GPT4TS, LLM4TS, BERT4ST) achieve good results. Except for STID,
Model d_model n_graph n_layer head_num batch_size lr some spatio-temporal models do not perform well, although they did
DLinear [69] 64 – 1 – 64 0.001 well in many other fields. The main reason may lie in that the spatial
LSTM [21] 32 – 2 – 64 0.001
correlation of WP data changes over time and there are no strong global
Transformer [25] 128 – 2 8 32 0.001
Informer [28] 128 – 2 8 32 0.001
distribution characteristics, which is usually assumed by the traditional
Autoformer [29] 128 – 2 8 32 0.001 spatio-temporal models. Furthermore, BERT4ST achieves significant
STGCN [40] 32 2 2 – 64 0.001 performance improvement compared to GPT4TS and LLM4TS due to its
GWNet [41] 32 3 2 – 64 0.001 introduction of spatio-temporal modeling. In the subsequent ablation
MTGNN [70] 128 2 3 – 32 0.001 studies, we will explore the performance contributions of detailed
STID [19] 64 – 3 – 32 0.001 model designs.
GPT4TS [54] 768 – 4 12 64 0.0001 We first explore the impact of spatio-temporal modeling on the
LLM4TS [60] 768 – 4 12 64 0.0001
forecasting performance. We implement two variants of BERT4ST,
namely BERT4TS and BERT4ST-ss. To compare with the single-stage
trained GPT4TS, these variants do not adopt multi-stage training. Sim-
√
√ ilar to GPT4TS and LLM4TS, BERT4TS employs a channel-independent
√ ∑𝑛 ∑ 𝑀 𝑃∑ +𝑄
√ 1
𝑅𝑀𝑆𝐸 = √ (𝑦 − 𝑦̂𝑖,𝑗,𝑡 )2 , (6) processing approach. By such a preprocessing approach, patches from
𝑛 ⋅ 𝑀 ⋅ 𝑄 𝑖=1 𝑗=1 𝑡=𝑃 +1 𝑖,𝑗,𝑡 different stations are treated independently, without modeling the in-
teractions between them. The parameter settings are the same as those
where 𝑛 is the number of samples, 𝑀 is the number of stations, 𝑃 is
illustrated in Section 5.1. The experimental results are shown in Ta-
the historical data length and 𝑄 is the forecasting length, 𝑦 denotes the
ble 4, indicating that the proposed variant BERT4TS achieves results
true value, and 𝑦̂ denotes the forecast one.
similar to GPT4TS, while BERT4ST-ss significantly outperforms both
of them. This suggests that introducing spatio-temporal correlation is
5.2. Forecasting results crucial for WP forecasting.
Now we verify the performance improvement of multi-stage train-
We compare BERT4ST with baselines, including classic time se- ing. Table 5 compares the forecasting results of BERT4ST and its
ries forecasting models, spatio-temporal forecasting models, and LLM- variant BERT4ST-ss on the Chuzhou dataset. The parameter settings
based models here. The adopted time series forecasting models include are the same as those illustrated in Section 5.1. We record the test
LSTM [21], DLinear [69], Transformer [25], Informer [28] and Auto- set performance of BERT4ST after the training of stages 2 and 3. We
former [29]. The adopted spatio-temporal forecasting models include observe that after the training of stage 2, the performance of BERT4ST
GMAN [18], STGCN [40], MTGNN [70], GWNet [41] and STID [19]. surpasses that of BERT4ST-ss, and there is further improvement after
Additionally, the adopted LLM-based models include GPT4TS [54] and the completion of stage 3. This indicates that multi-stage training is nec-
LLM4TS [60]. We conduct a grid search on the main hyper-parameters essary for fine-tuning large language models. Initially, self-supervised
of the baseline models, and the finally chosen hyper-parameter settings learning is employed to align the language model with time series data.
are listed in Table 2. Among the fields of Table 2, d_model denotes the Subsequently, more refined adjustments are carried out while keeping
dimension of models, n_graph denotes the depth of graph networks, the backbone network stable and fully training the forecasting layer.
n_layer denotes the depth of temporal networks, head_num denotes the Then we investigate the impact of spatial and temporal encoding
head number of self-attention networks, batch_size denotes the batch designs on performance improvement. We implement variants, which
size, and lr denotes the learning rate. Moreover, we train the models consider the use of temporal and spatial encoding, and present the
separately for different forecasting horizons. For example, we train results of the Chuzhou dataset in Table 6. The parameter settings are
models for 10-min-ahead forecasting with 𝑄 = 2, 1-h-ahead forecasting the same as those of Section 5.1. We observe that incorporating either
with 𝑄 = 12, and 4-h-ahead forecasting with 𝑄 = 48. temporal or spatial encoding can enhance forecasting performance.
Table 3 shows the forecasting results of various methods. The best Moreover, performance is superior when using only temporal encoding
results are shown in bold, and the second-best results in underlined. compared to using only spatial encoding, indicating that temporal
It can be observed that our proposed BERT4ST achieves the best encoding plays a dominant role. Furthermore, combining both temporal
results in all scenarios. In addition, we observe that the models based and spatial encodings achieves the best performance.
on self-attention mechanisms for time series forecasting (Transformer, In addition, we investigate the impact of different depths of the
Informer, Autoformer) perform poorly due to the over-learning risk of network on forecasting performance. We conduct experiments with

7
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331

Table 3
Forecasting results on the two datasets.
10 min (2 step) 1 h (12 steps) 4 h (48 steps)
MAE RMSE MAE RMSE MAE RMSE
DLinear [69] 1.27 2.36 3.38 6.06 7.35 12.26
LSTM [21] 1.24 2.28 3.29 6.06 7.80 13.03
Transformer [25] 1.46 2.41 3.39 6.20 7.18 12.44
Informer [28] 1.42 2.39 3.46 6.19 7.50 12.77
Chuzhou Autoformer [29] 2.01 3.23 4.23 6.87 7.86 12.68
STGCN [40] 2.07 3.42 5.44 8.75 10.83 16.55
GWNet [41] 1.50 2.48 3.80 6.57 8.75 13.40
MTGNN [70] 2.87 4.29 5.26 8.12 9.41 14.94
STID [19] 1.25 2.31 3.29 5.95 7.38 12.16
GPT4TS [54] 1.37 2.48 3.28 6.13 7.05 12.42
LLM4TS [60] 1.24 2.29 3.24 6.03 7.12 12.45
BERT4ST 1.20 2.19 3.03 5.58 6.78 11.91
10 min (1 steps) 1 h (6 steps) 4 h (24 steps)
MAE RMSE MAE RMSE MAE RMSE
DLinear [69] 0.403 0.589 0.624 0.903 1.075 1.534
LSTM [21] 0.386 0.564 0.616 0.888 1.120 1.601
Transformer [25] 0.392 0.570 0.619 0.892 1.116 1.573
Informer [28] 0.401 0.581 0.625 0.897 1.238 1.551
NorthSea Autoformer [29] 0.610 0.834 0.672 0.961 1.971 2.513
STGCN [40] 0.552 0.802 1.026 1.373 1.748 2.284
GWNet [41] 0.452 0.656 0.786 1.101 1.502 2.014
MTGNN [70] 0.408 0.595 0.609 0.876 1.030 1.463
STID [19] 0.387 0.576 0.612 0.885 1.068 1.523
GPT4TS [54] 0.392 0.576 0.618 0.898 1.089 1.569
LLM4TS [60] 0.389 0.573 0.615 0.892 1.085 1.564
BERT4ST 0.378 0.550 0.575 0.827 0.980 1.395

Table 4 Table 6
Ablation studies on spatio-temporal modeling. Ablation studies on spatial and temporal encodings. Both of them boost the
10 min (2 steps) 1 h (12 steps) 4 h (48 steps) performance. Temporal encoding plays a dominant role.
Temporal Spatial 10 min (2 steps) 1 h (12 steps) 4 h (48 steps)
MAE RMSE MAE RMSE MAE RMSE
Chuzhou MAE RMSE MAE RMSE MAE RMSE
GPT4TS [54] 1.37 2.48 3.28 6.13 7.05 12.42
BERT4ST-ss 1.23 2.27 3.07 5.70 7.02 12.25 ✗ ✗ 1.21 2.22 3.06 5.64 6.83 12.03
BERT4TS 1.25 2.32 3.21 6.00 7.14 12.45 ✓ ✗ 1.20 2.21 3.06 5.62 6.83 11.97
✗ ✓ 1.21 2.22 3.06 5.63 6.86 12.03
BERT4ST 1.20 2.19 3.03 5.58 6.78 11.91
✓ ✓ 1.20 2.19 3.03 5.58 6.78 11.91
10 min (1 step) 1 h (6 steps) 4 h (24 steps)
MAE RMSE MAE RMSE MAE RMSE
NorthSea GPT4TS [54] 0.392 0.576 0.618 0.898 1.089 1.569
BERT4ST-ss 0.392 0.576 0.588 0.844 0.999 1.433 model’s performance by separately evaluating the prediction accuracy
BERT4TS 0.391 0.574 0.615 0.892 1.088 1.570 at specific time steps. We trained all models in the 4-h ahead forecasting
BERT4ST 0.378 0.550 0.575 0.827 0.980 1.395 task and calculated their prediction accuracy at specific time steps. The
results are plotted in Fig. 9, showing that BERT4TS achieved the best
Table 5 performance at all time steps.
Ablation studies on the multi-stage training. BERT4ST outperforms BERT4ST-ss after
the training of stage 2, and there is further improvement after the completion of stage
3. 6. Conclusion
10 min (2 steps) 1 h (12 steps) 4 h (48 steps)
MAE RMSE MAE RMSE MAE RMSE
BERT4ST-ss: single stage 1.23 2.27 3.07 5.70 7.02 12.25
In this paper, we propose the BERT4ST method, which is the
BERT4ST: after stage 2 1.23 2.25 3.09 5.66 6.82 11.91 first to utilize a pre-trained language model for spatio-temporal wind
BERT4ST: after stage 3 1.20 2.19 3.03 5.58 6.78 11.91 power(WP) forecasting. By analyzing the similarity between bidirec-
tional attention networks and WP spatio-temporal data, we propose to
employ a pre-trained BERT encoder as the backbone network to learn
various depths on the Chuzhou dataset, and other parameter settings individual spatial and temporal dependency of patches of WP data. To
are the same as those of Section 5.1. The results are plotted in Fig. 8. handle the spatio-temporal correlation of WP data, we also redesign a
It can be observed that the model performs optimally at the depth of set of spatial and temporal encodings. During the fine-tuning process,
4, followed by the depth of 6. However, as the depth increases further, we adopt a multi-stage training manner, first aligning the language
the performance tends to decline. We infer that too deep a model is model with time series data and then fine-tuning the downstream tasks
easily over-fit. while maintaining the stability of the backbone network. Experimental
The previous experiments focused on forecasting accuracy over results demonstrate that our BERT4ST achieves desirable performance
continuous time intervals. Now we aim to verify the stability of the compared to some state-of-the-art methods in most scenarios.

8
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331

Fig. 8. Forecasting performance under different model depths. A 4-layer backbone is the best setting of BERT4ST.

Fig. 9. Prediction performance at different time steps. BERT4TS achieves the best performance at most time steps.

CRediT authorship contribution statement Data availability

Zefeng Lai: Methodology, Investigation, Formal analysis, Concep- Public datasets are used. The sources of these datasets are provided.
tualization. Tangjie Wu: Resources, Formal analysis, Data curation.
Xihong Fei: Software, Resources, Project administration. Qiang Ling: Acknowledgment
Writing – review & editing, Writing – original draft, Supervision, Inves-
tigation, Funding acquisition, Conceptualization.
This work was supported in part by the Key Science and Technol-
Declaration of competing interest ogy Program of Anhui under Grant 202203f07020002, in part by the
Key Common Technology Development Program of Hefei (Research
The authors declare that they have no known competing finan- on Multi-sensor Perception and Fusion Algorithms for Autonomous
cial interests or personal relationships that could have appeared to Driving) under Grant GJ2022GX35, and in part by the Natural Science
influence the work reported in this paper. Foundation of Hefei under Grant 2022003.

9
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331

References [30] Liu S, Yu H, Liao C, Li J, Lin W, Liu AX, Dustdar S. Pyraformer: Low-complexity
pyramidal attention for long-range time series modeling and forecasting. In:
[1] Jonkman JM. Dynamics modeling and loads analysis of an offshore floating wind International conference on learning representations. 2021.
turbine. University of Colorado at Boulder; 2007. [31] Nie Y, Nguyen NH, Sinthong P, Kalagnanam J. A time series is worth 64 words:
[2] Fernandez LM, Saenz JR, Jurado F. Dynamic models of wind farms with fixed Long-term forecasting with transformers. 2022, arXiv preprint arXiv:2211.14730.
speed wind turbines. Renew Energy 2006;31(8):1203–30. [32] Wu B, Wang L, Zeng Y-R. Interpretable wind speed prediction with multivariate
[3] Li Z, Ye L, Zhao Y, Pei M, Lu P, Li Y, Dai B. A spatiotemporal directed graph time series and temporal fusion transformers. Energy 2022;252:123990.
convolution network for ultra-short-term wind power prediction. IEEE Trans [33] Xiong B, Lou L, Meng X, Wang X, Ma H, Wang Z. Short-term wind power
Sustain Energy 2022;14(1):39–54. forecasting based on attention mechanism and deep learning. Electr Power Syst
[4] Bentsen LØ, Warakagoda ND, Stenbro R, Engelstad P. Spatio-temporal wind Res 2022;206:107776.
speed forecasting using graph networks and novel transformer architectures. Appl [34] Shilin S, Yuekai L, Qi L, Tianyang W, Fulei C. Deep learning-based multistep
Energy 2023;333:120565. ahead wind speed and power generation forecasting using direct method. Energy
[5] Kavasseri RG, Seetharaman K. Day-ahead wind speed forecasting using f-ARIMA Convers Manage 2023;283:116916.
models. Renew Energy 2009;34(5):1388–93. [35] Wu T, Ling Q. Mixformer: Mixture transformer with hierarchical con-
[6] Hodge B-M, Zeiler A, Brooks D, Blau G, Pekny J, Reklatis G. Improved wind text for spatio-temporal wind speed forecasting. Energy Convers Manage
power forecasting with ARIMA models. In: Computer aided chemical engineering, 2024;299:117896.
vol. 29, Elsevier; 2011, p. 1789–93. [36] Fei X, Ling Q. Attention-based global and local spatial-temporal graph
[7] Liu X, Lin Z, Feng Z. Short-term offshore wind speed forecast by seasonal convolutional network for vehicle emission prediction. Neurocomputing
ARIMA-A comparison against GRU and LSTM. Energy 2021;227:120492. 2023;521:41–55.
[8] Haque AU, Nehrir MH, Mandal P. A hybrid intelligent model for deterministic [37] Guo S, Lin Y, Feng N, Song C, Wan H. Attention based spatial-temporal graph
and quantile regression approach for probabilistic wind power forecasting. IEEE convolutional networks for traffic flow forecasting. Proc AAAI Conf Artif Intell
Trans Power Syst 2014;29(4):1663–72. 2019;33(01):922–9.
[9] Li Y, Shi H, Han F, Duan Z, Liu H. Smart wind speed forecasting approach using [38] Chengqing Y, Guangxi Y, Chengming Y, Yu Z, Xiwei M. A multi-factor driven
various boosting algorithms, big multi-step forecasting strategy. Renew Energy spatiotemporal wind power prediction model based on ensemble deep graph
2019;135:540–53. attention reinforcement learning networks. Energy 2023;263:126034.
[10] Carneiro TC, Rocha PA, Carvalho PC, Fernández-Ramírez LM. Ridge regression [39] Pan X, Wang L, Wang Z, Huang C. Short-term wind speed forecasting based on
ensemble of machine learning models applied to solar and wind forecasting in spatial-temporal graph transformer networks. Energy 2022;253:124095.
Brazil and Spain. Appl Energy 2022;314:118936. [40] Yu B, Yin H, Zhu Z. Spatio-temporal graph convolutional networks: A deep
[11] Nikodinoska D, Käso M, Müsgens F. Solar and wind power generation fore- learning framework for traffic forecasting. 2017, arXiv preprint arXiv:1709.
casts using elastic net in time-varying forecast combinations. Appl Energy 04875.
2022;306:117983. [41] Wu Z, Pan S, Long G, Jiang J, Zhang C. Graph wavenet for deep spatial-temporal
[12] Lu P, Ye L, Zhao Y, Dai B, Pei M, Li Z. Feature extraction of meteorological graph modeling. 2019, arXiv preprint arXiv:1906.00121.
factors for wind power prediction based on variable weight combined method. [42] Liu Y, Liu Q, Zhang J-W, Feng H, Wang Z, Zhou Z, Chen W. Multivariate time-
Renew Energy 2021;179:1925–39. series forecasting with temporal polynomial graph neural networks. In: Advances
[13] Lin Y, Yang M, Wan C, Wang J, Song Y. A multi-model combination ap- in neural information processing systems. 2022.
proach for probabilistic wind power forecasting. IEEE Trans Sustain Energy [43] Hu T, Wu W, Guo Q, Sun H, Shi L, Shen X. Very short-term spatial and temporal
2018;10(1):226–37. wind power forecasting: A deep learning approach. CSEE J Power Energy Syst
[14] Raza MQ, Mithulananthan N, Li J, Lee KY, Gooi HB. An ensemble framework 2019;6(2):434–43.
for day-ahead forecast of PV output power in smart grids. IEEE Trans Ind Inform [44] Zhu Q, Chen J, Shi D, Zhu L, Bai X, Duan X, Liu Y. Learning temporal and
2018;15(8):4624–34. spatial correlations jointly: A unified framework for wind speed prediction. IEEE
[15] Wang J, Zhou Y, Zhang Y, Lin F, Wang J. Risk-averse optimal combining forecasts Trans Sustain Energy 2019;11(1):509–23.
for renewable energy trading under CVaR assessment of forecast errors. IEEE [45] Wang Y, Zou R, Liu F, Zhang L, Liu Q. A review of wind speed and wind power
Trans Power Syst 2023. forecasting with deep neural networks. Appl Energy 2021;304:117766.
[16] Li Y, Yu R, Shahabi C, Liu Y. Diffusion convolutional recurrent neural network: [46] Ren Y, Li Z, Xu L, Yu J. The data-based adaptive graph learning network for
Data-driven traffic forecasting. 2017, arXiv preprint arXiv:1707.01926. analysis and prediction of offshore wind speed. Energy 2023;126590.
[17] Kipf TN, Welling M. Semi-supervised classification with graph convolutional [47] Yu X, Tang B, Zhang K. Fault diagnosis of wind turbine gearbox using a novel
networks. 2016, arXiv preprint arXiv:1609.02907. method of fast deep graph convolutional networks. IEEE Trans Instrum Meas
[18] Zheng C, Fan X, Wang C, Qi J. GMAN: A graph multi-attention network for 2021;70:1–14.
traffic prediction. Proc AAAI Conf Artif Intell 2020;34(01):1234–41. [48] Wang L, He Y. M2STAN: Multi-modal multi-task spatiotemporal attention net-
[19] Shao Z, Zhang Z, Wang F, Wei W, Xu Y. Spatial-temporal identity: A simple yet work for multi-location ultra-short-term wind power multi-step predictions. Appl
effective baseline for multivariate time series forecasting. In: Proceedings of the Energy 2022;324:119672.
31st ACM international conference on information & knowledge management. [49] Pu N, Chen W, Liu Y, Bakker EM, Lew MS. Dual gaussian-based variational
2022, p. 4454–8. subspace disentanglement for visible-infrared person re-identification. In: Pro-
[20] Olaofe ZO. A 5-day wind speed & power forecasts using a layer recurrent neural ceedings of the 28th ACM international conference on multimedia. 2020, p.
network (LRNN). Sustain Energy Technol Assess 2014;6:1–24. 2149–58.
[21] Shahid F, Zameer A, Muneeb M. A novel genetic LSTM model for wind power [50] Fu C, Hu Y, Wu X, Shi H, Mei T, He R. CM-NAS: Cross-modality neural
forecast. Energy 2021;223:120069. architecture search for visible-infrared person re-identification. In: Proceedings of
[22] Ju Y, Sun G, Chen Q, Zhang M, Zhu H, Rehman MU. A model combining the IEEE/CVF international conference on computer vision. 2021, p. 11823–32.
convolutional neural network and LightGBM algorithm for ultra-short-term wind [51] Huang N, Liu J, Miao Y, Zhang Q, Han J. Deep learning for visible-infrared
power forecasting. IEEE Access 2019;7:28309–18. cross-modality person re-identification: A comprehensive review. Inf Fusion
[23] Yaghoubirad M, Azizi N, Farajollahi M, Ahmadi A. Deep learning-based multistep 2023;91:396–411.
ahead wind speed and power generation forecasting using direct method. Energy [52] Ghosal D, Majumder N, Mehrish A, Poria S. Text-to-audio generation using
Convers Manage 2023;281:116760. instruction-tuned LLM and latent diffusion model. 2023, arXiv preprint arXiv:
[24] Wang Y, Chen T, Zhou S, Zhang F, Zou R, Hu Q. An improved wavenet 2304.13731.
network for multi-step-ahead wind energy forecasting. Energy Convers Manage [53] Lyu C, Wu M, Wang L, Huang X, Liu B, Du Z, Shi S, Tu Z. Macaw-LLM: Multi-
2023;278:116709. modal language modeling with image, audio, video, and text integration. 2023,
[25] Yoo J, Kang U. Attention-based autoregression for accurate and efficient multi- arXiv preprint arXiv:2306.09093.
variate time series forecasting. In: Proceedings of the 2021 SIAM international [54] Zhou T, Niu P, Wang X, Sun L, Jin R. One fits all: Power general time series
conference on data mining. analysis by pretrained LM. 2023, arXiv preprint arXiv:2302.11939.
[26] Sun S, Liu Y, Li Q, Wang T, Chu F. Short-term multi-step wind power forecasting [55] Gruver N, Finzi M, Qiu S, Wilson AG. Large language models are zero-shot time
based on spatio-temporal correlations and transformer neural networks. Energy series forecasters. 2023, arXiv preprint arXiv:2310.07820.
Convers Manage 2023;283:116916. [56] Xue H, Salim FD. Promptcast: A new prompt-based learning paradigm for time
[27] Child R, Gray S, Radford A, Sutskever I. Generating long sequences with sparse series forecasting. IEEE Trans Knowl Data Eng 2023.
transformers. 2019, arXiv preprint arXiv:1904.10509. [57] Cao D, Jia F, Arik SO, Pfister T, Zheng Y, Ye W, Liu Y. TEMPO: Prompt-
[28] Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W. Informer: Beyond based generative pre-trained transformer for time series forecasting. 2023, arXiv
efficient transformer for long sequence time-series forecasting. Proc AAAI Conf preprint arXiv:2310.04948.
Artif Intell 2021;35(12):11106–15. [58] Yu X, Chen Z, Ling Y, Dong S, Liu Z, Lu Y. Temporal data meets LLM–explainable
[29] Wu H, Xu J, Wang J, Long M. Autoformer: Decomposition transformers with financial time series forecasting. 2023, arXiv preprint arXiv:2306.11025.
auto-correlation for long-term series forecasting. Adv Neural Inf Process Syst [59] Garza A, Mergenthaler-Canseco M. TimeGPT-1. 2023, arXiv preprint arXiv:2310.
2021;34:22419–30. 03589.

10
Z. Lai et al. Energy Conversion and Management 307 (2024) 118331

[60] Chang C, Peng W-C, Chen T-F. LLM4TS: Two-stage fine-tuning for time-series [66] Fang Y, Qin Y, Luo H, Zhao F, Zeng L, Hui B, Wang C. CDGNet: A cross-time
forecasting with pre-trained llms. 2023, arXiv preprint arXiv:2308.08469. dynamic graph-based deep learning model for traffic forecasting. 2021, arXiv
[61] Sun C, Li Y, Li H, Hong S. TEST: Text prototype aligned embedding to activate preprint arXiv:2112.02736.
LLM’s ability for time series. 2023, arXiv preprint arXiv:2308.08241. [67] Song J, Son J, Seo D-h, Han K, Kim N, Kim S-W. ST-GAT: A spatio-temporal graph
[62] Jin M, Wang S, Ma L, Chu Z, Zhang JY, Shi X, Chen P-Y, Liang Y, Li Y-F, Pan S, et attention network for accurate traffic speed prediction. In: Proceedings of the
al. Time-LLM: Time series forecasting by reprogramming large language models. 31st ACM international conference on information & knowledge management.
2023, arXiv preprint arXiv:2310.01728. 2022, p. 4500–4.
[63] Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidi- [68] Kumar A, Raghunathan A, Jones R, Ma T, Liang P. Fine-tuning can distort
rectional transformers for language understanding. 2018, arXiv preprint arXiv: pretrained features and underperform out-of-distribution. 2022, arXiv preprint
1810.04805. arXiv:2202.10054.
[64] Kim T, Kim J, Tae Y, Park C, Choi J-H, Choo J. Reversible instance normalization [69] Zeng A, Chen M, Zhang L, Xu Q. Are transformers effective for time series
for accurate time-series forecasting against distribution shift. In: International forecasting? Proc AAAI Conf Artif Intell 2023;37(9):11121–8.
conference on learning representations. 2021. [70] Wu Z, Pan S, Long G, Jiang J, Chang X, Zhang C. Connecting the dots:
[65] Wang Y, Hu Q, Li L, Foley AM, Srinivasan D. Approaches to wind power curve Multivariate time series forecasting with graph neural networks. In: Proceedings
modeling: A review and discussion. Renew Sustain Energy Rev 2019;116:109422. of the 26th ACM SIGKDD international conference on knowledge discovery &
data mining. 2020, p. 753–63.

Fires and Explosions
No ratings yet
Fires and Explosions
67 pages
Tian 等 - 2025 - Developing an Interpretable Wind Power Forecasting System Using a Transformer Network and Transfer l
No ratings yet
Tian 等 - 2025 - Developing an Interpretable Wind Power Forecasting System Using a Transformer Network and Transfer l
17 pages
Bayesian CNN-BiLSTM and Vine-GMCM Based Probabilistic Forecasting of Hour-Ahead Wind Farm Power Outputs
No ratings yet
Bayesian CNN-BiLSTM and Vine-GMCM Based Probabilistic Forecasting of Hour-Ahead Wind Farm Power Outputs
18 pages
89-Article Text-644-1-10-20220816
No ratings yet
89-Article Text-644-1-10-20220816
6 pages
10 1016@j Apenergy 2020 115098
No ratings yet
10 1016@j Apenergy 2020 115098
11 pages
Prediction of Wind Turbines Power With Physics-Informed Neural Networks and Evidential Uncertainty Quantification
No ratings yet
Prediction of Wind Turbines Power With Physics-Informed Neural Networks and Evidential Uncertainty Quantification
29 pages
Wind Speed Short-term Prediction Using Recurrent Neural Network GRU Model and Stationary Wavelet Transform GRU Hybrid Model
No ratings yet
Wind Speed Short-term Prediction Using Recurrent Neural Network GRU Model and Stationary Wavelet Transform GRU Hybrid Model
13 pages
Applsci 13 11455 v2
No ratings yet
Applsci 13 11455 v2
19 pages
Wind Power Forecasting Methods Based On Deep Learning - A Survey
No ratings yet
Wind Power Forecasting Methods Based On Deep Learning - A Survey
31 pages
Energy Conversion and Management: Review
No ratings yet
Energy Conversion and Management: Review
18 pages
1 s2.0 S2590123022003103 Main
No ratings yet
1 s2.0 S2590123022003103 Main
9 pages
Zhen - A Hybrid Deep Learning Model and Comparison For Wind Power Forecasting Considering Temporal-Spatial Feature Extraction
No ratings yet
Zhen - A Hybrid Deep Learning Model and Comparison For Wind Power Forecasting Considering Temporal-Spatial Feature Extraction
24 pages
Data-Augmented Sequential Deep Learning for Wind Power Forecasting
No ratings yet
Data-Augmented Sequential Deep Learning for Wind Power Forecasting
12 pages
Temporal Fusion VMD Windpower
No ratings yet
Temporal Fusion VMD Windpower
18 pages
Wind Power Plant Prediction by Using Neural Networks: Preprint
No ratings yet
Wind Power Plant Prediction by Using Neural Networks: Preprint
9 pages
Wind Power Plant Prediction by Using Neural Networks: Preprint
No ratings yet
Wind Power Plant Prediction by Using Neural Networks: Preprint
9 pages
Leveraging LSTM-SMI and ARIMA Architecture for Robust Wind Power Plant Forecasting
No ratings yet
Leveraging LSTM-SMI and ARIMA Architecture for Robust Wind Power Plant Forecasting
22 pages
Detail Explanation About Major Project
No ratings yet
Detail Explanation About Major Project
12 pages
Aml g 通过网格搜索交叉验证优化rnn-lstm模型的风电预测
No ratings yet
Aml g 通过网格搜索交叉验证优化rnn-lstm模型的风电预测
21 pages
J Enconman 2016 02 022
No ratings yet
J Enconman 2016 02 022
12 pages
Short-Term Prediction of Wind Power Based On Tempo
No ratings yet
Short-Term Prediction of Wind Power Based On Tempo
11 pages
REF 3.0 Research - and - Application - of - Optimal - LSTM - Combinatorial - Model - Based - On - Convolutional - Neural - Network
No ratings yet
REF 3.0 Research - and - Application - of - Optimal - LSTM - Combinatorial - Model - Based - On - Convolutional - Neural - Network
5 pages
A Hybrid Deep Learning Architecture For Wind Power Prediction Based On Bi-Attention Mechanism and Crisscross Optimization
No ratings yet
A Hybrid Deep Learning Architecture For Wind Power Prediction Based On Bi-Attention Mechanism and Crisscross Optimization
16 pages
1 s2.0 S036054422301246X Main
No ratings yet
1 s2.0 S036054422301246X Main
16 pages
A Patchtst-Gru Based
No ratings yet
A Patchtst-Gru Based
17 pages
A High-Accuracy Hybrid Method For Short-Term Wind Power Forecasting
No ratings yet
A High-Accuracy Hybrid Method For Short-Term Wind Power Forecasting
13 pages
Short-Term Wind Power Prediction For Wind Farm Clusters Based On SFFS Feature Selection and BLSTM Deep Learning
No ratings yet
Short-Term Wind Power Prediction For Wind Farm Clusters Based On SFFS Feature Selection and BLSTM Deep Learning
18 pages
(IJCST-V13I1P4) :DR - Snehal K Joshi
No ratings yet
(IJCST-V13I1P4) :DR - Snehal K Joshi
7 pages
Wes 2024 113
No ratings yet
Wes 2024 113
19 pages
Long-Term Wind Speed and Power Forecasting Using Local RNNs Models
No ratings yet
Long-Term Wind Speed and Power Forecasting Using Local RNNs Models
12 pages
The Data Based Adaptive Graph Learning Network For Analysis and PR - 2023 - Ener
No ratings yet
The Data Based Adaptive Graph Learning Network For Analysis and PR - 2023 - Ener
10 pages
An Integrated Modeling Strategy For Wind Power Forecasting Based On Dynamic Meteorological Visualization
No ratings yet
An Integrated Modeling Strategy For Wind Power Forecasting Based On Dynamic Meteorological Visualization
11 pages
Advancing Ultra-Short-Term Wind Power Forecasting With Multi-Channel ML Techniques
100% (1)
Advancing Ultra-Short-Term Wind Power Forecasting With Multi-Channel ML Techniques
4 pages
A Comprehensive Study and Performance Analysis of Deep Neural Network-Based Approaches in Wind Time-Series Forecasting
No ratings yet
A Comprehensive Study and Performance Analysis of Deep Neural Network-Based Approaches in Wind Time-Series Forecasting
18 pages
Manero Et Al. - 2018 - Wind Energy Forecasting With Neural Networks A Li
No ratings yet
Manero Et Al. - 2018 - Wind Energy Forecasting With Neural Networks A Li
14 pages
An Artificial Intelligence Neural Network Predictive Model For Anomaly Detection and Monitoring of Wind Turbines Using SCADA Data
No ratings yet
An Artificial Intelligence Neural Network Predictive Model For Anomaly Detection and Monitoring of Wind Turbines Using SCADA Data
15 pages
Energies: Offshore Wind Power Forecasting-A New Hyperparameter Optimisation Algorithm For Deep Learning Models
No ratings yet
Energies: Offshore Wind Power Forecasting-A New Hyperparameter Optimisation Algorithm For Deep Learning Models
21 pages
An Artificial Intelligence Neural Network Predictive Model For Anomaly Detection and Monitoring of Wind Turbines Using SCADA Data
No ratings yet
An Artificial Intelligence Neural Network Predictive Model For Anomaly Detection and Monitoring of Wind Turbines Using SCADA Data
15 pages
Wind Power Forecasting Using LSTM-Grid Search
No ratings yet
Wind Power Forecasting Using LSTM-Grid Search
6 pages
Sustainability 15 07087 v2
No ratings yet
Sustainability 15 07087 v2
33 pages
Wind Turbine
No ratings yet
Wind Turbine
12 pages
2019, Prediction of Wind Power Generation Base On Neural Network in Consideration of The Fault Time
No ratings yet
2019, Prediction of Wind Power Generation Base On Neural Network in Consideration of The Fault Time
10 pages
LEJPT Template
No ratings yet
LEJPT Template
14 pages
SCT Final Project Report
No ratings yet
SCT Final Project Report
20 pages
Deterministic and Probabilistic Wind Power Forecasts by Considering Various Atmospheric Models and Feature Engineering Approaches
No ratings yet
Deterministic and Probabilistic Wind Power Forecasts by Considering Various Atmospheric Models and Feature Engineering Approaches
15 pages
A Machine Learning-Based Sustainable Energy Manage
No ratings yet
A Machine Learning-Based Sustainable Energy Manage
21 pages
Task Embedding Temporal Convolution Networks For Transfer Learning Problems in Renewable Power Time Series Forecast
No ratings yet
Task Embedding Temporal Convolution Networks For Transfer Learning Problems in Renewable Power Time Series Forecast
17 pages
Short-Term Wind Electric Power Forecasting Using A Novel Multi-Stage Intelligent Algorithm
No ratings yet
Short-Term Wind Electric Power Forecasting Using A Novel Multi-Stage Intelligent Algorithm
19 pages
A Wind Power Forecasting Method Based On Optimized Decomposition Prediction and Error Correction
No ratings yet
A Wind Power Forecasting Method Based On Optimized Decomposition Prediction and Error Correction
14 pages
Wind MeteodynWT CFD Micro Scale Modeling Combined Statistical Learning For Short Term Wind Power Forecasting
No ratings yet
Wind MeteodynWT CFD Micro Scale Modeling Combined Statistical Learning For Short Term Wind Power Forecasting
1 page
ANN - Based Model For Prediction Electricity From Wind Energy
No ratings yet
ANN - Based Model For Prediction Electricity From Wind Energy
6 pages
A Deterministic and Probabilistic Hybrid Model For Wind Po - 2024 - Expert Syste
No ratings yet
A Deterministic and Probabilistic Hybrid Model For Wind Po - 2024 - Expert Syste
22 pages
Wind Power Forecasting System With Data Enhancement and
No ratings yet
Wind Power Forecasting System With Data Enhancement and
19 pages
Optimizing Wind Power Forecasting Using Machine Learning: A Comparative Study With Emphasis On LightGBM For Predictive Maintenance
No ratings yet
Optimizing Wind Power Forecasting Using Machine Learning: A Comparative Study With Emphasis On LightGBM For Predictive Maintenance
12 pages
A New Short-Term Wind Speed Forecasting Method Based On Fine-Tuned LSTM Neural Network and Optimal Input Sets
No ratings yet
A New Short-Term Wind Speed Forecasting Method Based On Fine-Tuned LSTM Neural Network and Optimal Input Sets
15 pages
2015 Robust Hybrid Arima Elm SVM LSSVM GPR
No ratings yet
2015 Robust Hybrid Arima Elm SVM LSSVM GPR
16 pages
Wind Power Prediction Using ML and DL Methodologies
No ratings yet
Wind Power Prediction Using ML and DL Methodologies
13 pages
Interpretable Wind Power Forecasting Combining Seasonal-trend Representations Learning With Temporal Fusion Transformers Architecture
No ratings yet
Interpretable Wind Power Forecasting Combining Seasonal-trend Representations Learning With Temporal Fusion Transformers Architecture
14 pages
Li2023 - Done
No ratings yet
Li2023 - Done
10 pages
Hybrid Deep Learning and Quantum Inspired Neural Networ - 2024 - Expert Systems
No ratings yet
Hybrid Deep Learning and Quantum Inspired Neural Networ - 2024 - Expert Systems
11 pages
Nanoelectronics Devices: Design, Materials, and Applications (Part I)
From Everand
Nanoelectronics Devices: Design, Materials, and Applications (Part I)
Gopal Rawat
No ratings yet
6 Polycab Flexible
No ratings yet
6 Polycab Flexible
1 page
2011PY - Early Development, Rediscovery and Colonization of The Philippines - 0
No ratings yet
2011PY - Early Development, Rediscovery and Colonization of The Philippines - 0
23 pages
Heal The World
No ratings yet
Heal The World
1 page
Shashidharbkondaguli (0 0)
No ratings yet
Shashidharbkondaguli (0 0)
4 pages
Form-23-Tower Crane Safety Checklist
100% (1)
Form-23-Tower Crane Safety Checklist
2 pages
PLANNING Reviewer PDF
No ratings yet
PLANNING Reviewer PDF
14 pages
JoJo's Bizarre Adventure - Part 3 - Stardust Crusaders v05 (2017) (Digital) (DigitalMangaFan)
No ratings yet
JoJo's Bizarre Adventure - Part 3 - Stardust Crusaders v05 (2017) (Digital) (DigitalMangaFan)
301 pages
Resident Marine Mammals of The Protected Seascape Tanon Strait Vs Reyes
No ratings yet
Resident Marine Mammals of The Protected Seascape Tanon Strait Vs Reyes
2 pages
1158.6 Print
100% (1)
1158.6 Print
53 pages
Why The World Still Need The Trade and Globalization
No ratings yet
Why The World Still Need The Trade and Globalization
4 pages
Mid249 Ppid153 or Se9116
No ratings yet
Mid249 Ppid153 or Se9116
2 pages
Forces and Structure
No ratings yet
Forces and Structure
24 pages
The Netherlands: Colombia
No ratings yet
The Netherlands: Colombia
21 pages
Asme Code For Ultrasonic Testing The Knowledge Sinequation: Ashok J. Trivedi
No ratings yet
Asme Code For Ultrasonic Testing The Knowledge Sinequation: Ashok J. Trivedi
7 pages
V3300 Series-Drone
No ratings yet
V3300 Series-Drone
4 pages
ACU Based On Different Food Groups
No ratings yet
ACU Based On Different Food Groups
1 page
MCQ No - 4
No ratings yet
MCQ No - 4
6 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
30 pages
Pmc251 Manual
No ratings yet
Pmc251 Manual
34 pages
Clough Creek Platform Refurbishment Site Visit Report
No ratings yet
Clough Creek Platform Refurbishment Site Visit Report
6 pages
Pattern Crochet Hat and Mittens Simplicity
No ratings yet
Pattern Crochet Hat and Mittens Simplicity
6 pages
O Ring Chart
100% (1)
O Ring Chart
2 pages
Structural Members R602 3
No ratings yet
Structural Members R602 3
25 pages
Doctor and Nurse Collaboration
No ratings yet
Doctor and Nurse Collaboration
3 pages
Spectral Methods Using Multivariate Polynomials On The Unit Ball 1st Edition Kendall Atkinson Author David Chien Author Olaf Hansen Author Download
100% (2)
Spectral Methods Using Multivariate Polynomials On The Unit Ball 1st Edition Kendall Atkinson Author David Chien Author Olaf Hansen Author Download
91 pages
Lavendar Flame PDF
100% (2)
Lavendar Flame PDF
26 pages
Kyawuna Jarabtata
No ratings yet
Kyawuna Jarabtata
219 pages
ECOHOME PRICELIST - New Sept 2024
No ratings yet
ECOHOME PRICELIST - New Sept 2024
3 pages
Narayan Chandra Rana
No ratings yet
Narayan Chandra Rana
24 pages

BERT4ST Windpowerforecast

Uploaded by

BERT4ST Windpowerforecast

Uploaded by

Energy Conversion and Management 307 (2024) 118331

Contents lists available at ScienceDirect

Energy Conversion and Management

BERT4ST:: Fine-tuning pre-trained large language model for wind power

ARTICLE INFO ABSTRACT

The rest of this paper is organized as follows. In Section 2, we discuss

Fig. 1. The inputs and encodings of the BERT network.

Fig. 7. The station locations of both datasets.

CRediT authorship contribution statement Data availability

You might also like