1 Introduction

The time series is usually a set of random variables observed and recorded sequentially over time. Key research directions for time-series data are classification [1, 2], anomaly detection [3,4,5], event prediction [6,7,8], and time series forecasting [9,10,11]. Time series forecasting (TSF) predicts the future trend changes of time series from a large amount of data in various fields. With the development of data collection technology, the task gradually evolves into using more historical data to predict the longer-term future, which is long-term time series forecasting (LTSF) [12, 13]. Precise LTSF can offer support to decision makers to better plan for the future by forecasting outcomes further in advance, including meteorology prediction [14], noise cancellation [15], financial long-term strategic guidance [16], power load forecasting [17, 18], and traffic road condition prediction [19].

Formerly, traditional statistical approaches were applied to time series forecasting, such as autoregressive (AR) [20], moving average (MA) [21] models, auto-regressive moving average (ARMA) [22], AR Integrated MA (ARIMA) [23], and spectral analysis techniques [24]. However, these traditional statistical methods require many a priori assumptions on the time-series prediction, such as stability, normal distribution, linear correlation, and independence. For example, AR, MA, and ARMA models are based on the assumption that time series are stationary, but in many real cases, time-series data exhibit non-stationarity. These assumptions limit the effectiveness of these traditional methods in real-world applications.

As it is difficult to effectively capture the nonlinear relationships between time series with traditional statistical approaches, many researchers have studied LTSF from the perspective of machine learning (ML) [25,26,27,28,29]. Support vector machines (SVMs) [30] and adaptive boosting (AdaBoost) [31] were employed in the field of TSF. They calculate data metrics, such as minimum, maximum, mean, and variance, within a sliding window as new features for prediction. These models have somewhat solved the problem of predicting multivariate, heteroskedastic time series with nonlinear relationships. However, they suffer from poor generalization, which leads to limited prediction accuracy.

Deep learning (DL) models (Fig. 1) have greatly improved the nonlinear modeling capabilities of TSF in recent years. These models are constructed with neural network structures with powerful nonlinear modeling capabilities to learn complex patterns and feature representations in time series automatically. Therefore, DL is an effective solution for TSF and many other problems related to TSF, such as hierarchical time series forecasting [32], intermittent time series forecasting [33], and sparse multivariate time series forecasting [34] asynchronous time series forecasting [35, 36]. It has extended some multi-objective, multi-granular forecasting scenarios [37] and multi-modal time series forecasting scenarios [38, 39]. The advantage of deep learning models can be attributed to their profound flexibility and ability to capture long-term dependencies and handle large-scale data.

It is noteworthy that recurrent neural networks (RNNs) [40] and their variants, such as long short-term memory networks (LSTMs) [41] and gated recurrent units (GRUs) [42,43,44], are widely employed among deep learning models to process sequence data. These models process batches of data sequentially using a gradient descent algorithm to optimize the unknown model parameters. The gradient information of the model parameters is updated by back-propagation through time [45]. However, due to the sequential processing of input data and back-propagation through time, they suffer from some limitations, especially when dealing with datasets with long dependencies. The training process of LSTM and GRU models also suffers from gradient vanishing and explosion. Though some architectural modifications and training techniques can help LSTM and GRU to alleviate the gradient-related problems to some extent, the effectiveness and efficiency of RNN-based models may still be compromised [46]. Furthermore, it is possible to apply models like Convolutional Neural Network (CNN) to conduct time-series analysis.

On the other hand, the Transformer [47] is a model that combines various mechanisms, such as attention, embedding, and encoder-decoder structures, in natural language processing. Later, studies improved the Transformer and gradually applied it to TSF, imaging, and other areas, making Transformers progressively their genre. Recent advancements in Transformer-based models have shown substantial progress [12, 48, 49]. The self-attentive mechanism of the Transformer allows for adaptive learning of short-term and long-term dependencies through pairwise (query-key) request interactions. This feature grants the Transformer a significant advantage in learning long-term dependencies on sequential data, enabling the creation of more robust and expansive models [50]. The performance of Transformers on LTSF is impressive, and they have gradually become the current mainstream approach.

The two main tasks of time-series data are forecasting and classification. Forecasting aims to predict real values from given time-series data, while the classification task categorizes given time-series data into one or more target categories. Many advances have been made in time-series Transformers for forecasting [12, 49, 51,52,53,54,55,56,57,58,59] and classification tasks [1, 60,61,62]. However, genuine time-series data tends to be noisy and non-stationary, and learning spurious dependencies, lacking interpretability, can occur if time-series-related knowledge is not combined. Thus, challenges still need to be addressed despite the notable achievements in accurate long-term forecasting using Transformer-based models.

Fig. 1
figure 1

The development history of TSF algorithms based on deep learning

Following The Preferred Reporting Items for Systematic Reviews and Meta Analysis (PRISMA) standard, we conducted a systematic review and used digital databases, including Google Scholar, Elsevier, and Springer Link. We used the most pertinent keywords, such as “vision transformer” and “long-term time series forecasting”, to choose the most relevant to our study. The number of articles yielded through Google Scholar, Elsevier, and Springer Link since 2020 was 958, 1143, and 703, respectively, for a total of 2804. We first removed duplicate papers, then evaluated the titles and abstracts of these articles, and further reviewed the full text. A total of 59 articles were included in the study. The summary of search results for research articles is shown in Fig. 2.

Fig. 2
figure 2

Article retrieval and selection process based on PRISMA standard

In this review, we commence with a comprehensive overview of Transformer architecture in Sect. 2. Section 3 presents Transformer-based architectures for LTSF in recent research. In Sect. 4, we analyze Transformer effectiveness for LTSF. Subsequently, Sect. 5 summarizes the public datasets and evaluation metrics in LTSF tasks. Section 6 introduces several training strategies in existing Transformer-based LTSF solutions. Finally, we conclude this review in Sect. 7.

2 Transformer

In this section, we begin by analyzing the inherent mechanics of the Transformer proposed by Vaswani et al. [63] in 2017, to present solutions to the challenge of neural machine translation. Figure 3 shows the Transformer architecture. Subsequently, we delve into the operations within each constituent of the Transformer and the underlying principles that inform these operations. Several variants of the Transformer architecture have been proposed for time-series analysis; however, our discussion in this section is limited to the original architecture [64] [65].

The Transformer network has two parts, the encoder and the decoder, with self-attention for neural sequence transduction. The encoder component encompasses two primary networks: the multi-head attention mechanism and the two-layer feed-forward neural network. It handles symbolic relationships of an input categorization to an incessant relation. The decoder is similar to the encoder, albeit with an extra multi-head attention mechanism that interacts with the encoder output. Unlike the encoder, the decoder comprises three parts of the network structure. The top and bottom segments resemble the encoder, save for a middle section that engages with the encoder’s output, referred to as “encoder-decoder attention”. The decoder part of the transformer model engenders an output sequence one after the other. Each stage auto-degenerates and exploits the earlier input as supplementary to the next word.

Fig. 3
figure 3

Schematic diagram of Transformer

2.1 Self-attention

The self-attention mechanism is a process that involves mapping a query and a sequence of key-value pairs to generate a corresponding output vector. The resulting vector is determined by the summation of weights acting on values computed from the query and key. A schematic representation of the self-attention mechanism is depicted in Fig. 4.

Fig. 4
figure 4

Schematic diagram of self-attention

As shown in Fig. 4, the core of the self-attention mechanism is to get the attention weights by calculating Q and K and then act on V to get the whole weights and outputs. Q, K, and V are the input sequence’s Query, Key, and Value matrices after linear transformation. Concerning the input sequence denoted as X, the parameters Q, K, and V are given by

$$\:\text{Q}={W}_{q}\text{X},\text{K}={W}_{k}\text{X},\text{a}\text{n}\text{d}\:\:\text{V}={W}_{v.}$$
(1)

Q, K, and V are computed by multiplying the input X by three different matrices (but this is only limited to the encoder and decoder encoding process using the self-attention mechanism in their respective inputs; the Q, K, and V in the interaction between the encoder and decoder are referred to otherwise). The computed Q, K, and V can be interpreted as three different linear transformations of the same input to represent its three different states. The weight vectors can be further computed after Q, K, and V are computed. Specifically, for the inputs Q, K, and V, the weight vectors are calculated as:

$$\:\text{Attention}\left(Q,K,V\right)=\text{softmax}\left(\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}\right)\text{V}.$$
(2)

The dimension of the query and key is denoted by \(\:{d}_{k}\). The attention for each position is normalized using the softmax function. The formula illustrates that the attention score matrix can be derived by executing a dot product operation between the query and key, followed by division with a scaling factor of \(\:\sqrt{{d}_{k}}\). Subsequently, the attention weights for each position are obtained by performing a softmax operation on the attention score matrix. The ultimate self-attention representation is achieved by multiplying the attention weights with the value matrix. The compatibility function employed in this process is a scaled dot product, thus rendering the computation process efficient. Additionally, the linear transformation of the inputs introduces ample expressive power. As illustrated in Fig. 3, the scale process corresponds to the division of \(\:{d}_{k}\:\)in Eq. 2. It is imperative to note that scaling is essential because, for larger \(\:{d}_{k}\), the value obtained after \(\:\:\text{Q}{K}^{T}\) is excessively large, consequently causing a diminutive gradient after the softmax operation. The diminutive gradient hinders the training of the network and, thus, is not conducive to the overall outcome.

2.2 Multi-head attention

The self-attention mechanism solves the sequential encoding challenge encountered in conventional sequence models. It enables the generation of a final encoded vector that incorporates attention information from multiple positions, achieved through a finite number of matrix transformations on the initial inputs. However, it is worth noting that the model’s encoding of positional information may lead to an overemphasis on its position, potentially neglecting the importance of other positions. To address this issue, the Multi-Head Attention mechanism has been introduced.

Fig. 5
figure 5

Multi-head attention

As shown in Fig. 5, the multi-attention mechanism is a self-attention processing process of multiple groups of the original input sequences, and then each group of self-attention results is spliced together to perform a linear transformation to obtain the final output results. Specifically, its calculation formula is:

$$\:\text{Multi-Head}\left(Q,K,V\right)\text{Concat}=\left({\text{head}}_{1},\dots\:,{\text{head}}_{h}\right){W}_{O}$$
$$\:{\text{head}}_{i}=\text{Attention}\left(Q{W}_{Qi},K{W}_{Ki},V{W}_{Vi}\right).$$
(3)

In this context, the matrices Q, K, and V refer to the input sequences’ query, key, and value matrices, respectively, subsequent to linear transformation. The variable h denotes the number of attention heads. Additionally, the weight matrices \(\:{W}_{Qi},{W}_{Ki},and{W}_{Vi}\) are utilized to carry out the linear transformation on Q, K, and V. The output weight matrix of the multi-head attention is denoted by the symbol \(\:{W}_{O}\). The computation of a single attention head is denoted by attention in Eq. 2, equivalent to the previously mentioned self-attention mechanism. Each attention head maps the inputs through independent linear transformations and applies the attention mechanism to obtain the representation. The final output of the multi-head attention is obtained by combining the representations of all attention heads and using a linear transformation to the output weight matrix \(\:{W}_{O}.\)

3 Transformer-based architectures in LTSF

The design of a network needs to consider the characteristics and nature of problems. In this section, we first analyze the key problems in the LTSF tasks, followed by discussing some popular recent Transformer-based architectures in LTSF tasks.

3.1 LTSF’s key problems

LTSF is usually defined as forecasting a more distant future [12, 66]. Given the current status of the existing work, there are two main problems in the field of LTSF: complexity and dependency. LTSF requires processing a large amount of data [67], which may lead to longer training times and require more computational resources [68], as the computational complexity grows exponentially with the length of the sequence. Additionally, storing the entire sequence in memory may be challenging due to the computer’s limited memory [69]. This may limit the length of the time series available for prediction [70].

Meanwhile, LTSF models need to have the ability to accurately capture the temporal relationship between past and future observations in a time series [71,72,73]. Long-sequence time series exhibit long-term dependence [74, 75], challenging the models’ ability to capture dependencies [12]. Moreover, LTSF is characterized by inherent periodicity and non-stationarity [76], and thus, LTSF models need to learn a mixture of short-term and long-term repeated patterns in a given time series [67]. A practical model should capture both repeated ways to make accurate predictions, which imposes more stringent requirements on the prediction model regarding learning dependence.

3.2 Transformer variants

A Transformer [47] mainly captures correlations among sequence data through a self-attention mechanism. Compared with the traditional deep learning architecture, the self-attention mechanism in a Transformer is more interpretable. We have chosen to compare the Transformer-related methods proposed in the last five years. All of these Transformer variants enhance the original Transformer to some extent and can be used for LTSF. Wu et al. [53] introduced the vanilla Transformer to the field of temporal prediction for influenza disease prediction. However, as mentioned above, Transformers have a large computational complexity, leading to high computational costs. Moreover, the utilization of location information is not apparent, the position embedding in the model embedding process is ineffective, and long-distance information cannot be captured. A brief conclusion of recent Transformer-based architectures is given in Table 1.

Table 1 Transformer-based architectures

The time complexity of self-attention computation in a Transformer is initially established at \(\:\text{O}\left({\text{L}}^{2}\right)\), leading to high computational cost. Some subsequent works have been developed to optimize this time complexity and the long-term dependency of Transformer-based models.

The LogSparse Transformer [49] model first introduces Transformer to the field of TSF, making Transformer more feasible for time series with long-term dependencies. LogSparse Transformer allows each time step to be consistent with the previous time step and is selected using an exponential step. It proposed convolutional self-attention by employing causal convolutions to produce queries and keys, reducing time complexity from \(\:\text{O}\left({\text{L}}^{2}\right)\) to \(\:\text{O}\left(L{\left(logL\right)}^{2}\right)\). The prediction accuracy achieved for fine-grained, long-term dependent time series can be improved in cases with limited memory.

Informer [12] uses the ProbSparse self-attention mechanism, further reducing the computational complexity of the traditional Transformer model to \(\:\text{O}\left(\text{L}\right(\text{l}\text{o}\text{g}\text{L}\left)\right)\). At the same time, inspired by dilated convolution in [83] and [84], it also introduced the self-attention distilling operation to remove redundant combinations of value vectors to reduce the total space complexity of the model. In addition, it designed a generative style decoder to produce long sequence outputs with only one forward step to avoid accumulation error. The Informer architecture was tested on various datasets and performed better than models such as Autoregressive Integrated Moving Average (ARIMA) [85], Prophet [86], LSTMa [87], LSTNet [88], and DeepAR [89].

The Autoformer [67] is a simple seasonal trend decomposition architecture with an auto-correlation mechanism working as an attention module. It achieves \(\:\text{O}\left(\text{L}\right(\text{l}\text{o}\text{g}\text{L}\left)\right)\) computational time complexity. This deep decomposition architecture embeds the sequence decomposition strategy into the encoder-decoder structure as an internal unit of Autoformer.

In contrast, TCCT [51] designs a CSP attention module that merges CSPNet with a self-attentive mechanism and replaces the typical convolutional layer with an expanded causal convolutional layer, thereby modifying the distillation operation employed by Informer to achieve exponential receptive field growth. In addition, the model develops a penetration mechanism for stacking self-attentive blocks to obtain finer information at negligible additional computational costs.

Pyraformer [70] is a novel model based on hierarchical pyramidal attention. By letting the maximum length of the signal traversal path be a constant concerning the sequence length L, it can achieve theoretical O open paren L close paren complexity. Pyraformer conducts both intra-scale and inter-scale attentions, which capture temporal dependencies in an individual resolution and build a multi-resolution representation of the original series, respectively. Similarly, Triformer [13] proposed a triangular, variable-specific attention architecture, which achieves linear complexity through patch attention while proposing a lightweight approach to enable variable-specific model parameters.

FEDformer [79] achieves \(\:\text{O}\left(\text{L}\right)\) linear computational complexity by designing two attention modules that process the attention operation in the frequency domain with the Fourier transform [90] and wavelet transform [91], respectively. Instead of applying Transformer to the time domain, it applies it to the frequency domain, which helps it better expose potential periodic information in the input data.

The Conformer [82] model uses the fast Fourier transform to extract correlation features of multivariate variables. It employs a sliding window approach to improve the operational efficiency of long-period forecasting, sacrificing global information extraction and complex sequence modeling capabilities. Thus, the time complexity is reduced to O open paren L close paren.

To address the problems of long-term dependency, Lin et al. [77] established SpringNet for solar prediction. They proposed a DTW attention layer to capture the local correlations of time-series data, which helps capture repeatable fluctuation patterns and provide accurate predictions. For the same purpose, Chu et al. [80] combined Autoformer, Informer, and Reformer to propose a prediction model based on stacking ensemble learning.

Chen et al. [81] proposed a Quatformer framework in which learning-to-rotate attention introduces learnable period and phase information to describe complex periodic patterns, trend normalization to model normalization of the sequence representation in the hidden layer, and decoupling of the LRA by using the global memory, to efficiently fit multi-periodic complex patterns in the LTSF while achieving linear complexity without loss of prediction accuracy.

To alleviate the problem of redundant information input in LTSF, the Muformer proposed by Zeng et al. [68] enhances the features by inputting the multi-perceptual domain processing mechanism, while the multi-cornered attention head mechanism and the attention head pruning mechanism enhance the expression of multi-head attention. Each of these efforts takes a different perspective on optimizing the parametric part of the model, but a general architecture and component that can reduce the number of required model parameters has not yet emerged.

In addition to the previously mentioned Transformer-based architectures, other architectural modifications have emerged in recent years. For example, the Bidirectional Encoder Representations from Transformers (BERT) [92] model is built by stacking Transformer encoder modules and introducing a new training scheme. Pre-training the encoder modules is task-independent, and decoder modules can be added later and fine-tuned to the task. This scheme allows BERT models to be trained on large amounts of unlabeled data. BERT architecture has inspired many new Transformer models for time-series data [1, 57, 60]. However, compared to NLP tasks, time-series data include various data types [1, 12, 93]. Thus, the pre-training process will have to be different for each task. This task-dependent pre-training contrasts with the NLP tasks, which can start with the same pre-trained models, assuming all tasks are based on the same language semantics and structure.

Generative adversarial networks (GANs) consist of the generator and the discriminator, learning adversarially from each other. The generator-discriminator learning principle has been applied to the time-series forecasting task [56]. The authors use a generative adversarial encoder-decoder framework to train a sparse Transformer model for time-series forecasting, solving the problem of being unable to predict long series due to error accumulation. The adversarial training process improves the model’s robustness and generalization ability by directly shaping the output distribution of the network to avoid error accumulation through one-step-ahead inference.

TranAD [94] applied GAN-style adversarial training with two Transformer encoders and two Transformer decoders to gain stability. As a simple Transformer-based network tends to miss slight deviations of anomaly, an adversarial training procedure can amplify reconstruction errors.

TFT [54] designs a multi-horizon model with static covariate encoders, a gating feature selection module, and a temporal self-attention decoder. It encodes and selects valuable information from various covariates information to perform forecasting. It also preserves interpretability by incorporating global and temporal dependencies and events. SSDNet [95] combines the Transformer with state space models (SSM), which use the Transformer part to learn the temporal pattern and estimate the SSM parameters; the SSM parts perform the seasonal-trend decomposition to maintain the interpretable ability. While MT-RVAE [96] combines the Transformer with Variational AutoEncoder (VAE), it focuses on data with few dimensions or sparse relationships. A multi-scale Transformer is designed to extract different levels of global time-series information. AnomalyTrans [60] combines Transformer and Gaussian prior association to make rare anomalies more distinguishable. Prior association and series association are modeled simultaneously. The minimax strategy optimizes the anomaly model to constrain the prior and series associations for more distinguishable association discrepancies.

GTA [3] contains a graph convolution structure to model the influence propagation process. Replacing vanilla multi-head attention with a multi-branch attention mechanism combines global-learned attention, multi-head attention, and neighborhood convolution. GTN [62] applies a two-tower Transformer, with each tower working on time-step-wise attention and channel-wise attention, respectively. A learnable weighted concatenation is used to merge the features of the two towers. Aliformer [57] makes the time-series sales forecasting using knowledge-guided attention with a branch to revise and denoise the attention map.

In addition, some researchers have made corresponding network improvements for specific applications. First, in the transportation application, spatiotemporal graph Transformer [97] proposes an attention-based graph convolution mechanism for learning a more complex temporal-spatial attention pattern applying to pedestrian trajectory prediction. Traffic Transformer [55] designs an encoder-decoder structure using a self-attention module to capture the temporal-temporal dependencies and a graph neural network (GNN) module to capture the spatial dependencies. Spatial-temporal Transformer networks introduced a temporal Transformer block to capture the temporal dependencies and a spatial Transformer block to assist a graph convolution network to capture more spatial-spatial dependencies [98].

There are also applications for event prediction. Event forecasting or prediction aims to predict the times and marks of future events given the history of past events, which is often modeled by temporal point processes (TPP) [6]. Self-attentive Hawkes process (SAHP) [7] and Transformer Hawkes process (THP) [8] adopt Transformer encoder architecture to summarize the influence of historical events and compute the intensity function for event prediction. They modify the positional encoding by translating time intervals into sinusoidal functions to utilize interval between events. Later, a more flexible model named attentive neural datalog through time (ANDTT) [99] was proposed to extend SAHP/THP schemes by embedding all possible events and times with attention.

4 Transformer effectiveness for LTSF

Is Transformer effective in the time series forecasting domain? The response we provide is affirmative. Since the publication of Zeng’s scholarly article, “Are Transformers effective for time series forecasting?”[100], the feasibility of utilizing Transformer models for time series forecasting has emerged as a significant subject of scholarly discourse. This is particularly noteworthy as a straightforward model emerged victorious over a considerably intricate Transformer model, thus prompting a substantial academic discourse. Zeng claimed that the Transformer-based models are not effective in time series forecasting. They compare the Transformer-based models with a simple linear model, DLinear, which uses the decomposition layer structure in Autoformer and which DLinear claims outperforms the Transformer-based models. A Transformer with different positional and temporal embeddings retains very limited temporal relationships. It is prone to overfitting on noisy data, whereas a linear model can be modeled in a natural order and with fewer parameters can avoid overfitting. However, Nie [101] presents a novel solution to tackle the loss of temporal information induced by the self-attention mechanism. This approach is rooted in the Transformer time-series prediction and involves transforming the time-series data into a patch format akin to that of Vision Transformer. This conversion preserves the localization of the time series, with each patch serving as the smallest unit for Attention computation. The findings in Table 2 demonstrate that research focused on Transformer-based time-series prediction underscores the significance of integrating temporal information to improve the model’s prediction performance.

Table 2 Multivariate long-term forecasting results on electricity dataset

A straightforward linear model may have its advantages in specific circumstances; however, it may need to be more capable of effectively handling extensive time series information on the same level as a more intricate model, such as the Transformer. In summary, the Transformer model still needs to be updated in time series forecasting. Nonetheless, having abundant training data to unlock its immense potential is crucial. Unfortunately, there is currently a scarcity of publicly available datasets that are sufficiently large for time series forecasting. Most existing pre-trained time-series models use public datasets like Traffic and Electricity. Despite these benchmark datasets serving as the foundation for developing time series forecasting, their limited size and lack of generalizability pose significant challenges for large-scale pre-training. Thus, in the context of time-series prediction, the most pressing matter is the development of expansive and highly generalized datasets (similar to ImageNet in computer vision). This crucial step will undoubtedly propel the advancement of time-series analysis and training models while enhancing the capacity of training models in time-series prediction. Additionally, this development underscores the Transformer model’s effectiveness in successfully capturing long-term dependencies within a sequence while maintaining superior computational efficiency and a more comprehensive feature representation capability.

On the other hand, the Transformer’s effectiveness is reflected in Large Language Models (LLMs). LLMs are powerful Transformer-based models, and numerous previous studies have shown that Transformer-based models are capable of learning potentially complex relationships among textual sequences [102, 103]. It is reasonable to expect LLMs to have the potential to understand complex dependencies among numeric time series augmented by temporal textual sequences.

The current endeavor for time series LLMs encompasses two primary strategies. One approach involves creating and preliminary training a fundamental, comprehensive model specifically tailored for time series. This model can be subsequently fine-tuned to cater to various downstream tasks. This path represents the most rudimentary solution, drawing upon a substantial volume of data and imbuing the model with time-series-related knowledge through pre-training. The second strategy involves fine-tuning based on the LLM framework, wherein corresponding mechanisms are devised to adapt the time series, enabling application to existing language models. Consequently, this facilitates processing diverse time-series tasks using the pre-existing language models. This path poses challenges and necessitates transcending the original language model.

5 Public datasets and evaluation metrics

In this section, we summarize some typical applications and relevant public LTSF datasets. We also discuss the prediction evaluation metrics in LTSF.

5.1 Common applications and public datasets

5.1.1 Finance

LTSF is commonly used in finance to predict economic cycles [104], fiscal cycles, and long-term stock trends [105]. LTSF can predict future trends and stock price fluctuations in the stock market, helping investors develop more accurate investment strategies. In financial planning, LTSF can predict future economic conditions, such as income, expenses, and profitability, to help individuals or businesses better plan their financial goals and capital operations [106]. In addition, LTSF can predict a borrower’s repayment ability and credit risk [107] or predict future interest rate trends to help financial institutions conduct loan risk assessments for better monetary and interest rate policies. We summarized the open-source LTSF datasets in the finance field in recent years in Table 3.

Table 3 Finance LTSF dataset

5.1.2 Energy

In the energy field, LTSF is often used to assist in developing long-term resource planning strategies [118]. It can help companies and governments forecast future energy demand to better plan energy production and supply. It can also help power companies predict future power generation, ensuring a sufficient and stable power supply [119]. In addition, LTSF can help governments and enterprises to develop energy policy planning or manage the energy supply chain [120]. These applications can help enterprises and governments better plan, manage, reduce risks, improve efficiency, and realize sustainable development. We summarized the energy field’s open-source datasets in recent years in Table 4.

Table 4 Energy LTSF dataset

5.1.3 Transportation

In urban transportation, LTSF can help urban traffic management predict future traffic flow [123] for better traffic planning and management. It can also be used to predict future traffic congestion [124], future traffic accident risks, and traffic safety issues [125] for better traffic safety management and accident prevention. We summarized the open-source datasets in the transportation field in recent years in Table 5.

Table 5 Transportation LTSF dataset

5.1.4 Meteorology and medicine

The application of LTSF in meteorology mainly focuses on predicting long-term climate trends. For example, LTSF can be used to predict long-term climate change [133], providing a scientific basis for national decision-making in response to climate change. It can also issue early warnings for natural climate disasters [134] to mitigate potential hazards to human lives and properties. In addition, LTSF can predict information such as sea surface temperature and marine meteorology for the future [135], providing decision support for industries such as fisheries and marine transportation. We summarized the open-source datasets in the meteorology and medicine fields in recent years in Tables 6 and 7, respectively.

Table 6 Meteorology LTSF dataset

In the medical field, LTSF can be applied to various stages of drug development. For example, predicting a drug’s toxicity, pharmacokinetics, pharmacodynamics, and other parameters helps researchers optimize the drug design and screening process [137]. In addition, LTSF can predict medical needs over a certain period [138]. These predictions can be used to allocate and plan medical resources rationally.

Table 7 Medicine LTSF dataset

5.2 Evaluation metrics

In this section, we discuss prediction performance evaluation metrics in the field of TSF. According to [141], the prediction accuracy metrics can be divided into three groups: scale-dependent, scale-independent, and scaled error metrics, based on whether the evaluation metrics are affected by the data scale and how the data scale effects are eliminated.

Let \(\:{Y}_{t}\) denote the observation at time t (t = 1,…, n) and \(\:{F}_{t}\) denote the forecast of \(\:{Y}_{t}\). Then the forecast error is defined as \(\:{e}_{t}={Y}_{t}-{F}_{t}\).

5.2.1 Scale-dependent measures

Scale-dependent measures are the most widely used evaluation metrics in forecasting, whose data scales depend on the data size of the original data. This type of metric is computationally simple. These are useful when comparing different methods applied to the same datasets but should not be used, for example, when comparing across datasets with different scales.

The most commonly used scale-dependent measures are based on the absolute error or squared errors:

$$\:\text{M}\text{e}\text{a}\text{n}\:\text{S}\text{q}\text{u}\text{a}\text{r}\text{e}\:\text{E}\text{r}\text{r}\text{o}\text{r}\:\left(\text{M}\text{S}\text{E}\right)\:=\:\text{m}\text{e}\text{a}\text{n}\left({\text{e}}_{\text{t}}^{2}\right)$$
(7)
$$\:\text{R}\text{o}\text{o}\text{t}\:\text{M}\text{e}\text{a}\text{n}\:\text{S}\text{q}\text{u}\text{a}\text{r}\text{e}\:\text{E}\text{r}\text{r}\text{o}\text{r}\:\left(\text{R}\text{M}\text{S}\text{E}\right)\:=\:\sqrt{\text{M}\text{S}\text{E}}$$
(8)
$$\:\text{M}\text{e}\text{a}\text{n}\:\text{A}\text{b}\text{s}\text{o}\text{l}\text{u}\text{t}\text{e}\:\text{E}\text{r}\text{r}\text{o}\text{r}\:\left(\text{M}\text{A}\text{E}\right)\:=\:\text{m}\text{e}\text{a}\text{n}\left(\left|{\text{e}}_{\text{t}}\right|\right)$$
(9)
$$\:\text{M}\text{e}\text{d}\text{i}\text{a}\text{n}\:\text{A}\text{b}\text{s}\text{o}\text{l}\text{u}\text{t}\text{e}\:\text{E}\text{r}\text{r}\text{o}\text{r}\:\left(\text{M}\text{d}\text{A}\text{E}\right)\:=\:\text{m}\text{e}\text{d}\text{i}\text{a}\text{n}\left(\left|{\text{e}}_{\text{t}}\right|\right)$$
(10)

Historically, the RMSE and MSE have been popular because of their theoretical relevance in statistical modeling. The RMSE is effective for its simplicity and close relationship with statistical modeling. It can give the same value as the forecast error variance for unbiased forecasting. However, they are more sensitive to outliers than MAE or MdAE [142]. The MAE better reflects the actual error situation than the RMSE.

5.2.2 Scale-independent measures

Scale-independent measures are evaluation metrics not affected by the size of the original data. They can be divided more specifically into three subcategories: measures based on percentage errors, measures based on relative errors, and relative measures.

The percentage error is \(\:{p}_{t}\:=\:100{e}_{t}/{Y}_{t}\). The most commonly used measures are:

$$\:\text{M}\text{e}\text{a}\text{n}\:\text{A}\text{b}\text{s}\text{o}\text{l}\text{u}\text{t}\text{e}\:\text{P}\text{e}\text{r}\text{c}\text{e}\text{n}\text{t}\text{a}\text{g}\text{e}\:\text{E}\text{r}\text{r}\text{o}\text{r}\:\left(\text{M}\text{A}\text{P}\text{E}\right)\:=\:\text{m}\text{e}\text{a}\text{n}\left(\left|{\text{p}}_{\text{t}}\right|\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$$
(11)
$$\:\text{M}\text{e}\text{d}\text{i}\text{a}\text{n}\:\text{A}\text{b}\text{s}\text{o}\text{l}\text{u}\text{t}\text{e}\:\text{P}\text{e}\text{r}\text{c}\text{e}\text{n}\text{t}\text{a}\text{g}\text{e}\:\text{E}\text{r}\text{r}\text{o}\text{r}\:\left(\text{M}\text{d}\text{A}\text{P}\text{E}\right)\:=\:\text{m}\text{e}\text{d}\text{i}\text{a}\text{n}\left(\left|{\text{p}}_{\text{t}}\right|\right)$$
(12)
$$\:\text{R}\text{o}\text{o}\text{t}\:\text{M}\text{e}\text{a}\text{n}\:\text{S}\text{q}\text{u}\text{a}\text{r}\text{e}\:\text{P}\text{e}\text{r}\text{c}\text{e}\text{n}\text{t}\text{a}\text{g}\text{e}\:\text{E}\text{r}\text{r}\text{o}\text{r}\:\left(\text{R}\text{M}\text{S}\text{P}\text{E}\right)\:=\:\sqrt{\text{m}\text{e}\text{a}\text{n}\left({\text{p}}_{\text{t}}^{2}\right)}$$
(13)
$$\:\text{R}\text{o}\text{o}\text{t}\:\text{M}\text{e}\text{d}\text{i}\text{a}\text{n}\:\text{S}\text{q}\text{u}\text{a}\text{r}\text{e}\:\text{P}\text{e}\text{r}\text{c}\text{e}\text{n}\text{t}\text{a}\text{g}\text{e}\:\text{E}\text{r}\text{r}\text{o}\text{r}\:\left(\text{R}\text{M}\text{d}\text{S}\text{P}\text{E}\right)\:=\:\sqrt{\text{m}\text{e}\text{d}\text{i}\text{a}\text{n}\left({\text{p}}_{\text{t}}^{2}\right)}$$
(14)

Percentage errors have the advantage of being scale-independent and so are frequently used to compare forecast performance across different datasets. However, these measures have the disadvantage of being infinite or undefined if \(\:{Y}_{t}\:=\:0\) for any t in the period of interest and have an extremely skewed distribution when any value of \(\:{Y}_{t}\) is close to zero. The MAPE and MdAPE also have the disadvantage of putting a heavier penalty on positive errors than negative errors. Measures based on percentage errors are often highly skewed, and, therefore, transformations (such as logarithms) can make them more stable [143].

An alternative scaling method is dividing each error by the error obtained using another standard forecasting method. Let \(\:{r}_{t}\:=\:{e}_{t}/{e}_{t}^{*}\) denote the relative error, where \(\:{e}_{t}^{*}\) is the forecast error obtained from the benchmark method. Usually, the benchmark method is the random walk where \(\:{F}_{t}\) is equal to the last observation.

$$\:\text{M}\text{e}\text{a}\text{n}\:\text{R}\text{e}\text{l}\text{a}\text{t}\text{i}\text{v}\text{e}\:\text{A}\text{b}\text{s}\text{o}\text{l}\text{u}\text{t}\text{e}\:\text{E}\text{r}\text{r}\text{o}\text{r}\:\left(\text{M}\text{R}\text{A}\text{E}\right)\:=\:\text{m}\text{e}\text{a}\text{n}\left(\left|{\text{r}}_{\text{t}}\right|\right)$$
(15)
$$\:\text{M}\text{e}\text{d}\text{i}\text{a}\text{n}\:\text{R}\text{e}\text{l}\text{a}\text{t}\text{i}\text{v}\text{e}\:\text{A}\text{b}\text{s}\text{o}\text{l}\text{u}\text{t}\text{e}\:\text{E}\text{r}\text{r}\text{o}\text{r}\:\left(\text{M}\text{d}\text{R}\text{A}\text{E}\right)\:=\:\text{m}\text{e}\text{d}\text{i}\text{a}\text{n}\left(\left|{\text{r}}_{\text{t}}\right|\right)$$
(16)
$$\:\text{G}\text{e}\text{o}\text{m}\text{e}\text{t}\text{r}\text{i}\text{c}\:\text{M}\text{e}\text{a}\text{n}\:\text{R}\text{e}\text{l}\text{a}\text{t}\text{i}\text{v}\text{e}\:\text{A}\text{b}\text{s}\text{o}\text{l}\text{u}\text{t}\text{e}\:\text{E}\text{r}\text{r}\text{o}\text{r}\:\left(\text{G}\text{M}\text{R}\text{A}\text{E}\right)\:=\:\text{g}\text{m}\text{e}\text{a}\text{n}\left(\left|{\text{r}}_{\text{t}}\right|\right)$$
(17)

A serious deficiency of relative error measures is that \(\:{e}_{t}^{*}\) can be small. In fact, \(\:{r}_{t}\) has infinite variance because \(\:{e}_{t}^{*}\) has a positive probability density at 0. Using “winsorizing” can trim extreme values, which will avoid the difficulties associated with small values of \(\:{e}_{t}^{*}\) [144], but adds some complexity to the calculation and a level of arbitrariness as the amount of trimming must be specified.

Rather than use relative errors, one can use relative measures. For example, let MAEb denote the MAE from the benchmark method. Then, a relative MAE is given by

$$\:\text{r}\text{e}\text{l}\text{M}\text{A}\text{E}\:=\:\text{M}\text{A}\text{E}/{\text{M}\text{A}\text{E}}_{\text{b}}.$$
(18)

An advantage of these methods is their interpretability. However, they require several forecasts on the same series to compute MAE (or MSE).

5.2.3 Scaled errors

Scaled errors were first proposed in [141] and can be used to eliminate the effect of data size by comparing the prediction results obtained with the underlying method (usually the native method). The following scaled error is commonly used:

$$\:{q}_{t}\:=\:\frac{{e}_{t}}{\frac{1}{n-1}\sum\:_{i=2}^{n}\left|{Y}_{i}-{Y}_{i-1}\right|}.$$
(19)

Therefore, The Mean Absolute Scaled Error is simply \(\:\text{M}\text{A}\text{S}\text{E}\:=\:\text{m}\text{e}\text{a}\text{n}\left(\left|{\text{q}}_{\text{t}}\right|\right)\).

The denominator can be considered as the average error of the native predictions made one step ahead in the future. If MASE > 1, the experimental method under evaluation is worse than the native prediction, and vice versa. Similar to MASE, MASE is calculated using the mean, making it more susceptible to outliers, while MdASE computed using the median has stronger robustness and validity. However, such metrics can only reflect the results of comparison with the primary method and cannot visualize the error of the prediction results.

6 Training strategies

Recent Transformer variants introduce various time-series features into the models for improvements [67, 70]. In this section, we summarize several training strategies of existing Transformer-based models for LTSF.

6.1 Preprocessing and embedding

In the preprocessing stage, normalization with zero mean is often applied in time-series tasks. Moreover, seasonal-trend decomposition is a standard method to make raw data more predictable [145, 146], first proposed by Autoformer [67]. It also uses a moving average kernel on the input sequence to extract the trend-cyclical component of the time series. The seasonal component differs between the original sequence and the trend component. FEDformer [79] further proposed a mixture of experts’ strategies to mix the trend components extracted by moving average kernels with various kernel sizes.

The self-attentive layer in the Transformer architecture cannot preserve the positional information of the time series. However, local location information or the ordering of the time series is essential. Furthermore, global time information is also informative, such as hierarchical timestamps (weeks, months, years) and agnostic timestamps (holidays and events) [12]. To enhance the temporal context of the time-series input, a practical design is to inject multiple embeddings into the input sequence, such as fixed positional coding and learnable temporal embeddings. Additionally, the introduction of temporal embeddings accompanied by temporal convolutional layers [49] or learnable timestamps [67] has been proposed as an effective means further to enhance the temporal context of the input data.

6.2 Iterated multi-step and direct multi-step

The time series forecasting task is to predict the values at the T future time steps. When T > 1, iterated multi-step (IMS) forecasting [147] learns a single-step forecaster and iteratively applies it to obtain multi-step predictions. Alternatively, direct multi-step (DMS) forecasting [148] optimizes the multi-step forecasting objective simultaneously. The variance of the IMS predictions is smaller due to the autoregressive estimation procedure compared to DMS forecasting but is inevitably subject to the error accumulation effects. Therefore, IMS forecasting is more desirable when highly accurate single-step forecasters exist, and T is relatively small. In contrast, DMS forecasting produces more accurate forecasts when unbiased single step forecast models are challenging to obtain or when T is large.

Applying the vanilla Transformer model to the LTSF problem has some limitations, including the quadratic time/memory complexity with the original self-attention scheme and error accumulation caused by the autoregressive decoder design. Alternative Transformer variants have been developed to overcome these challenges, each employing distinct strategies. For instance, LogTrans [49] introduces a dedicated decoder for IMS forecasting, while Informer [12] leverages a generative-style decoder. Additionally, Pyraformer [70] incorporates a fully connected layer that concatenates spatiotemporal axes as its decoder. Autoformer [67] adds the two refined decomposition features of the trend-cyclical components and the stacked autocorrelation mechanism of the seasonal component to obtain the final prediction results. Similarly, FEDformer [79] applies a decomposition scheme and employs the proposed frequency attention block in deciphering the final results.

7 Conclusion

Transformer architecture has been found to be applicable to solving various time-series tasks. The Transformer architecture based on self-attention and positional encoding offers better or similar performance as RNNs and variants of LSTMs/GRUs. However, it is more efficient in computing time and overcomes other shortcomings of RNNs/LSTMs/GRUs.

In this paper, we summarized the application of the Transformer on LTSF. First, we have provided a thorough examination of the fundamental structure of the Transformer. Subsequently, we analyzed and summarized the advantages of Transformer on LTSF tasks. Given that the Transformer encounters intricacies and interdependencies when confronting LTSF tasks, numerous adaptations have been introduced to the original architectural framework, thus equipping Transformers with the capacity to handle LTSF tasks effectively. This architectural augmentation, however, brings certain challenges during the training process. To address this, we have incorporated a compendium of best practices that facilitate the practical training of Transformers. Additionally, we have collected abundant resources on TSF and LTSF, including datasets, application fields, and evaluation metrics.

In summary, our comprehensive review examines recent advancements in Transformer-based architecture in LTSF and imparts valuable insights to researchers seeking to improve their models. The Transformer architecture is renowned for its remarkable modeling capacity and aptitude for capturing long-term dependencies. However, it encounters challenges regarding time complexity when applied to LTSF tasks. While efforts to reduce complexity may inadvertently lead to the loss of certain interdependencies between data points, thereby compromising prediction accuracy. Consequently, the amalgamation of various techniques within a compound model, leveraging the strengths of each, emerges as a promising avenue for future research in Transformer-based LTSF models. This paves the way for innovative model designs, data processing techniques, and benchmarking approaches to tackle the intricate LTSF problems. In future research, progressive trending and seasonal decomposition mechanisms can be introduced as multiple cycles and trends are hidden and repeated among the data. At the same time, the Transformer-based models have some inherent limitations in LSTF, such as time complexity. Transformer-based models may unavoidably lose some of the dependencies between data points when reducing the complexity of self-attention computations in situations with many data feature variables with complex correlations, resulting in reduced prediction accuracy. Therefore, Transformer-based models are not suitable for all LSTF tasks. Also, pre-trained Transformer models for different tasks in time series and more architecture-level designs for Transformers can be investigated in depth in the future. Notably, researchers have recently explored the integration of Large Language Models (LLMs) in time series forecasting, wherein LLMs exhibit the capability to generate forecasts while offering human-readable explanations for predictions, outperforming traditional statistical models and machine learning approaches. These encouraging findings present a compelling impetus for further exploration, aiming to enhance the precision, comprehensibility, and transparency of forecasting results.