0% found this document useful (0 votes)
69 views14 pages

LLM4TS-Aligning Pre-Trained LLMs As Data-Efficient Time-Series Forecasters

Uploaded by

18738174181ljr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views14 pages

LLM4TS-Aligning Pre-Trained LLMs As Data-Efficient Time-Series Forecasters

Uploaded by

18738174181ljr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

LLM4TS: Aligning Pre-Trained LLMs as Data-Efficient Time-Series Forecasters

Ching Chang , Wei-Yao Wang , Wen-Chih Peng and Tien-Fu Chen


National Yang Ming Chiao Tung University, Hsinchu, Taiwan
[email protected], [email protected], {wcpeng, tfchen}@cs.nycu.edu.tw
arXiv:2308.08469v5 [cs.LG] 18 Jan 2024

Abstract vised representation learning [Chang et al., 2023] and transfer


learning [Zhang et al., 2022; Zhou et al., 2023]. Generally,
Multivariate time-series forecasting is vital in vari- these approaches aim to employ adept representation learn-
ous domains, e.g., economic planning and weather ers: first extracting rich representations from the time-series
prediction. Deep train-from-scratch models have data and then using these representations for forecasting.
exhibited effective performance yet require large
amounts of data, which limits real-world applica- Achieving an adept representation learner requires suffi-
bility. Recently, researchers have leveraged the rep- cient training data [Hoffmann et al., 2022], yet in real-world
resentation learning transferability of pre-trained scenarios, there is often a lack of large-scale time-series
Large Language Models (LLMs) to handle limited datasets. For instance, in industrial manufacturing, the sen-
non-linguistic datasets effectively. However, incor- sor data for different products cannot be combined for further
porating LLMs with time-series data presents chal- analysis, leading to limited data for each product type [Yeh
lenges of limited adaptation due to different com- et al., 2019]. Recent research has pivoted towards pre-trained
positions between time-series and linguistic data, LLMs in Natural Language Processing (NLP) [Radford et al.,
and the inability to process multi-scale temporal in- 2019; Touvron et al., 2023] , exploiting their robust represen-
formation. To tackle these challenges, we propose tation learning and few-shot learning capabilities. Moreover,
LLM4TS, a framework for time-series forecasting these LLMs can adapt to non-linguistic datasets (e.g., images
[Lu et al., 2021], audio [Ghosal et al., 2023], tabular data
with pre-trained LLMs. LLM4TS consists of a two-
[Hegselmann et al., 2023], and time-series data [Zhou et al.,
stage fine-tuning strategy: the time-series align-
ment stage to align LLMs with the nuances of time- 2023]) by fine-tuning with only a few parameters and limited
series data, and the forecasting fine-tuning stage for data. While LLMs are renowned for their exceptional trans-
downstream time-series forecasting tasks. Further- fer learning capabilities across various fields, the domain-
more, our framework features a novel two-level ag- specific nuances of time-series data introduce two challenges
gregation method that integrates multi-scale tempo- in leveraging these models for time-series forecasting.
ral data within pre-trained LLMs, enhancing their The first challenge of employing LLMs for time-series
ability to interpret time-specific information. In ex- forecasting is their limited adaptation to the unique charac-
periments across 7 time-series forecasting datasets, teristics of time-series data due to LLMs’ initial pre-training
LLM4TS is superior to existing state-of-the-art focus on the linguistic corpus. While LLMs have been both
methods compared with trained-from-scratch mod- practically and theoretically proven [Zhou et al., 2023] to be
els in full-shot scenarios, and also achieves an av- effective in transfer learning across various modalities thanks
erage improvement of 6.84% in MSE in few-shot to their data-independent self-attention mechanism, their pri-
scenarios. In addition, evaluations compared with mary focus on general text during pre-training causes a short-
different self-supervised learning approaches high- fall in recognizing key time-series patterns and nuances cru-
light LLM4TS’s effectiveness with representation cial for accurate forecasting. This limitation is evident in ar-
learning in forecasting tasks. eas such as meteorology and electricity forecasting [Zhou et
al., 2021], where failing to account for weather patterns and
energy consumption trends leads to inaccurate predictions.
1 Introduction The second challenge lies in the limited capacity to process
Forecasting is a vital task in multivariate time-series analy- multi-scale temporal information. While LLMs are adept at
sis, not only for its ability to operate without manual label- understanding the sequence and context of words, they strug-
ing but also for its importance in practical applications such gle to understand temporal information due to the lack of uti-
as economic planning [Lai et al., 2018] and weather predic- lizing multi-scale time-related data such as time units (e.g.,
tion [Zhou et al., 2021]. Recently, numerous deep train-from- seconds, minutes, hours, etc.) and specific dates (e.g., holi-
scratch models have been developed for time-series forecast- days, significant events). This temporal information is vital
ing [Nie et al., 2023], although some lean towards unsuper- in time-series analysis for identifying and predicting patterns
[Wu et al., 2021]; for instance, in energy management, it is
used to address consumption spikes during daytime and in
summer/winter, in contrast to the lower demand during the
night and in milder seasons [Zhou et al., 2021]. This under-
scores the importance of models adept at interpreting multi-
scale temporal patterns (hourly to seasonal) for precise energy
demand forecasting. However, most LLMs (e.g., [Radford et
al., 2019; Touvron et al., 2023]) built on top of the Trans-
former architecture do not naturally incorporate multi-scale (a) 10% training data (b) 5% training data
temporal information, leading to models that fail to capture
crucial variations across different time scales. Figure 1: Model performance comparison on few-shot forecasting.
To address the above issues, we propose LLM4TS,
a framework for time-series forecasting with pre-trained 2021], audio [Ghosal et al., 2023], tabular data [Hegselmann
LLMs. Regarding the first challenge, our framework intro- et al., 2023], and time-series data [Zhou et al., 2023]. A
duces a two-stage fine-tuning approach: the time-series align- key motivation for employing LLMs in various modalities
ment stage and the forecasting fine-tuning stage. The first is their ability to achieve notable performance with limited
stage focuses on aligning the LLMs with the characteristics of data [Zhou et al., 2023]. To preserve their data-independent
time-series data by utilizing the autoregressive objective, en- representation learning capability, most parameters in these
abling the fine-tuned LLMs to adapt to time-series represen- LLMs are kept fixed. Empirical evidence [Lu et al., 2021;
tations. The second stage is incorporated to learn correspond- Zhou et al., 2023] indicates that LLMs keeping most param-
ing time-series forecasting tasks. In this manner, our model eters unchanged often outperform those trained from scratch,
supports effective performance in full- and few-shot scenar- underscoring the value of maintaining these models’ pre-
ios. Notably, throughout both stages, most parameters in the existing representation learning strengths. Theoretically, it
pre-trained LLMs are frozen, thus preserving the model’s in- is shown that the self-attention modules in these pre-trained
herent representation learning capability. To overcome the transformers develop the capacity for data-independent op-
limitation of LLMs in integrating multi-scale temporal infor- erations (akin to principal component analysis [Zhou et al.,
mation, we introduce a novel two-level aggregation strategy. 2023]), enabling them to function effectively as universal
This approach embeds multi-scale temporal information into compute engines [Lu et al., 2021] or general computation
the patched time-series data, ensuring that each patch not only calculators [Giannou et al., 2023]. In the time-series domain,
represents the series values but also encapsulates the critical GPT4TS [Zhou et al., 2023] utilizes the pre-trained GPT-2
time-specific context. Consequently, LLM4TS emerges as and demonstrates strong performance in time-series forecast-
a data-efficient time-series forecaster, demonstrating robust ing under few-shot conditions without modifying most pa-
few-shot performance across various datasets (Figure 1). rameters. With our LLM4TS, we address the challenges of
In summary, the paper’s main contributions are as follows: limited adaptation to time-series characteristics and the diffi-
• Aligning LLMs Toward Time-Series Data: To the culty in processing multi-scale temporal information, thereby
best of our knowledge, LLM4TS is the first method that enhancing performance in time-series forecasting.
aligns pre-trained Large Language Models with time-
series characteristics, effectively utilizing existing rep- 2.2 Long-term Time-Series Forecasting
resentation learning and few-shot learning capabilities. Numerous efforts have been dedicated to employing Trans-
• Multi-Scale Temporal Information in LLMs: To former models for long-term time-series forecasting [Zhou et
adapt to time-specific information, a two-level aggrega- al., 2021; Wu et al., 2021; Zhou et al., 2022; Nie et al., 2023].
tion method is proposed to integrate multi-scale tempo- While Transformer-based models have gained traction, DLin-
ral data within pre-trained LLMs. ear [Zeng et al., 2023] reveals that a single-layer linear model
can surpass many of these sophisticated Transformer-based
• Robust Performance in Forecasting: LLM4TS ex- approaches. These deep train-from-scratch models exhibit
cels in 7 real-world time-series forecasting benchmarks, outstanding performance when trained on sufficient datasets,
outperforming state-of-the-art methods, including those but their efficacy decreases in limited-data scenarios. In con-
trained from scratch. It also demonstrates strong few- trast, LLM4TS sets new benchmarks alongside these state-of-
shot capabilities, particularly with only 5% of data, the-art approaches in both full- and few-shot scenarios.
where it surpasses the best baseline that uses 10% of
data. This efficiency makes LLM4TS highly relevant for 2.3 Time-Series Representation Learning
practical, real-world forecasting applications. In the time-series domain, self-supervised learning emerges
as a prominent approach to representation learning. While
2 Related Work Transformers are widely recognized as prime candidates for
end-to-end time-series analysis [Xu et al., 2021; Liu et al.,
2.1 Transfer Learning Across Various Modalities 2021; Nie et al., 2023], CNN-based [Yue et al., 2022] or
with LLMs RNN-based [Tonekaboni et al., 2021] backbones consistently
LLMs have demonstrated their effectiveness in transfer learn- stand out as the preferred architecture in time-series self-
ing across a variety of modalities, such as images [Lu et al., supervised learning. However, the inherent capability of
(a) Time-Series Alignment (b) Forecasting Fine-Tuning

Figure 2: LLM4TS framework. The numbers in the patched time series (e.g., 1, 2, ..., 16 in the first patch) indicate the sequential order
of the timestamps. The framework consists of two stages: (a) Time-series alignment, which uses the autoregressive approach to align the
pre-trained LLM with patched time-series data. (b) Forecasting fine-tuning, which starts with linear probing (i.e., only the output layer is
unfrozen), followed by full fine-tuning (all the layers and PEFT components in the LLM are unfrozen).

Transformers to model long-range dependencies and capture 4.1 Time-Series Alignment


patterns aligns perfectly with time-series data, which involve
complex sequential relationships. Since the time-series align- Existing LLMs are pre-trained on a general language corpus,
ment stage in LLM4TS can be seen as a self-supervised learn- which means they fail to learn contextualized information
ing approach, we evaluate LLM4TS’s representation learning outside linguistic domains; therefore, the time-series align-
capability and demonstrate the full potential of Transform- ment stage is proposed to align LLMs with the characteristics
ers in self-supervised learning, surpassing the performance of time-series data. Given our selection of GPT-2 [Radford et
of conventional CNN and RNN-based models. al., 2019] as the backbone model, which is a causal language
model, we ensure that this stage adopts the same autoregres-
sive training methodology used during its pre-training phase.
3 Problem Formulation Figure 2(a) illustrates the autoregressive objective in the time-
Given a complete and evenly-sampled multivariate time se- series alignment stage: given an input sequence of patched
ries, we use a sliding data window to extract sequential sam- time-series data (e.g., 1st patch, 2nd patch, 3rd patch, etc.),
ples. This window moves with a stride of 1 and has a to- the backbone model generates an output sequence shifted one
tal length of Tin + Tout — comprising past data xin = patch to the right (e.g., 2nd patch, 3rd patch, 4th patch, etc.).
(d1 , . . . , dTin ) with a look-back window length Tin and fu-
ture data xout = (dTin +1 , . . . , dTin +Tout ) with a predic- Instance Normalization Data normalization is essential
tion length Tout . For each time step t, dt represents a C- for stable performance when adapting pre-trained models
dimensional vector, where C denotes the number of features. across various modalities. Alongside the layer normalization
Our objective is to use the past data xin ∈ RTin ×C to predict used in the pre-trained LLM, we incorporate instance nor-
the future data xout ∈ RTout ×C . malization to improve consistency and reliability in handling
diverse time-series datasets. In our model, instance normal-
ization is employed without incorporating a trainable affine
4 The Proposed LLM4TS transformation. This is crucial because when a batch of data
Figure 2 illustrates our LLM4TS framework, leveraging the is gathered and instance normalization is applied with a train-
pre-trained GPT-2 [Radford et al., 2019] as the backbone able affine transformation, the resulting transformed data be-
model. We first introduce the time-series alignment stage, comes unsuitable to be the ground truth for the output. Given
which focuses on aligning the LLMs with the characteristics that an autoregressive objective is used at this stage, applying
of time-series data using an autoregressive objective (Section a trainable affine transformation is not feasible.
4.1). Subsequently, the forecasting fine-tuning stage is de- Given an input time-series sample xin ∈ RTin ×C , we
signed to further enhance the model’s ability to handle time- apply instance normalization (IN) to produce a normalized
series forecasting tasks (Section 4.2). time-series sample xnormed ∈ RTin ×C with zero mean and
Convtoken as our new token encoding layer. As opposed to
employing a linear layer [Zhou et al., 2023], we choose a
convolutional layer due to its superior ability to retain local
semantic information within the time-series data. This results
in the generation of the token embedding etoken ∈ RTp ×D ,
where D denotes the dimension of the embeddings:
etoken = Convtoken (p). (3)
For the positional encoding layer, we employ a trainable
lookup table Epos to map patch locations. This results in the
Figure 3: Temporal encoding for patched time-series data involves a generation of the positional embedding epos ∈ RTp ×D :
two-level aggregation process. Here, only the first patch is shown for
simplicity; in practice, all patches in a batch are processed simulta- epos = Epos (i), (4)
neously. Level 1 aggregation calculates the temporal embedding for
each time unit and sums them together. Next, Level 2 aggregation where i ∈ RTp represents the indices of the patch locations.
applies a pooling method to extract the final temporal embedding. To address the challenge LLMs face in processing multi-
scale temporal information, we introduce a temporal encod-
ing layer. When processing time-related data, we face two
unit standard deviation: challenges due to the need to aggregate multiple pieces of in-
xnormed = IN(xin ). (1) formation into one unified representation (Figure 3):
1. Each timestamp includes a range of multi-scale temporal
Time-Series Tokenization The context window sizes in attributes (e.g., seconds, minutes, hours, holidays, etc.).
pre-trained LLMs (e.g., 1024 in GPT-2) are sufficient for NLP
tasks but are inadequate for long-term time-series forecasting. 2. Each patch encompasses multiple timestamps.
In our experiments, a prediction length of 720 combined with To address the first challenge associated with diverse tempo-
a look-back window size of 512 easily exceeds these limits. ral attributes within a timestamp, we employ Level 1 aggre-
To address this, we adopt channel-independence along with gation: a trainable lookup table for each temporal attribute
patching [Nie et al., 2023] for time-series tokenization, ef- (e.g., Esec , Emin , ...), mapping it into a high-dimensional
fectively resolving the context window size constraint and si- space, and then summing them to produce a singular tempo-
multaneously reducing the time and space complexity of the ral embedding. In response to the second challenge of mul-
Transformer quadratically. Channel-independence converts tiple timestamps within a patch, we use Level 2 aggregation:
multivariate time-series data into multiple univariate time- a pooling method to extract the final temporal embedding.
series data, thus transforming the data’s dimension to RTin ×1 , For the pooling method, we opt for the “select first” method,
with the channel dimension C merged into the batch size di- where the initial timestamp is designated as representative of
mension. The subsequent patching step groups adjacent time the entire patch. This is because the first timestamp often car-
steps into a singular patch-based token, reducing the input ries the most significant and representative information for
sample’s time dimension from Tin to Tp , where Tp denotes the entire duration covered by the patch, especially in time-
the number of patches, and concurrently expanding the fea- series data where earlier events can have a substantial influ-
ture dimension from 1 to P , with P representing the patch ence on the subsequent sequence. This process generates the
length. final temporal embedding etemp ∈ RTp ×D :
Given a normalized time-series sample xnormed ∈  
RTin ×C , we first apply channel-independence (CI), and then X
patching to produce a series of patches p ∈ RTp ×P : etemp = Pooling  Ea (ta ) , (5)
a∈{sec,min,hour,...}
p = patching(CI(xnormed )). (2)
where a represents different temporal attributes (seconds,
Three Encodings for Patched Time-Series Data Given minutes, hours, holidays, etc.), Ea denotes the trainable
our goal to adapt a pre-trained LLM for time-series data, the lookup table for each temporal attribute, ta ∈ RTp ×P are
original token encoding layer (designed for text) becomes un- the series of patches containing temporal information for that
suitable due to the mismatched modalities. Additionally, we temporal attribute, and Pooling applies the pooling method to
design a new temporal encoding layer to address the inability the aggregated embeddings.
to process multi-scale temporal information. Finally, the token, positional, and temporal embeddings are
Given a series of tokens, applying token encoding is neces- summed to yield the final embedding e ∈ RTp ×D , which is
sary to align their dimensions with the latent embedding di- then fed into the pre-trained Transformer blocks:
mension of the pre-trained LLM. In standard NLP practices, e = etoken + epos + etemp . (6)
this encoding uses a trainable lookup table to map tokens into
a high-dimensional space. However, this method only suits Pre-Trained LLM To preserve LLMs’ data-independent
scalar tokens, whereas our patched time-series data are vec- representation learning capability, most parameters in these
tors. Therefore, we drop the original token encoding layer in LLMs are kept fixed. Empirical evidence [Lu et al., 2021;
the LLM, and employ a one-dimensional convolutional layer Zhou et al., 2023] shows that training these LLMs from
scratch often hurts performance, highlighting the importance Table 1: Statistical overview of the 7 datasets for long-term time-
of fixing most parameters to retain the LLM’s representation series forecasting.
learning capability. To that end, we opt for freezing most
Datasets Features Timesteps Granularity
parameters, particularly those associated with the multi-head
attention and feed-forward layers in the Transformer block, Weather 21 52,696 10 min
as they are the most responsible for representation learning Traffic 862 17,544 1 hour
[Zhou et al., 2023]. Electricity 321 26,304 1 hour
For the remaining trainable parameters in the pre-trained ETTh1 & ETTh2 7 17,420 1 hour
LLM, we employ two Parameter-Efficient Fine-Tuning ETTm1 & ETTm2 7 69,680 5 min
(PEFT) methods to selectively adjust or introduce a limited
set of trainable parameters. Specifically, we utilize Layer For the model architecture in the forecasting fine-tuning
Normalization Tuning [Lu et al., 2021] to adjust pre-existing stage, we preserve most of the structure as in the time-series
parameters in Transformer blocks, making the affine trans- alignment stage, including the three encoding layers and the
formation in layer normalization trainable. Concurrently, pre-trained LLM. However, there are two architectural dif-
we employ Low-Rank Adaptation (LoRA) [Hu et al., 2021], ferences in this stage: instance normalization and the output
which introduces trainable low-rank matrices that are applied layer.
to the query (Q) and key (K) matrices in the self-attention The first architectural difference is in the instance nor-
mechanism. With these two PEFT techniques, only 1.5% of malization, where we adopt Reversible Instance Normaliza-
the pre-trained LLM’s total parameters are used to be trained. tion (RevIN) [Kim et al., 2021] to enhance forecasting ac-
Given the embedding e (which is adjusted to the required curacy. RevIN involves batch-specific instance normalization
embedding dimension D by three encoding layers), we pass and subsequent denormalization, both sharing the same train-
it into the pre-trained LLM, which comprises a series of pre- able affine transformation. The additional denormalization
trained Transformer blocks (TBs) (with L blocks in total). step addresses distribution shifts between training and testing
This process yields the final embeddings z ∈ RTp ×D : data, which is a common challenge in the time-series domain
z = TBs(e). (7) (e.g., seasonal changes). Therefore, during the time-series to-
kenization step, we apply RevIN’s normalization, succeeded
After being processed by the pre-trained LLM, we employ by channel-independence and patching:
a linear output layer Wtsa ∈ RP ×D to transform the output
embedding back to patched time-series data: p = patching(CI(RevINnorm (xin ))). (10)
⊤ Notably, the denormalization step is applicable only to un-
p̂shif ted = zWtsa , (8)
patched time-series data; hence, in the time-series alignment
where p̂shif ted ∈ RTp ×P represents our prediction target, stage, standard instance normalization is employed.
corresponding to the original time-series patches (p) shifted The second architectural difference lies in the output layer,
one patch to the right, in line with the autoregressive objective whose function is to transform the final embedding z into the
of this stage. To ensure the prediction precisely reconstructs predicted future data, presented in the general (unpatched)
the actual shifted patched data pshif ted ∈ RTp ×P , we use the time-series format. This involves flattening the data and pass-
Mean Squared Error (MSE) as the loss function: ing it through the linear output layer Wf f t ∈ RTout ×Tp ·D ,
followed by rearrangement, and then applying RevIN’s de-
Ltsa = MSE(pshif ted , p̂shif ted ). (9) normalization to obtain the final prediction x̂out ∈ RTout ×C :
4.2 Forecasting Fine-tuning x̂out = RevINdenorm (Rearrange((Flatten(z))Wf⊤f t ). (11)
After aligning the pre-trained LLM with patched time-series
data in the time-series alignment stage, we transfer the trained To ensure that this prediction accurately reconstructs the fu-
weights of the backbone model, including those from the en- ture data xout ∈ RTout ×C , we use MSE as the loss function:
coding layers, to the forecasting fine-tuning stage. When fine- Lf f t = MSE(xout , x̂out ). (12)
tuning the backbone model for the forecasting task, two pri-
mary training strategies are available: full fine-tuning (where
all model parameters are updated) and linear probing (where 5 Experiments
only the final linear output layer is updated). Studies have
shown that a sequential approach—initial linear probing fol- Datasets We use 7 widely used multivariate time-series
lowed by full fine-tuning (LP-FT, as illustrated in Figure datasets to test our LLM4TS framework in long-term fore-
2(b))—consistently surpasses strategies exclusively employ- casting, including Weather, Traffic, Electricity, and 4 ETT
ing either method [Kumar et al., 2022]. The superiority sets (ETTh1, ETTh2, ETTm1, ETTm2). Detailed statistics
of LP-FT is due to its dual-phase approach: first finding for these datasets are provided in Table 1.
an optimized output layer to minimize later adjustments in Baselines For long-term time-series forecasting, we focus
fine-tuning (preserving feature extractor efficacy for out-of- on a range of state-of-the-art models. GPT4TS [Zhou et al.,
distribution (OOD) scenarios), and then employing full fine- 2023] is distinct in leveraging a pre-trained LLM (GPT-2),
tuning to adapt the model to the specific task (enhancing in- while DLinear [Zeng et al., 2023], PatchTST [Nie et al.,
distribution (ID) accuracy) [Kumar et al., 2022]. 2023], FEDformer [Zhou et al., 2022], Autoformer [Wu et
Table 2: Long-term forecasting for multivariate time-series tasks. Results are averaged over prediction lengths Tout ∈ {96, 192, 336, 720}
for all datasets. The best average results are in bold, while the second-best results are in underlined. Details are reported in Appendix A.1.

Methods LLM4TS GPT4TS DLinear PatchTST FEDformer Autoformer Informer


Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Weather 0.223 0.260 0.237 0.271 0.249 0.300 0.226 0.264 0.309 0.360 0.338 0.382 0.634 0.521
ETTh1 0.404 0.418 0.428 0.426 0.423 0.437 0.413 0.431 0.440 0.460 0.496 0.487 1.040 0.795
ETTh2 0.333 0.383 0.355 0.395 0.431 0.447 0.330 0.379 0.437 0.449 0.450 0.459 4.431 1.729
ETTm1 0.343 0.378 0.352 0.383 0.357 0.379 0.351 0.381 0.448 0.452 0.588 0.517 0.961 0.734
ETTm2 0.251 0.313 0.267 0.326 0.267 0.334 0.255 0.315 0.305 0.349 0.327 0.371 1.410 0.810
ECL 0.159 0.253 0.167 0.263 0.166 0.264 0.162 0.253 0.214 0.327 0.227 0.338 0.311 0.397
Traffic 0.401 0.273 0.414 0.295 0.434 0.295 0.391 0.264 0.610 0.376 0.628 0.379 0.764 0.416
Avg. Rank 1.286 1.429 3.286 3.000 3.714 3.714 1.714 1.857 5.000 5.000 6.000 6.000 7.000 7.000

Table 3: Few-shot long-term forecasting using 10% and 5% of the training data. For most datasets, results are averaged over prediction
lengths Tout ∈ {96, 192, 336, 720}. However, for datasets marked with * (ETTh1, ETTh2, and Traffic) in the 5% setting, only Tout ∈
{96, 192, 336} are used because there are insufficient data to constitute a training set when Tout = 720. The best average results are in bold,
while the second-best results are in underlined. Details are reported in Appendix A.2.
(a) Few-shot long-term forecasting using 10% of the training data.
Methods LLM4TS GPT4TS DLinear PatchTST FEDformer Autoformer Informer
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Weather 0.235 0.270 0.238 0.275 0.241 0.283 0.242 0.279 0.284 0.324 0.300 0.342 0.597 0.495
ETTh1 0.525 0.493 0.590 0.525 0.691 0.600 0.633 0.542 0.639 0.561 0.702 0.596 1.199 0.809
ETTh2 0.366 0.407 0.397 0.421 0.605 0.538 0.415 0.431 0.466 0.475 0.488 0.499 3.872 1.513
ETTm1 0.408 0.413 0.464 0.441 0.411 0.429 0.501 0.466 0.722 0.605 0.802 0.628 1.192 0.821
ETTm2 0.276 0.324 0.293 0.335 0.316 0.368 0.296 0.343 0.463 0.488 1.342 0.930 3.370 1.440
ECL 0.172 0.264 0.176 0.269 0.180 0.280 0.180 0.273 0.346 0.427 0.431 0.478 1.195 0.891
Traffic 0.432 0.303 0.440 0.310 0.447 0.313 0.430 0.305 0.663 0.425 0.749 0.446 1.534 0.811
Avg. Rank 1.143 1.000 2.286 2.286 3.857 4.286 3.143 3.000 4.714 4.714 5.857 5.714 7.000 7.000
(b) Few-shot long-term forecasting using 5% of the training data.
Methods LLM4TS GPT4TS DLinear PatchTST FEDformer Autoformer Informer
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Weather 0.256 0.292 0.264 0.302 0.264 0.309 0.270 0.304 0.310 0.353 0.311 0.354 0.584 0.528
ETTh1* 0.651 0.551 0.682 0.560 0.750 0.611 0.695 0.569 0.659 0.562 0.722 0.599 1.225 0.817
ETTh2* 0.359 0.405 0.401 0.434 0.828 0.616 0.439 0.448 0.441 0.457 0.470 0.489 3.923 1.654
ETTm1 0.413 0.417 0.472 0.450 0.401 0.417 0.527 0.476 0.731 0.593 0.796 0.621 1.163 0.792
ETTm2 0.286 0.332 0.308 0.346 0.399 0.426 0.315 0.353 0.381 0.405 0.389 0.434 3.659 1.490
ECL 0.173 0.266 0.179 0.273 0.177 0.276 0.181 0.277 0.267 0.353 0.346 0.405 1.281 0.930
Traffic* 0.418 0.295 0.434 0.305 0.451 0.317 0.418 0.297 0.677 0.424 0.833 0.502 1.591 0.832
Avg. Rank 1.143 1.000 2.571 2.286 4.000 4.143 3.429 3.286 4.286 4.429 5.571 5.714 7.000 7.000

al., 2021], and Informer [Zhou et al., 2021] are recognized as self-supervised learning, the settings are slightly adjusted to
train-from-scratch time-series forecasting models. The same Tin = 512, P = 12, and S = 12. Aligned with the GPT4TS
set of models is used for few-shot learning and ablation stud- configuration [Zhou et al., 2023], we utilize only the first 6
ies. For self-supervised learning, we choose PatchTST, BTSF layers of the 12-layer GPT-2 base [Radford et al., 2019].
[Yang and Hong, 2022], TS2Vec [Yue et al., 2022], TNC
[Tonekaboni et al., 2021], and TS-TCC [Eldele et al., 2021]. 5.1 Long-Term Time-Series Forecasting
Consistent with prior research [Zhou et al., 2023], we rely on
Table 2 presents the results of long-term time-series forecast-
Mean Squared Error (MSE) and Mean Absolute Error (MAE)
ing averaged over a consistent prediction length set Tout ∈
as evaluation metrics across all experiments.
{96, 192, 336, 720} for all datasets. For each dataset, we train
Implementation Details For our experiments in long-term a single model in the time-series alignment stage, which is
time-series forecasting, few-shot learning, and ablation stud- then applied consistently across all prediction lengths. In
ies, we use the settings from PatchTST [Nie et al., 2023] for a contrast, in the forecasting fine-tuning stage, we fine-tune a
consistent comparison. We set our look-back window length distinct model for each prediction length, while ensuring that
Tin to either 336 or 512 (reporting the best results), and con- all these models share the same hyperparameters. Although
figure the patch length P as 16 with a stride S of 8. For the primary intent of using the pre-trained LLM is for few-
(a) Key Components in (b) Training Strategies in
Figure 4: Self-supervised learning evaluation in forecasting with LLM4TS Forecasting Fine-Tuning
linear evaluation. We use results averaged over prediction lengths
Tout ∈ {24, 48, 168, 336, 720} for the ETTh1 dataset. The best av- Figure 5: Ablation study. Each ablation is conducted under both
erage results are in bold. Details are reported in Appendix A.3. full- and few-shot learning with 10% training data. We report results
averaged over prediction lengths T ∈ {96, 192, 336, 720} for the
ETTh1 dataset. The best average results are in bold. Details are
shot learning, LLM4TS still surpasses all baseline methods reported in Appendix A.4.
even when given access to the full dataset. By using two-
stage fine-tuning and incorporation of multi-scale temporal
information, LLM4TS achieves the highest rank in 9 of the Notably, LLM4TS delivers exceptional performance in few-
14 evaluations, covering 7 datasets and 2 metrics. shot learning, averaging a 6.2% reduction in MSE with each
incorporation of these components.
5.2 Few-Shot Learning
Training Strategies in Forecasting Fine-Tuning As dis-
Table 3 shows the results of using only 10% and 5% of the cussed in Section 4.2, while full fine-tuning (FT) shows supe-
training data in long-term time-series forecasting. In our rior performance in out-of-distribution (OOD) scenarios and
experiments, we maintain consistent splits for training, val- linear probing (LP) excels in in-distribution (ID) scenarios,
idation, and test sets in both full- and few-shot learning, LP-FT can surpass FT and LP in both OOD and ID scenarios.
and in few-shot scenarios, we intentionally limit the train- Figure 5b shows that LP-FT enhances performance in both
ing data percentage. Both LLM4TS and GPT4TS [Zhou full- and few-shot learning on the ETTh1 dataset, achieving
et al., 2023] consistently surpass most train-from-scratch an average improvement of 0.7% in MSE for full-shot learn-
models in limited-data scenarios across various datasets, ing and 2.51% for few-shot learning. The subtle improve-
thanks to the pre-existing representation learning capabil- ments in both scenarios can be attributed to the limited num-
ity encapsulated in GPT-2. With the additional time-series ber of trainable parameters in the LLM4TS’s backbone model
alignment and multi-scale temporal information integration, even when using FT, which narrows the distinction between
LLM4TS emerges as a better data-efficient time-series fore- LP and FT. Notably, few-shot learning benefits more from
caster against GPT4TS, achieving better performance across LP-FT than full-shot learning, mainly because it is more sus-
all datasets. Notably, LLM4TS with only 5% of data outper- ceptible to severe OOD issues.
forms the best baseline that uses 10% of data.

5.3 Self-Supervised Learning 6 Conclusion


Given that the autoregressive objective used in the time- In this paper, we present LLM4TS, a framework for time-
series alignment stage can be seen as a pretext task in self- series forecasting utilizing pre-trained LLMs. LLM4TS em-
supervised learning, we aim to assess LLM4TS’s representa- ploys a two-stage fine-tuning strategy, beginning with the
tion learning capability. To evaluate the effectiveness of self- time-series alignment stage to adapt LLMs to the character-
supervised learning, we conduct a linear evaluation on time- istics of time-series data, followed by the forecasting fine-
series forecasting. This involves pre-training the backbone tuning stage designed for time-series forecasting tasks. Our
model using the pretext task, freezing its weights, and then framework also introduces a novel two-level aggregation
training an attached linear layer on the downstream forecast- method, integrating multi-scale temporal data within pre-
ing task. With the backbone model’s parameters fixed, strong trained LLMs to improve their interpretation of time-related
performance in forecasting depends on the expressiveness of information. Through experiments on 7 time-series forecast-
the learned representations. Figure 4 shows LLM4TS’s su- ing datasets, LLM4TS demonstrates superior performance
perior performance over competitors on the ETTh1 dataset, over existing state-of-the-art methods, including those trained
highlighting the effectiveness of adapting the LLM to time- from scratch, in both full and few-shot scenarios.
series characteristics in the time-series alignment stage. In future work, we plan to extend our research in two di-
rections. First, while we chose GPT-2 as our primary LLM
5.4 Ablation Study in this paper for a fair comparison over GPT4TS, we plan to
Key Components in LLM4TS Figure 5a explores the ef- evaluate more recent LLMs like GPT-3.5 and LLaMA-2 to
fects of time-series alignment, temporal encoding, and PEFT assess their advancements. Second, we aim to explore other
in LLM4TS, assessing both full- and few-shot scenarios on tasks, such as classification and anomaly detection. Although
the ETTh1 dataset. A comparative analysis—with and with- forecasting is highly relevant to real-world applications with-
out these components—highlights their individual impor- out the need for manual labeling, extending it to other tasks
tance in enhancing forecasting accuracy in both scenarios. enables the broader applicability of our LLM4TS framework.
References computation engines. arXiv preprint arXiv:2103.05247,
[Chang et al., 2023] Ching Chang, Chiao-Tung Chan, Wei- 1, 2021.
Yao Wang, Wen-Chih Peng, and Tien-Fu Chen. Time- [Nie et al., 2023] Yuqi Nie, Nam H. Nguyen, Phanwadee
drl: Disentangled representation learning for multivariate Sinthong, and Jayant Kalagnanam. A time series is worth
time-series. arXiv preprint arXiv:2312.04142, 2023. 64 words: Long-term forecasting with transformers. In
[Eldele et al., 2021] Emadeldeen Eldele, Mohamed Ragab, ICLR. OpenReview.net, 2023.
Zhenghua Chen, Min Wu, C. Kwoh, Xiaoli Li, and Cun- [Radford et al., 2019] Alec Radford, Jeffrey Wu, Rewon
tai Guan. Time-series representation learning via temporal Child, David Luan, Dario Amodei, Ilya Sutskever, et al.
and contextual contrasting. In International Joint Confer- Language models are unsupervised multitask learners.
ence on Artificial Intelligence, 2021. OpenAI blog, 1:9, 2019.
[Ghosal et al., 2023] Deepanway Ghosal, Navonil Ma- [Tonekaboni et al., 2021] Sana Tonekaboni, Danny Eytan,
jumder, Ambuj Mehrish, and Soujanya Poria. Text-to- and Anna Goldenberg. Unsupervised representation learn-
audio generation using instruction-tuned llm and latent ing for time series with temporal neighborhood coding. In
diffusion model. ArXiv, abs/2304.13731, 2023. ICLR. OpenReview.net, 2021.
[Giannou et al., 2023] Angeliki Giannou, Shashank Rajput, [Touvron et al., 2023] Hugo Touvron, Thibaut Lavril, Gau-
Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris tier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim-
Papailiopoulos. Looped transformers as programmable othée Lacroix, Baptiste Rozière, Naman Goyal, Eric
computers. arXiv preprint arXiv:2301.13196, 2023. Hambro, Faisal Azhar, Aurelien Rodriguez, Armand
[Hegselmann et al., 2023] Stefan Hegselmann, Alejandro Joulin, Edouard Grave, and Guillaume Lample. Llama:
Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, Open and efficient foundation language models. ArXiv,
and David Sontag. Tabllm: Few-shot classification of tab- abs/2302.13971, 2023.
ular data with large language models. In International [Wu et al., 2021] Haixu Wu, Jiehui Xu, Jianmin Wang, and
Conference on Artificial Intelligence and Statistics, pages Mingsheng Long. Autoformer: Decomposition transform-
5549–5581. PMLR, 2023. ers with auto-correlation for long-term series forecast-
[Hoffmann et al., 2022] Jordan Hoffmann, Sebastian ing. Advances in Neural Information Processing Systems,
Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor 34:22419–22430, 2021.
Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne [Xu et al., 2021] Jiehui Xu, Haixu Wu, Jianmin Wang, and
Hendricks, Johannes Welbl, Aidan Clark, et al. Training Mingsheng Long. Anomaly transformer: Time series
compute-optimal large language models. arXiv preprint anomaly detection with association discrepancy. arXiv
arXiv:2203.15556, 2022. preprint arXiv:2110.02642, 2021.
[Hu et al., 2021] Edward J Hu, Yelong Shen, Phillip Wallis, [Yang and Hong, 2022] Ling Yang and Shenda Hong. Unsu-
Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, pervised time-series representation learning with iterative
and Weizhu Chen. Lora: Low-rank adaptation of large bilinear temporal-spectral fusion. In International Confer-
language models. arXiv preprint arXiv:2106.09685, 2021. ence on Machine Learning, pages 25038–25054. PMLR,
[Kim et al., 2021] Taesung Kim, Jinhee Kim, Yunwon Tae, 2022.
Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Re- [Yeh et al., 2019] Cheng-Han Yeh, Yao-Chung Fan, and
versible instance normalization for accurate time-series Wen-Chih Peng. Interpretable multi-task learning for
forecasting against distribution shift. In International Con- product quality prediction with attention mechanism. In
ference on Learning Representations, 2021. 2019 IEEE 35th International Conference on Data Engi-
[Kumar et al., 2022] Ananya Kumar, Aditi Raghunathan, neering (ICDE), pages 1910–1921. IEEE, 2019.
Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning [Yue et al., 2022] Zhihan Yue, Yujing Wang, Juanyong
can distort pretrained features and underperform out-of- Duan, Tianmeng Yang, Congrui Huang, Yunhai Tong, and
distribution. arXiv preprint arXiv:2202.10054, 2022. Bixiong Xu. Ts2vec: Towards universal representation of
[Lai et al., 2018] Guokun Lai, Wei-Cheng Chang, Yiming time series. In Proceedings of the AAAI Conference on Ar-
Yang, and Hanxiao Liu. Modeling long-and short-term tificial Intelligence, volume 36, pages 8980–8987, 2022.
temporal patterns with deep neural networks. In The 41st [Zeng et al., 2023] Ailing Zeng, Muxi Chen, Lei Zhang, and
international ACM SIGIR conference on research & devel- Qiang Xu. Are transformers effective for time series fore-
opment in information retrieval, pages 95–104, 2018. casting? In Proceedings of the AAAI conference on artifi-
[Liu et al., 2021] Minghao Liu, Shengqi Ren, Siyuan Ma, Ji- cial intelligence, volume 37, pages 11121–11128, 2023.
ahui Jiao, Yizhou Chen, Zhiguang Wang, and Wei Song. [Zhang et al., 2022] Xiang Zhang, Ziyuan Zhao, Theodoros
Gated transformer networks for multivariate time series Tsiligkaridis, and Marinka Zitnik. Self-supervised con-
classification. arXiv preprint arXiv:2103.14438, 2021. trastive pre-training for time series via time-frequency
[Lu et al., 2021] Kevin Lu, Aditya Grover, Pieter Abbeel, consistency. Advances in Neural Information Processing
and Igor Mordatch. Pretrained transformers as universal Systems, 35:3988–4003, 2022.
[Zhou et al., 2021] Haoyi Zhou, Shanghang Zhang, Jieqi
Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wan-
cai Zhang. Informer: Beyond efficient transformer for
long sequence time-series forecasting. In Proceedings of
the AAAI conference on artificial intelligence, volume 35,
pages 11106–11115, 2021.
[Zhou et al., 2022] Tian Zhou, Ziqing Ma, Qingsong Wen,
Xue Wang, Liang Sun, and Rong Jin. Fedformer: Fre-
quency enhanced decomposed transformer for long-term
series forecasting. In International Conference on Ma-
chine Learning, pages 27268–27286. PMLR, 2022.
[Zhou et al., 2023] Tian Zhou, Peisong Niu, Xue Wang,
Liang Sun, and Rong Jin. One fits all: Power general
time series analysis by pretrained lm. arXiv preprint
arXiv:2302.11939, 2023.
A Detailed Results of Experiments it lacks a distinct stage of representation learning. The vari-
We include detailed tables for the entire set of experiments in ant of PatchTST used here differs from that in the forecasting
the appendices: long-term forecasting (Appendix A.1), few- experiments; this variant focuses on representation learning.
shot learning (Appendix A.2), self-supervised learning (Ap- PatchTST incorporates an MLM (Masked Language Model)
pendix A.3), and ablation studies (Appendix A.4). All results approach, akin to BERT, for learning representations. De-
for various prediction lengths are provided in these tables spite this, LLM4TS still emerges as the top method in rep-
resentation learning capability among all evaluated methods,
A.1 Long-term Time-Series Forecasting achieving an average improvement of 6.02% in MSE.
Table 4 showcases the outcomes of long-term time-series
forecasting, maintaining a consistent prediction length set
A.4 Ablation Study
Tout ∈ {96, 192, 336, 720} across all datasets. While the Key Components in LLM4TS Table 8a explores the ef-
primary focus of using pre-trained LLMs is on few-shot fects of time-series alignment, temporal encoding, and PEFT
learning, LLM4TS notably outperforms all deep train-from- in LLM4TS, assessing both full- and few-shot scenarios
scratch methods, even with full dataset access. Its two-stage on the ETTh1 dataset. A comparative analysis—with and
fine-tuning approach and integration of multi-scale temporal without these components—highlights their individual im-
information enable LLM4TS to secure the top rank in 9 out portance in enhancing forecasting accuracy in both scenar-
of 14 evaluations, spanning 7 datasets and 2 metrics. Interest- ios. Notably, LLM4TS delivers exceptional performance in
ingly, for the largest dataset (Traffic), PatchTST emerges as few-shot learning, averaging a 6.2% reduction in MSE with
the leading model. This suggests that with complete dataset each incorporation of these components. We can observe
access and sufficient data volume, traditional deep train-from- two things in the experimental results. The first one is the
scratch models may sometimes outshine those leveraging pre- MSE improvement is increasing when the prediction length
trained LLMs. is also increasing. This indicate that the main components
have greater benefits when the requirement of the model’s
A.2 Few-Shot Learning prediction capability is increased. The second one is the few-
Table 5 shows the results of long-term time-series forecasting shot scenarios benefit more than full-shot scenarios when we
using only 10% of the training data, while Table 6 presents incorporate these main components in LLM4TS. This indi-
similar results but with merely 5% of the training data uti- cates that the model benefits more than the incorporation of
lized. In our experiments, consistent splits for training, vali- the LLM4TS’s compoments when the volumn of training data
dation, and test sets are maintained across both full and few- is limited. This proves that LLM4TS is an excellent data-
shot learning scenarios. We deliberately limited the training efficient time-series forecaster since the main components of
data percentage to 10% and 5% to evaluate model perfor- its own.
mance in few-shot scenarios. LLM4TS and GPT4TS [Zhou In the experimental results, we observe two key insights.
et al., 2023] consistently outperformed most deep train-from- First, there is a notable trend where the MSE improvement
scratch models in these limited-data scenarios across various increases as the prediction length extends. This indicates that
datasets, a success attributed to the inherent representation the core elements of LLM4TS become increasingly beneficial
learning capabilities of GPT-2. Enhanced by additional time- in situations where a higher level of predictive capability is
series alignment and multi-scale temporal information inte- required, particularly evident with longer prediction lengths.
gration, LLM4TS proves to be a better data-efficient time- Second, few-shot scenarios exhibit more substantial gains
series forecaster compared to GPT4TS, delivering superior than full-shot scenarios upon integrating these main compo-
performance across all datasets. Remarkably, with only 5% nents into LLM4TS. It emphasizes LLM4TS’s strength as
of data, LLM4TS surpasses the top baseline using 10% of a data-efficient time-series forecaster, a quality primarily at-
data. tributed to its intrinsic components.
For the largest dataset (Traffic), PatchTST emerges as the Training Strategies in Forecasting Fine-Tuning Table 8b
leading model in the full-shot scenario, though this trend does demonstrates that LP-FT boosts performance on the ETTh1
not extend to few-shot scenarios. With only 10% training dataset in both full- and few-shot learning scenarios, with an
data, LLM4TS outperforms PatchTST in 5 out of 8 evalua- average MSE improvement of 0.7% in full-shot and 2.51% in
tions, and with just 5% training data, it leads in 5 out of 6 few-shot learning. The subtle enhancements in both cases are
evaluations. This suggests that in few-shot scenarios, tradi- likely due to the limited number of trainable parameters in
tional deep train-from-scratch models generally still under- LLM4TS’s backbone model even with FT, leading to a more
perform compared to those leveraging pre-trained LLMs. negligible differentiation between LP and FT. The results fur-
ther reveal that few-shot learning derives a greater advantage
A.3 Self-Supervised Learning from LP-FT, primarily due to its higher vulnerability to se-
Table 7 showcases LLM4TS’s outstanding performance on vere OOD issues. Additionally, consistent with observations
the ETTh1 dataset, underscoring the success of adapting in the ablation study of LLM4TS’s main components, we note
the LLM to time-series characteristics during the time-series a similar trend where longer prediction lengths yield more
alignment stage. This comparison exclusively includes self- significant benefits in few-shot scenarios.
supervised learning methods, thereby excluding deep train-
from-scratch models designed explicitly for time-series fore-
casting. Similarly, GPT4TS is not part of this experiment as
Table 4: Long-term forecasting for multivariate time-series data. We use prediction lengths T ∈ {96, 192, 336, 720} for all datasets. The
best results are in bold, while the second-best results are underlined.

Methods LLM4TS GPT4TS DLinear PatchTST FEDformer Autoformer Informer


Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.147 0.196 0.162 0.212 0.176 0.237 0.149 0.198 0.217 0.296 0.266 0.336 0.300 0.384
192 0.191 0.238 0.204 0.248 0.220 0.282 0.194 0.241 0.276 0.336 0.307 0.367 0.598 0.434
Weather
336 0.241 0.277 0.254 0.286 0.265 0.319 0.245 0.282 0.339 0.380 0.359 0.395 0.578 0.523
720 0.313 0.329 0.326 0.337 0.333 0.362 0.314 0.334 0.403 0.428 0.419 0.428 1.059 0.741
96 0.371 0.394 0.376 0.397 0.375 0.399 0.370 0.399 0.376 0.419 0.449 0.459 0.865 0.713
192 0.403 0.412 0.416 0.418 0.405 0.416 0.413 0.421 0.420 0.448 0.500 0.482 1.008 0.792
ETTh1
336 0.420 0.422 0.442 0.433 0.439 0.443 0.422 0.436 0.459 0.465 0.521 0.496 1.107 0.809
720 0.422 0.444 0.477 0.456 0.472 0.490 0.447 0.466 0.506 0.507 0.514 0.512 1.181 0.865
96 0.269 0.332 0.285 0.342 0.289 0.353 0.274 0.336 0.358 0.397 0.346 0.388 3.755 1.525
192 0.328 0.377 0.354 0.389 0.383 0.418 0.339 0.379 0.429 0.439 0.456 0.452 5.602 1.931
ETTh2
336 0.353 0.396 0.373 0.407 0.448 0.465 0.329 0.380 0.496 0.487 0.482 0.486 4.721 1.835
720 0.383 0.425 0.406 0.441 0.605 0.551 0.379 0.422 0.463 0.474 0.515 0.511 3.647 1.625
96 0.285 0.343 0.292 0.346 0.299 0.343 0.290 0.342 0.379 0.419 0.505 0.475 0.672 0.571
192 0.324 0.366 0.332 0.372 0.335 0.365 0.332 0.369 0.426 0.441 0.553 0.496 0.795 0.669
ETTm1
336 0.353 0.385 0.366 0.394 0.369 0.386 0.366 0.392 0.445 0.459 0.621 0.537 1.212 0.871
720 0.408 0.419 0.417 0.421 0.425 0.421 0.416 0.420 0.543 0.490 0.671 0.561 1.166 0.823
96 0.165 0.254 0.173 0.262 0.167 0.269 0.165 0.255 0.203 0.287 0.255 0.339 0.365 0.453
192 0.220 0.292 0.229 0.301 0.224 0.303 0.220 0.292 0.269 0.328 0.281 0.340 0.533 0.563
ETTm2
336 0.268 0.326 0.286 0.341 0.281 0.342 0.274 0.329 0.325 0.366 0.339 0.372 1.363 0.887
720 0.350 0.380 0.378 0.401 0.397 0.421 0.362 0.385 0.421 0.415 0.433 0.432 3.379 1.338
96 0.128 0.223 0.139 0.238 0.140 0.237 0.129 0.222 0.193 0.308 0.201 0.317 0.274 0.368
192 0.146 0.240 0.153 0.251 0.153 0.249 0.157 0.240 0.201 0.315 0.222 0.334 0.296 0.386
ECL
336 0.163 0.258 0.169 0.266 0.169 0.267 0.163 0.259 0.214 0.329 0.231 0.338 0.300 0.394
720 0.200 0.292 0.206 0.297 0.203 0.301 0.197 0.290 0.246 0.355 0.254 0.361 0.373 0.439
96 0.372 0.259 0.388 0.282 0.410 0.282 0.360 0.249 0.587 0.366 0.613 0.388 0.719 0.391
192 0.391 0.265 0.407 0.290 0.423 0.287 0.379 0.256 0.604 0.373 0.616 0.382 0.696 0.379
Traffic
336 0.405 0.275 0.412 0.294 0.436 0.296 0.392 0.264 0.621 0.383 0.622 0.337 0.777 0.420
720 0.437 0.292 0.450 0.312 0.466 0.315 0.432 0.286 0.626 0.382 0.660 0.408 0.864 0.472
Table 5: Few-shot long-term forecasting using 10% of the training data. We use prediction lengths T ∈ {96, 192, 336, 720} for all datasets.
The best results are in bold, while the second-best results are underlined.

Methods LLM4TS GPT4TS DLinear PatchTST FEDformer Autoformer Informer


Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.158 0.207 0.163 0.215 0.171 0.224 0.165 0.215 0.188 0.253 0.221 0.297 0.374 0.401
192 0.204 0.249 0.210 0.254 0.215 0.263 0.210 0.257 0.250 0.304 0.270 0.322 0.552 0.478
Weather
336 0.254 0.288 0.256 0.292 0.258 0.299 0.259 0.297 0.312 0.346 0.320 0.351 0.724 0.541
720 0.322 0.336 0.321 0.339 0.320 0.346 0.332 0.346 0.387 0.393 0.390 0.396 0.739 0.558
96 0.417 0.432 0.458 0.456 0.492 0.495 0.516 0.485 0.512 0.499 0.613 0.552 1.179 0.792
192 0.469 0.468 0.570 0.516 0.565 0.538 0.598 0.524 0.624 0.555 0.722 0.598 1.199 0.806
ETTh1
336 0.505 0.499 0.608 0.535 0.721 0.622 0.657 0.550 0.691 0.574 0.750 0.619 1.202 0.811
720 0.708 0.572 0.725 0.591 0.986 0.743 0.762 0.610 0.728 0.614 0.721 0.616 1.217 0.825
96 0.282 0.351 0.331 0.374 0.357 0.411 0.353 0.389 0.382 0.416 0.413 0.451 3.837 1.508
192 0.364 0.400 0.402 0.411 0.569 0.519 0.403 0.414 0.478 0.474 0.474 0.477 3.856 1.513
ETTh2
336 0.374 0.416 0.406 0.433 0.671 0.572 0.426 0.441 0.504 0.501 0.547 0.543 3.952 1.526
720 0.445 0.461 0.449 0.464 0.824 0.648 0.477 0.480 0.499 0.509 0.516 0.523 3.842 1.503
96 0.360 0.388 0.390 0.404 0.352 0.392 0.410 0.419 0.578 0.518 0.774 0.614 1.162 0.785
192 0.386 0.401 0.429 0.423 0.382 0.412 0.437 0.434 0.617 0.546 0.754 0.592 1.172 0.793
ETTm1
336 0.415 0.417 0.469 0.439 0.419 0.434 0.476 0.454 0.998 0.775 0.869 0.677 1.227 0.908
720 0.470 0.445 0.569 0.498 0.490 0.477 0.681 0.556 0.693 0.579 0.810 0.630 1.207 0.797
96 0.184 0.265 0.188 0.269 0.213 0.303 0.191 0.274 0.291 0.399 0.352 0.454 3.203 1.407
192 0.240 0.301 0.251 0.309 0.278 0.345 0.252 0.317 0.307 0.379 0.694 0.691 3.112 1.387
ETTm2
336 0.294 0.337 0.307 0.346 0.338 0.385 0.306 0.353 0.543 0.559 2.408 1.407 3.255 1.421
720 0.386 0.393 0.426 0.417 0.436 0.440 0.433 0.427 0.712 0.614 1.913 1.166 3.909 1.543
96 0.135 0.231 0.139 0.237 0.150 0.253 0.140 0.238 0.231 0.323 0.261 0.348 1.259 0.919
192 0.152 0.246 0.156 0.252 0.164 0.264 0.160 0.255 0.261 0.356 0.338 0.406 1.160 0.873
ECL
336 0.173 0.267 0.175 0.270 0.181 0.282 0.180 0.276 0.360 0.445 0.410 0.474 1.157 0.872
720 0.229 0.312 0.233 0.317 0.223 0.321 0.241 0.323 0.530 0.585 0.715 0.685 1.203 0.898
96 0.402 0.288 0.414 0.297 0.419 0.298 0.403 0.289 0.639 0.400 0.672 0.405 1.557 0.821
192 0.416 0.294 0.426 0.301 0.434 0.305 0.415 0.296 0.637 0.416 0.727 0.424 1.454 0.765
Traffic
336 0.429 0.302 0.434 0.303 0.449 0.313 0.426 0.304 0.655 0.427 0.749 0.454 1.521 0.812
720 0.480 0.326 0.487 0.337 0.484 0.336 0.474 0.331 0.722 0.456 0.847 0.499 1.605 0.846
Table 6: Few-shot long-term forecasting using 5% of the training data. For most datasets, results are reported over prediction lengths
Tout ∈ {96, 192, 336, 720}. However, for datasets marked with * (ETTh1, ETTh2, and Traffic), only Tout ∈ {96, 192, 336} are used
because there are insufficient data to constitute a training set when Tout = 720. The best results are in bold, while the second-best results are
in underlined.

Methods LLM4TS GPT4TS DLinear PatchTST FEDformer Autoformer Informer


Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.173 0.227 0.175 0.230 0.184 0.242 0.171 0.224 0.229 0.309 0.227 0.299 0.497 0.497
192 0.218 0.265 0.227 0.276 0.228 0.283 0.230 0.277 0.265 0.317 0.278 0.333 0.620 0.545
Weather
336 0.276 0.310 0.286 0.322 0.279 0.322 0.294 0.326 0.353 0.392 0.351 0.393 0.649 0.547
720 0.355 0.366 0.366 0.379 0.364 0.388 0.384 0.387 0.391 0.394 0.387 0.389 0.570 0.522
96 0.509 0.484 0.543 0.506 0.547 0.503 0.557 0.519 0.593 0.529 0.681 0.570 1.225 0.812
192 0.717 0.581 0.748 0.580 0.720 0.604 0.711 0.570 0.652 0.563 0.725 0.602 1.249 0.828
ETTh1*
336 0.728 0.589 0.754 0.595 0.984 0.727 0.816 0.619 0.731 0.594 0.761 0.624 1.202 0.811
720 - - - - - - - - - - - - - -
96 0.314 0.375 0.376 0.421 0.442 0.456 0.401 0.421 0.390 0.424 0.428 0.468 3.837 1.508
192 0.365 0.408 0.418 0.441 0.617 0.542 0.452 0.455 0.457 0.465 0.496 0.504 3.975 1.933
ETTh2*
336 0.398 0.432 0.408 0.439 1.424 0.849 0.464 0.469 0.477 0.483 0.486 0.496 3.956 1.520
720 - - - - - - - - - - - - - -
96 0.349 0.379 0.386 0.405 0.332 0.374 0.399 0.414 0.628 0.544 0.726 0.578 1.130 0.775
192 0.374 0.394 0.440 0.438 0.358 0.390 0.441 0.436 0.666 0.566 0.750 0.591 1.150 0.788
ETTm1
336 0.411 0.417 0.485 0.459 0.402 0.416 0.499 0.467 0.807 0.628 0.851 0.659 1.198 0.809
720 0.516 0.479 0.577 0.499 0.511 0.489 0.767 0.587 0.822 0.633 0.857 0.655 1.175 0.794
96 0.192 0.273 0.199 0.280 0.236 0.326 0.206 0.288 0.229 0.320 0.232 0.322 3.599 1.478
192 0.249 0.309 0.256 0.316 0.306 0.373 0.264 0.324 0.394 0.361 0.291 0.357 3.578 1.475
ETTm2
336 0.301 0.342 0.318 0.353 0.380 0.423 0.334 0.367 0.378 0.427 0.478 0.517 3.561 1.473
720 0.402 0.405 0.460 0.436 0.674 0.583 0.454 0.432 0.523 0.510 0.553 0.538 3.896 1.533
96 0.139 0.235 0.143 0.241 0.150 0.251 0.145 0.244 0.235 0.322 0.297 0.367 1.265 0.919
192 0.155 0.249 0.159 0.255 0.163 0.263 0.163 0.260 0.247 0.341 0.308 0.375 1.298 0.939
ECL
336 0.174 0.269 0.179 0.274 0.175 0.278 0.183 0.281 0.267 0.356 0.354 0.411 1.302 0.942
720 0.222 0.310 0.233 0.323 0.219 0.311 0.233 0.323 0.318 0.394 0.426 0.466 1.259 0.919
96 0.401 0.285 0.419 0.298 0.427 0.304 0.404 0.286 0.670 0.421 0.795 0.481 1.557 0.821
192 0.418 0.293 0.434 0.305 0.447 0.315 0.412 0.294 0.653 0.405 0.837 0.503 1.596 0.834
Traffic*
336 0.436 0.308 0.449 0.313 0.478 0.333 0.439 0.310 0.707 0.445 0.867 0.523 1.621 0.841
720 - - - - - - - - - - - - - -

Table 7: Self-supervised learning evaluation in forecasting with linear probing. We use prediction lengths Tout ∈ {24, 48, 168, 336, 720} for
the ETTh1 dataset. The best average results are in bold, while the second-best results are in underlined.

Methods LLM4TS PatchTST BTSF TS2Vec TNC TS-TCC


Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
24 0.315 0.365 0.322 0.369 0.541 0.519 0.599 0.534 0.632 0.596 0.653 0.61
48 0.342 0.384 0.354 0.385 0.613 0.524 0.629 0.555 0.705 0.688 0.72 0.693
168 0.401 0.415 0.419 0.424 0.64 0.532 0.755 0.636 1.097 0.993 1.129 1.044
ETTh1
336 0.421 0.427 0.445 0.446 0.864 0.689 0.907 0.717 1.454 0.919 1.492 1.076
720 0.426 0.447 0.487 0.478 0.993 0.712 1.048 0.79 1.604 1.118 1.603 1.206
Avg. 0.381 0.408 0.405 0.420 0.730 0.595 0.788 0.646 1.098 0.863 1.119 0.926
Table 8: Ablation study. Each ablation is conducted under standard and few-shot learning with 10% training data. IMP. denotes the average
improvement achieved by incorporating one of LLM4TS’s core components. We use prediction lengths Tout ∈ {96, 192, 336, 720} for the
ETTh1 dataset. The best average results are in bold.

(a) Ablation study of time-series alignment, temporal encoding, and PEFT in LLM4TS.
Methods IMP. LLM4TS w/o Time-Series Fine-Tuning w/o Temporal Encoding w/o PEFT
Metric MSE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.89% 0.371 0.394 0.372 0.395 0.378 0.397 0.373 0.393
192 1.14% 0.403 0.412 0.404 0.411 0.411 0.416 0.408 0.412
ETTh1 (full) 336 1.16% 0.42 0.422 0.422 0.423 0.433 0.43 0.42 0.421
720 2.21% 0.422 0.444 0.424 0.448 0.442 0.457 0.429 0.45
Avg. 1.37% 0.404 0.418 0.406 0.419 0.416 0.425 0.408 0.419
96 1.80% 0.417 0.432 0.43 0.438 0.422 0.434 0.422 0.433
192 1.01% 0.469 0.468 0.488 0.474 0.463 0.465 0.471 0.462
ETTh1 (10%) 336 4.03% 0.505 0.499 0.538 0.506 0.516 0.508 0.525 0.504
720 11.89% 0.708 0.572 0.762 0.589 0.714 0.584 0.98 0.672
Avg. 6.20% 0.525 0.493 0.555 0.502 0.529 0.498 0.600 0.518

(b) Ablation study of training strategies during forecasting fine-tuning.


Methods IMP. LLM4TS w/ LP FT (Ours) LLM4TS w/ FT LLM4TS w/ LP
Metric MSE MSE MAE MSE MAE MSE MAE
96 0.80% 0.371 0.394 0.371 0.394 0.377 0.398
192 0.98% 0.403 0.412 0.404 0.413 0.41 0.416
ETTh1 (full) 336 0.70% 0.42 0.422 0.42 0.423 0.426 0.424
720 0.35% 0.422 0.444 0.42 0.444 0.427 0.447
Avg. 0.70% 0.404 0.418 0.404 0.419 0.410 0.421
96 0.12% 0.421 0.435 0.423 0.436 0.42 0.433
192 2.52% 0.454 0.457 0.477 0.474 0.455 0.454
ETTh1 (10%) 336 2.36% 0.515 0.504 0.545 0.524 0.511 0.507
720 3.92% 0.711 0.574 0.743 0.589 0.737 0.589
Avg. 2.51% 0.525 0.493 0.547 0.506 0.531 0.496

You might also like