LLM4TS-Aligning Pre-Trained LLMs As Data-Efficient Time-Series Forecasters
LLM4TS-Aligning Pre-Trained LLMs As Data-Efficient Time-Series Forecasters
Figure 2: LLM4TS framework. The numbers in the patched time series (e.g., 1, 2, ..., 16 in the first patch) indicate the sequential order
of the timestamps. The framework consists of two stages: (a) Time-series alignment, which uses the autoregressive approach to align the
pre-trained LLM with patched time-series data. (b) Forecasting fine-tuning, which starts with linear probing (i.e., only the output layer is
unfrozen), followed by full fine-tuning (all the layers and PEFT components in the LLM are unfrozen).
Table 3: Few-shot long-term forecasting using 10% and 5% of the training data. For most datasets, results are averaged over prediction
lengths Tout ∈ {96, 192, 336, 720}. However, for datasets marked with * (ETTh1, ETTh2, and Traffic) in the 5% setting, only Tout ∈
{96, 192, 336} are used because there are insufficient data to constitute a training set when Tout = 720. The best average results are in bold,
while the second-best results are in underlined. Details are reported in Appendix A.2.
(a) Few-shot long-term forecasting using 10% of the training data.
Methods LLM4TS GPT4TS DLinear PatchTST FEDformer Autoformer Informer
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Weather 0.235 0.270 0.238 0.275 0.241 0.283 0.242 0.279 0.284 0.324 0.300 0.342 0.597 0.495
ETTh1 0.525 0.493 0.590 0.525 0.691 0.600 0.633 0.542 0.639 0.561 0.702 0.596 1.199 0.809
ETTh2 0.366 0.407 0.397 0.421 0.605 0.538 0.415 0.431 0.466 0.475 0.488 0.499 3.872 1.513
ETTm1 0.408 0.413 0.464 0.441 0.411 0.429 0.501 0.466 0.722 0.605 0.802 0.628 1.192 0.821
ETTm2 0.276 0.324 0.293 0.335 0.316 0.368 0.296 0.343 0.463 0.488 1.342 0.930 3.370 1.440
ECL 0.172 0.264 0.176 0.269 0.180 0.280 0.180 0.273 0.346 0.427 0.431 0.478 1.195 0.891
Traffic 0.432 0.303 0.440 0.310 0.447 0.313 0.430 0.305 0.663 0.425 0.749 0.446 1.534 0.811
Avg. Rank 1.143 1.000 2.286 2.286 3.857 4.286 3.143 3.000 4.714 4.714 5.857 5.714 7.000 7.000
(b) Few-shot long-term forecasting using 5% of the training data.
Methods LLM4TS GPT4TS DLinear PatchTST FEDformer Autoformer Informer
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
Weather 0.256 0.292 0.264 0.302 0.264 0.309 0.270 0.304 0.310 0.353 0.311 0.354 0.584 0.528
ETTh1* 0.651 0.551 0.682 0.560 0.750 0.611 0.695 0.569 0.659 0.562 0.722 0.599 1.225 0.817
ETTh2* 0.359 0.405 0.401 0.434 0.828 0.616 0.439 0.448 0.441 0.457 0.470 0.489 3.923 1.654
ETTm1 0.413 0.417 0.472 0.450 0.401 0.417 0.527 0.476 0.731 0.593 0.796 0.621 1.163 0.792
ETTm2 0.286 0.332 0.308 0.346 0.399 0.426 0.315 0.353 0.381 0.405 0.389 0.434 3.659 1.490
ECL 0.173 0.266 0.179 0.273 0.177 0.276 0.181 0.277 0.267 0.353 0.346 0.405 1.281 0.930
Traffic* 0.418 0.295 0.434 0.305 0.451 0.317 0.418 0.297 0.677 0.424 0.833 0.502 1.591 0.832
Avg. Rank 1.143 1.000 2.571 2.286 4.000 4.143 3.429 3.286 4.286 4.429 5.571 5.714 7.000 7.000
al., 2021], and Informer [Zhou et al., 2021] are recognized as self-supervised learning, the settings are slightly adjusted to
train-from-scratch time-series forecasting models. The same Tin = 512, P = 12, and S = 12. Aligned with the GPT4TS
set of models is used for few-shot learning and ablation stud- configuration [Zhou et al., 2023], we utilize only the first 6
ies. For self-supervised learning, we choose PatchTST, BTSF layers of the 12-layer GPT-2 base [Radford et al., 2019].
[Yang and Hong, 2022], TS2Vec [Yue et al., 2022], TNC
[Tonekaboni et al., 2021], and TS-TCC [Eldele et al., 2021]. 5.1 Long-Term Time-Series Forecasting
Consistent with prior research [Zhou et al., 2023], we rely on
Table 2 presents the results of long-term time-series forecast-
Mean Squared Error (MSE) and Mean Absolute Error (MAE)
ing averaged over a consistent prediction length set Tout ∈
as evaluation metrics across all experiments.
{96, 192, 336, 720} for all datasets. For each dataset, we train
Implementation Details For our experiments in long-term a single model in the time-series alignment stage, which is
time-series forecasting, few-shot learning, and ablation stud- then applied consistently across all prediction lengths. In
ies, we use the settings from PatchTST [Nie et al., 2023] for a contrast, in the forecasting fine-tuning stage, we fine-tune a
consistent comparison. We set our look-back window length distinct model for each prediction length, while ensuring that
Tin to either 336 or 512 (reporting the best results), and con- all these models share the same hyperparameters. Although
figure the patch length P as 16 with a stride S of 8. For the primary intent of using the pre-trained LLM is for few-
(a) Key Components in (b) Training Strategies in
Figure 4: Self-supervised learning evaluation in forecasting with LLM4TS Forecasting Fine-Tuning
linear evaluation. We use results averaged over prediction lengths
Tout ∈ {24, 48, 168, 336, 720} for the ETTh1 dataset. The best av- Figure 5: Ablation study. Each ablation is conducted under both
erage results are in bold. Details are reported in Appendix A.3. full- and few-shot learning with 10% training data. We report results
averaged over prediction lengths T ∈ {96, 192, 336, 720} for the
ETTh1 dataset. The best average results are in bold. Details are
shot learning, LLM4TS still surpasses all baseline methods reported in Appendix A.4.
even when given access to the full dataset. By using two-
stage fine-tuning and incorporation of multi-scale temporal
information, LLM4TS achieves the highest rank in 9 of the Notably, LLM4TS delivers exceptional performance in few-
14 evaluations, covering 7 datasets and 2 metrics. shot learning, averaging a 6.2% reduction in MSE with each
incorporation of these components.
5.2 Few-Shot Learning
Training Strategies in Forecasting Fine-Tuning As dis-
Table 3 shows the results of using only 10% and 5% of the cussed in Section 4.2, while full fine-tuning (FT) shows supe-
training data in long-term time-series forecasting. In our rior performance in out-of-distribution (OOD) scenarios and
experiments, we maintain consistent splits for training, val- linear probing (LP) excels in in-distribution (ID) scenarios,
idation, and test sets in both full- and few-shot learning, LP-FT can surpass FT and LP in both OOD and ID scenarios.
and in few-shot scenarios, we intentionally limit the train- Figure 5b shows that LP-FT enhances performance in both
ing data percentage. Both LLM4TS and GPT4TS [Zhou full- and few-shot learning on the ETTh1 dataset, achieving
et al., 2023] consistently surpass most train-from-scratch an average improvement of 0.7% in MSE for full-shot learn-
models in limited-data scenarios across various datasets, ing and 2.51% for few-shot learning. The subtle improve-
thanks to the pre-existing representation learning capabil- ments in both scenarios can be attributed to the limited num-
ity encapsulated in GPT-2. With the additional time-series ber of trainable parameters in the LLM4TS’s backbone model
alignment and multi-scale temporal information integration, even when using FT, which narrows the distinction between
LLM4TS emerges as a better data-efficient time-series fore- LP and FT. Notably, few-shot learning benefits more from
caster against GPT4TS, achieving better performance across LP-FT than full-shot learning, mainly because it is more sus-
all datasets. Notably, LLM4TS with only 5% of data outper- ceptible to severe OOD issues.
forms the best baseline that uses 10% of data.
Table 7: Self-supervised learning evaluation in forecasting with linear probing. We use prediction lengths Tout ∈ {24, 48, 168, 336, 720} for
the ETTh1 dataset. The best average results are in bold, while the second-best results are in underlined.
(a) Ablation study of time-series alignment, temporal encoding, and PEFT in LLM4TS.
Methods IMP. LLM4TS w/o Time-Series Fine-Tuning w/o Temporal Encoding w/o PEFT
Metric MSE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.89% 0.371 0.394 0.372 0.395 0.378 0.397 0.373 0.393
192 1.14% 0.403 0.412 0.404 0.411 0.411 0.416 0.408 0.412
ETTh1 (full) 336 1.16% 0.42 0.422 0.422 0.423 0.433 0.43 0.42 0.421
720 2.21% 0.422 0.444 0.424 0.448 0.442 0.457 0.429 0.45
Avg. 1.37% 0.404 0.418 0.406 0.419 0.416 0.425 0.408 0.419
96 1.80% 0.417 0.432 0.43 0.438 0.422 0.434 0.422 0.433
192 1.01% 0.469 0.468 0.488 0.474 0.463 0.465 0.471 0.462
ETTh1 (10%) 336 4.03% 0.505 0.499 0.538 0.506 0.516 0.508 0.525 0.504
720 11.89% 0.708 0.572 0.762 0.589 0.714 0.584 0.98 0.672
Avg. 6.20% 0.525 0.493 0.555 0.502 0.529 0.498 0.600 0.518