0% found this document useful (0 votes)
159 views

LLM Reprogramming

This document proposes a framework called TIME-LLM that adapts large language models for time series forecasting through reprogramming. It begins by reprogramming time series data into text prototypes that are better suited to language models. It also introduces Prompt-as-Prefix to provide additional context and instructions to guide the model's transformations. Evaluations show TIME-LLM can perform effective few-shot and zero-shot forecasting, outperforming specialized models while keeping the language models intact. The framework points to a way of leveraging language models' reasoning for both language and sequential tasks.

Uploaded by

Fernando Fuentes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views

LLM Reprogramming

This document proposes a framework called TIME-LLM that adapts large language models for time series forecasting through reprogramming. It begins by reprogramming time series data into text prototypes that are better suited to language models. It also introduces Prompt-as-Prefix to provide additional context and instructions to guide the model's transformations. Evaluations show TIME-LLM can perform effective few-shot and zero-shot forecasting, outperforming specialized models while keeping the language models intact. The framework points to a way of leveraging language models' reasoning for both language and sequential tasks.

Uploaded by

Fernando Fuentes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

T IME -LLM: T IME S ERIES F ORECASTING BY R EPRO -

GRAMMING L ARGE L ANGUAGE M ODELS

Ming Jin1∗, Shiyu Wang2∗, Lintao Ma2 , Zhixuan Chu2 , James Y. Zhang2 , Xiaoming Shi2 ,
Pin-Yu Chen3 , Yuxuan Liang6 , Yuan-Fang Li1 , Shirui Pan4†, Qingsong Wen5†
1
Monash University 2 Ant Group 3 IBM Research 4 Griffith University 5 Alibaba Group
6
The Hong Kong University of Science and Technology (Guangzhou)
{ming.jin, yuanfang.li}@monash.edu, [email protected]
[email protected], [email protected], [email protected]
{weiming.wsy,lintao.mlt,chuzhixuan.czx,james.z,peter.sxm}@antgroup.com
arXiv:2310.01728v1 [cs.LG] 3 Oct 2023

A BSTRACT
Time series forecasting holds significant importance in many real-world dynamic
systems and has been extensively studied. Unlike natural language process (NLP)
and computer vision (CV), where a single large model can tackle multiple tasks,
models for time series forecasting are often specialized, necessitating distinct de-
signs for different tasks and applications. While pre-trained foundation models
have made impressive strides in NLP and CV, their development in time series
domains has been constrained by data sparsity. Recent studies have revealed that
large language models (LLMs) possess robust pattern recognition and reasoning
abilities over complex sequences of tokens. However, the challenge remains in
effectively aligning the modalities of time series data and natural language to
leverage these capabilities. In this work, we present T IME -LLM, a reprogram-
ming framework to repurpose LLMs for general time series forecasting with the
backbone language models kept intact. We begin by reprogramming the input
time series with text prototypes before feeding it into the frozen LLM to align the
two modalities. To augment the LLM’s ability to reason with time series data,
we propose Prompt-as-Prefix (PaP), which enriches the input context and directs
the transformation of reprogrammed input patches. The transformed time series
patches from the LLM are finally projected to obtain the forecasts. Our com-
prehensive evaluations demonstrate that T IME -LLM is a powerful time series
learner that outperforms state-of-the-art, specialized forecasting models. More-
over, T IME -LLM excels in both few-shot and zero-shot learning scenarios.

1 I NTRODUCTION
Time series forecasting is a critical capability across many real-world dynamic systems, with appli-
cations ranging from demand planning (Leonard, 2001) and inventory optimization (Li et al., 2022)
to energy load forecasting (Liu et al., 2023a) and climate modeling (Schneider & Dickinson, 1974).
Each time series forecasting task typically requires extensive domain expertise and task-specific
model designs. This stands in stark contrast to foundation language models like GPT-3 (Brown
et al., 2020), GPT-4 (OpenAI, 2023), Llama (Touvron et al., 2023), inter alia, which can perform
well on a diverse range of NLP tasks in a few-shot or even zero-shot setting.
Pre-trained foundation models, such as large language models (LLMs), have driven rapid progress
in computer vision (CV) and natural language processing (NLP). While time series modeling has
not benefited from the same significant breakthroughs, LLMs’ impressive capabilities have inspired
their application to time series forecasting. Several desiderata exist for leveraging LLMs to ad-
vance forecasting techniques: Generalizability. LLMs have demonstrated a remarkable capability
for few-shot and zero-shot transfer learning (Brown et al., 2020). This suggests their potential for
generalizable forecasting across domains without requiring per-task retraining from scratch. In con-
trast, current forecasting methods are often rigidly specialized by domain. Data efficiency. By

Equal Contribution

Corresponding Authors

1
leveraging pre-trained knowledge, LLMs have shown the ability to perform new tasks with only
a few examples. This data efficiency could enable forecasting for settings where historical data is
limited. In contrast, current methods typically require abundant in-domain data. Reasoning. LLMs
exhibit sophisticated reasoning and pattern recognition capabilities (Mirchandani et al., 2023; Wang
et al., 2023; Chu et al., 2023). Harnessing these skills could allow making highly precise fore-
casts by leveraging learned higher-level concepts. Existing non-LLM methods are largely statistical
without much innate reasoning. Multimodal knowledge. As LLM architectures and training tech-
niques improve, they gain more diverse knowledge across modalities like vision, speech, and text
(Ma et al., 2023). Tapping into this knowledge could enable synergistic forecasting that fuses dif-
ferent data types. Conventional tools lack ways to jointly leverage multiple knowledge bases. Easy
optimization. LLMs are trained once on massive computing and then can be applied to forecasting
tasks without learning from scratch. Optimizing existing forecasting models often requires signifi-
cant architecture search and hyperparameter tuning (Zhou et al., 2023b). In summary, LLMs offer
a promising path to make time series forecasting more general, efficient, synergistic, and accessible
compared to current specialized modeling paradigms. Thus, adapting these powerful models for
time series data can unlock significant untapped potential.
The realization of the above benefits hinges on the effective alignment of the modalities of time
series data and natural language. However, this is a challenging task as LLMs operate on discrete
tokens, while time series data is inherently continuous. Furthermore, the knowledge and reasoning
capabilities to interpret time series patterns are not naturally present within LLMs’ pre-training.
Therefore, it remains an open challenge to unlock the knowledge within large language models in
activating their ability for general time series forecasting in a way that is accurate, data-efficient, and
task-agnostic.
In this work, we propose T IME -LLM, a reprogramming framework to adapt large language models
for time series forecasting while keeping the backbone model intact. The core idea is to reprogram
the input time series into text prototype representations that are more naturally suited to language
models’ capabilities. To further augment the model’s reasoning about time series concepts, we
introduce Prompt-as-Prefix (PaP), a novel idea in enriching the input time series with additional
context and providing task instructions in the modality of natural language. This provides declarative
guidance about desired transformations to apply to the reprogrammed input. The output of the
language model is then projected to generate time series forecasts. Our comprehensive evaluation
demonstrates that large language models can act as effective few-shot and zero-shot time series
learners when adopted through this reprogramming approach, outperforming specialized forecasting
models. By leveraging LLMs’ reasoning capability while keeping the models intact, our work points
the way toward multimodal foundation models that can excel on both language and sequential data
tasks. Our proposed reprogramming framework offers an extensible paradigm for imbuing large
models with new capabilities beyond their original pre-training. Our main contributions in this work
can be summarized as follows:
• We introduce a novel concept of reprogramming large language models for time series forecast-
ing without altering the pre-trained backbone model. In doing so, we show that forecasting can
be cast as yet another “language” task that can be effectively tackled by an off-the-shelf LLM.
• We propose a new framework, T IME -LLM, which encompasses reprogramming the input time
series into text prototype representations that are more natural for the LLM, and augmenting the
input context with declarative prompts (e.g., domain expert knowledge and task instructions) to
guide LLM reasoning. Our technique points towards multimodal foundation models excelling
in both language and time series.
• T IME -LLM consistently exceeds state-of-the-art performance in mainstream forecasting tasks,
especially in few-shot and zero-shot scenarios. Moreover, this superior performance is achieved
while maintaining excellent model reprogramming efficiency. Thus, our research is a concrete
step in unleashing LLMs’ untapped potential for time series and perhaps other sequential data.

2 R ELATED W ORK
Task-specific Learning. Most time series forecasting models are crafted for specific tasks and
domains (e.g., traffic prediction), and trained end-to-end on small-scale data. An illustration is in
Fig. 1(a). For example, ARIMA models are designed for univariate time series forecasting (Box
et al., 2015), LSTM networks are tailored for sequence modeling (Hochreiter & Schmidhuber,

2
(a) Task-Specific (b) Model Fine-Tuning (c) Model Reprogramming
Source Modality Learning
Lightweight

Adaptation
[Summarization]
[Retrieval] Head

Reprogram. Space
[Classification]
… Head … High Effectiveness
Language

Prompts
Model Model Model
Target Modality Contextual Bootstrapping
[Forecasting]
[Classification]
… Input
Reprogram
[Imputation]
… Pre-training
Cross-Modality

Source Data Sample Target Data Sample Source Task Target Task Frozen Fine-tune Pre-training

Figure 1: Schematic illustration of reprogramming large language models (LLMs) in comparison


of (a) task-specific learning and (b) model fine-tuning. Our proposal investigates and demonstrates
(c) how to effectively reprogram open-sourced LLMs as powerful time series learners where well-
developed time series pre-trained models are not readily available.

1997), and temporal convolutional networks (Bai et al., 2018) and transformers (Wen et al., 2023)
are developed for handling longer temporal dependencies. While achieving good performance on
narrow tasks, these models lack versatility and generalizability to diverse time series data.
In-modality Adaptation. Relevant research in CV and NLP has demonstrated the effectiveness of
pre-trained models that can be fine-tuned for various downstream tasks without the need for costly
training from scratch (Devlin et al., 2018; Brown et al., 2020; Touvron et al., 2023). Inspired by
these successes, recent studies have focused on the development of time series pre-trained models
(TSPTMs). The first step among them involves time series pre-training using different strategies like
supervised (Fawaz et al., 2018) or self-supervised learning (Zhang et al., 2022b; Deldari et al., 2022;
Zhang et al., 2023). This allows the model to learn representing various input time series. Once pre-
trained, it can be fine-tuned on similar domains to learn how to perform specific tasks (Tang et al.,
2022). An example is in Fig. 1(b). The development of TSPTMs leverages the success of pre-
training and fine-tuning in NLP and CV but remains limited on smaller scales due to data sparsity.
Cross-modality Adaptation. Building on in-modality adaptation, recent work has further explored
transferring knowledge from powerful pre-trained foundations models in NLP and CV to time series
modeling, through techniques such as multimodal fine-tuning (Yin et al., 2023) and model repro-
gramming (Chen, 2022). Our approach aligns with this category; however, there is limited pertinent
research available on time series. An example is Voice2Series (Yang et al., 2021), which adapts an
acoustic model (AM) from speech recognition to time series classification. It transforms a time se-
ries into a format suitable for the AM and map the output to new labels, allowing for leveraging the
representation learning ability of AMs trained on massive voice datasets for quick adaptation on time
series. Recently, Chang et al. (2023) proposes LLM4TS for time series forecasting using LLMs. It
designs a two-stage fine-tuning process on the LLM - first supervised pre-training on time series,
then task-specific fine-tuning. Zhou et al. (2023a) leverages pre-trained language models without al-
tering their self-attention and feedforward layers. This model is fine-tuned and evaluated on various
time series analysis tasks and demonstrates comparable or state-of-the-art performance by transfer-
ring knowledge from natural language pre-training. Distinct from these approaches, we neither edit
the input time series directly nor fine-tune the backbone LLM. Instead, as illustrated in Fig. 1(c), we
propose reprogramming time series with the source data modality along with prompting to unleash
the potential of LLMs as effective time series machines.

3 M ETHODOLOGY
Our model architecture is depicted in Fig. 2. We focus on reprogramming an embedding-visible
language foundation model, such as Llama (Touvron et al., 2023) and GPT-2 (Radford et al., 2019),
for general time series forecasting without requiring any fine-tuning of the backbone model. Specif-
ically, we consider the following problem: given a sequence of historical observations X ∈ RN ×T
consisting of N different 1-dimensional variables across T time steps, we aim to reprogram a large
language model f (·) to understand the input time series and accurately forecast the readings at H
future time steps, denoted by Ŷ ∈ RN ×H , with the overall objective to minimize the mean square
PH
errors between the ground truths Y and predictions, i.e., H1 h=1 ||Ŷh − Yh ||2F .
Our method encompasses three main components: (1) input transformation, (2) a pre-trained and
frozen LLM, and (3) output projection. Initially, a multivariate time series is partitioned into N

3
Latest
Output Embeddings
Forecasts
Add & Layer Norm
Output Projection
Feed Forward
Flatten & Linear
Add & Layer Norm
Output Patch
Embeddings Pre-trained LLM Multi-Head
(Body) Attention

Input Embeddings

Output Token
Embeddings Reprogrammed
Patch Embeddings
Token Embedder Pre-trained LLM Patch Reprogram
(Embedder) Multi-Head Attention
Patching
Tokenization Text Prototypes
Patch
<dataset description> Instance Norm Embedder
Input Text ### Domain: <domain knowledge> Linear
### Instruction: <task information>
### Input statistics:
<time series statistic 1>
Time Series Pre-trained

<time series statistic 2> Patches Word Embeddings

Frozen Training Prompt Embeddings Patch Embeddings Forward Backward

Figure 2: The model framework of T IME -LLM. Given an input time series, we first tokenize and
embed it via 1 patching along with a 2 customized embedding layer. 3 These patch embeddings
are then reprogrammed with condensed text prototypes to align two modalities. To augment the
LLM’s reasoning ability, 4 additional prompt prefixes are added to the input to direct the transfor-
mation of input patches. 5 The output patches from the LLM are projected to generate the forecasts.

univariate time series, which are subsequently processed independently (Nie et al., 2023). The i-th
series is denoted as X(i) ∈ R1×T , which undergoes normalization, patching, and embedding prior
to being reprogrammed with learned text prototypes to align the source and target modalities. Then,
we augment the LLM’s time series reasoning ability by prompting it together with reprogrammed
patches to generate output representations, which are projected to the final forecasts Ŷ(i) ∈ R1×H .
We note that only the parameters of the lightweight input transformation and output projection are
updated, while the backbone language model is frozen. In contrast to vision-language and other
multimodal language models, which usually fine-tune with paired cross-modality data, T IME -LLM
is directly optimized and becomes readily available with only a small set of time series and a few
training epochs, maintaining high efficiency and imposing fewer resource constraints compared to
building large domain-specific models from scratch or fine-tuning them. To further reduce mem-
ory footprints, various off-the-shelf techniques (e.g., quantization) can be seamlessly integrated for
slimming T IME -LLM.

3.1 M ODEL S TRUCTURE

Input Embedding. Each input channel X(i) is first individually normalized to have zero mean and
unit standard deviation via reversible instance normalization (RevIN) in mitigating the time series
distribution shift (Kim et al., 2021). Then, we divide X(i) into several consecutive overlapped or
non-overlapped patches (Nie et al., 2023) with length Lp ; thus the total number of input patches
(T −L )
is P = ⌊ S p ⌋ + 2, where S denotes the horizontal sliding stride. The underlying motivations
are two-fold: (1) better preserving local semantic information by aggregating local information into
each patch and (2) serving as tokenization to form a compact sequence of input tokens, reducing
(i) (i)
computational burdens. Given these patches XP ∈ RP ×Lp , we embed them as X̂P ∈ RP ×dm ,
adopting a simple linear layer as the patch embedder to create dimensions dm .
Patch Reprogramming. Here we reprogram patch embeddings into the source data representation
space to align the modalities of time series and natural language to activate the backbone’s time
series understanding and reasoning capabilities. A common practice is learning a form of “noise”
that, when applied to target input samples, allows the pre-trained source model to produce the desired
target outputs without requiring parameter updates. This is technically feasible for bridging data

4
0.6

Source Target the next value is 0 . 6 Projection


time
Pre-trained LLM Pre-trained LLM
late Patch 1 (Body + LM Head) (Body)
early
down Patch 2
up
… …
steady Patch Pre-trained LLM Pre-trained LLM Patch
short
Patch 5 Reprogram (Text Embedder) (Text Embedder) Reprogram
long

Prototypes the next value is 0 . 6 <input context> <instruction>

(a) Patch Reprogramming (b) Patch-as-Prefix and Prompt-as-Prefix


Figure 3: Illustration of (a) patch reprogramming and (b) Patch-as-Prefix versus Prompt-as-Prefix.
modalities that are identical or similar. Examples include repurposing a vision model to work with
cross-domain images (Misra et al., 2023) or reprogramming an acoustic model to handle time series
data (Yang et al., 2021). In both cases, there are explicit, learnable transformations between the
source and target data, allowing for the direct editing of input samples. However, time series can
neither be directly edited nor described losslessly in natural language, posing significant challenges
to directly bootstrap the LLM for understanding time series without resource-intensive fine-tuning.
(i)
To close this gap, we propose reprogramming X̂P using pre-trained word embeddings E ∈ RV ×D
in the backbone, where V is the vocabulary size. Nevertheless, there is no prior knowledge indi-
cating which source tokens are directly relevant. Thus, simply leveraging E will result in large and
potentially dense reprogramming space. A simple solution is to maintain a small collection of text

prototypes by linearly probing E, denoted as E′ ∈ RV ×D , where V ′ ≪ V . An illustration is in
Fig. 3(a). Text prototypes learn connecting language cues, e.g., “long steady” (blue lines) and “short
up” (red lines), which are then combined to represent the local patch information (e.g., “short up
then down steadily” for characterizing patch 5) without leaving the space where the language model
is pre-trained. This approach is efficient and allows for the adaptive selection of relevant source
information. To realize this, we employ a multi-head cross-attention layer. Specifically, for each
(i) (i) (i)
head k = {1, · · · , K}, we define query matrices Qk = X̂P WkQ , key matrices Kk = E′ WkK ,
(i) Q
and value matrices Vk = E′ WkV , where Wk ∈ Rdm ×d and WkK , WkV ∈ RD×d . Specifically,
D is the hidden dimension of the backbone model, and d = ⌊ dKm ⌋. Then, we have the operation to
reprogram time series patches in each attention head defined as:
(i) (i)⊤
(i) (i) (i) (i) Qk Kk (i)
Zk = ATTENTION(Qk , Kk , Vk ) = S OFTMAX( √ )Vk . (1)
dk
### Domain Knowledge: We usually observe that
(i) electricity consumption
(i) usuallyP peaks at noon, with a
By aggregating each Zk ∈ RP ×d in every head, we obtain Z ∈ R . This is then linearly ×dm
significant increase in transformer load …
projected to align the hidden dimensions with the backbone model, yielding O(i) ∈ RP ×D .
<BOS> The Electricity Transformer Temperature (ETT)
Prompt-as-Prefix.<BOS> The Electricity Transformer
Prompting serves Temperature (ETT)
as a straightfor- The Electricity Transformer Temperature (ETT) indicates the
indicates the electric power long-term deployment. indicates the electric power long-term deployment. electric power long-term deployment. Each data point consists
ward yet effective approach
The dataset contains hourly sampled data from … to task-specific activation
Each data point consists of the target oil temperature of of the target oil temperature and 6 power load features …
LLMs (Yin and 6 power load features … Below is the information about the input time series:
### Task instruction: Predict the next <!> steps given et al., 2023). However, the direct translation
### Domain Knowledge: We usually observe that [BEGIN DATA]
the previous <"> steps information ofattached
time series into natural language presents considerable
electricity consumption usually peaks at noon, with a ***
### Input statistics: The input has a minimum of <min significant increase [Domain]: We usually observe that electricity consumption
challenges, hindering
value>, a maximum of <max value>, and a median of
both thein transformer
creation loadof… instruction-
peaks at noon, with a significant increase in transformer load
<median value>. The overall trend isfollowing
<upward or datasets ###
andTaskthe effective
instruction: Predictutilization
the next <!> steps ofgiven
on-the- ***
[Instruction]: Predict the next <!> steps given the previous
downward>. The top five lags are <lag values>. <EOS> the previous <"> steps information attached
fly prompting without performance compromise (Xue & <"> steps information attached
### Input statistics: The input has a minimum of <min other ***
Salim, 2022). Recent advancements indicate that [Statistics]: The input has a minimum of <min_val>, a maximum
value>, a maximum of <max value>, and a median of of <max_val>, and a median of <median_val>. The overall trend
data modalities, such <median asvalue>.
images, can
The overall trendbe seamlessly
is <upward or in- is <upward or downward>. The top five lags are <lag_val>.
tegrated as the prefixes downward>. ofTheprompts,
top five lags are <lag values>.
thereby <EOS>
facilitating [END DATA]
effective reasoning based on these inputs (Tsimpoukelli
et al., 2021). Motivated byinstruction:
### Task these findings,
Predict the next and to render
<!> steps given Figure 4: Prompt example. <> and <>
our approach directly the previous <"> steps information attached
applicable to real-world time series, are task-specific configurations and cal-
we pose an alternative### Input statistics:can
question: The input
promptshas a minimum
act ofas<min
value>, a maximum of <max value>, and a median of
pre- culated input statistics.
fixes to enrich the input <mediancontext
value>. Theand
overallguide the transfor-
trend is <upward or
mation of reprogrammed downward>.
timeTheseries
top five lags are <lag values>.
patches? We term this concept as Prompt-as-Prefix (PaP) and
observe that it significantly enhances the LLM’s adaptability to downstream tasks while comple-
menting patch reprogramming (See Sec. 4.5 later).

5
An illustration of the two prompting approaches is in Fig. 3(b). In Patch-as-Prefix, a language model
is prompted to predict subsequent values in a time series, articulated in natural language. This ap-
proach encounters certain constraints: (1) language models typically exhibit reduced sensitivity in
processing high-precision numerals without the aid of external tools, thereby presenting substantial
challenges in accurately addressing practical forecasting tasks over long horizons; (2) intricate, cus-
tomized post-processing is required for different language models, given that they are pre-trained
on diverse corpora and may employ different tokenization types in generating high-precision nu-
merals with precision and efficiency. This results in forecasts being represented in disparate natural
language formats, such as [‘0’, ‘.’, ‘6’, ‘1’] and [‘0’, ‘.’, ‘61’], to denote the decimal 0.61.
Prompt-as-Prefix, on the other hand, tactfully avoids these constraints. In practice, we identify three
pivotal components for constructing effective prompts: (1) dataset context, (2) task instruction, and
(3) input statistics. A prompt example is in Fig. 4. The dataset context furnishes the LLM with
essential background information concerning the input time series, which often exhibits distinct
characteristics across various domains. Task instruction serves as a crucial guide for the LLM in
the transformation of patch embeddings for specific tasks. We also enrich the input time series with
additional crucial statistics, such as trends and lags, to facilitate pattern recognition and reasoning.
Output Projection. Upon packing and feedforwarding the prompt and patch embeddings O(i)
through the frozen LLM as shown in Fig. 2, we discard the prefixal part and obtain the output
representations. Following this, we flatten and linear project them to derive the final forecasts Ŷ(i) .

4 M AIN R ESULTS
T IME -LLM consistently outperforms state-of-the-art forecasting methods by large margins across
multiple benchmarks and settings, especially in few-shot and zero-shot scenarios. We compared our
approach against a broad collection of up-to-date models, including a recent study that fine-tunes
language model for time series analysis (Zhou et al., 2023a). To ensure a fair comparison, we adhere
to the experimental configurations in (Wu et al., 2022) across all baselines with a unified evaluation
pipeline1 . We use Llama-7B (Touvron et al., 2023) as the default backbone unless stated otherwise.
Baselines. We compare with the SOTA time series models, and we cite their performance from
(Zhou et al., 2023a) if applicable. Our baselines include a series of Transformer-based methods:
PatchTST (2023), ESTformer (2022), Non-Stationary Transformer (2022), FEDformer (2022), Aut-
oformer (2021), Informer (2021), and Reformer (2020). We also select a set of recent competitive
models, including GPT4TS (2023a), DLinear (2023), TimesNet (2022), and LightTS (2022a). In
short-term forecasting, we further compare our model with N-HiTS (2023b) and N-BEATS (2019).
More details are in Appendix A.

4.1 L ONG - TERM F ORECASTING


Setups. We evaluate on four ETT datasets (i.e., ETTh1, ETTh2, ETTm1, and ETTm2), which
have been extensively adopted for benchmarking long-term forecasting models (Zhou et al., 2021).
Details of the implementation and datasets can be found in Appendix B. The input time series
length T is set as 512, and we use four different prediction horizons H ∈ {96, 192, 336, 720}. The
evaluation metrics include mean square error (MSE) and mean absolute error (MAE).
Results. Our results are shown in Tab. 1, where T IME -LLM outperforms all baselines in most cases
and significantly so to the majority of them. The comparison with GPT4TS (Zhou et al., 2023a) is
particularly noteworthy. GPT4TS is a very recent work that involves fine-tuning on the backbone
language model. We note average performance gains of 13% and 15% over GPT4TS and Times-
Net, respectively. When compared with the SOTA task-specific Transformer model PatchTST, by
reprogramming the smallest Llama, T IME -LLM realizes an average MSE reduction of 2%. Relative
to the other models, e.g., DLinear, our improvements are also pronounced, exceeding 10%.

4.2 S HORT- TERM F ORECASTING


Setups. We choose the M4 benchmark (Makridakis et al., 2018) as the testbed, which contains a
collection of marketing data in different sampling frequencies. More details are provided in Ap-
pendix B. The prediction horizons in this case are relatively small and in [6, 48]. The input lengths
1
https://github.com/thuml/Time-Series-Library

6
Table 1: Long-term forecasting results. We use forecasting horizons H ∈ {96, 192, 336, 720}. A lower value
indicates better performance. Red: the best, Blue: the second best.
Methods T IME -LLM GPT4TS DLinear PatchTST TimesNet FEDformer Autoformer Stationary ETSformer LightTS Informer Reformer
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.362 0.392 0.376 0.397 0.375 0.399 0.370 0.399 0.384 0.402 0.376 0.419 0.449 0.459 0.513 0.491 0.494 0.479 0.424 0.432 0.865 0.713 0.837 0.728
ET T h1

192 0.398 0.418 0.416 0.418 0.405 0.416 0.413 0.421 0.436 0.429 0.420 0.448 0.500 0.482 0.534 0.504 0.538 0.504 0.475 0.462 1.008 0.792 0.923 0.766
336 0.430 0.427 0.442 0.433 0.439 0.443 0.422 0.436 0.491 0.469 0.459 0.465 0.521 0.496 0.588 0.535 0.574 0.521 0.518 0.488 1.107 0.809 1.097 0.835
720 0.442 0.457 0.477 0.456 0.472 0.490 0.447 0.466 0.521 0.500 0.506 0.507 0.514 0.512 0.643 0.616 0.562 0.535 0.547 0.533 1.181 0.865 1.257 0.889
Avg 0.408 0.423 0.465 0.455 0.422 0.437 0.413 0.430 0.458 0.450 0.440 0.460 0.496 0.487 0.570 0.537 0.542 0.510 0.491 0.479 1.040 0.795 1.029 0.805
96 0.268 0.328 0.285 0.342 0.289 0.353 0.274 0.336 0.340 0.374 0.358 0.397 0.346 0.388 0.476 0.458 0.340 0.391 0.397 0.437 3.755 1.525 2.626 1.317
ET T h2

192 0.329 0.375 0.354 0.389 0.383 0.418 0.339 0.379 0.402 0.414 0.429 0.439 0.456 0.452 0.512 0.493 0.430 0.439 0.520 0.504 5.602 1.931 11.12 2.979
336 0.368 0.409 0.373 0.407 0.448 0.465 0.329 0.380 0.452 0.452 0.496 0.487 0.482 0.486 0.552 0.551 0.485 0.479 0.626 0.559 4.721 1.835 9.323 2.769
720 0.372 0.420 0.406 0.441 0.605 0.551 0.379 0.422 0.462 0.468 0.463 0.474 0.515 0.511 0.562 0.560 0.500 0.497 0.863 0.672 3.647 1.625 3.874 1.697
Avg 0.334 0.383 0.381 0.412 0.431 0.446 0.330 0.379 0.414 0.427 0.437 0.449 0.450 0.459 0.526 0.516 0.439 0.452 0.602 0.543 4.431 1.729 6.736 2.191
96 0.272 0.334 0.292 0.346 0.299 0.343 0.290 0.342 0.338 0.375 0.379 0.419 0.505 0.475 0.386 0.398 0.375 0.398 0.374 0.400 0.672 0.571 0.538 0.528
ET T m1

192 0.310 0.358 0.332 0.372 0.335 0.365 0.332 0.369 0.374 0.387 0.426 0.441 0.553 0.496 0.459 0.444 0.408 0.410 0.400 0.407 0.795 0.669 0.658 0.592
336 0.352 0.384 0.366 0.394 0.369 0.386 0.366 0.392 0.410 0.411 0.445 0.459 0.621 0.537 0.495 0.464 0.435 0.428 0.438 0.438 1.212 0.871 0.898 0.721
720 0.383 0.411 0.417 0.421 0.425 0.421 0.416 0.420 0.478 0.450 0.543 0.490 0.671 0.561 0.585 0.516 0.499 0.462 0.527 0.502 1.166 0.823 1.102 0.841
Avg 0.329 0.372 0.388 0.403 0.357 0.378 0.351 0.380 0.400 0.406 0.448 0.452 0.588 0.517 0.481 0.456 0.429 0.425 0.435 0.437 0.961 0.734 0.799 0.671
96 0.161 0.253 0.173 0.262 0.167 0.269 0.165 0.255 0.187 0.267 0.203 0.287 0.255 0.339 0.192 0.274 0.189 0.280 0.209 0.308 0.365 0.453 0.658 0.619
ET T m2

192 0.219 0.293 0.229 0.301 0.224 0.303 0.220 0.292 0.249 0.309 0.269 0.328 0.281 0.340 0.280 0.339 0.253 0.319 0.311 0.382 0.533 0.563 1.078 0.827
336 0.271 0.329 0.286 0.341 0.281 0.342 0.274 0.329 0.321 0.351 0.325 0.366 0.339 0.372 0.334 0.361 0.314 0.357 0.442 0.466 1.363 0.887 1.549 0.972
720 0.352 0.379 0.378 0.401 0.397 0.421 0.362 0.385 0.408 0.403 0.421 0.415 0.433 0.432 0.417 0.413 0.414 0.413 0.675 0.587 3.379 1.338 2.631 1.242
Avg 0.251 0.313 0.284 0.339 0.267 0.333 0.255 0.315 0.291 0.333 0.305 0.349 0.327 0.371 0.306 0.347 0.293 0.342 0.409 0.436 1.410 0.810 1.479 0.915
1st Count 18 0 1 4 0 0 0 0 0 0 0 0

Table 2: Short-term time series forecasting results on M4. The forecasting horizons are in [6, 48] and the
three rows provided are weighted averaged from all datasets under different sampling intervals. A lower value
indicates better performance. Red: the best, Blue: the second best. Appendix E shows our full results.
Methods T IME -LLM GPT4TS TimesNet PatchTST N-HiTS N-BEATS ETSformer LightTS DLinear FEDformer Stationary Autoformer Informer Reformer
Average

SMAPE 11.983 12.69 12.88 12.059 12.035 12.25 14.718 13.525 13.639 13.16 12.780 12.909 14.086 18.200
MASE 1.595 1.808 1.836 1.623 1.625 1.698 2.408 2.111 2.095 1.775 1.756 1.771 2.718 4.223
OWA 0.859 0.94 0.955 0.869 0.869 0.896 1.172 1.051 1.051 0.949 0.930 0.939 1.230 1.775

are twice as prediction horizons. The evaluation metrics are symmetric mean absolute percentage
error (SMAPE), mean absolute scaled error (MSAE), and overall weighted average (OWA).
Results. Our brief results with unified seeds across all methods are in Tab. 2. T IME -LLM consis-
tently surpasses all baselines, outperforming GPT4TS by 8.7%. T IME -LLM remains competitive
even when compared with the SOTA model, N-HiTS (Challu et al., 2023b) , w.r.t. MASE and OWA.

4.3 F EW- SHOT F ORECASTING


Setups. LLMs have recently demonstrated remarkable few-shot learning capabilities (Liu et al.,
2023b). In this section, we assess whether our reprogrammed LLM retains this ability in forecasting
tasks. We adhere to the setups in (Zhou et al., 2023a) for fair comparisons, and we evaluate on
scenarios with limited training data (i.e., ≤ first 10% training time steps).
Results. The 10% and 5% few-shot learning results are depicted in Tab. 3 and Tab. 4 respectively.
T IME -LLM remarkably excels over all baseline methods, and we attribute this to the successful

Table 3: Few-shot learning on 10% training data. We use the same protocol and notations as in Tab. 1.
Methods T IME -LLM GPT4TS DLinear PatchTST TimesNet FEDformer Autoformer Stationary ETSformer LightTS Informer Reformer
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.448 0.460 0.458 0.456 0.492 0.495 0.516 0.485 0.861 0.628 0.512 0.499 0.613 0.552 0.918 0.639 1.112 0.806 1.298 0.838 1.179 0.792 1.184 0.790
ET T h1

192 0.484 0.483 0.570 0.516 0.565 0.538 0.598 0.524 0.797 0.593 0.624 0.555 0.722 0.598 0.915 0.629 1.155 0.823 1.322 0.854 1.199 0.806 1.295 0.850
336 0.589 0.540 0.608 0.535 0.721 0.622 0.657 0.550 0.941 0.648 0.691 0.574 0.750 0.619 0.939 0.644 1.179 0.832 1.347 0.870 1.202 0.811 1.294 0.854
720 0.700 0.604 0.725 0.591 0.986 0.743 0.762 0.610 0.877 0.641 0.728 0.614 0.721 0.616 0.887 0.645 1.273 0.874 1.534 0.947 1.217 0.825 1.223 0.838
Avg. 0.556 0.522 0.590 0.525 0.691 0.600 0.633 0.542 0.869 0.628 0.639 0.561 0.702 0.596 0.915 0.639 1.180 0.834 1.375 0.877 1.199 0.809 1.249 0.833
96 0.275 0.326 0.331 0.374 0.357 0.411 0.353 0.389 0.378 0.409 0.382 0.416 0.413 0.451 0.389 0.411 0.678 0.619 2.022 1.006 3.837 1.508 3.788 1.533
ET T h2

192 0.374 0.373 0.402 0.411 0.569 0.519 0.403 0.414 0.490 0.467 0.478 0.474 0.474 0.477 0.473 0.455 0.785 0.666 2.329 1.104 3.856 1.513 3.552 1.483
336 0.406 0.429 0.406 0.433 0.671 0.572 0.426 0.441 0.537 0.494 0.504 0.501 0.547 0.543 0.507 0.480 0.839 0.694 2.453 1.122 3.952 1.526 3.395 1.526
720 0.427 0.449 0.449 0.464 0.824 0.648 0.477 0.480 0.510 0.491 0.499 0.509 0.516 0.523 0.477 0.472 1.273 0.874 3.816 1.407 3.842 1.503 3.205 1.401
Avg. 0.370 0.394 0.397 0.421 0.605 0.538 0.415 0.431 0.479 0.465 0.466 0.475 0.488 0.499 0.462 0.455 0.894 0.713 2.655 1.160 3.872 1.513 3.485 1.486
96 0.346 0.388 0.390 0.404 0.352 0.392 0.410 0.419 0.583 0.501 0.578 0.518 0.774 0.614 0.761 0.568 0.911 0.688 0.921 0.682 1.162 0.785 1.442 0.847
ET T m1

192 0.373 0.416 0.429 0.423 0.382 0.412 0.437 0.434 0.630 0.528 0.617 0.546 0.754 0.592 0.781 0.574 0.955 0.703 0.957 0.701 1.172 0.793 1.444 0.862
336 0.413 0.426 0.469 0.439 0.419 0.434 0.476 0.454 0.725 0.568 0.998 0.775 0.869 0.677 0.803 0.587 0.991 0.719 0.998 0.716 1.227 0.908 1.450 0.866
720 0.485 0.476 0.569 0.498 0.490 0.477 0.681 0.556 0.769 0.549 0.693 0.579 0.810 0.630 0.844 0.581 1.062 0.747 1.007 0.719 1.207 0.797 1.366 0.850
Avg. 0.404 0.427 0.464 0.441 0.411 0.429 0.501 0.466 0.677 0.537 0.722 0.605 0.802 0.628 0.797 0.578 0.980 0.714 0.971 0.705 1.192 0.821 1.426 0.856
96 0.177 0.261 0.188 0.269 0.213 0.303 0.191 0.274 0.212 0.285 0.291 0.399 0.352 0.454 0.229 0.308 0.331 0.430 0.813 0.688 3.203 1.407 4.195 1.628
ET T m2

192 0.241 0.314 0.251 0.309 0.278 0.345 0.252 0.317 0.270 0.323 0.307 0.379 0.694 0.691 0.291 0.343 0.400 0.464 1.008 0.768 3.112 1.387 4.042 1.601
336 0.274 0.327 0.307 0.346 0.338 0.385 0.306 0.353 0.323 0.353 0.543 0.559 2.408 1.407 0.348 0.376 0.469 0.498 1.031 0.775 3.255 1.421 3.963 1.585
720 0.417 0.390 0.426 0.417 0.436 0.440 0.433 0.427 0.474 0.449 0.712 0.614 1.913 1.166 0.461 0.438 0.589 0.557 1.096 0.791 3.909 1.543 3.711 1.532
Avg. 0.277 0.323 0.293 0.335 0.316 0.368 0.296 0.343 0.320 0.353 0.463 0.488 1.342 0.930 0.332 0.366 0.447 0.487 0.987 0.756 3.370 1.440 3.978 1.587
1st Count 20 3 2 0 0 0 0 0 0 0 0 0

7
Table 4: Few-shot learning on 5% training data. We use the same protocol and notations as in Tab. 1. ’-’ means
that 5% time series is not sufficient to constitute a training set.
Methods T IME -LLM GPT4TS DLinear PatchTST TimesNet FEDformer Autoformer Stationary ETSformer LightTS Informer Reformer
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.483 0.464 0.543 0.506 0.547 0.503 0.557 0.519 0.892 0.625 0.593 0.529 0.681 0.570 0.952 0.650 1.169 0.832 1.483 0.91 1.225 0.812 1.198 0.795
ET T h1

192 0.629 0.540 0.748 0.580 0.720 0.604 0.711 0.570 0.940 0.665 0.652 0.563 0.725 0.602 0.943 0.645 1.221 0.853 1.525 0.93 1.249 0.828 1.273 0.853
336 0.768 0.626 0.754 0.595 0.984 0.727 0.816 0.619 0.945 0.653 0.731 0.594 0.761 0.624 0.935 0.644 1.179 0.832 1.347 0.87 1.202 0.811 1.254 0.857
720 - - - - - - - - - - - - - - - - - - - - - - - -
Avg. 0.627 0.543 0.681 0.560 0.750 0.611 0.694 0.569 0.925 0.647 0.658 0.562 0.722 0.598 0.943 0.646 1.189 0.839 1.451 0.903 1.225 0.817 1.241 0.835
96 0.336 0.397 0.376 0.421 0.442 0.456 0.401 0.421 0.409 0.420 0.390 0.424 0.428 0.468 0.408 0.423 0.678 0.619 2.022 1.006 3.837 1.508 3.753 1.518
ET T h2

192 0.406 0.425 0.418 0.441 0.617 0.542 0.452 0.455 0.483 0.464 0.457 0.465 0.496 0.504 0.497 0.468 0.845 0.697 3.534 1.348 3.975 1.933 3.516 1.473
336 0.405 0.432 0.408 0.439 1.424 0.849 0.464 0.469 0.499 0.479 0.477 0.483 0.486 0.496 0.507 0.481 0.905 0.727 4.063 1.451 3.956 1.520 3.312 1.427
720 - - - - - - - - - - - - - - - - - - - - - - - -
Avg. 0.382 0.418 0.400 0.433 0.694 0.577 0.827 0.615 0.439 0.448 0.463 0.454 0.441 0.457 0.470 0.489 0.809 0.681 3.206 1.268 3.922 1.653 3.527 1.472
96 0.316 0.377 0.386 0.405 0.332 0.374 0.399 0.414 0.606 0.518 0.628 0.544 0.726 0.578 0.823 0.587 1.031 0.747 1.048 0.733 1.130 0.775 1.234 0.798
ET T m1

192 0.450 0.464 0.440 0.438 0.358 0.390 0.441 0.436 0.681 0.539 0.666 0.566 0.750 0.591 0.844 0.591 1.087 0.766 1.097 0.756 1.150 0.788 1.287 0.839
336 0.450 0.424 0.485 0.459 0.402 0.416 0.499 0.467 0.786 0.597 0.807 0.628 0.851 0.659 0.870 0.603 1.138 0.787 1.147 0.775 1.198 0.809 1.288 0.842
720 0.483 0.471 0.577 0.499 0.511 0.489 0.767 0.587 0.796 0.593 0.822 0.633 0.857 0.655 0.893 0.611 1.245 0.831 1.200 0.799 1.175 0.794 1.247 0.828
Avg. 0.425 0.434 0.472 0.450 0.400 0.417 0.526 0.476 0.717 0.561 0.730 0.592 0.796 0.620 0.857 0.598 1.125 0.782 1.123 0.765 1.163 0.791 1.264 0.826
96 0.174 0.261 0.199 0.280 0.236 0.326 0.206 0.288 0.220 0.299 0.229 0.320 0.232 0.322 0.238 0.316 0.404 0.485 1.108 0.772 3.599 1.478 3.883 1.545
ET T m2

192 0.215 0.287 0.256 0.316 0.306 0.373 0.264 0.324 0.311 0.361 0.394 0.361 0.291 0.357 0.298 0.349 0.479 0.521 1.317 0.850 3.578 1.475 3.553 1.484
336 0.273 0.330 0.318 0.353 0.380 0.423 0.334 0.367 0.338 0.366 0.378 0.427 0.478 0.517 0.353 0.380 0.552 0.555 1.415 0.879 3.561 1.473 3.446 1.460
720 0.433 0.412 0.460 0.436 0.674 0.583 0.454 0.432 0.509 0.465 0.523 0.510 0.553 0.538 0.475 0.445 0.701 0.627 1.822 0.984 3.896 1.533 3.445 1.460
Avg. 0.274 0.323 0.308 0.346 0.399 0.426 0.314 0.352 0.344 0.372 0.381 0.404 0.388 0.433 0.341 0.372 0.534 0.547 1.415 0.871 3.658 1.489 3.581 1.487
1st Count 15 1 3 1 0 1 0 0 0 0 0 0

knowledge activation in our reprogrammed LLM. Interestingly, both our approach and GPT4TS
consistently surpass other competitive baselines, further underscoring the potential prowess of lan-
guage models as proficient time series machines.
In the realm of 10% few-shot learning, our methodology realizes a 7.7% MSE reduction in com-
parison to GPT4TS, without necessitating any fine-tuning on the LLM. In relation to recent SOTA
models such as PatchTST, DLinear, and TimesNet, our average enhancements surpass 12%, 18%,
and 28% w.r.t. MSE. Analogous trends are discernible in the 5% few-shot learning scenarios, where
our average advancement over GPT4TS exceeds 8%. When compared with PatchTST, DLinear, and
TimesNet, T IME -LLM manifests a striking average improvement of over 24%.

4.4 Z ERO - SHOT F ORECASTING

Table 5: Zero-shot learning results. Red: the best, Blue: the second best. Appendix D shows complete results.
Methods T IME -LLM GPT4TS DLinear PatchTST TimesNet Autoformer
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
ET T h1 → ET T h2 0.353 0.387 0.406 0.422 0.493 0.488 0.380 0.405 0.421 0.431 0.582 0.548
ET T h1 → ET T m2 0.273 0.340 0.325 0.363 0.415 0.452 0.314 0.360 0.327 0.361 0.457 0.483
ET T h2 → ET T h1 0.479 0.474 0.757 0.578 0.703 0.574 0.565 0.513 0.865 0.621 0.757 0.608
ET T h2 → ET T m2 0.272 0.341 0.335 0.370 0.328 0.386 0.325 0.365 0.342 0.376 0.366 0.411
ET T m1 → ET T h2 0.381 0.412 0.433 0.439 0.464 0.475 0.439 0.438 0.457 0.454 0.470 0.479
ET T m1 → ET T m2 0.268 0.320 0.313 0.348 0.335 0.389 0.296 0.334 0.322 0.354 0.469 0.484
ET T m2 → ET T h2 0.354 0.400 0.435 0.443 0.455 0.471 0.409 0.425 0.435 0.443 0.423 0.439
ET T m2 → ET T m1 0.414 0.438 0.769 0.567 0.649 0.537 0.568 0.492 0.769 0.567 0.755 0.591

Setups. Beyond few-shot learning, LLMs hold potential as effective zero-shot reasoners (Kojima
et al., 2022). In this section, we evaluate the zero-shot learning capabilities of the reprogrammed
LLM within the framework of cross-domain adaptation. Specifically, we examine how well a model
performs on a dataset ♣ when it is optimized on another dataset ♠, where the model has not en-
countered any data samples from the dataset ♣. Similar to few-shot learning, we use long-term
forecasting protocol and evaluate on various cross-domain scenarios utilizing the ETT datasets.
Results. Our brief results are in Tab. 5. T IME -LLM consistently outperforms the most competitive
baselines by a large margin, over 14.2% w.r.t. the second-best in MSE reduction. Considering the
few-shot results, we observe that reprogramming an LLM tends to yield significantly better results
in data scarcity scenarios. For example, our overall error reductions w.r.t. GPT4TS in 10% few-shot
forecasting, 5% few-shot forecasting, and zero-shot forecasting are increasing gradually: 7.7%,
8.4%, and 22%. We attribute this to our approach being better at activating the LLM’s knowledge
transfer and reasoning capabilities in a resource-efficient manner when performing time series tasks.

8
Table 6: Ablations on ETTh1 and ETTm1 in predicting 96 and 192 steps ahead (MSE reported). Red: the best.
Long-term Forecasting Few-shot Forecasting
Variant
ETTh1-96 ETTh1-192 ETTm1-96 ETThm1-192 ETTh1-96 ETTh1-192 ETTm1-96 ETThm1-192
A.1 Llama (Default; 32) 0.362 0.398 0.272 0.310 0.448 0.484 0.346 0.373
A.2 Llama (8) 0.389 0.412 0.297 0.329 0.567 0.632 0.451 0.490
A.3 GPT-2 (12) 0.385 0.419 0.306 0.332 0.548 0.617 0.447 0.509
A.4 GPT-2 (6) 0.394 0.427 0.311 0.342 0.571 0.640 0.468 0.512
B.1 w/o Patch Reprogramming 0.410 0.412 0.310 0.342 0.498 0.570 0.445 0.487
B.2 w/o Prompt-as-Prefix 0.398 0.423 0.298 0.339 0.521 0.617 0.432 0.481
C.1 w/o Dataset Context 0.402 0.417 0.298 0.331 0.491 0.538 0.392 0.447
C.2 w/o Task Instruction 0.388 0.420 0.285 0.327 0.476 0.529 0.387 0.439
C.3 w/o Statistical Context 0.391 0.419 0.279 0.347 0.483 0.547 0.421 0.461

Table 7: Efficiency analysis of T IME -LLM on ETTh1 in forecasting different steps ahead.
Length ETTh1-96 ETTh1-192 ETTh1-336 ETTh1-512
Metric Param. (M) Mem. (MiB) Speed(s/iter) Param. (M) Mem. (MiB) Speed(s/iter) Param. (M) Mem. (MiB) Speed(s/iter) Param. (M) Mem.(MiB) Speed(s/iter)
D.1 LLama (32) 3404.53 32136 0.517 3404.57 33762 0.582 3404.62 37988 0.632 3404.69 39004 0.697
D.2 LLama (8) 975.83 11370 0.184 975.87 12392 0.192 975.92 13188 0.203 976.11 13616 0.217
D.3 w/o LLM 6.39 3678 0.046 6.42 3812 0.087 6.48 3960 0.093 6.55 4176 0.129

4.5 M ODEL A NALYSIS


Language Model Variants. We compare two representative backbones with varying capacities
(A.1-4 in Tab. 6). Our results indicate that the scaling law retain after the LLM reprogramming. We
adopt Llama-7B by default in its full capacity, which manifestly outperforms its 1/4 capacity variant
(A.2; inclusive of the first 8 Transformer layers) by 14.5%. An average MSE reduction of 14.7% is
observed over GPT-2 (A.3), which slightly outperforms its variant GPT-2 (6) (A.4) by 2.7%.
Cross-modality Alignment. Our results in Tab. 6 indicate that ablating either patch reprogramming
or Prompt-as-Prefix hurts knowledge transfer in reprogramming the LLM for effective time series
forecasting. In the absence of representation alignment (B.1), we observe a notable average perfor-
mance degradation of 9.2%, which becomes more pronounced (exceeding 17%) in few-shot tasks.
In T IME -LLM, the act of prompting stands as a pivotal element in harnessing the LLM’s capacity
for understanding the inputs and tasks. Ablation of this component (B.2) results in over 8% and
19% degradation in standard and few-shot forecasting tasks, respectively. We find that removing
the input statistics (C.1) hurts the most, resulting in an average increase of 10.2% MSE. This is an-
ticipated as external knowledge can be naturally incorporated via prompting to facilitate the learning
and inference. Additionally, providing the LLM with clear task instructions and input context (e.g.,
dataset captioning) is also beneficial (i.e., C.2 and C.1; eliciting over 7.7% and 9.6%, respectively).
Reprogramming Interpretation. We provide a case study on ETTh1 of reprogramming 48 time
series patches with 100 text prototypes in Fig. 5. The top 4 subplots visualize the optimization of
reprogramming space from randomly-initialized (a) to well-optimized (d). We find only a small
set of prototypes (columns) participated in reprogramming the input patches (rows) in subplot (e).
Also, patches undergo different representations through varying combinations of prototypes. This
indicates: (1) text prototypes learn to summarize language cues, and a select few are highly relevant
for representing information in local time series patches, which we visualize by randomly selecting
10 in subplot (f). Our results suggest a high relevance to the words that describe time series proper-
ties (i.e., word sets 1 and 2); (2) patches usually have different underlying semantics, necessitating
different prototypes to represent.
Reprogramming Efficiency. Tab. 7 provides an overall efficiency analysis of T IME -LLM with
and without the backbone LLM. Our proposed reprogramming network itself (D.3) is lightweight
in activating the LLM’s ability for time series forecasting (i.e., fewer than 6.6 million trainable
parameters; only around 0.2% of the total parameters in Llama-7B), and the overall efficiency of
T IME -LLM is actually capped by the leveraged backbones (e.g., D.1 and D.2). This is favorable
even compared to the parameter-efficient fine-tuning methods (e.g., QLoRA (Dettmers et al., 2023))
in balancing task performance and efficiency.

5 C ONCLUSION AND F UTURE W ORK


TIME-LLM shows promise in adapting frozen large language models for time series forecasting
by reprogramming time series data into text prototypes more natural for LLMs and providing nat-
ural language guidance via Prompt-as-Prefix to augment reasoning. Evaluations demonstrate the

9
arXiv Version
(a) Epoch 0 (b) Epoch 1 (c) Epoch 5 (d) Epoch 10

Patch

Patch

Patch

Patch
Text Prototype Text Prototype Text Prototype Text Prototype
(e) Illustration of a Well-optimized Reprogramming Space in (d)

A significant
text prototype
Time series patches are differently reprogrammed
Less
significant
prototypes

(f) Visualization of 10 different learned


text prototypes
Prototype

Prototype

Prototype
Word Set 1: {‘periodic’, ‘seasonal’, ‘increase’ …}
Word Set 2: {‘quantile’, ‘average’, ‘short’ …}
Word Set 3: {‘outspoken’, ‘galiee’, ‘analogue’…}
Word Set 1 Word Set 2 Word Set 3

Figure 5: A showcase of time series patch reprogramming.

adapted LLMs can outperform specialized expert models, indicating their potential as effective time
series machines. Our results also provide a novel insight that time series forecasting can be cast as
yet another “language” task that can be tackled by an off-the-shelf LLM to achieve state-of-the-art
performance through our Time-LLM framework. Further research should explore optimal repro-
gramming representations, enrich LLMs with explicit time series knowledge through continued pre-
training, build towards multi-modal models with joint reasoning across modalities, and apply the
reprogramming framework to imbue LLMs with additional new capabilities.

R EFERENCES
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional
and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis:
forecasting and control. John Wiley & Sons, 2015.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are
few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Cristian Challu, Kin G Olivares, Boris N Oreshkin, Federico Garza, Max Mergenthaler, and Artur
Dubrawski. N-hits: Neural hierarchical interpolation for time series forecasting. AAAI, 2023a.
Cristian Challu, Kin G Olivares, Boris N Oreshkin, Federico Garza Ramirez, Max Mergenthaler
Canseco, and Artur Dubrawski. Nhits: neural hierarchical interpolation for time series forecast-
ing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 6989–6997,
2023b.
Ching Chang, Wen-Chih Peng, and Tien-Fu Chen. Llm4ts: Two-stage fine-tuning for time-series
forecasting with pre-trained llms. arXiv preprint arXiv:2308.08469, 2023.
Pin-Yu Chen. Model reprogramming: Resource-efficient cross-domain machine learning. arXiv
preprint arXiv:2202.10629, 2022.
Zhixuan Chu, Hongyan Hao, Xin Ouyang, Simeng Wang, Yan Wang, Yue Shen, Jinjie Gu, Qing Cui,
Longfei Li, Siqiao Xue, et al. Leveraging large language models for pre-trained recommender
systems. arXiv preprint arXiv:2308.10837, 2023.
Shohreh Deldari, Hao Xue, Aaqib Saeed, Jiayuan He, Daniel V Smith, and Flora D Salim. Beyond
just vision: A review on self-supervised representation learning on multimodal and temporal data.
arXiv preprint arXiv:2206.02353, 2022.

10
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning
of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar, and Pierre-Alain
Muller. Transfer learning for time series classification. In 2018 IEEE international conference on
big data, pp. 1367–1376. IEEE, 2018.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):
1735–1780, 1997.
Taesung Kim, Jinhee Kim, Yunwon Tae, Cheonbok Park, Jang-Ho Choi, and Jaegul Choo. Re-
versible instance normalization for accurate time-series forecasting against distribution shift. In
International Conference on Learning Representations, 2021.
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv
preprint arXiv:2001.04451, 2020.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large
language models are zero-shot reasoners. URL https://arxiv. org/abs/2205.11916, 2022.
Michael Leonard. Promotional analysis and forecasting for demand planning: a practical time series
approach. with exhibits, 1, 2001.
Na Li, Donald M Arnold, Douglas G Down, Rebecca Barty, John Blake, Fei Chiang, Tom Courtney,
Marianne Waito, Rick Trifunov, and Nancy M Heddle. From demand forecasting to inventory
ordering decisions for red blood cells through integrating machine learning, statistical modeling,
and inventory optimization. Transfusion, 62(1):87–99, 2022.
Hengbo Liu, Ziqing Ma, Linxiao Yang, Tian Zhou, Rui Xia, Yi Wang, Qingsong Wen, and Liang
Sun. Sadi: A self-adaptive decomposed interpretable framework for electric load forecasting
under extreme events. In 2023 IEEE International Conference on Acoustics, Speech and Signal
Processing, 2023a.
Xin Liu, Daniel McDuff, Geza Kovacs, Isaac Galatzer-Levy, Jacob Sunshine, Jiening Zhan, Ming-
Zher Poh, Shun Liao, Paolo Di Achille, and Shwetak Patel. Large language models are few-shot
health learners. arXiv preprint arXiv:2305.15525, 2023b.
Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Non-stationary transformers: Exploring
the stationarity in time series forecasting. Advances in Neural Information Processing Systems,
35:9881–9893, 2022.
Ziyang Ma, Wen Wu, Zhisheng Zheng, Yiwei Guo, Qian Chen, Shiliang Zhang, and Xie Chen.
Leveraging speech ptm, text llm, and emotional tts for speech emotion recognition. arXiv preprint
arXiv:2309.10294, 2023.
Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. The m4 competition: Re-
sults, findings, conclusion and way forward. International Journal of Forecasting, 34(4):802–808,
2018.
Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Are-
nas, Kanishka Rao, Dorsa Sadigh, and Andy Zeng. Large language models as general pattern
machines, 2023.
Diganta Misra, Agam Goyal, Bharat Runwal, and Pin Yu Chen. Reprogramming under constraints:
Revisiting efficient and reliable transferability of lottery tickets. arXiv preprint arXiv:2308.14969,
2023.
Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth
64 words: Long-term forecasting with transformers. In the Eleventh International Conference on
Learning Representations, 2023.

11
OpenAI. Gpt-4 technical report, 2023.
Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. N-beats: Neural basis
expansion analysis for interpretable time series forecasting. arXiv preprint arXiv:1905.10437,
2019.
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-
performance deep learning library. Advances in neural information processing systems, 32, 2019.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language
models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Stephen H Schneider and Robert E Dickinson. Climate modeling. Reviews of Geophysics, 12(3):
447–493, 1974.
Yihong Tang, Ao Qu, Andy HF Chow, William HK Lam, SC Wong, and Wei Ma. Domain adver-
sarial spatial-temporal network: a transferable framework for short-term traffic forecasting across
cities. In Proceedings of the 31st ACM International Conference on Information & Knowledge
Management, pp. 1905–1915, 2022.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée
Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and
efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Mul-
timodal few-shot learning with frozen language models. Advances in Neural Information Pro-
cessing Systems, 34:200–212, 2021.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural informa-
tion processing systems, 30, 2017.
Yan Wang, Zhixuan Chu, Xin Ouyang, Simeng Wang, Hongyan Hao, Yue Shen, Jinjie Gu, Siqiao
Xue, James Y Zhang, Qing Cui, et al. Enhancing recommender systems with large language
model reasoning graphs. arXiv preprint arXiv:2308.10835, 2023.
Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, and Liang Sun.
Transformers in time series: A survey. In International Joint Conference on Artificial Intelligence,
2023.
Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, and Steven Hoi. Etsformer: Exponential
smoothing transformers for time-series forecasting. arXiv preprint arXiv:2202.01381, 2022.
Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition trans-
formers with auto-correlation for long-term series forecasting. Advances in Neural Information
Processing Systems, 34:22419–22430, 2021.
Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. Timesnet: Tem-
poral 2d-variation modeling for general time series analysis. arXiv preprint arXiv:2210.02186,
2022.
Hao Xue and Flora D Salim. Prompt-based time series forecasting: A new task and dataset. arXiv
preprint arXiv:2210.08964, 2022.
Chao-Han Huck Yang, Yun-Yun Tsai, and Pin-Yu Chen. Voice2series: Reprogramming acoustic
models for time series classification. In International conference on machine learning, pp. 11808–
11819. PMLR, 2021.
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on
multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series
forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pp.
11121–11128, 2023.

12
Kexin Zhang, Qingsong Wen, Chaoli Zhang, Rongyao Cai, Ming Jin, Yong Liu, James Zhang,
Yuxuan Liang, Guansong Pang, Dongjin Song, et al. Self-supervised learning for time series
analysis: Taxonomy, progress, and prospects. arXiv preprint arXiv:2306.10125, 2023.
Tianping Zhang, Yizhuo Zhang, Wei Cao, Jiang Bian, Xiaohan Yi, Shun Zheng, and Jian Li. Less is
more: Fast multivariate time series forecasting with light sampling-oriented mlp structures. arXiv
preprint arXiv:2207.01186, 2022a.
Xiang Zhang, Ziyuan Zhao, Theodoros Tsiligkaridis, and Marinka Zitnik. Self-supervised con-
trastive pre-training for time series via time-frequency consistency. Advances in Neural Informa-
tion Processing Systems, 35:3988–4003, 2022b.
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang.
Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings
of the AAAI conference on artificial intelligence, volume 35, pp. 11106–11115, 2021.
Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency
enhanced decomposed transformer for long-term series forecasting. In International Conference
on Machine Learning, pp. 27268–27286. PMLR, 2022.
Tian Zhou, Peisong Niu, Xue Wang, Liang Sun, and Rong Jin. One fits all: Power general time
series analysis by pretrained lm. Advances in Neural Information Processing Systems, 36, 2023a.
Yunyi Zhou, Zhixuan Chu, Yijia Ruan, Ge Jin, Yuchen Huang, and Sheng Li. ptse: A multi-model
ensemble method for probabilistic time series forecasting. arXiv preprint arXiv:2305.11304,
2023b.

13
A M ORE R ELATED W ORK

We furnish an extension of the related work on task-specific learning, focusing particularly on the
most related models to which we made comparisons. Recent works improve Transformer (Vaswani
et al., 2017) for time series forecasting by incorporating signal processing principles like patching,
exponential smoothing, decomposition, and frequency analysis. For example, PatchTST (Nie et al.,
2023) segments time series into patches as input tokens to Transformer. This retains local semantics,
reduces computation/memory for attention, and allows longer history. It improves long-term forecast
accuracy over other Transformer models. It also achieves excellent performance on self-supervised
pretraining and transfer learning. ETSformer (Woo et al., 2022) incorporates exponential smooth-
ing principles into Transformer attention to improve accuracy and efficiency. It uses exponential
smoothing attention and frequency attention to replace standard self-attention. FEDformer (Zhou
et al., 2022) combines Transformer with seasonal-trend decomposition. The decomposition captures
the global profile while Transformer captures detailed structures. It also uses frequency enhance-
ment for long-term prediction. This provides better performance and efficiency than the standard
Transformer. Autoformer (Wu et al., 2021) uses a decomposition architecture with auto-correlation
to enable progressive decomposition capacities for complex series. Auto-correlation is designed
based on series periodicity to conduct dependency discovery and representation aggregation. It out-
performs self-attention in efficiency and accuracy.
Although these methods enhance efficiency and accuracy compared to vanilla Transformer, they are
mostly designed and optimized for narrow prediction tasks within specific domains. These models
are typically trained end-to-end on small, domain-specific datasets. While achieving strong perfor-
mance on their target tasks, such specialized models sacrifice versatility and generalizability across
the diverse range of time series data encountered in the real world. The narrow focus limits their
applicability to new datasets and tasks. To advance time series forecasting, there is a need for more
flexible, widely applicable models that can adapt to new data distributions and tasks without exten-
sive retraining. Ideal models would learn robust time series representations that transfer knowledge
across domains. Developing such broadly capable forecasting models remains an open challenge.
According to our discussions of related previous work, recent studies have begun to explore model
versatility through pre-training and architectural innovations. However, further efforts are needed to
realize the truly general-purpose forecasting systems that we are advancing in this research.

B E XPERIMENTAL D ETAILS

B.1 I MPLEMENTATION

We mainly follow the experimental configurations in (Wu et al., 2022) across all baselines within a
unified evaluation pipeline in https://github.com/thuml/Time-Series-Library for
fair comparisons. We use Llama-7B (Touvron et al., 2023) as the default backbone model unless
stated otherwise. All our experiments are repeated three times and we report the averaged results.
Our model implementation is on PyTorch (Paszke et al., 2019) with all experiments conducted on
NVIDIA A100-80G GPUs. Our detailed model configurations are in Appendix B.4.

B.2 DATASET D ETAILS

Dataset statistics are summarized in Tab. 8. We evaluate the long-term forecasting performance on
the well-established ETT datasets (Zhou et al., 2021) (i.e., ETTh1, ETTh2, ETTm1, and ETTm2).
Furthermore, we evaluate the performance of short-term forecasting on the M4 benchmark (Makri-
dakis et al., 2018).
The Electricity Transformer Temperature (ETT; An indicator reflective of long-term electric power
deployment) benchmark is comprised of two years of data, sourced from two counties in China,
and is subdivided into four distinct datasets, each with varying sampling rates: ETTh1 and ETTh2,
which are sampled at a 1-hour level, and ETTm1 and ETTm2, which are sampled at a 15-minute
level. Each entry within the ETT datasets includes six power load features and a target variable,
termed “oil temperature”.

14
Table 8: Dataset statistics are from (Wu et al., 2022). The dimension indicates the number of time
series (i.e., channels), and the dataset size is organized in (training, validation, testing).

Tasks Dataset Dim. Series Length Dataset Size Frequency Information


ETTm1 7 {96, 192, 336, 720} (34465, 11521, 11521) 15 min Temperature
Long-term ETTm2 7 {96, 192, 336, 720} (34465, 11521, 11521) 15 min Temperature
Forecasting ETTh1 7 {96, 192, 336, 720} (8545, 2881, 2881) 1 hour Temperature
ETTh2 7 {96, 192, 336, 720} (8545, 2881, 2881) 1 hour Temperature
M4-Yearly 1 6 (23000, 0, 23000) Yearly Demographic
M4-Quarterly 1 8 (24000, 0, 24000) Quarterly Finance
Short-term M4-Monthly 1 18 (48000, 0, 48000) Monthly Industry
Forecasting M4-Weakly 1 13 (359, 0, 359) Weakly Macro
M4-Daily 1 14 (4227, 0, 4227) Daily Micro
M4-Hourly 1 48 (414, 0, 414) Hourly Other

The M4 benchmark comprises 100K time series, amassed from various domains commonly present
in business, financial, and economic forecasting. These time series have been partitioned into six
distinctive datasets, each with varying sampling frequencies that range from yearly to hourly.

B.3 E VALUATION M ETRICS

For evaluation metrics, we utilize the mean square error (MSE) and mean absolute error (MAE) for
long-term forecasting. In terms of the short-term forecasting on M4 benchmark, we adopt the sym-
metric mean absolute percentage error (SMAPE), mean absolute scaled error (MASE), and overall
weighted average (OWA) as in N-BEATS (Oreshkin et al., 2019). Note that OWA is a specific metric
utilized in the M4 competition. The calculations of these metrics are as follows:

T H
1 X 1 X
MSE = (Yh − Ŷh )2 , MAE = |Yh − Ŷh |,
H H
h=1 h=1
H H
200 X |Yh − Ŷh | 100 X |Yh − Ŷh |
SMAPE = , MAPE = ,
H |Yh | + |Ŷh | H |Yh |
h=1 h=1
H  
1 X |Yh − Ŷh | 1 SMAPE MASE
MASE = PH , OWA = + ,
H 1
j=s+1 |Yj − Yj−s |
2 SMAPENaı̈ve2 MASENaı̈ve2
h=1 H−s

where s is the periodicity of the time series data. H denotes the number of data points (i.e., prediction
horizon in our cases). Yh and Ŷh are the h-th ground truth and prediction where h ∈ {1, · · · , H}.

B.4 M ODEL C ONFIGURATIONS

The configurations of our models, relative to varied tasks and datasets, are consolidated in Tab. 9.
By default, the Adam optimizer (Kingma & Ba, 2015) is employed throughout all experiments.
Specifically, the quantity of text prototypes V ′ is held constant at 100 and 1000 for short-term and
long-term forecasting tasks, respectively. We utilize the Llama-7B model at full capacity, maintain-
ing the backbone model layers at 32 across all tasks as a standard. The term input length T signifies
the number of time steps present in the original input time series data. Patch dimensions dm rep-
resent the hidden dimensions of the embedded time series patches prior to reprogramming. Lastly,
heads K correlate to the multi-head cross-attention utilized for patch reprogramming. In the four
rightmost columns of Tab. 9, we detail the configurations related to model training.

15
Table 9: An overview of the experimental configurations for T IME -LLM. “LTF” and “STF” denote
long-term and short-term forecasting, respectively.

Model Hyperparameter Training Process


Task-Dataset / Configuration
Text Prototype V ′ Backbone Layers Input Length T Patch Dim. dm Heads K LR∗ Loss Batch Size Epochs
LTF - ETTh1 1000 32 512 16 8 10−3 MSE 16 50
LTF - ETTh2 1000 32 512 16 8 10−3 MSE 16 50
LTF - ETTm1 1000 32 512 16 8 10−3 MSE 16 100
LTF - ETTm2 1000 32 512 16 8 10−3 MSE 16 100

STF - M4 100 32 2×H 32 8 10−4 SMAPE 32 50
† H represents the forecasting horizon of the M4 datasets.
∗ LR means the initial learning rate.

C H YPERPARAMETER S ENSITIVITY

We conduct a hyperparameter sensitivity analysis focusing on the four important hyperparameters


within T IME -LLM: namely, the number of backbone model layers, the number of text prototypes
V ′ , the time series input length T , and the number of patch reprogramming cross-attention heads
K. The correlated results can be found in Fig. 6. From our analysis, we derive the following
observations: (1) There is a positive correlation between the number of Transformer layers in the
backbone LLM and the performance of T IME -LLM, affirming that the scaling law is preserved post-
LLM reprogramming.; (2) Generally, acquiring more text prototypes enhances performance. We
hypothesize that a limited number of prototypes V ′ might induce noise when aggregating language
cues, consequently obstructing the efficient learning of highly representative prototypes essential for
characterizing the input time series patches; (3) The input time length T exhibits a direct relation
with forecasting accuracy, particularly evident when predicting extended horizons. This observation
is logical and is in congruence with conventional time series models; (4) Increasing the number of
attention heads during the reprogramming of input patches proves to be advantageous.

Figure 6: Analysis of hyperparameter sensitivity on ETTh1 dataset.

16
D Z ERO - SHOT F ORECASTING

The full results of zero-shot forecasting are summarized in Tab. 10. It can be seen that T IME -LLM
remarkably surpasses the five most competitive time series models in zero-shot adaptation. Overall,
we observe over 23.5% and 12.4% MSE and MAE reductions across all baselines on average.
Our improvements are consistently significant on those typical cross-domain scenarios (e.g., ETTh2
→ ETTh1 and ETTm2 → ETTm1), over 20.8% and 11.3% on average w.r.t. MSE and MAE.
We attribute this success to our reprogramming framework being better at activating the LLM’s
knowledge transfer and reasoning capabilities in a resource-efficient manner when performing time
series tasks.

Table 10: Full zero-shot learning results on ETT datasets. A lower value indicates better performance. Red:
the best, Blue: the second best.

Methods T IME -LLM GPT4TS DLinear PatchTST TimesNet Autoformer


Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.279 0.337 0.335 0.374 0.347 0.400 0.304 0.350 0.358 0.387 0.469 0.486
192 0.351 0.374 0.412 0.417 0.447 0.460 0.386 0.400 0.427 0.429 0.634 0.567
ET T h1 → ET T h2 336 0.388 0.415 0.441 0.444 0.515 0.505 0.414 0.428 0.449 0.451 0.655 0.588
720 0.391 0.420 0.438 0.452 0.665 0.589 0.419 0.443 0.448 0.458 0.570 0.549
Avg. 0.353 0.387 0.406 0.422 0.493 0.488 0.380 0.405 0.421 0.431 0.582 0.548
96 0.189 0.293 0.236 0.315 0.255 0.357 0.215 0.304 0.239 0.313 0.352 0.432
192 0.237 0.312 0.287 0.342 0.338 0.413 0.275 0.339 0.291 0.342 0.413 0.460
ET T h1 → ET T m2 336 0.291 0.365 0.341 0.374 0.425 0.465 0.334 0.373 0.342 0.371 0.465 0.489
720 0.372 0.390 0.435 0.422 0.640 0.573 0.431 0.424 0.434 0.419 0.599 0.551
Avg. 0.273 0.340 0.325 0.363 0.415 0.452 0.314 0.360 0.327 0.361 0.457 0.483
96 0.450 0.452 0.732 0.577 0.689 0.555 0.485 0.465 0.848 0.601 0.693 0.569
192 0.465 0.461 0.758 0.559 0.707 0.568 0.565 0.509 0.860 0.610 0.760 0.601
ET T h2 → ET T h1 336 0.501 0.482 0.759 0.578 0.710 0.577 0.581 0.515 0.867 0.626 0.781 0.619
720 0.501 0.502 0.781 0.597 0.704 0.596 0.628 0.561 0.887 0.648 0.796 0.644
Avg. 0.479 0.474 0.757 0.578 0.703 0.574 0.565 0.513 0.865 0.621 0.757 0.608
96 0.174 0.276 0.253 0.329 0.240 0.336 0.226 0.309 0.248 0.324 0.263 0.352
192 0.233 0.315 0.293 0.346 0.295 0.369 0.289 0.345 0.296 0.352 0.326 0.389
ET T h2 → ET T m2 336 0.291 0.337 0.347 0.376 0.345 0.397 0.348 0.379 0.353 0.383 0.387 0.426
720 0.392 0.417 0.446 0.429 0.432 0.442 0.439 0.427 0.471 0.446 0.487 0.478
Avg. 0.272 0.341 0.335 0.370 0.328 0.386 0.325 0.365 0.342 0.376 0.366 0.411
96 0.321 0.369 0.353 0.392 0.365 0.415 0.354 0.385 0.377 0.407 0.435 0.470
192 0.389 0.410 0.443 0.437 0.454 0.462 0.447 0.434 0.471 0.453 0.495 0.489
ET T m1 → ET T h2 336 0.408 0.433 0.469 0.461 0.496 0.494 0.481 0.463 0.472 0.484 0.470 0.472
720 0.406 0.436 0.466 0.468 0.541 0.529 0.474 0.471 0.495 0.482 0.480 0.485
Avg. 0.381 0.412 0.433 0.439 0.464 0.475 0.439 0.438 0.457 0.454 0.470 0.479
96 0.169 0.257 0.217 0.294 0.221 0.314 0.195 0.271 0.222 0.295 0.385 0.457
192 0.227 0.318 0.277 0.327 0.286 0.359 0.258 0.311 0.288 0.337 0.433 0.469
ET T m1 → ET T m2 336 0.290 0.338 0.331 0.360 0.357 0.406 0.317 0.348 0.341 0.367 0.476 0.477
720 0.375 0.367 0.429 0.413 0.476 0.476 0.416 0.404 0.436 0.418 0.582 0.535
Avg. 0.268 0.320 0.313 0.348 0.335 0.389 0.296 0.334 0.322 0.354 0.469 0.484
96 0.298 0.356 0.360 0.401 0.333 0.391 0.327 0.367 0.360 0.401 0.353 0.393
192 0.359 0.397 0.434 0.437 0.441 0.456 0.411 0.418 0.434 0.437 0.432 0.437
ET T m2 → ET T h2 336 0.367 0.412 0.460 0.459 0.505 0.503 0.439 0.447 0.460 0.459 0.452 0.459
720 0.393 0.434 0.485 0.477 0.543 0.534 0.459 0.470 0.485 0.477 0.453 0.467
Avg. 0.354 0.400 0.435 0.443 0.455 0.471 0.409 0.425 0.435 0.443 0.423 0.439
96 0.359 0.397 0.747 0.558 0.570 0.490 0.491 0.437 0.747 0.558 0.735 0.576
192 0.390 0.420 0.781 0.560 0.590 0.506 0.530 0.470 0.781 0.560 0.753 0.586
ET T m2 → ET T m1 336 0.421 0.445 0.778 0.578 0.706 0.567 0.565 0.497 0.778 0.578 0.750 0.593
720 0.487 0.488 0.769 0.573 0.731 0.584 0.686 0.565 0.769 0.573 0.782 0.609
Avg. 0.414 0.438 0.769 0.567 0.649 0.537 0.568 0.492 0.769 0.567 0.755 0.591

E S HORT- TERM F ORECASTING

Our complete results on short-term forecasting are presented in Tab. 11. T IME -LLM consistently
outperforms the majority of baseline models in most cases. Notably, we surpass GPT4TS by a
large margin (e.g., 8.7% overall, 13.4% on M4-Yearly, and an average of 21.5% on M4-Hourly,
M4-Daily, and M4-Weekly), as well as TimesNet (e.g., 10% overall, 14.1% on M4-Yearly, and
an average of 30.1% on M4-Hourly, M4-Daily, and M4-Weekly). Compared to the recent state-

17
Table 11: Full short-term time series forecasting results. The forecasting horizons are in [6, 48] and the last
three rows are weighted averaged from all datasets under different sampling intervals. A lower value indicates
better performance. Red: the best, Blue: the second best.

Methods T IME -LLM GPT4TS TimesNet PatchTST N-HiTS N-BEATS ETSformer LightTS DLinear FEDformer Stationary Autoformer Informer Reformer
SMAPE 13.419 15.11 15.378 13.477 13.422 13.487 18.009 14.247 16.965 14.021 13.717 13.974 14.727 16.169
Yearly

MASE 3.005 3.565 3.554 3.019 3.056 3.036 4.487 3.109 4.283 3.036 3.078 3.134 3.418 3.800
OWA 0.789 0.911 0.918 0.792 0.795 0.795 1.115 0.827 1.058 0.811 0.807 0.822 0.881 0.973
Monthly Quarterly

SMAPE 10.110 10.597 10.465 10.38 10.185 10.564 13.376 11.364 12.145 11.1 10.958 11.338 11.360 13.313
MASE 1.178 1.253 1.227 1.233 1.18 1.252 1.906 1.328 1.520 1.35 1.325 1.365 1.401 1.775
OWA 0.889 0.938 0.923 0.921 0.893 0.936 1.302 1.000 1.106 0.996 0.981 1.012 1.027 1.252
SMAPE 12.980 13.258 13.513 12.959 13.059 13.089 14.588 14.014 13.514 14.403 13.917 13.958 14.062 20.128
MASE 0.963 1.003 1.039 0.97 1.013 0.996 1.368 1.053 1.037 1.147 1.097 1.103 1.141 2.614
OWA 0.903 0.931 0.957 0.905 0.929 0.922 1.149 0.981 0.956 1.038 0.998 1.002 1.024 1.927
SMAPE 4.795 6.124 6.913 4.952 4.711 6.599 7.267 15.880 6.709 7.148 6.302 5.485 24.460 32.491
Others

MASE 3.178 4.116 4.507 3.347 3.054 4.43 5.240 11.434 4.953 4.041 4.064 3.865 20.960 33.355
OWA 1.006 1.259 1.438 1.049 0.977 1.393 1.591 3.474 1.487 1.389 1.304 1.187 5.879 8.679
Average

SMAPE 11.983 12.69 12.88 12.059 12.035 12.25 14.718 13.525 13.639 13.16 12.780 12.909 14.086 18.200
MASE 1.595 1.808 1.836 1.623 1.625 1.698 2.408 2.111 2.095 1.775 1.756 1.771 2.718 4.223
OWA 0.859 0.94 0.955 0.869 0.869 0.896 1.172 1.051 1.051 0.949 0.930 0.939 1.230 1.775

of-the-art forecasting models, N-HiTS and PatchTST, T IME -LLM exhibits comparable or superior
performances without any parameter updates on the backbone LLM.

F E FFICIENCY C OMPARISON WITH M ODEL F INE -T UNING

Setups. We compare the efficiency of model fine-tuning (with QLoRA Dettmers et al. (2023))
and our proposed model reprogramming in this section with two different backbones, that is, Llama
in 1/4 capacity (first 8 Transformer layers) and full capacity. Here, we adhere to the long-term
forecasting protocol on ETTh1 to forecast two different steps (that is, 96 and 336 in this case)
ahead. For the evaluation metrics, we report the total number of trainable parameters (in million),
GPU memory (in mebibyte), and running time (seconds per iteration).

Results. Our results are given in Tab. 12. We see that model reprogramming remarkably results
in better efficiency compared to parameter-efficient fine-tuning (PEFT) with QLoRA on long-range
forecasting tasks in terms of the total number of trainable parameters, GPU memory overhead, and
training speed. Quantitatively, there is an 71.2% trainable parameter reduction on average over four
scenarios, leading to 23.1% smaller memory consumption and 25.3% faster training speed.

Table 12: Efficiency comparison between model reprogramming and parameter-efficient fine-tuning (PEFT)
with QLoRA (Dettmers et al., 2023) on ETTh1 dataset in forecasting two different steps ahead.

Length ETTh1-96 ETTh1-336


Metric Trainable Param. (M) Mem. (MiB) Speed(s/iter) Trainable Param. (M) Mem. (MiB) Speed(s/iter)
QLoRA 12.60 14767 0.237 12.69 15982 0.335
Llama (8)
Reprogram 5.62 11370 0.184 5.71 13188 0.203
QLoRA 50.29 45226 0.697 50.37 49374 0.732
Llama (32)
Reprogram 6.39 32136 0.517 6.48 37988 0.632

G E RROR BARS

All experiments have been conducted three times, and we present the standard deviations of our
model and the runner-up model here. The comparisons between our method and the second-best
method, PatchTST (Nie et al., 2023), on long-term forecasting tasks, are delineated in Tab. 13.
In this table, the average MSE and MAE have been reported across four ETT datasets, complete
with standard deviations. Furthermore, Tab. 14 contrasts the effectiveness of our method with that
of the second-best method, N-HiTS (Challu et al., 2023a), employing varying M4 datasets for the
comparison.

18
Table 13: Standard deviations of our approach and the second-best method (PatchTST) on ETTh1, ETTh2,
ETTm1, and ETTm2 for long-term forecasting.

Model T IME -LLM PatchTST (2023)


Dataset MSE MAE MSE MAE
ETTh1 0.408 ± 0.011 0.423 ± 0.012 0.413 ± 0.001 0.430 ± 0.002
ETTh2 0.334 ± 0.005 0.383 ± 0.009 0.330 ± 0.002 0.379 ± 0.007
ETTm1 0.329 ± 0.006 0.372 ± 0.007 0.351 ± 0.006 0.380 ± 0.002
ETTm2 0.251 ± 0.002 0.313 ± 0.003 0.255 ± 0.003 0.315 ± 0.002

Table 14: Standard deviations of our T IME -LLM and the second-best method (N-HiTS) on M4 datasets for
short-term forecasting.

Model T IME -LLM N-HiTS (2023a)


Dataset SMAPE MAPE OWA SMAPE MAPE OWA
Yearly 13.419 ± 0.117 3.005 ± 0.011 0.789 ± 0.003 13.422 ± 0.009 3.056 ± 0.017 0.795 ± 0.010
Quarterly 10.110 ± 0.107 1.178 ± 0.009 0.889 ± 0.007 10.185 ± 0.107 1.180 ± 0.007 0.893 ± 0.001
Monthly 12.980 ± 0.102 0.963 ± 0.005 0.903 ± 0.001 13.059 ± 0.101 1.013 ± 0.007 0.929 ± 0.005
Others 4.795 ± 0.117 3.178 ± 0.012 1.006 ± 0.009 4.711 ± 0.117 3.054 ± 0.011 0.997 ± 0.012
Averaged 11.983 ± 0.011 1.595 ± 0.021 0.859 ± 0.002 12.035 ± 0.111 1.625 ± 0.012 0.869 ± 0.005

H V ISUALIZATION
In this part, we visualize the forecasting results of T IME -LLM compared with the state-of-the-
art and representative methods (e.g., GPT4TS (Zhou et al., 2023a), PatchTST (Nie et al., 2023),
and Autoformer (Wu et al., 2021)) in various scenarios to demonstrate the superior performance of
T IME -LLM.
In Fig. 7 and Fig. 8, the long-term (input-96-predict-96) and short-term (input-36-predict-36) fore-
casts of various approaches are compared with the ground truth. Here, T IME -LLM showcases
forecasting accuracy that is notably superior compared to GPT4TS, PatchTST, and a classical
Transformer-based method, Autoformer.
We also offer visual comparisons of the forecasting results in both few-shot and zero-shot scenarios,
as depicted in Fig. 9 and Fig. 10. We adhere to the long-term (input-96-predict-96) forecasting setup
in both cases. T IME -LLM exhibits remarkable superiority in forecasting with limited data—a fact
that becomes particularly salient when compared to GPT4TS.

19
(a)Time-LLM (b)GPT4TS (c)PatchTST (d) Autoformer

Figure 7: Long-term forecasting cases from ETTh1 by different models under the input-96-predict-
96 settings. Blue lines are the ground truths and orange lines are the model predictions.

(a)Time-LLM (b)GPT4TS (c) TimesNet (d) Autoformer

Figure 8: Short-term forecasting from the M4 dataset by different models under the input-36-predict-
18 settings.

(a)Time-LLM (b)GPT4TS (c)PatchTST (d) Autoformer

Figure 9: Few-shot forecasting cases from ETTm1 by different models under the input-96-predict-
96 settings. Blue lines are the ground truths and orange lines are the model predictions.

(a)Time-LLM (b)GPT4TS (c)PatchTST (d) Autoformer

Figure 10: Zero-shot forecasting cases from ETTh1→ETTh2 by different models under the input-
96-predict-96 settings. Blue lines are the ground truths and orange lines are the model predictions.

20

You might also like