Time FM
Time FM
Z
Foundation for Time-Series Modeling
Understanding Zero-Shot Learning in Decoder-Only LLMs
efore a transformer-based model can make sense of an input, the raw data (whether text or
B
time-series) needs to be structured into discrete components. InNLP, this means converting
words intotokens, which serve as the fundamentalunits of computation. Each token is
mapped to a high-dimensionalembedding vector, capturingboth semantic meaning and
syntactic relationships.
●
["The", "stock", "market", "is", "volatile", "today", "."]
[E1, E2, E3, E4,
● Mapped to a sequence of embeddings: E5, E6, E7]
hese embeddings flow through the transformer layers, undergoing complex transformations
T
that help the model understand context and relationships between words.
The Transformer Processing Pipeline: Self-Attention and Predicting the Next Token
inally, anoutput layerpredicts the probabilitydistribution over the vocabulary for thenext
F
token(i+1) based on all previously observed tokens.The model learns to generate text
autoregressively, meaning each output depends onlyon past information.
●
["The", "stock", "market", "is", "volatile"]
"today"as the most probable nextword.
● The model generates
his architecture allows LLMs togenerate coherenttext, translate languages, and even
T
answer questions—all in a zero-shot setting, withouttask-specific training.
hile transformers have proven their dominance in NLP, the question arises:
W
"Can a similar architecture be adapted for time-series forecasting?"
Input text tokenization (e.g., GPT) ime-series patching (breaking series into
T
structured chunks)
● ay 1:$100
D
● Day 2:$105
● Day 3:$110
● Day 4:$108
● Day 5: ???
In the next sections, we willdelve deeper into thespecific architecture of TimesFM,
discussing how it incorporates transformers for time-series forecastingand how it
outperforms traditional models throughself-attention-driventemporal modeling.
o understand how the input flows through the system, we will explore three key
T
components:patching the time-series data, processingit through a residual block, and
applying a padding maskto handle variable-lengthsequences.
For instance, consider a time-series representing daily stock prices over 15 days:
riginal sequence:
O
100, 102, 105, 110, 108, 112, 115, 117, 119, 123, 125, 127, 130, 133, 135
Instead of handling 15 individual data points, the model now processes only three patches.
This method reduces computational complexity while ensuring that local patterns within each
segment are preserved.
nce the time-series data is split into patches, each patch undergoes further processing to
O
enhance its representationbefore being passed intothe transformer layers. This is
achieved using aresidual block, which refines thepatch embeddings while preserving the
original structure.
residual blockis a small neural network modulethat consists of two main components: a
A
feedforward transformationand askip connection.The feedforward transformation
applieslinear projections and non-linear activationsto extract complex features from the
raw patch data, while the skip connection ensures that the original information is retained.
This prevents the transformation from distorting the input representation too aggressively.
o illustrate this, consider the first patch from our stock price example:
T
Patch 1:[100, 102, 105, 110, 108]
his processed patch now carriesboth the originalnumerical structure and enriched
T
featuresextracted from the transformation. Theresidualconnection prevents loss of
informationwhile allowing the model to capture deeperrelationships in the data.
padding maskis a binary tensor that marks whichparts of the input should be ignored.
A
This is particularly useful when dealing with batches of time-series that have different
lengths, as it ensures that the model does not allocate unnecessary attention to artificially
padded values.
S
● eries A:[100, 102, 105, 110, 108, 112, 115] (7 timepoints)
● Series B:[50, 52, 53, 54] (4 time points)
eries B:
S
Patch 1: [50, 52, 53, 54, 0]
In this mask,1 indicates valid data, while0 indicatespadded regionsthat the transformer
should ignore. Without this mechanism, the model might treat padded values as actual
observations, leading to incorrect pattern recognition.
nother crucial aspect of input processing in TimesFM ispatch masking, which forces the
A
model to generalize beyond simply memorizing training sequences. Instead of always
feeding complete patches into the network, the modelrandomly masks certain sections,
requiring it to infer missing values based on the available context.
ere, the question marks represent masked values that the model mustlearn to predict
H
without direct supervision. This approach mimics real-world scenarios where data may be
incomplete or unavailable, improving the model's ability to makerobust zero-shot
predictions.
t this stage, the raw time-series data has beenstructuredinto patches, enriched
A
through residual blocks, and optimized for variable-length inputsusing padding masks
nd patch masking. This prepares the data for thestacked transformer layers, where
a
self-attention mechanisms will extract deeper relationships across time.
nce the input time-series has been segmented into patches, processed through residual
O
blocks, and appropriately masked, it is ready to be passed through thestacked transformer
layers. These layers form the computational core ofTimesFM, allowing it to capture intricate
dependencies across time.
In natural language processing, a transformer processes a sequence of words and learns
the contextual relationships between them. The key mechanism behind this is
self-attention, where the model dynamically determineswhich words contribute the most to
understanding a given token. TimesFM repurposes this mechanism for time-series data.
where:
Q
● (Query) represents the current patch’s embedding,
● K (Key) represents all past patches’ embeddings,
● V (Value) holds the learned transformations of past patches.
his equation determineswhich past patches shouldinfluence the current forecast, and
T
the attention scores dynamically adjust based on the sequence’s structure. The softmax
operation ensures that attention weights sum to 1, allowing the model to allocate importance
proportionally.
single attention head might not be sufficient to capture all relevant features in a
A
time-series. Some dependencies might beshort-term,while others could extend across
long-term trends. To resolve this,multi-head self-attentionis employed, where multiple
independent attention mechanisms operate in parallel, each capturing different aspects of
the data.
O
● ne attention head might focus on theprevious 24-hourcycle(daily pattern),
● Another head might detectweekly trends,
● A third head might captureanomalous spikesdue tounexpected events.
ystacking multiple transformer layers, each layerrefines the representations learned in
B
the previous step. Lower layers might capturelocalvariations, while deeper layers focus on
broad seasonal structures. The final layer outputsa processed version of each patch,
embedding all the learned dependencies.
nce the transformer layers have processed the input patches, theoutput layer maps the
O
final transformed embeddings to actual forecasted values. The output layers of
TimesFM are structured to achieve two crucial objectives:
. M
1 apping the output token to an actual numerical prediction
2. Training in decoder-only mode, where each output token predicts the next
segment of the time-series
3. Allowing the output patch length to be larger than the input patch length,
which enables long-range forecasting in a single forward pass
where:
𝑦
● ^𝑝𝑗 + 1: 𝑝𝑗 + ℎis the predicted sequence followingpatch j,
● ojis the output embedding corresponding to that patch.
uring training, the model is given only the observed values (
D 50, 55, 60 ) and must
predict the next three missing values. Once thesevalues are predicted, they arefed
back into the modelto predict the next segment (
80, 85, 90 ). This auto-regressive
structure mirrors how LLMs generate text, producing one token at a time.
nlike text-based models, where each token corresponds to asingle word, in TimesFM,
U
eachoutput patch can be larger than the input patch.This means the model is trained to
predictlarger time spansinstead of step-by-stepvalues.
redicting Larger Chunks Than Seen: The Key Difference Between LLMs
P
and TimesFM
ne fundamental difference between TimesFM and LLMs is that theinput patch length
O
does not have to match the output patch length. Inlanguage models, each token
corresponds to a single word, and generation happens word-by-word. In contrast, TimesFM
anpredict a much longer horizon in a single forward pass, making it more efficient for
c
time-series forecasting.
uppose we have stock price data for the past 100 days. A traditional auto-regressive model
S
would generateone step at a time, requiring 100 iterativesteps to forecast the next 100
days. TimesFM, however, can take50 days of inputand generate 100 days of output in
one go.
Input Patch:
[100, 102, 105, 110, 108, 112, 115, ..., 150]
Output Patch: [152, 155, 157, 160, ..., 200]
(Generated in one
step)
uring training, TimesFM randomly masks parts of the sequence, forcing the model to learn
D
how to interpolate missing values and generalize beyond the seen context. This
ensures that, at inference time, the model can make reliable long-range forecasts without
needing extensive fine-tuning.
onsider a case where the input patch coversthe past32 time-steps, but the output patch
C
is128 time-steps long. The model learns to:
. A
1 ttend to the most relevant historical data (e.g., past trends, periodic fluctuations).
2. Infer missing patterns using self-attention across long-range dependencies.
3. Generate four times the number of time-steps in a single pass, reducing
computational cost.
he primary objective in training TimesFM is to minimize the error between its predicted
T
future values and the actual observed values. Given that TimesFM focuses onpoint
forecastingrather than probabilistic forecasting,the loss function primarily used is theMean
Squared Error (MSE):
where:
𝑦
● ^𝑝𝑗 + 1: 𝑝𝑗 + ℎis the predicted future sequence from the model,
● 𝑦𝑝𝑗 + 1: 𝑝𝑗 + ℎis the actual ground-truth sequence,
● N is the total number of patches in a batch.
his loss function ensures that the modelpenalizeslarger errors more aggressivelythan
T
smaller ones, making it particularly effective for minimizinglarge deviations in time-series
predictions.
In some cases, alternative loss functions such asMean Absolute Error (MAE)or
quantile-based losses can be used if probabilistic forecasting is desired. However, for a
foundation model like TimesFM,MSE serves as a robustmetricto ensure that the model
learns an accurate representation of temporal patterns across diverse datasets.
● eb search trends
W
● Wikipedia page visits
● Financial market data
● Retail sales records
● Weather and traffic patterns
B
● atch size: Ensures the model sees a variety of time-seriesin each step.
● Maximum context length: Trained across multiple horizonlengths to adapt to
different datasets.
● Dropout regularization: Prevents overfitting by introducingrandomness in attention
mechanisms.
Given a new, unseen time-series, the inference process follows three key steps:
ecent research has shown thatdirectly predictingthe entire forecast horizon in one
R
stepcan yield better results than autoregressivedecoding on long-horizon benchmarks.
However, this is not always practical in a foundation model setting, where theforecasting
horizon is unknownbefore inference. Since the modelmust be general enough to handle
varying horizon lengths at runtime, a pure one-shot decoding approach is not feasible.
o demonstrate the impact of this optimization, TimesFM was evaluated on theETT dataset,
T
a widely used benchmark for time-series forecasting. The model was tested onpredicting
512 time-steps into the future, with varying outputpatch lengths ranging from8 to 128.
Results showed amonotonic decrease in the Mean AbsoluteError (MAE) as the output
patch length increased. This validates the idea thatlonger output patches lead to better
performance by minimizing autoregressive decoding steps.
Choosing the Optimal Input Patch Length
he length of theinput patchplays a crucial rolein determining how much historical context
T
the model uses to generate forecasts. Increasing the input patch length generally improves
performance, as the model can leverage more historical information. However, making the
input patch too long introduces practical challenges.
If theinput patch is too short, the model may notcapture long-term dependencies
effectively, leading toweaker generalization on long-horizonforecasting tasks.
Conversely, if theinput patch is too long, the trainingdynamics begin to resemble
encoder-decoder architectures, which are computationallymore expensive and not
optimized for decoder-only training.
● O
nhourly ETT datasets, there was almostno differencein performance between
the two models, as hourly granularity was well-represented in real datasets.
● H owever, on15-minute ETT datasets, the model trainedwithout synthetic data
performed significantly worse than the one trainedwith synthetic data.
● Similarly, on theMonash dataset, which includes datasetswithquarterly and
yearly granularities, the model trainedwithout syntheticdatafailed to generalize
well, leading to degraded performance.
hese design choices enable TimesFM to serve as atrue foundation modelfor time-series
T
forecasting, capable of performing well across diverse datasets without fine-tuning. Unlike
traditional forecasting models that require extensive retraining,TimesFM can generalize in
a zero-shot setting, making it a valuable tool forreal-world applications infinance,
healthcare, climate modeling, and supply chain optimization.