0% found this document useful (0 votes)

27 views15 pages

Time FM

The document discusses the capabilities of Large Language Models (LLMs) in zero-shot inference and introduces TimesFM, a transformer-based model adapted for time-series forecasting. It explains the architecture of TimesFM, including input processing through tokenization and patching, as well as the use of self-attention and feedforward networks to capture temporal dependencies. TimesFM allows for accurate predictions on unseen datasets without retraining, making it significant for various industries such as finance and healthcare.

Uploaded by

karan.bhutani1010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views15 pages

Time FM

Uploaded by

karan.bhutani1010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

‭ ero-Shot Forecasting in LLMs: A‬

Z
‭Foundation for Time-Series Modeling‬
‭Understanding Zero-Shot Learning in Decoder-Only LLMs‬

‭ he rapid advancements in‬‭Large Language Models (LLMs)‬‭have fundamentally reshaped‬

T
‭how we approach machine learning tasks. One of the most striking capabilities of LLMs,‬
‭particularly those trained in a‬‭decoder-only fashion‬‭,‬‭is their ability to‬‭perform zero-shot‬
‭inference‬‭—a phenomenon where a model can generate‬‭meaningful outputs for unseen‬
‭tasks without explicit fine-tuning.‬

‭ t the core of this capability lies the‬‭transformer‬‭architecture‬‭, which processes inputs‬

A
‭through a stack of‬‭self-attention layers‬‭and‬‭feedforward‬‭networks (FFNs)‬‭. But to fully‬
‭appreciate the elegance of this mechanism, let's break it down step by step.‬

‭Tokenization: The First Step in Understanding Input Sequences‬

‭ efore a transformer-based model can make sense of an input, the raw data (whether text or‬
B
‭time-series) needs to be structured into discrete components. In‬‭NLP‬‭, this means converting‬
‭words into‬‭tokens‬‭, which serve as the fundamental‬‭units of computation. Each token is‬
‭mapped to a high-dimensional‬‭embedding vector‬‭, capturing‬‭both semantic meaning and‬
‭syntactic relationships.‬

‭Consider an input sentence:‬

‭"The stock market is volatile today."‬

‭After tokenization, it might be broken into:‬

‭●‬ ‭
["The", "stock", "market", "is", "volatile", "today", "."]‬
[E1, E2, E3, E4,‬‭
‭●‬ ‭Mapped to a sequence of embeddings:‬‭ E5, E6, E7]‬

‭ hese embeddings flow through the transformer layers, undergoing complex transformations‬
T
‭that help the model understand context and relationships between words.‬

‭The Transformer Processing Pipeline: Self-Attention and Predicting the Next Token‬

‭Each‬‭stacked transformer layer‬‭follows a two-step‬‭process:‬

‭1.‬ ‭Multi-Head Self-Attention (MHA)‬

‭○‬ ‭Computes attention weights for each token, determining‬‭which parts of the‬
‭input matter most‬‭when predicting the next token.‬
‭○‬ ‭Enables the model to capture long-range dependencies, unlike traditional‬
‭RNNs or CNNs.‬
‭2.‬ ‭Feedforward Networks (FFN)‬
‭‬ A
○ ‭ pplies a‬‭non-linear transformation‬‭to refine the token representation.‬
‭○‬ ‭Helps capture‬‭higher-order relationships‬‭that self-attention‬‭alone cannot‬
‭model.‬

‭ inally, an‬‭output layer‬‭predicts the probability‬‭distribution over the vocabulary for the‬‭next‬
F
‭token‬‭(i+1) based on all previously observed tokens.‬‭The model learns to generate text‬
‭autoregressively‬‭, meaning each output depends only‬‭on past information.‬

‭For example, given the input sequence:‬

‭●‬ ‭
["The", "stock", "market", "is", "volatile"]‬
"today"‬‭as the most probable next‬‭word.‬
‭●‬ ‭The model generates‬‭

‭ his architecture allows LLMs to‬‭generate coherent‬‭text, translate languages, and even‬
T
‭answer questions‬‭—all in a zero-shot setting, without‬‭task-specific training.‬

‭ xtending the Transformer Paradigm to Time-Series: The Birth of‬

E
‭TimesFM‬

‭ hile transformers have proven their dominance in NLP, the question arises:‬
W
‭"Can a similar architecture be adapted for time-series forecasting?"‬

‭ imesFM‬‭, a decoder-only foundation model for time-series‬‭forecasting, builds upon the‬

T
‭same principles‬‭that make LLMs powerful. Instead of‬‭predicting the next‬‭token‬‭, it predicts‬
‭the next‬‭time-step‬‭in a sequence, leveraging transformers‬‭to capture‬‭complex temporal‬
‭dependencies‬‭.‬

‭Here’s how the analogy extends:‬

‭NLP Transformers‬ ‭Time-Series Transformers (TimesFM)‬

‭Input text tokenization (e.g., GPT)‬ ‭ ime-series patching (breaking series into‬
T
‭structured chunks)‬

‭Embeddings capture semantic relationships‬ ‭Time embeddings capture temporal trends‬

‭ ulti-head self-attention identifies‬

M ‭ elf-attention extracts seasonality, trends,‬
S
‭contextual dependencies‬ ‭and patterns‬

‭FFN refines contextual representation‬ ‭FFN refines time-series signals‬

‭ utoregressive decoding generates next‬

A ‭ utoregressive decoding predicts future‬
A
‭word‬ ‭time-steps‬

‭Intuitive Example: Predicting Stock Prices‬

‭ et’s evolve our earlier NLP example into the‬‭time-series domain‬‭.‬
L
‭Imagine we have daily stock prices for a company:‬

‭‬
● ‭ ay 1:‬‭$100‬
D
‭●‬ ‭Day 2:‬‭$105‬
‭●‬ ‭Day 3:‬‭$110‬
‭●‬ ‭Day 4:‬‭$108‬
‭●‬ ‭Day 5: ???‬

‭ n LLM-style time-series model‬‭learns patterns from‬‭past sequences‬‭and predicts that‬

A
‭Day 5 might be‬‭$112‬‭, given the trend. Instead of text‬‭tokens, the model is processing‬
‭time-series patches‬‭, using transformers to‬‭understand‬‭trends, seasonal cycles, and‬
‭anomalies‬‭—just like an LLM understands grammar, syntax,‬‭and semantics.‬

‭Why This Matters: The Power of Zero-Shot Forecasting‬

‭ ith‬‭TimesFM‬‭, we now have a‬‭universal forecasting‬‭model‬‭that can generate accurate‬

W
‭predictions for unseen datasets—without requiring retraining on new time-series data. This is‬
‭game-changing‬‭for fields like finance, healthcare,‬‭supply chain management, and climate‬
‭forecasting.‬

I‭n the next sections, we will‬‭delve deeper into the‬‭specific architecture of TimesFM,‬
‭discussing how it incorporates transformers for time-series forecasting‬‭and how it‬
‭outperforms traditional models through‬‭self-attention-driven‬‭temporal modeling‬‭.‬

‭ tay tuned for an in-depth exploration of its‬‭model‬‭architecture, empirical performance,‬

S
‭and industry applications‬‭. 🚀‬

‭Transformer Architecture in TimesFM: Understanding Input Processing‬

‭ he foundation of‬‭TimesFM‬‭lies in how it processes‬‭raw time-series data before feeding it‬

T
‭into the transformer layers. Unlike traditional forecasting models that work with entire‬
‭sequences,‬‭TimesFM structures the input into patches‬‭—a‬‭technique inspired by Vision‬
‭Transformers (ViTs). This structured approach allows the model to efficiently capture‬
‭temporal patterns while maintaining computational efficiency.‬

‭ o understand how the input flows through the system, we will explore three key‬
T
‭components:‬‭patching the time-series data, processing‬‭it through a residual block, and‬
‭applying a padding mask‬‭to handle variable-length‬‭sequences.‬

‭Breaking Input into Non-Overlapping Patches‬

‭ ime-series data consists of continuous numerical observations, making it fundamentally‬

T
‭different from text-based data that follows structured grammar and vocabulary. Instead of‬
‭treating each time step independently, TimesFM‬‭segments‬‭the time-series into‬
f‭ ixed-length patches‬‭, allowing it to learn from local patterns before capturing broader‬
‭dependencies.‬

‭ ‬‭patch‬‭in TimesFM is a contiguous chunk of values‬‭extracted from the time-series with a‬

A
‭fixed length. Instead of processing each time step individually, which would lead to‬
‭inefficiencies for long sequences, the model chunks the data into groups. These patches‬
‭serve a function similar to‬‭tokens in natural language‬‭models‬‭, forming the smallest unit of‬
‭processing.‬

‭For instance, consider a time-series representing daily stock prices over 15 days:‬

‭ riginal sequence:‬
O
‭100, 102, 105, 110, 108, 112, 115, 117, 119, 123, 125, 127, 130, 133, 135‬

‭If we set a‬‭patch size of 5‬‭, the sequence is split‬‭into:‬

‭ atch 1:‬‭[100, 102, 105, 110, 108]‬

P
‭Patch 2:‬‭[112, 115, 117, 119, 123]‬
‭Patch 3:‬‭[125, 127, 130, 133, 135]‬

I‭nstead of handling 15 individual data points, the model now processes only three patches.‬
‭This method reduces computational complexity while ensuring that local patterns within each‬
‭segment are preserved.‬

‭ his approach is crucial because time-series data often contains‬‭short-term fluctuations‬

T
‭embedded within long-term trends‬‭. By breaking the‬‭input into patches, the model can‬
‭efficiently encode both local dependencies (within patches) and global structures (across‬
‭patches). Moreover, reducing the number of input tokens allows the transformer to focus on‬
‭meaningful relationships rather than being overwhelmed by redundant or noisy values.‬

‭Processing Patches Using a Residual Block‬

‭ nce the time-series data is split into patches, each patch undergoes further processing to‬
O
‭enhance its representation‬‭before being passed into‬‭the transformer layers. This is‬
‭achieved using a‬‭residual block‬‭, which refines the‬‭patch embeddings while preserving the‬
‭original structure.‬

‭ ‬‭residual block‬‭is a small neural network module‬‭that consists of two main components: a‬
A
‭feedforward transformation‬‭and a‬‭skip connection‬‭.‬‭The feedforward transformation‬
‭applies‬‭linear projections and non-linear activations‬‭to extract complex features from the‬
‭raw patch data, while the skip connection ensures that the original information is retained.‬
‭This prevents the transformation from distorting the input representation too aggressively.‬

‭Mathematically, a residual block operates as follows:‬

‭ utput=Patch Input+MLP(Patch Input)\text{Output} = \text{Patch Input} +‬

O
‭\text{MLP}(\text{Patch Input})Output=Patch Input+MLP(Patch Input)‬
‭ ere, the‬‭MLP (Multi-Layer Perceptron)‬‭applies a linear transformation followed by a‬‭ReLU‬
H
‭activation function‬‭, which introduces non-linearity‬‭into the learning process. However,‬
‭instead of replacing the original patch representation entirely, the‬‭skip connection‬‭adds the‬
‭transformed features back to the original input. This technique ensures that‬‭the core‬
‭information remains intact while enhancing key patterns‬‭.‬

‭ o illustrate this, consider the first patch from our stock price example:‬
T
‭Patch 1:‬‭[100, 102, 105, 110, 108]‬

‭1.‬ T ‭ he‬‭linear transformation‬‭maps this raw input to a‬‭new space, producing:‬

[0.8, 1.2, 1.5, 2.1, 2.3]‬
‭
‭2.‬ ‭The‬‭ReLU activation function‬‭ensures non-linearity,‬‭preventing negative values:‬
[0.8, 1.2, 1.5, 2.1, 2.3]‬‭(unchanged in this case‬‭as values were already‬
‭
‭positive)‬
‭3.‬ ‭The‬‭skip connection‬‭adds the original patch values‬‭back to the transformed ones:‬
[100.8, 103.2, 106.5, 112.1, 110.3]‬
‭

‭ his processed patch now carries‬‭both the original‬‭numerical structure and enriched‬
T
‭features‬‭extracted from the transformation. The‬‭residual‬‭connection prevents loss of‬
‭information‬‭while allowing the model to capture deeper‬‭relationships in the data.‬

‭ his type of structure is essential for time-series forecasting because‬‭naïve‬

T
‭transformations can obscure key patterns‬‭if not handled‬‭correctly. By allowing each patch‬
‭to be transformed while keeping the original values accessible, the model can build a strong‬
‭foundation for prediction.‬

‭Handling Variable-Length Sequences with a Padding Mask‬

‭ ne of the challenges in time-series forecasting is that real-world sequences often have‬

O
‭unequal lengths‬‭. Some time-series may contain missing‬‭values, while others may be‬
‭shorter than the defined patch length. To ensure that the model does not misinterpret these‬
‭variations, a‬‭padding mask‬‭is introduced.‬

‭ ‬‭padding mask‬‭is a binary tensor that marks which‬‭parts of the input should be ignored.‬
A
‭This is particularly useful when dealing with batches of time-series that have different‬
‭lengths, as it ensures that the model does not allocate unnecessary attention to artificially‬
‭padded values.‬

‭Consider two time-series:‬

‭‬ S
● ‭ eries A:‬‭[100, 102, 105, 110, 108, 112, 115] (7 time‬‭points)‬
‭●‬ ‭Series B:‬‭[50, 52, 53, 54] (4 time points)‬

‭If we apply a‬‭patch length of 5‬‭, we get:‬

‭ eries A:‬
S
‭Patch 1: [100, 102, 105, 110, 108]‬
‭Patch 2: [112, 115, 0, 0, 0]‬

‭ eries B:‬
S
‭Patch 1: [50, 52, 53, 54, 0]‬

‭ he zeros here are‬‭padded values‬‭that the model should‬‭ignore during computation. A‬

T
‭corresponding‬‭mask tensor‬‭is created:‬

[1, 1, 1, 1, 1], [1, 1, 0, 0, 0]‬

‭Mask for Series A:‬‭
[1, 1, 1, 1, 0]‬
‭Mask for Series B:‬‭

I‭n this mask,‬‭1 indicates valid data‬‭, while‬‭0 indicates‬‭padded regions‬‭that the transformer‬
‭should ignore. Without this mechanism, the model might treat padded values as actual‬
‭observations, leading to incorrect pattern recognition.‬

‭ his approach enables‬‭TimesFM to handle variable-length‬‭sequences efficiently‬‭,‬

T
‭ensuring that each forecast is based solely on meaningful historical data rather than arbitrary‬
‭placeholders.‬

‭Patch Masking for Generalization‬

‭ nother crucial aspect of input processing in TimesFM is‬‭patch masking‬‭, which forces the‬
A
‭model to generalize beyond simply memorizing training sequences. Instead of always‬
‭feeding complete patches into the network, the model‬‭randomly masks certain sections‬‭,‬
‭requiring it to infer missing values based on the available context.‬

‭ or example, if a sales forecasting model is trained on a sequence:‬

F
‭[500, 520, 530, ?, 550, ?, 590, 600, ?]‬

‭ ere, the question marks represent masked values that the model must‬‭learn to predict‬
H
‭without direct supervision. This approach mimics real-world scenarios where data may be‬
‭incomplete or unavailable, improving the model's ability to make‬‭robust zero-shot‬
‭predictions‬‭.‬

‭ y incorporating‬‭patch masking‬‭, TimesFM ensures that‬‭its forecasting abilities are‬‭not‬

B
‭overfitted to specific patterns‬‭but instead develop‬‭a‬‭broader understanding of temporal‬
‭relationships‬‭.‬

‭Bridging Input Processing to Transformer Layers‬

‭ t this stage, the raw time-series data has been‬‭structured‬‭into patches, enriched‬
A
‭through residual blocks, and optimized for variable-length inputs‬‭using padding masks‬
‭ nd patch masking. This prepares the data for the‬‭stacked transformer layers‬‭, where‬
a
‭self-attention mechanisms will extract deeper relationships across time.‬

‭ tacked Transformer Layers: Extracting Temporal Dependencies in‬

S
‭TimesFM‬

‭ nce the input time-series has been segmented into patches, processed through residual‬
O
‭blocks, and appropriately masked, it is ready to be passed through the‬‭stacked transformer‬
‭layers‬‭. These layers form the computational core of‬‭TimesFM, allowing it to capture intricate‬
‭dependencies across time.‬

‭ ‬‭stacked transformer‬‭refers to multiple layers of‬‭the‬‭self-attention mechanism‬‭, each‬

A
‭refining the representations learned from the previous layer. The deeper the stack, the more‬
‭abstract and powerful the model's understanding of temporal relationships becomes. Unlike‬
‭conventional autoregressive models that rely on fixed lag structures, self-attention in a‬
‭transformer dynamically determines‬‭which past observations‬‭are relevant‬‭for predicting‬
‭the future.‬

‭Fig: A decoder-only foundation model for time-series forecasting‬

‭Self-Attention for Time-Series‬

I‭n natural language processing, a transformer processes a sequence of words and learns‬
‭the contextual relationships between them. The key mechanism behind this is‬
‭self-attention‬‭, where the model dynamically determines‬‭which words contribute the most to‬
‭understanding a given token. TimesFM repurposes this mechanism for time-series data.‬

‭ ach‬‭patch representation‬‭enters the transformer stack,‬‭where self-attention scores‬

E
‭determine the importance of‬‭past patches‬‭when forecasting‬‭future values. Instead of‬
‭looking at‬‭individual time-steps‬‭, the model attends‬‭to entire‬‭patches‬‭, meaning it can learn‬
‭dependencies over‬‭longer horizons‬‭compared to traditional‬‭recurrent architectures.‬
‭Mathematically, self-attention is computed as:‬

‭𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛‬(‭𝑄‬, ‭𝐾‬, ‭𝑉)‬ = ‭𝑠𝑜𝑓𝑡𝑚𝑎𝑥‬(‭𝑑𝑘‬‭‬𝑄𝐾𝑇‬‭‬)‭𝑉‬

‭where:‬

‭‬ Q
● ‭ (Query) represents the current patch’s embedding,‬
‭●‬ ‭K (Key) represents all past patches’ embeddings,‬
‭●‬ ‭V (Value) holds the learned transformations of past patches.‬

‭ his equation determines‬‭which past patches should‬‭influence the current forecast‬‭, and‬
T
‭the attention scores dynamically adjust based on the sequence’s structure. The softmax‬
‭operation ensures that attention weights sum to 1, allowing the model to allocate importance‬
‭proportionally.‬

‭Temporal Dependencies and Multi-Head Attention‬

‭ single attention head might not be sufficient to capture all relevant features in a‬
A
‭time-series. Some dependencies might be‬‭short-term‬‭,‬‭while others could extend across‬
‭long-term trends‬‭. To resolve this,‬‭multi-head self-attention‬‭is employed, where multiple‬
‭independent attention mechanisms operate in parallel, each capturing different aspects of‬
‭the data.‬

‭For example, when forecasting‬‭electricity consumption‬‭:‬

‭‬ O
● ‭ ne attention head might focus on the‬‭previous 24-hour‬‭cycle‬‭(daily pattern),‬
‭●‬ ‭Another head might detect‬‭weekly trends‬‭,‬
‭●‬ ‭A third head might capture‬‭anomalous spikes‬‭due to‬‭unexpected events.‬

‭ y‬‭stacking multiple transformer layers‬‭, each layer‬‭refines the representations learned in‬
B
‭the previous step. Lower layers might capture‬‭local‬‭variations‬‭, while deeper layers focus on‬
‭broad seasonal structures‬‭. The final layer outputs‬‭a processed version of each patch,‬
‭embedding all the learned dependencies.‬

‭Output Layers: Generating Future Predictions in TimesFM‬

‭ nce the transformer layers have processed the input patches, the‬‭output layer maps the‬
O
‭final transformed embeddings to actual forecasted values‬‭. The output layers of‬
‭TimesFM are structured to achieve two crucial objectives:‬

‭ .‬ M
1 ‭ apping the output token to an actual numerical prediction‬
‭2.‬ ‭Training in decoder-only mode, where each output token predicts the next‬
‭segment of the time-series‬
‭3.‬ ‭Allowing the output patch length to be larger than the input patch length,‬
‭which enables long-range forecasting in a single forward pass‬

‭Mapping Output Tokens to Predictions‬

‭ he embeddings produced by the transformer layers are still in a high-dimensional latent‬
T
‭space. To translate these embeddings back into actual time-series values, TimesFM applies‬
‭a final‬‭residual transformation‬‭that maps each output‬‭token to its corresponding forecasted‬
‭values. This step ensures that the learned temporal dependencies manifest as real-world‬
‭numerical predictions.‬

‭Mathematically, this transformation is performed using another‬‭residual block‬‭:‬

‭𝑦‬‭^‬‭𝑝𝑗‬ + ‭1:‬ ‭𝑝𝑗‬ + ‭ℎ‬‭ = ‭𝑂𝑢𝑡𝑝𝑢𝑡𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝐵𝑙𝑜𝑐𝑘‬(‭𝑜𝑗‬‭‬)

‭where:‬

‭‬ 𝑦
● ‭ ‬‭^‬‭𝑝𝑗‬ + ‭1:‬ ‭𝑝𝑗‬ + ‭ℎ‬‭‭i‬s the predicted sequence following‬‭patch j,‬
‭●‬ ‭ojis the output embedding corresponding to that patch.‬

‭ his mapping ensures that each output token‬‭encapsulates‬‭sufficient historical‬

T
‭information‬‭to generate a reliable future estimate.‬

‭Training in Decoder-Only Mode: Learning from Context to Forecast the Future‬

‭ imesFM follows a‬‭decoder-only architecture‬‭, meaning‬‭it is trained to predict the next‬

T
‭segment of the sequence based solely on past observations. The key characteristic of this‬
‭approach is‬‭causal masking‬‭, where each output token‬‭can only attend to inputs that‬
‭occurred before it. This ensures that the model does not "cheat" by looking at future values‬
‭during training.‬

‭For example, consider the following time-series:‬

[50, 55, 60, ?, ?, ?, 80, 85, 90, ?, ?, ?]‬

‭

‭ uring training, the model is given only the observed values (‬‭
D 50, 55, 60‬ ‭) and must‬
‭predict the next three missing values‬‭. Once these‬‭values are predicted, they are‬‭fed‬
‭back into the model‬‭to predict the next segment (‬‭
80,‬‭ 85, 90‬ ‭). This auto-regressive‬
‭structure mirrors how LLMs generate text, producing one token at a time.‬

‭ nlike text-based models, where each token corresponds to a‬‭single word‬‭, in TimesFM,‬
U
‭each‬‭output patch can be larger than the input patch‬‭.‬‭This means the model is trained to‬
‭predict‬‭larger time spans‬‭instead of step-by-step‬‭values.‬

‭ redicting Larger Chunks Than Seen: The Key Difference Between LLMs‬
P
‭and TimesFM‬

‭ ne fundamental difference between TimesFM and LLMs is that the‬‭input patch length‬
O
‭does not have to match the output patch length‬‭. In‬‭language models, each token‬
‭corresponds to a single word, and generation happens word-by-word. In contrast, TimesFM‬
‭ an‬‭predict a much longer horizon in a single forward pass‬‭, making it more efficient for‬
c
‭time-series forecasting.‬

‭Example: Forecasting a Stock Price Over a Long Horizon‬

‭ uppose we have stock price data for the past 100 days. A traditional auto-regressive model‬
S
‭would generate‬‭one step at a time‬‭, requiring 100 iterative‬‭steps to forecast the next 100‬
‭days. TimesFM, however, can take‬‭50 days of input‬‭and generate 100 days of output in‬
‭one go‬‭.‬

Input Patch:
‭ [100, 102, 105, 110, 108, 112, 115, ..., 150]‬
Output Patch: [152, 155, 157, 160, ..., 200]
‭ (Generated in one‬
step)‬
‭

‭ his is achieved by designing‬‭longer output patches‬‭that capture multiple time-steps in a‬

T
‭single forward pass. The model is trained to‬‭predict‬‭far beyond what it has seen‬‭, using the‬
‭contextual information embedded in previous time windows‬‭.‬

‭How Patch Masking Supports Long-Horizon Forecasting‬

‭ uring training, TimesFM randomly masks parts of the sequence, forcing the model to learn‬
D
‭how to interpolate missing values and generalize beyond the seen context‬‭. This‬
‭ensures that, at inference time, the model can make reliable long-range forecasts without‬
‭needing extensive fine-tuning.‬

‭ onsider a case where the input patch covers‬‭the past‬‭32 time-steps‬‭, but the output patch‬
C
‭is‬‭128 time-steps long‬‭. The model learns to:‬

‭ .‬ A
1 ‭ ttend to the most relevant historical data (e.g., past trends, periodic fluctuations).‬
‭2.‬ ‭Infer missing patterns using self-attention across long-range dependencies.‬
‭3.‬ ‭Generate four times the number of time-steps in a single pass, reducing‬
‭computational cost.‬

‭ his ability to‬‭extrapolate efficiently‬‭without requiring‬‭step-by-step iteration is what makes‬

T
‭TimesFM superior to classical methods that rely on recursive forecasting.‬

‭Loss Function: Optimizing for Accurate Point Forecasting‬

‭ he primary objective in training TimesFM is to minimize the error between its predicted‬
T
‭future values and the actual observed values. Given that TimesFM focuses on‬‭point‬
‭forecasting‬‭rather than probabilistic forecasting,‬‭the loss function primarily used is the‬‭Mean‬
‭Squared Error (MSE)‬‭:‬

‭𝐿‬ = ‭𝑁‬‭1‭𝑗‬ ‬ = ‭1∑‬‭𝑁‭‬(‭𝑦‬‭^‭𝑝

‬ 𝑗‬ + ‭1‬: ‭𝑝𝑗‬ + ‭ℎ‭‬ − ‭𝑦𝑝𝑗‬ + ‭1:‬ ‭𝑝𝑗‬ + ‭ℎ‭‬)‭2‬

‭where:‬
‭‬ 𝑦
● ‭ ‬‭^‬‭𝑝𝑗‬ + ‭1:‬ ‭𝑝𝑗‬ + ‭‬‭ℎ‭i‬s the predicted future sequence from the model,‬
‭●‬ ‭𝑦𝑝𝑗‬ + ‭1:‬ ‭𝑝𝑗‬ + ‭ℎ‬‭‭i‬s the actual ground-truth sequence,‬
‭●‬ ‭N is the total number of patches in a batch.‬

‭ his loss function ensures that the model‬‭penalizes‬‭larger errors more aggressively‬‭than‬
T
‭smaller ones, making it particularly effective for minimizing‬‭large deviations in time-series‬
‭predictions‬‭.‬

I‭n some cases, alternative loss functions such as‬‭Mean Absolute Error (MAE)‬‭or‬
‭quantile-based losses can be used if probabilistic forecasting is desired. However, for a‬
‭foundation model like TimesFM,‬‭MSE serves as a robust‬‭metric‬‭to ensure that the model‬
‭learns an accurate representation of temporal patterns across diverse datasets.‬

‭Training: Learning from Large-Scale Time-Series Data‬

‭ raining TimesFM involves exposing it to a‬‭diverse‬‭mix of real-world and synthetic‬

T
‭time-series data‬‭, allowing it to generalize across‬‭different domains, granularities, and‬
‭forecasting horizons. The core aspects of training include:‬

‭Pretraining on Massive Time-Series Data‬

‭ imesFM is trained on a‬‭large corpus of time-series‬‭data‬‭, sourced from various real-world‬

T
‭domains such as:‬

‭‬
● ‭ eb search trends‬
W
‭●‬ ‭Wikipedia page visits‬
‭●‬ ‭Financial market data‬
‭●‬ ‭Retail sales records‬
‭●‬ ‭Weather and traffic patterns‬

‭ dditionally, the dataset includes‬‭synthetic time-series‬‭generated using‬‭ARMA processes,‬

A
‭sinusoidal patterns, and exponential trends‬‭. This‬‭ensures that the model learns‬‭both‬
‭common real-world behaviors and rare edge-case patterns‬‭.‬

‭Masked Patch Training for Generalization‬

‭ o improve its ability to forecast in unseen scenarios, TimesFM employs‬‭patch masking‬

T
‭during training. Instead of always providing full patches, the model is trained with‬‭randomly‬
‭masked time-steps‬‭, forcing it to‬‭infer missing values‬‭based on context.‬

‭For example, during training, the model might be given:‬

[100, 105, ?, 115, ?, ?, 140, 150, ?, ?]‬

‭

‭ here the masked values‬‭must be reconstructed‬‭before‬‭predicting the next sequence. This‬

w
‭ensures that at inference time, TimesFM can‬‭generalize‬‭across datasets and handle‬
‭missing values gracefully‬‭.‬
‭Optimization and Training Strategy‬

‭ imesFM is trained using‬‭mini-batch gradient descent‬‭with the‬‭Adam optimizer‬‭, which‬

T
‭dynamically adjusts learning rates for better convergence. The key hyperparameters include:‬

‭‬ B
● ‭ atch size‬‭: Ensures the model sees a variety of time-series‬‭in each step.‬
‭●‬ ‭Maximum context length‬‭: Trained across multiple horizon‬‭lengths to adapt to‬
‭different datasets.‬
‭●‬ ‭Dropout regularization‬‭: Prevents overfitting by introducing‬‭randomness in attention‬
‭mechanisms.‬

‭ nlike conventional forecasting models that require‬‭task-specific fine-tuning‬‭, TimesFM is‬

U
‭designed to‬‭perform zero-shot inference‬‭once trained,‬‭eliminating the need for‬
‭dataset-specific retraining.‬

‭Inference: Zero-Shot Forecasting with Minimal Compute Overhead‬

‭ nce trained, TimesFM is deployed in a‬‭zero-shot setting‬‭,‬‭meaning it can forecast future‬

O
‭values‬‭without any further tuning‬‭on new datasets.‬‭The inference process is optimized for‬
‭speed and scalability‬‭, making it feasible for real-world‬‭applications.‬

‭How Forecasting Works at Inference Time‬

‭Given a new, unseen time-series, the inference process follows three key steps:‬

‭1.‬ P ‭ reprocessing‬‭: The input sequence is‬‭split into patches‬‭and‬‭transformed into‬

‭embeddings‬‭, just as in training.‬
‭2.‬ ‭Self-Attention in Decoder Mode‬‭: The model‬‭attends‬‭to relevant past patches‬
‭while maintaining a‬‭causal structure‬‭, ensuring predictions‬‭are based only on past‬
‭data.‬
‭3.‬ ‭Autoregressive Decoding with Long Horizon Forecasting‬‭:‬
‭○‬ ‭Instead of predicting‬‭one step at a time‬‭, TimesFM‬‭generates an entire‬
‭forecast window in a single pass‬‭.‬
‭○‬ ‭If a‬‭32-step input‬‭is given, the model can generate‬‭128-step forecasts‬
‭without iterative sampling.‬

‭ or example, if a retail company wants to forecast‬‭next-quarter sales‬‭using‬‭weekly sales‬

F
‭data‬‭from the past year:‬

‭●‬ T ‭ raditional models would‬‭iteratively predict one week‬‭at a time‬‭, compounding‬

‭errors over long horizons.‬
‭●‬ ‭TimesFM, in contrast,‬‭predicts the entire quarter‬‭(12 weeks) in a single forward‬
‭pass‬‭, leading to‬‭more stable and reliable forecasts‬‭.‬

‭Handling Variable Forecast Horizons‬

‭ ecause TimesFM was trained across‬‭multiple horizon‬‭lengths‬‭, it can dynamically adjust‬

B
‭to different forecasting needs. If a company needs a‬‭one-month, three-month, or‬
‭ ix-month forecast‬‭, the model can generate each‬‭without retraining‬‭, leveraging its‬
s
‭pre-learned knowledge of temporal dependencies.‬

‭Optimization and Practical Considerations in TimesFM‬

‭ uilding a‬‭decoder-only foundation model‬‭for time-series‬‭forecasting requires careful‬

B
‭architectural and training decisions to balance‬‭forecasting‬‭accuracy, computational‬
‭efficiency, and generalization capability‬‭. The optimization‬‭of TimesFM focuses on‬
‭reducing autoregressive decoding steps, selecting the right input patch length, and‬
‭incorporating synthetic data for robustness across different temporal granularities‬‭.‬
‭These design choices differentiate it from traditional autoregressive models and even from‬
‭large language models (LLMs), making it more adaptable to various forecasting tasks.‬

‭Reducing Autoregressive Steps with Longer Output Patches‬

‭ ne of the most critical optimizations in TimesFM is its approach to‬‭forecasting long‬

O
‭horizons‬‭. Traditionally, autoregressive decoding generates‬‭predictions step-by-step, where‬
‭each predicted value is fed back as input to generate the next one. However, for‬‭long-term‬
‭forecasting‬‭, this approach introduces significant‬‭error accumulation, making predictions‬
‭unstable over large horizons.‬

‭ ecent research has shown that‬‭directly predicting‬‭the entire forecast horizon in one‬
R
‭step‬‭can yield better results than autoregressive‬‭decoding on long-horizon benchmarks.‬
‭However, this is not always practical in a foundation model setting, where the‬‭forecasting‬
‭horizon is unknown‬‭before inference. Since the model‬‭must be general enough to handle‬
‭varying horizon lengths at runtime, a pure one-shot decoding approach is not feasible.‬

‭To address this, TimesFM adopts a‬‭hybrid approach‬‭:‬

‭●‬ I‭nstead of generating predictions‬‭one step at a time‬‭,‬‭it‬‭outputs a longer patch of‬

‭future values in a single pass‬‭.‬
‭●‬ ‭The‬‭output patch length is greater than the input‬‭patch length‬‭, ensuring that‬
‭fewer autoregressive steps are required to generate long-horizon forecasts‬‭.‬
‭●‬ ‭This design choice reduces‬‭error accumulation‬‭and‬‭speeds up inference, making‬
‭TimesFM significantly more efficient than traditional autoregressive models.‬

‭Empirical Validation: Predicting 512 Time-Steps in the Future‬

‭ o demonstrate the impact of this optimization, TimesFM was evaluated on the‬‭ETT dataset‬‭,‬
T
‭a widely used benchmark for time-series forecasting. The model was tested on‬‭predicting‬
‭512 time-steps into the future‬‭, with varying output‬‭patch lengths ranging from‬‭8 to 128‬‭.‬
‭Results showed a‬‭monotonic decrease in the Mean Absolute‬‭Error (MAE) as the output‬
‭patch length increased‬‭. This validates the idea that‬‭longer output patches lead to better‬
‭performance by minimizing autoregressive decoding steps‬‭.‬
‭Choosing the Optimal Input Patch Length‬

‭ he length of the‬‭input patch‬‭plays a crucial role‬‭in determining how much historical context‬
T
‭the model uses to generate forecasts. Increasing the input patch length generally improves‬
‭performance, as the model can leverage more historical information. However, making the‬
‭input patch too long introduces practical challenges.‬

I‭f the‬‭input patch is too short‬‭, the model may not‬‭capture long-term dependencies‬
‭effectively, leading to‬‭weaker generalization on long-horizon‬‭forecasting tasks‬‭.‬
‭Conversely, if the‬‭input patch is too long‬‭, the training‬‭dynamics begin to resemble‬
‭encoder-decoder architectures‬‭, which are computationally‬‭more expensive and not‬
‭optimized for decoder-only training.‬

‭TimesFM is designed to balance these trade-offs:‬

‭●‬ E ‭ xperiments show that increasing the‬‭input patch length‬‭from 8 to 32‬‭improves‬

‭performance.‬
‭●‬ ‭Beyond‬‭32 time-steps‬‭, performance starts to plateau,‬‭and computational costs‬
‭increase significantly.‬
‭●‬ ‭Models with‬‭input patch length of 16 or 32‬‭strike‬‭the best balance between‬
‭training speed and forecasting accuracy‬‭.‬

‭ or instance, models with‬‭input patch length of 32‬‭trained‬‭twice as fast‬‭as models with‬

F
‭patch length of 16‬‭, while achieving similar or even‬‭better forecasting accuracy. This makes‬
‭32 a practical choice‬‭for general use.‬

‭Dataset Augmentation with Synthetic Data for Robust Generalization‬

‭ significant challenge in building a foundation model for time-series forecasting is‬‭ensuring‬

A
‭that it generalizes across different temporal granularities‬‭.‬‭Real-world time-series‬
‭datasets often exhibit well-represented patterns at specific frequencies (e.g.,‬‭daily or hourly‬
‭data‬‭). However, many forecasting tasks require working‬‭with‬‭less common frequencies‬‭,‬
‭such as‬‭quarterly, yearly, or irregular intervals‬‭.‬

I‭f the model is trained only on datasets with‬‭well-represented‬‭granularities‬‭, it struggles‬

‭when encountering‬‭underrepresented time-series frequencies‬‭.‬‭To mitigate this, TimesFM‬
‭incorporates‬‭synthetic data augmentation‬‭, ensuring‬‭that it can handle a wider variety of‬
‭forecasting scenarios.‬

‭Effect of Synthetic Data on Performance‬

‭ o test the impact of synthetic data, a‬‭200M-parameter‬‭version of TimesFM‬‭was trained‬

T
‭with and without synthetic data and evaluated on the‬‭Monash and ETT datasets‬‭. The‬
‭results were insightful:‬

‭●‬ O
‭ n‬‭hourly ETT datasets‬‭, there was almost‬‭no difference‬‭in performance between‬
‭the two models, as hourly granularity was well-represented in real datasets.‬
‭●‬ H ‭ owever, on‬‭15-minute ETT datasets‬‭, the model trained‬‭without synthetic data‬
‭performed significantly worse than the one trained‬‭with synthetic data‬‭.‬
‭●‬ ‭Similarly, on the‬‭Monash dataset‬‭, which includes datasets‬‭with‬‭quarterly and‬
‭yearly granularities‬‭, the model trained‬‭without synthetic‬‭data‬‭failed to generalize‬
‭well, leading to degraded performance.‬

‭ hese results confirm that‬‭introducing synthetic time-series‬‭data during training‬

T
‭improves generalization‬‭to underrepresented granularities,‬‭making TimesFM more‬
‭versatile across different forecasting domains.‬

‭Final Considerations: Making TimesFM a Scalable Foundation Model‬

‭ he‬‭optimization strategies in TimesFM‬‭were designed‬‭to make it‬‭scalable, efficient, and‬

T
‭adaptable across various forecasting tasks‬‭. By carefully‬‭balancing autoregressive‬
‭decoding, input patch length selection, and dataset augmentation, the model achieves‬‭high‬
‭accuracy while maintaining computational efficiency‬‭.‬

‭1.‬ M ‭ inimizing autoregressive steps‬‭with‬‭longer output‬‭patches‬‭ensures stable,‬

‭long-horizon forecasts.‬
‭2.‬ ‭Optimizing input patch length‬‭prevents unnecessary‬‭computational costs while‬
‭preserving forecasting accuracy.‬
‭3.‬ ‭Incorporating synthetic data‬‭enhances generalization,‬‭allowing TimesFM to work‬
‭on underrepresented time-series frequencies.‬

‭ hese design choices enable TimesFM to serve as a‬‭true foundation model‬‭for time-series‬
T
‭forecasting, capable of performing well across diverse datasets without fine-tuning. Unlike‬
‭traditional forecasting models that require extensive retraining,‬‭TimesFM can generalize in‬
‭a zero-shot setting‬‭, making it a valuable tool for‬‭real-world applications in‬‭finance,‬
‭healthcare, climate modeling, and supply chain optimization‬‭.‬

‭ y bridging the gap between‬‭LLM-style pretraining‬‭and time-series forecasting‬‭,‬

B
‭TimesFM represents a significant step forward in the evolution of foundation models for‬
‭predictive analytics‬

Digital Modulations using Matlab
From Everand
Digital Modulations using Matlab
Mathuranathan Viswanathan
4/5 (6)
LTSM Eng v1.2
No ratings yet
LTSM Eng v1.2
76 pages
SENTENCE PROCESSES v1
100% (1)
SENTENCE PROCESSES v1
374 pages
ATL Packet
100% (2)
ATL Packet
23 pages
Myers Briggs Test
No ratings yet
Myers Briggs Test
2 pages
Lesson Plan - Letter W
100% (2)
Lesson Plan - Letter W
5 pages
A Decoder-Only Foundation Model For Time-Series Forecasting
No ratings yet
A Decoder-Only Foundation Model For Time-Series Forecasting
11 pages
Decoder Only Foundation Model For Time Series Forecasting: Reprint
No ratings yet
Decoder Only Foundation Model For Time Series Forecasting: Reprint
21 pages
A Decoder-Only Foundation Model For Time-Series Forecasting: Radford Et Al. 2019
No ratings yet
A Decoder-Only Foundation Model For Time-Series Forecasting: Radford Et Al. 2019
16 pages
Module 4
No ratings yet
Module 4
36 pages
LSTM and Transformer
No ratings yet
LSTM and Transformer
4 pages
XLSTMTime Long-Term Time Series Forecasting With XLSTM
No ratings yet
XLSTMTime Long-Term Time Series Forecasting With XLSTM
13 pages
An Overview and Comparative Analysis of Recurrent Neural Networks For Short Term Load Forecasting
No ratings yet
An Overview and Comparative Analysis of Recurrent Neural Networks For Short Term Load Forecasting
41 pages
FilterNet Harnessing Frequency Filters For Time Series Forecasting
No ratings yet
FilterNet Harnessing Frequency Filters For Time Series Forecasting
20 pages
T T - A: A T: Ransformers in IME Series Nalysis Utorial
No ratings yet
T T - A: A T: Ransformers in IME Series Nalysis Utorial
29 pages
Transformers in Time-Series Analysis: A Tutorial
No ratings yet
Transformers in Time-Series Analysis: A Tutorial
34 pages
Long Short-Term Memory RNN: Department of Computer Science
No ratings yet
Long Short-Term Memory RNN: Department of Computer Science
16 pages
Kgptalkie Com Multi Step Time Series Predicting Using RNN LSTM
No ratings yet
Kgptalkie Com Multi Step Time Series Predicting Using RNN LSTM
32 pages
Transformers in Time Series - A Survey
No ratings yet
Transformers in Time Series - A Survey
9 pages
CH 10
No ratings yet
CH 10
41 pages
G L: F D - C L R - S M: ATE OOP Ully ATA Ontrolled Inear E Currence For Equence Odeling
No ratings yet
G L: F D - C L R - S M: ATE OOP Ully ATA Ontrolled Inear E Currence For Equence Odeling
14 pages
1.shiyang Li - Enhance Locality and Break The Memory Bottleneck
No ratings yet
1.shiyang Li - Enhance Locality and Break The Memory Bottleneck
14 pages
9 Deep Leaning RNN
No ratings yet
9 Deep Leaning RNN
64 pages
Enhancing The Locality and Breaking The Memory Bottleneck of Transformer On Time Series Forecasting Paper
No ratings yet
Enhancing The Locality and Breaking The Memory Bottleneck of Transformer On Time Series Forecasting Paper
11 pages
Recurrent Neural Filters: Learning Independent Bayesian Filtering Steps For Time Series Prediction
No ratings yet
Recurrent Neural Filters: Learning Independent Bayesian Filtering Steps For Time Series Prediction
12 pages
Chronos - Learning The Language of Time Series
No ratings yet
Chronos - Learning The Language of Time Series
40 pages
XLSTMTime - Long-Term Time Series Forecasting With XLSTM
No ratings yet
XLSTMTime - Long-Term Time Series Forecasting With XLSTM
13 pages
T: I T A E T S F: I Ransformer Nverted Ransformers RE Ffective For IME Eries Orecasting
No ratings yet
T: I T A E T S F: I Ransformer Nverted Ransformers RE Ffective For IME Eries Orecasting
25 pages
Time Series Forecasting With Deep Learning: A Survey: Research
No ratings yet
Time Series Forecasting With Deep Learning: A Survey: Research
13 pages
Lifetime Limited Memory Neural Networks
No ratings yet
Lifetime Limited Memory Neural Networks
54 pages
LLM 4 Ts
No ratings yet
LLM 4 Ts
9 pages
Transformers in Time Series A Survey 2202.07125
No ratings yet
Transformers in Time Series A Survey 2202.07125
8 pages
Chapter Recurrent Neural Networks
No ratings yet
Chapter Recurrent Neural Networks
10 pages
Large Language Models Are
No ratings yet
Large Language Models Are
14 pages
Multivariate Multi Step Time Series Forecasting Using Stacked LSTM Sequence To Sequence Autoencoder in Tensorflow 2 0 Keras
No ratings yet
Multivariate Multi Step Time Series Forecasting Using Stacked LSTM Sequence To Sequence Autoencoder in Tensorflow 2 0 Keras
9 pages
Full Review Time Series Deep Learning
No ratings yet
Full Review Time Series Deep Learning
2 pages
A Transformer That Tends To Mine Metaphorical-Level Information
No ratings yet
A Transformer That Tends To Mine Metaphorical-Level Information
16 pages
Token Turing Machines
No ratings yet
Token Turing Machines
12 pages
A Time Series Is Worth 64 Words - Long-Term Forecasting With Transformers
No ratings yet
A Time Series Is Worth 64 Words - Long-Term Forecasting With Transformers
24 pages
632 Itransformer Inverted Tran
No ratings yet
632 Itransformer Inverted Tran
25 pages
07 RNN Recurrent Neural Networks
No ratings yet
07 RNN Recurrent Neural Networks
63 pages
Time Series Forecasting Final Report
No ratings yet
Time Series Forecasting Final Report
7 pages
A Systematic Review For Transformer-Based Long-Term Series Forecasting
No ratings yet
A Systematic Review For Transformer-Based Long-Term Series Forecasting
30 pages
Transformers in Time Series - A Survey
No ratings yet
Transformers in Time Series - A Survey
8 pages
Neural Network Applications and Implementations
No ratings yet
Neural Network Applications and Implementations
35 pages
Sequential Models
No ratings yet
Sequential Models
105 pages
Multivariate Time Series Forecasting Final 3rd Sem
No ratings yet
Multivariate Time Series Forecasting Final 3rd Sem
22 pages
Bryan Lim
No ratings yet
Bryan Lim
145 pages
Neural Networks For Time Series Forecasting With R - Dr. N.D Lewis
67% (3)
Neural Networks For Time Series Forecasting With R - Dr. N.D Lewis
227 pages
Report
No ratings yet
Report
14 pages
Chronos
No ratings yet
Chronos
43 pages
Seriesnet:A Generative Time Series Forecasting Model: Zhipeng Shen, Yuanming Zhang, Jiawei Lu, Jun Xu, Gang Xiao
No ratings yet
Seriesnet:A Generative Time Series Forecasting Model: Zhipeng Shen, Yuanming Zhang, Jiawei Lu, Jun Xu, Gang Xiao
8 pages
Focus Attn
No ratings yet
Focus Attn
12 pages
Lag Llama
No ratings yet
Lag Llama
23 pages
Time Series Prediction With Recurrent Neural Networks
No ratings yet
Time Series Prediction With Recurrent Neural Networks
7 pages
NeurIPS 2023 Forecastpfn Synthetically Trained Zero Shot Forecasting Paper Conference
No ratings yet
NeurIPS 2023 Forecastpfn Synthetically Trained Zero Shot Forecasting Paper Conference
24 pages
Peerj Cs 2481
No ratings yet
Peerj Cs 2481
32 pages
Astro AI
No ratings yet
Astro AI
20 pages
Transformers Implementations 1731410319
No ratings yet
Transformers Implementations 1731410319
10 pages
SocrAI Day 4
No ratings yet
SocrAI Day 4
38 pages
Time-Series Forecasting With Deep Learning - A Survey
No ratings yet
Time-Series Forecasting With Deep Learning - A Survey
14 pages
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
From Everand
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Climate and Weather Lesson Plan
No ratings yet
Climate and Weather Lesson Plan
10 pages
English Support Material-6
No ratings yet
English Support Material-6
1 page
High School: San Pablo - Santa Elena - Ecuador
No ratings yet
High School: San Pablo - Santa Elena - Ecuador
2 pages
Parallel Structure Exp-1
50% (2)
Parallel Structure Exp-1
21 pages
Fs Final Demo Observation Sheet
No ratings yet
Fs Final Demo Observation Sheet
2 pages
Grade 7 English p2 2022
No ratings yet
Grade 7 English p2 2022
8 pages
Cultivation Theory
No ratings yet
Cultivation Theory
16 pages
Writing Rubric
No ratings yet
Writing Rubric
1 page
Perdev Module 7
No ratings yet
Perdev Module 7
19 pages
Test Feedback
No ratings yet
Test Feedback
2 pages
Raven SPM
No ratings yet
Raven SPM
4 pages
Lesson Plan
100% (1)
Lesson Plan
5 pages
Reading and Writing COMPARE-AND-CONTRAST-WRITTEN-TEXTS-2
No ratings yet
Reading and Writing COMPARE-AND-CONTRAST-WRITTEN-TEXTS-2
19 pages
Physical Literacy Concept Paper
No ratings yet
Physical Literacy Concept Paper
71 pages
OAE4552 Assessement Tasks
0% (2)
OAE4552 Assessement Tasks
30 pages
Ob MCQ
0% (1)
Ob MCQ
5 pages
Term Paper On Language Disorder
100% (1)
Term Paper On Language Disorder
6 pages
Philosophy
No ratings yet
Philosophy
54 pages
The Verb Phrase 2
No ratings yet
The Verb Phrase 2
8 pages
Forallxyork
No ratings yet
Forallxyork
199 pages
Learning Perception and Attribution
60% (5)
Learning Perception and Attribution
32 pages
Integrated Curriculum
No ratings yet
Integrated Curriculum
18 pages
Lesson On Aesthetics
No ratings yet
Lesson On Aesthetics
4 pages
Part 2 DLP
No ratings yet
Part 2 DLP
3 pages
Strategic Management 1 3 PDF
No ratings yet
Strategic Management 1 3 PDF
20 pages
Let's Check: Activity 1. Now That You Know What Is Art and Artwork. Let Us Try To Check Your
No ratings yet
Let's Check: Activity 1. Now That You Know What Is Art and Artwork. Let Us Try To Check Your
3 pages

Time FM

Uploaded by

Time FM

Uploaded by

‭ ero-Shot Forecasting in LLMs: A‬

‭ he rapid advancements in‬‭Large Language Models (LLMs)‬‭have fundamentally reshaped‬

‭ t the core of this capability lies the‬‭transformer‬‭architecture‬‭, which processes inputs‬

‭Tokenization: The First Step in Understanding Input Sequences‬

‭Consider an input sentence:‬

‭"The stock market is volatile today."‬

‭After tokenization, it might be broken into:‬

‭Each‬‭stacked transformer layer‬‭follows a two-step‬‭process:‬

‭1.‬ ‭Multi-Head Self-Attention (MHA)‬

‭For example, given the input sequence:‬

‭ xtending the Transformer Paradigm to Time-Series: The Birth of‬

‭ imesFM‬‭, a decoder-only foundation model for time-series‬‭forecasting, builds upon the‬

‭Here’s how the analogy extends:‬

‭NLP Transformers‬ ‭Time-Series Transformers (TimesFM)‬

‭Embeddings capture semantic relationships‬ ‭Time embeddings capture temporal trends‬

‭ ulti-head self-attention identifies‬

‭FFN refines contextual representation‬ ‭FFN refines time-series signals‬

‭ utoregressive decoding generates next‬

‭Intuitive Example: Predicting Stock Prices‬

‭ n LLM-style time-series model‬‭learns patterns from‬‭past sequences‬‭and predicts that‬

‭Why This Matters: The Power of Zero-Shot Forecasting‬

‭ ith‬‭TimesFM‬‭, we now have a‬‭universal forecasting‬‭model‬‭that can generate accurate‬

‭ tay tuned for an in-depth exploration of its‬‭model‬‭architecture, empirical performance,‬

‭Transformer Architecture in TimesFM: Understanding Input Processing‬

‭ he foundation of‬‭TimesFM‬‭lies in how it processes‬‭raw time-series data before feeding it‬

‭Breaking Input into Non-Overlapping Patches‬

‭ ime-series data consists of continuous numerical observations, making it fundamentally‬

‭ ‬‭patch‬‭in TimesFM is a contiguous chunk of values‬‭extracted from the time-series with a‬

‭If we set a‬‭patch size of 5‬‭, the sequence is split‬‭into:‬

‭ atch 1:‬‭[100, 102, 105, 110, 108]‬

‭ his approach is crucial because time-series data often contains‬‭short-term fluctuations‬

‭Processing Patches Using a Residual Block‬

‭Mathematically, a residual block operates as follows:‬

‭ utput=Patch Input+MLP(Patch Input)\text{Output} = \text{Patch Input} +‬

‭1.‬ T ‭ he‬‭linear transformation‬‭maps this raw input to a‬‭new space, producing:‬

‭ his type of structure is essential for time-series forecasting because‬‭naïve‬

‭Handling Variable-Length Sequences with a Padding Mask‬

‭ ne of the challenges in time-series forecasting is that real-world sequences often have‬

‭Consider two time-series:‬

‭If we apply a‬‭patch length of 5‬‭, we get:‬

‭ he zeros here are‬‭padded values‬‭that the model should‬‭ignore during computation. A‬

[1, 1, 1, 1, 1], [1, 1, 0, 0, 0]‬

‭ his approach enables‬‭TimesFM to handle variable-length‬‭sequences efficiently‬‭,‬

‭Patch Masking for Generalization‬

‭ or example, if a sales forecasting model is trained on a sequence:‬

‭ y incorporating‬‭patch masking‬‭, TimesFM ensures that‬‭its forecasting abilities are‬‭not‬

‭Bridging Input Processing to Transformer Layers‬

‭ tacked Transformer Layers: Extracting Temporal Dependencies in‬

‭ ‬‭stacked transformer‬‭refers to multiple layers of‬‭the‬‭self-attention mechanism‬‭, each‬

‭Fig: A decoder-only foundation model for time-series forecasting‬

‭Self-Attention for Time-Series‬

‭ ach‬‭patch representation‬‭enters the transformer stack,‬‭where self-attention scores‬

‭𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛‬(‭𝑄‬, ‭𝐾‬, ‭𝑉)‬ = ‭𝑠𝑜𝑓𝑡𝑚𝑎𝑥‬(‭𝑑𝑘‬‭​‬𝑄𝐾𝑇‬‭​‬)‭𝑉‬

‭Temporal Dependencies and Multi-Head Attention‬

‭For example, when forecasting‬‭electricity consumption‬‭:‬

‭Output Layers: Generating Future Predictions in TimesFM‬

‭Mapping Output Tokens to Predictions‬

‭Mathematically, this transformation is performed using another‬‭residual block‬‭:‬

‭𝑦‬‭^​‬‭𝑝𝑗‬ + ‭1:‬ ‭𝑝𝑗‬ + ‭ℎ‬‭​ = ‭𝑂𝑢𝑡𝑝𝑢𝑡𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝐵𝑙𝑜𝑐𝑘‬(‭𝑜𝑗‬‭​‬)

‭ his mapping ensures that each output token‬‭encapsulates‬‭sufficient historical‬

‭Training in Decoder-Only Mode: Learning from Context to Forecast the Future‬

‭ imesFM follows a‬‭decoder-only architecture‬‭, meaning‬‭it is trained to predict the next‬

‭For example, consider the following time-series:‬

[50, 55, 60, ?, ?, ?, 80, 85, 90, ?, ?, ?]‬

‭Example: Forecasting a Stock Price Over a Long Horizon‬

‭ his is achieved by designing‬‭longer output patches‬‭that capture multiple time-steps in a‬

‭How Patch Masking Supports Long-Horizon Forecasting‬

‭ his ability to‬‭extrapolate efficiently‬‭without requiring‬‭step-by-step iteration is what makes‬

‭Loss Function: Optimizing for Accurate Point Forecasting‬

‭𝐿‬ = ‭𝑁‬‭1‭𝑗​‬ ‬ = ‭1∑‬‭𝑁‭​‬(‭𝑦‬‭^‭𝑝

‭Training: Learning from Large-Scale Time-Series Data‬

‭ raining TimesFM involves exposing it to a‬‭diverse‬‭mix of real-world and synthetic‬

‭Pretraining on Massive Time-Series Data‬

‭ imesFM is trained on a‬‭large corpus of time-series‬‭data‬‭, sourced from various real-world‬

‭ dditionally, the dataset includes‬‭synthetic time-series‬‭generated using‬‭ARMA processes,‬

‭Masked Patch Training for Generalization‬

‭𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛‬(‭𝑄‬, ‭𝐾‬, ‭𝑉)‬ = ‭𝑠𝑜𝑓𝑡𝑚𝑎𝑥‬(‭𝑑𝑘‬‭‬𝑄𝐾𝑇‬‭‬)‭𝑉‬

‭𝑦‬‭^‬‭𝑝𝑗‬ + ‭1:‬ ‭𝑝𝑗‬ + ‭ℎ‬‭ = ‭𝑂𝑢𝑡𝑝𝑢𝑡𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙𝐵𝑙𝑜𝑐𝑘‬(‭𝑜𝑗‬‭‬)

‭𝐿‬ = ‭𝑁‬‭1‭𝑗‬ ‬ = ‭1∑‬‭𝑁‭‬(‭𝑦‬‭^‭𝑝