Agentic Retrieval-Augmented Generation For Time Series Analysis
Agentic Retrieval-Augmented Generation For Time Series Analysis
knowledge repositories tailored to each sub-agent’s time series task, In summary, the master agent orchestrates sub-agents, selects
the prompt pools store both domain and task-specific knowledge as the most appropriate sub-agent, and allocates the task to the special-
key-value pairs. This facilitates easy reuse and sharing within and ized sub-agent. The sub-agent retrieves relevant information from
across datasets, promoting knowledge sharing and transfer, reduc- a shared knowledge base of prompt pools and generates an out-
ing the need to relearn or rediscover patterns from scratch. Each put based on the retrieved information. The differentiable prompt
‘key’ represents a specific pattern (seasonality, cyclicality, etc.), and pools for each sub-agent, acting as specialized dynamic knowledge
the ‘value’ contains details about that pattern. When processing repositories, provide the necessary historical context and under-
new input data, the sub-agent retrieves the most relevant prompts standing to effectively analyze new input data for their designated
from the pool based on similarity. These prompts provide contextual tasks. The master agent gathers responses from the chosen sub-
knowledge about related historical patterns and trends, improving agent and synthesize these responses to produce a comprehensive
generalization to new scenarios. This knowledge-augmentation answer for the end-user query. The hierarchical, multi-agent archi-
approach, by conditioning on past patterns, allows the sub-agent tecture for time series analysis offers key advantages. It enables
access to a broad spectrum of task-specific knowledge regardless modularity, flexibility, and accuracy by allowing specialized sub-
of historical occurrence, enabling it to learn and adapt to diverse agents to focus on specific tasks, be updated independently, and
trends within complex data for improved predictions. Each sub- be dynamically allocated by the meta-agent to generate compre-
agent utilizes pre-trained, SLMs like Gemma[40] and Llama 3[1]. hensive results. Extensive empirical studies demonstrate that the
We fine-tune each SLM using instruction-tuning on task-specific Agentic-RAG framework achieves performance on par with, or
datasets and optimize them for time series tasks such as forecasting, even surpassing, state-of-the-art methods across multiple time se-
imputation, or other related tasks. Additionally, we fine-tune using ries analysis tasks for both univariate and multivariate datasets. The
DPO[8] through a dynamic masking technique to align the SLMs multi-agent approach tackles the diverse and complex challenges of
task-specific outputs to preferred and non-preferred outcomes, pro- time series analysis, unlike a single, universal agent that attempts
viding adversarial feedback[47] through a binary classification task. to be a jack-of-all-trades for all time series tasks.
The master agent for sub-agent orchestration utilizes the ’ReAct’ 2 PROBLEM FORMULATION
prompting technique[45], encouraging the general-purpose SLM Consider a time series dataset characterized by 𝑁 univariate time se-
to think step-by-step and use external tools (sub-agents, each uti- ries, with sequential data collected over 𝑇 timestamps, represented
lizing a fine-tuned SLM for specific time series tasks) to generate as a data matrix X ∈ R𝑁 ×𝑇 . Each row in this matrix represents a
responses. The master agent can even chain sub-agents together to univariate time series, and each column corresponds to data col-
handle complex, multi-step time series analysis tasks, addressing lected at a specific timestamp. To refer to data from a specific time
more intricate challenges. However, in this work, the sub-agents series or timestamp, we use subscripts and superscripts, respec-
operate in isolation, each handling only a single, specific task. tively. For instance, 𝑋𝑖 = X𝑖,: denotes the data from the 𝑖-th time
series, and 𝑋 𝑡 = X:,𝑡 denotes the data at timestamp 𝑡.
Master Agent
Query
2.1 Forecasting
Response
We utilize a sliding window[10, 46] of size 𝜏, to construct time series
subsequences 𝑆 𝑡 = 𝑋 𝑡 −𝜏+1:𝑡 ∈ R𝑁 ×𝜏 , which have been observed
Agentic RAGs Sub-Agents
over previous 𝜏-steps prior to current time step 𝑡 to predict about
AgentFor AgentImp AgentCls AgentAMD
the future values for the next 𝜈-steps, 𝑆 𝑡 +1 = 𝑋 𝑡 +1:𝑡 +𝜈 ∈ R𝑁 ×𝜈 .
2.2 Missing Data Imputation
Prompt Pool Prompt Pool Prompt Pool Prompt Pool
We utilize a binary mask matrix M ∈ {0, 1}𝑁 ×𝑇 , where 𝑀𝑖,𝑡 = 0
Anomaly
indicates that the value 𝑋𝑖,𝑡 is missing, and 𝑀𝑖,𝑡 = 1 indicates that
Forecasting Imputation Classification
Figure 1: The figure illustrates the proposed agentic RAG follow random or block patterns[4, 26, 27] across the 𝑁 univariate
framework, designed to handle diverse time series analysis time series and 𝑇 timestamps. We utilize observed values Xobs =
tasks. The framework employs a hierarchical, multi-agent X ⊙ M to estimate the missing values Xmiss = X ⊙ (1−M). ⊙ denotes
architecture. A master agent receives end-user questions and element-wise multiplication. We utilize a sliding window of size 𝜏
routes them to appropriate specialized sub-agents based on over the observed samples Xobs , to construct subsequences 𝑆 obs 𝑡 =
the specific time series task (e.g., forecasting, imputation, 𝑡 −𝜏+1:𝑡
𝑋 obs ∈R 𝑁 ×𝜏 , which have been observed over previous 𝜏-steps
classification, anomaly detection). The sub-agents utilize pre- prior to the current time step 𝑡. These observed samples are used to
trained SLMs fine-tuned on task-specific datasets using tech- 𝑡 +1 = 𝑋 𝑡 +1:𝑡 +𝜈 ∈
predict the missing values for the next 𝜈-steps, 𝑆 miss miss
niques like instruction tuning and direct preference opti- R 𝑁 ×𝜈 by leveraging spatio-temporal dependencies within the data.
mization to capture spatio-temporal dependencies within
2.3 Anomaly Detection
and across the time series datasets. Each sub-agent main-
Assuming the time series dataset exhibits normal behavior during
tains its own prompt pool as ‘key-value’ pairs, which stores
the initial 𝑇train timestamps, any pattern deviating from the normal
relevant historical knowledge related to specific trends and
behavior in subsequent timestamps 𝑡 > 𝑇train is anomalous. Data
patterns within its respective specialized domain. This al-
observed after 𝑇train is considered the test dataset. We use a sliding
lows the sub-agents to leverage related past experiences for
window to construct samples from previous time steps 𝑆 𝑡 ∈ R𝑁 ×𝜏
improved task-specific predictions on new, similar data, and
to predict future values of multiple time series 𝑆 𝑡 +1 ∈ R𝑁 ×𝜈 . The
is then relayed back to the user through the master agent.
2
Agentic Retrieval-Augmented Generation for Time Series Analysis 30th, ACM KDD August 25 - 29, 2024, 2024, Barcelona, Spain
framework predictions are denoted by 𝑆ˆ𝑡 +1 ∈ R𝑁 ×𝜈 . In the unsu- related past knowledge by retrieving the same group of prompts
pervised anomaly detection task, it computes the robust normalized from the prompt pool for effective adaptive learning on new, similar
anomaly scores (𝐴𝑖𝑡 +1 ) for each variable 𝑖 across the time steps in the input data. The dynamic prompting approach utilizes a shared pool
training set T𝑡𝑟𝑎𝑖𝑛 . This information regarding the variables helps of prompts stored as key-value pairs. For time series applications,
in accurately localizing the anomalies within the test set. each prompt is represented by a key vector encoding the essential
𝐴𝑖𝑡 +1 = S𝑖t+1 − Ŝ𝑖𝑡 +1 global characteristics associated with that prompt. The correspond-
We compute the simple moving average of the maximum value ing value matrix contains specific knowledge related to those trends
of anomalousness score(𝐴𝑖𝑡 +1 ) across the multiple variables at time or patterns, such as seasonality, cyclicality, irregularities, and other
point 𝑡 + 1 over the validation set as given, effects. The key vector acts as an identifier or query vector to re-
1
𝑡 +1
∑︁ trieve relevant prompts from the pool based on similarity to the
Th = max 𝐴𝑡 +1 ; 𝐴𝑡 +1 = max 𝐴𝑖𝑡 +1
(1) input new data, providing a form of conditioning or context about
𝑡 ∈ T𝑣𝑎𝑙 𝑤𝑎 𝑖 ∈ |𝑁 |
𝑡 − (𝑤𝑎 +1) historical patterns to enhance the predictions. This allows the time
where 𝑤𝑎 denotes the number of time points in the moving series methods to effectively leverage encoded knowledge from
average calculation. T𝑣𝑎𝑙 denotes the time points in the validation past experiences, enhancing their predictions by recognizing and
set. We set the anomaly detection threshold(Th) as the moving applying learned patterns from the shared prompt pool to the new
averaged maximum anomaly value for time 𝑡 + 1, 𝐴𝑡 +1 over the input data. The pool of prompts P contains a set of 𝑀 distinct
validation data. During inference, time points with an anomaly key-value pairs as follows:
score above the threshold were flagged as anomalies. P = (𝑘 1, 𝑣 1 ), (𝑘 2, 𝑣 2 ), . . . , (𝑘𝑀 , 𝑣 𝑀 )
Here, 𝑀 is the total number of prompts in the pool, 𝑘𝑚 ∈ R 𝑑
2.4 Classification
We perform unsupervised 𝐾-means clustering, identifying (𝐾) op- is the key vector of the 𝑚-th prompt, and 𝑣𝑚 ∈ R 𝑙 ×𝑑 is the corre-
timal clusters or regimes and assigning cluster labels C ∈ R𝑇 to sponding prompt value matrix with length 𝑙 and dimensionality
𝑑. In order to retrieve the most relevant prompts for a given input
each time point in the data matrix X ∈ R𝑁 ×𝑇 . Then, a sliding
time series 𝑆𝑖𝑡 = 𝑋𝑖𝑡 −𝜏+1:𝑡 ∈ R𝜏 , we first linearly project it into 𝑑-
window approach is employed to predict the cluster labels for the
next 𝜈 steps 𝑆 𝑡 +1 = 𝑋 𝑡 +1:𝑡 +𝜈 ∈ R𝑁 ×𝜈 based on the observed sample dimensional embeddings 𝑆𝑖𝑡 ∈ R𝑑 . We then utilize a score-matching
𝑆 𝑡 = 𝑋 𝑡 −𝜏+1:𝑡 ∈ R𝑁 ×𝜏 over the previous 𝜏 time steps. function 𝛾 to measure the similarity between the input and each
prompt key: 𝑆𝑖𝑡 · 𝒌𝑚
3 PROPOSED METHOD 𝑡
𝛾 𝑆𝑖 , 𝒌 𝑚 = 𝑡
|𝑆𝑖 ||𝒌𝑚 |
The proposed framework offers a novel approach to time series anal-
where 𝛾 computes the cosine similarity between the input em-
ysis by leveraging a hierarchical, multi-agent architecture. It com-
bedding 𝑆𝑖𝑡 and the prompt key k𝑚 . The top-𝐾 prompts with the
prises a master agent that coordinates specialized sub-agents, each
highest similarity scores are selected, where 1 ≤ 𝐾 ≤ 𝑀. Let
dedicated to a specific time series task such as forecasting, anomaly
J = 𝑗1, 𝑗2, . . . , 𝑗𝐾 be the set of indices corresponding to the top-𝐾
detection, or imputation. These sub-agents employ pre-trained lan-
most relevant prompts retrieved from the pool P for the given
guage models and utilize prompt pools as internal knowledge bases,
input time series 𝑆𝑖𝑡 . The selected prompts, along with the original
storing key-value pairs representing historical patterns and trends.
input, are concatenated to form the input embedding 𝑆𝑖𝑡 as follows:
By retrieving relevant prompts from these pools, the sub-agents can
augment their predictions with contextual knowledge about related 𝑆𝑖𝑡 = 𝑣 𝑗1 ; . . . ; 𝑣 𝑗𝐾 ; 𝑆𝑖𝑡
past patterns, enabling them to adapt to diverse trends within com- where s𝑖𝑡 ∈ R (𝐾𝑙+1) ×𝑑 . We linearly project s𝑖𝑡 to 𝑑-dimensional
plex time series data. The framework’s modular design, combined representation as follows:
with the strengths of individual sub-agents, allows for improved s𝑖𝑡 = 𝑊 s𝑖𝑡
where 𝑊 ∈ R 𝑑 × (𝐾𝑙+1)𝑑 is a learnable weight matrix. In sum-
performance across various time series analysis tasks, surpassing
the limitations of traditional fixed-window methods. mary, it aims to improve time series modeling efficiency on the
task-specific performance by allowing the framework to recognize
3.1 Dynamic Prompting Mechansim
and apply learned patterns across non-stationarity datasets with
Current time series methods typically utilize past data within a
distributional shifts via the shared prompt representation pool.
predefined window length to understand historical trends and pre-
dict task-specific outcomes. However, this approach may not be 3.2 Fine-Tuning/Preference Optimization SLMs
optimal because there is no universally ideal window length for Current pretrained SLMs, such as Google’s Gemma and Meta’s
all time series data. A larger window length might obscure short- Llama-3 models, are designed with a context length of 8K tokens.
range dependencies, while a smaller window length might fail to However, they struggle to process long input sequences that ex-
capture long-range dependencies . Existing methods fail to capture ceed their pretraining context window. This is because the limited
the full complexity of diverse trends and patterns within the com- length of the context window during pretraining restricts their ef-
plex data required for accurate time series modeling. Adjusting the fectiveness during inference when dealing with longer texts. SLMs
window length in real-world scenarios can be challenging and com- with an improved context length can better capture long-term
putationally expensive. Achieving this goal is an ambitious task, spatio-temporal dependencies and complex patterns that unfold
given the current state of research in this field. To address the chal- over extended periods, which is essential for accurate predictions
lenges of non-stationarity and distributional shifts in real-world and understanding seasonal or cyclic trends. We build upon recent
data, we utilize a differentiable dynamic prompting mechanism[3]. work [19] to improve how SLMs handle long sequences without fine-
This mechanism allows traditional time series methods to access tuning. A two-tiered attention mechanism (grouped and neighbor
3
30th, ACM KDD August 25 - 29, 2024, 2024, Barcelona, Spain Chidaksh Ravuru, Sagar Srinivas Sakhinana, and Venkataramana Runkana
attention) allows SLMs to process unseen long-range dependen- 38 different attack scenarios. In addition, we discuss the univariate
cies, enabling SLMs to naturally handle extended text and maintain datasets for forecasting and imputation in the technical appendix.
performance. It outperforms fine-tuning methods on multiple NLP Dataset Sensors Timesteps Time-Range Data Split Granularity
benchmarks, demonstrating a significant step forward for SLMs in PeMSD3 358 26,208 09/2018 - 11/2018
managing long text sequences. Nevertheless, fine-tuning general-
PeMSD4 307 16,992 01/2018 - 02/2018
purpose SLMs on task-specific data and objectives can still provide 6/2/2
PeMSD7 883 28,224 05/2017 - 08/2017
5 mins
significant performance gains and allow for customization and adap-
PeMSD8 170 17,856 07/2016 - 08/2016
tation to the unique challenges and requirements of different time
PeMSD7(M) 228 12,672 05/2012 - 06/2012
series analysis tasks. Instruction-tuning of SLMs captures complex
METR-LA 207 34,272 03/2012 - 06/2012
task-specific spatio-temporal dependencies and improves predic- 7/1/2
PEMS-BAY 325 52,116 01/2017 - 05/2017
tion accuracy. We perform instruction-tuning of SLMs with an im-
proved context length [19](32K tokens) using parameter-efficient Table 1: Summary of the spatio-temporal datasets.
fine-tuning (PEFT) techniques on their associated specific tasks Dataset SWaT WADI SMAP MSL TEP HAI
(e.g., forecasting, imputation) using the corresponding time-series . Sensors 51 123 25 55 52 59
datasets. This approach could significantly enhance the effective- 𝜏 25 25 50 55 35 30
ness of SLMs in processing extensive time-series data. We leverage
Table 2: Statistical summary of benchmark datasets. 𝜏 is the
Direct Preference Optimization (DPO; [32]), which involves ran-
length of subsequences or historical window length.
domly masking 50 % of the data and performing binary classification
task to predict the corresponding correct task-specific outcomes. Evaluation Metrics: For forecasting and imputation tasks, the
This is done to steer the predictions of the SLMs toward more performance of the proposed framework is evaluated using MAE,
reliable outcomes in the specific context of time series analysis, RMSE, and MAPE metrics on the original scale of the time series
favoring preferred responses over dispreferred responses. data. For classification tasks, we use accuracy. For anomaly detec-
4 EXPERIMENTS tion, we utilize the standard evaluation metrics of precision (P in
Datasets: We evaluate the proposed Agentic-RAG framework %), recall (R in %), and F1-score (F1 in %). We utilize a multi-metric
on four tasks: forecasting, classification, anomaly detection, and approach for a fair and rigorous comparison with baseline models.
imputation. To comprehensively evaluate the framework perfor- To do this, we compute the confusion matrix: true positive (TP) for
mance against several baselines, we conducted experiments using correctly detected anomalies, false negative (FN) for undetected
both univariate and multivariate benchmark datasets across mul- anomalies, true negative (TN) for correctly identified normal points,
tiple time series tasks. The variants include Agentic-RAG with and false positive (FP) for normal points mistakenly identified as
SelfExtend-Gemma-2B-instruct, Gemma-7B-instruct, and Llama anomalies. Precision (TP/(FP + TP)) represents the proportion of
3-8B-instruct. We utilized several real-world traffic-related datasets correctly detected anomalies among all identified anomalies, while
(PeMSD3, PeMSD4, PeMSD7, PeMSD7(M), PeMSD8) obtained from recall (TP / (FN + TP)) represents the proportion of all true anom-
the Caltrans Performance Measurement System (PeMS) [5] for fore- alies that were correctly detected. The F1-score is calculated as the
casting, classification, and imputation. To ensure consistency with harmonic mean of precision and recall. The threshold for identify-
prior research[7], these datasets are preprocessed by aggregating ing anomalies is set to the highest anomaly score(refer to Section
30-second data points into 5-minute averages. Additionally, publicly 2.3) from the validation dataset. For the SWaT and WADI datasets,
available traffic prediction datasets (METR-LA, PEMS-BAY) [22] which contain contiguous anomaly segments, we adopt the point
are utilized, with data aggregated into 5-minute intervals, resulting adjustment strategy [36, 51] to flag the entire subsequence as an
in 288 observations per day. Table 1 provides comprehensive details anomaly if the model predicts one. On the Tennessee Eastman
regarding the spatiotemporal multivariate datasets. For anomaly de- dataset, we utilize the Fault Detection Rate (FDR, in %), defined as
tection, we evaluate the proposed Agentic-RAG framework on pub- the ratio of the number of faults detected to the total number of
licly available multivariate datasets, conducting a comprehensive faults that occur, to evaluate the effectiveness of our framework.
benchmark comparison against baseline methods. Table 2 provides Experimental Settings: To reduce memory footprint and com-
an overview of the datasets used in this study. SWaT and WADI1 are putational complexity, we segment the time series datasets using a
real-world datasets on water treatment facilities and distribution sliding window technique with a predefined historical window size
networks, respectively. SMAP and MSL are expert annotated open- to obtain time series subsequences (smaller, overlapping sequences
source datasets of telemetry data sourced from NASA[18]. The of a fixed length). We performed instruction-tuning(fine-tuning)
Tennessee Eastman Process (TEP)2 dataset is a simulated industrial of the small-scale language models, such as SelfExtend-Instruct
benchmark designed for process monitoring and control, compris- LLaMA 3-8B, Gemma-2B, and Gemma-7B models using the PEFT
ing 20 distinct fault types. The HAI3 dataset comprises time-series technique[44] such as QLoRA[12], on their specific associated time
data from an industrial testbed for detecting adversarial attacks on series tasks using corresponding datasets. We set the following
industrial control systems, involving steam-turbine power genera- hyperparameters: a batch size of 16, a sequence length of 32K, a
tion and pumped-storage hydropower generation processes, with learning rate of 1e-5, training for 15 epochs, 500 warmup steps, a
weight decay of 0.01, and a gradient accumulation of 2 steps. We
1 https://itrust.sutd.edu.sg/itrust-labs/datasets/ used the AdamW optimizer[25] and a linear scheduler to adjust the
2 https://dataverse.harvard.edu/dataverse/harvard learning rate during training. We utilized a 4-bit quantization for
3 https://github.com/icsdataset/hai QLoRA. The QLoRA hyperparameters include the low-rank(𝑟 ) of
4
Agentic Retrieval-Augmented Generation for Time Series Analysis 30th, ACM KDD August 25 - 29, 2024, 2024, Barcelona, Spain
5
30th, ACM KDD August 25 - 29, 2024, 2024, Barcelona, Spain Chidaksh Ravuru, Sagar Srinivas Sakhinana, and Venkataramana Runkana
Table 5: Experimental results on the anomaly detection benchmark datasets in terms of precision, recall, and F1-score
SWaT WADI SMAP MSL HAI
Methods
P(%) R(%) F1(%) P(%) R(%) F1 P(%) R(%) F1(%) P(%) R(%) F1(%) P(%) R(%) F1(%)
GAN-Li 81.03 84.97 77.32 76.25 80.33 77.95 67.10 87.06 75.19 71.02 87.06 78.23 19.83 18.36 17.45
LSTM-NDT 79.12 75.08 78.75 81.25 78.64 75.18 89.65 88.46 89.05 59.44 53.74 56.40 22.46 23.45 20.32
MTAD-GAT 82.01 76.84 72.47 82.58 84.94 80.25 89.06 91.23 90.41 87.54 94.40 90.84 24.75 21.78 20.14
MAD-GAN 98.97 63.74 77.0 41.44 33.92 37.0 80.49 82.14 81.31 85.17 89.91 87.47 25.27 23.34 21.87
GDN 99.35 68.12 81.0 97.50 40.19 57.0 86.62 84.27 83.24 89.92 87.24 86.84 43.41 46.27 44.59
GTA 74.91 96.41 84.0 74.56 90.50 82.0 89.11 91.76 90.41 91.04 91.17 91.11 44.91 41.63 40.29
LOF 72.15 65.43 68.62 57.02 61.17 53.46 58.93 56.33 57.60 47.72 85.25 61.18 31.27 29.93 26.48
Deep-SVDD 80.42 84.45 82.39 74.18 70.82 73.43 89.93 56.02 69.04 91.92 76.63 83.58 34.81 31.26 30.94
DAGMM 89.92 57.84 70.4 54.44 26.99 36.0 86.45 56.73 68.51 89.60 63.93 74.62 35.56 37.12 33.77
MMPCACD 82.52 68.29 74.73 74.29 75.01 71.48 88.61 75.84 81.73 81.42 61.31 69.95 31.58 29.46 27.33
VAR 81.59 60.29 69.34 75.59 69.36 66.21 81.38 53.88 64.83 74.68 81.42 77.90 34.42 36.28 31.97
LSTM 86.15 83.27 84.69 68.73 62.47 65.74 89.41 78.13 83.39 85.45 82.50 83.95 35.61 32.84 31.92
CL-MPPCA 76.78 81.50 79.07 69.72 65.23 67.32 86.13 63.16 72.88 73.71 88.54 80.44 33.82 31.74 30.05
ITAD 63.13 52.08 57.08 71.95 69.39 65.76 82.42 66.89 73.85 69.44 84.09 76.07 36.72 33.42 32.47
LSTM-VAE 76.00 89.50 82.20 87.79 14.45 25.0 92.20 67.75 78.10 85.49 79.94 82.62 38.25 37.94 35.04
BeatGAN 64.01 87.46 73.92 74.46 70.71 76.52 92.38 55.85 69.61 89.75 85.42 87.53 39.41 38.03 35.47
OmniAnomaly 81.42 84.30 82.83 78.18 80.13 77.24 92.49 81.99 86.92 89.02 86.37 87.67 46.29 43.75 42.73
InterFusion 80.59 85.58 83.01 81.78 84.37 80.21 89.77 88.52 89.14 81.28 92.70 86.62 45.72 43.15 42.55
THOC 83.94 86.36 85.13 84.24 81.32 80.09 92.06 89.34 90.68 88.45 90.97 89.69 43.72 45.82 43.67
GRELEN 95.60 83.50 89.10 77.30 61.30 68.20 94.45 98.16 97.29 94.36 94.04 91.58 47.31 43.12 40.58
Agentic-RAG W/Gemma-2B 99.35 98.00 92.45 98.50 91.85 89.95 98.10 98.85 98.90 97.95 97.25 96.90 58.10 56.00 53.10
Agentic-RAG W/Gemma-7B 99.42 98.08 92.53 98.58 91.93 90.03 98.18 98.93 98.98 98.03 97.33 96.98 58.18 56.08 53.18
Agentic-RAG W/Llama-8B 99.47 98.15 92.59 98.63 91.97 90.08 98.24 98.97 99.04 98.11 97.37 97.04 58.27 56.13 53.24
Best performance in bold. Second-best with underlines(except Agentic-RAG framework Variants).
Table 6: Experimental results on simulated Tennessee Eastman dataset in terms of fault detection rate (FDR(%))
Base Model 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Transformer 99.64 98.45 5.00 99.96 28.86 100 100 96.43 5.19 17.48 77.51 98.20 94.01 99.97 5.39 13.43 91.53 93.76 25.13 48.05
TCN 99.61 97.93 5.12 100 26.46 100 100 94.68 5.19 35.57 80.51 96.63 93.48 99.97 5.36 21.10 96.14 93.90 23.39 47.92
FNet 99.67 98.64 4.86 99.18 25.82 100 100 96.76 18.87 18.87 76.08 98.11 94.07 99.96 5.48 13.74 91.05 93.70 24.43 45.59
GTA 98.12 99.35 5.88 98.04 55.82 100 100 97.34 20.18 34.33 79.81 98.72 96.03 98.21 7.64 16.69 92.25 94.78 26.57 47.31
GDN 99.81 99.27 6.72 99.56 41.07 100 100 95.04 16.46 41.22 79.57 99.64 95.71 97.58 7.83 15.64 92.79 95.27 27.17 48.81
MTAD-GAT 99.78 98.91 8.92 99.81 39.33 100 100 98.57 20.37 43.93 82.47 99.51 96.84 99.74 10.13 16.98 94.47 94.60 30.79 58.90
GRELEN 99.67 98.64 10.86 99.18 51.82 100 100 96.76 18.87 48.87 76.08 98.11 94.07 99.96 5.48 13.74 91.05 93.70 24.43 62.59
Agentic-RAG W/Gemma-2B 99.60 99.75 16.10 99.85 75.20 99.85 99.85 99.30 28.90 68.00 87.00 99.30 98.50 99.60 13.80 29.20 99.70 98.05 41.10 79.20
Agentic-RAG W/Gemma-7B 99.66 99.82 16.18 99.90 75.28 99.90 99.90 99.40 29.00 68.12 87.10 99.35 98.58 99.68 13.88 29.30 99.78 98.13 41.18 79.28
Agentic-RAG W/Llama-8B 99.72 99.89 16.23 100 75.38 100 100 99.47 29.04 68.16 87.15 99.46 98.64 99.75 13.96 29.37 99.83 98.21 41.23 79.35
Best performance in bold. Second-best with underlines(except Agentic-RAG framework Variants).
16, an 𝛼 of 32, and a dropout of 0.05 to ensure efficient parameter datasets (PeMSD3, PeMSD4, PeMSD7, PeMSD7M, PeMSD8, METR-
updates. We performed preference tuning on the SLMs using Direct LA, and PEMS-BAY) on the forecasting task. We report experimental
Preference Optimization(DPO[32]) along with QLoRA, minimizing results from a previous study [7] for a fair and rigorous compari-
the binary cross-entropy (BCE) loss with the following hyperpa- son. Tables 5-6 show the performance of Agentic-RAG framework
rameters: a learning rate of 5.0e-7 with a cosine scheduler and a variants on time-series anomaly detection on benchmark datasets.
gradient accumulation of 2 steps. 𝛽 was set to 0.2 to better align We present experimental results of baseline methods from earlier
SLMs with the desired preferences. We conducted training for 3 studies [6, 11, 13, 43]. Our proposed framework outperforms base-
epochs using the AdamW optimizer, with a batch size of 8 for both line methods across the benchmark datasets, showing significant
the training and evaluation phases. These hyperparameters were improvements on the forecasting and anomaly detection tasks. We
chosen to balance the trade-off between SLMs’ performance on present experimental results on missing data imputation and classi-
the specific time series task and computational resources. Opti- fication tasks in the appendix. Experimental results on univariate
mal hyperparameter values are highly task-specific and depend datasets across all time series tasks are discussed in the appendix.
on the dataset and language model architecture. Extensive experi- 6 CONCLUSION
mentation are crucial to find the best configurations. We discuss In this work, we propose an Agentic RAG framework to address the
the hyperparameter optimization results in appendix. To ensure challenges of distribution shifts, and fixed-length subsequences in
efficient and consistent framework training, we preprocess time- time series analysis. The framework overcomes these challenges by
series data by standardizing each variable (zero mean, unit variance) leveraging a hierarchical, multi-agent architecture with specialized
and calculate evalution metric on the original scale. We leverage sub-agents for various time series tasks. Each sub-agent utilizes a
NVIDIA GPUs and PyTorch for accelerated training, enabling the prompt pool as its internal knowledge base to store historical pat-
use of small-scale models and datasets. For robust evaluation, we terns and trends. The sub-agent retrieves relevant prompts and uti-
conduct multiple independent runs and report ensemble averages. lizes the corresponding knowledge to improve predictions on new,
5 RESULTS unseen data. This modular design with task-specific sub-agents
Tables 3-4 present a performance comparison of the Agentic-RAG and knowledge augmentation outperforms traditional methods in
framework variants with baseline methods on seven benchmark handling complex time series analysis tasks.
6
Agentic Retrieval-Augmented Generation for Time Series Analysis 30th, ACM KDD August 25 - 29, 2024, 2024, Barcelona, Spain
REFERENCES [26] Ivan Marisca, Cesare Alippi, and Filippo Maria Bianchi. 2024. Graph-based
[1] AI@Meta. 2024. Llama 3 Model Card. (2024). https://github.com/meta-llama/ Forecasting with Missing Data through Spatiotemporal Downsampling. arXiv
llama3/blob/main/MODEL_CARD.md preprint arXiv:2402.10634 (2024).
[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, [27] Ivan Marisca, Andrea Cini, and Cesare Alippi. 2022. Learning to reconstruct
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda missing data from spatiotemporal graphs with sparse observations. Advances in
Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Neural Information Processing Systems 35 (2022), 32069–32082.
Information Processing Systems 33 (2020), 1877–1901. [28] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2023.
[3] Defu Cao, Furong Jia, Sercan O Arik, Tomas Pfister, Yixiang Zheng, Wen Ye, A Time Series is Worth 64 Words: Long-term Forecasting with Transformers.
and Yan Liu. 2024. TEMPO: Prompt-based Generative Pre-trained Transformer In The Eleventh International Conference on Learning Representations. https:
for Time Series Forecasting. In The Twelfth International Conference on Learning //openreview.net/forum?id=Jbdc0vTOcol
Representations. https://openreview.net/forum?id=YH5w12OUuU [29] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
[4] Wei Cao, Dong Wang, Jian Li, Hao Zhou, Lei Li, and Yitan Li. 2018. Brits: [30] Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. 2020. N-
Bidirectional recurrent imputation for time series. Advances in neural information BEATS: Neural basis expansion analysis for interpretable time series forecasting.
processing systems 31 (2018). In International Conference on Learning Representations.
[5] Chao Chen, Karl Petty, Alexander Skabardonis, Pravin Varaiya, and Zhanfeng [31] Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh
Jia. 2001. Freeway performance measurement system: mining loop detector data. Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li,
Transportation Research Record 1748, 1 (2001), 96–102. Kamyar Azizzadenesheli, et al. 2022. Fourcastnet: A global data-driven high-
[6] Zekai Chen, Dingshuo Chen, Xiao Zhang, Zixuan Yuan, and Xiuzhen Cheng. resolution weather model using adaptive fourier neural operators. arXiv preprint
2021. Learning graph structures with transformer for multivariate time series arXiv:2202.11214 (2022).
anomaly detection in iot. IEEE Internet of Things Journal (2021). [32] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano
[7] Jeongwhan Choi, Hwangyong Choi, Jeehyun Hwang, and Noseong Park. 2022. Ermon, and Chelsea Finn. 2024. Direct preference optimization: Your language
Graph neural controlled differential equations for traffic forecasting. In Proceed- model is secretly a reward model. Advances in Neural Information Processing
ings of the AAAI Conference on Artificial Intelligence, Vol. 36. 6367–6374. Systems 36 (2024).
[8] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario [33] Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin
Amodei. 2017. Deep reinforcement learning from human preferences. Advances Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented language
in neural information processing systems 30 (2017). models. Transactions of the Association for Computational Linguistics 11 (2023),
[9] Andrea Cini, Ivan Marisca, and Cesare Alippi. 2021. Multivariate Time Series 1316–1331.
Imputation by Graph Neural Networks. arXiv e-prints (2021), arXiv–2108. [34] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy
[10] Andrea Cini, Ivan Marisca, Daniele Zambon, and Cesare Alippi. 2024. Taming Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat,
local effects in graph-based spatiotemporal forecasting. Advances in Neural Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlocking multimodal understanding
Information Processing Systems 36 (2024). across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024).
[11] Ailin Deng and Bryan Hooi. 2021. Graph neural network-based anomaly detection [35] Andreas Roth and Thomas Liebig. 2022. Forecasting Unobserved Node States
in multivariate time series. In Proceedings of the AAAI Conference on Artificial with spatio-temporal Graph Neural Networks. arXiv preprint arXiv:2211.11596
Intelligence, Vol. 35. 4027–4035. (2022).
[12] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2024. [36] Lifeng Shen, Zhuocong Li, and James Kwok. 2020. Timeseries anomaly detection
Qlora: Efficient finetuning of quantized llms. Advances in Neural Information using temporal hierarchical one-class network. Advances in Neural Information
Processing Systems 36 (2024). Processing Systems 33 (2020), 13016–13026.
[13] Yiwei Fu and Feng Xue. 2022. MAD: Self-Supervised Masked Anomaly Detection [37] Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike
Task for Multivariate Time Series. arXiv preprint arXiv:2205.02100 (2022). Lewis, Luke Zettlemoyer, and Wen-tau Yih. 2023. Replug: Retrieval-augmented
[14] Nate Gruver, Marc Finzi, Shikai Qiu, and Andrew G Wilson. 2024. Large language black-box language models. arXiv preprint arXiv:2301.12652 (2023).
models are zero-shot time series forecasters. Advances in Neural Information [38] Shamane Siriwardhana, Rivindu Weerasekera, Elliott Wen, Tharindu Kalu-
Processing Systems 36 (2024). arachchi, Rajib Rana, and Suranga Nanayakkara. 2023. Improving the domain
[15] Han Guo, Philip Greengard, Eric P Xing, and Yoon Kim. 2023. Lq-lora: Low-rank adaptation of retrieval augmented generation (RAG) models for open domain
plus quantized matrix decomposition for efficient language model finetuning. question answering. Transactions of the Association for Computational Linguistics
arXiv preprint arXiv:2311.12023 (2023). 11 (2023), 1–17.
[16] Zeyu Han, Chao Gao, Jinyang Liu, Sai Qian Zhang, et al. 2024. Parameter- [39] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste
efficient fine-tuning for large models: A comprehensive survey. arXiv preprint Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth,
arXiv:2403.14608 (2024). et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint
[17] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean arXiv:2312.11805 (2023).
Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large [40] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati-
language models. arXiv preprint arXiv:2106.09685 (2021). raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette
[18] Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Love, et al. 2024. Gemma: Open models based on gemini research and technology.
Tom Soderstrom. 2018. Detecting spacecraft anomalies using lstms and nonpara- arXiv preprint arXiv:2403.08295 (2024).
metric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD interna- [41] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
tional conference on knowledge discovery & data mining. 387–395. Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
[19] Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv
Chang, Huiyuan Chen, and Xia Hu. 2024. Llm maybe longlm: Self-extend llm preprint arXiv:2302.13971 (2023).
context window without tuning. arXiv preprint arXiv:2401.01325 (2024). [42] Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng
[20] Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Long. 2023. TimesNet: Temporal 2D-Variation Modeling for General Time Series
Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. 2023. Time-llm: Analysis. In The Eleventh International Conference on Learning Representations.
Time series forecasting by reprogramming large language models. arXiv preprint https://openreview.net/forum?id=ju_Uqw384Oq
arXiv:2310.01728 (2023). [43] Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. 2021. Anomaly
[21] Michael Leonard. 2001. Promotional analysis and forecasting for demand plan- transformer: Time series anomaly detection with association discrepancy. arXiv
ning: a practical time series approach. with exhibits 1 (2001). preprint arXiv:2110.02642 (2021).
[22] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018. Diffusion Convolutional [44] Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. 2023.
Recurrent Neural Network: Data-Driven Traffic Forecasting. In ICLR. Parameter-efficient fine-tuning methods for pretrained language models: A criti-
[23] Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, cal review and assessment. arXiv preprint arXiv:2312.12148 (2023).
Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, et al. 2023. Ra-dit: [45] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan,
Retrieval-augmented dual instruction tuning. arXiv preprint arXiv:2310.01352 and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models.
(2023). arXiv preprint arXiv:2210.03629 (2022).
[24] Hengbo Liu, Ziqing Ma, Linxiao Yang, Tian Zhou, Rui Xia, Yi Wang, Qingsong [46] Kun Yi, Qi Zhang, Wei Fan, Hui He, Liang Hu, Pengyang Wang, Ning An, Long-
Wen, and Liang Sun. 2023. SADI: A Self-Adaptive Decomposed Interpretable bing Cao, and Zhendong Niu. 2024. FourierGNN: Rethinking multivariate time
Framework for Electric Load Forecasting Under Extreme Events. In IEEE Interna- series forecasting from a pure graph perspective. Advances in Neural Information
tional Conference on Acoustics, Speech and Signal Processing. Processing Systems 36 (2024).
[25] Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. [47] Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. 2019. Time-series
arXiv preprint arXiv:1711.05101 (2017). generative adversarial networks. Advances in neural information processing
systems 32 (2019).
7
30th, ACM KDD August 25 - 29, 2024, 2024, Barcelona, Spain Chidaksh Ravuru, Sagar Srinivas Sakhinana, and Venkataramana Runkana
[48] Tianping Zhang, Yizhuo Zhang, Wei Cao, Jiang Bian, Xiaohan Yi, Shun Zheng,
and Jian Li. 2022. Less is more: Fast multivariate time series forecasting with
light sampling-oriented mlp structures. arXiv preprint arXiv:2207.01186 (2022).
[49] Weiqi Zhang, Chen Zhang, and Fugee Tsung. 2022. GRELEN: Multivariate Time
Series Anomaly Detection from the Perspective of Graph Relational Learning..
In IJCAI. 2390–2397.
[50] Yunhao Zhang and Junchi Yan. 2022. Crossformer: Transformer utilizing cross-
dimension dependency for multivariate time series forecasting. In The eleventh
international conference on learning representations.
[51] Hang Zhao, Yujing Wang, Juanyong Duan, Congrui Huang, Defu Cao, Yunhai
Tong, Bixiong Xu, Jing Bai, Jie Tong, and Qi Zhang. 2020. Multivariate time-
series anomaly detection via graph attention network. In 2020 IEEE International
Conference on Data Mining (ICDM). IEEE, 841–850.
[52] Helen Zhou, Sercan O Arik, and Jingtao Wang. 2023. Business Metric-Aware
Forecasting for Inventory Management. arXiv preprint arXiv:2308.13118 (2023).
[53] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong,
and Wancai Zhang. 2021. Informer: Beyond efficient transformer for long se-
quence time-series forecasting. In Proceedings of the AAAI Conference on Artificial
Intelligence, Vol. 35. 11106–11115.
[54] Qihang Zhou, Shibo He, Haoyu Liu, Jiming Chen, and Wenchao Meng. 2024.
Label-free multivariate time series anomaly detection. IEEE Transactions on
Knowledge and Data Engineering (2024).
[55] Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. 2022.
FEDformer: Frequency enhanced decomposed transformer for long-term series
forecasting. In Proc. 39th International Conference on Machine Learning (ICML
2022) (Baltimore, Maryland).
[56] Tian Zhou, Peisong Niu, Liang Sun, Rong Jin, et al. 2024. One fits all: Power
general time series analysis by pretrained lm. Advances in neural information
processing systems 36 (2024).
[57] Tian Zhou, Peisong Niu, Xue Wang, Liang Sun, and Rong Jin. 2023. One Fits
All: Power General Time Series Analysis by Pretrained LM. In Thirty-seventh
Conference on Neural Information Processing Systems. https://openreview.net/
forum?id=gMS6FVZvmF
8
Agentic Retrieval-Augmented Generation for Time Series Analysis 30th, ACM KDD August 25 - 29, 2024, 2024, Barcelona, Spain
A MULTIVARIATE SPATIO-TEMPORAL a ratio of 7:1:2 for the METR-LA and PEMS-BAY datasets and a
DATASETS ratio of 6:2:2 for the other datasets into training, validation, and
test sets. We evaluated the Agentic-RAG framework’s performance
A.1 Missing Data Imputation on simulated data using multiple imputation metrics (e.g., RMSE,
Time series imputation is a critical step in time series analysis. It ad- MAE, and MAPE). This analysis helps us understand how well the
dresses a common issue in this field: missing values within datasets. framework handles time series data with missing values, particu-
These missing values can arise from sensor failures, data transmis- larly how its performance changes as the percentage of missing
sion errors, or incomplete records. By imputing these gaps, time data increases. We establish the Agentic-RAG framework, trained
series imputation ensures the quality and reliability of subsequent on complete data (no missing values), as a strong performance
analyses. The Agentic-RAG framework achieves this by handling benchmark. This benchmark allows us to evaluate the framework’s
seasonality, trends and capturing the inherent spatio-temporal de- effectiveness in imputing missing data under different conditions
pendencies within the data. Ultimately, imputation improves data of data incompleteness. Tables 7 and 8 present the imputation re-
quality, enabling more accurate analysis, modeling, and decision- sults on standard benchmark datasets with different missingness
making. In essence, it plays a vital role by maintaining data in- patterns, while the framework performs slightly worse than the
tegrity and enabling reliable analysis. To evaluate the Agentic-RAG baseline for minimal missing data. Its accuracy degrades more sig-
framework’s ability to handle missing data, we simulated two types nificantly as the data becomes more incomplete, regardless of the
of missingness patterns: point missing and block missing[9, 35]. specific missingness pattern. Our proposed Agentic-RAG frame-
These patterns represent varying degrees of data availability. To work demonstrates robustness to missing data by focusing on the
achieve this, we introduced synthetic missingness into time series available observations for imputing missing values, thereby avoid-
datasets following these patterns. For point missing, individual ing the introduction of potentially inaccurate estimates that could
values were randomly omitted with a probability threshold (𝑝), con- obscure the underlying trends and patterns within the time series
trolling the overall percentage of missing data. The block missing data. Additionally, the Agentic-RAG framework effectively captures
pattern involves removing contiguous, multi-period, multi-time the complex non-linear intra- and inter-time series dependencies
series segments. This is done by randomly selecting start and end and this leads to more reliable imputation. The experiments show
times, as well as start and end time series, to define uniform blocks that our framework can learn the spatiotemporal dependencies
with an average length of (ł). All data points within each block are from partially observed data with various missingness patterns,
then omitted. Furthermore, two block missing patterns are consid- resulting in lower imputation errors.
ered: temporal and spatial. For temporal block missing, contiguous A.2 Time Series Classification
multi-period segments are removed from a given time series. This is
done by randomly selecting start and end times, creating stretches Time series classification is a crucial task with applications across
of unavailable temporal data. For spatial block missing, contiguous various domains. In time series analysis, regimes, or clusters rep-
blocks are removed across multiple related time series at specific resent distinct behavioral modes, operating conditions, or states
time points. This involves randomly selecting the start and end of the system underlying the data. Identifying and characterizing
time series, resulting in missing spatial data at the chosen time these regimes is crucial for understanding the complex patterns and
points. Both patterns show varying levels of missing information dynamics within the data. This allows for more accurate modeling,
in the time series data. In summary, point missing refers to spo- forecasting, and decision-making in applications where time series
radic gaps in the data, while block missing involves the absence of analysis is essential. The emergence of different regimes or clusters
entire contiguous multi-period and multi-series segments. Block can stem from changes in the data generation process, external con-
missing can further be categorized into two types: temporal block ditions, or the inherent non-stationarity and multivariate nature
missing, where contiguous segments are removed within a single of the time series. This reflects the rich information content and
time series, and spatial block missing, where contiguous blocks are complexity often encountered in real-world time series data. To
removed across multiple related time series, mimicking realistic evaluate the proposed Agentic-RAG framework’s ability to handle
scenarios of faulty data collection. In the context of time series impu- time series classification tasks, an unsupervised clustering approach
tation, “in-sample" and “out-of-sample" imputation refer to distinct was employed for data labeling. We first applied k-means clustering
evaluation settings. In-sample imputation involves the imputation to the original time series datasets, determining the optimal num-
method reconstructing missing values within a given fixed input ber of clusters (k) using established techniques such as the elbow
sequence, 𝑆 𝑡 , using all available observed data within that sequence. method or silhouette analysis. The optimal clusters were treated
Out-of-sample imputation involves training the imputation method as class labels, representing distinct regimes within the time series,
using the fixed sequence 𝑆 𝑡 to impute missing points in a future and each time series was assigned the corresponding cluster label,
sequence, 𝑆 𝑡 +1 . In this work, we utilize out-of-sample settings, as creating a labeled classification dataset. We adopted a time-based
this approach mimics real-world scenarios and rigorously assesses division strategy to split multiple benchmark datasets into training,
the Agentic-RAG framework’s robustness and generalizability by validation, and testing sets. The METR-LA and PEMS-BAY datasets
evaluating its ability to handle new, unseen data. The simulated were split at a 7:1:2 ratio, while other datasets used a 6:2:2 split.
datasets with missing values were then used to evaluate the missing We evaluated the framework’s performance on the held-out test
data handling capabilities of the proposed Agentic-RAG framework. set using standard classification metrics: accuracy, precision, recall.
We split multiple benchmark datasets in chronological order with This methodology allowed us to assess the framework’s ability to
learn the underlying patterns and relationships associated with
9
30th, ACM KDD August 25 - 29, 2024, 2024, Barcelona, Spain Chidaksh Ravuru, Sagar Srinivas Sakhinana, and Venkataramana Runkana
Table 7: The table presents the Agentic-RAG framework’s evaluation results on various metrics for missing data imputation
across PeMSD3, PeMSD4, PeMSD7, and METR-LA benchmark datasets with diverse missing data patterns.
each cluster/class and its overall effectiveness in classifying time ETTh2 are two hourly time series datasets containing observations
series data based on inherent complex spatio-temporal regimes, of electricity transformers from two different locations. ETTm1 and
paving the way for its practical application in real-world scenarios. ETTm2 are two monthly time series datasets containing observa-
The experimental results, presented in Tables 9 and 10, show a tions of electricity transformers from two different locations. In
comparison with the simple baselines. this work, we utilize the ETT datasets[53] to evaluate the Agentic-
RAG framework for both forecasting and missing data imputation
B UNIVARIATE DATASETS tasks. The Table 11 shows the performance of various methods
We conducted several experiments to evaluate the proposed Agentic- on the multi-horizon forecasting task using a lookback window
RAG framework variants: SelfExtend-Agentic-RAG with Gemma- of size 512. It presents mean squared error (MSE) and mean ab-
2B, SelfExtend-Agentic-RAG with Gemma-7B, and SelfExtend- solute error (MAE) for nine models (GPT4TS[57], PatchTST[28],
Agentic-RAG with Llama-8B, on the univariate datasets for mul- TimesNet[42], FEDFormer[55], LightTS[48], N-BEATS[30], Agentic-
tiple time series analysis tasks such as forecasting and imputation. RAG w/Gemma-2B, Agentic-RAG w/Gemma-7B, and Agentic-RAG
w/Llama-8B) across four datasets (ETTh1, ETTh2, ETTm1, ETTm2)
B.1 Forecasting and Imputation at different time horizons (96, 192, 336, 720). This allows for a
The ETT (Electricity Transformer) datasets[53], ETTh1, ETTh2, comprehensive analysis of forecasting accuracy and robustness of
ETTm1, and ETTm2, are popular benchmarks used for evaluat- Agentic-RAG framework across varying prediction lengths. The per-
ing and benchmarking univariate time series forecasting methods. formance of various methods for imputing missing data (point and
They provide a challenging benchmark due to the presence of com- block missing) and their effectiveness in out-of-sample imputation
plex patterns, such as trends, seasonality, and irregularities, which settings are compared in Tables 12 and 13. The evaluated methods
are commonly found in real-world time series data. ETTh1 and
10
Agentic Retrieval-Augmented Generation for Time Series Analysis 30th, ACM KDD August 25 - 29, 2024, 2024, Barcelona, Spain
include GPT4TS[57], PatchTST[28], TimesNet[42], FEDFormer[55], GPU and 35.525 kg CO2e for the NVIDIA T4 GPU. Note: kg CO2e
LightTS[48], N-BEATS[30], Agentic-RAG with Gemma-2B, Agentic- stands for kilograms of carbon dioxide equivalent. The average
RAG with Gemma-7B, and Agentic-RAG with Llama-8B. The eval- person in the United States emits approximately 43.8 kg of carbon
uation employs a 512-step historical window for imputing 96-step- dioxide equivalent (CO2e) per day. Given the emissions of 152.25
ahead (short-term prediction) and 720-step-ahead (long-term pre- kg CO2e for the NVIDIA P100 GPU and 35.525 kg CO2e for the
diction) missing values in future data. The tables show results for NVIDIA T4 GPU, it would take a single person’s emissions ap-
four datasets (ETTh1, ETTh2, ETTm1, ETTm2) under three missing proximately 3.5 days to match the emissions of the P100 GPU and
data scenarios: 0% missing (no missing data), 20% point missing, and approximately 0.8 days (or 19 hours) to match the emissions of the
20% block missing. The proposed Agentic-RAG framework variants T4 GPU. While the calculated carbon footprint provides valuable
demonstrate strong performance on the benchmark datasets for insight, the actual energy consumption and resulting emissions
both forecasting and imputation tasks, with lower errors. may vary due to factors like GPU utilization and regional energy
sources. Nonetheless, quantifying the carbon footprint is a crucial
C ENVIRONMENTAL IMPACT step towards understanding and mitigating the environmental im-
Our Agentic-RAG framework training process, involving multi- pact of deep learning research, paving the way for more sustainable
ple variants running for extended periods, increases our energy and responsible practices in artificial intelligence.
consumption and carbon footprint. Accurate quantification of the
carbon footprint of deep learning experiments is essential for pro- D HYPERPARAMETER OPTIMIZATION
moting sustainable practices in artificial intelligence research and Hyperparameter optimization involves training the Agentic-RAG
development. A crucial aspect of this endeavor is estimating the en- framework variants multiple times with different hyperparameter
ergy consumption and associated greenhouse gas emissions during settings. This can be computationally expensive, especially for com-
the computationally intensive training processes. This is calculated plex pre-trained language models or large datasets. We optimized
by determining the Total Graphics Power (TGP), which represents the hyperparameters for the best-performing Agentic-RAG w/Llama-8B
the maximum power draw of the GPU, including the GPU chip itself variant. For simplicity and in the interest of time, we have utilized
and other components like memory and additional circuitry. For the same settings for evaluating the performance of Agentic-RAG
example, the NVIDIA P100 GPU has a TGP of 300 watts, while the with w/Gemma-2B and w/Gemma-7B variants for both multi-
NVIDIA T4 GPU has a TGP of 70 watts. By multiplying the TGP by variate and univariate datasets across all tasks. In our experiments,
the training time, we can estimate the energy consumption, which we optimized the training process for supervised fine-tuning using
is then converted to carbon emissions using a region-specific car- a batch size from {16, 32, 64}, learning rate from {1𝑒 −5, 5𝑒 −5, 1𝑒 −4}.
bon intensity factor. This factor accounts for the energy mix (coal, The training was conducted over epochs in the range of {10, 15, 20}
natural gas, renewables, etc.) used to generate electricity in the geo- with a warmup step count from {500, 1000, 1500} and a weight
graphic area where the computations are performed. Considering a decay for regularization from {0.01, 0.05, 0.1}. We used gradient ac-
725-GPU hours training experiment and using an estimated carbon cumulation steps for stabilized training convergence from {2, 4, 8}
intensity factor of 0.0007 metric tons CO2e per kWh for the year and employed the AdamW optimizer. To manage memory and
2024 (for more information on the carbon intensity of electricity, computational efficiency, we applied 4-bit quantization for QLoRA,
you can visit CO2 Intensity - Our World in Data), the calculated with hyperparameters including a low-rank (‘𝑟 ’) from {16, 32, 64},
carbon footprint would be 152.25 kg CO2e for the NVIDIA P100 an (‘𝛼’) from {32, 64, 128}, and a dropout from {0.05, 0.1, 0.2}. For
11
30th, ACM KDD August 25 - 29, 2024, 2024, Barcelona, Spain Chidaksh Ravuru, Sagar Srinivas Sakhinana, and Venkataramana Runkana
Methods GPT4TS PatchTST TimesNet FEDFormer LightTS N-BEATS ARAG w/-2B ARAG w/-7B ARAG-w/8B
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
96 0.376 0.397 0.370 0.399 0.384 0.402 0.376 0.419 0.424 0.432 0.399 0.428 0.410 0.435 0.407 0.433 0.369 0.396
192 0.416 0.418 0.413 0.421 0.436 0.429 0.420 0.448 0.475 0.462 0.451 0.464 0.448 0.461 0.445 0.459 0.412 0.417
ETTh1
336 0.442 0.433 0.422 0.436 0.491 0.469 0.459 0.465 0.518 0.488 0.498 0.500 0.487 0.476 0.484 0.473 0.421 0.434
720 0.477 0.456 0.447 0.466 0.521 0.500 0.506 0.507 0.547 0.533 0.608 0.573 0.496 0.482 0.491 0.478 0.446 0.464
96 0.285 0.342 0.274 0.336 0.340 0.374 0.358 0.397 0.397 0.437 0.327 0.387 0.345 0.378 0.342 0.374 0.273 0.335
192 0.354 0.389 0.339 0.379 0.402 0.414 0.429 0.439 0.520 0.504 0.400 0.435 0.387 0.410 0.384 0.406 0.338 0.378
ETTh2
336 0.373 0.407 0.329 0.380 0.452 0.452 0.496 0.487 0.626 0.559 0.747 0.599 0.465 0.468 0.462 0.465 0.328 0.379
720 0.406 0.441 0.379 0.422 0.462 0.468 0.463 0.474 0.863 0.672 1.454 0.847 0.473 0.472 0.469 0.469 0.371 0.420
96 0.292 0.346 0.290 0.342 0.338 0.375 0.379 0.419 0.374 0.400 0.318 0.367 0.354 0.369 0.351 0.366 0.289 0.340
192 0.332 0.372 0.332 0.369 0.374 0.387 0.426 0.441 0.400 0.407 0.355 0.391 0.368 0.383 0.365 0.380 0.331 0.367
ETTm1
336 0.366 0.394 0.366 0.392 0.410 0.411 0.445 0.459 0.438 0.438 0.401 0.419 0.396 0.404 0.392 0.400 0.365 0.388
720 0.417 0.421 0.416 0.420 0.478 0.450 0.543 0.490 0.527 0.502 0.448 0.448 0.435 0.427 0.431 0.423 0.411 0.419
96 0.173 0.262 0.165 0.255 0.187 0.267 0.203 0.287 0.209 0.308 0.197 0.271 0.190 0.265 0.187 0.262 0.164 0.254
192 0.229 0.301 0.220 0.292 0.249 0.309 0.269 0.328 0.311 0.382 0.285 0.328 0.276 0.318 0.273 0.315 0.219 0.290
ETTm2
336 0.286 0.341 0.274 0.329 0.321 0.351 0.325 0.366 0.442 0.466 0.338 0.366 0.319 0.354 0.316 0.351 0.273 0.328
720 0.378 0.401 0.362 0.385 0.408 0.403 0.421 0.415 0.675 0.587 0.395 0.419 0.410 0.411 0.407 0.408 0.361 0.384
Table 11: The table compares various methods for the multi-horizon forecasting task with a lookback window of size 512.
Methods GPT4TS PatchTST TimesNet FEDFormer LightTS N-BEATS ARAG w/-2B ARAG w/-7B ARAG-w/8B
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
0% 0.376 0.397 0.370 0.399 0.384 0.402 0.376 0.419 0.424 0.432 0.399 0.428 0.410 0.435 0.407 0.433 0.369 0.396
ETTh1 20% PM 0.460 0.480 0.450 0.475 0.460 0.490 0.455 0.485 0.470 0.500 0.465 0.495 0.468 0.498 0.465 0.495 0.450 0.475
20% BM 0.550 0.570 0.545 0.565 0.550 0.580 0.548 0.575 0.560 0.590 0.555 0.585 0.558 0.588 0.555 0.585 0.545 0.565
0% 0.285 0.342 0.274 0.336 0.340 0.374 0.358 0.397 0.397 0.437 0.327 0.387 0.345 0.378 0.342 0.374 0.273 0.335
ETTh2 20% PM 0.370 0.420 0.360 0.415 0.380 0.440 0.375 0.435 0.390 0.450 0.380 0.440 0.383 0.443 0.380 0.440 0.360 0.415
20% BM 0.460 0.510 0.450 0.505 0.470 0.530 0.465 0.525 0.480 0.540 0.470 0.530 0.473 0.533 0.470 0.530 0.450 0.505
0% 0.292 0.346 0.290 0.342 0.338 0.375 0.379 0.419 0.374 0.400 0.318 0.367 0.354 0.369 0.351 0.366 0.289 0.340
ETTm1 20% PM 0.380 0.430 0.375 0.425 0.390 0.450 0.385 0.445 0.400 0.460 0.395 0.455 0.398 0.458 0.395 0.455 0.375 0.425
20% BM 0.470 0.520 0.465 0.515 0.480 0.540 0.475 0.535 0.490 0.550 0.485 0.545 0.488 0.548 0.485 0.545 0.465 0.515
0% 0.173 0.262 0.165 0.255 0.187 0.267 0.203 0.287 0.209 0.308 0.197 0.271 0.190 0.265 0.187 0.262 0.164 0.254
ETTm2 20% PM 0.250 0.330 0.245 0.325 0.260 0.345 0.255 0.340 0.270 0.355 0.265 0.350 0.268 0.353 0.265 0.350 0.245 0.325
20% BM 0.340 0.420 0.335 0.415 0.350 0.435 0.345 0.430 0.360 0.445 0.355 0.440 0.358 0.443 0.355 0.440 0.335 0.415
Table 12: The table compares different methods for imputing missing data, specifically for point missing (PM) and block missing
(BM) scenarios, using a 512-step lookback window for forecasting 96 steps ahead.
Methods GPT4TS PatchTST TimesNet FEDFormer LightTS N-BEATS ARAG w/-2B ARAG w/-7B ARAG-w/8B
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
0% 0.477 0.456 0.447 0.466 0.521 0.500 0.506 0.507 0.547 0.533 0.608 0.573 0.496 0.482 0.491 0.478 0.446 0.464
ETTh1
20% PM 0.580 0.560 0.550 0.570 0.620 0.600 0.605 0.605 0.645 0.630 0.710 0.670 0.595 0.580 0.590 0.575 0.550 0.570
20% BM 0.690 0.670 0.660 0.680 0.740 0.720 0.725 0.725 0.765 0.750 0.830 0.790 0.715 0.700 0.710 0.695 0.670 0.680
0% 0.406 0.441 0.379 0.422 0.462 0.468 0.463 0.474 0.863 0.672 1.454 0.847 0.473 0.472 0.469 0.469 0.371 0.420
ETTh2
20% PM 0.510 0.545 0.483 0.526 0.566 0.572 0.567 0.578 0.967 0.776 1.558 0.947 0.577 0.576 0.573 0.573 0.475 0.524
20% BM 0.620 0.655 0.593 0.636 0.676 0.682 0.677 0.688 1.067 0.876 1.658 1.047 0.677 0.676 0.673 0.673 0.575 0.624
0% 0.417 0.421 0.416 0.420 0.478 0.450 0.543 0.490 0.527 0.502 0.448 0.448 0.435 0.427 0.431 0.423 0.411 0.419
ETTm1
20% PM 0.520 0.525 0.519 0.523 0.581 0.553 0.646 0.593 0.630 0.602 0.551 0.551 0.538 0.530 0.534 0.526 0.514 0.522
20% BM 0.630 0.635 0.629 0.633 0.691 0.663 0.756 0.703 0.740 0.712 0.661 0.661 0.648 0.640 0.644 0.636 0.624 0.632
0% 0.378 0.401 0.362 0.385 0.408 0.403 0.421 0.415 0.675 0.587 0.395 0.419 0.410 0.411 0.407 0.408 0.361 0.384
ETTm2
20% PM 0.480 0.503 0.464 0.487 0.510 0.505 0.523 0.517 0.777 0.689 0.495 0.519 0.510 0.511 0.507 0.508 0.461 0.484
20% BM 0.590 0.613 0.574 0.597 0.620 0.615 0.633 0.627 0.877 0.789 0.595 0.619 0.610 0.611 0.607 0.608 0.561 0.584
Table 13: The table evaluates the effectiveness of various missing data imputation techniques (including point-wise and
block-wise methods) for out-of-sample imputation, using a 512-step historical window to predict missing values in subsequent
720-step future data.
12
Agentic Retrieval-Augmented Generation for Time Series Analysis 30th, ACM KDD August 25 - 29, 2024, 2024, Barcelona, Spain
preference tuning, the hyperparameter (‘𝛽’) was set in the range detection capabilities. For classification tasks, the original frame-
of {0.2, 0.4, 0.6} and learning rate from {5.0𝑒 − 7, 1.0𝑒 − 6, 5.0𝑒 − 6}. work excels, as demonstrated in Tables 17 and 18, achieving the
The optimal hyperparameters for training were chosen to achieve highest accuracy, precision, and recall across datasets like PeMSD3,
a balance between performance and computational efficiency. The PeMSD4, PeMSD7, METR-LA, PeMSD7(M), PeMSD8, and PEMS-
optimal hyperparameters for supervised fine-tuning were a batch BAY. The superior performance in classification tasks, coupled with
size of 16 and a learning rate of 1e-5, trained over 15 epochs with the significant drop observed in ablated variants, highlights the
500 warmup steps and a weight decay of 0.01, utilizing the AdamW critical role each component plays in the original framework’s suc-
optimizer. Gradient accumulation steps were set to 2. QLoRA quan- cess. This comprehensive analysis underscores the importance of
tization was applied with 4-bit precision, and its specific hyperpa- integrating all components to maximize performance across fore-
rameters included a low-rank (𝑟 ’) of 16, an alpha (𝛼’) of 32, and a casting, anomaly detection, and classification tasks. The synergistic
dropout rate of 0.05. Preference optimization was performed with contribution of the dynamic prompting mechanism, sub-agent spe-
a learning rate of 5.0e-7 over 3 epochs and a beta value of 0.2. cialization, instruction-tuning, and direct preference optimization is
evident in the consistent superiority of the Agentic-RAG framework
E ABLATION STUDY compared to its ablated variants.
To understand the contribution of each component within our pro-
posed Agentic-RAG framework, we designed an ablation study.
By systematically evaluating the impact of removing individual
components, we gain valuable insights into their role in the frame-
work’s overall performance. The following ablation experiments
were conducted:
• (a) Effect of dynamic prompting mechanism(DPM):
– We compared the performance of the Agentic-RAG frame-
work with and without the dynamic prompting mecha-
nism.
• (b) Role of sub-agent specialization(SAS):
– We evaluated the Agentic-RAG framework using a single,
universal sub-agent for all tasks versus specialized sub-
agents for each task.
• (c) Instruction-tuning(IT) vs. no fine-tuning(NIT):
– We compared the performance of SLMs with instruction-
tuning against their performance without any fine-tuning.
• (d) Effectiveness of direct preference optimization (DPO):
– We evaluated the framework’s performance with and with-
out DPO and assessed how aligning SLMs with preferred
outcomes impacts the accuracy and reliability of predic-
tions.
Our study investigates the impact of different components on the
overall performance of the framework, ‘SelfExtend-Agentic-RAG
W/Llama 3 - 8B", in time series forecasting, anomaly detection, and
classification tasks across various benchmark datasets. We system-
atically disable each component (dynamic prompting mechanism
(DPM), sub-agent specialization (SAS), instruction-tuning (IT), or di-
rect preference optimization (DPO)) and compare the results to the
full framework. Tables 14 and 15 detail the forecasting performance,
highlighting that the original framework consistently achieves the
lowest error rates in MAE, RMSE, and MAPE across different hori-
zons and datasets. This indicates the crucial role of each component
in improving forecasting accuracy. Table 16 focuses on anomaly de-
tection tasks, showing the original framework’s superior precision,
recall, and F1-score compared to its ablated variants. The origi-
nal framework consistently achieves higher metrics scores across
anomaly benchmark datasets such as SWaT, WADI, SMAP, MSL,
and HAI. The significant performance drop observed in the ablated
variants underscores the importance of the integrated components,
demonstrating their synergistic contribution to enhancing anomaly
13
30th, ACM KDD August 25 - 29, 2024, 2024, Barcelona, Spain Chidaksh Ravuru, Sagar Srinivas Sakhinana, and Venkataramana Runkana
Table 17: The table presents the ablation study results, evaluating the performance across various metrics for time series
classification tasks on the PeMSD3, PeMSD4, PeMSD7, and METR-LA benchmark datasets.
PeMSD7(M) PeMSD8 PEMS-BAY
Dataset
Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall
Baseline W/O DPM 75.41% 73.21% 74.42% 76.02% 74.81% 75.23% 76.81% 75.42% 76.02%
Baseline W/O SAS 82.23% 80.52% 81.14% 83.14% 81.32% 82.01% 83.62% 82.11% 82.73%
Baseline W/O IT 37.61% 36.12% 36.54% 38.02% 36.81% 37.23% 38.61% 37.42% 37.92%
Baseline W/O DPO 90.02% 88.73% 89.21% 90.54% 89.32% 89.83% 91.01% 89.73% 90.32%
SelfExtend-Agentic-RAG W/Llama-8B 94.02% 92.54% 93.02% 95.04% 94.03% 94.52% 96.01% 95.01% 95.53%
Table 18: This table presents the results of an ablation study comparing the performance of various Agentic-RAG framework
variants. The study evaluates performance on three benchmark datasets – PeMSD7(M), PeMSD8, and PEMS-BAY – across
different metrics for time series classification tasks.
14