RITA: Group Attention Is All You Need For Timeseries Analytics

RITA: Group Attention is All You Need for Timeseries Analytics
Jiaming Liang Lei Cao∗ Samuel Madden

University of Pennsylvania Massachusetts Institute of Technology Massachusetts Institute of Technology
Philadelphia, PA, USA Cambridge, MA, USA Cambridge, MA, USA
[email protected] [email protected] [email protected]
Zachary Ives Guoliang Li

University of Pennsylvania Tsinghua University
Philadelphia, PA, USA Beijing, China
[email protected] [email protected]
arXiv:2306.01926v1 [cs.LG] 2 Jun 2023
ABSTRACT pre-training a Transformer model which takes the correlations

Timeseries analytics is of great importance in many real-world among different observations into account is a natural idea to learn
applications. Recently, the Transformer model, popular in natu- feature embeddings from timeseries. Indeed, the experiments in [61]
ral language processing, has been leveraged to learn high quality confirm that Transformer-based methods outperform traditional
feature embeddings from timeseries, core to the performance of timeseries analytics techniques.
various timeseries analytics tasks. However, the quadratic time and However, existing work [61] that directly applies Transformers
space complexities limit Transformers’ scalability, especially for to learn features from timeseries data have been shown not to be
long timeseries. To address these issues, we develop a timeseries an- scalable to long timeseries [30]. The idea of self-attention [52] is
alytics tool, RITA, which uses a novel attention mechanism, named central to pre-training methods in NLP: It computes pairwise cor-
group attention, to address this scalability issue. Group attention dy- relations among different semantic units in a sequence (in NLP, a
namically clusters the objects based on their similarity into a small sentence); as such, it has quadratic time and space complexity in
number of groups and approximately computes the attention at the length of the input sequence. Such an approach places limits on
the coarse group granularity. It thus significantly reduces the time the model’s scalability, especially when handling large sequences,
and space complexity, yet provides a theoretical guarantee on the which are common in real-world timeseries applications such as
quality of the computed attention. The dynamic scheduler of RITA IoT, medical AI, and finance [6, 34, 62]. Predictions about timeseries
continuously adapts the number of groups and the batch size in the may need to look at months or years of historical data to make ac-
training process, ensuring group attention always uses the fewest curate predictions, spanning hundreds of thousands of samples. As
groups needed to meet the approximation quality requirement. Ex- an example, in collaboration with a research hospital we have been
tensive experiments on various timeseries datasets and analytics developing a seizure classifier that automatically detects seizures
tasks demonstrate that RITA outperforms the state-of-the-art in based on EEG signals (timeseries) collected during the clinical ob-
accuracy and is significantly faster — with speedups of up to 63X. servation of patients. As seizures last only a few seconds, we chunk
long EEG data into many 2 second segments and detect seizures at
1 INTRODUCTION a segment level. However, the classification of a particular segment
depends on up to 12 hours of prior signal to determine if one 2
Motivation. Many data driven applications involve processing second segment indicates seizure or not, because seizure diagnosis
massive timeseries data, including IoT [11], medical AI [14], stock needs to consider long-term trends in the EEG data [6]. The number
market [27], and so on. As such, there is a great need for timeseries of segments in 12 hours is more than 21k. This is far larger than
analytics, such as forecasting [8], classification [20], clustering [31], the number of semantic units the typical NLP tasks expect. For
similarity search [39], and anomaly detection [50], with applications example, BERT [16] limits the number of units to 512 and even
ranging from automatically diagnosing diseases [5], recognizing massive models like GPT-3 [4] limit the number of units to 2048.
human activities [29], to stopping financial fraud [59]. Although in NLP some lower-complexity methods have been
Effective feature extraction [40] lies at the core of almost all proposed to approximately compute self-attention [10, 26, 54], their
these timeseries analytics tasks. Recently researchers [61] have performance degrades dramatically when used on timeseries, due
started leveraging the self-supervised pre-training methodology of to the gap between natural language and timeseries, as we will
Transformers [4, 16, 52], which have proven remarkably successful show in our experiments.
in natural language processing (NLP), to automatically learn high Proposed Approach. To tackle the aforementioned problem, we
quality feature embeddings from timeseries. In NLP, self-supervised develop RITA, a Transformer-based timeseries analytics tool, which
pre-training exploits the sequential patterns (correlations) among uses a novel attention mechanism, called group attention, to scale
the words in sentences to produce contextualized feature embed- to long timeseries.
dings. Timeseries bear similarity to natural language, because in Leveraging the periodicity of timeseries, RITA chunks the input
timeseries data the sequential order among the values (stock price, timeseries into segments and dynamically clusters the segments
volume, etc.) over time matters. That is, each value is highly cor- into a small number (denoted as 𝑁 ) of groups. Segments in the
related with other values observed before or after it. Therefore, same group possess similar feature embeddings during the current
∗ Corresponding Author training iteration, thus enabling them to approximately share the
1
computation of attention. As the timeseries increases in length, O0 O1 O2 ..... On
more sharing opportunities become available. RITA then computes
the self-attention at a group level and produces a compressed group RITA Encoder
attention matrix. In this way, group attention eliminates both com- .....
putation and memory bottlenecks in Transformer-style models and E0 E1 E2 En
thus more scalable to long timeseries.
Position
However, making this idea effective and efficient in Transformer Embedding P0 P1 P2 ..... Pn
architectures is challenging for several reasons: + + + +
Window
• Efficiently Producing High Quality Feature Embeddings. Embedding
W[CLS] W1 W2 ..... Wn
Although RITA computes the attention matrix at a group level, to .....
preserve the quality of the feature embeddings, it still has to pro- ⊗
duce different embeddings for different segments. This is because
even if some segments share the attention score temporally, it does Time-aware
Convolution
not mean they should have the same feature embedding. However,
using the group attention matrix, the existing self-attention mech- Scale & Input
anism will only produce a single feature vector for each group. A
Raw
naive solution would be to restore the original attention matrix Timeseries
from the group attention matrix. However, in this case we again
get an attention matrix with quadratic space complexity. Because Figure 1: RITA Architecture
GPUs have limited memory, GPU memory will remain a bottleneck
in group attention.
• The Number of Groups N. In RITA, the number of groups
𝑁 is a crucial factor that balances the speed up and the quality of
attention approximation. A small 𝑁 will lead to a large speedup, Second, we design an adaptive scheduler which dynamically de-
but the approximation errors can also be significant. On the other cides an appropriate 𝑁 for each group attention layer during the
hand, although a large 𝑁 tends to produce high-quality approxima- training process. It starts with a large 𝑁 and iteratively merges
tions, it inevitably slows down the training process. Therefore, an groups that are similar to each other. Guided by an error bound on
appropriate 𝑁 is essential to the performance of group attention. the approximated self-attention that users can tolerate, it automati-
However, 𝑁 depends on the distributional properties of the dataset. cally determines if two groups are mergeable, performing merging
Furthermore, like the classical transformer models, RITA stacks efficiently in a GPU-friendly way.
multiple attention layers to produce better embeddings. Ideally, Moreover, we propose a learning-based method to model the
different layers should also use different values of 𝑁 . In addition, correlation between the number of groups 𝑁 and the batch size 𝐵.
during the model training phrase, group attention should use dif- This model is used to predict 𝐵 for a given 𝑁 when training RITA.
ferent values of 𝑁 at different iterations to adapt to the varying Specifically, we first sample some 𝑁 values in a reasonable range.
feature embeddings. This makes manually setting appropriate 𝑁 For each sampled 𝑁 , we find a batch size that consumes up to a
almost impossible. certain percentage of GPU memory in a cost-efficient way. Using a
• Batch Size. Moreover, as we want to dynamically adjust 𝑁 small set of mathematical functions as a prior, RITA learns a model
during training, a fixed batch size is sub-optimal: as 𝑁 decreases, with only a few <N, B> pairs as ground truth labels.
the memory usage of a single sample decreases. This allows a larger Our experiments on public timeseries benchmarks and the MGH
batch size which is beneficial, because: (1) it makes full use of GPU EEG data [6] confirm that RITA outperforms state-of-the-art meth-
memory; (2) high-parallelism across the samples in a big batch ods in accuracy on various timeseries analytics tasks, while our
brings better performance. Our experimental study shows that group attention mechanism achieves a 63X speedup with much
doubling the batch size reduces the training time by 30%, while still less memory required, compared to existing self-attention mecha-
preserving the quality of the model. Thus, RITA should dynamically nisms [10, 52, 54].
adjust batch size as 𝑁 changes. Contributions. The key contributions of this work include:
To address the above problems, we first propose an embedding • Our group attention mechanism leverages the periodicity of
aggregation strategy and a customized group softmax function to timeseries, reducing the time and space complexity of the self-
replace the classical softmax function [52]. Together they ensure attention mechanism with accuracy guarantees, allowing RITA to
RITA is able to directly use the compressed attention matrix to scale to long timeseries data.
produce different feature embeddings for different segments. We • Guided by an approximation error bound, our adaptive sched-
theoretically show the embeddings RITA produces in this way are uler dynamically adapts the number of groups and the batch size
identical to those produced by first re-storing the original large to the distribution properties of the evolving feature embeddings,
attention matrix. Thus RITA is able to produce high quality embed- making group attention efficient and easily tunable.
dings without introducing extra overhead. Further, we design a GPU • We conduct experiments on various datasets and different ana-
friendly algorithm to group the segments in parallel, effectively lytics tasks, demonstrating that RITA is 4 to 63 times faster than
minimizing the grouping cost. the state-of-the-art while achieving better accuracy when handling
long timeseries (length ≥ 2000).
2
2 BACKGROUND semantic units 𝑋 1, 𝑋 2, ..., 𝑋𝑛 (𝑋𝑖 ∈ 𝑅𝑑 ) as input (e.g. embeddings of
We provide some background on the canonical self-attention mod- 𝑛 windows for a timeseries), then models the correlations between
ule in the Transformer[52]. A self-attention module takes 𝑛 hidden the semantic units and outputs 𝑌1, ..., 𝑌𝑛 (𝑌𝑖 ∈ 𝑅𝑑 ) as the context-
embedding vectors 𝐻 ∈ R𝑛∗𝑑ℎ as input, then projects them to aware embedding of each unit.
queries (𝑄), keys (𝐾) and values (𝑉 ) and performs Scaled-dot Prod- What makes RITA Encoder different from Transformer Encoder
uct Attention, which given input hidden state 𝐻 , is computed by: is that: at the core of Transformer Encoder lies self-attention mech-
anism which incurs a 𝑂 (𝑛 2 ) time complexity and memory usage.
𝑄 = 𝐻𝑊𝑄 , 𝐾 = 𝐻𝑊𝐾 , 𝑉 = 𝐻𝑊𝑉 This quadratic cost becomes prohibitive for long timeseries and
𝑄𝐾 𝑇 (1) limits the scalablity of Transformer-based models. To make the
𝑂 = 𝐴𝑉 = 𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 ( √︁ )𝑉
𝑑𝑘 attention computation efficient yet high-quality, we replace the
canonical self-attention with our proposed group attention.
Where 𝑊𝑄 ∈ R𝑑ℎ ∗𝑑𝑘 ,𝑊𝐾 ∈ R𝑑ℎ ∗𝑑𝑘 ,𝑊𝑉 ∈ R𝑑ℎ ∗𝑑 𝑣 are projection
matrices for generating 𝑄, 𝐾, 𝑉 . 𝑄 ∈ R𝑛∗𝑑𝑘 is also regarded as the Self-supervised Pretraining. Inspired by the “cloze text” pre-
packing of 𝑛 query vectors {𝑞 1, ..., 𝑞𝑛 } with dimension 𝑑𝑘 into a training task in NLP, we designed a mask-and-predict task as the
matrix. 𝐾 ∈ R𝑛∗𝑑𝑘 , 𝑉 ∈ R𝑛∗𝑑 𝑣 are regarded as the packing of key pretraining task for our model. The timeseries is randomly masked
vectors {𝑘 1, ..., 𝑘𝑛 } and value vectors {𝑣 1, ..., 𝑣𝑛 } in the same way. and the model should recover the masked values based on corre-
Given a matrix 𝑀 ∈ R𝐿∗𝑛 , the softmax function normalizes 𝑀 sponding contextual information.
to ensure the sum of each row equals to 1, as shown below. To be specific, we generate masks on time-stamps, with a mask
𝑒𝑥𝑝 (𝑀𝑖,𝑗 ) rate 𝑝. The timeseries is scaled to be non-negative and the values
𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 (𝑀𝑖,𝑗 ) = Í𝑛−1 (2) across all the channels on the masked timestamps are set to be -1,
𝑘=0
𝑒𝑥𝑝 (𝑀𝑖,𝑘 )
an impossible value on normal timestamps. Then the masked time-
Note the attention matrix A is an 𝑛×𝑛 matrix, where 𝑛 represents series is fed into RITA and the output representation is translated
the number of elements in the input sequence (e.g. words in NLP). to the recovered timeseries by a Transpose Convolution layer.
3 RITA OVERVIEW
Given a collection of unlabeled timeseries, RITA first pre-trains 4 GROUP ATTENTION MECHANISM
a Transformer-style model to produce high quality feature em- Group attention, a novel and efficient approximate attention mecha-
beddings for timeseries data. This pre-trained model is then used nism, addresses the performance bottleneck of self-attention in the
to support various downstream tasks, similar to BERT [16]. Next, vanilla Transformer. In this section, we first introduce the frame-
we overview the model architecture of RITA. We show how RITA work of group attention and then theoretically establish the bound
supports various downstream tasks in Appendix A.7. of its approximation error.
As shown in Fig. 1, RITA is consist of two components: (1) Time-
aware Convolution Layer (2) RITA Encoder.
Time-aware Convolution Layer fills the gap between timeseries 4.1 The Idea of Group Attention
and natural language. Despite their high-level similarity, there is a As periodicity is a natural property of timeseries [56], similar
big gap between timeseries and natural language. First, in natural windows frequently occur. Similar windows result in similar
language each word, as a discrete semantic unit, has an indepen- queries/keys for attention computation, bringing opportunities for
dent meaning, while each element in a timeseries is a continuous, saving computation.
numerical value and does not necessarily constitute an independent As discussed in Sec. 2, 𝐴𝑖 𝑗 , the attention score of window 𝑖 onto
event. Furthermore, the input sequences are single-channeled in window 𝑗, is determined by the inner product between the query
NLP, but often multi-channeled in timeseries (i.e., sensor data often vector of window 𝑖 and the key vector of window 𝑗, that is, 𝑞𝑖 · 𝑘 𝑗 .
consists of several related channels). Given another window 𝑥, if window 𝑥 has the similar key vector
RITA leverages the classical convolution [28] strategy to solve to window 𝑗, that is, 𝑘 𝑗 ≈ 𝑘𝑥 , then 𝑞𝑖 · 𝑘 𝑗 ≈ 𝑞𝑖 · 𝑘𝑥 . In other words,
this problem. Convolution is widely used to capture the local struc- 𝐴𝑖 𝑗 ≈ 𝐴𝑖𝑥 when 𝑘 𝑗 ≈ 𝑘𝑥 .
tures of an image. We use convolution to chunk one input timeseries This observation inspires our group attention mechanism. That
into a sequence of windows and learn the local structure of each is, we group the windows by their similarity in keys. Assuming
window, similar to the discrete semantic units in natural language. all windows in the same group have the same attention score onto
It also discovers the correlations across different channels, thus another window 𝑘, we then only compute the attention once by
naturally solving the multi-channel problem. using one single key to represent this group, for example the centroid
More specifically, treating a multi-variate timeseries of length 𝑛 of the group of keys. This thus saves significant computation cost.
and with 𝑚 variables as an n × m matrix 𝑇 , RITA uses 𝑑 convolution Better yet, after grouping 𝑛 windows into 𝑁 groups, group atten-
kernels to chunk 𝑇 into n windows and produce one d-dimensional tion compresses the attention matrix from an 𝑛×𝑛 matrix to an 𝑛×𝑁
embedding per window using the convolution operation [28]. Each matrix. Because 𝑁 (number of groups) tends to be much smaller
convolution kernel corresponds to a w × m matrix, where 𝑤 defines than 𝑛 (number of windows) due to the periodicity of timeseries,
the number of timestamps that each convolution kernel covers, group attention consumes much less memory than the original
identical to the window size in sliding window. self-attention mechanism, successfully eliminating the memory
RITA Encoder functions as Transformer Encoder as described in bottleneck. Note that it also doesn’t hurt quality all that much, as
the original Transformer work[52]. It takes the embeddings of 𝑛 confirmed in our experiments (Sec. 6.2).
3
Attention Matrix Output
Using embedding aggregation, RITA is able to produce the fea-
ture embedding 𝑂 e that is identical to the embedding 𝑂 produced
by using the full attention matrix 𝐴 and the embedding matrix 𝑉 .
MatMul
Weighted
MatMul Group Softmax Function. In canonical self-attention the atten-
SoftMax
QK T
Q Transpose tion matrix 𝐴 is computed as 𝐴 = SoftMax ( √ ). To compute 𝐴,
K dk
V Average Sum
Q Aggregate
we have to first compute 𝑄𝐾𝑇 (denoted as 𝑃) which is an 𝑛 × 𝑛
matrix. Then normalizing the 𝑃 matrix with softmax produces the
attention matrix 𝐴.
Grouping V
K Group attention follows the same procedure. But after grouping
keys into 𝐾,e 𝑄𝐾 e𝑇 produces an 𝑛 × 𝑁 matrix 𝑃. e Due to the non-
Figure 2: Group Attention linearity of the softmax function, applying softmax directly on 𝑃e
will result in a group attention matrix 𝐴
e from which we are not able
4.2 Computing the Output Feature Embedding to recover a full attention matrix that is identical to first restoring
We now discuss how to efficiently compute the output feature 𝑃e to 𝑃 and then applying softmax on 𝑃. The 𝐴 matrix produced
embeddings using the small compressed group attention matrix. by the latter is desirable, as we want to approximate the original
attention matrix as accurately as possible. However, restoring the
4.2.1 Problem: Producing Embeddings w/ Group Attention Matrix small 𝑛 × 𝑁 𝑃e matrix is not memory efficient, as it will end up with
As described in the Background, once we have acquired the at- a full 𝑛 × 𝑛 matrix 𝑃.
tention matrix 𝐴, canonical self-attention computes the output To solve the above problems, we introduce a new group softmax
embedding 𝑂 as O = AV . Because 𝐴 is an 𝑛 × 𝑛 matrix and 𝑉 is an function to replace the original softmax function (Eq. 2).
𝑛 × 𝑑 𝑣 matrix, the matrix product operation still produces an 𝑛 × 𝑑 𝑣
𝑒𝑥𝑝 (𝑃𝑖,𝑗 )
matrix 𝑂. That is, it produces a 𝑑 𝑣 dimensional feature vector for 𝐺𝑟𝑜𝑢𝑝𝑆𝑜 𝑓 𝑡𝑀𝑎𝑥 (𝑃g
𝑖,𝑗 ) = Í𝑁 −1 (3)
each window. However, our group attention will produce an 𝑛 × 𝑁 𝑘=0
𝑐𝑜𝑢𝑛𝑡𝑘 𝑒𝑥𝑝 (𝑃𝑖,𝑘 )
attention matrix 𝐴 e , where 𝑁 corresponds to the number of groups. In Eq. 3, 𝑐𝑜𝑢𝑛𝑡𝑘 represents the number of windows that Group
In this case the matrix product will produce a 𝑁 ×𝑑 𝑣 matrix 𝑂.
e That 𝐺𝑘 contains. Compared to the original softmax, our group softmax
is, it produces a feature vector for each group. However, our goal considers each group 𝐺𝑘 as 𝑐𝑜𝑢𝑛𝑡𝑘 elements and counts it 𝑐𝑜𝑢𝑛𝑡𝑘
is to produce different embeddings for different windows, because times when summing up the exponential of each group’s 𝑃𝑖,𝑘 . In
even if some windows share the attention score temporally, it does this way, the group softmax function operating on the small 𝑃e
not mean they should have the same feature embedding. matrix will produce exactly the same result to the softmax function
A Naive Solution. A naive solution would be to restore the full operating on the full 𝑃 matrix.
attention matrix 𝐴 from the group attention matrix 𝐴. e For example, Theoretical Guarantee. In Appendix A.4, we prove that the group
given one group composed of 𝑤𝑖𝑛𝑖 and 𝑤𝑖𝑛 𝑗 , we map its group softmax function and the embedding aggregation operation produce
attention vector in 𝐴 e into two rows that correspond to 𝑤𝑖𝑛𝑖 and the same output feature embedding with the naive method that has
𝑤𝑖𝑛 𝑗 in 𝐴. However, in this case we again get a 𝑛 × 𝑛 attention to first restore the big full attention matrix.
matrix; and GPU memory remains a bottleneck in group attention. We show an efficient implementation of the embedding aggrega-
4.2.2 Solution: Embedding Aggregation and Group SoftMax tion operation and group softmax function in Appendix A.2, Alg. 1.
Using an embedding aggregation operation and a group softmax Time Complexity. The time complexity of Alg. 1 is 𝑂 (𝑛𝑁𝑑) and
function, RITA produces 𝑛 embeddings without restoring the full the space complexity is 𝑂 (𝑛𝑁 ), while the time and space complexity
attention matrix. Fig. 2 shows the workflow of group attention. of the original self-attention mechanism are 𝑂 (𝑛 2𝑑) and 𝑂 (𝑛 2 ).
Embedding Aggregation. The idea is inspired by the observation
on the matrix product operation O = AV conducted on the fully 4.3 Error Bound
restored attention matrix 𝐴. Group attention produces a group attention matrix 𝐴ewhich approxi-
Given an element 𝑂𝑖,𝑗 of 𝑂 corresponding to the 𝑗 𝑡ℎ dimension of mates the attention matrix 𝐴 produced by the classical self-attention
𝑤𝑖𝑛𝑖 ’s feature vector, 𝑂𝑖,𝑗 = 𝑎𝑖 ·𝑣 𝑗 , where vector ai ∈ Rn denotes the with a bounded error, as shown in Lemma 1.
𝑖 𝑡ℎ row of the attention matrix 𝐴 and vector vj ∈ Rn denotes the 𝑗 𝑡ℎ Lemma 1. Let 𝑅 be the radius of the ball where all key vectors
dimension of all the 𝑛 feature vectors. Given ai =< ai1, ai2 , · · · , ain > live; e
𝑘𝑖 be the representative of the group that contains key 𝑘𝑖 . Let 𝐴
and vj =< vj1, vj2 , · · · , vjn >, 𝑂𝑖,𝑗 = nk=1 aik vjk .
Í
denote the full attention matrix restored from 𝐴. e Suppose the distance
As an example, assume 𝑤𝑖𝑛 1 and 𝑤𝑖𝑛 2 belong to the same group between 𝑘𝑖 and 𝑘𝑖 (|| k𝑖 − k𝑖 ||) satisfies: || k𝑖 − k𝑖 || ≤ d.
e e e
𝐺 1 . Then 𝑎𝑖1 = 𝑎𝑖2 = 𝑎e𝑖1 , where 𝑎e𝑖1 ∈ 𝐴
e corresponds to the attention ln(𝜖 ) A
Then ∀ 𝜖 > 1, if d ≤ 2R , 𝜖1 ≤ Ai,ji,j ≤ 𝜖
of group 𝐺 1 onto 𝑤𝑖𝑛𝑖 . Therefore, 𝑎𝑖1 𝑣 1𝑗 + 𝑎𝑖2 𝑣 2𝑗 = 𝑎e𝑖1 (𝑣 1𝑗 + 𝑣 2𝑗 ).
As an immediate generalization of the above analysis, if we ag- Lemma 1 shows that the error bound 𝜖 of the group attention is
gregate up the windows that belong to the same group and convert determined by the distance 𝑑. As discussed in Sec. 5.1, it inspires
the n-dimensional feature vector 𝑣 𝑗 into a 𝑁 -dimensional group fea- us to design a strategy to dynamically determine the number of
ture vector e𝑣 𝑗 beforehand, we could directly use the group attention groups 𝑁 – the most critical parameter of group attention. Please
vector 𝑎e𝑖 and the group feature vector e 𝑣 𝑗 to compute 𝑂𝑖,𝑗 . refer to Appendix A.5 for the proof.
4
4.4 GPU Friendly Grouping Method feature embeddings produced epoch by epoch tend to get stabler
In this section, we discuss the implementation of a grouping method. and stabler and gradually converge, thus no need to increase 𝑁 .
To make group attention efficient and effective, the grouping RITA reduces the number of groups by merging similar groups.
method has to satisfy the following requirements: Intuitively, given two groups, we could measure their similarity
(1) Tight distance bound: to ensure the approximation quality, based on the distance of their centers. If the distance between
the distance between each key and its group representative should their centers is smaller than a distance threshold, then the two
be minimized according to Lemma 1. groups could be merged. However, setting an appropriate distance
(2) Lightweight: to ensure the performance gain, the grouping threshold seems hard – as difficult as setting an appropriate 𝑁 .
method must be lightweight, at worst not exceeding the complexity To solve this problem, RITA leverages the error bound of group
of group attention itself (𝑂 (𝑁𝑛)). attention introduced in Sec. 4.3. It only requires users to set an
(3) GPU friendly: to take advantage of GPUs, we prefer a group- error bound 𝜖, and then uses Lemma 1 to translate 𝜖 to a distance
ing method that mainly consists of matrix operations, which can threshold 𝑑. RITA then uses Lemma 2 to determine if merging some
be efficiently executed on a GPU. given clusters still meets the error bound threshold 𝜖.
To satisfy the above requirements, after thorough investigation Lemma 2. Denote 𝑐𝑘 to be the cluster center of 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑘 . Assume
on various clustering algorithms, we design a GPU friendly K- the existing grouping satisfies ∀k, max |ck − x | ≤ d , thus satis-
x ∈clusterk
means [35] as the grouping method. fying an error bound 𝜖 by Lemma 1. If there exist 𝑚 clusters, namely,
First, K-means minimizes the overall distance between any object 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑘1 , 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑘2 , ..., 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑘𝑚 , satisfying that:
and its cluster center, hence naturally satisfying Requirement 1.
Second, given 𝑁 centers, in each iteration the time and space 𝑚𝑎𝑥 |𝑐𝑘𝑖 − 𝑐𝑘 𝑗 | + |𝑥 − 𝑐𝑘𝑖 | ≤ 𝑑, 𝑖, 𝑗 ∈ [1, 𝑚] (4)
𝑥 ∈𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑘𝑖
complexity of K-means is 𝑂 (𝑛𝑁 ). Usually, the iteration goes until
convergence. However, we observe that rather than seeking a per- merging them into one cluster still meets the error bound 𝜖.
fect K-means clustering, training a few iterations is sufficient to Please refer to Appendix A.6 for the proof.
get a good grouping for group attention, because typically the later Finding the Mergable Clusters. We formulate the problem of
iterations only slightly update the clustering and group attention finding mergeable clusters using graph theory:
is robust to such imperfection. (1) each cluster is a node in the graph;
Third, we design a GPU-friendly implementation of K-means. (2) if 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖 and 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗 satisfy:
The performance bottleneck of K-means comes from the dis- 𝑚𝑎𝑥 |𝑐𝑖 −𝑐 𝑗 |+|𝑥 −𝑐𝑖 | ≤ 𝑑, and 𝑚𝑎𝑥 |𝑐 𝑗 −𝑐𝑖 |+|𝑥 −𝑐 𝑗 | ≤ 𝑑
tance computation between each vector and its center, that is, 𝑥 ∈𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖 𝑥 ∈𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗
√︃ there is an undirected edge between 𝑛𝑜𝑑𝑒𝑖 and 𝑛𝑜𝑑𝑒 𝑗 ;
|vi − cj | = (vi − cj ) 2 , i ∈ [1, n], j ∈ [1, N ]. The performance bot-
In this scenario, finding the maximum number of mergeable
tleneck is 𝑣𝑖 − 𝑐 𝑗 . We instead use a different formulation: |𝑣𝑖 − clusters is equivalent to finding the minimal clique cover in the
√︃
𝑐 𝑗 | = |vi − cj | = |vi | 2 + |cj | 2 − 2vi · cj , i ∈ [1, n], j ∈ [1, N ]. This is corresponding graph, which is an NP-hard problem [24]. Such
because in this formulation, the performance bottleneck is 𝑣𝑖 · 𝑐 𝑗 , heavy computation overhead is not acceptable for RITA. We thus
which could be implemented as a matrix product operation. Al- offer a simplified solution:
though the complexity of the two formulations is the same, in GPUs (1) Halve the clusters into two sets 𝑆 1, 𝑆 2 ;
matrix product is much more efficient than pairwise difference. (2) If 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖 ∈ 𝑆 1 and 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗 ∈ 𝑆 2 satisfy:
𝑑
𝑚𝑎𝑥 |𝑐𝑖 − 𝑐 𝑗 | + |𝑥 − 𝑐𝑖 | ≤ 𝑑, 𝑚𝑎𝑥 |𝑐 𝑗 − 𝑐𝑖 | + |𝑥 − 𝑐 𝑗 | ≤
5 ADAPTIVE SCHEDULER 𝑥 ∈𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖 𝑥 ∈𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗 2
(5)
Next, we present the adaptive scheduler of RITA which addresses 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗 is marked.
the challenges of determining an appropriate number of groups (3) Decrease the number of clusters by counting the masks in 𝑆 2 .
𝑁 and accordingly the batch size 𝐵, as described in Introduction.
In this solution, clusters in 𝑆 1 can be regarded as transfer nodes.
Using a dynamic scheduling method we propose, the scheduler
If (5) holds for (𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖 ∈ 𝑆 1, 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗1 ∈ 𝑆 2 ) and (𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑖 ∈
automatically determines and adjusts 𝑁 and 𝐵 based on the distri-
𝑆 1, 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗2 ∈ 𝑆 2 ), respectively, we have,
butional properties of the feature embeddings produced over the
𝑚𝑎𝑥 |𝑐 𝑗1 − 𝑐 𝑗2 | + |𝑥 − 𝑐 𝑗1 |
iterative training process, while guaranteed to produce high quality 𝑥 ∈𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗1
attention approximation that meets the requirement of users. ≤ 𝑚𝑎𝑥 |𝑐 𝑗1 − 𝑐𝑖 | + |𝑐𝑖 − 𝑐 𝑗2 | + |𝑥 − 𝑐 𝑗1 |
𝑥 ∈𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗1 (6)
In Sec. 5.1 we show how RITA automatically determines 𝑁 . Then
we introduce in Sec. 5.2 the learning-based method which given an ≤ 𝑚𝑎𝑥 |𝑐 𝑗1 − 𝑐𝑖 | + |𝑐𝑖 − 𝑐 𝑗2 | + |𝑥 − 𝑐 𝑗1 | + |𝑥 − 𝑐 𝑗2 | ≤ 𝑑
𝑥 ∈𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑗1
𝑁 , immediately predicts a good batch size.
Thus (4) holds when merging several clusters in 𝑆 2 with one
cluster in 𝑆 1 . As a result, we can greedily merge clusters in 𝑆 2 , as
5.1 Dynamically Determining the Number of illustrated in step(3).
Groups N Assume the number of clusters decreases by 𝐷 after merging,
Without loss of generality, we use one group attention module as we apply a momentum update [42] on the number of clusters 𝑁 , as
an example to show how RITA automatically gets an appropriate 𝑁 . is commonly used in machine learning to smooth the changing of
The adaptive scheduler of RITA starts with a large 𝑁 and decreases 𝑁 and avoid sample selection bias. To be specific: 𝑁𝑛𝑒𝑤 = 𝛼 (𝑁 −
it dynamically. This is because in the training process of RITA, the 𝐷) + (1 − 𝛼)𝑁 , where 𝛼 is a hyper-parameter for momentum.
5
5.2 Dynamically Determining the Batch Size • RWHAR RealWorld HAR dataset [48] covers 15 subjects per-
Because of the dynamic grouping operation, the computational forming 8 locomotion-style activities. Each subject wears the sen-
graph in deep learning training [1] varies from sample to sample. As sors for approximately ten minutes. The sampling rate is 50 Hz.
a result, it is impossible to precisely compute a batch’s GPU memory • ECG dataset [34] consists of 10,000 EEG recordings for arrhyth-
usage without indeed feeding it into the model. To overcome this mia classification. Each recording has an uncertain length ranging
problem, RITA learns a batch size prediction function offline; then from 6 to 60 seconds sampled at 500 Hz. The ECG recordings corre-
at the RITA training time, given a number of groups 𝑁 , RITA uses spond to 9 types of heart problems such as atrial fibrillation (AF)
this function to predict a proper batch size. and premature atrial contraction (PAC), etc.
When the model architecture and hardware are fixed, the batch • MGH [6] is a EEG dataset collected by Mass. General Hospital.
size depends on the length of the timeseries 𝐿 and the average Each timeseries corresponds to the EEG data observed from one
group number among all attention module 𝑁 . So RITA samples patient during their stay in ICU for a couple of days. The EEG
several (𝐿𝑖 , 𝑁 𝑖 ) pairs and estimate a proper batch size for each pair. monitoring produced data with 20 channels. The sampling rate is
More specifically, given a user-defined timeseries maximal length 200 HZ. So it produces very long timeseries.
𝐿𝑚𝑎𝑥 , we randomly sample integral points (𝐿𝑖 , 𝑁𝑖 ) from plane • WISDM*/HHAR*/RWHAR* are three uni-variate datasets de-
{1 ≤ 𝐿 ≤ 𝐿𝑚𝑎𝑥 , 1 ≤ 𝑁 ≤ 𝐿}. Then we use a binary search based rived by picking one channel from WISDM/HHAR/RWHAR.
algorithm to find the maximal batch size 𝐵𝑖 that consumes less than Training/Validation Data Generation. We apply a sliding win-
90% available GPU memory, aiming to avoid wasting GPU memory dow on the raw timeseries to get training/validation samples. The
and the risks of out of memory (OOM). size of the sliding window is set as 200 on small datasets (WISDM,
Treating these pairs as ground truth labels, we use function HHAR, RWHAR), 2000 on medium size dataset (ECG), and 10,000
fitting [18] to learn the batch size predicting function B = f (L, N ), on the large dataset (MGH). Table 1 shows the statics of the gen-
where B is a function of two variables 𝐿 and 𝑁 . erated datasets. They are randomly split into training/validation
Learning the Prediction Function. We apply curve fit from set in a proportion of 0.9/0.1. In “pretraining + few-label finetun-
SciPy [53] as the function fitting tool to fit the two-variable function ing” scenario, we use 100 labeled data per class for finetuning. We
𝐵𝑖 = 𝑓 (𝐿𝑖 , 𝑁𝑖 ) on plane {1 ≤ 𝐿 ≤ 𝐿𝑚𝑎𝑥 , 1 ≤ 𝑁 ≤ 𝐿}. guarantee that training set does not overlap with the validation set.
We observe that applying one function to the whole plane incurs Dataset Train. Size Valid. Size Length Channel Classes
a huge estimation error. So we develop a dynamic-programming WISDM 28,280 3,112 200 3 18
HHAR 20,484 2,296 200 3 5
(DP) method to divide the plane into several sub-planes and apply RWHAR 27,253 3,059 200 3 8
a distinct function to each sub-plane respectively. It is optimal in ECG 31,091 3,551 2000 12 9
minimizing the total estimation error on all sub-planes MGH 8,550 950 10000 21 N/A
With the learned prediction function 𝑓 , we can estimate a proper Table 1: The statistics of the datasets
batch size for any (𝐿, 𝑁 ) during training, even if it is not seen in
the sampled (𝐿𝑖 , 𝑁𝑖 ) pairs. Alternative Methods. We compare RITA against the SOTA Trans-
The Algorithms and Optimality Proof. Please refer to Appen- former based timeseries representation learning method TST [61].
dix A.3 for the pseudo code of the binary search-based algorithm To evaluate our group attention (referred to as Group Attn.), we
and the description of the DP method for plane-division and the develop three baselines by replacing the group attention compo-
proof for its optimality. nent in RITA with the classic vanilla Self-Attention [52](referred
to as Vanilla) and two SOTA methods that reduce the complexity
of self-attention by approximation in NLP, namely, Performer [10]
6 EVALUATION (referred to as Performer) and Linformer [54] (referred to as Lin-
Our experimental study focuses on the following questions: former). Similar to our proposed Group Attn., Vanilla, Performer,
1. Effectiveness and efficiency of RITA: How does RITA com- Linformer all use RITA’s time-aware convolution operation (Sec. 3)
pare with other Transformer-based methods and traditional time- to turn timeseries segments into input feature vectors.
series representation learning methods in accuracy and efficiency? We also compare Group Attn. against GRAIL [40], which is
2. Ablation Study: How do the key techniques of RITA work? the SOTA of the non-deep learning methods for timeseries repre-
sentation learning. GRAIL supports classification tasks by feeding
the learned representations into a Support-Vector Machine [12]
6.1 Experimental Setup or K-Nearest Neighbor [17] classifier. Note GRAIL only targets
Datasets. We evaluate RITA on classification and imputation tasks uni-variate timeseries and cannot support imputation tasks.
using 5 multi-variate and 3 uni-variate timeseries datasets. Methodology. We mainly focus on two downstream tasks:
• WISDM [55] is a popular multivariate timeseries dataset gen- (1) Classification. First, we train Group Attn. and the base-
erated from the accelerometer in the mobile phone. The subjects lines with full labels from scratch to test the effectiveness of RITA
performed 18 daily activities (e.g. walking, jogging). The dataset framework and the approximation quality of our group attention.
was collected from 51 subjects and the sampling rate is 20 Hz. Second, to measure the effectiveness of self-supervised pretrain-
• HHAR dataset [46] contains sensing data of accelerometer col- ing, we evaluate the accuracy of training on few labeled timeseries
lected from 9 users performing 5 activities with 12 different smart- with/without pretraining on large scales of unlabeled timeseries. To
phones (varying in sampling rate). This increases the complexity be specific, we split the training set into a pretraining set and a fine-
of the task and thus can test the model’s robustness. tuning set, with very few data in the latter (100 labeled samples per
6
3 out of 4 datasets for classification. Although Linformer works
slightly better than Group Attn. on the ECG dataset (90.37% vs
88.84%), its performance is the worst in all other cases compared
Training Time/sec
to any other RITA-based methods. Vanilla computes the attention
scores precisely. Thus it is expected to work well. However, Group
Attn. outperforms Vanilla on WISDM (87.50% vs 86.95%) and is very
close to it on other 3 datasets. This suggests that group attention’s
approximation quality is good.
6.2.2 pretraining + few label finetune (Multi-variate classification)
(a) Effectiveness (b) Efficiency
The results shown in Table 3 get us the following observation:
Figure 3: Full-label classification results (multi-variate data). (1) Pretraining is effective. Pretraining always leads to better
accuracy than training with a few labels from scratch. In particular,
class in our experiment). We train the model on the cloze pretrain-
on WISDM data all the methods using RITA architecture increase
ing task with a mask rate 𝑝 = 0.2. Then we train two classification
the accuracy by at least 10%. This is impressive considering we do
models using the finetuning set, either based on the pretrained
not have a very large unlabeled pre-training set to use.
version or from scratch. We repeat the experiment 5 times with
(2) RITA’s advantage over TST. our Group Attn. and other
random data splits and report the median accuracy.
three baselines using RITA architecture (Vanilla, Performer, and
(2) Imputation. We run the imputation task on the datasets used
Linformer) significantly outperform TST on all four classification
in classification as well as the large unlabeled MGH dataset, and
datasets by 25 percentage points.
measure the mean square error and absolute imputation error. To
(3) Group Attention’s advantage over other attention mech-
get timeseries with missing values, we randomly mask the values
anisms. Group Attn. is better than Performer and Linformer on 3
with an expected mask rate of 𝑝 = 0.2. The masked values are
out of 4 datasets. When compared to Vanilla, Group Attn. is better
replaced with a special value.
on HHAR and ECG, and comparable on the other two, further con-
Finally, to evaluate Group Attn.’s benefit on efficiency, the total
firming its high quality on approximation. Further, we notice that
time of forward computation, backward propagation, and grouping
Linformer struggles in this setting: in average its accuracy is worse
are measured for all methods in all the experiments.
than Vanilla by 8.22% and Group Attn. by 8.01%. This is because the
To save space, we only report the average training time per epoch
low-rank projection operation introduces extra model parameters,
here and refer readers to Appendix A.8 for the inference time.
making Linformer more easily overfit, while overfitting is especially
We first compare against the Transformer-based methods on
harmful when there are only a few labeled training samples.
multi-variate datasets (sec. 6.2, 6.3), then compare against the non-
deep learning method GRAIL on uni-variate datasets (sec. 6.4). 6.2.3 full-dataset training (Multi-variate imputation)
Configuration. Please refer to Appendix A.1 for the experiment
configuration and hyper-parameter settings. Similar to classification tasks, the results of imputation tasks
(Table.2) show that Group Attn. consistently outperforms the base-
6.2 Effectiveness: Transformer-Based Methods lines in training time while achieving comparable/better MSE. Again,
We first evaluate the quality of the models trained with full labels on the large dataset MGH (length = 10,000), TST and Vanilla fail due
from scratch. We then show how the pretraining of RITA increases to out of memory (OOM) errors. Methods using RITA framework
the accuracy of the downstream tasks. (Group Attn., Performer, Linformer) all achieve very low MSE (are
highly accurate). Among them Linformer is the worst.
6.2.1 full-label training (Multi-variate classification)
Results shown in Figure 3(a) get us the following observations: 6.3 Efficiency: Transformer-based Methods
(1) RITA’s advantage over TST. On all four datasets for the clas-
We measure the efficiency by the average training time per epoch
sification tasks, Group Attn. and the other three baselines that use
including the cost of the forward computation + backward propaga-
RITA architecture (Vanilla, Performer, and Linformer) outperform
tion and the grouping overhead. We first show the results on all the
TST. In particular, Group Attn. outperforms TST by 49 percentage
5 datasets in Sec. 6.3.1. We then vary the length of the timeseries
points on the ECG dataset (88.48% vs 39.93%) with long timeseries.
on the MGH dataset to show group attention’s scalability on long
Two deficiencies in TST may cause its poor performance on the long
timeseries in Sec. 6.3.2.
timeseries. Firstly, TST concatenates the output embedding vector
of each time stamp, then uses a linear classifier to do classification 6.3.1 Training Time: All Multi-variate Datasets
on the concatenated vector. When the timeseries is long, the linear The results in Fig. 3(b) and Table 2 lead to the below observations:
classifier has so many parameters that it tends to overfit easily. (1) Vanilla Self-Attention is not scalable. In average, it takes
Secondly, TST replaces Layer Normalization in vanilla Transformer 2-3 minutes to train one epoch when the length of the timeseries is
with Batch Normalization. When the timeseries is long, it can only only 200 (WISDM, HHAR, RWHAR), takes over 15 minutes when
accommodate a small number of timeseries in each batch, leading the length increases to 2,000 (ECG), and fails on the long MGH data
to bias in Batch Normalization. when the length reaches 10,000 due to out of GPU memory.
(2) Group-attention’s advantage over other attention mech- (2) Group Attn.’s advantage over all other attention mecha-
anisms. Group Attn. is better than Performer and Linformer on nisms. As we have shown in Sec. 6.2, Group Attn. is more accurate
7
TST [61] Vanilla Performer Linformer Group Attn.
Dataset Length
MSE Time/s MSE Time/s MSE Time/s MSE Time/s MSE Time/s
WISDM 200 13.30 150.3 3.240 178.1 3.449 162.6 3.852 141.9 3.277 136.7
HHAR 200 1.085 78.2 0.2968 97.4 0.2980 82.6 0.3198 81.1 0.2974 73.3
RWHAR 200 0.0882 83.9 0.0478 108.1 0.0489 89.1 0.0572 98.4 0.0478 81.3
ECG 2000 0.0905 696.3 0.0037 857.9 0.0033 270.2 0.0035 291.38 0.0038 164.36
MGH 10000 N/A N/A N/A N/A 0.00014 356.2 0.00088 404.9 0.00042 54.4
Table 2: Imputation results (multi-variate data). The best results are marked with bold.
TST [61] Vanilla Performer Linformer Group Attn.
Dataset Pretrain Size
Scratch Pre. Scratch Pre. Scratch Pre. Scratch Pre. Scratch Pre.
WISDM 62,231 49.13% 50.03% 66.16% 75.89% 66.09% 73.97% 50.12% 67.44% 62.56% 75.06%
HHAR 68,294 72.56% 75.30% 75.60% 81.35% 76.52% 80.70% 65.94% 76.52% 76.17% 82.62%
RWHAR 63,599 69.46% 80.41% 85.68% 91.14% 87.54% 91.33% 81.03% 86.33% 86.13% 89.63%
ECG 561,358 20.98% 27.99% 42.05% 46.16% 43.34% 45.58% 27.19% 31.34% 42.58% 46.39%
Table 3: Pretrain + few-label finetuning results. The best results are marked with bold.
Training Time/sec
Training Time/sec
Accuracy
MSE
(a) (b)
Figure 5: Comparison to non-deep learning method (uni-
(a) Effectiveness (b) Efficiency variate data).
Figure 4: Varying the lengths of timeseries.
6.4 Comparison to Non-deep Learning Methods
We compare against GRAIL, the SOTA of non-deep learning time-
series representation learning. We use the three uni-variate datasets,
than Performer and Linformer in classification and imputation tasks, because GRAIL only targets uni-variate timeseries.
while Group Attn. is always faster than Performer, Linformer, and Results in Fig. 5 show that on all 3 datasets RITA significantly
all other baselines on all 5 multi-variate datasets, thus a win-win. outperforms GRAIL in accuracy by 45, 16, and 21 percentage points
(3) The longer the timeseries, the larger the speedup. On because of the expressive power of Transformer. Moreover, thanks
the medium sized ECG dataset with a length of 2,000, Group Attn. to the GPU-friendly design of RITA, it is at least 2× faster than
has a speedup of 3.86/1.36/2.27 compared to Vanilla/Performer/Lin- GRAIL in training time.
former. When the length increases to 10,000, the speedup on the
MGH dataset increases to 6.59/7.48 compared to Performer/Lin- 6.5 Ablation Study
former (Vanilla and TST failed in this case) on imputation task 6.5.1 Adaptive Scheduler
(Table. 2). However, even on the short WISDM, HHAR, RWHAR To evaluate the effectiveness of RITA’s adaptive scheduler (Sec. 5),
datasets, Group Attn. still consistently outperforms other methods, we compare it against a baseline using a fixed group number 𝑁 . We
confirming that it does not introduce much overhead. This is be- vary 𝑁 and the error bound threshold 𝜖 used by RITA.
cause when the length of the timeseries gets longer, Group Attn. From the results in Table 4 we get the following observations:
gets more opportunities to find windows with similar properties. (1) Adaptive Scheduler is better than fixed 𝑁 . Training with
Adaptive Scheduler already achieves better or comparable perfor-
6.3.2 Training time: Varying the Length mance compared to the best performing 𝑁 . More specifically, on
In this experiment, we truncate the original MGH timseries into the MGH dataset, dynamic scheduler always achieves better accu-
sequences with the lengths at 2000/4000/6000/8000/10000, and com- racy and is much faster compared to fixed 𝑁 . On the ECG dataset,
pare Group Attn. against Vanilla and other attention mechanisms. although fixed 𝑁 is slightly better than adaptive scheduler in accu-
Vanilla cannot handle sequences longer than 8000. racy when setting the N as 512, it runs much slower than adaptive
The results in Fig. 4 again show that the longer the timeseries, the scheduler. Of course, finding the best 𝑁 that balances the accuracy
larger the speed up. With comparable MSE, Group Attn. outperforms and running time requires careful tuning.
Vanilla by 63X. Moreover, as the length increases from 2000 to 10000, (2) Adaptive Scheduler is tuning free. It is robust on both
the training time of Group Attn. only increases from 31.2 seconds accuracy and running time when 𝜖 varies, while the results of
to 54.4 seconds per epoch. The reason is that as the timeseires fixed 𝑁 vary significantly when the value of 𝑁 changes. Therefore,
becomes longer, there are more grouping opportunities because of Adaptive Scheduler frees the users from tuning the 𝜖 threshold,
the similarity of the timeseries segments. while it is hard to find an appropriate 𝑁 for a given dataset.
8
Dataset Task Scheduler Parameter Metric Time
1.5 88.34% 292.5 CNN/RNN-based Deep Learning Methods. CNN-based methods,
Dynamic 2 88.48% 236.8 such as InceptionTime [21] and Resnet [19], are good at classifica-
3 87.83% 216.8
ECG Class.
64 87.50% 255.2
tion tasks, but can not handle generative tasks such as forecasting
128 88.96% 297.2 because of the inductive bias of convolution networks. RNN-based
Fixed 256 88.82% 414.1 methods, such as Brit [7] and deepAR [44], are capable for classifi-
512 90.03% 662.6
1024 88.65% 873.7 cation, regression and generation. However, the recurrent structure
1.5 0.00041 60.7 brings a lot of problems: (1) limiting the model’s ability in captur-
Dynamic 2 0.00040 57.9 ing long-range correlation; (2) notoriously difficult to train [41]
3 0.00042 54.4
MGH Imput.
128 0.00054 128.6
because of gradient vanishing and exploding problem. As a result,
256 0.00053 190.2 such methods can hardly scale to very long timeseries.
Fixed
512 0.00049 240.8 Transformer-based Deep Learning Methods. Given that Trans-
1024 0.00046 323.3
former is the best choice for backbone in almost all sequence mod-
Table 4: Adaptive Scheduling VS Fixed N. eling tasks, some effort has been made to apply Transformer to
timeseries analytics. Targeting forecasting of uni-variate timeseries,
Pretrain Data size Few-label Accuracy
LogTrans [30] introduced a log sparsity assumption to attention
N/A 62.56% computation. Informer [62] pushes LogTrans a step further and
12,446 72.94% scales forecasting to multi-variate timeseries. Autoformer [57] per-
24,892 72.78%
37,338 74.10%
forms forecasting by decomposing timeseries into two parts, i.e.
49,784 74.22% the trend part and the seasonal part.
62,231 75.06% For imputation tasks, CDSA [37] outperforms statistical meth-
Table 5: RITA Pretraining: increasing sizes of pretrain set. ods and the SOTA of RNN-based method Brit [7] on 3 public and
2 competition datasets. For timeseries classification, AutoTrans-
former [43] performs architecture search to adapt to the tasks
6.5.2 The Sizes of the Pretraining Data in different domains. For timeseries anomaly detection, Anomaly
Next, we evaluate how the number of unlabeled data influences the Transformer [58] outperforms many widely-used methods such
effectiveness of pretraining. To get empirical results, we pretrain as OmniAnomaly [47], assuming the attention score maps show
RITA on WISDM dataset with 20%/40%/60%/80% of the pretraining Gaussian distribution.
data and finetune each pretrained model with 100 labels per class. All of these works are designed for specific tasks, rather than
The results in Table 5 show that: (1) The more pretraining data, functioning as a representation learning framework to serve
the larger the improvement. The accuracy increases with the different downstream tasks. To fill this gap, some researchers pro-
sizes of the pretraining data; (2) Marginal utility diminishing. posed a Transformer-based architecture, called TST [61]. Like RITA,
The first 20% pretraining data gives a 10.38% improvement in accu- TST supports regression, classification, and unsupervised learning
racy (72.94% vs 62.56%), while the remaining 80% pretraining data through the “cloze test” pretraining task on timeseries. However,
only gives an additional improvement of 2.12% (75.06% vs 72.94%). TST directly uses the classical Vanilla self-attention, thus not scal-
able to long timeseries as shown in our experiments (Sec. 6.3.2).
7 RELATED WORK
7.1 Timeseries Analytics
There is a great deal of prior work on timeseries analytics methods. 7.2 Efficient Transformers
This work can be divided into three categories: (1) non-deep learn- The need of improving the scalability of Transformers has led to
ing methods; (2) CNN/RNN-based deep learning methods; and (3) more efficient variations of Transformers, especially for accommo-
Transformer-based deep learning methods. dating long text data in NLP [49].
Traditional Methods. These methods, such as TS-CHIEF [45], Introducing fixed/random patterns to self-attention mechanism
HIVE-COTE [33], ROCKET [15] have achieved notable performance is an intuitive idea. Sparse Transformer [9] and Longformer [3] only
on public datasets. Despite that, traditional methods suffer from compute attention at fixed intervals. ETC [2] and BigBird [60] use
one or more issues: they (1) rely on expert knowledge for feature global-local attention: the attention computation is limited within
extraction; (2) incur heavy computation cost and are inappropriate a fixed radius, while some auxiliary tokens are added to attend/get
for GPU devices; (3) support only uni-variate timeseries; (4) perform attended globally. The deficiencies of fixed attention patterns are
classification solely. Some work [61] shows that the transformed- obvious: it heavily depends on users to give an optimal setting.
based methods outperform these traditional methods especially on To decrease the reliance on human labor, some works seek to
multi-variate timeseries. introduce learnable/adaptive attention patterns instead of fixed
In particular, as the SOTA of timeseries representation learn- patterns. Reformer [26] proposed only computing the dominant
ing, GRAIL [40] extracts landmarks from data and computes the attention terms based on their observation of sparsity in atten-
representations with the combination of the landmarks. However, tion matrix from language/image data. Such sparsity is intuitive
GRAIL only supports uni-variate timeseries. Our experiments (Sec. 6.4) in language data, in which a word’s attention mainly focuses on
show that RITA significantly outperforms GRAIL in both effective- the nearby sentences. However, attention in timeseries data shows
ness and efficiency on uni-variate timeseries. strong seasonal patterns rather than sparse patterns, mainly as
9
result of the periodicity of timeseries data. Therefore, such works [13] David R Cox. 1958. The regression analysis of binary sequences. Journal of the
do not work well for timeseries. Royal Statistical Society: Series B (Methodological) 20, 2 (1958), 215–232.
[14] Benjamin F Crabtree, Subhash C Ray, Priscilla M Schmidt, Patrick T O’Connor,
Apart from introducing attention patterns, some works seek and David D Schmidt. 1990. The individual over time: time series applications in
to solve this problem with applied mathematics techniques. Lin- health care research. Journal of clinical epidemiology 43, 3 (1990), 241–260.
[15] Angus Dempster, François Petitjean, and Geoffrey I. Webb. 2020. ROCKET: excep-
former [54] performs a projection to decrease the size of query, tionally fast and accurate time series classification using random convolutional
key and value matrices before attention computation, because the kernels. Data Min. Knowl. Discov. 34, 5 (2020), 1454–1495.
attention matrix tends to be low-ranked. Performer [10] uses linear [16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding. In
functions to approximate the kernel function softmax, making at- Proceedings of the 2019 Conference of the North American Chapter of the Association
tention computation commutative. When the sequence length is far for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019,
greater than the dimension of embedding vectors, Performer ben- Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). 4171–
4186.
efits from changing the order of matrix multiplication. Linformer [17] Evelyn Fix and Joseph Lawson Hodges. 1989. Discriminatory analysis. Nonpara-
and Performer do not depend on the unique properties of language metric discrimination: Consistency properties. International Statistical Review/Re-
vue Internationale de Statistique 57, 3 (1989), 238–247.
data, thus potentially fitting timeseries better than other techniques, [18] Philip George Guest and Philip George Guest. 2012. Numerical methods of curve
which is why we compared against them in our experiments. How- fitting. Cambridge University Press.
ever as shown in Sec. 6, our group attention significantly outper- [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
learning for image recognition. In Proceedings of the IEEE conference on computer
forms them in both accuracy and efficiency (training time), because vision and pattern recognition. 770–778.
group attention fully leverages the periodicity of timeseries. [20] Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar,
and Pierre-Alain Muller. 2019. Deep learning for time series classification: a
review. Data mining and knowledge discovery 33, 4 (2019), 917–963.
8 CONCLUSION [21] Hassan Ismail Fawaz, Benjamin Lucas, Germain Forestier, Charlotte Pelletier,
Daniel F Schmidt, Jonathan Weber, Geoffrey I Webb, Lhassane Idoumghar, Pierre-
In this work, we presented RITA, an automatic, self-supervised, and Alain Muller, and François Petitjean. 2020. Inceptiontime: Finding alexnet for
scalable timeseries analytics tool. RITA effectively adapts Trans- time series classification. Data Mining and Knowledge Discovery 34, 6 (2020),
1936–1962.
former, popular in NLP, into timeseries analytics. As the key com- [22] Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantization
ponent of RITA, group attention eliminates the performance bottle- for nearest neighbor search. IEEE transactions on pattern analysis and machine
neck of the classical self-attention mechanisms, thus successfully intelligence 33, 1 (2010), 117–128.
[23] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity
scaling RITA to highly complex, long timeseries data. Our experi- search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
ments confirm that RITA significantly speeds up the state-of-the-art [24] Richard M Karp. 1972. Reducibility among combinatorial problems. In Complexity
of computer computations. Springer, 85–103.
by 63X with a better accuracy. [25] Eamonn Keogh, Kaushik Chakrabarti, Michael Pazzani, and Sharad Mehrotra.
2001. Dimensionality reduction for fast similarity search in large time series
databases. Knowledge and information Systems 3, 3 (2001), 263–286.
REFERENCES [26] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, transformer. arXiv preprint arXiv:2001.04451 (2020).
Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. [27] John Kraft and Arthur Kraft. 1977. Determinants of common stock prices: A time
2016. Tensorflow: Large-scale machine learning on heterogeneous distributed series analysis. The journal of finance 32, 2 (1977), 417–425.
systems. arXiv preprint arXiv:1603.04467 (2016). [28] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Clas-
[2] Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, sification with Deep Convolutional Neural Networks. In Advances in Neural
Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. 2020. Information Processing Systems, F. Pereira, C.J. Burges, L. Bottou, and K.Q. Wein-
ETC: Encoding long and structured inputs in transformers. arXiv preprint berger (Eds.), Vol. 25. Curran Associates, Inc. https://proceedings.neurips.cc/
arXiv:2004.08483 (2020). paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
[3] Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long- [29] Oscar D Lara and Miguel A Labrador. 2012. A survey on human activity recog-
document transformer. arXiv preprint arXiv:2004.05150 (2020). nition using wearable sensors. IEEE communications surveys & tutorials 15, 3
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, (2012), 1192–1209.
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda [30] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang,
Askell, et al. 2020. Language models are few-shot learners. Advances in neural and Xifeng Yan. 2019. Enhancing the locality and breaking the memory bottle-
information processing systems 33 (2020), 1877–1901. neck of transformer on time series forecasting. Advances in Neural Information
[5] C Bui, N Pham, A Vo, A Tran, A Nguyen, and T Le. 2017. Time series forecasting Processing Systems 32 (2019).
for healthcare diagnosis and prognostics with the focus on cardiovascular dis- [31] T Warren Liao. 2005. Clustering of time series data—a survey. Pattern recognition
eases. In International conference on the development of biomedical engineering in 38, 11 (2005), 1857–1874.
Vietnam. Springer, 809–818. [32] Rake& Agrawal King-lp Lin and Harpreet S Sawhney Kyuseok Shim. 1995. Fast
[6] Lei Cao, Wenbo Tao, Sungtae An, Jing Jin, Yizhou Yan, Xiaoyu Liu, Wendong similarity search in the presence of noise, scaling, and translation in time-series
Ge, Adam Sah, Leilani Battle, Jimeng Sun, Remco Chang, M. Brandon Westover, databases. In Proceeding of the 21th International Conference on Very Large Data
Samuel Madden, and Michael Stonebraker. 2019. Smile: A System to Support Bases. 490–501.
Machine Learning on EEG Data at Scale. Proc. VLDB Endow. 12, 12 (2019), 2230– [33] Jason Lines, Sarah Taylor, and Anthony Bagnall. 2018. Time Series Classification
2241. with HIVE-COTE: The Hierarchical Vote Collective of Transformation-Based
[7] Wei Cao, Dong Wang, Jian Li, Hao Zhou, Lei Li, and Yitan Li. 2018. Brits: Ensembles. ACM Trans. Knowl. Discov. Data 12, 5, Article 52 (jul 2018), 35 pages.
Bidirectional recurrent imputation for time series. Advances in neural information [34] Feifei Liu, Chengyu Liu, Lina Zhao, Xiangyu Zhang, Xiaoling Wu, Xiaoyan
processing systems 31 (2018). Xu, Yulin Liu, Caiyun Ma, Shoushui Wei, Zhiqiang He, et al. 2018. An open
[8] Chris Chatfield. 2000. Time-series forecasting. Chapman and Hall/CRC. access database for evaluating the algorithms of electrocardiogram rhythm and
[9] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating morphology abnormality detection. Journal of Medical Imaging and Health
long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019). Informatics 8, 7 (2018), 1368–1373.
[10] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, [35] Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on
Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, information theory 28, 2 (1982), 129–137.
Lukasz Kaiser, et al. 2020. Rethinking attention with performers. arXiv preprint [36] Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization.
arXiv:2009.14794 (2020). arXiv preprint arXiv:1711.05101 (2017).
[11] Andrew A Cook, Göksel Mısırlı, and Zhong Fan. 2019. Anomaly detection for IoT [37] Jiawei Ma, Zheng Shou, Alireza Zareian, Hassan Mansour, Anthony Vetro, and
time-series data: A survey. IEEE Internet of Things Journal 7, 7 (2019), 6481–6494. Shih-Fu Chang. 2019. CDSA: cross-dimensional self-attention for multivariate,
[12] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine geo-tagged time series imputation. arXiv preprint arXiv:1905.09904 (2019).
learning 20, 3 (1995), 273–297.
10
[38] Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate Ieee, 5519–5522.
nearest neighbor search using hierarchical navigable small world graphs. IEEE [60] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris
transactions on pattern analysis and machine intelligence 42, 4 (2018), 824–836. Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang,
[39] Tripti Negi and Veena Bansal. 2005. Time series: Similarity search and its appli- et al. 2020. Big bird: Transformers for longer sequences. Advances in Neural
cations. In Proceedings of the International Conference on Systemics, Cybernetics Information Processing Systems 33 (2020), 17283–17297.
and Informatics: ICSCI-04, Hyderabad, India. 528–533. [61] George Zerveas, Srideepika Jayaraman, Dhaval Patel, Anuradha Bhamidipaty, and
[40] John Paparrizos and Michael J Franklin. 2019. Grail: efficient time-series repre- Carsten Eickhoff. 2021. A Transformer-based Framework for Multivariate Time
sentation learning. Proceedings of the VLDB Endowment 12, 11 (2019), 1762–1777. Series Representation Learning. In KDD ’21: The 27th ACM SIGKDD Conference
[41] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18,
of training recurrent neural networks. In International conference on machine 2021. 2114–2124.
learning. PMLR, 1310–1318. [62] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong,
[42] Ning Qian. 1999. On the momentum term in gradient descent learning algorithms. and Wancai Zhang. 2021. Informer: Beyond efficient transformer for long se-
Neural networks 12, 1 (1999), 145–151. quence time-series forecasting. In Proceedings of AAAI.
[43] Yankun Ren, Longfei Li, Xinxing Yang, and Jun Zhou. 2022. AutoTransformer:
Automatic Transformer Architecture Design for Time Series Classification. In
Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 143– A APPENDIX: SUPPLEMENTARY MATERIAL
155.
[44] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. 2020. A.1 Experiment Configuration and
DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Inter-
national Journal of Forecasting 36, 3 (2020), 1181–1191.
Hyper-parameter Settings
[45] Ahmed Shifaz, Charlotte Pelletier, François Petitjean, and Geoffrey I. Webb. 2020. Configuration. All models were trained on an NVIDIA Tesla V100
TS-CHIEF: a scalable and accurate forest algorithm for time series classification.
Data Mining and Knowledge Discovery 34 (2020), 742–775. 16GB GPU. All the methods are optimized with AdamW [36] of
[46] Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, which the starting learning rate and weight decay parameter are
Mikkel Baun Kjærgaard, Anind Dey, Tobias Sonne, and Mads Møller Jensen. both 1𝑒 −4 . In full-label training scenario, we train the models for
2015. Smart devices are different: Assessing and mitigatingmobile sensing het-
erogeneities for activity recognition. In Proceedings of the 13th ACM conference 100 epochs. In “pretraining + few-label finetuning scenario”, as the
on embedded networked sensor systems. 127–140. pretrained models require fewer epochs to converge [61], we train
[47] Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. 2019. Robust
anomaly detection for multivariate time series through stochastic recurrent
the model for 50 epochs. For a fair comparison, the baselines use a
neural network. In Proceedings of the 25th ACM SIGKDD international conference maximal batch size within GPU’s capacity during training.
on knowledge discovery & data mining. 2828–2837. As for model hyper-parameter setting, RITA and the baselines
[48] Timo Sztyler and Heiner Stuckenschmidt. 2016. On-body localization of wearable
devices: An investigation of position-aware activity recognition. In 2016 IEEE use a Transformer structure balancing Vanilla ’s accuracy and
International Conference on Pervasive Computing and Communications (PerCom). efficiency: 8-layer stack of 2-head attention with hidden vectors
IEEE, 1–9. in dimension of 64. Convolution kernel size is set to 5 by default.
[49] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient
transformers: A survey. ACM Computing Surveys (CSUR) (2020). We set the error bound threshold (𝜖, Sec. 5.1) of Group Attention
[50] Mingyan Teng. 2010. Anomaly detection on time series. In 2010 IEEE International to 2, as it balances the accuracy and the efficiency in general on
Conference on Progress in Informatics and Computing, Vol. 1. IEEE, 603–608.
[51] Patrick A Thompson. 1990. An MSE statistic for comparing forecast accuracy
all datasets. Because Linformer requires the users to set the sizes
across series. International Journal of Forecasting 6, 2 (1990), 219–227. of projection matrix, in different settings we choose an accuracy-
[52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, efficiency balancing one among {64,128,256,512}.
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
you Need. In Advances in Neural Information Processing Systems 30: Annual Con-
ference on Neural Information Processing Systems 2017, December 4-9, 2017, Long A.2 Efficient Computation of Group Attention
Beach, CA, USA. 5998–6008.
[53] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler
Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Algorithm 1 Efficient Computation of Group Attention
Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jar-
rod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Require: 𝑄, 𝑉 , 𝑅, 𝐶𝑂𝑈 𝑁𝑇 , 𝐵𝐸𝐿𝑂𝑁 𝐺
Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Ensure: 𝑄, 𝑉 ∈ R𝑛∗𝑑 ,𝑅 ∈ R𝑁 ∗𝑑 ,𝐶𝑂𝑈 𝑁𝑇 ∈ N𝑁 ,𝐵𝐸𝐿𝑂𝑁 𝐺 ∈ N𝑛
Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, 1: function group_attention(𝑄, 𝑉 , 𝑅)
Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa,
Paul van Mulbregt, and SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamental Al- 2: for 𝑖 = 0 → 𝑁 − 1 do
Í
gorithms for Scientific Computing in Python. Nature Methods 17 (2020), 261–272. 3: e𝑣𝑖 ← 𝑛−1 𝑗 =0 (𝐵𝐸𝐿𝑂𝑁 𝐺 𝑗 == 𝑖 )𝑣 𝑗
https://doi.org/10.1038/s41592-019-0686-2
[54] Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Lin- 4: 𝑃e ← 𝑄𝑅𝑇
former: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 5: for 𝑖 = 0 → 𝑛 − 1 do
(2020). 6: for 𝑗 = 0 → 𝑁 − 1 do
[55] Gary M Weiss, Kenichi Yoneda, and Thaier Hayajneh. 2019. Smartphone and 7: 𝑤𝑖,𝑗 ← 𝑒𝑥𝑝 ( 𝑃e𝑖,𝑗 )𝐶𝑂𝑈 𝑁𝑇 𝑗
smartwatch-based biometrics using activities of daily living. IEEE Access 7 (2019),
133190–133202. 8: for 𝑖 = 0 → 𝑛 − 1 do
Í −1
[56] Qingsong Wen, Kai He, Liang Sun, Yingying Zhang, Min Ke, and Huan Xu. 2021. 9: 𝑠𝑖 ← 𝑁 𝑗 =0 𝑤𝑖,𝑗
RobustPeriod: Robust Time-Frequency Mining for Multiple Periodicity Detection.
In Proceedings of the 2021 International Conference on Management of Data (Virtual 10: for 𝑖 = 0 → 𝑛 − 1 do
Event, China) (SIGMOD ’21). Association for Computing Machinery, New York, Í −1 𝑒𝑥𝑝 (𝑃e𝑖,𝑗 )
11: 𝑜𝑖 ← 𝑁 𝑗 =0 𝑠𝑖 𝑣𝑗
e
NY, USA, 2328–2337. https://doi.org/10.1145/3448016.3452779
[57] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. 2021. Autoformer: De- 12: return 𝑂
composition transformers with auto-correlation for long-term series forecasting.
Advances in Neural Information Processing Systems 34 (2021), 22419–22430.
[58] Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. 2021. Anomaly In Alg. 1, we denote 𝐶𝑂𝑈 𝑁𝑇𝑖 to be the size of the 𝑖 𝑡ℎ group, 𝑁 to
Transformer: Time Series Anomaly Detection with Association Discrepancy.
arXiv preprint arXiv:2110.02642 (2021). be the number of groups, r𝑖 to be the representative key of the 𝑖 𝑡ℎ
[59] Dianmin Yue, Xiaodan Wu, Yunfeng Wang, Yue Li, and Chao-Hsien Chu. 2007. A group and R to be the matrix consisting of all r𝑖 , 𝐵𝐸𝐿𝑂𝑁𝐺𝑖 to be
review of data mining-based financial fraud detection research. In 2007 Interna-
tional Conference on Wireless Communications, Networking and Mobile Computing. the group that k𝑖 belongs to. 𝑄, 𝑉 are the packing matrices of query
vectors and value vectors as described in Sec.2. Alg. 1 outputs the
11
packing matrix 𝑂 for new feature emebddings {𝑜 1, ..., 𝑜𝑛 }, where 𝑜𝑖 estimation error for points in sub-plane {𝑙 2 ≤ 𝐿 ≤ 𝑙 1, 𝑁 ≤ 𝑛}. In
corresponds to the feature embedding of 𝑤𝑖𝑛𝑖 . Lines 2-3 implement Lines 11-13, we enumerate all possible ways of cutting {𝑙 2 ≤ 𝐿 ≤
the embedding aggregation operation, while Lines 8-11 implement 𝑙 1, 𝑁 ≤ 𝑛} horizontally into two sub-plane {𝑙 2 ≤ 𝐿 ≤ 𝑙 1, 𝑁 ≤ 𝑖} and
the group softmax function. {𝑙 2 ≤ 𝐿 ≤ 𝑙 1, 𝑖 ≤ 𝑁 ≤ 𝑛} by iterating 𝑖 from 1 to n. Choosing the
cutting strategy that minimizes estimation error gets us a 𝑔(𝑙 1 ) with
A.3 The Algorithms and Optimality Proof for minimal estimation error for sub-plane {𝑙 2 ≤ 𝐿 ≤ 𝑙 1, 𝑁 ≤ 𝑙 1 }, which
Dynamically Determing Batch Size is recorded as 𝑓𝑙1 ,𝑙2 in Line 14. 𝑑𝑝 (𝑙) denotes the minimal estimation
error for sub-plane {𝐿 ≤ 𝑙 }. We enumerate all the possible ways
of cutting {𝐿 ≤ 𝑙 } vertically into two sub-plane {𝐿 ≤ 𝑖} and {𝑖 ≤
Algorithm 2 Binary Search for Batch Size
𝐿 ≤ 𝑙 } by iterating 𝑖 from 1 to 𝑙 (Line 17-19). Finally, we have the
Require: 𝐿, 𝑁 minimal estimation error for the whole plane as 𝑑𝑝 (𝐿𝑚𝑎𝑥 ). Based
Ensure: 1 ≤ 𝐿 ≤ 𝐿𝑚𝑎𝑥 , 1 ≤ 𝑁 ≤ 𝐿 on the above discussion, this algorithm guarantees to not miss any
1: function binary_search(𝐿, 𝑁 ) better solution, hence optimal.
2: 𝐿←1
3: 𝑅 ← 𝑀𝑎𝑥𝐵𝑎𝑡𝑐ℎ𝑆𝑖𝑧𝑒
4: 𝑑𝑎𝑡𝑎 ← 𝑅𝑎𝑛𝑑𝑜𝑚𝑇𝑖𝑚𝑒𝑆𝑒𝑟𝑖𝑒𝑠 𝑖𝑛 𝑙𝑒𝑛𝑔𝑡ℎ 𝐿 A.4 The Correctness of Group Attention
5: 𝐵𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 Lemma 3. Assuming the windows belonging to the same group 𝐺𝑖
6: while 𝐿 ≤ 𝑅 do have the same key vector, i.e. 𝑘 𝑗 = 𝑟𝑖 (𝑤𝑖𝑛 𝑗 ∈ 𝐺𝑖 ), then the feature
7: 𝐼𝑛𝑝𝑢𝑡 ← 𝑑𝑎𝑡𝑎 × 𝐵𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 embedding 𝑂 produced by the original self-attention mechanism is
8: 𝑀𝑜𝑑𝑒𝑙𝐹𝑜𝑟𝑤𝑎𝑟𝑑 (𝐼𝑛𝑝𝑢𝑡) identical to the output of our group attention mechanism implemented
9: 𝑀𝑜𝑑𝑒𝑙𝐵𝑎𝑐𝑘𝑤𝑎𝑟𝑑 in Algorithm 1.
𝑃𝑒𝑎𝑘𝑀𝑒𝑚𝑜𝑟 𝑦𝑈 𝑠𝑎𝑔𝑒
10: 𝑢 ← 𝑇 𝑜𝑡𝑎𝑙𝑀𝑒𝑚𝑜𝑟 𝑦
11: if 0.9 > 𝑢 then
12: 𝐿 ← 𝐵𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 + 1
Proof. Denote 𝑘e𝑗 to be the representative vectors of 𝑘 𝑗 , i.e. 𝑘e𝑗 =
13: 𝐵 ← 𝐵𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙
𝑟𝑖 = 𝑘 𝑗 (𝑤𝑖𝑛 𝑗 ∈ 𝐺𝑖 ). Algorithm 1 gives that
14: else
15: 𝑅 ← 𝐵𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 − 1
⌊𝐿+𝑅 ⌋ 𝑛−1
𝐵𝑡𝑒𝑚𝑝𝑜𝑟𝑎𝑙 ←
∑︁
16: 2 𝑣𝑖 =
e (𝐵𝐸𝐿𝑂𝑁 𝐺 𝑗 == 𝑖 )v 𝑗 , 𝑃e𝑖,𝑗 = q𝑖 · r 𝑗
17: return 𝐵 𝑗 =0
(7)
𝑁 −1 𝑁 −1
∑︁ ∑︁ 𝑃e𝑖,𝑗
𝑠𝑖 = 𝑒𝑥𝑝 ( 𝑃e𝑖,𝑗 )𝐶𝑂𝑈 𝑁𝑇 𝑗 , 𝑜e𝑖 = 𝑣𝑗
e
𝑗 =0 𝑗 =0
𝑠𝑖
Algorithm 3 Dynamic Programming for Plane Division
Require: 𝐿𝑖 , 𝑁𝑖 , 𝐵𝑖 , 𝐿𝑚𝑎𝑥 By the canonical self-attention mechanism introduced in Sec. 2,
Ensure: 1 ≤ 𝐿𝑖 ≤ 𝐿𝑚𝑎𝑥 , 1 ≤ 𝑁𝑖 ≤ 𝐿𝑖
1: function cost(S) we get:
2: if |𝑆 | < 𝑀 then return +∞
3: 𝐿, 𝑁 , 𝐵 ← 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛 𝑆
𝑛−1
4: 𝑓 ← 𝑓 𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑓 𝑖𝑡𝑡𝑖𝑛𝑔 (𝐵 |𝐿, 𝑁 ) 𝑒𝑥𝑝 (𝑃𝑖,𝑗 ) ∑︁
return 𝐸 (𝐵, 𝐿, 𝑁 | 𝑓 ) 𝑃𝑖,𝑗 = q𝑖 · kj , 𝐴𝑖,𝑗 = Í𝑛−1 , o𝑖 = 𝐴𝑖,𝑗 v 𝑗 (8)
5: function dynamic_programming(𝐿𝑖 , 𝑁𝑖 , 𝐿𝑚𝑎𝑥 ) 𝑘=0 𝑒𝑥𝑝 (𝑃𝑖,𝑘 ) 𝑗 =0
6: for 𝑙 1 = 1 → 𝐿𝑚𝑎𝑥 do
7: for 𝑙 2 = 1 → 𝑙 1 do
8: for 𝑛 = 1 → 𝑙 1 do With 7 and 8, we have
9: 𝑆 ← 𝑝𝑜𝑖𝑛𝑡𝑠 𝑠𝑒𝑡 𝑖𝑛 {𝑙 2 ≤ 𝐿 ≤ 𝑙 1 , 𝑁 ≤ 𝑛}
10: 𝑔 (𝑛) ← 𝐶𝑂𝑆𝑇 (𝑆 )
11: for 𝑖 = 1 → 𝑛 do 𝑛−1
∑︁ 𝑛−1
∑︁
12: 𝑆 ← 𝑝𝑜𝑖𝑛𝑡𝑠 𝑠𝑒𝑡 𝑖𝑛 {𝑙 2 ≤ 𝐿 ≤ 𝑙 1 , 𝑖 ≤ 𝑁 ≤ 𝑛} 𝑒𝑥𝑝 (𝑃𝑖,𝑗 ) = 𝑒𝑥𝑝 (q𝑖 · k 𝑗 )
13: 𝑔 (𝑛) ← 𝑚𝑖𝑛 (𝑔 (𝑛), 𝑔 (𝑖 ) + 𝐶𝑂𝑆𝑇 (𝑆 ) ) 𝑗 =0 𝑗 =0
14: 𝑓𝑙2 ,𝑙1 ← 𝑔 (𝑙 1 ) 𝑁
∑︁−1 𝑛−1
∑︁
15: = (𝐵𝐸𝐿𝑂𝑁 𝐺𝑥 == 𝑗 )𝑒𝑥𝑝 (q𝑖 · k𝑥 )
16: for 𝑙 = 1 → 𝐿𝑚𝑎𝑥 do 𝑗 =0 𝑥 =0
17: 𝑑𝑝 (𝑙 ) ← 𝑓 (1, 𝑙 )
𝑁 −1 𝑛−1
18: for 𝑖 = 1 → 𝑙 do ∑︁ ∑︁
19: 𝑑𝑝 (𝑙 ) ← 𝑚𝑖𝑛 (𝑑𝑝 (𝑙 ), 𝑑𝑝 (𝑖 ) + 𝑓 (𝑖, 𝑙 ) ) = 𝑒𝑥𝑝 (q𝑖 · r 𝑗 ) (𝐵𝐸𝐿𝑂𝑁 𝐺𝑥 == 𝑗 )
return 𝑑𝑝 (𝐿𝑚𝑎𝑥 ) 𝑗 =0 𝑥 =0 (9)
𝑁
∑︁−1
= 𝑒𝑥𝑝 (q𝑖 · r 𝑗 )𝐶𝑂𝑈 𝑁𝑇 𝑗
We describe Alg. 3 and intuitively show its optimality. We assume 𝑗 =0
that Scipy [53] learns an optimal function in Line 4 so that function 𝑁 −1
∑︁
COST gives the optimal estimation error when fitting the points in = 𝑒𝑥𝑝 ( 𝑃e𝑖,𝑗 )𝐶𝑂𝑈 𝑁𝑇 𝑗
set 𝑆. When fitting very few points, we assign an infinite cost to 𝑗 =0
prevent a biased fitting function (Line 2). 𝑔(𝑛) denotes the minimal = 𝑠𝑖
12
Further, For ∀𝑖 ∈ [1, 𝑚], ∀𝑥 ∈ 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑘𝑖 , it holds that:
𝑛−1
∑︁ |𝑥 − 𝑐 ′ | ≤ |𝑥 − 𝑐𝑘𝑖 | + |𝑐𝑘𝑖 − 𝑐 ′ | (𝑇 𝑟𝑖𝑎𝑛𝑔𝑙𝑒 𝑖𝑛𝑒𝑞𝑢𝑎𝑙𝑖𝑡 𝑦)
o𝑖 = 𝐴𝑖,𝑗 vj Í𝑚 Í𝑚
𝑗 =1 𝑛𝑘 𝑗 𝑗 =1 𝑛𝑘 𝑗 𝑐 𝑘 𝑗
𝑗 =0 = |𝑥 − 𝑐𝑘𝑖 | + | Í𝑚 𝑐𝑘𝑖 − Í𝑚 |
𝑁
∑︁−1 𝑛−1
∑︁ 𝑗 =1 𝑛𝑘 𝑗 𝑗 =1 𝑛𝑘 𝑗
Í𝑚
= (𝐵𝐸𝐿𝑂𝑁 𝐺𝑥 == 𝑗 )𝐴𝑖,𝑥 v𝑥 𝑗 =1 𝑛𝑘 𝑗 (𝑐 𝑘𝑖 − 𝑐 𝑘 𝑗 )
𝑗 =0 𝑥 =0 = |𝑥 − 𝑐𝑘𝑖 | + | Í𝑚 |
𝑗 =1 𝑛𝑘 𝑗
𝑁 −1 𝑛−1
∑︁ ∑︁ 𝑒𝑥𝑝 (𝑃𝑖,𝑥 ) Í𝑚
| 𝑗 =1 𝑛𝑘 𝑗 (𝑐𝑘𝑖 − 𝑐𝑘 𝑗 ) |
= (𝐵𝐸𝐿𝑂𝑁 𝐺𝑥 == 𝑗 ) Í𝑛−1 v𝑥 = |𝑥 − 𝑐𝑘𝑖 | +
𝑘=0 𝑒𝑥𝑝 (𝑃𝑖,𝑘 )
Í𝑚
𝑗 =0 𝑥 =0 𝑗 =1 𝑛𝑘 𝑗 (15)
−1 𝑛−1
Í𝑚
𝑗 |𝑐 𝑘𝑖 − 𝑐 𝑘 𝑗 |
𝑁
∑︁ ∑︁ 𝑒𝑥𝑝 (q𝑖 · k𝑥 ) 𝑗 =1 𝑛 𝑘
= (𝐵𝐸𝐿𝑂𝑁 𝐺𝑥 == 𝑗 ) Í𝑛−1 v𝑥 (10) ≤ |𝑥 − 𝑐𝑘𝑖 | + Í𝑚
𝑗 =0 𝑥 =0 𝑘=0 𝑒𝑥𝑝 (𝑃𝑖,𝑘 ) 𝑗 =1 𝑛𝑘 𝑗
Í𝑚
𝑁 −1 𝑛−1 𝑛
𝑗 =1 𝑘 𝑗 ( |𝑐 𝑘𝑖 − 𝑐 𝑘 𝑗 | + |𝑥 − 𝑐 𝑘𝑖 | )
∑︁ ∑︁ 𝑒𝑥𝑝 (q𝑖 · rj ) =
= (𝐵𝐸𝐿𝑂𝑁 𝐺𝑥 == 𝑗 ) Í𝑛−1 v𝑥 Í𝑚
𝑗 =1 𝑛𝑘 𝑗
𝑗 =0 𝑥 =0 𝑘=0 𝑒𝑥𝑝 (𝑃𝑖,𝑘 ) Í𝑚
−1 𝑗 =1 𝑛𝑘 𝑗 𝑑
𝑁
∑︁ 𝑒𝑥𝑝 (q𝑖 · rj ) 𝑛−1
∑︁ ≤ Í𝑚 =𝑑
= Í𝑛−1 (𝐵𝐸𝐿𝑂𝑁 𝐺𝑥 == 𝑗 )v𝑥 𝑗 =1 𝑛𝑘 𝑗
𝑗 =0 𝑘=0 𝑒𝑥𝑝 (𝑃𝑖,𝑘 ) 𝑥 =0
𝑁 −1
𝑒𝑥𝑝 (q𝑖 · rj )
A.7 Downstream Tasks
∑︁
= Í𝑛−1 𝑣e𝑗
𝑗 =0 𝑘=0 𝑒𝑥𝑝 (𝑃𝑖,𝑘 ) RITA supports a variety of downstream tasks. In this section, we
Pi,j
ÍN −1 e show that with minimal modification RITA can effectively support
Combining (7), (9) (10), we have oi = j=0 si e vj = e
oi . classification, imputation and forecasting tasks. Other unsupervised
This concludes that the output of our group attention is identical tasks such as similarity search or clustering are naturally supported
to vanilla self-attention’s. □ by extracting feature embeddings from RITA.
A.7.1 Classification
A.5 The Proof of Error Bound (Lemma 1)
To classify timeseries, we input timeseries to the model as described
Proof. We have in Sec. 3 and attach a special token [CLS] as the first input em-
𝑒𝑥𝑝 (𝑃 𝑖,𝑗 ) 𝑒𝑥𝑝 (q𝑖 · e k𝑗 ) bedding. [CLS]’s embedding acts as the embedding for the entire
= = 𝑒𝑥𝑝 (q𝑖 · (e k 𝑗 − k 𝑗 )) timeseries, and the output representation of [CLS] is fed into a
𝑒𝑥𝑝 (𝑃𝑖,𝑗 ) 𝑒𝑥𝑝 (q𝑖 · k 𝑗 ) (11)
classifier: y = Softmax (Wcls Z [CLS] + Bcls ), where 𝑍 [𝐶𝐿𝑆 ] ∈ R𝑑 is
= 𝑒𝑥𝑝 (||q𝑖 || · ||e
k 𝑗 − k 𝑗 || · 𝑐𝑜𝑠 (q𝑖 , e
k 𝑗 − k 𝑗 )) the output representation of [CLS], C is the number of classes, and
Wcls ∈ RC×d , Bcls ∈ RC are learnable parameters for classification
So
task. The result vector 𝑦 ∈ R𝐶 represents the possibility that the
𝑒𝑥𝑝 (𝑃 𝑖,𝑗 ) input timeseries belongs to each class.
𝑒𝑥𝑝 (−𝑑𝑅) ≤ ≤ 𝑒𝑥𝑝 (𝑑𝑅) (12)
𝑒𝑥𝑝 (𝑃𝑖,𝑗 ) We apply Cross Entropy Loss as the loss function of the classi-
fication task [13]: L = C1 Ci=1 −ŷ(i)log(y(i)), where 𝑦ˆ is a binary
Í
Then we have:
indicator for ground truth label:
𝐴𝑖,𝑗 𝑒𝑥𝑝 (𝑃 𝑖,𝑗 ) 𝑒𝑥𝑝 (𝑃𝑖,𝑗 ) (
= Í𝑛 / Í𝑛 1 𝑖 is ground truth label
𝐴𝑖,𝑗 𝑒𝑥𝑝 (𝑃 ) 𝑒𝑥𝑝 (𝑃𝑖,𝑘 ) 𝑦ˆ (𝑖) = (16)
𝑘=1 𝑖,𝑘 𝑘=1 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Í𝑛 (13)
𝑒𝑥𝑝 (𝑃 𝑖,𝑗 ) 𝑘=1 𝑒𝑥𝑝 (𝑃𝑖,𝑘 )
= A.7.2 Imputation
𝑒𝑥𝑝 (𝑃𝑖,𝑗 ) 𝑛 𝑒𝑥𝑝 (𝑃 𝑖,𝑘 )
Í
𝑘=1 Timeseries are mainly generated by sensors, a common problem
of which is missing values. This becomes a challenge when many
Combining (12) (13), the error is bounded by downstream analytics require the missing values to be recovered.
𝐴𝑖,𝑗 The recovering task is imputation.
𝑒𝑥𝑝 (−2𝑑𝑅) ≤ ≤ 𝑒𝑥𝑝 (2𝑑𝑅) (14) Denote the real timeseries as 𝑇𝑟 ∈ R𝑡 ×𝑚 , the observed timeseries
𝐴𝑖,𝑗 with missing values as 𝑇𝑜 ∈ R𝑡 ×𝑚 , and the set of missing values’
ln(𝜖 ) A positions as 𝑀. We scale the values of all timeseries to non-negative
Thus, if d ≤ 2R , 𝜖1 ≤ Ai,ji,j ≤ 𝜖. This proves Lemma 1. and use a special value (-1) to indicate missing values:
(
−1 (𝑖, 𝑗) ∈ 𝑀
A.6 The Proof of Merge Operation (Lemma 2) 𝑇𝑜 (𝑖, 𝑗) = (17)
𝑇𝑟 (𝑖, 𝑗) (𝑖, 𝑗) ∉ 𝑀
Proof. Denote the cluster size of 𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑘 to be 𝑛𝑘 .After merge-
ing, the new center will be: 𝑇𝑜 is fed into the RITA as input, and the output representa-
Í𝑚 tions are concatenated and fed into a Transpose Convolution layer
𝑖=1 𝑛𝑘𝑖 𝑐𝑘𝑖 which decodes the output embedding vectors from hidden space to
𝑐′ = Í 𝑚 𝑛
𝑖=1 𝑘𝑖
timeseries values, corresponding to the convolution operation in
13
the input stage, i.e., Y = TransposeCNN (Z1 ○Z + 2 ○...
+ ○Z + n ), where A.8 Inference Time
𝑌 ∈ R𝑡 ×𝑚 is the recovered timeseries, and 𝑍𝑖 ∈ R𝑑 is the output of
each position. Dataset Length TST[61] Vanilla Performer Linformer Group Attn.
Here Mean Square Error is chosen as the loss function [51]: WISDM 200 2.18 2.26 2.35 2.22 2.17
𝐿 = |𝑀1 | (𝑖,𝑗 ) ∈𝑀 (𝑌 (𝑖, 𝑗) − 𝑇𝑟 (𝑖, 𝑗)) 2 .
Í HHAR 200 1.19 1.23 1.28 1.21 1.18
RWHAR 200 1.32 1.37 1.42 1.34 1.31
A.7.3 Forecasting ECG 2000 18.44 15.26 5.80 6.08 5.16
Forecasting can be regarded as a special case of imputation, in Table 6: Inference time: Classification on multi-variate data
which all missing values are at the end of timeseries. (seconds).
So like in imputation task, we scale the timeseries to non-
negative and use a special value (-1) to indicate the values to be
predicted:
(
𝑇 (𝑖, 𝑗) 𝑖 ≤ 𝑡𝑜𝑏𝑠𝑒𝑟 𝑣𝑒𝑑 Dataset Length TST[61] Vanilla Performer Linformer Group Attn.
𝑇𝑜𝑏𝑠𝑒𝑟 𝑣𝑒𝑑 (𝑖, 𝑗) = 𝑟𝑒𝑎𝑙 (18) WISDM 200 2.03 2.11 2.19 2.07 2.02
−1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 HHAR 200 1.11 1.14 1.19 1.12 1.10
Where 𝑡𝑜𝑏𝑠𝑒𝑟 𝑣𝑒𝑑 is the observed timestamp. Then the output RWHAR 200 1.23 1.27 1.32 1.25 1.22
ECG 2000 17.22 14.32 4.73 4.99 4.11
representations are fed into a Transpose Convolution layer using MGH 10000 N/A N/A 6.58 6.88 1.35
Mean Squared Error as loss function, as described above.
Table 7: Inference time: Imputation on multi-variate data
A.7.4 Other Unsupervised Tasks (seconds).
RITA naturally supports other unsupervised tasks, such as similar-
ity search and clustering [25, 31, 32], by producing the embedding In this section, we present the average inference time on valida-
of one timeseries (output representation of the special token [CLS]). tion sets. The results in Table. 6 and 7 correspond to the average
Clustering can be performed on the embeddings with flexible choice inference time on validation sets of classification and imputation
of distance metrics. Similarly, a high dimensional similarity search tasks, respectively. Consistent with the results in Section. 6.3, our
system [22, 23, 38] can be built on the embeddings. method Group Attn. outperforms the baselines on both classifica-
tion and imputation tasks, particularly on the datasets comprising
long timeseries (ECG and MGH).
14

RITA: Group Attention Is All You Need For Timeseries Analytics

Uploaded by

Copyright:

Available Formats

RITA: Group Attention Is All You Need For Timeseries Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RITA: Group Attention Is All You Need For Timeseries Analytics

Uploaded by

Copyright:

Available Formats

RITA: Group Attention is All You Need for Timeseries Analytics

Jiaming Liang Lei Cao∗ Samuel Madden

Zachary Ives Guoliang Li

ABSTRACT pre-training a Transformer model which takes the correlations

You might also like