Fast Approximate Correlation For Massive Time-Series Data - Correlation-Sigmod10
Fast Approximate Correlation For Massive Time-Series Data - Correlation-Sigmod10
Permission to make digital or hard copies of all or part of this work for Previous works have shown the importance of computing cross
personal or classroom use is granted without fee provided that copies are correlation of a large number of signals. In [20], authors mention
not made or distributed for profit or commercial advantage and that copies five important queries in a data center management application, out
bear this notice and the full citation on the first page. To copy otherwise, to of which three queries related to server dependency analysis, load
republish, to post on servers or to redistribute to lists, requires prior specific balancing, and anomaly detection involve computing correlation
permission and/or a fee.
SIGMOD’10, June 6–11, 2010, Indianapolis, Indiana, USA. matrices. For example, Figure 1(a) shows a correlation matrix of
Copyright 2010 ACM 978-1-4503-0032-2/10/06 ...$5.00. 350 signals, where each signal represents the number of TCP con-
at a time, signals that are mutually correlated with each other re-
side in the memory at the same time—in this way, the correlation
coefficients of a cached signal with a large number of other cached
signals can be computed without additional I/Os. We show that the
problem can be modeled as a graph partitioning problem and be
solved using efficient heuristics.
To reduce CPU costs, we propose two novel approximation algo-
(a) Correlation Matrix C (b) Threshold Corr. Matrix C T rithms for computing a correlation matrix. Our first approximation
algorithm computes correlation coefficients within a given error
Figure 1: Correlation matrices. Darker pixels represent higher bound, while the second algorithm identifies signal pairs with cor-
correlation coefficients (e.g., a black pixel represents a corr. co- relation above a threshold without any false positives or false neg-
eff. of 1). atives. Our algorithms take the advantage of computational short-
cuts in the Discrete Fourier Transformation (DFT) space and are
significantly faster than an exact algorithm; yet, their approxima-
nections in a server measured every 30 seconds in a day. The matrix tion guarantees keep them useful for many real-world applications,
shows the existence of several load balanced clusters of machines. including our running example of the data center management app-
If such a correlation matrix is periodically computed, any devia- lication.
tion between successive matrices can indicate possible anomalies. Our approximation algorithms can be used in combination (with
Similarly, many eScience questions are better understood by corre- or without an exact algorithm) during interactive data exploration.
lation matrices [9]. In [21] and [27], authors provide examples of For example, a data center operator can first use our first approxi-
sensing and stock trading applications requiring cross correlation mation algorithm for identifying highly correlated signal pairs. If
of tens of thousands of signals. the correlation pattern looks interesting, he can further use our sec-
We consider ad-hoc queries, where individual signals are stored ond approximation algorithm to remove any false positives. If fur-
on disks and a target set of signals and a target time window are de- ther needed, he can use an exact algorithm to compute exact cor-
fined during the query time. We assume that no precomputed index relations. Such an approach gives a user the flexibility to stop data
specifically designed for correlation computation (such as [1, 27]) exploration early (e.g., after the first step) or continue for greater fi-
exists because of its large overhead and inefficiency to handle ad- delity answers at the cost of increasing computational complexity.
hoc queries. Fast computation of a large correlation matrix in this In summary, we make the following contributions.
setting is challenging because of its high I/O and CPU overhead.
High I/O costs are encountered because, due to limited memory, • We propose a novel caching algorithm for computing a threshold
signals may need to be read from disk to memory multiple times. correlation matrix. The algorithm uses DFT and graph partition-
High CPU costs result from examining all pairs of signals and do- ing to optimize overall I/O cost (§ 4).
ing expensive floating point operations. As we will show in Sec- • We propose two efficient approximation algorithms. The first al-
tion 3, computing a correlation matrix for 10,000 signals of length gorithm approximates entries in a threshold correlation matrix
2880 using a naive approach can take several hours in a standard within a given error bound. The second algorithm efficiently
desktop computer. If multiple such matrices are to be computed identifies all correlated signal pairs without any false positives
(e.g., one matrix for each day of signals), this can take up to several or negatives (§ 5).
days on a single machine. Such large response times are unaccept-
• We propose extensions to our basic algorithms to support anti-
able for interactive data exploration tasks in a SWS.
correlation and lagged correlation (§ 6).
To address the above challenges, we consider a slightly relaxed,
yet almost equally useful, version of the problem. In many applica- • Using several real datasets, we evaluate our proposed algorithms.
tions, including the data center management application, users are Our evaluation shows that our algorithms are up to 17× and 71×
typically interested in correlated (e.g., correlation above a given faster than existing algorithms for synchronous and lagged corre-
threshold T ) signal pairs—uncorrelated signal pairs are typically lation respectively (§ 7).
not of much interest. Therefore, while the exact correlations of This work is a part of a broader project, called DataGarage [16],
correlated pairs need to be computed, that of uncorrelated pairs can for building a data-driven data center management system. Data-
often be safely ignored. Therefore, the general problem we consi- Garage aims to collect and archive monitoring data from tens of
der in this paper is as follows. thousands of servers, to enable users to run ad-hoc and routine data
mining queries on massive data, and to provide useful control feed-
P ROBLEM 1 (T HRESHOLD C ORRELATION M ATRIX ). Given back to data center operators for load balancing, energy optimiza-
n signals of equal length m and a threshold T , for 1 ≤ i, j ≤ tion, anomaly detection, etc. In that context, this work provides an
n, compute a n × n threshold correlation matrix C T such that important building block to mine massive data and to gather valu-
C T [i, j] = corr(i, j) if |corr(i, j)| ≥ T and 0 otherwise. able insights for data center operators.
Figure 1(b) shows an example of threshold correlation matrix
with a threshold of T = 0.5. As shown, some of the gray pixels, 2. RELATED WORK
with correlation < 0.5 in Figure 1(a) are absent in Figure 1(b); yet, Correlation is a similarity measure and prior works have ex-
Figure 1(b) shows the prominent clusters of similar signals. tensively considered the problem of discovering similar sequences
To reduce I/O costs, we propose a novel data partitioning algo- from a large number of sequences. Traditionally, the Euclidean
rithm that divides a given dataset into batches such that each batch distance is used to capture the similarity. The original work by
fits in the available memory. More importantly, batches are care- Agrawal el at. [1] considered discovering similarity between an
fully created such that most signal pairs within a single batch are online sequence and an indexed database of previously obtained
correlated, while many pairs with signals in different batches are sequence information. The proposed techniques focused on whole
uncorrelated. Thus, as signals are read into the memory one batch sequence matching and utilized the DFT to transform data from the
time domain into frequency domain and used multidimensional in- Cache size, # signals CPU time (Mins) I/O time (Mins)
dex structure to index the first few DFT coefficients. The work was n (Full cache) 93 4
n
later generalized to allow subsequence matching [4] and transfor- 32
(Partial cache) 93 40
mations such as scaling and shifting [19]. Our work differs from 2 (No cache) 93 7639
them in that i) we consider Pearson correlation coefficient, which
is optimal for detecting linear relationships, ii) we do not assume Table 1: Total time for computing a correlation matrix of n =
existence of any precomputed index, since our sequence set and the 10, 752 signals for various cache sizes.
target time window are defined ad-hoc during query time, and iii)
we consider computing correlation of all sequence pairs, instead of
experiments show an average CPU utilization of only 2% without
only the pairs involving a given sequence. From algorithmic point
any cache. On the other hand, if the entire dataset can be cached in
of view, our caching and approximation algorithms are novel and
DRAM, each signal needs to be read from disk only once and hence
were not considered by this previous work.
the I/O cost is minimized. In such case, the correlation matrix can
StatStream [27] and HierarchyScan [12], like our work, con-
be computed in around 1 hour 37 minutes, highlighting the fact that
sider correlation coefficients as a similarity measure. StatStream
more than 127 hours is spent for I/O in the no-cache scenario.
uses DFT to maintain a grid data structure to quickly identify sim-
However, in practice, it may not be possible to completely cache
ilar streams within a sliding window in real-time. HierarchyScan
a large dataset. For example, the DataCenter1 dataset for 20,000
considers a stream warehouse setting and performs correlation be-
signals for a month is more than 24GB, which is significantly big-
tween the stored sequences and a given template pattern in the
ger than the available memory. Thus, only a fraction of the signals
transformed domain (e.g., using DFT or DWT) hierarchically. [2]
can be cached in the DRAM at a time. In such a case, a single sig-
uses sketches to correlate uncooperative (i.e., noisy) signals, which
nal may need to be read from the disk multiple times to correlate
are not prevalent in our target applications. Our work differs from
with all other signals. For example, a signal x may be evicted from
these works in that: i) none of the work considers I/O optimization
the limited cache before another signal y is read from the disk; then
ii) none of the work considers bounded approximation of correla-
x must be reread from disk later to compute corr(x, y).
tion (e.g., HierarchyScan may output false negatives, sketch may
output both false positives and negatives). An Optimal Baseline Caching Algorithm. Consider the follow-
Our use of DFT is in the similar spirit of reducing dimensional- ing optimal baseline caching algorithm for dealing with a limited
ity of signals. Previous works have used similar ideas to achieve cache. We define the cache size as the number of signals it can hold
tight lower bounds for pruning signals (e.g., to answer similarity at a time. Given n signals and a cache of size (n/q + 1), signals
queries [22, 23]). Some existing dimensionality reduction tech- are partitioned into batches {Bi } of size n/q each, except of the
niques (e.g., APCA [11], PAA [10], MSM [15]) provide better last batch which can be smaller. Batches are determined according
lower bounds for pruning than DFT and DWT. However, none of to signal IDs; i.e., the first n/q signals are in the first batch B1 , the
these techniques allows computing correlation (with bounded er- second n/q signals are in the second batch B2 , and so on. Each
ror) in the reduced dimensionality space. More specifically, even batch Bi is brought to the cache at once, and correlation coeffi-
though these techniques reduce dimensionality to prune uncorre- cients of all pairs of signals (x, y), x ∈ Bi , y ∈ Bi are computed.
lated (or dissimilar) signals, correlation coefficients (or some other Before the batch Bi is evicted from the cache, all remaining sig-
similarity metrics) of signals are computed in the time domain; in nals z ∈ Bj>i are read from the disk one at a time. When such a
contrast we use DFT to compute approximate correlation in the fre- signal z is read, correlation coefficients of all pairs (x, z), x ∈ Bi
quency domain. are computed. After that, the next batch of signals is loaded into
Many other prior works consider similarity or related queries in the cache and the process continues. This simple caching strategy
streaming scenario [7, 13, 14, 26]. These techniques are not ad- is optimal because every time a signal is read from the disk, the
equate for our target applications since the techniques are not de- number of correlation coefficients computed with it is exactly n/q,
signed for stream warehouse settings, and/or do not explicitly con- which is the maximum possible with a cache of size (n/q + 1).
sider correlation coefficients, and/or are not shown to scale to tens We use the above baseline strategy in our experiments with lim-
of thousands of streams. ited cache. As shown in Table 1, even if we have a cache large
enough to hold 1/32 of the entire dataset (a partial cache scenario),
3. MOTIVATION I/O cost remains significant (40 minutes). A careful back of the en-
velope calculation shows that the I/O cost to compute an all-pair
We use a real dataset to demonstrate I/O and CPU complexities
correlation matrix of n signals with a cache of size n/q + 1 is pro-
of computing a large correlation matrix C in a stream warehouse
portional to n(1 + q)/2. Thus, the I/O cost linearly decreases with
environment. The dataset, called DataCenter1, records a perfor-
increasing cache size. We empirically validate this in Section 7.
mance counter from a Microsoft data center (more details of the
The above empirical results highlight two main components of
dataset is in Section 7). It consists of n = 10, 752 sequences, each
the total execution time.
of length m = 2880. Each sequence is stored on disk as a separate
file. 1 We use a 2.67 GHz quad core machine with 6GB RAM and • High I/O cost: With a limited cache, a significant amount of the
a 750 GB Hitachi hard disk of 7200 rpm for this experiment. time is spent for reading data from disk.
Our experiments show that computing a correlation matrix for • High CPU cost: Even if there is enough cache to hold the en-
the above dataset is very expensive, especially with limited or no tire dataset, computation remains expensive, as shown by the Full
caching (Table 1). In the worst case, without any caching, a signal cache scenario in Table 1.
needs to be read from the disk (n − 1) times, to correlate with all
(n−1) other signals. In such a case, computing a correlation matrix We next present techniques to reduce these two costs. For sim-
for our above dataset takes approximately 129 hours! Obviously, plicity, we first assume only positive correlation and synchronized
most of the time is spent in reading signals from the disk. Our signals in the next two sections. We consider anti-correlation and
lagged correlation in Section 6. Table 2 summarizes the symbols
1
Our conclusions hold if all data is stored in a single file. we use.
Symbol Definition Algorithm 1 P runeU ncorrelated(S, k)
n Number of signals Require: A set S of n signals, with each signal si ∈ S is of length
m Length of each signal m
x, X A signal and its DFT Ensure: Report likely correlated signal pairs
x
b, X
b Normalization of x and DFT of x b 1: for each signal si ∈ S, 0 ≤ i < n do
d(x, y) Euclidean distance of x and y 2: Read si from disk
dk (x, y) Euclidean distance of first k elements of x and y 3: Normalize si to sbi
T Correlation threshold 4: DF T [i] ← first k DFT coefficients of sbi
5: for each signal si ∈ S, 0 ≤ i < n do
Table 2: Symbols and definitions 6: for each signal sj ∈ S, i < j <pn do
7: if dk (DF T [i], DF T [j]) ≤ 2m(1 − T ) then
8: Report the pair (i, j) as likely correlated
4. REDUCING I/O COSTS
In this section, we present a novel technique to reduce the total
I/O costs required to answer a threshold correlation matrix query. 4.1 Identifying Correlated Pairs
As a shorthand, we call two signals correlated if their correlation We use Discrete Fourier Transform (DFT) to identify correlated
coefficient is above the given threshold, or uncorrelated otherwise. signal pairs in an I/O efficient manner. The DFT of a signal x =
Thus we need to compute correlation coefficients for correlated sig- x0 , x1 , . . . , xm−1 is a sequence X = X0 , X1 , . . . , Xm−1 = DF T (x)
nal pairs only. A naïve algorithm would require computing corre- of complex numbers given by
lation of all pairs of n signals and have an O(n2 ) I/O cost. A hypo-
m−1
thetical optimal algorithm can reduce the cost in at least two ways. 1 X −2πif
Xf = xi e m k , f = 0, 1, . . . , m − 1
The first technique is pruning. If the algorithm magically knew m
k=0
which pairs of signals are correlated, it could compute correlation
coefficients (and read relevant data from disk) for those pairs only We also define the normalization of x as x b = x b0 , x
b1 , . . . , x
bm−1 ,
and ignore uncorrelated pairs. The second technique is intelligent bk = (xi − µx )/σx , where µx and σx are mean and
such that x
caching. The algorithm can read signals from the disk in an optimal standard deviation of the values x0 , x1 , . . . , xm−1 .
order such that signals that are mutually correlated with each other As the following lemma suggests, the correlation coefficient of
reside in the cache at the same time—thus, a cached signal can be signals can be reduced to the Euclidean distance between their nor-
compared with a large number of other cached signals, reducing malized series.
the amortized I/O cost of reading a signal from the disk.
Realizing such an algorithm needs answering two questions. First, L EMMA 1 ([18]). The correlation coefficient of two signals
how does the algorithm know which signals are correlated? This x and y is corr(x, y) = 1 − 2m 1
d2 (b
x, y
b ), where d(b
x, y
b ) is the
must be done with an I/O cost significantly smaller than the O(n2 )— Euclidean distance between x
b and y
b.
pruning becomes useless if it itself has an I/O cost close to the
By reducing the correlation coefficient to Euclidean distance, we
O(n2 ) I/O cost of a naïve algorithm. Without any background
can apply the techniques in [27] to report signals with correlation
knowledge about the nature of input signals, the algorithm must
coefficients higher than a specific threshold.
examine all the signals at least once (i.e., read each signal from
disk at least once), implying an O(n) lower bound of I/O cost. L EMMA 2 ([27]). Let DFT of the normalized forms of two
Even after all correlated pairs are identified, I/O cost to compute
signals x and y be X
b and Y.
b Then,
correlation coefficients of correlated pairs can still be significantly p
high if a good caching strategy is not used. We define a caching corr(x, y) ≥ T ⇒ dk (X, b ≤ 2m(1 − T )
b Y)
strategy as the order in which signals are read to and evicted from
the cache. The impact of different caching strategies on I/O cost is where dk (X, b Y) b is the Euclidean distance between sequences
illustrated by an example in Figure 2. In Figure 2(a), a black cell X
b0 , X bk−1 and Yb0 , Yb1 , . . . , Ybk−1 for some k ≤ m .
b1 , . . . , X
2
(i, j) implies that signal i and signal j are correlated (e..g, their
correlation coefficient is above a given threshold), and hence we Lemma 2 implies
p that we can safely ignore signal pairs for which
need to compute their exact correlation coefficient. Knowing this dk (X, b > 2m(1 − T ), since they cannot have correlation co-
b Y)
information (e.g., from an oracle), an algorithm can read the sig- efficients above a given threshold T . By ignoring such pairs, we
nals from disk in many different orders, including the two strate- will get a set of likely correlated signal pairs. This is a superset of
gies shown in Figure 2(b) and (c). Each step of a strategy shows the correlated signal pairs, but there will be no false negatives. Sim-
the signals that are read from the disk (1st column), the cache con- ilar technique has been used in previous works[1, 27]. For a large
tent after the signals are read (2nd column), and the pairs of signals class of real-world signals (called cooperative signals [2]), includ-
whose correlation coefficients have been computed at this step. We ing our data center data and stock prices, the first few low frequency
assume the cache can hold at most 4 signals at a time. As shown, DFT coefficients are sufficient to capture the overall shape of a sig-
both strategies compute the correlations of the same set of signal nal. For such signals, computing only a small number of low fre-
pairs; but Strategy 2 does that with almost half the I/O costs of quency coefficients, e.g., using k = 5, is sufficient for identifying
Strategy 1.2 The example illustrates the importance of choosing a likely correlated signal pairs. The number of false positives can be
good caching strategy. This leads us to the second challenge: how reduced by using a larger k, which comes at the cost of increased
can one find a good caching strategy to minimize I/O costs? computational overhead.
We next address these two challenges. The above properties of DFT lay the foundation of our I/O effi-
cient detection of correlated pairs. Algorithm 1 shows the details.
2 Given n signals of length m on disk, we read one signal at a time
Note that correlation is symmetric; i.e., computing corr(x, y)
gives corr(y, x). to compute first k DFT coefficients of each signal, resulting in an
1 2 3 4 5 6 7 8
1
Signals read In Cache Computed pairs
2 Signals read In Cache Computed pairs
3 1, 2, 3, 4 1, 2, 3, 4 (1, 2), (3, 4)
1, 2, 5, 6 1, 2, 5, 6 (1, 2), (1, 5), (2, 6), (5, 6)
4 5, 6, 7, 8 5, 6, 7, 8 (5, 6), (6, 7), (7, 8)
5 7 2, 5, 6, 7 (6, 7)
1, 2 1, 2, 5, 6 (1, 5), (2, 6)
6 3, 4, 8 3, 4, 7, 8 (3, 4), (3, 7), (4, 8), (7, 8)
7 3, 4, 7, 8 3, 4, 7, 8 (3, 7), (4, 8)
8
(a) Pruning Matrix (b) Caching Strategy 1, 14 disk reads (c) Caching Strategy 2, 8 disk reads
Figure 2: Computing a threshold correlation matrix with two different caching strategies. The cache can hold 4 signals at a time.
Strategy 2 is 1.75× more I/O efficient than Strategy 1.
4.2 Caching Strategy Figure 3: Partitioning signals into two batches to minimize the
A caching strategy involves deciding which set of signals to bring multicut size
into the cache together and how to evict them from the cache. We
use the same general framework we used for the optimal baseline P
algorithm in Section 3: we divide signals into batches and bring that e∈∆ we is minimized, where ∆ is the set of edges whose end
them into the cache one batch at a time. However, we introduce nodes belong to different elements of the partition, typically called
two optimizations in the baseline algorithm. First, before a batch a multicut. The resulting partitioning is called minimum multicut
is evicted from the cache, the baseline algorithm brings all signals size partitioning, or min-cut partitioning in short.
in remaining batches, one at a time, to compute correlation coeffi- In our setting, the cache size B defines the capacity, the graph
cients of all signal pairs having exactly one signal in the currently has the set of signals as the nodes, and the weight we of the edge
cached batch. In contrast, we use the Pruning Matrix to ignore the e between node i and node j is P [i, j], where P is the Pruning
uncorrelated pairs; thus we bring a signal in the cache only if it is Matrix. Thus, different elements of the resulting partition denote
likely correlated with at least one signal in the current batch. In different batches of signals that are read to the cache together. Fig-
the best case, if a batch is not correlated with any other signals, no ure 3 shows an example graph (only edges with weight 1 are shown)
additional signals need to read before eviction of the batch. representing the Pruning Matrix in Figure 2(a), and two batches
Our second, and the most important, caching optimization care- of signals resulting in the caching strategy 2 in Figure 2(c). Intu-
fully chooses the batches. Note that, for each likely correlated sig- itively, we try to avoid pairs of signals that are likely correlated with
nal pair whose two signals are in two different batches, we need each other (as indicated by the Pruning Matrix) to place in different
to incur an additional disk read. Suppose the Pruning Matrix sug- batches.
gests that signals x and y are likely correlated and hence we need The above graph partitioning problem is NP-complete [5]. There
to compute corr(x, y). If they are put in the same batch, they will are many heuristics-based and approximation algorithms for bal-
be read to the cache together and hence corr(x, y) can be com- anced graph partitioning [3, 6, 8, 25]. Many of the algorithms are
puted without additional disk I/O. In contrast, if they are put in used in offline circuit partitioning in VLSI design, and hence they
different batches, and if the batch containing x is read to cache optimize for accuracy at the cost of increased execution time. In
before the batch containing y, y will be read from disk at least contrast, we need to partition graph online, during query execu-
twice—once just before the batch containing x is evicted, to com- tion. Hence, we have chosen the simplest and the fastest of these
pute corr(x, y), and again when the batch containing y is read to existing algorithms: the F-M (Fiduccia-Mattheyses) algorithm [6].
the cache. Thus, computing correlation between signals in diffe- F-M is a bi-partitioning algorithm that partitions a given graph into
rent batches incurs additional I/O costs, and therefore we aim to two equal size partitions while minimizing the size of the multicut.
partition the signals into batches such that such additional I/O cost We use it recursively to get a multi-way balanced partitioning such
is minimized. In Figure 2, caching strategy 1 uses two batches as that each partition is smaller than or equal to the cache size B and
{1, 2, 3, 4} and {5, 6, 7, 8}, which results in four likely correlated the multicut size is minimized. (Such a recursive approach has been
signal pairs across batches. In contrast, caching strategy 2, which shown to yield smaller multicut size than iterative approaches [24].)
outperforms caching strategy 1, uses two batches as {1, 2, 5, 6} and Since graph bi-partitioning is NP-Hard, the F-M algorithm uses
{3, 4, 7, 8}, resulting in only one such pair (6, 7) across different heuristics for bi-partitioning. It starts with a random balanced bi-
batches. partitioning and iteratively reduces the multicut size. It defines the
gain of a vertex as the difference between the number of its adja-
Optimal data partitioning. Fortunately, we can formulate the cent vertices in its opposite partition and the number of its adjacent
above optimization problem as the node capacitated graph parti- vertices in its current partition. In each iteration, the algorithm con-
tioning problem [5]. Given a graph G = (V, E), edge weights siders each vertex in the descending order of gains and tentatively
we for e ∈ E, and a capacity B, the goal is to find a partition moves it to the opposite partition. After a vertex is moved to an
(V1 , V2 , . . . , Vφ ) of V such that |Vi | < B for 1 ≤ i ≤ φ and such opposite partition, the gains of all its adjacent vertices are updated.
Finally, the algorithm finalizes the first k moves such that the total
gain of the first k vertices is maximized and the resulting partitions
are balanced. The algorithm stops whenever an iteration cannot
improve the current bi-partitioining. The algorithm is shown to
converge in very few iterations (< 10) [6].
For multi-way partitioning, we recursively use the F-M algo-
rithm to partition an input graph with n vertices into M partitions
such that the size of each partition is ≤ B. Ideally, we should con- (a) C ( = 0.04, T = 0.5) (b) C B (T = 0.5)
tinue recursive partitioning until we get M 0 = dn/Be partitions.
However, such a restriction does not provide the partitioning algo- Figure 4: (a) An approximate threshold corr. matrix C , and
rithm enough flexibility to find good partitions. So, we continue (b) a Boolean threshold corr. matrix C B , corresponding to the
partitioning until we get M > M 0 , say M = 2dn/Be, partitions. matrix C in Figure 1(a).
This results in good partitions, along with a few partitions signif-
icantly smaller than B. At the end, we merge those small parti-
tions together to produce larger partitions of size ≤ B. The final , compute an n × n approximate threshold correlation matrix ma-
partitions determine the batches of signals in our caching strategy trix C such that C [i, j] = corr (i, j) if corr (i, j) > T , and
mentioned before. C [i, j] = 0 otherwise.
a lagged correlation with lag l if they look very similar when one 1 h X e2πi(p− m −r) − 1 i
= X m−l
mr + Xp 2πi( p − r )
signal is delayed by l time ticks. Formally, given two signals x and m−l e m m−l − 1
p=0,p6= mr m−l
y of equal length m, their correlation with lag l is corrl (x, y) =
Pm−l−1 (xi −µx ) (yi+l −µy )
, where µx , µy , σx , σy are defined on The proof for suffix is similar.
i=0 σx σy
the overlapping part of two signals. Note that synchronous correla- Thus, once the DFT coefficients for xb and yb are computed, they
tion is a special form of lagged correlation with l = 0. can be reused to compute DFT coefficients for all their prefixes and
We further define the maximum lagged correlation of two sig- suffixes and hence be used for computing correlations with arbi-
nals as the maximum of their correlations with all possible lags. trary lags.
Note that in the above definition, only one signal has been lagged The above result enables us to efficiently compute lagged ver-
or shifted. If both signals are periodic, shifting any of the signals sions of C and C B in the frequency domain. In a lagged C , an
yields the same maximum lagged correlation. Otherwise, if any of entry C [i, j] is an -approximation of the maximum lagged cor-
the signals is aperiodic, both the signals need to be shifted to find relation of signals i and j. Similarly, in a lagged C B , an entry
the maximum lagged correlation. For simplicity of description, we C B [i, j] is 1 iff the maximum lagged correlation of signals i and j
here consider shifting only one signal. is above the given threshold.
We now show how to extend our previous algorithms to consider However, there is a caveat. The basic idea above requires all
maximum lagged correlations, instead of synchronous correlation. DFT coefficients of an original signal; in contrast, our approxima-
Our approach is similar to BRAID [21], which discovers maximum tion algorithms compute only a first few DFT coefficients. Thus,
lagged correlation of a signal pair in O(lg m) time, where m is we need to approximate remaining coefficients with zeros, which
the maximum possible lag. Instead of computing correlations for introduces errors in the DFT coefficients we compute for prefixes
all possible lags to find the maximum, BRAID probes in geomet- and suffixes. In general, this may cause our algorithms to violate
ric progression and interpolates the remaining values of the cross approximation guarantees. In practice, however, the effect is very
correlation function. Although BRAID is designed for streaming small because our algorithms compute as many DFT coefficients as
applications, it can easily be adapted to use in a stream warehous- required to capture the most of the energy of a signal; hence, ignor-
ing scenario. However, BRAID computes correlations in the time ing the remaining coefficients does not affect the accuracy much.
domain, which can be significantly expensive for a large number We will experimentally validate this in Section 7.
of long signals. We now show how BRAID can be used in the
frequency domain to avoid such cost. 7. EVALUATION
Note that to compute lagged correlation of x and y with a lag l,
We evaluate our algorithms using the same machine described in
one signal is first shifted (or, lagged) by l time ticks while keeping
section 3.
the other signal fixed, and then the correlation is computed over
We use four datasets. To perform experiments on massive sized
their trimmed, common parts of length (m − l). Without loss of
data, we replicate every signal in a dataset equal number of times
generality, assume that the common parts include a prefix of signal
with small additive white noise. This preserves the pairwise corre-
x and a suffix of signal y, both of length (m − l); i.e., x0 is aligned
lation structure in the original dataset.
with yl to compute lagged correlation with a lag of l. To work
in the frequency domain, a naïve solution would compute DFT of • DataCenter1 contains measurements of the number of TCP con-
all prefixes of x and suffixes of y. However, the following lemma nections established over a day to 350 servers in a real data cen-
shows that we can compute DFT coefficients of a signal once, and ter. One measurement is made every 30 seconds, and so a signal
then reuse them to compute coefficients for any prefix or suffix of for a day consists of m = 2, 880 samples. The dataset contains
the signal. n = 11, 200 signals.
60 70 14
Random partition DataCenter1 RandomWalk
Speedup
40 Chlorine Chlorine
40 8
30 30 6
20 2.4x 20
1.8x 3.5x 4
3x 10
10 2
1
0 4 16 32 64 128 0.4 0.5 0.6 0.7 0.8 0.9
DC1 DC2 Chlorine RW Data Size:Cache Size Threshold
(a) I/O cost reduction (b) Impact of cache size (c) CPU cost reduction
Figure 5: Impact of I/O and CPU optimizations for computing a threshold correlation matrix. In (a), DC1=DataCenter1,
DC2=DataCenter2, and RW=RandomW alk.
• DataCenter2 is a collection of measurements of the CPU utiliza- • A : an algorithm to compute an all-pair exact correlation matrix
tion of 120 servers in a real data center. One measurement is made C in the time domain,
every 30 seconds, and a signal for a day consists of m = 2, 880 • AT : an algorithm, described below, for computing an exact
samples. The dataset contains n = 4, 745 signals. threshold correlation matrix C T ,
• Chlorine [17] is a collection of signals representing chlorine con- • A : our algorithm for computing an approximate threshold cor-
centration at different junctions in a water distribution network. relation matrix C (Section 5.1),
The original dataset has 166 signals of two weeks long signal
traces (one sample in every 5 minutes). We use week-long (m = • AB : our algorithm for computing a threshold Boolean correlation
2, 155 samples) traces and replicate them to create a dataset of matrix C B (Section 5.2).
n = 10, 624 signals. For different algorithms, we report the speedup factors. The
• RandomWalk is a collection of n = 16, 384 random walk signals speedup of an algorithm is the ratio of the end-to-end CPU time of a
generated synthetically using Gaussian steps. Each signal is m = baseline algorithm to that of the algorithm. The higher the speedup,
2, 880 samples long. the faster the algorithm. For our approximation algorithms A and
AB , we use AT as the baseline. Before reporting the speedups of
7.1 Impacts of I/O Optimizations our algorithms, we first report the speedup factor of AT , with A as
To show the benefit of our min-cut partitioning based caching the baseline. This will allow us to interpret the speedup factors of
strategy, we compare it with a baseline caching strategy where sig- A and AB with respect to A as well.
nals are randomly partitioned. The only difference between the IThreshold Correlation Matrix. To compute C T for a given
two caching strategies is how they partition signals into batches— threshold, AT prunes uncorrelated signal pairs based on their dis-
both are used for computing a C T , and they both prune signal pairs tances of k = 5 first DFT-coefficients (similar to the methods in
based on the same Pruning Matrix. We assume the cache is big [1, 27]). The exact correlations of likely correlated signal pairs are
1
enough to hold 32 th of a dataset. The result is shown in Figure 5(a). then computed in the time domain. Figure 5(c) shows, for diffe-
As shown, our min-cut partitioning significantly (1.8× - 3.5×) re- rent correlation thresholds, the speedup factors of AT , with respect
duces the I/O time for all the datasets (the factor of reduction in I/O to A (which takes more than 90 minutes of CPU time for all the
time is shown at the top of the second bar). This reduction is at- datasets). As shown, AT can be several times faster than A, specif-
tributed to our careful partitioning that reduces the number of disk ically with high thresholds (e.g., with T = 0.9, AT is > 3× faster
I/O required to compute correlation of signal pairs across different than A). The speedup increases as the threshold increases; this is
batches. However, the overhead of partitioning is never more than because more and more uncorrelated signal pairs can be pruned as
30 seconds, which is very tiny compared to the end-to-end response the threshold increases.
time.
The I/O cost of our caching strategy can be reduced by using a IApproximate Threshold Correlation. Figure 6(a) shows the
bigger cache. To show the impact of cache size, we vary the cache speedup of A , with respect to AT , for different approximation er-
size keeping the datasize fixed. Figure 5(b) shows I/O costs as a ror bounds . As before, we use k = 5 DFT coefficients for pruning
function of the ratio of cache size and data size. As shown, the I/O and T = 0.9 as the threshold. The graph shows that even with a
time decreases linearly with the increase in available cache size, very small error, e.g., 0.02, A is significantly faster than AT for
and becomes < 10 minutes for a cache of size n/16. This is due all real datasets. For example, with = 0.02, the speedups for the
to the fact that with a larger cache, data is partitioned into fewer Chlorine and DataCenter2 datasets are 17 and 9 respectively.
batches, and hence fewer disk I/Os are required to compute corre- The speedup is small for RandomW alk, because most of its en-
lation of signal pairs across batches. The slopes of different lines ergy is captured by a very few of its leading coefficients [1], helping
demonstrate the amount of correlation present in different datasets. the baseline algorithm AT to perform extremely good (also shown
The more correlated pairs in a dataset are, the larger the slope is. in Figure 5(c)) with such data. The speedup increases with error
The RandomW alk dataset has the least slope among all datasets, tolerance, as this allows A to compute fewer DFT coefficients.
as it has the least correlation among signals. Figure 6(b) shows the speedup of A for different thresholds. In
general, the absolute execution time of A is not affected much by
7.2 CPU Speedup due to Approximation different thresholds. In contrast, the baseline algorithm AT runs
We now show how much our approximation algorithms reduce faster with bigger thresholds (as shown in Figure 5(c)). Therefore,
the CPU cost of computing a correlation matrix. As a shorthand, the speedup of A with respect to AT decreases with increasing
we use the following notations: threshold, as shown in Figure 6(b).
24 24 Chlorine 12
Chlorine
20 DataCenter2 DataCenter1
Chlorine 18 DataCenter1
DataCenter2 DataCenter2
Speedup
Speedup
Speedup
16 16 RandomWalk 8
DataCenter1
12 RandomWalk 12
8 8 4
4 4
1 1 1
0.02 0.04 0.06 0.08 0.1 0.4 0.5 0.6 0.7 0.8 0.9 0.4 0.5 0.6 0.7 0.8 0.9
Error Threshold Threshold
Figure 6: Speedup for computing an approximate and a Boolean threshold correlation matrix. The speedup is computed with respect
to computing an exact threshold correlation matrix.
% Error
2 Chlorine
Chlorine 98 67.5 3.5 10.4
RandomW alk 207 35.2 6.6 29.0 1
50
ported with respect to A, and other speedups are reported with 40
respect to AT , we can combine the speedups. For example, for 30
the DataCenter1 dataset, A is 18.75 times faster than A, for 20
a threshold T = 0.9 and an error bound = 0.06. Similarly,
10
AB is 8.34 times faster than A, for a threshold T = 0.9. For the
0
DataCenter2 dataset these numbers are 26.4 and 4.1.5 DC1 DC2 Chlorine RW
IAbsolute savings. Table 3 shows the absolute CPU time of diffe-
rent algorithms on different datasets. This highlights that, in addi- Figure 8: Speedup for lagged correlation (DC1=DataCenter1,
tion to significant relative speedups, our algorithms have significant DC2=DataCenter2, RW=RandomW alk).
absolute savings in execution time of different correlation queries.
In none of the experiments above, our algorithms result in a
speedup less than 1. This highlights that our algorithms are never mum lag. Lags are shown as percentages of the entire signal lengths
slower than AT or A with our datasets. (= m). As shown, the error is very small for all datasets for reason-
able lag values. In particular, Chlorine and RandomW alk incur
7.3 Lagged Correlation close to zero error with a maximum lag of 5% of the entire signal
length. All datasets have errors < 2% even for a large maximum
IError. As mentioned in Section 6.2, our algorithm for comput- lag of 20% of an entire signal. Lags are typically much smaller in
ing lagged C may violate -approximation guarantee. We now practice; e.g., for DataCenter1, a signal represents the entire day,
experimentally measure the effect for computing lagged C with and hence 10% lag means a lag of 2.4 hours, which is extremely
= 0.04. After computing lagged C , we count the number of sig- unlikely in a data center. Thus, for practical values of maximum
nal pairs that violate the -approximation guarantee; i.e., for which lag, our algorithm incurs a very small error.
the true maximum lagged correlation is more than away from our
estimated maximum lagged correlation. We define the percentage ISpeedup. The small error above comes with a significant benefit
of signal pairs violating the approximation guarantee as the error of of speedup. In Figure 8, we report the speedup of our algorithm
our algorithm due to lag. Note that, without any lag, our algorithm to compute a lagged C for = 0.04, with respect to compute
has an error of 0 as it never violates the approximation guarantee. C T with BRAID [21], the state-of-the-art algorithm for computing
Figure 7 shows the error of our algorithm as a function of maxi- lagged correlation. Both BRAID and our algorithm are configured
to compute correlation coefficients for 16 different lags (recall that
5 BRAID considers logarithmic number of lags). Figure 8 shows
We assume a large cache, so that I/O cost is small, to compute the
speedup. that our algorithm is 35× to 71× faster than BRAID for different
datasets. This huge speedup comes because our algorithm works [11] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra.
in frequency domain and reuses DFT coefficients across different Locally adaptive dimensionality reduction for indexing large
lags. time series databases. SIGMOD Rec., 30:151–162, 2001.
[12] C.-S. Li, P. S. Yu, and V. Castelli. HierarchyScan: A
8. CONCLUSION hierarchical similarity search algorithm for databases of long
We have proposed novel algorithms, based on Discrete Fourier sequences. In ICDE, 1996.
Transform (DFT) and graph partitioning, to reduce the end-to-end [13] X. Lian and L. Chen. Efficient similarity search over future
response time of an all-pair correlation query. To optimize I/O cost, stream time series. TKDE, 20(1):40–54, 2008.
we intelligently partition a massive input signal set into smaller [14] X. Lian, L. Chen, and B. Wang. Approximate similarity
batches such that caching the signals one batch at a time minimizes search over multiple stream time series. In Advances in
disk I/O. To optimize CPU cost, we have proposed two approx- Databases: Concepts, Systems and Applications, pages
imation algorithms. Our algorithms have strict error guarantees, 962–968, 2008.
which makes them as useful as corresponding exact solutions for [15] X. Lian, L. Chen, J. X. Yu, J. Han, and J. Ma. Multiscale
many real applications. However, compared to the state-of-the-art representations for fast pattern matching in stream time
exact solution, our algorithms are up to 17× faster for several real series. TKDE, 21(4):568–581, 2009.
datasets. [16] C. Loboz, S. Smyl, and S. Nath. Datagarage: Warehousing
massive amounts of performance data on commodity servers.
9. REFERENCES Technical Report MSR-TR-2010-22, Microsoft Research,
March 2010.
[1] R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient
similarity search in sequence databases. In 4th International [17] S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming pattern
Conference on Foundations of Data Organization and discovery in multiple time-series. In VLDB, pages 697–708,
Algorithms (FODO), 1993. 2005.
[2] R. Cole, D. Shasha, and X. Zhao. Fast window correlations [18] D. Rafiei. On similarity-based queries for time series data. In
over uncooperative time series. In SIGKDD, pages 743–749, ICDE, 1999.
2005. [19] D. Rafiei and A. Mendelzon. Similarity-based queries for
[3] G. Even. Fast approximate graph partitioning algorithms. time series data. In SIGMOD, 1997.
SIAM J. Comput., 28(6):2187–2214, 1999. [20] G. Reeves, J. Liu, S. Nath, and F. Zhao. Managing massive
[4] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast time series streams with multiscale compressed trickles. In
subsequence matching in time-series databases. In SIGMOD, VLDB, 2009.
1994. [21] Y. Sakurai, S. Papadimitriou, and C. Faloutsos. BRAID:
[5] C. E. Ferreira, A. Martin, C. C. de Souza, R. Weismantel, stream mining through group lag correlations. In SIGMOD,
and L. A. Wolsey. The node capacitated graph partitioning 2005.
problem: a computational study. Math. Program., [22] M. Vlachos, S. S. Kozat, and P. S. Yu. Optimal distance
81(2):229–256, 1998. bounds on time-series data. In SDM, 2009.
[6] C. M. Fiduccia and R. M. Mattheyses. A linear-time heuristic [23] M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos.
for improving network partitions. In DAC ’82: Proceedings Identifying similarities, periodicities and bursts for online
of the 19th Design Automation Conference, pages 175–181, search queries. In SIGMOD, 2004.
1982. [24] M. Wang, S. Lim, J. Cong, and M. Sarrafzadeh. Multi-way
[7] J. Gehrke, F. Korn, and D. Srivastava. On computing partitioning using bi-partition heuristics. In Proceedings of
correlated aggregates over continual data streams. In the 2000 Asia and South Pacific Design Automation
SIGMOD, 2001. Conference, 2000.
[8] A. Hagen, L. Kahng. Fast spectral methods for ratio cut [25] H. Yang and D. F. Wong. Efficient network flow based
partitioning and clustering. In Computer-Aided Design, min-cut balanced partitioning. In Proceedings of the 1994
1991., pages 10–13, 1991. IEEE/ACM international conference on Computer-aided
[9] T. Hey, S. Tansley, and K. Tolle. The Fourth Paradigm: design, pages 50–55, 1994.
Data-Intensive Scientific Discovery. Microsoft Research, [26] B.-K. Yi, N. Sidiropoulos, T. Johnson, A. Biliris,
2009. H. Jagadish, and C. Faloutsos. Online data mining for
[10] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. co-evolving time sequences. ICDE, 2000.
Dimensionality reduction for fast similarity search in large [27] Y. Zhu and D. Shasha. StatStream: statistical monitoring of
time series databases. KAIS, 3:263–286, 2000. thousands of data streams in real time. In VLDB, 2002.