0% found this document useful (0 votes)

14 views12 pages

Fast Approximate Correlation For Massive Time-Series Data - Correlation-Sigmod10

The document presents novel algorithms for computing all-pair correlations in massive time-series data, addressing the high I/O and CPU costs associated with existing methods. By utilizing Discrete Fourier Transformation and graph partitioning, the proposed algorithms significantly reduce response times for ad-hoc queries in stream warehousing systems, achieving up to 17× faster performance compared to state-of-the-art exact algorithms. The work focuses on applications in data center management and other fields requiring efficient analysis of large datasets, while ensuring strict error guarantees for approximate solutions.

Uploaded by

whatever

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views12 pages

Fast Approximate Correlation For Massive Time-Series Data - Correlation-Sigmod10

Uploaded by

whatever

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Fast Approximate Correlation for Massive Time-series Data

Abdullah Mueen Suman Nath Jie Liu

UC Riverside Microsoft Research Microsoft Research
[email protected] [email protected] [email protected]

ABSTRACT volumes of data. Applications for data center management, envi-

We consider the problem of computing all-pair correlations in a ronmental monitoring, financial engineering, scientific experiments,
warehouse containing a large number (e.g., tens of thousands) of and mobile asset tracking produce massive time series streams (or,
time-series (or, signals). The problem arises in automatic discov- signals) from various (physical and virtual) sensors. Such applica-
ery of patterns and anomalies in data intensive applications such as tions typically require stream warehousing systems (SWS) that, un-
data center management, environmental monitoring, and scientific like typical data stream systems, can archive data for a long period
experiments. However, with existing techniques, solving the prob- of time and efficiently support various statistical and data mining
lem for a large stream warehouse is extremely expensive, due to the queries on historic data.
problem’s inherent quadratic I/O and CPU complexities. Minimizing response times of ad-hoc queries in an SWS is very
We propose novel algorithms, based on Discrete Fourier Trans- important for effective user interaction. However, achieving this is
formation (DFT) and graph partitioning, to reduce the end-to-end extremely challenging as (i) the volume of the data of interest is
response time of an all-pair correlation query. To minimize I/O massive, and (ii) the benefit of pre-processing techniques (such as
cost, we partition a massive set of input signals into smaller batches materialized views and indices) can be minimal since the queries
such that caching the signals one batch at a time maximizes data are composed on-the-fly. We use a data center management app-
reuse and minimizes disk I/O. To reduce CPU cost, we propose lication as the running example in the paper. Since data centers
two approximation algorithms. Our first algorithm efficiently com- are large capital investments for online service providers, they are
putes approximate correlation coefficients of similar signal pairs closely monitored for operating conditions and utilizations by col-
within a given error bound. The second algorithm efficiently iden- lecting various software and hardware performance counters (e.g.,
tifies, without any false positives or negatives, all signal pairs with a server’s CPU and memory utilization, an application’s response
correlations above a given threshold. For many real applications, time, etc.). A typical SWS, such as DataGarage [16], monitors mul-
our approximate solutions are as useful as corresponding exact so- tiple data centers, each of which may contain tens of thousands of
lutions, due to our strict error guarantees. However, compared to servers. Assume that 500 performance counters are collected from
the state-of-the-art exact algorithms, our algorithms are up to 17× each server. Then, data centers with 100,000 servers will yield 50
faster for several real datasets. million concurrent data streams and, with a mere 15-second sam-
pling rate, more than 30 billion records (or about 1TB data) a day.
While mining historical data over several months for tasks such
Categories and Subject Descriptors as capacity planning, workload placement, pattern discovery, and
H.3.3 [Information Systems]: Information Search and Retrieval fault diagnostics seems appealing, the sheer volume of the data can
make useful ad-hoc data mining queries impractically slow.
General Terms Recent work [20] has shown efficient techniques for compress-
ing data and running simple queries in a SWS. In this paper, we
Algorithm, Performance consider a more complex query of computing a correlation matrix:
given n signals of equal length m, compute a n × n matrix C such
Keywords that C[i, j], 1 ≤ i, j ≤ n is the correlation coefficient corr(i, j)
Correlation Matrix, Discrete Fourier Transform of signals i and j. We consider Pearson’s correlation coefficients;
given two signals x and y of equal length m, with averages µx and
µy and standard deviations σx and σy respectively, their Pearson
1. INTRODUCTION correlation coefficient is defined as,
The increasing instrumentation of physical and computing pro-
m−1
cesses has given us unprecedented capabilities to collect massive 1 X x i − µx yi − µy
corr(x, y) =
m i=0 σx σy

Permission to make digital or hard copies of all or part of this work for Previous works have shown the importance of computing cross
personal or classroom use is granted without fee provided that copies are correlation of a large number of signals. In [20], authors mention
not made or distributed for profit or commercial advantage and that copies five important queries in a data center management application, out
bear this notice and the full citation on the first page. To copy otherwise, to of which three queries related to server dependency analysis, load
republish, to post on servers or to redistribute to lists, requires prior specific balancing, and anomaly detection involve computing correlation
permission and/or a fee.
SIGMOD’10, June 6–11, 2010, Indianapolis, Indiana, USA. matrices. For example, Figure 1(a) shows a correlation matrix of
Copyright 2010 ACM 978-1-4503-0032-2/10/06 ...$5.00. 350 signals, where each signal represents the number of TCP con-
at a time, signals that are mutually correlated with each other re-
side in the memory at the same time—in this way, the correlation
coefficients of a cached signal with a large number of other cached
signals can be computed without additional I/Os. We show that the
problem can be modeled as a graph partitioning problem and be
solved using efficient heuristics.
To reduce CPU costs, we propose two novel approximation algo-
(a) Correlation Matrix C (b) Threshold Corr. Matrix C T rithms for computing a correlation matrix. Our first approximation
algorithm computes correlation coefficients within a given error
Figure 1: Correlation matrices. Darker pixels represent higher bound, while the second algorithm identifies signal pairs with cor-
correlation coefficients (e.g., a black pixel represents a corr. co- relation above a threshold without any false positives or false neg-
eff. of 1). atives. Our algorithms take the advantage of computational short-
cuts in the Discrete Fourier Transformation (DFT) space and are
significantly faster than an exact algorithm; yet, their approxima-
nections in a server measured every 30 seconds in a day. The matrix tion guarantees keep them useful for many real-world applications,
shows the existence of several load balanced clusters of machines. including our running example of the data center management app-
If such a correlation matrix is periodically computed, any devia- lication.
tion between successive matrices can indicate possible anomalies. Our approximation algorithms can be used in combination (with
Similarly, many eScience questions are better understood by corre- or without an exact algorithm) during interactive data exploration.
lation matrices [9]. In [21] and [27], authors provide examples of For example, a data center operator can first use our first approxi-
sensing and stock trading applications requiring cross correlation mation algorithm for identifying highly correlated signal pairs. If
of tens of thousands of signals. the correlation pattern looks interesting, he can further use our sec-
We consider ad-hoc queries, where individual signals are stored ond approximation algorithm to remove any false positives. If fur-
on disks and a target set of signals and a target time window are de- ther needed, he can use an exact algorithm to compute exact cor-
fined during the query time. We assume that no precomputed index relations. Such an approach gives a user the flexibility to stop data
specifically designed for correlation computation (such as [1, 27]) exploration early (e.g., after the first step) or continue for greater fi-
exists because of its large overhead and inefficiency to handle ad- delity answers at the cost of increasing computational complexity.
hoc queries. Fast computation of a large correlation matrix in this In summary, we make the following contributions.
setting is challenging because of its high I/O and CPU overhead.
High I/O costs are encountered because, due to limited memory, • We propose a novel caching algorithm for computing a threshold
signals may need to be read from disk to memory multiple times. correlation matrix. The algorithm uses DFT and graph partition-
High CPU costs result from examining all pairs of signals and do- ing to optimize overall I/O cost (§ 4).
ing expensive floating point operations. As we will show in Sec- • We propose two efficient approximation algorithms. The first al-
tion 3, computing a correlation matrix for 10,000 signals of length gorithm approximates entries in a threshold correlation matrix
2880 using a naive approach can take several hours in a standard within a given error bound. The second algorithm efficiently
desktop computer. If multiple such matrices are to be computed identifies all correlated signal pairs without any false positives
(e.g., one matrix for each day of signals), this can take up to several or negatives (§ 5).
days on a single machine. Such large response times are unaccept-
• We propose extensions to our basic algorithms to support anti-
able for interactive data exploration tasks in a SWS.
correlation and lagged correlation (§ 6).
To address the above challenges, we consider a slightly relaxed,
yet almost equally useful, version of the problem. In many applica- • Using several real datasets, we evaluate our proposed algorithms.
tions, including the data center management application, users are Our evaluation shows that our algorithms are up to 17× and 71×
typically interested in correlated (e.g., correlation above a given faster than existing algorithms for synchronous and lagged corre-
threshold T ) signal pairs—uncorrelated signal pairs are typically lation respectively (§ 7).
not of much interest. Therefore, while the exact correlations of This work is a part of a broader project, called DataGarage [16],
correlated pairs need to be computed, that of uncorrelated pairs can for building a data-driven data center management system. Data-
often be safely ignored. Therefore, the general problem we consi- Garage aims to collect and archive monitoring data from tens of
der in this paper is as follows. thousands of servers, to enable users to run ad-hoc and routine data
mining queries on massive data, and to provide useful control feed-
P ROBLEM 1 (T HRESHOLD C ORRELATION M ATRIX ). Given back to data center operators for load balancing, energy optimiza-
n signals of equal length m and a threshold T , for 1 ≤ i, j ≤ tion, anomaly detection, etc. In that context, this work provides an
n, compute a n × n threshold correlation matrix C T such that important building block to mine massive data and to gather valu-
C T [i, j] = corr(i, j) if |corr(i, j)| ≥ T and 0 otherwise. able insights for data center operators.
Figure 1(b) shows an example of threshold correlation matrix
with a threshold of T = 0.5. As shown, some of the gray pixels, 2. RELATED WORK
with correlation < 0.5 in Figure 1(a) are absent in Figure 1(b); yet, Correlation is a similarity measure and prior works have ex-
Figure 1(b) shows the prominent clusters of similar signals. tensively considered the problem of discovering similar sequences
To reduce I/O costs, we propose a novel data partitioning algo- from a large number of sequences. Traditionally, the Euclidean
rithm that divides a given dataset into batches such that each batch distance is used to capture the similarity. The original work by
fits in the available memory. More importantly, batches are care- Agrawal el at. [1] considered discovering similarity between an
fully created such that most signal pairs within a single batch are online sequence and an indexed database of previously obtained
correlated, while many pairs with signals in different batches are sequence information. The proposed techniques focused on whole
uncorrelated. Thus, as signals are read into the memory one batch sequence matching and utilized the DFT to transform data from the
time domain into frequency domain and used multidimensional in- Cache size, # signals CPU time (Mins) I/O time (Mins)
dex structure to index the first few DFT coefficients. The work was n (Full cache) 93 4
n
later generalized to allow subsequence matching [4] and transfor- 32
(Partial cache) 93 40
mations such as scaling and shifting [19]. Our work differs from 2 (No cache) 93 7639
them in that i) we consider Pearson correlation coefficient, which
is optimal for detecting linear relationships, ii) we do not assume Table 1: Total time for computing a correlation matrix of n =
existence of any precomputed index, since our sequence set and the 10, 752 signals for various cache sizes.
target time window are defined ad-hoc during query time, and iii)
we consider computing correlation of all sequence pairs, instead of
experiments show an average CPU utilization of only 2% without
only the pairs involving a given sequence. From algorithmic point
any cache. On the other hand, if the entire dataset can be cached in
of view, our caching and approximation algorithms are novel and
DRAM, each signal needs to be read from disk only once and hence
were not considered by this previous work.
the I/O cost is minimized. In such case, the correlation matrix can
StatStream [27] and HierarchyScan [12], like our work, con-
be computed in around 1 hour 37 minutes, highlighting the fact that
sider correlation coefficients as a similarity measure. StatStream
more than 127 hours is spent for I/O in the no-cache scenario.
uses DFT to maintain a grid data structure to quickly identify sim-
However, in practice, it may not be possible to completely cache
ilar streams within a sliding window in real-time. HierarchyScan
a large dataset. For example, the DataCenter1 dataset for 20,000
considers a stream warehouse setting and performs correlation be-
signals for a month is more than 24GB, which is significantly big-
tween the stored sequences and a given template pattern in the
ger than the available memory. Thus, only a fraction of the signals
transformed domain (e.g., using DFT or DWT) hierarchically. [2]
can be cached in the DRAM at a time. In such a case, a single sig-
uses sketches to correlate uncooperative (i.e., noisy) signals, which
nal may need to be read from the disk multiple times to correlate
are not prevalent in our target applications. Our work differs from
with all other signals. For example, a signal x may be evicted from
these works in that: i) none of the work considers I/O optimization
the limited cache before another signal y is read from the disk; then
ii) none of the work considers bounded approximation of correla-
x must be reread from disk later to compute corr(x, y).
tion (e.g., HierarchyScan may output false negatives, sketch may
output both false positives and negatives). An Optimal Baseline Caching Algorithm. Consider the follow-
Our use of DFT is in the similar spirit of reducing dimensional- ing optimal baseline caching algorithm for dealing with a limited
ity of signals. Previous works have used similar ideas to achieve cache. We define the cache size as the number of signals it can hold
tight lower bounds for pruning signals (e.g., to answer similarity at a time. Given n signals and a cache of size (n/q + 1), signals
queries [22, 23]). Some existing dimensionality reduction tech- are partitioned into batches {Bi } of size n/q each, except of the
niques (e.g., APCA [11], PAA [10], MSM [15]) provide better last batch which can be smaller. Batches are determined according
lower bounds for pruning than DFT and DWT. However, none of to signal IDs; i.e., the first n/q signals are in the first batch B1 , the
these techniques allows computing correlation (with bounded er- second n/q signals are in the second batch B2 , and so on. Each
ror) in the reduced dimensionality space. More specifically, even batch Bi is brought to the cache at once, and correlation coeffi-
though these techniques reduce dimensionality to prune uncorre- cients of all pairs of signals (x, y), x ∈ Bi , y ∈ Bi are computed.
lated (or dissimilar) signals, correlation coefficients (or some other Before the batch Bi is evicted from the cache, all remaining sig-
similarity metrics) of signals are computed in the time domain; in nals z ∈ Bj>i are read from the disk one at a time. When such a
contrast we use DFT to compute approximate correlation in the fre- signal z is read, correlation coefficients of all pairs (x, z), x ∈ Bi
quency domain. are computed. After that, the next batch of signals is loaded into
Many other prior works consider similarity or related queries in the cache and the process continues. This simple caching strategy
streaming scenario [7, 13, 14, 26]. These techniques are not ad- is optimal because every time a signal is read from the disk, the
equate for our target applications since the techniques are not de- number of correlation coefficients computed with it is exactly n/q,
signed for stream warehouse settings, and/or do not explicitly con- which is the maximum possible with a cache of size (n/q + 1).
sider correlation coefficients, and/or are not shown to scale to tens We use the above baseline strategy in our experiments with lim-
of thousands of streams. ited cache. As shown in Table 1, even if we have a cache large
enough to hold 1/32 of the entire dataset (a partial cache scenario),
3. MOTIVATION I/O cost remains significant (40 minutes). A careful back of the en-
velope calculation shows that the I/O cost to compute an all-pair
We use a real dataset to demonstrate I/O and CPU complexities
correlation matrix of n signals with a cache of size n/q + 1 is pro-
of computing a large correlation matrix C in a stream warehouse
portional to n(1 + q)/2. Thus, the I/O cost linearly decreases with
environment. The dataset, called DataCenter1, records a perfor-
increasing cache size. We empirically validate this in Section 7.
mance counter from a Microsoft data center (more details of the
The above empirical results highlight two main components of
dataset is in Section 7). It consists of n = 10, 752 sequences, each
the total execution time.
of length m = 2880. Each sequence is stored on disk as a separate
file. 1 We use a 2.67 GHz quad core machine with 6GB RAM and • High I/O cost: With a limited cache, a significant amount of the
a 750 GB Hitachi hard disk of 7200 rpm for this experiment. time is spent for reading data from disk.
Our experiments show that computing a correlation matrix for • High CPU cost: Even if there is enough cache to hold the en-
the above dataset is very expensive, especially with limited or no tire dataset, computation remains expensive, as shown by the Full
caching (Table 1). In the worst case, without any caching, a signal cache scenario in Table 1.
needs to be read from the disk (n − 1) times, to correlate with all
(n−1) other signals. In such a case, computing a correlation matrix We next present techniques to reduce these two costs. For sim-
for our above dataset takes approximately 129 hours! Obviously, plicity, we first assume only positive correlation and synchronized
most of the time is spent in reading signals from the disk. Our signals in the next two sections. We consider anti-correlation and
lagged correlation in Section 6. Table 2 summarizes the symbols
1
Our conclusions hold if all data is stored in a single file. we use.
Symbol Definition Algorithm 1 P runeU ncorrelated(S, k)
n Number of signals Require: A set S of n signals, with each signal si ∈ S is of length
m Length of each signal m
x, X A signal and its DFT Ensure: Report likely correlated signal pairs
x
b, X
b Normalization of x and DFT of x b 1: for each signal si ∈ S, 0 ≤ i < n do
d(x, y) Euclidean distance of x and y 2: Read si from disk
dk (x, y) Euclidean distance of first k elements of x and y 3: Normalize si to sbi
T Correlation threshold 4: DF T [i] ← first k DFT coefficients of sbi
5: for each signal si ∈ S, 0 ≤ i < n do
Table 2: Symbols and definitions 6: for each signal sj ∈ S, i < j <pn do
7: if dk (DF T [i], DF T [j]) ≤ 2m(1 − T ) then
8: Report the pair (i, j) as likely correlated
4. REDUCING I/O COSTS
In this section, we present a novel technique to reduce the total
I/O costs required to answer a threshold correlation matrix query. 4.1 Identifying Correlated Pairs
As a shorthand, we call two signals correlated if their correlation We use Discrete Fourier Transform (DFT) to identify correlated
coefficient is above the given threshold, or uncorrelated otherwise. signal pairs in an I/O efficient manner. The DFT of a signal x =
Thus we need to compute correlation coefficients for correlated sig- x0 , x1 , . . . , xm−1 is a sequence X = X0 , X1 , . . . , Xm−1 = DF T (x)
nal pairs only. A naïve algorithm would require computing corre- of complex numbers given by
lation of all pairs of n signals and have an O(n2 ) I/O cost. A hypo-
m−1
thetical optimal algorithm can reduce the cost in at least two ways. 1 X −2πif
Xf = xi e m k , f = 0, 1, . . . , m − 1
The first technique is pruning. If the algorithm magically knew m
k=0
which pairs of signals are correlated, it could compute correlation
coefficients (and read relevant data from disk) for those pairs only We also define the normalization of x as x b = x b0 , x
b1 , . . . , x
bm−1 ,
and ignore uncorrelated pairs. The second technique is intelligent bk = (xi − µx )/σx , where µx and σx are mean and
such that x
caching. The algorithm can read signals from the disk in an optimal standard deviation of the values x0 , x1 , . . . , xm−1 .
order such that signals that are mutually correlated with each other As the following lemma suggests, the correlation coefficient of
reside in the cache at the same time—thus, a cached signal can be signals can be reduced to the Euclidean distance between their nor-
compared with a large number of other cached signals, reducing malized series.
the amortized I/O cost of reading a signal from the disk.
Realizing such an algorithm needs answering two questions. First, L EMMA 1 ([18]). The correlation coefficient of two signals
how does the algorithm know which signals are correlated? This x and y is corr(x, y) = 1 − 2m 1
d2 (b
x, y
b ), where d(b
x, y
b ) is the
must be done with an I/O cost significantly smaller than the O(n2 )— Euclidean distance between x
b and y
b.
pruning becomes useless if it itself has an I/O cost close to the
By reducing the correlation coefficient to Euclidean distance, we
O(n2 ) I/O cost of a naïve algorithm. Without any background
can apply the techniques in [27] to report signals with correlation
knowledge about the nature of input signals, the algorithm must
coefficients higher than a specific threshold.
examine all the signals at least once (i.e., read each signal from
disk at least once), implying an O(n) lower bound of I/O cost. L EMMA 2 ([27]). Let DFT of the normalized forms of two
Even after all correlated pairs are identified, I/O cost to compute
signals x and y be X
b and Y.
b Then,
correlation coefficients of correlated pairs can still be significantly p
high if a good caching strategy is not used. We define a caching corr(x, y) ≥ T ⇒ dk (X, b ≤ 2m(1 − T )
b Y)
strategy as the order in which signals are read to and evicted from
the cache. The impact of different caching strategies on I/O cost is where dk (X, b Y) b is the Euclidean distance between sequences
illustrated by an example in Figure 2. In Figure 2(a), a black cell X
b0 , X bk−1 and Yb0 , Yb1 , . . . , Ybk−1 for some k ≤ m .
b1 , . . . , X
2
(i, j) implies that signal i and signal j are correlated (e..g, their
correlation coefficient is above a given threshold), and hence we Lemma 2 implies
p that we can safely ignore signal pairs for which
need to compute their exact correlation coefficient. Knowing this dk (X, b > 2m(1 − T ), since they cannot have correlation co-
b Y)
information (e.g., from an oracle), an algorithm can read the sig- efficients above a given threshold T . By ignoring such pairs, we
nals from disk in many different orders, including the two strate- will get a set of likely correlated signal pairs. This is a superset of
gies shown in Figure 2(b) and (c). Each step of a strategy shows the correlated signal pairs, but there will be no false negatives. Sim-
the signals that are read from the disk (1st column), the cache con- ilar technique has been used in previous works[1, 27]. For a large
tent after the signals are read (2nd column), and the pairs of signals class of real-world signals (called cooperative signals [2]), includ-
whose correlation coefficients have been computed at this step. We ing our data center data and stock prices, the first few low frequency
assume the cache can hold at most 4 signals at a time. As shown, DFT coefficients are sufficient to capture the overall shape of a sig-
both strategies compute the correlations of the same set of signal nal. For such signals, computing only a small number of low fre-
pairs; but Strategy 2 does that with almost half the I/O costs of quency coefficients, e.g., using k = 5, is sufficient for identifying
Strategy 1.2 The example illustrates the importance of choosing a likely correlated signal pairs. The number of false positives can be
good caching strategy. This leads us to the second challenge: how reduced by using a larger k, which comes at the cost of increased
can one find a good caching strategy to minimize I/O costs? computational overhead.
We next address these two challenges. The above properties of DFT lay the foundation of our I/O effi-
cient detection of correlated pairs. Algorithm 1 shows the details.
2 Given n signals of length m on disk, we read one signal at a time
Note that correlation is symmetric; i.e., computing corr(x, y)
gives corr(y, x). to compute first k DFT coefficients of each signal, resulting in an
1 2 3 4 5 6 7 8
1
Signals read In Cache Computed pairs
2 Signals read In Cache Computed pairs
3 1, 2, 3, 4 1, 2, 3, 4 (1, 2), (3, 4)
1, 2, 5, 6 1, 2, 5, 6 (1, 2), (1, 5), (2, 6), (5, 6)
4 5, 6, 7, 8 5, 6, 7, 8 (5, 6), (6, 7), (7, 8)
5 7 2, 5, 6, 7 (6, 7)
1, 2 1, 2, 5, 6 (1, 5), (2, 6)
6 3, 4, 8 3, 4, 7, 8 (3, 4), (3, 7), (4, 8), (7, 8)
7 3, 4, 7, 8 3, 4, 7, 8 (3, 7), (4, 8)
8

(a) Pruning Matrix (b) Caching Strategy 1, 14 disk reads (c) Caching Strategy 2, 8 disk reads

Figure 2: Computing a threshold correlation matrix with two different caching strategies. The cache can hold 4 signals at a time.
Strategy 2 is 1.75× more I/O efficient than Strategy 1.

O(n) total I/O cost. Since k m, we can maintain the coeffi-

cients in cache. Then, we examine DFT coefficients of all pairs and 5 8
identify the pairs of signals that are likely to be correlated (using
Lemma 2). Conceptually, the algorithm produces a matrix like the 1 6 7 4
one shown in Figure 2(a), where all pairs with correlation above a
threshold and some pairs with correlation below the threshold (i.e.,
2 3
false positives) are marked as 1, and all other pairs are marked as
0. We call this a Pruning Matrix P and use it in subsequent steps.

4.2 Caching Strategy Figure 3: Partitioning signals into two batches to minimize the
A caching strategy involves deciding which set of signals to bring multicut size
into the cache together and how to evict them from the cache. We
use the same general framework we used for the optimal baseline P
algorithm in Section 3: we divide signals into batches and bring that e∈∆ we is minimized, where ∆ is the set of edges whose end
them into the cache one batch at a time. However, we introduce nodes belong to different elements of the partition, typically called
two optimizations in the baseline algorithm. First, before a batch a multicut. The resulting partitioning is called minimum multicut
is evicted from the cache, the baseline algorithm brings all signals size partitioning, or min-cut partitioning in short.
in remaining batches, one at a time, to compute correlation coeffi- In our setting, the cache size B defines the capacity, the graph
cients of all signal pairs having exactly one signal in the currently has the set of signals as the nodes, and the weight we of the edge
cached batch. In contrast, we use the Pruning Matrix to ignore the e between node i and node j is P [i, j], where P is the Pruning
uncorrelated pairs; thus we bring a signal in the cache only if it is Matrix. Thus, different elements of the resulting partition denote
likely correlated with at least one signal in the current batch. In different batches of signals that are read to the cache together. Fig-
the best case, if a batch is not correlated with any other signals, no ure 3 shows an example graph (only edges with weight 1 are shown)
additional signals need to read before eviction of the batch. representing the Pruning Matrix in Figure 2(a), and two batches
Our second, and the most important, caching optimization care- of signals resulting in the caching strategy 2 in Figure 2(c). Intu-
fully chooses the batches. Note that, for each likely correlated sig- itively, we try to avoid pairs of signals that are likely correlated with
nal pair whose two signals are in two different batches, we need each other (as indicated by the Pruning Matrix) to place in different
to incur an additional disk read. Suppose the Pruning Matrix sug- batches.
gests that signals x and y are likely correlated and hence we need The above graph partitioning problem is NP-complete [5]. There
to compute corr(x, y). If they are put in the same batch, they will are many heuristics-based and approximation algorithms for bal-
be read to the cache together and hence corr(x, y) can be com- anced graph partitioning [3, 6, 8, 25]. Many of the algorithms are
puted without additional disk I/O. In contrast, if they are put in used in offline circuit partitioning in VLSI design, and hence they
different batches, and if the batch containing x is read to cache optimize for accuracy at the cost of increased execution time. In
before the batch containing y, y will be read from disk at least contrast, we need to partition graph online, during query execu-
twice—once just before the batch containing x is evicted, to com- tion. Hence, we have chosen the simplest and the fastest of these
pute corr(x, y), and again when the batch containing y is read to existing algorithms: the F-M (Fiduccia-Mattheyses) algorithm [6].
the cache. Thus, computing correlation between signals in diffe- F-M is a bi-partitioning algorithm that partitions a given graph into
rent batches incurs additional I/O costs, and therefore we aim to two equal size partitions while minimizing the size of the multicut.
partition the signals into batches such that such additional I/O cost We use it recursively to get a multi-way balanced partitioning such
is minimized. In Figure 2, caching strategy 1 uses two batches as that each partition is smaller than or equal to the cache size B and
{1, 2, 3, 4} and {5, 6, 7, 8}, which results in four likely correlated the multicut size is minimized. (Such a recursive approach has been
signal pairs across batches. In contrast, caching strategy 2, which shown to yield smaller multicut size than iterative approaches [24].)
outperforms caching strategy 1, uses two batches as {1, 2, 5, 6} and Since graph bi-partitioning is NP-Hard, the F-M algorithm uses
{3, 4, 7, 8}, resulting in only one such pair (6, 7) across different heuristics for bi-partitioning. It starts with a random balanced bi-
batches. partitioning and iteratively reduces the multicut size. It defines the
gain of a vertex as the difference between the number of its adja-
Optimal data partitioning. Fortunately, we can formulate the cent vertices in its opposite partition and the number of its adjacent
above optimization problem as the node capacitated graph parti- vertices in its current partition. In each iteration, the algorithm con-
tioning problem [5]. Given a graph G = (V, E), edge weights siders each vertex in the descending order of gains and tentatively
we for e ∈ E, and a capacity B, the goal is to find a partition moves it to the opposite partition. After a vertex is moved to an
(V1 , V2 , . . . , Vφ ) of V such that |Vi | < B for 1 ≤ i ≤ φ and such opposite partition, the gains of all its adjacent vertices are updated.
Finally, the algorithm finalizes the first k moves such that the total
gain of the first k vertices is maximized and the resulting partitions
are balanced. The algorithm stops whenever an iteration cannot
improve the current bi-partitioining. The algorithm is shown to
converge in very few iterations (< 10) [6].
For multi-way partitioning, we recursively use the F-M algo-
rithm to partition an input graph with n vertices into M partitions
such that the size of each partition is ≤ B. Ideally, we should con- (a) C ( = 0.04, T = 0.5) (b) C B (T = 0.5)
tinue recursive partitioning until we get M 0 = dn/Be partitions.
However, such a restriction does not provide the partitioning algo- Figure 4: (a) An approximate threshold corr. matrix C , and
rithm enough flexibility to find good partitions. So, we continue (b) a Boolean threshold corr. matrix C B , corresponding to the
partitioning until we get M > M 0 , say M = 2dn/Be, partitions. matrix C in Figure 1(a).
This results in good partitions, along with a few partitions signif-
icantly smaller than B. At the end, we merge those small parti-
tions together to produce larger partitions of size ≤ B. The final , compute an n × n approximate threshold correlation matrix ma-
partitions determine the batches of signals in our caching strategy trix C such that C [i, j] = corr (i, j) if corr (i, j) > T , and
mentioned before. C [i, j] = 0 otherwise.

Figure 4(a) shows an approximate threshold correlation matrix

5. REDUCING COMPUTATIONAL COST C corresponding to the exact threshold correlation matrix C T in
The optimizations presented so far help reducing the I/O over- Figure 1(b). If we zoom in, we would notice that a few white pixels
head of answering a threshold correlation matrix query. However, in C T are shown as gray in C (i.e., a few pairs with zero corre-
as mentioned in Section 3, computing the matrix remains expen- lation in C T are shown to have non-zero correlation in C ). This
sive even if we completely ignore the I/O cost. In this section we happens because, due to approximation, a few (not all) pairs with
present techniques to reduce this computational complexity. correlation coefficients within the range [T − , T ) are reported to
We exploit the fact that real-world applications, including our have a correlation ≥ T . However, as shown in the figure, such
running example of data center management application, can often occurrences are rare. Moreover, such approximation is acceptable
tolerate some small, bounded approximation errors in computing since the threshold is not a hard one in most applications. Also note
correlation. Specifically, that such approximation does not introduce any false negatives—no
• It is often sufficient to report correlation coefficients within a pairs with correlation above the threshold will be omitted in C .
small error bound. For example, an application looking for all
5.1.1 Approximation with Prefix Distance
signal pairs with correlation > 0.9 may tolerate an error of ±0.02
in the computed correlation coefficients. According to Lemma 1, correlation of two signals corr(x, y)
can be computed by their Euclidian distance d(b x, yb ). Thus, one
• It may often be sufficient to only identify all and only the pairs way to approximate corr(x, y) is to use an approximate value for
with correlation above a given threshold; knowing the exact cor- d(b
x, y
b ); e.g., dk (b
x, y
b ), for k < m. For significant savings in
relation coefficients is not strictly necessary. Note that the Prun- computational complexity, it is important that k m. Unfortu-
ing Matrix computed in the previous section is not sufficient for nately such approximation does not work well in the time domain—
this purpose since it contains false positives. for k m, dk (b x, y
b ) is typically not a good approximation for
In this section, we present two algorithms that can compute app- d(b
x, y
b ).
roximate threshold correlation matrices satisfying the above require- We therefore approximate distance in the frequency domain. Since
ments, but run significantly faster than a corresponding exact algo- DFT is a linear transformation, the Euclidean distance is preserved
1 2
rithm. Figure 4 shows an example of these approximate matrices, under the transformation. Therefore, m d (b b ) = d2 (X,
x, y b Y).
b
highlighting the fact that they are almost as useful as the exact ma- This allows us to rephrase Lemma 1 as follows:
trix in Figure 1(a) in identifying correlation patterns. 1 2 b b
Note that even in situations where exact correlations may be re- corr(x, y) = 1 − d (X, Y)
2
quired, our approximate algorithms can still be useful for quickly
identifying the existence of interesting correlation patterns in the Thus, approximate correlation can be computed by using an app-
data. The algorithms can give users the flexibility to avoid ex- roximate distance d(X, b Y),
b such as dk (X,b Y)b for k < m. Since
pensive exact correlation computation if no interesting patterns are for many real-world datasets, first few DFT coefficients (i.e., a
found. small prefix of the DFT coefficient vector) capture most informa-
tion of the original sequences, dk (X,
b Y)
b is a good approximation
5.1 Approximate Threshold Correlation of d(X, Y) for k m. Thus, unlike in time domain, approximat-
b b
Matrix ing distance d with dk is computationally attractive in the frequency
In this section, we provide solutions for efficiently computing - domain.
approximate correlation of signals. An -approximate correlation However, computing each DFT coefficient requires O(n) com-
coefficient corr (i, j) of two signals i and j is a value within of putation. Thus, to minimize overall computation cost while guar-
their exact correlation coefficient corr(i, j); i.e., corr(i, j) − ≤ anteeing a given approximation error bound, we need to compute
corr (i, j) ≤ corr(i, j) + . Thus, we consider the following the smallest number k of DFT coefficients that ensure the target
problem. accuracy. However, the value of k depends on datasets. For some
datasets (e.g., periodic or random-walk signals), a small k may be
P ROBLEM 2 (A PPROX . T HRESHOLD C ORR . M ATRIX ). Given sufficient to ensure a good approximation; while a very large k may
n sequences of equal length m, a threshold T , and an error bound be required for some other signals (e.g., white noise signal).
Our main result in this section shows a relationship between the 60 coefficients for more than 90% of the signals. Thus, approxi-
approximation error in a correlation coefficient and the number of mate correlation within an error of 0.04 can be computed by using
leading DFT coefficients used for approximating the coefficient. It only 60 numbers, instead of by using the entire signals that consist
uses the notion of energy of Pa signal. The energy of a signal x is of 1000 numbers each.
defined as E(x) = k x k= m−1 2
i=0 xi . We also define the energy Also note that the value of d2k (X,
b Y)
b monotonically increases
captured by the first k components of the signal x as Ek (x) = 2 b b
Pk−1 2 with increasing k. As soon as dl (X, Y), for l < k, becomes larger
i=0 xi . The following lemma shows that normalization bounds than a threshold implying corr(x, y) < T , we can abandon the
the total energy of a signal. distance computation as well as computing the rest of the k DFT
coefficients for both signals.
L EMMA 3. Let X
b be the DFT of normalized signal x
b, then
m−1
5.1.2 Computational Cost
X 2
E(X)
b = |X
bi | = 1 The above optimization comes with the additional cost of actu-
i=0 ally computing the DFT coefficients, as an algorithm A for com-
puting exact correlation in the time domain does not need expensive
Ek (X)
b ≤ 1, for k ≤ m
DFT operations. However, many systems [20] compute and archive
In the following lemma, we show an upper bound of approxima- DFT coefficients along with or in place of raw signals for data com-
tion errors if we use the first k DFT coefficients to compute distance pression and efficient data mining; we can simply leverage these al-
of two signals. ready computed DFT coefficients. Even if the DFT coefficients are
computed on the fly, our overall query processing cost can still be
L EMMA 4. Let X
b and Y
b be the DFTs of normalized signals x b smaller than A for two reasons. First, we need to compute only a
and y b and η = min(2 ki=0 |X
P bi |2 , 2 Pk |Ybi |2 ) for some k ≤ small number of DFT coefficients. Second, DFT coefficients need
i=0
m
. Then to be computed only once per signal and can be reused for all pairs
2
involving the signal. A simple back of the envelope calculation
2d2k (X, b ≤ d2 (X,
b Y) b ≤ 2d2k (X,
b Y) b + 4(1 − η)
b Y) shows the savings. Consider a set of n signals, each of length m.
Computing all pairwise exact correlations inthe time domain has
P ROOF. The first inequality is obvious. On the second inequality, a computational complexity of cost1 = n2 m. In contrast, our
approach has the complexity of cost2 = kmn + n2 k, where the

using the symmetry of X,
b we get
m−1 first component is for computing k DFT coefficients of all n signals
2
and the second component is for computing pairwise correlations.
X
d2 (X,
b Y)
b = |X
bi − Ybi |
i=0 Assuming k m, we get cost2 < cost1 for n > 2k. Thus, as
m−k−1 long as we deal with > 2k signals, the total cost of computing DFT
coefficients and pairwise correlations remains less than computing
X
≤ 2d2k (X,
b Y)
b + bi |2 + |Ybi |2 + 2|X
(|X bi Ybi |)
i=k+1
exact correlations in the time domain. The benefit becomes more
pronounced with a large number of signals. Our experimental re-
Now, using Cauchy and Schwarz inequality, we get sults in Section 7 include DFT computation overhead and still show
m−k−1
v
um−k−1 m−k−1
significant speedup.
X u X X
≤ 2d2k (X,
b Y) + 2 2
bi | + |Ybi | ) + 2t
(|X b i |2
|X |Ybi |2
5.2 Threshold Boolean Correlation Matrix
b
i=k+1 i=k+1 i=k+1
The second type of approximation we consider is to identify if
Now, using Lemma 3, we get a pair of signals has correlation above a given threshold T . More
p precisely, we consider the following problem.
≤ 2d2k (X, b +1−η+1−η+2
b Y) (1 − η)(1 − η)
= 2d2k (X, b + 4(1 − η)
b Y) P ROBLEM 3 (T HRESHOLD B OOLEAN C ORR . M ATRIX ). Given
n sequences of equal length m and a threshold T , compute a thresh-

old Boolean correlation matrix C B such that,
The above lemma leads to our main result.
0 if corr(i, j) < T
T HEOREM 1. Given two signals x and y and an error bound , C B [i, j] =
1 if corr(i, j) ≥ T
corr(x, y) − ≤ 1 − d2k (X, b ≤ corr(x, y) +
b Y)
Figure 4(b) shows the C B for the threshold correlation matrix
where the value of k is chosen such that C T in Figure 1(b). As shown, C B still preserves the darker regions
k k in C and captures valuable information such as clusters of servers
X
b i |2 , 2
X
min(2 |X |Ybi |2 ) ≥ 1 − showing similar behavior.
i=0 i=0
2 Suppose, given two signals x and y, we somehow know, with-
out actually computing the exact distance d(x, y), the upper bound
The theorem enables us to incrementally compute the smallest
U B(x, y) and the lower bound LB(x, y) of their Euclidean dis-
number of DFT coefficients that are sufficient to guarantee the ap-
tance. Then, using these bounds with Lemma 5, we may be able to
proximation error to be ≤ . For example, if = 0.04, we only
conclude if x and y are correlated or not. More specifically,
need to compute as many DFT coefficients that capture 0.98 of the
normalized energy. Many real-world data sets are cooperative [2]; L EMMA 5. Let x and y are the two signals, then
i.e., most of their energy is concentrated in the lower frequencies 
and the energy decreases quickly with higher frequencies. For ex-  1 if U B(x, y) ≤ θ
B
ample, in a set of random walk signals (which model, e.g., stock C [x, y] = 0 if LB(x, y) > θ
prices) of length 1000, 98% of the energy is captured by the first  Undetermined Otherwise
Algorithm 2 BooleanApproximation(S, T ) many of which are correlated with each other; we can find such
Require: A set S of n signals, with each signal si ∈ S is of length m, and good verifier signals within the database itself. We now present
a threshold T a dynamic programming based algorithm to systematically search
Ensure: Report a 0/1 matrix ; 1 for correlated signal pairs and 0 for uncor- the input signals for finding verifiers for different signal pairs.
related
ppairs
1: θ = 2m(T − 1) A Dynamic Programming Algorithm. Given a set of signals, the
2: Initialize temporary matrices U B and LB with ∞ and 0, respectively algorithm computes, for each signal pair, an upper and a lower
3: for 0 ≤ i < n − 1 do bound of their distances. These bounds are then compared to a
4: U B[i, i + 1] = LB[i, i + 1] = d(si , si+1 )
5: C B [i, i + 1] = 1 if d(si , si+1 ) ≤ θ, 0 otherwise threshold (see Lemma 5) to label pairs as correlated or uncorre-
6: for each diagonal k, 1 < k < n do lated. For the pairs whose correlation status cannot be decided
7: for each cell (i, j), 0 ≤ i < n − k, j = i + k do based on its upper and lower bounds, their true distances are com-
8: U B[i, j] = mini<v<j {U B[i, v] + U B[v, j]} puted to determine their correlation status.
9: LB[i, j] = maxi<v<j {max{LB[i, v]−U B[v, j], LB[j, v]− The basic observation is that computing the lower and the upper
U B[v, i]} bounds of the true distance between signals si and sj , i < j can be
10: if U B[i, j] ≤ θ then
11: C B [i, j] = 1
decomposed into the following recursive substructure. This enables
12: else if LB[i, j] > θ then us to compute bounds for a pair of signals from the bounds already
13: C B [i, j] = 0 computed for other pairs.
14: else
15: U B[i, j] = LB[i, j] = d(si , sj ) U B(si , sj ) = mini<v<j {U B(si , sv ) + U B(sv , sj )} (1)
16: C B [i, j] = 1 if d(si , sj ) ≤ θ, 0 otherwise LB(si , sj ) = maxi<v<j {max{LB(si , sv ) − U B(sv , sj ),
LB(sj , sv ) − U B(sv , si )}} (2)
p
where θ = 2m(T − 1) and U B(x, y) ≥ d(x, y) and LB(x, y) Algorithm 5.2 shows the pseudocode of our algorithm for com-
≤ d(x, y). puting C B . It fills the upper-right half of the matrix C B , one di-
agonal at a time.4 First, it computes the true distances (and hence
The proof of the lemma follows Lemma 2. Obviously, the tighter exact correlations) of the signal pairs on the principal diagonal of
the upper and the lower bounds are, the more likely it is to deter- C B (Line 4). The true distances are assigned as lower and up-
mine the correlation status of a pair. per bounds of corresponding signal pairs. These bounds are then
The key question is how do we efficiently find good bounds on reused for computing correlation of other pairs. Next, the algo-
distances of two signals? One efficient way to compute the lower rithm considers subsequent diagonals, and for each signal pair on
bound is to compute dk (X,b Y).
b While using a small k is computa- a diagonal, it uses Equations 1 and 2 to compute upper and lower
tionally cheap, it may not provide a tight lower bound. Moreover, bounds of their true distances (Line 8 and 9). Diagonals are con-
this does not provide any hints of the upper bound. We address sidered in order, starting from the principal diagonal towards the
this problem with the triangular inequality property of Euclidean top-right corner of C B . This ensures that before a signal pair (i, j)
distance. The following lemma shows how one can get the bounds is considered, bounds of all the signal pairs (i, v), (v, j), i < v < j
using triangular inequality and a random reference point.3 are already computed; hence, these already computed bounds and
L EMMA 6. Let x and y are two signals and r is a reference Equations 1 and 2 can be used for the pair (i, j). Finally, the algo-
signal, then rithm uses Lemma 5 to decide a pair’s correlation status based on its
distance bounds (Line 11 and 13). If the status of a pair cannot be
U B(x, y) = d(x, r) + d(y, r) ≥ d(x, y) determined based on its distance bounds, the algorithm computes
exact distance of the pair and decides its correlation status based on
and
the true distance (Line 16).
LB(x, y) = |d(x, r) − d(y, r)| ≤ d(x, y) Note that, examining all previously computed diagonals for ver-
ifiers can be expensive than computing the exact correlation when
One straightforward way to use the above property in computing
n m. In conditions like this, we can use a shorter search range
C B is to compute distances of all signals from a random reference
for the verifier. For example, to consider no more than 5 signals
signal and then use the distances to compute bounds for all O(n2 )
as verifiers, we can use i < v < t, t = min(j, 5) instead of
pairs. Unfortunately, apart from being computationally expensive,
i < v < j in line 8 and line 9.
the resulting upper bounds are not tight and are practically useless
because a random reference signal is expected to be equally far
from all of the signals in a very high (= m) dimensional space. 6. EXTENSIONS
To overcome this problem, we need to choose reference signals In this section we describe several useful extensions of our ap-
carefully. Ideally, for a pair of signal x and y, we want a third proximation algorithms described in Section 5.
signal v (called a verifier) that is either (i) very close to both x and
y implying that x and y are likely to be close, and hence correlated 6.1 Negative Correlation
to each other, or (ii) very close to x but not to y (or vice versa),
In real datasets, signals often show negative correlation (or, anti-
implying that x and y are not correlated. Such verifiers provide
correlation). For example, number of bytes available in the main
tighter upper and lower bounds of distances between signals and are
memory of a server is negatively correlated with the number of
effective in labeling signals as correlated or uncorrelated without
TCP connections made to that server. Both our approximation al-
expensive distance computation.
gorithms can be extended to handle such negative correlations. The
The last piece of the puzzle is to efficiently find effective veri-
basic idea is that corr(x, y) = −corr(x, −y). Thus, for a given
fiers. Since we deal with a database of a large number of signals;
negative threshold T , corr(x, y) ≤ T ⇒ corr(x, −y) ≥ −T .
3
We cannot use triangular inequality on correlation values because
correlation is not a metric. 4
C B is symmetric, so computing half of it is sufficient.
We can easily extend our approximate threshold correlation al- L EMMA 7. Let x be a signal of length m with DFT X. Then,
gorithm in Section 5.1 for deciding if a signal pair (x, y) has an for r = 0, 1, . . . , m − l − 1
anti-correlation smaller than a threshold T < 0, as follows. The (i) [Prefix] The DFT of xp = x0 , x1 , . . . , xm−l−1 is Ẋ where
modified algorithm compares dk (X, b −Y) b with a threshold based pl
m−1
on −T , where the appropriate value of k is determined by Theo- 1 h X e2πi(p− m −r) − 1 i
mr +
Ẋr = X m−l Xp p r
rem 1. Note that the modified algorithm does not need to compute m−l mr e2πi( m − m−l ) − 1
p=0,p6= m−l
any additional DFT coefficients; the same DFT coefficients com-
puted for positive correlations are sufficient. (ii)[Suffix] The DFT of xs = xl , xl+1 , . . . , xm−1 is Ẍ where
Similarly, our algorithm in Section 5.2 for threshold Boolean pl
m−1
correlation query can be extended for negative correlation between 1 h X e2πi(p− m −r) − 1 i
Ẍr = S m−l
mr + Sp p r
signals x and y by computing lower and upper bounds of the dis- m−l mr e2πi( m − m−l ) − 1
p=0,p6= m−l
tance d(x, −y).
The above extensions have the same asymptotic computational 2πi lp
where Sp is the l-shift of Xp defined as Sp = e m Xp .
complexity as our original algorithms.
P ROOF.
m−l−1
6.2 Lagged Correlation 1 X 2πir
Ẋr = xj e− m−l j
m − l j=0
So far we have considered synchronous correlation where sig-
X m−1
m−l−1
nals to be correlated are assumed to be aligned with each other in 1 X 2πij 2πir
the time dimension. However, signals in real datasets often exhibit = Xp e m p e− m−l j
m − l j=0 p=0
correlation with unknown lag. For example, in a typical two-tiered m−1 m−l−1
web service deployment, an increase in the number of TCP con- 1 X X 2πi( p − r )j
= Xp e m m−l
nections in the front-end server typically precedes an increase in m − l p=0 j=0
the CPU load in the back-end server. Intuitively, two signals have m−1 pl

a lagged correlation with lag l if they look very similar when one 1 h X e2πi(p− m −r) − 1 i
= X m−l
mr + Xp 2πi( p − r )
signal is delayed by l time ticks. Formally, given two signals x and m−l e m m−l − 1
p=0,p6= mr m−l
y of equal length m, their correlation with lag l is corrl (x, y) =
Pm−l−1 (xi −µx ) (yi+l −µy )
, where µx , µy , σx , σy are defined on The proof for suffix is similar.
i=0 σx σy
the overlapping part of two signals. Note that synchronous correla- Thus, once the DFT coefficients for xb and yb are computed, they
tion is a special form of lagged correlation with l = 0. can be reused to compute DFT coefficients for all their prefixes and
We further define the maximum lagged correlation of two sig- suffixes and hence be used for computing correlations with arbi-
nals as the maximum of their correlations with all possible lags. trary lags.
Note that in the above definition, only one signal has been lagged The above result enables us to efficiently compute lagged ver-
or shifted. If both signals are periodic, shifting any of the signals sions of C and C B in the frequency domain. In a lagged C , an
yields the same maximum lagged correlation. Otherwise, if any of entry C [i, j] is an -approximation of the maximum lagged cor-
the signals is aperiodic, both the signals need to be shifted to find relation of signals i and j. Similarly, in a lagged C B , an entry
the maximum lagged correlation. For simplicity of description, we C B [i, j] is 1 iff the maximum lagged correlation of signals i and j
here consider shifting only one signal. is above the given threshold.
We now show how to extend our previous algorithms to consider However, there is a caveat. The basic idea above requires all
maximum lagged correlations, instead of synchronous correlation. DFT coefficients of an original signal; in contrast, our approxima-
Our approach is similar to BRAID [21], which discovers maximum tion algorithms compute only a first few DFT coefficients. Thus,
lagged correlation of a signal pair in O(lg m) time, where m is we need to approximate remaining coefficients with zeros, which
the maximum possible lag. Instead of computing correlations for introduces errors in the DFT coefficients we compute for prefixes
all possible lags to find the maximum, BRAID probes in geomet- and suffixes. In general, this may cause our algorithms to violate
ric progression and interpolates the remaining values of the cross approximation guarantees. In practice, however, the effect is very
correlation function. Although BRAID is designed for streaming small because our algorithms compute as many DFT coefficients as
applications, it can easily be adapted to use in a stream warehous- required to capture the most of the energy of a signal; hence, ignor-
ing scenario. However, BRAID computes correlations in the time ing the remaining coefficients does not affect the accuracy much.
domain, which can be significantly expensive for a large number We will experimentally validate this in Section 7.
of long signals. We now show how BRAID can be used in the
frequency domain to avoid such cost. 7. EVALUATION
Note that to compute lagged correlation of x and y with a lag l,
We evaluate our algorithms using the same machine described in
one signal is first shifted (or, lagged) by l time ticks while keeping
section 3.
the other signal fixed, and then the correlation is computed over
We use four datasets. To perform experiments on massive sized
their trimmed, common parts of length (m − l). Without loss of
data, we replicate every signal in a dataset equal number of times
generality, assume that the common parts include a prefix of signal
with small additive white noise. This preserves the pairwise corre-
x and a suffix of signal y, both of length (m − l); i.e., x0 is aligned
lation structure in the original dataset.
with yl to compute lagged correlation with a lag of l. To work
in the frequency domain, a naïve solution would compute DFT of • DataCenter1 contains measurements of the number of TCP con-
all prefixes of x and suffixes of y. However, the following lemma nections established over a day to 350 servers in a real data cen-
shows that we can compute DFT coefficients of a signal once, and ter. One measurement is made every 30 seconds, and so a signal
then reuse them to compute coefficients for any prefix or suffix of for a day consists of m = 2, 880 samples. The dataset contains
the signal. n = 11, 200 signals.
60 70 14
Random partition DataCenter1 RandomWalk

I/O Time (Minutes)

I/O time (Minutes) 50 Min-cut partition 60 DataCenter2 12 DataCenter1
50 RandomWalk 10 DataCenter2

Speedup
40 Chlorine Chlorine
40 8
30 30 6
20 2.4x 20
1.8x 3.5x 4
3x 10
10 2
1
0 4 16 32 64 128 0.4 0.5 0.6 0.7 0.8 0.9
DC1 DC2 Chlorine RW Data Size:Cache Size Threshold

(a) I/O cost reduction (b) Impact of cache size (c) CPU cost reduction

Figure 5: Impact of I/O and CPU optimizations for computing a threshold correlation matrix. In (a), DC1=DataCenter1,
DC2=DataCenter2, and RW=RandomW alk.

• DataCenter2 is a collection of measurements of the CPU utiliza- • A : an algorithm to compute an all-pair exact correlation matrix
tion of 120 servers in a real data center. One measurement is made C in the time domain,
every 30 seconds, and a signal for a day consists of m = 2, 880 • AT : an algorithm, described below, for computing an exact
samples. The dataset contains n = 4, 745 signals. threshold correlation matrix C T ,
• Chlorine [17] is a collection of signals representing chlorine con- • A : our algorithm for computing an approximate threshold cor-
centration at different junctions in a water distribution network. relation matrix C (Section 5.1),
The original dataset has 166 signals of two weeks long signal
traces (one sample in every 5 minutes). We use week-long (m = • AB : our algorithm for computing a threshold Boolean correlation
2, 155 samples) traces and replicate them to create a dataset of matrix C B (Section 5.2).
n = 10, 624 signals. For different algorithms, we report the speedup factors. The
• RandomWalk is a collection of n = 16, 384 random walk signals speedup of an algorithm is the ratio of the end-to-end CPU time of a
generated synthetically using Gaussian steps. Each signal is m = baseline algorithm to that of the algorithm. The higher the speedup,
2, 880 samples long. the faster the algorithm. For our approximation algorithms A and
AB , we use AT as the baseline. Before reporting the speedups of
7.1 Impacts of I/O Optimizations our algorithms, we first report the speedup factor of AT , with A as
To show the benefit of our min-cut partitioning based caching the baseline. This will allow us to interpret the speedup factors of
strategy, we compare it with a baseline caching strategy where sig- A and AB with respect to A as well.
nals are randomly partitioned. The only difference between the IThreshold Correlation Matrix. To compute C T for a given
two caching strategies is how they partition signals into batches— threshold, AT prunes uncorrelated signal pairs based on their dis-
both are used for computing a C T , and they both prune signal pairs tances of k = 5 first DFT-coefficients (similar to the methods in
based on the same Pruning Matrix. We assume the cache is big [1, 27]). The exact correlations of likely correlated signal pairs are
1
enough to hold 32 th of a dataset. The result is shown in Figure 5(a). then computed in the time domain. Figure 5(c) shows, for diffe-
As shown, our min-cut partitioning significantly (1.8× - 3.5×) re- rent correlation thresholds, the speedup factors of AT , with respect
duces the I/O time for all the datasets (the factor of reduction in I/O to A (which takes more than 90 minutes of CPU time for all the
time is shown at the top of the second bar). This reduction is at- datasets). As shown, AT can be several times faster than A, specif-
tributed to our careful partitioning that reduces the number of disk ically with high thresholds (e.g., with T = 0.9, AT is > 3× faster
I/O required to compute correlation of signal pairs across different than A). The speedup increases as the threshold increases; this is
batches. However, the overhead of partitioning is never more than because more and more uncorrelated signal pairs can be pruned as
30 seconds, which is very tiny compared to the end-to-end response the threshold increases.
time.
The I/O cost of our caching strategy can be reduced by using a IApproximate Threshold Correlation. Figure 6(a) shows the
bigger cache. To show the impact of cache size, we vary the cache speedup of A , with respect to AT , for different approximation er-
size keeping the datasize fixed. Figure 5(b) shows I/O costs as a ror bounds . As before, we use k = 5 DFT coefficients for pruning
function of the ratio of cache size and data size. As shown, the I/O and T = 0.9 as the threshold. The graph shows that even with a
time decreases linearly with the increase in available cache size, very small error, e.g., 0.02, A is significantly faster than AT for
and becomes < 10 minutes for a cache of size n/16. This is due all real datasets. For example, with = 0.02, the speedups for the
to the fact that with a larger cache, data is partitioned into fewer Chlorine and DataCenter2 datasets are 17 and 9 respectively.
batches, and hence fewer disk I/Os are required to compute corre- The speedup is small for RandomW alk, because most of its en-
lation of signal pairs across batches. The slopes of different lines ergy is captured by a very few of its leading coefficients [1], helping
demonstrate the amount of correlation present in different datasets. the baseline algorithm AT to perform extremely good (also shown
The more correlated pairs in a dataset are, the larger the slope is. in Figure 5(c)) with such data. The speedup increases with error
The RandomW alk dataset has the least slope among all datasets, tolerance, as this allows A to compute fewer DFT coefficients.
as it has the least correlation among signals. Figure 6(b) shows the speedup of A for different thresholds. In
general, the absolute execution time of A is not affected much by
7.2 CPU Speedup due to Approximation different thresholds. In contrast, the baseline algorithm AT runs
We now show how much our approximation algorithms reduce faster with bigger thresholds (as shown in Figure 5(c)). Therefore,
the CPU cost of computing a correlation matrix. As a shorthand, the speedup of A with respect to AT decreases with increasing
we use the following notations: threshold, as shown in Figure 6(b).
24 24 Chlorine 12
Chlorine
20 DataCenter2 DataCenter1
Chlorine 18 DataCenter1
DataCenter2 DataCenter2
Speedup

Speedup

Speedup
16 16 RandomWalk 8
DataCenter1
12 RandomWalk 12
8 8 4
4 4
1 1 1
0.02 0.04 0.06 0.08 0.1 0.4 0.5 0.6 0.7 0.8 0.9 0.4 0.5 0.6 0.7 0.8 0.9
Error Threshold Threshold

(a) Impact of varying (b) Impact of thresholds (c) Boolean approximation

Figure 6: Speedup for computing an approximate and a Boolean threshold correlation matrix. The speedup is computed with respect
to computing an exact threshold correlation matrix.

Dataset CPU time (minutes)

A AT A AB DataCenter1
DataCenter1 106 50.7 9.5 18.2 DataCenter2
RandomWalk
DataCenter2 19 12 0.85 4.9

% Error
2 Chlorine
Chlorine 98 67.5 3.5 10.4
RandomW alk 207 35.2 6.6 29.0 1

Table 3: Absolute CPU time for different algorithms, with T = 0

0.7 and = 0.04
1 5 10 15 20
Lag (% of m)
IBoolean Threshold Correlation. Finally, we test our AB al-
gorithm with varying thresholds. Figure 6(c) shows speedups for Figure 7: Errors in lagged correlation.
different datasets. In most of the cases speedup is more than 2 and
for Chlorine it reaches up to > 10× for T = 0.9. In general,
AB is slower than A ; this is because A needs to search for good 80
verifier signals in order to avoid any false positives and negatives. 70
60
ISpeedup with respect to A. Since the speedup of AT is re-
Speedup

50
ported with respect to A, and other speedups are reported with 40
respect to AT , we can combine the speedups. For example, for 30
the DataCenter1 dataset, A is 18.75 times faster than A, for 20
a threshold T = 0.9 and an error bound = 0.06. Similarly,
10
AB is 8.34 times faster than A, for a threshold T = 0.9. For the
0
DataCenter2 dataset these numbers are 26.4 and 4.1.5 DC1 DC2 Chlorine RW
IAbsolute savings. Table 3 shows the absolute CPU time of diffe-
rent algorithms on different datasets. This highlights that, in addi- Figure 8: Speedup for lagged correlation (DC1=DataCenter1,
tion to significant relative speedups, our algorithms have significant DC2=DataCenter2, RW=RandomW alk).
absolute savings in execution time of different correlation queries.
In none of the experiments above, our algorithms result in a
speedup less than 1. This highlights that our algorithms are never mum lag. Lags are shown as percentages of the entire signal lengths
slower than AT or A with our datasets. (= m). As shown, the error is very small for all datasets for reason-
able lag values. In particular, Chlorine and RandomW alk incur
7.3 Lagged Correlation close to zero error with a maximum lag of 5% of the entire signal
length. All datasets have errors < 2% even for a large maximum
IError. As mentioned in Section 6.2, our algorithm for comput- lag of 20% of an entire signal. Lags are typically much smaller in
ing lagged C may violate -approximation guarantee. We now practice; e.g., for DataCenter1, a signal represents the entire day,
experimentally measure the effect for computing lagged C with and hence 10% lag means a lag of 2.4 hours, which is extremely
= 0.04. After computing lagged C , we count the number of sig- unlikely in a data center. Thus, for practical values of maximum
nal pairs that violate the -approximation guarantee; i.e., for which lag, our algorithm incurs a very small error.
the true maximum lagged correlation is more than away from our
estimated maximum lagged correlation. We define the percentage ISpeedup. The small error above comes with a significant benefit
of signal pairs violating the approximation guarantee as the error of of speedup. In Figure 8, we report the speedup of our algorithm
our algorithm due to lag. Note that, without any lag, our algorithm to compute a lagged C for = 0.04, with respect to compute
has an error of 0 as it never violates the approximation guarantee. C T with BRAID [21], the state-of-the-art algorithm for computing
Figure 7 shows the error of our algorithm as a function of maxi- lagged correlation. Both BRAID and our algorithm are configured
to compute correlation coefficients for 16 different lags (recall that
5 BRAID considers logarithmic number of lags). Figure 8 shows
We assume a large cache, so that I/O cost is small, to compute the
speedup. that our algorithm is 35× to 71× faster than BRAID for different
datasets. This huge speedup comes because our algorithm works [11] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra.
in frequency domain and reuses DFT coefficients across different Locally adaptive dimensionality reduction for indexing large
lags. time series databases. SIGMOD Rec., 30:151–162, 2001.
[12] C.-S. Li, P. S. Yu, and V. Castelli. HierarchyScan: A
8. CONCLUSION hierarchical similarity search algorithm for databases of long
We have proposed novel algorithms, based on Discrete Fourier sequences. In ICDE, 1996.
Transform (DFT) and graph partitioning, to reduce the end-to-end [13] X. Lian and L. Chen. Efficient similarity search over future
response time of an all-pair correlation query. To optimize I/O cost, stream time series. TKDE, 20(1):40–54, 2008.
we intelligently partition a massive input signal set into smaller [14] X. Lian, L. Chen, and B. Wang. Approximate similarity
batches such that caching the signals one batch at a time minimizes search over multiple stream time series. In Advances in
disk I/O. To optimize CPU cost, we have proposed two approx- Databases: Concepts, Systems and Applications, pages
imation algorithms. Our algorithms have strict error guarantees, 962–968, 2008.
which makes them as useful as corresponding exact solutions for [15] X. Lian, L. Chen, J. X. Yu, J. Han, and J. Ma. Multiscale
many real applications. However, compared to the state-of-the-art representations for fast pattern matching in stream time
exact solution, our algorithms are up to 17× faster for several real series. TKDE, 21(4):568–581, 2009.
datasets. [16] C. Loboz, S. Smyl, and S. Nath. Datagarage: Warehousing
massive amounts of performance data on commodity servers.
9. REFERENCES Technical Report MSR-TR-2010-22, Microsoft Research,
March 2010.
[1] R. Agrawal, C. Faloutsos, and A. N. Swami. Efficient
similarity search in sequence databases. In 4th International [17] S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming pattern
Conference on Foundations of Data Organization and discovery in multiple time-series. In VLDB, pages 697–708,
Algorithms (FODO), 1993. 2005.
[2] R. Cole, D. Shasha, and X. Zhao. Fast window correlations [18] D. Rafiei. On similarity-based queries for time series data. In
over uncooperative time series. In SIGKDD, pages 743–749, ICDE, 1999.
2005. [19] D. Rafiei and A. Mendelzon. Similarity-based queries for
[3] G. Even. Fast approximate graph partitioning algorithms. time series data. In SIGMOD, 1997.
SIAM J. Comput., 28(6):2187–2214, 1999. [20] G. Reeves, J. Liu, S. Nath, and F. Zhao. Managing massive
[4] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast time series streams with multiscale compressed trickles. In
subsequence matching in time-series databases. In SIGMOD, VLDB, 2009.
1994. [21] Y. Sakurai, S. Papadimitriou, and C. Faloutsos. BRAID:
[5] C. E. Ferreira, A. Martin, C. C. de Souza, R. Weismantel, stream mining through group lag correlations. In SIGMOD,
and L. A. Wolsey. The node capacitated graph partitioning 2005.
problem: a computational study. Math. Program., [22] M. Vlachos, S. S. Kozat, and P. S. Yu. Optimal distance
81(2):229–256, 1998. bounds on time-series data. In SDM, 2009.
[6] C. M. Fiduccia and R. M. Mattheyses. A linear-time heuristic [23] M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos.
for improving network partitions. In DAC ’82: Proceedings Identifying similarities, periodicities and bursts for online
of the 19th Design Automation Conference, pages 175–181, search queries. In SIGMOD, 2004.
1982. [24] M. Wang, S. Lim, J. Cong, and M. Sarrafzadeh. Multi-way
[7] J. Gehrke, F. Korn, and D. Srivastava. On computing partitioning using bi-partition heuristics. In Proceedings of
correlated aggregates over continual data streams. In the 2000 Asia and South Pacific Design Automation
SIGMOD, 2001. Conference, 2000.
[8] A. Hagen, L. Kahng. Fast spectral methods for ratio cut [25] H. Yang and D. F. Wong. Efficient network flow based
partitioning and clustering. In Computer-Aided Design, min-cut balanced partitioning. In Proceedings of the 1994
1991., pages 10–13, 1991. IEEE/ACM international conference on Computer-aided
[9] T. Hey, S. Tansley, and K. Tolle. The Fourth Paradigm: design, pages 50–55, 1994.
Data-Intensive Scientific Discovery. Microsoft Research, [26] B.-K. Yi, N. Sidiropoulos, T. Johnson, A. Biliris,
2009. H. Jagadish, and C. Faloutsos. Online data mining for
[10] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra. co-evolving time sequences. ICDE, 2000.
Dimensionality reduction for fast similarity search in large [27] Y. Zhu and D. Shasha. StatStream: statistical monitoring of
time series databases. KAIS, 3:263–286, 2000. thousands of data streams in real time. In VLDB, 2002.

Dunham - Data Mining PDF
83% (6)
Dunham - Data Mining PDF
156 pages
Textbook of Pharmacoepidemiology, 3rd Edition, 3rd Edition Fast Ebook Download
100% (8)
Textbook of Pharmacoepidemiology, 3rd Edition, 3rd Edition Fast Ebook Download
14 pages
Dunham - Data Mining PDF
100% (1)
Dunham - Data Mining PDF
156 pages
Learning Typescript Fudamentals
100% (1)
Learning Typescript Fudamentals
72 pages
(Electrical Power Systems) (By: C.L. Wadhwa) (Published: July, 2009)
No ratings yet
(Electrical Power Systems) (By: C.L. Wadhwa) (Published: July, 2009)
5 pages
Applied Data Mining
100% (1)
Applied Data Mining
284 pages
AGRU - FM 1613 Approved HDPE Pipes Fittings
No ratings yet
AGRU - FM 1613 Approved HDPE Pipes Fittings
64 pages
Nancy Canavan Anderson, Lainie Schuster - Good Questions For Math Teaching: Why Ask Them and What To Ask, Grades 5-8 (2005)
No ratings yet
Nancy Canavan Anderson, Lainie Schuster - Good Questions For Math Teaching: Why Ask Them and What To Ask, Grades 5-8 (2005)
204 pages
CMIT-796-PIP-15.69-00-0008 - 0 3D Model Review Procedure
No ratings yet
CMIT-796-PIP-15.69-00-0008 - 0 3D Model Review Procedure
10 pages
Big Data Analytics For Fault Detection and Its Application in Maintenance
No ratings yet
Big Data Analytics For Fault Detection and Its Application in Maintenance
148 pages
Notes Module 2
No ratings yet
Notes Module 2
28 pages
Data Mining PDF
No ratings yet
Data Mining PDF
24 pages
Tabla de Ampacidades 310.16 NEC (NFPA 70) - Como Utilizar La Tabla de Ampacidades
No ratings yet
Tabla de Ampacidades 310.16 NEC (NFPA 70) - Como Utilizar La Tabla de Ampacidades
3 pages
Prado Dictionary
No ratings yet
Prado Dictionary
897 pages
For Permit - AR - STJ-BATANGAS
No ratings yet
For Permit - AR - STJ-BATANGAS
6 pages
Trading Strategy
No ratings yet
Trading Strategy
3 pages
Idea Makers Stephen Wolfram Epub - Google Search
0% (1)
Idea Makers Stephen Wolfram Epub - Google Search
3 pages
Satyabhama Bigdata
No ratings yet
Satyabhama Bigdata
128 pages
Labview Digital Filter Design Toolkit Api Reference 2024-04-19-01-39-07
No ratings yet
Labview Digital Filter Design Toolkit Api Reference 2024-04-19-01-39-07
149 pages
Distributed Data Mining
No ratings yet
Distributed Data Mining
119 pages
Non-Blocking Synchronization and System Design - CS-TR-99-1624
No ratings yet
Non-Blocking Synchronization and System Design - CS-TR-99-1624
261 pages
Zhao Xiaojian
No ratings yet
Zhao Xiaojian
114 pages
Complete Doc - Lavanya
No ratings yet
Complete Doc - Lavanya
95 pages
Intro To Duplicate Detection
No ratings yet
Intro To Duplicate Detection
87 pages
Kshape
No ratings yet
Kshape
49 pages
MIE1628 Big Data Analytics Lecture7
No ratings yet
MIE1628 Big Data Analytics Lecture7
77 pages
Adept Owl Simulated Business
No ratings yet
Adept Owl Simulated Business
64 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
Evolights Laser RGB 400mw Animation - Instrukcja Obs Ugi Manual Eng PL 16ch
No ratings yet
Evolights Laser RGB 400mw Animation - Instrukcja Obs Ugi Manual Eng PL 16ch
19 pages
Audio Signa
No ratings yet
Audio Signa
23 pages
Module 4 Techniques in Big Data Analytics
No ratings yet
Module 4 Techniques in Big Data Analytics
46 pages
Qualcomm 213
No ratings yet
Qualcomm 213
28 pages
Cluster Analysis Introduction
No ratings yet
Cluster Analysis Introduction
23 pages
Scsa3001 1 58
No ratings yet
Scsa3001 1 58
58 pages
Enhanced Over - Sampling Techniques For Imbalanced Big Data Set Classi Fication
No ratings yet
Enhanced Over - Sampling Techniques For Imbalanced Big Data Set Classi Fication
33 pages
Base Design in Creo
No ratings yet
Base Design in Creo
31 pages
1.4 Module-1
No ratings yet
1.4 Module-1
21 pages
BIG DATA Technology: Subtitle
No ratings yet
BIG DATA Technology: Subtitle
34 pages
Unit IV
No ratings yet
Unit IV
32 pages
Clustering of Time-Series Data
No ratings yet
Clustering of Time-Series Data
20 pages
Data Mining
No ratings yet
Data Mining
22 pages
Lecture 16
No ratings yet
Lecture 16
31 pages
Parcial Cono 1 21
No ratings yet
Parcial Cono 1 21
21 pages
DM Unit 2
No ratings yet
DM Unit 2
19 pages
Data Collection & Analysis Educational Presentation in Pink and Blue Lined Style
No ratings yet
Data Collection & Analysis Educational Presentation in Pink and Blue Lined Style
51 pages
Mirus Presentation
No ratings yet
Mirus Presentation
19 pages
Tough SF - Particle Beams in Space
No ratings yet
Tough SF - Particle Beams in Space
92 pages
FlexRig Fleet International
No ratings yet
FlexRig Fleet International
2 pages
Fluid Mechanics BEL L5
No ratings yet
Fluid Mechanics BEL L5
26 pages
Data 07 00011
No ratings yet
Data 07 00011
22 pages
2008 Basics DDM FEA
No ratings yet
2008 Basics DDM FEA
24 pages
Advanced Database Systems
No ratings yet
Advanced Database Systems
15 pages
Communication Overlap in Multi-Tier Parallel Algorithms - A33-Baden
No ratings yet
Communication Overlap in Multi-Tier Parallel Algorithms - A33-Baden
20 pages
Parcial Cono 1 14
No ratings yet
Parcial Cono 1 14
14 pages
PM2-Project Charter
No ratings yet
PM2-Project Charter
23 pages
Ivosights
No ratings yet
Ivosights
49 pages
Bigdata Unit5
No ratings yet
Bigdata Unit5
20 pages
Rangkuman Data Analitik Dan Big Data
No ratings yet
Rangkuman Data Analitik Dan Big Data
10 pages
Atc20 Visheratin
No ratings yet
Atc20 Visheratin
14 pages
BCE Report
No ratings yet
BCE Report
14 pages
Recent Trends in IT
No ratings yet
Recent Trends in IT
7 pages
19 Arid 3235 LAB (5,6,7,8)
No ratings yet
19 Arid 3235 LAB (5,6,7,8)
11 pages
GLOSSARY OF NUCLEAR TERMS (From 'Swords of Armageddon')
No ratings yet
GLOSSARY OF NUCLEAR TERMS (From 'Swords of Armageddon')
43 pages
Chapter 4 Data Analyticsv3
No ratings yet
Chapter 4 Data Analyticsv3
10 pages
TCS Allegations and Mixtures Quiz-3 PREP INSTA
No ratings yet
TCS Allegations and Mixtures Quiz-3 PREP INSTA
21 pages
The MESA and ISA 95
No ratings yet
The MESA and ISA 95
9 pages
Chapter 3-Updated
No ratings yet
Chapter 3-Updated
34 pages
Lecture 1
No ratings yet
Lecture 1
55 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
Discussion 7 (Week 6) Big Data
No ratings yet
Discussion 7 (Week 6) Big Data
14 pages
Big Data
No ratings yet
Big Data
8 pages
Analyzing and Processing Data Faster Bas PDF
No ratings yet
Analyzing and Processing Data Faster Bas PDF
6 pages
113 Trellix NX 4600 Ds Trellix Network Security Tech Specifications Datasheet
No ratings yet
113 Trellix NX 4600 Ds Trellix Network Security Tech Specifications Datasheet
9 pages
SAC MoM 20.3.24
No ratings yet
SAC MoM 20.3.24
10 pages
Unlocking Online Insights: LSTM Exploration and Transfer Learning Prospects
No ratings yet
Unlocking Online Insights: LSTM Exploration and Transfer Learning Prospects
14 pages
Liu 2015
No ratings yet
Liu 2015
8 pages
Big Data Analytics - Notes
No ratings yet
Big Data Analytics - Notes
8 pages
ReductStore - White Paper - Review
No ratings yet
ReductStore - White Paper - Review
7 pages
2020, Sathyaraj - Chicken Swarm Foraging Algorithm For Big Data Classification Using The Deep Belief Network Classifier
No ratings yet
2020, Sathyaraj - Chicken Swarm Foraging Algorithm For Big Data Classification Using The Deep Belief Network Classifier
21 pages
2020, Ahsan - Correlation-Based Analytics of Time Series Data
No ratings yet
2020, Ahsan - Correlation-Based Analytics of Time Series Data
10 pages
PISA An Index For Aggregating Big Time Series Data
No ratings yet
PISA An Index For Aggregating Big Time Series Data
11 pages
Monitoring of Bioreactor Using Statistical Techniques: Review Paper
No ratings yet
Monitoring of Bioreactor Using Statistical Techniques: Review Paper
6 pages
Google APAC Test FAQ - Campus Recruiting 2015
No ratings yet
Google APAC Test FAQ - Campus Recruiting 2015
2 pages
PRO Argument Facebook Fake News Dissemination
No ratings yet
PRO Argument Facebook Fake News Dissemination
2 pages
Matrix Profile For DDoS Attacks Detection
No ratings yet
Matrix Profile For DDoS Attacks Detection
5 pages
Painting Crew Supervisor Interview Question
No ratings yet
Painting Crew Supervisor Interview Question
6 pages
DM MCQS Unit-1
No ratings yet
DM MCQS Unit-1
4 pages
Lebanese International University: CSCI345 - Digital Logic Assignment 1
No ratings yet
Lebanese International University: CSCI345 - Digital Logic Assignment 1
5 pages
Research
No ratings yet
Research
2 pages
Big Data Analytics in CPS
No ratings yet
Big Data Analytics in CPS
4 pages
Rohini 56509347058
No ratings yet
Rohini 56509347058
4 pages
New Text Document
No ratings yet
New Text Document
3 pages
Automatic Anomaly Detection in The Cloud Via Statistical Learning
No ratings yet
Automatic Anomaly Detection in The Cloud Via Statistical Learning
13 pages
23-04-2024 Tuesday Educational Information and o
No ratings yet
23-04-2024 Tuesday Educational Information and o
2 pages
An Efficient Clustering Method To Find Similaritybetween The Documents
No ratings yet
An Efficient Clustering Method To Find Similaritybetween The Documents
4 pages
MA1014 Lecture 15 and 16 Semester 1 Intake 2023
No ratings yet
MA1014 Lecture 15 and 16 Semester 1 Intake 2023
2 pages

Fast Approximate Correlation For Massive Time-Series Data - Correlation-Sigmod10

Uploaded by

Fast Approximate Correlation For Massive Time-Series Data - Correlation-Sigmod10

Uploaded by

Fast Approximate Correlation for Massive Time-series Data

Abdullah Mueen Suman Nath Jie Liu

ABSTRACT volumes of data. Applications for data center management, envi-

O(n) total I/O cost. Since k  m, we can maintain the coeffi-

Figure 4(a) shows an approximate threshold correlation matrix

I/O Time (Minutes)

(a) Impact of varying  (b) Impact of thresholds (c) Boolean approximation

Dataset CPU time (minutes)

Table 3: Absolute CPU time for different algorithms, with T = 0

You might also like

O(n) total I/O cost. Since k m, we can maintain the coeffi-

(a) Impact of varying (b) Impact of thresholds (c) Boolean approximation