FastQuantilesInStreamingData
FastQuantilesInStreamingData
Streams
19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
1 2 1 2
ements and different error thresholds. We also com- (log ( ) + log log( 1δ )) with a failure probability of δ.
pared the performance of our algorithms against prior Greenwald et al. [5] improved Manku’s [11] algorithm
algorithms for arbitrary-sized streams. In practice, our to achieve a storage bound of O( 1 log N ). Their algo-
algorithm is able to achieve upto 300× speedup over rithm can deterministically compute an -approximate
prior algorithms. quantile summary without the prior knowledge of N .
Organization of the paper: The rest of the paper Lin et al. [8] presented algorithms to compute uniform
is organized as follows. We describe the related work quantiles over sliding windows. Arasu and Manku [2]
in Section 2. In Section 3, we present our algorithms improved the space bound using a novel exponential
and analysis for both fixed-sized streams and arbitrary- histogram-based data structure.
sized streams. In Section 4, we demonstrate our imple- More recently, Cormode et al. [3] studied the prob-
mentation results. Section 5 concludes the paper. lem of biased quantiles. They proposed an algorithm
with poly-log space complexity based on [5]. However,
2 Related Work it is shown in [15] that the space requirement of their
algorithm can grow linearly with the input size with
carefully crafted data. Cormode et al. [4] presented
Quantile computation has been studied extensively
a better algorithm with an improved space bound of
in the database literature. At a broad level, they can
O( logU
log N ) and amortized update time complexity
be classified as exact algorithms and approximate al-
of O(log log U ) where U is the size of the universe where
gorithms.
data element is chosen from and N is the size of the
Exact Algorithms: Several algorithms are pro- data stream.
posed for computing exact quantiles efficiently. There
Recent work has also focussed on approximate quan-
is also considerable work on deriving the lower and up-
tile computation algorithms in distributed streams and
per bounds of number of comparisons needed for find-
sensor networks. Greenwald et al. [6] proposed an
ing exact quantiles. Mike Paterson [13] reviewed the
algorithm for computing -approximate quantiles dis-
history of the theoretical results on this aspect. The
tributely for sensor network applications. Their algo-
current upper bound is 2.9423N comparisons, and the
rithm guarantees that the summary structure at each
lower bound is (2 + α)N , where α is the order of 2−40 .
sensor is of size O(log2 n/). Shrivastava et al. [14]
Munro and Paterson [12] also showed that algorithms
presented an algorithm to compute medians and other
which compute the exact φ-quantile of a sequence of
quantiles in sensor networks using a space complexity
N data elements in p passes, will need Ω(N 1/p ) space.
of O( 1 log(U )) where U is the size of the universe.
For single pass requirement of stream applications, this
To deal with massive quantile queries, Lin et al.
requires Ω(N ) space. Therefore, approximation algo-
[9] proposed an algorithm to reduce the number of
rithms that require sublinear-space are needed for on-
distinct quires by clustering the queries and treating
line quantile computations on large data streams.
each cluster as a single query. For relative error or-
Approximate Algorithms: Approximate algo-
der statistics, Zhang et al. [15] proposed an algo-
rithms are either deterministic with guaranteed error
rithm with confidence 1 − δ using O( 12 log 1δ log 2 N )
or randomized with guaranteed error of certain proba-
space, which improved the previous best space bound
bility. These algorithms can further be classified as uni-
O( 13 log 1δ log N ).
form, biased or targetted quantile algorithms. More-
over, based on the underlying model, they can be
further classified as quantile computations on entire 3 Algorithms
stream history, sliding windows and distributed stream
algorithms. In this section, we describe our algorithms for fast
Jain and Chlamatac [7], Agrawal and Swami [1] have computation of approximate quantiles on large high-
proposed algorithms to compute uniform quantiles in speed data streams. We present our data structures
a single pass. However, both of these two algorithms and algorithms for both fixed-sized (with known size)
have no apriori guarantees on error. Manku et al. and arbitrary-sized (with unknown size) streams. Fur-
[10] proposed a single pass algorithm for computing thermore, We analyze the computational complexity
-approximate uniform quantile summary. Their algo- and the memory requirements for our algorithms.
rithm requires prior knowledge of N . The space com- Let N denote the total number of values of the data
plexity of their algorithm is O( 1 log2 N ). Manku et stream and n denote the number of values in the data
al. [11] also presented a randomized uniform quan- stream seen so far. Given a user-defined error and
tile approximation algorithm which does not require any rank r ∈ [1, n], an -approximate quantile is an
prior knowledge of N . The space requirement is element in the data stream whose rank r0 is within
19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
[r − n, r + n]. We maintain a summary structure to to these operations as MERGE on s1 , sc and EMPTY
continuously answer -approximate quantile queries. on s0 . Generally, the MERGE(sl+1 , sc ) operation
merges sl+1 with the sketch sc by performing a
3.1 Fixed Size Streams merge sort. The EMPTY(sl ) operation empties sl
after MERGE operation is finished. Finally, we per-
We first present our approximate quantile computa- form COMPRESS on the result of MERGE, and send
tion algorithm for the case where N is given in advance. the resulting sketch sc to level 2.
We will generalize the algorithm with unknown N in
4. If s2 is empty, we set s2 to be sc and the up-
the following subsection. In practice, the former algo-
date procedure is done. Otherwise, we perform the
rithm can be used for applications such as summarizing
operations s2 =MERGE(s2 , sc ), sc =COMPRESS(s2 ),
large databases that do not fit in main memory. The
EMPTY(s1 ) in the given order, and send new sc to
latter algorithm is useful for continuous streams whose
level 3.
size can not be predicted beforehand. We introduce
our summary structure and describe the algorithm to 5. We repeatedly perform step 4 for si , i = 3, . . . , L
construct the summary. until we find a level L where sL is empty.
3.1.1 Multi-level Quantile Summary The pseudo code of the entire update procedure
whenever an element e comes is shown in Algorithm
We maintain a multi-level -summary S of the stream 1. In the following discussion, we describe the opera-
as data elements are coming in. An -summary is a tions COMPRESS, MERGE in detail.
sketch of the stream which can provide -approximate Assume that s is an -summary of stream B. For
answer for quantile query of any rank r ≤ n. each element e in s, we maintain rmax(e) and rmin(e)
We maintain a multi-level summary structure S = which represent the maximum and minimum possible
{s0 , ..., sl , ..., sL }, where sl is the summary at level l ranks of e in B, respectively. Therefore, we can an-
and L is the total number of levels (see Fig.1). Basi- swer the -approximate quantile query of any rank r
cally, we divide the incoming stream into blocks of size by returning the value e which satisfies: rmax(e) ≤
b (b = b logN c). Each level l covers a disjoint bag Bi of r + |B| and rmin(e) ≥ r − |B|. Initially, rmin(e) =
consecutive
S blocks in the stream, and all levels together rmax(e) = rank(e). rmin(e), rmax(e) are updated
Bi cover the whole stream. Specifically, B0 always during the COMPRESS and MERGE operations.
contains the most recent block (whether it is complete COMPRESS(s, 1b ): The COMPRESS operation takes
or not), B1 contains the older two blocks, and BL con- at most d 2b e + 1 values from s, which are:
sists of the oldest 2L blocks. Each sl is an l -summary
quantile(s, 1), quantile(s, b 2|B| 2|B|
b c), quantile(s, b2 b c),
of Bi , where l ≤ .
The multi-level summary construction and main- . . .,quantile(s, bi 2|B|
b c),. . ., quantile(s, |B|), where
tainance is performed as follows. Initially, all levels quantile(s, r) queries summary s for quantile of rank
are empty. Whenever a data element in the stream r. According to [6], the result of COMPRESS(s, 1b ) is an
arrives, we perform the following update procedure. ( + 1b )-summary, assuming s is an -summary.
MERGE(s, s0 ): The MERGE operation combines s and s0
1. Insert the element into s0 . by performing a merge-sort on s and s0 . According to
[6], if s is an -summary of B and s0 is an 0 -summary S of
2. If s0 is not full (|s0 | < b), stop and the update B 0 , the result of MERGE(s, s0 ) is an -summary of B B 0
procedure is done for the current element. If s0 where = max(, 0 ).
becomes full (|s0 | = b), we reduce the size of s0
by computing a sketch sc of size d 2b e + 1 on s0 . Lemma 1. The number of levels in the summary
We refer to this sketch computation operation as structure is less than log(N ).
COMPRESS, which we will describe in detail in later
Proof. In the entire summary structure construction,
discussion. Consider s0 as an 0 -summary of B0
s0 becomes full at most Nb times, sl becomes full 2Nl b
where 0 = 0, the COMPRESS operation guarantees
times and the highest level sL becomes full at most
that sc is an (0 + 1b )-summary. After COMPRESS
once. Therefore,
operation, sc is sent to level 1.
N
3. If s1 is empty, we set s1 to be sc and the update L ≤ log( ) < log(N ) − log(log(N )) < log(N ) (1)
b
procedure is done. Otherwise, we merge s1 with
sc which is sent by level 0 and empty s0 . We refer .
19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
s3
s2
L=3 B3 s1
L=2 B2 s0
L=1 B1
L=0 B0
8b 4b 2b b
Figure 1. Multi-level summary S: This figure highlights the multi-level structure of our -summary
S = {s0 , s1 , . . . , sL }. The incoming data is divided into equi-sized blocks of size b and blocks are grouped
into disjoint bags, B0 , B1 , . . . , Bl , . . . , BL with Bl for level l. B0 contains the most recent block, B1 contains
the older two blocks, and BL consists of the oldest 2L blocks. At each level, we maintain sl as the
l -summary for Bl . The total number of levels L is no more than log Nb
Algorithm 1 Update(e,S,)
Input e: current data element to be inserted, S: cur-
rent summary structure S = {s0 , . . . , sl , . . . , sL }, : re- To answer a query of any rank r using S, we first
quired approximation factor of S sort s0 and merge the summaries at all levels {sl } using
the MERGE operation, denote it as MERGE(S). Then the
1: insert e into s0
-approximate quantile for any rank r is the element
2: if |s0 | = b then
e in MERGE(S) which satisfies: rmin(e) ≥ r − N and
3: sort s0
rmax(e) ≤ r + N .
4: sc ← compress(s0 , 1b )
5: empty(s0 ) Theorem 1. For multi-level summary S, MERGE(S) is
6: else an -approximate summary of the entire stream.
7: exit
8: end if Proof. MERGE S operation on all sl generates a sum-
9: for l = 1 to L do mary for Bl with approximation factor U =
10: if |sl | = 0 then max(1 , 2 , . . . , L ). According to Lemma 2, U < .
11: s l ← sc Since the union of all the Bl is the entire stream,
12: break MERGE(S) is an -approximate summary of the entire
13: else stream.
14: sc = compress(merge(sl , sc ), 1b )
15: empty(sl ) 3.1.2 Performance Analysis
16: end if
Our summary structure maintains at most b + 3 ele-
17: end for
ments in each level (after MERGE operation) and there
are L levels in the summary structure. Therefore, the
Lemma 2. Each level in our summary maintains an storage requirement for constructing the summary is
error less than . bounded by (b + 3)L = O( 1 log2 (N )). The storage re-
quirement for our algorithm is higher than the best
Proof. During the construction process of S, the error storage bound proposed by Greenwald and Khanna
at each level l depends on the COMPRESS and MERGE op- [5], which is O( 1 log(N )). However, The goal behind
erations. Intially, 0 = 0. At each level, COMPRESS(sl , our algorithm is to achieve faster computational time
1 1
b ) operation generates a new sketch sc with error l + b with reasonable storage. In practice, the memory size
and added to level l + 1, and MERGE does not increase requirements for the algorithm can be a small frac-
the error. Therefore, the error of the summary in sl+1 tion of the RAM on most PCs even for peta-byte-sized
is given by datasets (see table 1).
1 l+1 l+1
= 0 +
l+1 = l + = (2) Theorem 2. The average update cost of our algorithm
b b b
is O(log( 1 log(N ))).
From equations 2 and 1, it is easy to verify that
Proof. At level 0, for each block we perform a sort and
l log(N ) a COMPRESS operation. The cost of sort per block is
l = < log(N ) = (3)
b b log b, COMPRESS per block is 2b . Totally, there are Nb
19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
blocks, so the total cost at level 0 is: N log b + N2 . Algorithm 2 gUpdate(e,S, , SC )
At each level Li , i > 0, we perform a COMPRESS and Input e: current data element, S: current sum-
a MERGE operation. Each COMPRESS costs b, since a mary structure, S = {S 0 , S 1 , . . . , S k−1 } (sub-streams
linear scan is required to batch query all the values P0 , . . . , Pk−1 have completely arrived), : required ap-
needed (refer to COMPRESS operation). Each MERGE proximation factor of S, SC : the fixed size multi-level
costs b with a merge sort. In fact, the computation summary corresponding to the current sub-stream Pk ,
cost of MERGE also includes the updates of rmin and SC = {s0 , s1 , . . . , sL }
rmax (will be discussed in Sec 3.3), which can be done 1: if e is the last element of Pk then
in linear time. Thus the cost of a MERGE adds up to 2: Apply merge on all the levels of SC : sall =
2b. Therefore, the total expected cost P of computing merge(SC ) = merge(s0 , s1 , . . . , sL )
the summary structure is N log b + N2 + i=L N
i=1 2i b 3b = 3: S k = compress(sall , 2 )
1
O(N log( log(N )). The average update time per ele- 4:
S
S = S {S k }
ment is O(log( 1 log(N ))). 5: SC ← φ
6: else
In practice, for a fixed , the average per ele- 7: update SC : SC = U pdate(e, SC , 2 )
ment computation cost of our algorithm is given by 8: end if
O(log log N ) and the overall computation is almost lin-
ear in performance. The algorithm proposed by Green-
2. Once the last element of sub-stream Pk arrives,
wald and Khanna [5] has a best case computation time
we compute an 2 -summary on MERGE(SC ), which
(per element) of O(log s), and worst computation time
is the merged set of all levels in SC . The resulting
(per element) of O(s) where s is 1 log(N ). We will
summary S k =COMPRESS(MERGE(SC ), 2 ) is an -
demonstrate in our experiment section the comparison
summary of Pk and it consists of 2 elements.
of the performance.
The majority of the computation in the summary 3. The ordered set of the summaries of all complete
construction is dominated by the sort operations on sub-streams so far S = {S 0 , S 1 , . . . , S k−1 } is the
blocks. Although sorting is computationally intensive, current multi-level -summary of the entire stream
it is fast on small blocks which fit in the CPU L2 caches. except the incomplete sub-stream Pk .
Table 1 shows a comparison of the block size, memory
requirement as a function of stream size N with error The pseudo code for the update algorithm for stream
bound 0.001 using our generalized streaming algorithm with unknown size is shown in Algorithm 2. Initially,
in the next section. In practice, the size of the blocks S = φ. Whenever an element comes, gUpdate is per-
in our algorithm is smaller than the CPU cache size formed to update the summary structure S.
even for peta-byte-sized data streams. To answer a query of any rank r using S, if SC
is not empty, we first compute S k for the incomplete
3.2 Generalized Streaming Algorithm sub-stream Pk : S k = compress(merge(SC ), 2 ), then
we merge all the -summaries S 0 , S 1 , . . . , S k−1 in S to-
We generalize our algorithm for fixed size streams gether with S k using MERGE operation, the final sum-
to compute approximate quantiles in streams without mary is the -summary for P .
prior knowledge of size N . The basic idea of our algo-
rithm is as follows. We partition the input stream P 3.2.1 Performance Analysis
into disjoint sub-streams P0 , P1 , . . . , Pm with increas-
i We first present the storage analysis and then analyze
ing size. Specifically, sub-stream Pi has size 2 and
the computational complexity of our algorithm.
covers the elements whose location is in the interval
i i+1
[ 2 −1 , 2 −1 ). By partitioning the incoming stream Theorem 3. The space requirement of Algorithm 2 is
into sub-streams with known size, we are able to con- O( 1 log2 (n)).
struct a multi-level summary Si on each sub-stream Pi
using our algorithm for fixed size streams. Our sum- Proof. At any point of time, assume that the num-
mary construction algorithm is as follows. ber of data elements arriving so far is n. According
to Algorithm 1, we compute and maintain a multi-
1. For the latest sub-stream Pk which has not com- level -approximate summary SC for the current sub-
pleted, we maintain a multi-level 0 -summary SC stream Pk . For each of the previous sub-streams
using Algorithm 1 by performing U pdate(e, SC , 0 ) Pi , i = 1, . . . , k − 1 which are complete, we maintain
whenever an element comes. Here 0 = 2 . an -summary S i of size 2 . Since k ≤ blog(n +
19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
Stream size (N) Maximum Block Size (Bytes) Bound of Number of Tuples Bound of Summary Size (Bytes)
106 191.2KB 161K 1.9MB
109 420.4KB 717K 8.6MB
1012 669.6KB 1.67M 20MB
1015 908.8KB 3.03M 36.4MB
Table 1. This table shows the memory size requirements of our Generalized algorithm (with unknown
size) for large data streams with an error of 0.001. Each tuple consists of a data value, and its minimum
and maximum rank in the stream, totally 12 bytes. Observe that the block size is less than a MB and
fits in the L2 cache of most CPUs. Therefore, the sorting will be in-memory and can be conducted
very fast. Also, the maximum memory requirement for our algorithm is a few MB even for handling
streams of 1 peta data.
1)c, totally we need O( 1 log n) space. According element in S 00 that is smaller than xr (ys is undefined
to the space bound for fixed size streams, we need if no such element), and let yt be the smallest element
O( 1 log2 (n)) space for computing the summary SC in S 00 that is larger than xr (yt is undefined if no such
for the current sub-stream. Therefore, the space re- element). Then
quirement for the entire algorithm at any point of time
rminS 0 (xr ) if ys undefined
is O( 1 log2 (n)). rminS (zi ) =
rminS 0 (xr ) + rminS 00 (ys ) otherwise
Theorem 4. The average update cost of Algorithm 2
is O(log( 1 log n)).
rmaxS0 (xr ) + rmaxS00(ys ) if yt undefined
rminS (zi ) =
rmaxS0 (xr ) + rmaxS00(yt ) − 1 otherwise
Proof. According to Theorem 2, the computa-
tional complexity of each sub-stream Pi , i =
Rank update in COMPRESS: Assume COMPRESS(S 0 ) =
0, 1, . . . , blog(n + 1)c is O(ni log( 10 log(0 ni ))) where
i S, for any element e ∈ S, we define rminS (e) =
ni = |Pi | = 2 , Σni = n, 0 = 2 . After each sub- rminS 0 (e) and rmaxS (e) = rmaxS 0 (e).
stream Pi is complete, we perform an additional MERGE
and COMPRESS operation each of cost O( 10 log2 (0 ni ))
4 Implementation and Resuts
to construct S i .
Given the above observations, the total computa-
We implemented our algorithms in C++ on an Intel
tional cost of our algorithm is
1.8 GHz Pentium PC with 2GB main memory. We
i=blog(n+1)c tested our algorithm with a C++ implementation of
X 2i 2(i − 1) 2
( log( ) + (i − 1)2 ) (4) the algorithm in [5] (refer to as GK01 in the remaining
i=0
part of the paper) from the authors.
.
Simplifying equation 4, the total computational cost 4.1 Results
of our algorithm is O(n log( 1 log(n))), the average up-
dating time per element is O(log( 1 log(n))), which is We measured the performance of GK01 and our al-
O(log log n) if is fixed. gorithm on different data sets. Specifically, we studied
the computational performance as a function of the size
3.3 Update rmin(e) and rmax(e) of the incoming stream, the error and input data dis-
tribution. In all experiments, we do not assume the
For both fixed size stream and arbitrary size stream, knowledge of the stream size, and we use f loat as data
to answer the quantile query, we need to update rmin type which takes 4 bytes.
and rmax values of each element e in the summary
properly. 4.1.1 Sorted Input
rmin(e) and rmax(e) are updated during COMPRESS
and MERGE operations as follows (as in [6]): We tested our algorithms using an input stream with ei-
Rank update in MERGE: Let S 0 = x1 , x2 , . . . , xa and ther sorted or reverse sorted data. Fig. 2(a) shows the
S 00 = y1 , y2 , . . . , yb be two quantile summaries. Let performance of GK01 and our algorithm as the input
S = z1 , z2 , . . . , za+b = MERGE(S 0 , S 00 ). Assume zi corre- data stream size varies from 106 to 107 with a guar-
sponds to some element xr in Q0 . Let ys be the largest anteed error bound of 0.001. For these experiments,
19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
as the data stream size increases, the block size in the requirement of MRL [10] and higher than GK01. Al-
largest sub-stream varies from 191.2K to 270.9K. In though the storage requirement is comparatively high,
practice, our algorithm is able to compute the summary for many practical applications, the storage used by
on a stream of size 107 (40M B) using less than 2MB our algorithm is small enough to manage. For exam-
RAM. Our algorithm is able to achieve a 200 − 300× ple, a stream with 100 million values and error bound
speedup over GK01. Note that the sorted and reverse 0.001 has a worst-case storage requirement of 5MB
sorted curves for GK01 are almost overlapping due to and practical on most PCs. Although our algorithm
the log-scale presentation and small difference between has a higher storage requirement than GK01, our al-
them (average 1.16% difference). Same reason for the gorithm can construct the summary upto two orders
sorted and reverse sorted curves for our algorithm, and of magnitude faster than GK01. In terms of the com-
the average difference between them is 2.1%. putational cost, our algorithm has an expected cost
We also measured the performance of our algorithm of O(n log( 1 log(N ))). Therefore, for a fixed error
and GK01 by varying the error bound from 10−3 to bound, the algorithm has an almost linear increase in
10−2 on sorted and reverse sorted streams. Fig. 2(b) computational time in n. Our algoritm also has a near-
shows the performance of our algorithm and GK01 on logarithmic increase in time as error bound decreases.
an input stream of 107 data elements. We observe that Therefore, our algorithm is able to handle higher accu-
the performance of our algorithm is almost constant racy, large data streams efficiently.
even when the approximation accuracy of quantiles in-
creases by 10×. Note that the performance of GK01
is around 60× slower for large error and around 300×
slower for higher precision quantiles compared with our 5 Conclusion and Future Work
algorithm.
19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
1000 1000
100 100
log-scale
log-scale
1 1
1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01
0.1 0.1
Data Stream Size Error
(a) Summary construction time vs stream size (b) Summary construction time vs error
Figure 2. Sorted Data: We used the sorted and reverse sorted input data to measure the best possible
performance of the summary construction time using our algorithm and GK01. Fig. 2(a) shows the
computational time as a function of the stream size on a log-scale for a fixed epsilon of 0.001. We observe
that the sorted and reverse sorted computation time curves for GK01 are almost overlapping due to
the log-scale presentation and small difference between them (average 1.16% difference). Same reason
for the sorted and reverse sorted curves for our algorithm, and the average difference between them is
2.1%. We also observe that the performance of our algorithm is almost linear and the computational
performance is almost two orders of magnitude faster than GK01. Fig. 2(b) shows the computational
time as a function of the error. We observe the higher performance of our algorithm which is 60 − 300×
faster than GK01. Moreover, GK01 has a significant performance overhead as the error becomes
smaller.
19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
1000 1000
Our algorithm Our algorithm
GK01 GK01
100 100
Time (in sec)
log-scale
10 10
1 1
1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01
0.1 0.1
Data Stream Size Error
(a) Summary construction time vs stream size (b) Summary construction time vs error
Figure 3. Random Data: We used the random input data to measure the performance of the summary
construction time using our algorithm and GK01. Fig. 3(a) shows the computational time as a function
of the stream size on a log-scale for a fixed epsilon of 0.001. We observe that the performance of our
algorithm is almost linear. Furthermore, the log-scale plot indicates that our algorithm is almost two
orders of magnitude faster than GK01. Fig. 3(b) shows the computational time as a function of the
error. We observe that our algorithm is almost constant whereas GK01 has a significant performance
overhead as the error becomes smaller.
19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007