0% found this document useful (0 votes)
5 views

FastQuantilesInStreamingData

Uploaded by

thinkpavan6
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

FastQuantilesInStreamingData

Uploaded by

thinkpavan6
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

A Fast Algorithm for Approximate Quantiles in High Speed Data

Streams

Qi Zhang and Wei Wang


University of North Carolina, Chapel Hill
Department of Computer Science
Chapel Hill, NC 27599, USA
[email protected], [email protected]

Abstract streams. In order to guarantee the precision of the re-


sult, the algorithm should ensure random or determin-
We present a fast algorithm for computing approx- istic error bound for the quantile computation. The
imate quantiles in high speed data streams with deter- best reported storage bound for approximate quantile
ministic error bounds. For data streams of size N computation is O( 1 log N ) [5]. Many algorithms were
where N is unknown in advance, our algorithm par- developed for computing approximate quantiles over
titions the stream into sub-streams of exponentially the entire stream history [10, 5] or over a sliding win-
increasing size as they arrive. For each sub-stream dow [8, 2]; with uniform error [10, 5] or with biased
which has a fixed size, we compute and maintain a error [3, 4]. However, most of these algorithms focus
multi-level summary structure using a novel algorithm. on reducing the space requirement and can trade off
In order to achieve high speed performance, the algo- the computational cost. This can be an issue in stream
rithm uses simple block-wise merge and sample oper- applications such as streaming music, streaming video,
ations. Overall, our algorithms for fixed-size streams voice over IP, etc. which require real-time performance.
and arbitrary-size streams have a computational cost of For high-speed streams on OC-768 links with 40 Gbps
O(N log( 1 log N )) and an average per-element update capacity, many algorithms may not have enough time
cost of O(log log N ) if  is fixed. for processing the data, even if the storage requirement
is relieved.
In this paper, we present fast algorithms for comput-
1 Introduction ing approximate quantiles with uniform error on entire
stream history. Specifically, for a fixed error, our algo-
Quantile computation has a wide range of applica- rithm has a computational cost of O(N log 1 log N )
tions in database and scientific computing. Comput- where N is the size of the data stream. An av-
ing exact quantiles on large datasets or unlimited data erage per element cost of O(log log N ) for fixed 
streams requires either huge memory or relatively slow is achieved which significantly reduces the compu-
disk-based sorting. It is proven that at least O(N p )
1 tational bandwidth requirement. Our algorithm is
storage is needed for exact median (0.5 quantile) com- based on block-wise sampling and merging operations.
putation in p passes for a data set of size N [12]. Re- For a fixed-sized stream with known size, we main-
cently, researchers have studied the problem of com- tain a multi-level summary structure online which
puting approximate quantiles with guaranteed error can answer -approximate quantile query for any rank
bound to improve both memory and speed performance r ∈ [1, n]. For an arbitrary-sized stream with un-
[10, 11, 5, 6, 2, 8, 3, 4]. known size, our algorithm divides the streams into
sub-streams of exponentially-increasing sizes, where a
Streaming quantile computation has several con-
fixed-sized stream algorithm can be applied for each
straints. Data streams are transient and can arrive
sub-stream. The storage requirement of our algorithms
at a high speed. Furthermore, the stream size may
is O( 1 log2 N ).
not be known apriori. Streaming computations there-
fore require single pass algorithms with small space re- We have tested the performance of our algorithms
quirement and which are able to handle arbitrary sized on arbitrary-sized streams with tens of millions of el-

19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
1 2 1 2
ements and different error thresholds. We also com-  (log (  ) + log log( 1δ )) with a failure probability of δ.
pared the performance of our algorithms against prior Greenwald et al. [5] improved Manku’s [11] algorithm
algorithms for arbitrary-sized streams. In practice, our to achieve a storage bound of O( 1 log N ). Their algo-
algorithm is able to achieve upto 300× speedup over rithm can deterministically compute an -approximate
prior algorithms. quantile summary without the prior knowledge of N .
Organization of the paper: The rest of the paper Lin et al. [8] presented algorithms to compute uniform
is organized as follows. We describe the related work quantiles over sliding windows. Arasu and Manku [2]
in Section 2. In Section 3, we present our algorithms improved the space bound using a novel exponential
and analysis for both fixed-sized streams and arbitrary- histogram-based data structure.
sized streams. In Section 4, we demonstrate our imple- More recently, Cormode et al. [3] studied the prob-
mentation results. Section 5 concludes the paper. lem of biased quantiles. They proposed an algorithm
with poly-log space complexity based on [5]. However,
2 Related Work it is shown in [15] that the space requirement of their
algorithm can grow linearly with the input size with
carefully crafted data. Cormode et al. [4] presented
Quantile computation has been studied extensively
a better algorithm with an improved space bound of
in the database literature. At a broad level, they can
O( logU
 log N ) and amortized update time complexity
be classified as exact algorithms and approximate al-
of O(log log U ) where U is the size of the universe where
gorithms.
data element is chosen from and N is the size of the
Exact Algorithms: Several algorithms are pro- data stream.
posed for computing exact quantiles efficiently. There
Recent work has also focussed on approximate quan-
is also considerable work on deriving the lower and up-
tile computation algorithms in distributed streams and
per bounds of number of comparisons needed for find-
sensor networks. Greenwald et al. [6] proposed an
ing exact quantiles. Mike Paterson [13] reviewed the
algorithm for computing -approximate quantiles dis-
history of the theoretical results on this aspect. The
tributely for sensor network applications. Their algo-
current upper bound is 2.9423N comparisons, and the
rithm guarantees that the summary structure at each
lower bound is (2 + α)N , where α is the order of 2−40 .
sensor is of size O(log2 n/). Shrivastava et al. [14]
Munro and Paterson [12] also showed that algorithms
presented an algorithm to compute medians and other
which compute the exact φ-quantile of a sequence of
quantiles in sensor networks using a space complexity
N data elements in p passes, will need Ω(N 1/p ) space.
of O( 1 log(U )) where U is the size of the universe.
For single pass requirement of stream applications, this
To deal with massive quantile queries, Lin et al.
requires Ω(N ) space. Therefore, approximation algo-
[9] proposed an algorithm to reduce the number of
rithms that require sublinear-space are needed for on-
distinct quires by clustering the queries and treating
line quantile computations on large data streams.
each cluster as a single query. For relative error or-
Approximate Algorithms: Approximate algo-
der statistics, Zhang et al. [15] proposed an algo-
rithms are either deterministic with guaranteed error
rithm with confidence 1 − δ using O( 12 log 1δ log 2 N )
or randomized with guaranteed error of certain proba-
space, which improved the previous best space bound
bility. These algorithms can further be classified as uni-
O( 13 log 1δ log N ).
form, biased or targetted quantile algorithms. More-
over, based on the underlying model, they can be
further classified as quantile computations on entire 3 Algorithms
stream history, sliding windows and distributed stream
algorithms. In this section, we describe our algorithms for fast
Jain and Chlamatac [7], Agrawal and Swami [1] have computation of approximate quantiles on large high-
proposed algorithms to compute uniform quantiles in speed data streams. We present our data structures
a single pass. However, both of these two algorithms and algorithms for both fixed-sized (with known size)
have no apriori guarantees on error. Manku et al. and arbitrary-sized (with unknown size) streams. Fur-
[10] proposed a single pass algorithm for computing thermore, We analyze the computational complexity
-approximate uniform quantile summary. Their algo- and the memory requirements for our algorithms.
rithm requires prior knowledge of N . The space com- Let N denote the total number of values of the data
plexity of their algorithm is O( 1 log2 N ). Manku et stream and n denote the number of values in the data
al. [11] also presented a randomized uniform quan- stream seen so far. Given a user-defined error  and
tile approximation algorithm which does not require any rank r ∈ [1, n], an -approximate quantile is an
prior knowledge of N . The space requirement is element in the data stream whose rank r0 is within

19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
[r − n, r + n]. We maintain a summary structure to to these operations as MERGE on s1 , sc and EMPTY
continuously answer -approximate quantile queries. on s0 . Generally, the MERGE(sl+1 , sc ) operation
merges sl+1 with the sketch sc by performing a
3.1 Fixed Size Streams merge sort. The EMPTY(sl ) operation empties sl
after MERGE operation is finished. Finally, we per-
We first present our approximate quantile computa- form COMPRESS on the result of MERGE, and send
tion algorithm for the case where N is given in advance. the resulting sketch sc to level 2.
We will generalize the algorithm with unknown N in
4. If s2 is empty, we set s2 to be sc and the up-
the following subsection. In practice, the former algo-
date procedure is done. Otherwise, we perform the
rithm can be used for applications such as summarizing
operations s2 =MERGE(s2 , sc ), sc =COMPRESS(s2 ),
large databases that do not fit in main memory. The
EMPTY(s1 ) in the given order, and send new sc to
latter algorithm is useful for continuous streams whose
level 3.
size can not be predicted beforehand. We introduce
our summary structure and describe the algorithm to 5. We repeatedly perform step 4 for si , i = 3, . . . , L
construct the summary. until we find a level L where sL is empty.

3.1.1 Multi-level Quantile Summary The pseudo code of the entire update procedure
whenever an element e comes is shown in Algorithm
We maintain a multi-level -summary S of the stream 1. In the following discussion, we describe the opera-
as data elements are coming in. An -summary is a tions COMPRESS, MERGE in detail.
sketch of the stream which can provide -approximate Assume that s is an -summary of stream B. For
answer for quantile query of any rank r ≤ n. each element e in s, we maintain rmax(e) and rmin(e)
We maintain a multi-level summary structure S = which represent the maximum and minimum possible
{s0 , ..., sl , ..., sL }, where sl is the summary at level l ranks of e in B, respectively. Therefore, we can an-
and L is the total number of levels (see Fig.1). Basi- swer the -approximate quantile query of any rank r
cally, we divide the incoming stream into blocks of size by returning the value e which satisfies: rmax(e) ≤
b (b = b logN c). Each level l covers a disjoint bag Bi of r + |B| and rmin(e) ≥ r − |B|. Initially, rmin(e) =
consecutive
S blocks in the stream, and all levels together rmax(e) = rank(e). rmin(e), rmax(e) are updated
Bi cover the whole stream. Specifically, B0 always during the COMPRESS and MERGE operations.
contains the most recent block (whether it is complete COMPRESS(s, 1b ): The COMPRESS operation takes
or not), B1 contains the older two blocks, and BL con- at most d 2b e + 1 values from s, which are:
sists of the oldest 2L blocks. Each sl is an l -summary
quantile(s, 1), quantile(s, b 2|B| 2|B|
b c), quantile(s, b2 b c),
of Bi , where l ≤ .
The multi-level summary construction and main- . . .,quantile(s, bi 2|B|
b c),. . ., quantile(s, |B|), where
tainance is performed as follows. Initially, all levels quantile(s, r) queries summary s for quantile of rank
are empty. Whenever a data element in the stream r. According to [6], the result of COMPRESS(s, 1b ) is an
arrives, we perform the following update procedure. ( + 1b )-summary, assuming s is an -summary.
MERGE(s, s0 ): The MERGE operation combines s and s0
1. Insert the element into s0 . by performing a merge-sort on s and s0 . According to
[6], if s is an -summary of B and s0 is an 0 -summary S of
2. If s0 is not full (|s0 | < b), stop and the update B 0 , the result of MERGE(s, s0 ) is an -summary of B B 0
procedure is done for the current element. If s0 where  = max(, 0 ).
becomes full (|s0 | = b), we reduce the size of s0
by computing a sketch sc of size d 2b e + 1 on s0 . Lemma 1. The number of levels in the summary
We refer to this sketch computation operation as structure is less than log(N ).
COMPRESS, which we will describe in detail in later
Proof. In the entire summary structure construction,
discussion. Consider s0 as an 0 -summary of B0
s0 becomes full at most Nb times, sl becomes full 2Nl b
where 0 = 0, the COMPRESS operation guarantees
times and the highest level sL becomes full at most
that sc is an (0 + 1b )-summary. After COMPRESS
once. Therefore,
operation, sc is sent to level 1.
N
3. If s1 is empty, we set s1 to be sc and the update L ≤ log( ) < log(N ) − log(log(N )) < log(N ) (1)
b
procedure is done. Otherwise, we merge s1 with
sc which is sent by level 0 and empty s0 . We refer .

19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
s3
s2
L=3 B3 s1
L=2 B2 s0
L=1 B1
L=0 B0
8b 4b 2b b

Figure 1. Multi-level summary S: This figure highlights the multi-level structure of our -summary
S = {s0 , s1 , . . . , sL }. The incoming data is divided into equi-sized blocks of size b and blocks are grouped
into disjoint bags, B0 , B1 , . . . , Bl , . . . , BL with Bl for level l. B0 contains the most recent block, B1 contains
the older two blocks, and BL consists of the oldest 2L blocks. At each level, we maintain sl as the
l -summary for Bl . The total number of levels L is no more than log Nb

Algorithm 1 Update(e,S,)
Input e: current data element to be inserted, S: cur-
rent summary structure S = {s0 , . . . , sl , . . . , sL }, : re- To answer a query of any rank r using S, we first
quired approximation factor of S sort s0 and merge the summaries at all levels {sl } using
the MERGE operation, denote it as MERGE(S). Then the
1: insert e into s0
-approximate quantile for any rank r is the element
2: if |s0 | = b then
e in MERGE(S) which satisfies: rmin(e) ≥ r − N and
3: sort s0
rmax(e) ≤ r + N .
4: sc ← compress(s0 , 1b )
5: empty(s0 ) Theorem 1. For multi-level summary S, MERGE(S) is
6: else an -approximate summary of the entire stream.
7: exit
8: end if Proof. MERGE S operation on all sl generates a sum-
9: for l = 1 to L do mary for Bl with approximation factor U =
10: if |sl | = 0 then max(1 , 2 , . . . , L ). According to Lemma 2, U < .
11: s l ← sc Since the union of all the Bl is the entire stream,
12: break MERGE(S) is an -approximate summary of the entire
13: else stream.
14: sc = compress(merge(sl , sc ), 1b )
15: empty(sl ) 3.1.2 Performance Analysis
16: end if
Our summary structure maintains at most b + 3 ele-
17: end for
ments in each level (after MERGE operation) and there
are L levels in the summary structure. Therefore, the
Lemma 2. Each level in our summary maintains an storage requirement for constructing the summary is
error less than . bounded by (b + 3)L = O( 1 log2 (N )). The storage re-
quirement for our algorithm is higher than the best
Proof. During the construction process of S, the error storage bound proposed by Greenwald and Khanna
at each level l depends on the COMPRESS and MERGE op- [5], which is O( 1 log(N )). However, The goal behind
erations. Intially, 0 = 0. At each level, COMPRESS(sl , our algorithm is to achieve faster computational time
1 1
b ) operation generates a new sketch sc with error l + b with reasonable storage. In practice, the memory size
and added to level l + 1, and MERGE does not increase requirements for the algorithm can be a small frac-
the error. Therefore, the error of the summary in sl+1 tion of the RAM on most PCs even for peta-byte-sized
is given by datasets (see table 1).
1 l+1 l+1
= 0 +
l+1 = l + = (2) Theorem 2. The average update cost of our algorithm
b b b
is O(log( 1 log(N ))).
From equations 2 and 1, it is easy to verify that
Proof. At level 0, for each block we perform a sort and
l log(N ) a COMPRESS operation. The cost of sort per block is
l = < log(N ) =  (3)
b b log b, COMPRESS per block is 2b . Totally, there are Nb


19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
blocks, so the total cost at level 0 is: N log b + N2 . Algorithm 2 gUpdate(e,S, , SC )
At each level Li , i > 0, we perform a COMPRESS and Input e: current data element, S: current sum-
a MERGE operation. Each COMPRESS costs b, since a mary structure, S = {S 0 , S 1 , . . . , S k−1 } (sub-streams
linear scan is required to batch query all the values P0 , . . . , Pk−1 have completely arrived), : required ap-
needed (refer to COMPRESS operation). Each MERGE proximation factor of S, SC : the fixed size multi-level
costs b with a merge sort. In fact, the computation summary corresponding to the current sub-stream Pk ,
cost of MERGE also includes the updates of rmin and SC = {s0 , s1 , . . . , sL }
rmax (will be discussed in Sec 3.3), which can be done 1: if e is the last element of Pk then
in linear time. Thus the cost of a MERGE adds up to 2: Apply merge on all the levels of SC : sall =
2b. Therefore, the total expected cost P of computing merge(SC ) = merge(s0 , s1 , . . . , sL )
the summary structure is N log b + N2 + i=L N
i=1 2i b 3b = 3: S k = compress(sall , 2 )
1
O(N log(  log(N )). The average update time per ele- 4:
S
S = S {S k }
ment is O(log( 1 log(N ))). 5: SC ← φ
6: else
In practice, for a fixed , the average per ele- 7: update SC : SC = U pdate(e, SC , 2 )
ment computation cost of our algorithm is given by 8: end if
O(log log N ) and the overall computation is almost lin-
ear in performance. The algorithm proposed by Green-
2. Once the last element of sub-stream Pk arrives,
wald and Khanna [5] has a best case computation time
we compute an 2 -summary on MERGE(SC ), which
(per element) of O(log s), and worst computation time
is the merged set of all levels in SC . The resulting
(per element) of O(s) where s is 1 log(N ). We will
summary S k =COMPRESS(MERGE(SC ), 2 ) is an -
demonstrate in our experiment section the comparison
summary of Pk and it consists of 2 elements.
of the performance.
The majority of the computation in the summary 3. The ordered set of the summaries of all complete
construction is dominated by the sort operations on sub-streams so far S = {S 0 , S 1 , . . . , S k−1 } is the
blocks. Although sorting is computationally intensive, current multi-level -summary of the entire stream
it is fast on small blocks which fit in the CPU L2 caches. except the incomplete sub-stream Pk .
Table 1 shows a comparison of the block size, memory
requirement as a function of stream size N with error The pseudo code for the update algorithm for stream
bound 0.001 using our generalized streaming algorithm with unknown size is shown in Algorithm 2. Initially,
in the next section. In practice, the size of the blocks S = φ. Whenever an element comes, gUpdate is per-
in our algorithm is smaller than the CPU cache size formed to update the summary structure S.
even for peta-byte-sized data streams. To answer a query of any rank r using S, if SC
is not empty, we first compute S k for the incomplete
3.2 Generalized Streaming Algorithm sub-stream Pk : S k = compress(merge(SC ), 2 ), then
we merge all the -summaries S 0 , S 1 , . . . , S k−1 in S to-
We generalize our algorithm for fixed size streams gether with S k using MERGE operation, the final sum-
to compute approximate quantiles in streams without mary is the -summary for P .
prior knowledge of size N . The basic idea of our algo-
rithm is as follows. We partition the input stream P 3.2.1 Performance Analysis
into disjoint sub-streams P0 , P1 , . . . , Pm with increas-
i We first present the storage analysis and then analyze
ing size. Specifically, sub-stream Pi has size 2 and
the computational complexity of our algorithm.
covers the elements whose location is in the interval
i i+1
[ 2 −1 , 2  −1 ). By partitioning the incoming stream Theorem 3. The space requirement of Algorithm 2 is
into sub-streams with known size, we are able to con- O( 1 log2 (n)).
struct a multi-level summary Si on each sub-stream Pi
using our algorithm for fixed size streams. Our sum- Proof. At any point of time, assume that the num-
mary construction algorithm is as follows. ber of data elements arriving so far is n. According
to Algorithm 1, we compute and maintain a multi-
1. For the latest sub-stream Pk which has not com- level -approximate summary SC for the current sub-
pleted, we maintain a multi-level 0 -summary SC stream Pk . For each of the previous sub-streams
using Algorithm 1 by performing U pdate(e, SC , 0 ) Pi , i = 1, . . . , k − 1 which are complete, we maintain
whenever an element comes. Here 0 = 2 . an -summary S i of size 2 . Since k ≤ blog(n +

19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
Stream size (N) Maximum Block Size (Bytes) Bound of Number of Tuples Bound of Summary Size (Bytes)
106 191.2KB 161K 1.9MB
109 420.4KB 717K 8.6MB
1012 669.6KB 1.67M 20MB
1015 908.8KB 3.03M 36.4MB

Table 1. This table shows the memory size requirements of our Generalized algorithm (with unknown
size) for large data streams with an error of 0.001. Each tuple consists of a data value, and its minimum
and maximum rank in the stream, totally 12 bytes. Observe that the block size is less than a MB and
fits in the L2 cache of most CPUs. Therefore, the sorting will be in-memory and can be conducted
very fast. Also, the maximum memory requirement for our algorithm is a few MB even for handling
streams of 1 peta data.

1)c, totally we need O( 1 log n) space. According element in S 00 that is smaller than xr (ys is undefined
to the space bound for fixed size streams, we need if no such element), and let yt be the smallest element
O( 1 log2 (n)) space for computing the summary SC in S 00 that is larger than xr (yt is undefined if no such
for the current sub-stream. Therefore, the space re- element). Then
quirement for the entire algorithm at any point of time 
rminS 0 (xr ) if ys undefined
is O( 1 log2 (n)). rminS (zi ) =
rminS 0 (xr ) + rminS 00 (ys ) otherwise
Theorem 4. The average update cost of Algorithm 2
is O(log( 1 log n)).

rmaxS0 (xr ) + rmaxS00(ys ) if yt undefined
rminS (zi ) =
rmaxS0 (xr ) + rmaxS00(yt ) − 1 otherwise
Proof. According to Theorem 2, the computa-
tional complexity of each sub-stream Pi , i =
Rank update in COMPRESS: Assume COMPRESS(S 0 ) =
0, 1, . . . , blog(n + 1)c is O(ni log( 10 log(0 ni ))) where
i S, for any element e ∈ S, we define rminS (e) =
ni = |Pi | = 2 , Σni = n, 0 = 2 . After each sub- rminS 0 (e) and rmaxS (e) = rmaxS 0 (e).
stream Pi is complete, we perform an additional MERGE
and COMPRESS operation each of cost O( 10 log2 (0 ni ))
4 Implementation and Resuts
to construct S i .
Given the above observations, the total computa-
We implemented our algorithms in C++ on an Intel
tional cost of our algorithm is
1.8 GHz Pentium PC with 2GB main memory. We
i=blog(n+1)c tested our algorithm with a C++ implementation of
X 2i 2(i − 1) 2
( log( ) + (i − 1)2 ) (4) the algorithm in [5] (refer to as GK01 in the remaining
i=0
   part of the paper) from the authors.
.
Simplifying equation 4, the total computational cost 4.1 Results
of our algorithm is O(n log( 1 log(n))), the average up-
dating time per element is O(log( 1 log(n))), which is We measured the performance of GK01 and our al-
O(log log n) if  is fixed. gorithm on different data sets. Specifically, we studied
the computational performance as a function of the size
3.3 Update rmin(e) and rmax(e) of the incoming stream, the error and input data dis-
tribution. In all experiments, we do not assume the
For both fixed size stream and arbitrary size stream, knowledge of the stream size, and we use f loat as data
to answer the quantile query, we need to update rmin type which takes 4 bytes.
and rmax values of each element e in the summary
properly. 4.1.1 Sorted Input
rmin(e) and rmax(e) are updated during COMPRESS
and MERGE operations as follows (as in [6]): We tested our algorithms using an input stream with ei-
Rank update in MERGE: Let S 0 = x1 , x2 , . . . , xa and ther sorted or reverse sorted data. Fig. 2(a) shows the
S 00 = y1 , y2 , . . . , yb be two quantile summaries. Let performance of GK01 and our algorithm as the input
S = z1 , z2 , . . . , za+b = MERGE(S 0 , S 00 ). Assume zi corre- data stream size varies from 106 to 107 with a guar-
sponds to some element xr in Q0 . Let ys be the largest anteed error bound of 0.001. For these experiments,

19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
as the data stream size increases, the block size in the requirement of MRL [10] and higher than GK01. Al-
largest sub-stream varies from 191.2K to 270.9K. In though the storage requirement is comparatively high,
practice, our algorithm is able to compute the summary for many practical applications, the storage used by
on a stream of size 107 (40M B) using less than 2MB our algorithm is small enough to manage. For exam-
RAM. Our algorithm is able to achieve a 200 − 300× ple, a stream with 100 million values and error bound
speedup over GK01. Note that the sorted and reverse 0.001 has a worst-case storage requirement of 5MB
sorted curves for GK01 are almost overlapping due to and practical on most PCs. Although our algorithm
the log-scale presentation and small difference between has a higher storage requirement than GK01, our al-
them (average 1.16% difference). Same reason for the gorithm can construct the summary upto two orders
sorted and reverse sorted curves for our algorithm, and of magnitude faster than GK01. In terms of the com-
the average difference between them is 2.1%. putational cost, our algorithm has an expected cost
We also measured the performance of our algorithm of O(n log( 1 log(N ))). Therefore, for a fixed error
and GK01 by varying the error bound from 10−3 to bound, the algorithm has an almost linear increase in
10−2 on sorted and reverse sorted streams. Fig. 2(b) computational time in n. Our algoritm also has a near-
shows the performance of our algorithm and GK01 on logarithmic increase in time as error bound decreases.
an input stream of 107 data elements. We observe that Therefore, our algorithm is able to handle higher accu-
the performance of our algorithm is almost constant racy, large data streams efficiently.
even when the approximation accuracy of quantiles in-
creases by 10×. Note that the performance of GK01
is around 60× slower for large error and around 300×
slower for higher precision quantiles compared with our 5 Conclusion and Future Work
algorithm.

4.1.2 Random Input We presented fast algorithms for computing approx-


imate quantiles for streams. Our algorithms are based
In order to measure the average case performance, we on simple block-wise merge and sort operations which
measured the performance of our algorithm and GK01 significantly reduces the update cost performed for
on random data. Fig. 3(a) shows the performance of each incoming element in stream. In order to handle
GK01 and our algorithm as the input data stream size unknown size of the stream, we divide the incoming
varies from 106 to 107 with error bound of 0.001. As streams into sub-streams of exponentially increasing
the data size increases, the time taken by our algorithm sizes. We construct summaries efficiently using lim-
increases almost linearly as the computational require- ited space on the sub-streams. For both fixed sized
ment of our algorithm is O(n log log n). We observe and arbitrary sized streams, our algorithm has an av-
that our algorithm is able to achieve about 200 − 300× erage update time complexity of O(log 1 log N ). We
speedup over GK01. also analyzed the performance of prior algorithms. We
In Fig. 3(b), we evaluated our algorithms on a data evaluated our algorithms on different data sizes and
stream size of 107 by varying the error bound from compared them with optimal implementations of prior
10−2 to 10−3 . We observe that the performance of our algorithms. In practice, our algorithm can achieve up
algorithm degrades by less than 10% while comput- to 300× improvement in performance. Moreover, our
ing a summary with 10× higher accuracy. This graph algorithm exhibits almost linear performance with re-
indicates that the performance of our algorithm is sub- spect to stream size and performs well on large data
linear to the inverse of the error bound. In comparison, streams.
the performance of GK01 algorithm degrades by over
There are many interesting problems for future
500% as the accuracy of the computed summary in-
investigation. We would like to extend our block-
creases by 10×. In practice, the computational time
wise merge and compress scheme to compute quan-
increase for computing a higher accuracy summary us-
tiles quickly over sliding windows. Another interesting
ing our algorithm is significantly lower than that using
problem is to extend our current algorithm to handle
GK01.
biased quantiles. We are also interested in designing
biased quantile algorithms on distributed streams and
4.2 Analysis sensor networks. We would also like to design better
streaming algorithms with higher computational per-
The worst-case storage requirement for our algo- formance for other problems such as k-median cluster-
rithm is O( 1 log2 (N )). It is comparable to the storage ing, histograms, etc.

19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
1000 1000

100 100

Our algorithm (sorted)

Time (in sec)


GK01 (sorted)
Time (in sec)

Our algorithm (sorted)

log-scale
log-scale

Our algorithm (reverse sorted)


10 10 GK01 (sorted)
GK01 (reverse sorted)
Our algorithm (reverse sorted)
GK01 (reverse sorted)

1 1
1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

0.1 0.1
Data Stream Size Error

(a) Summary construction time vs stream size (b) Summary construction time vs error

Figure 2. Sorted Data: We used the sorted and reverse sorted input data to measure the best possible
performance of the summary construction time using our algorithm and GK01. Fig. 2(a) shows the
computational time as a function of the stream size on a log-scale for a fixed epsilon of 0.001. We observe
that the sorted and reverse sorted computation time curves for GK01 are almost overlapping due to
the log-scale presentation and small difference between them (average 1.16% difference). Same reason
for the sorted and reverse sorted curves for our algorithm, and the average difference between them is
2.1%. We also observe that the performance of our algorithm is almost linear and the computational
performance is almost two orders of magnitude faster than GK01. Fig. 2(b) shows the computational
time as a function of the error. We observe the higher performance of our algorithm which is 60 − 300×
faster than GK01. Moreover, GK01 has a significant performance overhead as the error becomes
smaller.

Acknowledgements [5] M. B. Greenwald and S. Khanna. Space-efficient online


computation of quantile summaries. ACM SIGMOD,
We would like to thank Dr. Michael B. Greenwald pages 58–66, 2001.
for providing the optimized GK01 code and many use- [6] M. B. Greenwald and S. Khanna. Power-conserving
computation of order-statistics over sensor networks.
ful suggestions.
Symposium on Principles of database systems, 2004.
[7] R. Jain and I. Chlamtac. The p2 algorithm for dy-
References namic calculation for quantiles and histograms with-
out storing observations. Communications of the
[1] R. Agrawal and A. Swami. A one-pass space-efficient ACM, 28(10):1076–1085, 1985.
algorithm for finding quantiles. International Confer- [8] X. Lin, H. Lu, J. Xu, and J. X. Yu. Continuously
ence of Management of Data(COMAD), 1995. maintaining quantile summaries of the most recent
[2] A. Arasu and G. S. Manku. Approximate counts and n elements over a data stream. In ICDE ’04: Pro-
quantiles over sliding windows. ACM Symposium on ceedings of the 20th International Conference on Data
Principles of Database Systems (PODS), pages 286– Engineering, page 362, Washington, DC, USA, 2004.
296, 2004. IEEE Computer Society.
[3] G. Cormode, F. Korn, S. Muthukrishnan, and D. Sri- [9] X. Lin, J. Xu, Q. Zhang, H. Lu, J. X. Yu, X. Zhou,
vastava. Effective computation of biased quantiles over and Y. Yuan. Approximate processing of massive con-
data streams. International Conference on Data En- tinuous quantile queries over high-speed data streams.
gineering, pages 20–32, 2005. IEEE Transactions on Knowledge and Data Engineer-
[4] G. Cormode, F. Korn, S. Muthukrishnan, and D. Sri- ing, 18(5):683–698, 2006.
vastava. Space- and time-efficient deterministic algo- [10] G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Ap-
rithms for biased quantiles over data streams. The proximate medians and other quantiles in one pass
ACM Symposium on Principles of Database Systems and with limited memory. ACM SIGMOD, 12:426–
P ODS, pages 263–272, 2006. 435, June 1998.

19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007
1000 1000
Our algorithm Our algorithm
GK01 GK01

100 100
Time (in sec)

Time (in sec)


log-scale

log-scale
10 10

1 1
1M 2M 3M 4M 5M 6M 7M 8M 9M 10M 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01

0.1 0.1
Data Stream Size Error

(a) Summary construction time vs stream size (b) Summary construction time vs error

Figure 3. Random Data: We used the random input data to measure the performance of the summary
construction time using our algorithm and GK01. Fig. 3(a) shows the computational time as a function
of the stream size on a log-scale for a fixed epsilon of 0.001. We observe that the performance of our
algorithm is almost linear. Furthermore, the log-scale plot indicates that our algorithm is almost two
orders of magnitude faster than GK01. Fig. 3(b) shows the computational time as a function of the
error. We observe that our algorithm is almost constant whereas GK01 has a significant performance
overhead as the error becomes smaller.

[11] G. S. Manku, S. Rajagopalan, and B. G. Lindsay.


Random sampling techniques for space efficient online
computation of order statistics of large datasets. ACM
SIGMOD, pages 251–262, 1999.
[12] J. I. Munro and M. Paterson. Selection and sorting
with limited storage. Theoretical Computer Science,
12:315–323, 1980.
[13] M. Paterson. Progress in selection. Scandinavian
Workshop on Algorithm Theory, 1997.
[14] N. Shrivastava, C. Buragohain, D. Agrawal, and
S. Suri. Medians and beyond: new aggregation tech-
niques for sensor networks. In SenSys ’04: Proceedings
of the 2nd international conference on Embedded net-
worked sensor systems, pages 239–249, New York, NY,
USA, 2004. ACM Press.
[15] Y. Zhang, X. Lin, J. Xu, F. Korn, and W. Wang.
Space-efficient relative error order sketch over data
streams. International Conference on Data Engineer-
ing, 2006.

19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
0-7695-2868-6/07 $25.00 © 2007

You might also like