Optimal Quantile Estimation
Optimal Quantile Estimation
April 8, 2024
arXiv:2404.03847v1 [cs.DS] 5 Apr 2024
Abstract
Estimating quantiles is one of the foundational problems of data sketching. Given n elements
x1 , x2 , . . . , xn from some universe of size U arriving in a data stream, a quantile sketch estimates
the rank of any element with additive error at most εn. A low-space algorithm solving this task
has applications in database systems, network measurement, load balancing, and many other
practical scenarios.
Current quantile estimation algorithms described as optimal include the GK sketch (Green-
wald and Khanna 2001) using O(ε−1 log n) words (deterministic) and the KLL sketch (Karnin,
Lang, and Liberty 2016) using O(ε−1 log log(1/δ)) words (randomized, with failure probabil-
ity δ). However, both algorithms are only optimal in the comparison-based model, whereas
most typical applications involve streams of integers that the sketch can use aside from making
comparisons.
If we go beyond the comparison-based model, the deterministic q-digest sketch (Shrivastava,
Buragohain, Agrawal, and Suri 2004) achieves a space complexity of O(ε−1 log U ) words, which
is incomparable to the previously-mentioned sketches. It has long been asked whether there is
a quantile sketch using O(ε−1 ) words of space (which is optimal as long as n ≤ poly(U )). In
this work, we present a deterministic algorithm using O(ε−1 ) words, resolving this line of work.
∗
E-mail: [email protected]. This author was supported by an NSF GRFP Fellowship.
†
E-mail: [email protected]. This author was supported by an NSF GRFP Fellowship.
‡
E-mail: [email protected]. This author was supported by Avishay Tal’s Sloan Research Fellowship, NSF
CAREER Award CCF-2145474, and Jelani Nelson’s ONR grant N00014-18-1-2562.
Contents
1 Introduction 1
1.1 Discussion and further directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Preliminaries 4
2.1 Definitions for streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Other notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Technical overview 4
6 Practical considerations 28
7 Lower bounds 31
i
1 Introduction
Estimating basic statistics such as the mean, median, minimum/maximum, and variance of large
datasets is a fundamental problem of wide practical interest. Nowadays, the massive amount of
data often exceeds the memory capacity of the algorithm. This is captured by streaming model :
The bounded-memory algorithm makes one pass over the data stream x1 , x2 , . . . , xn from a universe
[U ] = {1, . . . , U } and, in the end, outputs the statistic of interest. The memory state of the algorithm
is therefore a sketch of the data set that contains the information about the statistic and allows
future insertions. Here, memory consumption is conventionally measured in units of words, where
1 word equals log n + log U bits.
Most of these simple statistics can be computed exactly with a constant number of words. But
the median, or more generally, the ϕ-quantile, is one exception. In their pioneering paper, Munro
and Paterson [MP80] showed that even an algorithm that makes p passes over the data stream still
needs Ω(n1/p ) space to find the median. Fortunately, for many practical applications, it suffices
to find the ε-approximate ϕ-quantile: Instead of outputting the element of rank exactly ϕn, the
algorithm only has to output an element of rank (ϕ ± ε)n. Such algorithms are called approximate
quantile sketches. They are actually implemented in practice, appearing in systems or libraries
such as Spark-SQL [AXL+ 15], the Apache DataSketches project [Apa], GoogleSQL [Goo], and the
popular machine learning library XGBoost [CG16].
There are also other queries the sketch could need to answer: For example, online queries asked
in the middle of the stream, or rank queries, where the algorithm is asked to estimate the rank of an
element up to εn error. As finding approximate quantiles is equivalent to answering rank queries.
To solve all of them, it suffices to solve the following strongest definition.
Problem 1.1 (Quantile sketch). The problem of quantile sketching (or specifically, ε-approximate
quantile sketching) is to find a data structure A taking as little space as possible in order to solve
the following problem: Given a stream of elements π = x1 , x2 , . . . , xn ∈ [U ], we define the partial
stream πt = x1 , x2 , . . . , xt . For element x ∈ [U ], let rankπt (x) be be the number of elements in πt
that are at most x. When a query x arrives at time t, then A must output an approximate rank r,
such that |r − rankπt (x)| ≤ εt.
Two notable quantile sketches include the Greenwald and Khanna (GK) sketch [GK01] using
O(ε−1 log n) words (deterministic) and KLL sketch [KLL16] using O(ε−1 log log(1/δ)) words (ran-
domized, with failure probability δ). Both algorithms follow the comparison-based paradigm, where
the sketch cannot see anything about the elements themselves and can only make black-box com-
parisons between elements it has stored. They are known to be optimal in this paradigm ([CV20]
shows the GK is optimal for deterministic algorithms and [KLL16] shows that KLL is optimal for
randomized algorithms).
However, most typical applications of quantile sketches apply to streams of integers (or elements
of some finite universe), rather than just to black-box comparable objects. For example, the elements
in the universe could be one of the following: network response times (with a preset timeout), IP
addresses, file sizes, or any other data with fixed precision. This may allow for a better quantile
sketch than in the comparison-based model. The best previously-known non-comparison-based
algorithm is the q-digest sketch introduced in [SBAS04], which is a deterministic sketch using
O(ε−1 log U ) words. Unfortunately, this isn’t really better than the GK sketch, as n is typically
much less than poly(U ). On the other hand, the only lower bound we know is the trivial lower
bound of Ω(ε−1 ) words in the regime where n ≤ poly(U ) (which holds for both deterministic and
1
randomized algorithms). Motivated by this gap, Greenwald and Khanna, in their survey [GK16],
asked if the q-digest algorithm is already optimal, and as such, one cannot substantially improve
upon comparison-based sketches.
In this work, we resolve this question fully and provide a deterministic quantile sketch that uses
the optimal O(ε−1 ) words. This is the first quantile sketch that goes beyond the comparison-based
lower bound (in the natural regime of n ≤ poly(U )) and is the first direct improvement on the
q-digest sketch in the 20 years since it was proposed.
Theorem 1.2. There exists a deterministic streaming algorithm for Problem 1.1 using O(ε−1 ) words
(more specifically, O(ε−1 (log(εn) + log(εU ))) bits) of space1 .
Our sketch uses less space than not only the deterministic q-digest and GK sketches but also
the randomized KLL sketch, when compared in words. Note that randomized algorithms, like KLL
sketch, have failure probabilities and retain their theoretical guarantee only against non-adaptive
adversaries. The fact that our algorithm is deterministic provides stronger robustness. As these
sketches are already implemented in practice, we hope that our algorithm can help improve the
performance of these libraries.
Optimality of our algorithm. As we discussed earlier, the quantile sketch lower bound of Ω(ε−1 )
words only holds in the regime where n ≤ poly(U ). However, we conjecture that our algorithm is
optimal in general for deterministic algorithms. Specifically, there is a simple example showing any
sketch for Problem 1.1 requires at least ε−1 log(εU ) bits (see Section 7), but we also need to show
a lower bound of ε−1 log(εn) bits. We make the following conjecture about deterministic parallel
counting, which would imply our lower bound because any algorithm for Problem 1.1 can also solve
the k-parallel counters problem for k = Θ(1/ε).
Conjecture 1.3 (Deterministic parallel counters). We define the k-parallel counters problem as
following: There are k counters initiated to 0. Given a stream of increments i1 , i2 , . . . , in ∈ [k]
where it means to increment the it -th counter by 1, the algorithm has to output the final count of
each counter up to an additive error of n/k.
1
Here, technically, when we write log(εn) and log(εU ), it really should be max{log(εn), 1}, max{log(εU ), 1} to
avoid the uninteresting corner cases.
2
We conjecture that any deterministic algorithm for this problem requires at least Ω(k log(n/k))
bits of memory.
This conjecture essentially says that to maintain k counters in parallel, one needs to maintain
each counter independently. A recent paper by Aden-Ali, Han, Nelson, and Yu [AHNY22] studies
this problem for randomized algorithms. The authors of that paper proved that any randomized
algorithm with failure probability δ must use at least k · min(log(n/k), log log(1/δ)) bits when
k
log log(1/δ) = Ω(k). Setting 1/δ = 22 , this directly implies a Ω(min{ε−1 log(εn), ε−2 })-bit lower
bound for any algorithm solving Problem 1.1. Thus, our algorithm is also optimal in the regime
when ε−1 > log(εn).
Improvements in the randomized setting. Deterministic algorithms are used at the heart
of the randomized ones. Many randomized algorithms (including the algorithm by Felber and
Ostrovsky [FO17], the KLL sketch [KLL16], and the mergeable summary of [ACH+ 13]) follow the
paradigm of first sampling a number of elements from the stream and then maintaining them with
a careful combination of deterministic sketches.
As long as n ≤ poly(U ), our algorithm is optimal even in the randomized setting, but when
this condition is not met, it is possible to do better in the randomized setting. If n is known in
advance, one can simply sample logε21/δ elements and feeds them into our sketch.2 It uses a memory
of O(ε−1 (log log(1/δ) + log U ) + log n) bits, which strictly improves that of the KLL sketch. We
note that, in the most common regime where δ > 1/2εn , there is a Ω(ε−1 (log log(1/δ) + log εU ))-bit
lower bound for streaming quantile sketches.3 So our algorithm is also very close to optimal in the
randomized setting as well.
More on quantile sketches. Early works on quantiles sketches include [MP80, ARS97, MRL98].
Among them, the MRL sketch [MRL98] and its randomized variant from [ACH+ 13] lead to the
aforementioned KLL sketch. Another variant of the problem is the biased quantile sketches (also
called relative error quantile sketches), meaning that for queries of rank r, the algorithm can only
have an error of εr instead of εn. That is, we require that the 0.1% quantiles are extremely accurate,
while the 50% quantile can allow much more error. This question was raised in [GZ03]; since then,
people have proposed deterministic [CKMS06, ZW07] and randomized [CKL+ 21] algorithms for this
problem. There are also other variants such as sliding windows [AM04], weighted streams [AJPS23]
and relative value error [MRL19]. In practice, there are also the t-digest sketch [DE19] and the
moment-based sketch [GDT+ 18], which do not have strict theoretical guarantees. In particular,
[CMRV21] shows that there exists a data distribution, such that even i.i.d. samples from that
distribution can cause t-digest to have arbitrarily large error.
2
If n is not known is advance, instead of simple sampling, one can replace the use of GK sketch in KLL with
our algorithm. As the compactor hierarchy part of KLL stores only O(1/ε) elements, it results in the same space
complexity as the known n case.
3
This follows from the ε−1 log εU lower bound in Section 7 (which holds for both deterministic and randomized
algorithms), and the aforementioned k · min(log(n/k), log log(1/δ)) lower bound in [AHNY22] (setting k = 1/ε).
3
2 Preliminaries
Define the rank of an element x in a stream π, denoted rankπ (x), to be the total number of elements
of π that are less than or equal to x. We also define a notion of distance between two streams. For
two streams π, π ′ of equal length, define their distance as follows:
d(π, π ′ ) = max |rankπ (x) − rankπ′ (x)|.
x∈[1,U ]
We observe that this distance satisfies some basic properties, i.e., the triangle inequality, and sub-
additivity under concatenation of streams:
Observation 2.1 (Triangle inequality). For all streams π, π ′ , π ′′ of the same length,
d(π, π ′ ) ≤ d(π, π ′′ ) + d(π ′′ , π ′ )
Observation 2.2. For all streams π, π ′ of the same length and ρ, ρ′ of the same length,
d(π ◦ ρ, π ′ ◦ ρ′ ) ≤ d(π, π ′ ) + d(ρ, ρ′ ),
where π ◦ ρ denotes concatenation of the streams π and ρ.
Throughout this paper, we use standard asymptotic notation, including big O and little o. For
clarity, we sometimes omit floor and ceiling signs where they might technically be required.
All logarithms in this paper are considered to be in base 2, and we define the iterated logarithm
log∗ (m) to be the number of times we need to apply a logarithm to the number m to bring its value
below 1.
We also define the function VxW, for any x ∈ R+ , to be the smallest power of 2 that is at least
x. In particular, we always have x ≤ VxW ≤ 2x.
3 Technical overview
The eager q-digest sketch. Before explaining our algorithm, it would be instructive to first
reivew the q-digest algorithm because our algorithm is based on it. At a high level, this data
structure is a tree where every node represents some subset of the stream elements received so far.
The node doesn’t store each element exactly, but only an interval that contains all of the elements it
represents and a count of how many elements it represents. The version we describe slightly differs
from the typical treatment, and we call it eager q-digest. The data structure has the following
structure and supports the following operations.
4
• Structure: The eager q-digest is a binary tree of depth log U . The nodes in the bottom level
of the tree (which we call the base level ) correspond left-to-right to each element 1, 2, . . . , U
in the universe. Each non-base level node corresponds to a subinterval of [1, U ] consisting of
the base level nodes below it. Each node u represents a subset of W [u] elements (W [u] is the
weight/count of the node) that have been received so far; that is, when an element is inserted,
it increments the counter W [u] at some node. The W [u] elements that u represents must all
be within the node’s interval.
• Insertion: We insert elements into the tree top-down as follows: upon receiving an element
x ∈ [1, U ], look at the path from the root to x and increment the counter W [u] of the first
non-full node u. A node is full if its weight is already at capacity, which we set to be α := log
εn
U.
Base level nodes are permitted to exceed capacity.
• Rank queries: We are given an element x ∈ [U ] for which we want to return the rank. To
do this, answer with the total weight of everything on the path from the root to the base node
x and everything to the left of that path in the tree. All the elements inserted in nodes to the
left of this path must have been less than x (since their intervals only contain elements less
than x) and all the elements inserted to the right must be larger. As such, the error in the
rank estimate is only the sum of nodes along the path (not including x), which is bounded by
the depth of the tree times the weight of each node above x, at most α log U = εn.
• Quantile queries: We are given a rank r ∈ [n] for which we want to return an element
between the rank-(r − εn) and the rank-(r + εn) element of the stream. The ability to do this
follows from the ability to answer rank queries, since we can simply perform a binary search.4
Let us look at an example of an eager q-digest. Each node has capacity (maximum weight)
α = 5 for this example.
[1, 4]
W=5
In this example, triangle represents 5 elements in the interval [1, 4], square represents 5 more
elements in the interval [3, 4], and star represents 3 more elements exactly equal to 3. If we insert
the number 3 into the example, it would not get inserted into triangle or square because they are
full, and so it would be put into star and increment the count by 1. If we want to then find the
rank of the number 3 (in the pictured tree exactly, before the insertion), we return the sum of the
weights on the circled nodes plus the path to x, which is 9 + 5 + 5 + 3 = 22. This can be off by at
most 10 – we know the 9 elements represented by the circled nodes are definitely less than 3, the
ones inserted directly to the star are exactly 3, the ones to the right are definitely more than 3. The
ones inserted to the triangle and square are the only unknowns.
4
This is true in a black-box way; see Section 6 for details.
5
Analyzing the space complexity of eager q-digest. The space complexity (in bits) of q-
digest (and similarly of eager q-digest) is well known to be O(ε−1 ((log U )2 + log U log n)). Let
us understand why, so we can see where we might improve upon this. The space complexity is
approximately the product of the following two things:
(1) The number of non-empty nodes. This is at most O(ε−1 log U ) since the number of full nodes
(which is within a constant factor of the number of non-empty nodes) is n/α = ε−1 log U .
(2) The amount of space necessary per non-empty node. Naively, we would need to store the
location of each nonempty node (the interval it corresponds to) and the weight of the node
(the number of stream elements it corresponds to). This would take log U + log n space.
As such, in total the space complexity is O(ε−1 (log U )2 + log U log n). In our sketch, we do not
reduce (1), the number of nodes. Therefore, we must reduce the storage in (2) substantially. This
has two parts: efficiently storing the corresponding interval (location in the binary tree) of each
node and efficiently storing the count.
It is actually quite simple to store the interval/location of each node: To see this, notice that
the non-empty nodes form a connected tree of their own within the large binary tree. Since the tree
is binary, storing the edge from a parent to child in the tree of nonempty nodes takes only O(1)
space. This observation is quite straightforward from the way we formulated q-digest, but the usual
implementation of q-digest doesn’t push to the top eagerly, and so is unable to directly save this
log U term.
The main challenge: avoiding storing counters. The second challenge is to avoid storing a
counter W [u] at each node. One useful observation about the structure of the tree of non-empty
nodes is that all internal nodes are full (at capacity) and only its leaves, which we call exposed nodes,
need counters. Unfortunately, a constant fraction of the non-empty nodes are exposed nodes, so
this doesn’t actually save on space.
Another idea is to store only an approximate count at each node. Unfortunately, we cannot
just store an independent approximate count at each node, or even only a counter that estimates
when the count surpasses the threshold α; this is impossible to do deterministically without using
log α space (which is too large). Even in the randomized setting, approximately counting each node
independently does not improve upon KLL.
The situation is summarized above in Figure 2. At each of the exposed nodes, denoted
v1 , v2 , . . . , vℓ , we want to store some approximate version of counters W [v1 ], W [v2 ], . . . , W [vℓ ] that
represent how many elements are inserted into that node using significantly less than log n space,
ideally O(1) space.
For simplicity, assume that elements are received in “batches” of size n b (to be determined later),
which we can use unlimited space to process. Our only constraint is to minimize storage space
between batches. Let us assume that before the batch, all the counters W [v1 ], W [v2 ], . . . , W [vℓ ] are
less than α/2 and set n b = α/2 so the set of exposed nodes won’t change within the batch. At the
end of the batch, we need to find suitable approximate values W c [v1 ], W c [vℓ ] to increment
c [v2 ] . . . , W
the counters by, based on the true counts C[v1 ], C[v2 ], . . . , C[vℓ ] of the stream elements.
Let us quantify how “inaccurate” these approximate counts can be compared to the true counts.
The amount of additional error (in rank-space) introduced by answering a rank query for some
universe element below a node vi should be at most εb
n – we can tolerate this much because it only
6
[1, 8]
W=α
[1, 4] [5, 8]
W=α W=α
v1 [1, 1] v2 [2, 2]
W=? W=?
Figure 2: The tree formed by non-empty nodes in eager q-digest. (The filled nodes are the full
nodes.)
doubles ε and we could’ve chosen ε to be half as big at the start. The value of thisrank query, or
the total weight of all the nodes to the left of vi and the path to vi changes by W c [v1 ] + . . . +
c [vi ] − C[v1 ] + . . . + C[vi ] , and so we need to ensure that, for all i,
W
W c [vi ] − C[v1 ] + . . . + C[vi ] < εb
c [v1 ] + . . . + W n. (1)
Here is a simple way to make that happen: Take the 0-th element, the (εb n)-th element, the
n)-th element and so on, and increment the counters W [vi ] corresponding to those elements each
(2εb
by εb
n. Then, Equation 1 is satisfied, and also the counters can be stored in O(log(ε−1 )) bits since
they are always multiples of εb
n = εα/2 and so only have 2ε−1 possibilities.
The main idea: recursive quantile sketch. Of course, the glaring issue is how to find (an
approximation of) the 0-th element, the (εb n)-th element, the (2εb
n)-th element and so on, or at least
which vi each one corresponds to, without storing the entire batch of n b elements. In particular,
we have reduced to the following problem: We receive n b elements in a stream in the universe
{v1 , . . . , vℓ }, and we need to return the approximate 0-th element, the (εb
n)-th element, the (2εb
n)-th
element and so on. These are just quantile queries! In particular, we need a quantile sketch on a
universe of size ℓ receiving n b elements. The new universe size ℓ is at most the number of exposed
nodes of the eager q-digest, which is at most ε−1 log U , and so we have a big saving – the new
quantile sketch is on a logarithmically smaller universe, and so even naively using eager q-digest for
the inner sketch will save space.
This solves the problem. The outer quantile sketch requires only O(ε−1 log(ε−1 ) log U ) space be-
cause it needs O(log(ε−1 ) space per node, and the inner sketch requires only ε−1 log log U (log log U +
b) space because its universe size is log U . Both of these are within O(ε−1 log(ε−1 ) log log U )
log n
words of memory. An illustration of the recursive step is shown in Section 3, where we build a new
eager q-digest whose universe is the exposed nodes of our original eager q-digest. This new eager
q-digest will process n
b elements and ultimately return the 0-th element, the εb n-th element, 2εb n-th
element, and so on.
Modifications to get the optimal bounds. We can iterate this construction recursively by
building a new eager q-digest on the exposed nodes of the second eager q-digest. This process will
7
[1, 8] [1, 6]
W=α
Figure 3: An inner eager q-digest tree whose universe is the exposed nodes of the original tree. (The
filled nodes are the full nodes.)
continue to reduce the universe size nearly logarithmically each time. The number of layers before
reaching a constant sized universe is roughly log∗ U , and so to get constant error and constant space,
we will need to be careful with how we set the error fraction εi for each recursive layer and argue
that the total size of the sketches converges.
We also made an assumption that when we started receiving the batch of n b elements, all the
exposed nodes had weight at most α/2. However, the node could have any weight jεα. To deal
with this, we need the lower level q-digest to deal with nodes getting “overfilled.”
Our final algorithm also manages to get rid of log(ε−1 ) factors in the space complexity. This
takes a number of additional considerations. One is that the nodes cannot even store counts that
require O(ε−1 ) bits, but truly need to just be either empty or full. To deal with this, we will increase
the batch size to n bα but now we will need to deal with nodes getting overfilled again. A second
issue is that, as described, at the last layer of recursion, the number of nodes would be ε−1 log(ε−1 ),
which is slightly too large. To deal with this, we will have to use an optimized eager q-digest, which
we discuss in Section 4.
In this section, we will describe the optimized eager q-digest algorithm. This slightly improves the q-
digest algorithm of [SBAS04]. The space complexity of optimized eager q-digest will be fairly similar
to that of q-digest (it achieves O(ε−1 log εn log εU ) bits instead of O(ε−1 (log U + log εn) log U ) bits).
Although it does not contain the main idea of this paper, we need it as a building block of our
algorithm. Also, we hope that this section can be a warm-up that familiarizes readers with our
notation and the basics about q-digest.
Though we have already talked briefly about the eager q-digest in the technical overview, we
will start anew in this section by building the algorithm up from the original q-digest, since we
make several more modifications than what we described in that section.
Tree structure of the original q-digest sketch. In the original q-digest sketch of [SBAS04],
there is a underlying complete binary tree T of depth log U . We say that those nodes at depth log U
8
are at the base-level of T . These nodes correspond (from left to right) to each element 1, 2, . . . , U
in the universe.
We label each node in T with a subinterval of [1, U ]. First, the base-level node corresponding to
i is labeled with [i, i]. For a node above u the base level, its interval is the union of all its base-level
descendants. For every node u ∈ T , it also has a weight W [u] associated to it. Intuitively, one can
think of the nodes u ∈ T with weight W [u] and interval label [au , bu ] as a representative of W [u]
many elements in the stream that are within [au , bu ].
In the original q-digest all nodes u except the base level nodes can have weight at most W [u] ≤ α.
This is the capacity of the node and is usually set to α = log
εn
U . When there is an insertion of stream
element x, the algorithm finds the base-level node v whose interval equals [x, x] and increases W [x]
by 1. This is always possible as there is no capacity constraint for base-level nodes.
Since this tree T has as many as 2U − 1 nodes, the q-digest algorithm does not store the tree
T nor the labels. It only store the set S of non-empty nodes, those nodes v with W [v] > 0. As
there are more and more insertions, the set S grows. Whenever |S| > logε U , the q-digest algorithm
performs a compression.
One way of performing such compression is to find all nodes u such that W [u] > 0 and
W [parent(u)] < α, and move one unit of weight from W [u] to W [parent(u)]. After there is no
such node u, let F ⊆ S be the set of full nodes v with W [v] = α. We know that |F | ≤ αn = logε U .
Now for every nonempty node u ∈ S, its parent must be a full node. So compression gets the
number of nonempty node down to |S| ≤ 3|F | = O logε U . For every u ∈ S, the actual information
stored by original q-digest are 1. the position of u in the tree T (which takes log U bits); 2. weight
W [v] (which takes log α ≈ log(εn) bits).
Finally, for all these to make sense, we have to be able to answer rank queries. In order to
estimate rank(x), we simply add up the weights W [u] of all nodes u whose intervals contain at least
one element less or equal to x. This might overcount the number of actual stream elements which
are at most x; any node whose interval contains both an element which is at most x and greater
than x can contribute to the overcounting. These nodes are all (strict) ancestors of the node in the
base level corresponding to x, so there are at most log U of them, and their total weight is thus at
most α · log U . Thus the answer to the rank query is off by at most α · log U ≤ εn.
Now, having described the original q-digest algorithm, we will describe the modifications we
make to it to get the optimized eager q-digest.
Modification 1: Use a forest of 1/ε trees. To improve the log U factor to log(εU ), we have
to equally divide the universe into 1/ε intervals and maintain a tree for each one. This gives us a
9
forest of 1/ε trees, while allowing us to set α to log(εU ) .
εn 5
Roughly speaking, this change corresponds to removing the top log(1/ε) levels of the q-digest
tree while keeping the levels below it. Although only offering a small improvement here, this is
actually essential for our final algorithm. It is one of the ingredients that allow us to avoid the extra
O(ε−1 log(ε−1 )) term in the number of words used.
Modification 2: Move weights up eagerly. Next we describe how nodes are inserted into the
eager q-digest. The original q-digest algorithm moves weight up the tree lazily; that is, it does so
when the number of nodes stored exceeds its limit. By contrast, the eager q-digest will do so eagerly:
upon receiving an element of the stream, it will immediately move it up as much as possible.
More formally, when we receive an element x of the stream, we do not increase the weight of
the base-level node with interval [x, x] as we would in a normal q-digest. Instead, we immediately
move this weight up. That is, we pick the highest non-full node whose interval contains x, and we
increment its weight by 1.
Space Complexity: Full nodes and non-full nodes. We now look at the space complexity
of optimized eager q-digest. An ordinary q-digest has to store, for every non-empty node, both its
location in T and its weight. However, in an optimized eager q-digest, the non-empty nodes are
upward closed ; that is, every parent of a non-empty node is also non-empty. (In fact, every parent
of an non-empty node is actually full, since otherwise the weight would have been pushed up to
the parent.) Thus, the non-empty nodes form at most 1/ε trees which include the roots of their
components in T . Storing the topology of a binary tree of size k only requires space k (it is enough
to use 2 bits for each node to record whether it has left/right child). Thus the total space required
to describe the locations of the non-empty nodes is only O(|S| + 1/ε) bits, where |S| is the total
number of non-empty nodes.
At this point, for all the full nodes, we are already done. Since we know that their weight is
log(εU )
exactly α, there is nothing more to store. Since |S| ≤ 3|F | ≤ α = O
3n
ε , we are able to store
all the full nodes with only O(1/ε) words. However, there are still the non-full
nodes in S. Since
log(εU )
we have to store the weight for each of them, this takes O(|S| log α) = O ε · log(εn) space.
This completes the description of eager q-digest. We have saved an |S| log U term in the space
complexity by not having to store the location of each non-empty node, but the |S| log α term from
storing the weights of non-full nodes in S still remains. In the following section, the main idea of
our algorithm is to recursively maintain these non-full nodes in S with another recursive layer of
our algorithm. When carefully implemented, we are able to ensure that every node in our trees
are either full or empty, except at the very last layer of recursion. This removes the extra |S| log α
term.
In this section, we will implement the sketch in Table 2, proving Theorem 1.2. We assume throughout
this section that εU is at least a sufficiently large absolute constant, since otherwise we can increase
5
This is because the depth of each tree becomes at most log(εU ) and the error for answering rank queries is at
most the depth multiplied by α.
10
U without affecting our asymptotic space complexity.
In Section 5.8, we will show that each operation takes O(log(1/ε)) amortized
time under mild assumptions.
To start with, we will also assume that we know an upper bound on n (this upper bound will
become n0 ), and that it is sufficiently large (that is, n0 ≥ n∗ , where n∗ is a function of U, ε).
Furthermore, we will initially allow rank queries to have error up to εn0 . We will maintain these
assumptions until Section 5.5, where we will then describe how to dispense with these assumptions.
We now outline how this section will proceed. In Sections 5.1 and 5.2, we describe the data
structure, and how to handle insertions into the data structure, including how to merge layers of
the data structure. In Section 5.3, we bound the error introduced into the data structure with each
merge. Then, in Section 5.4, we describe how to perform rank queries and show a bound of εn0
on the error of a query. We next, in Section 5.5, describe how to make our data structure work
even when n < n∗ , and also improve our bound on error of a query to εt (where t is the size of the
stream so far). In Section 5.6, we pick the numerical parameters of our data structure such that
the claims of the previous section hold. Finally, in Sections 5.7 and 5.8, we analyze the space and
time complexity of our algorithm, respectively.
As mentioned before, our sketch will be formed from recursive applications of the eager q-digest.
We now define the structure of the recursive layers, which we number 0, 1, . . . , k.
The 0-th layer. We start with the top layer (layer 0) and introduce our notation. The top layer
has the same structure as an ordinary optimized eager q-digest forest. We call this underlying forest
T0 . It has universe size U0 = U and error parameter ε0 = ε/8. We would like to emphasize that
in optimized eager q-digest, T0 is a forest with 1/ε0 infinite trees where most nodes have weight 0.
We call these nodes empty.
Whether empty or not, each node in this infinite forest is labeled with an interval. The 1/ε0
roots of the trees are labeled with [1, ε0 U ], [ε0 U + 1, 2ε0 U ], . . . , [(1 − ε0 )U + 1, U ], respectively.
Then, if a node is labeled with interval [a, b], its two children are labeled with [a, (a + b − 1)/2]
and [(a + b + 1)/2, b] respectively. (Since we assumed that ε and U are powers of 2, these are all
integers.) As a special case, if a = b, the node is going to have only one child, labeled [a, a].
In T0 , each node u has a weight W0 [u] that cannot exceed capacity α0 = ε0 n0 /Vlog(ε0 U0 ) + 1W,
11
where n0 is an upper bound on n. We define the set of full nodes, F0 , as the set of nodes that have
a weight of exactly α0 . Recall from optimized eager q-digest that we know F0 is a upward-closed
set of nodes and is therefore itself a forest of at most 1/ε0 trees. (See Section 4 for details). We will
enforce the invariant that every node in the tree T0 is either full or empty. So nodes in F0 are the
only nodes in T0 that we actually use and store. As mentioned before, this allows us to store each
node with only a constant number of bits.
Note that if we were to add new full nodes to this structure, the empty children of full nodes in
F0 , as well as the empty roots of trees, are potentially positions for new nodes. We call these empty
nodes the exposed nodes. Formally, the exposed nodes of T0 is the set of empty nodes that do not
have a full parent. For a concrete example, see the forest T0 in Figure 4.
Intuition: Batch processing of insertions. Let us first jump ahead and sketch the purpose of
having layer i (1 ≤ i ≤ k). Imagine if we insert a new element in the stream. Then, an execution of
the eager q-digest algorithm will increase the weight of one exposed node in V0 to 1 ≪ α0 . However,
our algorithm cannot do the same, because it would break our invariant of having only full nodes in
T0 . Instead, we maintain the exposed nodes V0 with our recursive structure (layers ≥ 1) and insert
the new element into layers ≥ 1. These recursive layers act like a “buffer”; once they accumulate n1
elements, we clear them and compress those elements into new full nodes in T0 .
In general, for layer i (1 ≤ i < k), we group ni+1 insertions in a batch and insert them to layer
≥ i + 1. After each batch, we compress the elements in layer ≥ i + 1 into full nodes in layer i and
clear layer ≥ i + 1. Full details of how we handle insertion will be discussed in Section 5.2.
The i-th layer (1 ≤ i ≤ k). Roughly speaking, the upper part of the layer i structure (which we
call Ti ) resembles an optimized eager q-digest forest with whose “universe size” is Ui , which is an
upper bound on |Vi−1 | (when we pick the values of the parameters, we will prove this upper bound
in Claim 5.19). At depth hi := log(εi Ui ), they have exactly |Vi−1 | nodes6 . Each such node u will
correspond to an exposed node v ∈ Vi−1 , in order (from left to right). We call this depth the base
level of Ti . This is the upper part of Ti .
For the interval labeling of the upper part, as each base level node u corresponds to an empty
node v ∈ Vi−1 , naturally, u just inherits the interval label of v. Strictly above the base level, the
interval of each node is the union of the intervals of its base-level descendants.
Now we start to describe the lower part of Ti . Unlike the optimized eager q-digest, we will also
allow Ti to grow beyond the base level. (We give some intuition for this in Remark 5.2, which
readers may skip on the first read.) For each base-level node u that corresponds to v ∈ Vi−1 , we
copy the empty infinite subtree of v in Ti−1 , and put it as the subtree of u in Ti . This also copies the
interval labels on nodes in the subtree. For a concrete example, see the forests T1 , T2 in Figure 4.
Remark 5.1. Because we copied the subtrees from Ti−1 , for any node u below the base level
(including the base level itself), there exists a unique node u′ ∈ Ti−1 corresponding to it. (We will
soon see that u′ is in fact an empty node.)
A node u in Ti has weight Wi [u] and capacity αi = εi ni /Vlog(εi Ui ) + 1W. We again call a node
full when it reaches its capacity. Fi is defined to be the set of all full nodes in Ti . We maintain the
similar invariant as layer 0:
6
Note that in an optimized eager q-digest, the base level contains Ui nodes; we just remove the remaining Ui −|Vi−1 |
nodes and their descendants, and also any inner nodes with no descendants remaining.
12
For all 0 ≤ i < k, the forest Ti will contain either full or empty nodes.7
Note that this invariant means that, for layers i < k, instead of storing the weight map Wi , it
suffices to only store Fi , since the contents of Wi are determined by Fi .
Finally, Vi , the set of exposed nodes of Ti , is defined as the set of of empty nodes which do not
have a full parent (for 1 ≤ i < k)8 . Note that there may be some exposed nodes above the base
level. (This results in a subtlety in the interval labels. See Remark 5.3 for details. Readers may
skip it on their first read.)
Remark 5.2. Suppose that we do not allow the tree Ti to grow beyond the base level. Then the
total weight of it can be at most 2αi |Vi−1 |. In other words, layer ≥ i will not be able to handle
more than that many intersections. But it turns out that we will later want to set ni ≫ 2αi |Vi−1 |,
so we have to allow Ti to grow beyond the base level. (More specifically, we want to set ni so that
εi ni ≥ αi−1 , which is essential for Lemma 5.13.)
Remark 5.3. First, for the upper part of Ti , a node labeled with [a, b] may not have children with
evenly split interval labels ([a, (a + b − 1)/2] and [(a + b + 1)/2, b]). This is clear since the labels
of nodes above base level are derived bottom-up by taking the union of intervals at their base-level
descendants. It is, though, tempting to think that for the lower part of Ti (below the base level), all
nodes labeled will have two children with equally split intervals. This, however, is also not always
the case. It is possible that a base-level node u corresponds to an exposed node v ∈ Vi−1 that is
in the upper part of Ti−1 . Then when we copy the subtree of v, those two children will not have
equally split interval labels. For example, this happens in Figure 4, at the node labeled [1, 6] in the
tree T2 . Its two children split into [1, 4] and [5, 6], while an even split is [1, 3] and [3, 6].
Remark 5.4. In order to avoid interrupting the flow of the paper, we will defer the precise defini-
tions of the parameters k, ni , Ui , εi until Section 5.6. However, so that the reader can have a sense
of the scale of each of these parameters, we will give approximate values now that can be used as
guidelines. All the parameters except k will be powers of 2, to avoid divisibility issues. We pick the
following rough values:
• The number of layers will be k + 1 ≈ log∗ (εU ).
• The approximation parameters εi are all very close to ε, and can be thought of as essentially
equal to ε.
• The Ui will satisfy the approximate recursion εUi+1 ≈ log(εUi ), so by the last level we will
have Uk ≈ 1/ε.
• The batch sizes ni will shrink very slowly (only by polylogarithmic factors in εU ), so they can
all be thought of as roughly n, though decreasing.
◦ In particular, even the last batch size nk is almost n in this sense, so one can think of
the algorithm as spending most of its time at layer k, with a “universe” of size O(1/ε).
• Similarly, the capacities αi are also all approximately εn, though also decreasing in i.
7
Note when i = k, since there are no further recursive layers, we do not require the invariant for it. Insertions to
Tk are simply handled as in a normal optimized eager q-digest. (See Section 5.2 for more details.)
8
For i = k, we define Vk instead to be the set of non-full nodes without a full parent.
13
[1, 4] [5, 8]
[1, 6] [7, 8]
[1, 7] [8, 8]
.. ..
. .
Exposed nodes in T0 Exposed nodes in T1 Full nodes
Figure 4: The structure of different layers. Here ε = 0.5, so there are 1/ε = 2 trees in each layer.
The nodes below the base-level of each layer is marked as gray. Note that when we construct Ti , we
take all the exposed nodes in Ti−1 and use them as the base-level nodes to build 1/ε trees. Then
we copy their subtrees in Ti−1 to be their subtrees in Ti .
Insertions. Recall that in Section 5.1, we only require our invariant to hold for layers i ̸= k. For
layer k, it is maintained by a normal eager q-digest. For any insertion x, we first insert it into the
layer k as we would in a normal optimized eager q-digest. In other words, we find the exposed node
in Tk whose interval contains x and increase its weight, Wk [v], by 1. This node always exists due
to the following observation.
14
Observation 5.5. For all layers 1 ≤ i ≤ k. the intervals of the exposed nodes Vi are always disjoint
and cover the entire universe [1, U ].
Then for i = k, k − 1, . . . , 1, we check if the total number of elements inserted so far, denoted by
t, is a multiple of ni . If so, we need to compress layers ≥ i into full nodes in layer i − 1. Specifically,
we will chose these ni ’s so that ni is always a multiple of ni+1 for all I (we prove this in Fact 5.20(c)).
Therefore if wtot is a multiple of ni , layers ≥ i + 1 have already been compressed into full nodes of
layer i. We will only need to compress layer i into full nodes in layer i − 1 and merge them into
Ti−1 . We call this procedure Merge(i) and will describe it next. The pseudocode for the insertion
procedure as a whole is summarized below in Algorithm 1.
Next, we explain how Merge(i) compresses layer i into full nodes in layer i − 1. We follow a
delicate three-step strategy. On a high level, it is carefully designed so that we incur an error (which
is defined formally later in Section 5.3) of at most hi · αi + αi+1 from the compression. (Recall that
hi := log(εi Ui ) is the depth of the base level in Ti .) This is important to our analysis.
Merge Step 1: Move the weight into Ti−1 . In the first step, we move all the weight in Ti into
empty nodes in Ti−1 . There are two cases:
• For every node u with weight below the base level (including the base level itself) in Ti , there
is a unique empty node u′ in Ti−1 corresponding to it. (See Remark 5.1.) We move all the
weights for u into that of u′ . Formally, we just increase weight Wk−1 [u′ ] by Wk [u].
• For every node u strictly above the base level of Ti , there is no node in Ti−1 that directly
corresponds to it. Instead, we will take an arbitrary descendant v ∈ Ti of it at the base level.
As v corresponds to an (exposed) empty node v ′ ∈ Ti−1 , we will move the weight of u there.
Formally, we increase weight Wk−1 [v ′ ] by Wk [u].
This is summarized in Algorithm 2. We defer the error analysis of this step to later in this
section. Before we proceed, let us state a simple property about this step.
15
Observation 5.6. We will choose the parameters so that αi · hi ≤ αi−1 (this will be shown in
Fact 5.20(d)). (Recall that hi := log(εi Ui ) is the depth of the base level in Ti .) Thus, after this
step, all nodes in Ti−1 still have weight at most αi−1 .
Therefore, this step does not exceed the capacity of nodes in Ti−1 . But it does create a number
of non-full nodes: It merges Ti into Ti−1 while breaking our invariant of having only full or empty
node in Ti−1 . So the purpose of Step 2 and 3 is exactly to restore this invariant.
Merge Step 2: Compressing into full nodes. Naturally, given the non-full nodes in Ti−1 , we
want to first perform a compression step similar to q-digest: Whenever a node v ∈ Ti−1 has a parent
that is not full, we move weight from v to parent(v).
Let Fi−1 be the set of full nodes after this step. We call the nodes that are neither full nor
empty partial nodes. All the partial nodes are now either non-full children of full nodes in Fi−1 or
an partially-full root. Importantly, we have the following observation.
Observation 5.7. After this step, the interval labels of the partial nodes are all disjoint.
This is because no partial node can be an ancestor of another. These partial nodes are the
leftovers that we will round up in Step 3.
Merge Step 3: Round up the leftovers. As the interval labels of these leftover partial nodes
are disjoint by Observation 5.7, we can sort these nodes by their interval. Then, roughly speaking,
we are going to take the (offline) quantile sketch of these nodes as the result for rounding.
More formally, suppose there are ℓ partial nodes. After sorting, these nodes are v1 , v2 , . . . , vℓ .
Suppose each partial node vj is labeled [aj , bj ]. We will have a1 ≤ b1 < a2 ≤ b2 < · · · < aℓ ≤ bℓ .
9
When i = k, this will actually be all u such that Wk [u] is nonzero, rather than just all full nodes.
11
We are keeping the algorithm description simple by moving weights one unit at a time. In an actual implemen-
tation, one should of course move the maximum amount possible at each time.
16
1 Pℓ
Let r = αi−1 j=1 Wi−1 [vj ] be the number of full nodes that we are expected to round up to.
12 For
Pqm
every m ∈ [r], we find the first qm ∈ [ℓ] such that j=1 Wi−1 [vj ] ≥ m · αi−1 . These vq1 , vq2 . . . , vqr
are the “quantiles” of these sorted partial nodes.
Then we set the weight of all vqm ’s (for all m ∈ [r]) to αi−1 and the weight of all other vj ’s to
zero. Note these vqm ’s must be disjoint since by Observation 5.6, any node has weight at most αi−1 .
This rounds up the partial nodes into r many full nodes and finishes this step. An implementation
of this procedure is given below in Algorithm 4.
Conclusion. Finally, our merging operation is implemented by performing these three steps se-
quentially.
Before we analyze each step of Merge(i), let us first define the error metric.
Consistency and Discrepancy. First, we define the notion of consistency between our layer i
sketch Ti and a stream of elements π. Intuitively, this describes what layer i should look like upon
receiving stream π if the merge had not introduced any error.
Definition 5.8 (Consistency). We say that a stream π is consistent with a subset of nodes S ⊆ Ti
if and only if there exists a map f that maps {1, 2, . . . , |π|} to S satisfying the following.
17
2. For every 1 ≤ j ≤ |π|, the interval label of node f (j) contains πj .
Then we define the discrepancy between Ti and the stream π. This quantifies the amount of
additional error we have.
Definition 5.9 (Discrepancy). We define the discrepancy between a stream π and a subset of nodes
S ⊆ Ti as
disc(π, S) := ′ min d(π, π ′ ).
π consistent with S
Analysis of Step 1. Now, we show that Step 1 increases the discrepancy by at most εi ni .
Lemma 5.10 (Step 1). Let Ti be the layer-i sketch before Algorithm 2 (Step 1). Also, let S be the
set of originally empty nodes in Ti−1 whose weight increases during Algorithm 2.
For any stream π, we have
disc(π, S) ≤ disc(π, Ti ) + εi ni .
Proof. Let π ∗ := arg minπ∗ consistent with Ti d(π, π ∗ ) and f ∗ be the consistent mapping from π∗ to Ti .
We will construct a stream π ′ and a mapping f ′ such that π ′ is consistent with S with mapping
f ′ and d(π ∗ , π ′ ) ≤ εi ni . This finishes the proof because the distance we define satisfies the triangle
inequality d(π, π ′ ) ≤ d(π, π ∗ ) + εi ni .
For any element πj∗ (1 ≤ j ≤ |π ∗ |) there are two cases:
1. If f ∗ (j) = u for a node u below the base level of Ti , let u′ ∈ Ti−1 be the corresponding node
(as in Line 9, Algorithm 2). We let πj′ = πj∗ and set f ′ (j) = u′ .
2. If f ∗ (j) = u is a node u strictly above the base level of Ti , let v ∈ Ti be its descendant at the
base level and v ′ ∈ Ti−1 be the corresponding exposed node (as in Line 5, Algorithm 2). We
select an arbitrary element y in the interval of v (which is equal to that of v ′ ), and let πj′ = y.
Then we set f ′ (y) = v ′ .
From this construction, it is clear that π ′ is consistent with S under f ′ . To upper bound d(π ∗ , π),
consider any query x ∈ [1, U ], the difference of the rank of x in π and in π ′ is bounded by the
number of j’s such that x lies strictly between πj∗ and πj′ .
As πj∗ ̸= πj′ , this can only happen in Case 2. Moreover, as πj∗ was initially in the interval of
u, and πj′ is in the interval of v (wich is contained by that of u), we know that x must also be in
the interval of u. Since there are at most hi such nodes u strictly above the base level of Ti , and
each is mapped to αi times, we have at most hi αi many such j’s. We will choose the parameters in
Section 5.6 so that hi αi ≤ εi ni (this will follow from (3)). This proves d(π ∗ , π) ≤ εi ni .
Then we need to argue that when S is merged with the original nodes in Ti−1 , their discrep-
ancies at most add up. This follows from the following observation, which is a consequence of
Observation 2.2:
18
Observation 5.11. For two disjoint sets of nodes S, T and any two streams π1 and π2 , we have
disc(π1 ◦ π2 , S ∪ T ) ≤ disc(π1 , S) + disc(π2 , T ),
where ◦ means concatenating two streams.
Analysis of Step 2. It is not hard to see that Step 2 never increases discrepancy.
Lemma 5.12 (Step 2). For any stream π that is consistent with Ti−1 , after we perform Algorithm 3
on Ti−1 , π is still consistent with Ti−1 . This implies that for any stream π, disc(π, Ti−1 ) is always
nonincreasing after perform Algorithm 3 on Ti−1 .
Proof. We prove this for each operation we perform. Whenever we move one unit of weight from v
to parent(v), we pick an arbitrary 1 ≤ j ≤ |π| such that f (j) = v and let f (j) ← parent(v). Since
the interval of parent(v) contains that of v, the consistency map remains valid.
Analysis of Step 3. Finally, we show that the rounding in Step 3 only increases the discrepancy
by αi−1 = εi ni .
Lemma 5.13 (Step 3). For any stream π, whenever we perform Step 3 (Algorithm 4) to Ti−1 in
our algorithm, the discrepancy disc(π, Ti−1 ) increases by at most αi−1 (which is equal to εi ni ).
Proof. First, we only perform Algorithm 4 after Algorithm 3. So, by Observation 5.7, all the partial
nodes have disjoint intervals before Algorithm 4.
BeforePthe algorithm starts, let v1 , v2 , . . . , vℓ be the partial nodes of Ti−1 in sorted order, and
ℓ
j=1 Wi−1 [vj ]. Suppose [a1 , b1 ], [a2 , b2 ], . . . , [aℓ , bℓ ] are their disjoint interval labels. Let
1
r = αi−1
π = arg minπ∗ consistent with Ti−1 d(π, π ∗P
∗ ) and f ∗ be corresponding consistency map. For every m ∈
[r], let vqm be the first node such that qj=1 m
Wi−1 [vj ] ≥ m · αi−1 . As discussed in Section 5.2, these
vq1 , vq2 , . . . , vqr are all distinct.
After the algorithm, all partial nodes become empty, except that vq1 , vq2 , . . . , vqr become full
nodes with weight αi−1 . We let q0 = 0. For all m ∈ [r], we do the following to construct stream π ′
and its consistency map f ′ (with Ti−1 after the algorithm):
• For all nodes vs with qm−1 < s < qm and all j ∈ {1, 2, . . . , |π ∗ |} such that f ∗ (j) = vs , we set
πj′ ← aqm and f ′ (j) ← vqm .
Pq(m−1)
• For the node vqm−1 , we take s=1 Wi−1 [vs ]−(m−1)·αi−1 many j’s such that f ∗ (j) = vqm−1
and set πj ← aqm and f (j) ← vqm .
′ ′
P(qm )−1
• For the node vqm , we take m · αi−1 − s=1 Wi−1 [vs ] many j’s such that f ∗ (j) = vqm and
set πj′ ← πj∗ and f ′ (j) ← f ∗ (j) = vqm .
Now we prove that d(π ∗ , π ′ ) ≤ αi−1 , which by our choice of parameters in Section 5.6, will be at most
εi ni . This will end the proof of this lemma by the triangle inequality d(π, π ′ ) ≤ d(π, π ∗ )+d(π ∗ , π ′ ) ≤
d(π, π ∗ ) + εi ni .
For any query x, its rank in π ∗ and π ′ differs by at most the number of j’s such that x is strictly
between πj∗ and πj′ . As πj∗ ̸= πj′ , this only happens in the first two cases. Suppose f ′ (j) = vqm . This
implies πj′ = aqm . Then f ∗ (j) must be a node vs with qm−1 ≤ vs < qm , and πj∗ ≥ aqm−1 .
19
This implies x is in the interval [aqm−1 , aqm ). Thus there is a unique m for each query x, and by
our construction, there can be at most αi−1 many j’s that are mapped to vqm by f ′ . This proves
that d(π ∗ , π ′ ) ≤ αi−1 .
Proof. We proceed by induction. In the base case where i = k, the layer-k structure Tk is always
consistent with the partial stream π by construction. Suppose that this holds for i + 1. We split
the stream π into its batches π = π (1) ◦ π (2) ◦ · · · ◦ π (ni /ni+1 ) where each π (j) has length ni+1 . For
the ease of notation, we define π (1...j) = π (1) ◦ π (2) ◦ · · · ◦ π (j) .
By the induction hypothesis, we know that after receiving each π (j) but immediately before we
perform Merge(i + 1), we have disc(π (j) , Ti+1 ) ≤ 2γi+2 · ni+1 .
Then let us look at the process of Merge(i + 1) and do another layer of induction. The
induction hypothesis is that immediately after receiving π (j) and perform Merge(i + 1), we have
disc(π (1...j) , Ti ) ≤ 2(γi+2 + εi+1 ) · j · ni+1 . When j = ni /ni+1 , this is simply disc(π, Ti ) ≤ 2(γi+2 +
εi+1 ) · ni = 2γi+1 · ni and proves the outer induction.
In the base case, Ti is empty, and we have disc(∅, Ti ) = 0. Suppose for j − 1, our induction
hypothesis holds.
• It first performs Merge(i + 1) which, by Lemma 5.10, adds a set S of new non-empty nodes
to Ti−1 with disc(π (j) , S) ≤ disc(π (j) , Ti+1 ) + εi+1 · ni+1 ≤ (2γi+2 + εi+1 ) · ni+1 . Then by
Observation 5.11, after this step, we have disc(π (1...j) , Ti ) ≤ (2γi+2 · j + εi+1 · (2j − 1)) · ni+1 .
• Then it performs Compress(i) which, by Lemma 5.12, does not increase the discrepancy.
• Finally, it performs Round(i) which, by Lemma 5.13, increases the discrepancy by at most
εi+1 · ni+1 and results in disc(π (1...j) , Ti ) ≤ 2(γi+2 + εi+1 ) · j · ni+1 .
This finishes the inner induction and the proof of this lemma.
The inner induction in the proof above actually proves the natural corollary below.
Corollary 5.15. Let π be the partial stream that arrives at time [s · ni + 1, t] for some integer s
and t such that ni+1 | t and t ≤ s · ni . After the t-th insertion and immediately after Merge(i + 1)
returns. We have
disc(π, Ti ) ≤ 2γi+1 · |π|
where
γi+1 = εi+1 + εi+2 + · · · + εk .
20
5.4 Answering queries
To answer a rank query, we simply add up the weights of all the nodes whose interval contains any
element that is at most x, as shown in Algorithm 6.
First, we bound the total weight of nodes v which could cause over-counting. To this end, we
say that a node is bad if its interval contains x and, furthermore, its interval is not the length-1
interval containing only x. Then, we show the following.
Proposition 5.16. The total weight of all bad nodes, across all layers, is at most γ0 n0 .
Proof. Let wi denote the total weight of all bad nodes in Ti (for i < k, this is just αi times the
number of full bad nodes in layer i). Moreover, let ci denote the total capacity of all bad nodes in
layer i, even the empty ones13 (this is αk times the total number of bad nodes in layer k).
We will prove the following statement for 0 ≤ i ≤ k by induction:
w0 + · · · + wi−1 + ci ≤ ε0 n0 + · · · + εi ni . (2)
For the base case i = 0, there are h0 bad nodes in layer 0 (namely, the strict ancestors of the node
in the base level which corresponds to x). Therefore we have c0 = h0 α0 ≤ ε0 n0 .
Now, assume that (2) holds for i − 1 (where 1 ≤ i ≤ k); we will show that it also holds for
i. Consider the quantity ci , the total capacity of bad nodes in layer i. Above the base level of Ti ,
at most one node in each level is bad (since the intervals in a level are disjoint). Thus, the total
contribution from these nodes to ci is at most hi αi ≤ εi ni .
On the other hand, each bad node in Ti which is at or below the base level corresponds to an
empty bad node in Ti−1 . Note that the total capacity of empty bad nodes in Ti−1 is just ci−1 − wi−1 .
Moreover, since αi ≤ αi−1 (by Fact 5.20(d)), the capacity of each node at or below the base level of
Ti is at most the capacity of the corresponding empty bad node in Ti−1 . Thus, the total capacity
of bad nodes in Ti which are at or below the base level is at most ci−1 − wi−1 . Therefore, in total,
the total capacity of all bad nodes in Ti is at most
ci ≤ εi ni + ci−1 − wi−1 .
21
Having proven (2), it remains to complete the proof of Proposition 5.16. Indeed, setting i = k
in (2) and using the fact that ci ≥ wi , the total weight of all bad nodes is at most
ε0 n0 + · · · + εi ni ≤ (ε0 + · · · + εi )n0 = γ0 n0 ,
as desired.
Proposition 5.17. At any time t, suppose that π is the stream received so far. Then there exists
a decomposition π = π0 ◦ π1 ◦ · · · ◦ πk such that
k
X
disc(πi , Ti ) ≤ 2γ1 t.
i=0
Proof. Let π0 be the first ⌊t/n1 ⌋ · n1 elements of π, π1 be the next ⌊t/n2 ⌋ · n2 − |π1 | elements, π2 be
the next ⌊t/n3 ⌋ · n3 − |π1 ◦ π2 | elements, and so on. In general, πi is the next ⌊t/ni+1 ⌋ · ni+1 − |π1 ◦
π2 ◦ · · · ◦ πi−1 | elements in π after those in πi−1 . Specifically, we let nk+1 = 1.
By Corollary 5.15, we know that disc(πi , Ti ) ≤ 2γi+1 · |πi |. Thus,
k
X
disc(πi , Ti ) ≤ 2γ1 |π0 | + 2γ2 |π1 | + · · · + 2γk |πk |
i=0
≤ 2γ1 (|π0 | + |π1 | + · · · + |πk |)
= 2γ1 t,
so we are done.
Proposition 5.18. Let π be the stream received so far at time t. Then, the answer to a rank query,
as performed by Algorithm 6, for any element x differs from rankπ (x) by at most γ0 n0 + 2γ1 t.
Let π ′ = π1′ ◦ π2′ ◦ · · · ◦ πk′ . By the triangle inequality (Observation 2.1), we know that
d(π, π ′ ) ≤ 2γ1 t. Since we answered the query by counting the total weight of nodes whose in-
tervals include any element which is at most x, the quantity obtained is at least rankπ′ (x), and
may overcount at nodes whose interval also contains an element larger than x. However, note
that any such node must be bad, so the total amount by which the algorithm overcounts is at
most γ0 n0 by Proposition 5.16. Thus, the output of the algorithm differs from rankπ′ (x) by at
most γ0 n0 . Furthermore, by Proposition 5.17 (and the definition of distance of streams), we have
|rankπ (x) − rankπ′ (x)| ≤ 2γ1 t, so the conclusion follows.
Now, since γ0 , γ1 ≤ ε/4 (by Fact 5.20(f)), this already means that the error of a rank query is
at most εn0 . However, so far we have still assumed that we know n in advance; moreover, we would
actually like the error to be at most εt, where t is the total number of elements received so far. In
Section 5.5, we will explain how to rectify this.
22
5.5 Removing assumptions about n
In this section, we will describe how to dispense with the assumption that we know n, as well as
the assumption that n ≥ n∗ . We will also prove that the error of any query is at most εt.
Unknown n. First, we describe how to maintain the data structure when we don’t know n in
advance, but still assuming that all queries happen after t ≥ n∗ . At the start of the algorithm, we
initialize the data structure with n0 = n∗ . Then, whenever t, the number of elements so far in the
stream, reaches n0 , we double n0 (which has the effect of doubling ni and αi for all i). Note that
when t = n0 , only layer 0 exists, so we only need to describe how to update layer 0. Every node
in layer 0 is now half-full instead of being full; that is, the weight of every node in F0 is now α0 /2.
Then, we just perform the push-up and rounding, as described in Algorithms 3 and 4, to layer 0.
The pseudocode of this procedure is given in Algorithm 7, and it is called in Line 13 of Algorithm 1.
By Lemma 5.13, this has the effect of changing the stream represented by layer 0 by a distance of
at most α0 = ε0 n0 /Vlog(ε0 U0 ) + 1W ≤ εt/16 (since we assumed that εU is sufficiently large). Then,
at any point in the stream, the total amount the represented stream has been changed by these
rounding operations is at most εn0 /16 + εn0 /32 + · · · ≤ εn0 /8. Therefore, the bound on distance
between π and π ′ in Proposition 5.17 is increased by εn0 /16 after adding the doubling step to the
algorithm.
Therefore, after this modification to the algorithm, the proof of Proposition 5.18 now gives a
bound of γ0 n0 + γ1 t + εn0 /16. Since n0 ≤ 2t (since we assumed that t ≥ n∗ ), and γ0 ≤ ε/4 and
γ1 ≤ ε/8 (by Fact 5.20(f)), we have
γ0 n0 + γ1 t + εn0 /16 ≤ εt.
In conclusion, for any t ≥ n∗ , the additive error of any rank query after t elements of the stream is
at most εt, as desired. It remains, then, to handle the cases where t < n∗ .
Dealing with 1/ε ≤ t < n∗ . Next, we describe how to modify the algorithm to still be able to
answer queries when 1/ε ≤ t < n∗ . Firstly, we still store the original data structure, since we will
need to use it after t exceeds n∗ . However, in addition, we create a new instantiation of the data
structure (with the same parameters), where upon receiving an element of the stream, instead of
inserting it once, we insert the same element εn∗ times (by Fact 5.20(i), this is an integer). Then, as
long as t ≥ 1/ε, we will have inserted at least n∗ elements into this alternate data structure, so by
the previous section, it will be able to answer rank queries with relative error at most ε, as desired.
Of course, the effective value of t will have increased by a factor of εn∗ , which will have ramifications
for the space complexity. However, we will show in Section 5.7 that the space complexity is still
what we want it to be.
23
Dealing with t < 1/ε. Finally, while t < 1/ε, we will just store all the elements of the stream so
far explicitly (in addition to keeping the data structures of the previous two sections). We will show
in Section 5.7 that this can actually be done using O(ε−1 log(εU )) space. Obviously, if we store all
the elements of the stream, rank queries can be answered exactly.
We will now choose values for the parameters of the algorithm (k, ni , Ui , and εi ) and verify that
they satisfy some necessary properties.
First, note that we may assume that n, U , and ε are all powers of 2 (by rounding n and U up
and ε down to the nearest power of 2, costing at most a constant factor). Indeed, we will ensure
that ni , Ui , εi , and αi are always powers of 2, in order to stave off divisibility issues.
We then pick the following values. Let k = log∗ (εU ). As described in Section 5.1, let U0 = U .
Let n0 be an upper bound on t, the number of elements so far in the stream. As previously described,
we will imagine for now that we know n in advance and that n0 = n. Also, we assume, as we may,
that n0 is a power of 2. We then pick εi as follows:
(
ε/8, i = 0,
εi = k−i+4
ε/2 , i ≥ 1.
Also, define
γi = εi + εi+1 + · · · + εk .
Also, recall from Section 5.1 that for all i, we define the capacities αi based on εi , ni , and Ui as
follows:
εi n i
αi = . (3)
Vlog(εi Ui ) + 1W
Now, we define the parameters ni and Ui for layer i + 1 recursively (for i < k) as follows:
1 ni 1 + Vlog(εi Ui ) + 1W 2Vlog(εi Ui ) + 1W
Ui+1 = + = = , (4)
εi α i εi εi
αi ε i ni
ni+1 = = . (5)
εi+1 εi+1 Vlog(εi Ui ) + 1W
We let hi be the depth of the base layer in Ti :
hi = log(εi Ui ).
24
Now we will check some properties of these parameters which we will need. First, we will show
the important property of Ui : that it is an upper bound on the number of exposed nodes in the
previous layer.
Claim 5.19. For all 0 ≤ i < k, we have Ui+1 ≥ |Vi | (recall that Vi is the set of exposed nodes in
layer i).
Proof. The number of full nodes in layer i is at most ni /αi (since full nodes have weight αi . If
there are no full nodes, then we would have |Vi | = 1/εi , since Vi would just be the set of all the
roots of trees in Ti . Now, imagine building up the set of full nodes by adding them one at a time
(from bottom to top). Each time we add a full node, we remove one exposed node, and add back
at most two exposed nodes. Thus, the total number of exposed nodes after this process is at most
1/εi + ni /αi , which is indeed at most Ui+1 by (4).
Now, we will prove various other properties of the parameters which we will need throughout.
We state all these properties now, but we will defer their proof to Appendix A, since they mostly
just involve manipulation of the definitions of the parameters.
Now we discuss the space complexity of the algorithm. All space complexities in this section will
be in bits, not words. There are two primary things to check: the space taken by the sketch itself,
and the space required during a merge step after an insertion.
Space of sketch. The information stored by the algorithm consists only of the full nodes Fi for
layers 0 ≤ i < k and the weights Wk for layer k. (Note that we don’t need to store Ti since it is
determined recursively by Ti−1 and Fi−1 .)
Each Fi is an upward-closed subset of Ti . In each of the 1/εi trees that comprise Ti , the portion
of Fi in that tree (if nonempty) is a connected subgraph including the root. Thus, that portion
of Fi is uniquely determined by the topology of the (rooted) tree that it forms (where in a tree
topology we ). We can store the topology of an ℓ-vertex tree using O(ℓ) bits (by storing the bracket
representation of the tree). The total number of full nodes in Fi is at most ni /αi at any time, so
this means that the total space to store Fi is O(1/εi + ni /αi ), which is just O(Ui+1 ) by (4). Thus,
the total space to store all the Fi is O(U1 + · · · + Uk ), which is O(ε−1 log(εU )) by Fact 5.20(h).
25
Now, it remains to check the space required to store Wk . First, the keys of Wk also form an
upward-closed subset of Ti . This subset consists of full and partial nodes; by the same argument,
there are at most nk /αk = O(1/ε) full nodes. Every partial node is either a root (of which there
are O(1/εk ) = O(1/ε)) or a child of a full node, so there are also at most O(1/ε) partial nodes.
Therefore, as with the Fi , the space required to store the set of all nonempty nodes is at most
O(1/εk + 1/ε) = O(1/ε).
After the set of nonempty nodes has been stored, we just need to store their weights14 in some
order (say pre-order of the trees). The weights are all at most αk = n0 /n∗ , and there are O(1/ε) of
them, so the space required to store all the weights is at most O(ε−1 log(n0 /n∗ )). Since n∗ ≥ 1/ε
(by Fact 5.20(i)) and n0 ≤ max(2n, n∗ ) at all times, we have O(ε−1 log(n0 /n∗ )) ≤ O(ε−1 log(εn)).
Putting everything together, the total space complexity of the data structure is at most
as desired.
Space of sketch while t is small. Recall that in section Section 5.5, we made two modifications
to the data structure that lasted while t < n∗ and t < 1/ε. We will show now that (asymptotically)
they don’t require any extra space.
First, while t < n∗ , we maintained a second data structure identical to the first, except that we
repeated each element εn∗ times. For this data structure, the space analysis that we just performed
still holds, except that n0 may now be up to 2εn∗ t. The space to store the Fi is unchanged. The
space required to store the weights is now at most O(ε−1 log(n0 /n∗ )) ≤ O(ε−1 log(εt)), which is still
at most O(ε−1 log(εn)), as desired.
Finally, for t < 1/ε, we stored all the elements of the stream explicitly. Naively, storing these
as an ordered list would take O(e−1 log U ) space, but actually, since the set is unordered, we can
improve this. Indeed, split the universe [1, U ] into 1/ε buckets of size εU (based on the log ε−1 most
significant bits). Then, for each bucket, store an ordered list of the log(εU ) least significant bits of
every stream element in that bucket. Storing such an ordered list of length ℓ takes O(1 + ℓ log(εU ))
space, so the total space taken is at most O(1/ε + t log(εU )) ≤ O(ε−1 log(εU )), which is at most a
constant multiple of the desired space.
This completes the discussion of the space taken by the sketch itself. Now we will show that the
algorithm does not require any extra space (asymptotically) during the merge operation.
Space during merge. During the merge, the only extra memory we require is that of storing
the keys (i.e., vertices) of the map Wi−1 which weren’t already stored in Fi−1 . There are two parts
of this: we need to store the new keys of Wi−1 (that is, the vertices with newly added weight), and
we need to store the weights themselves.
Let S denote the set of new keys of Wi−1 . Note that every node in S corresponds to at least one
node from Fi which put its weight into that node. Thus, we have |S| ≤ |Fi |. Additionally, S ∪ Fi−1
form an upward-closed set in Ti−1 . Thus, just as we stored Fi−1 , we can also store S ∪ Fi−1 using
|S ∪ Fi−1 | ≤ |Fi | + |Fi−1 | space. Note that we already used |Fi | + |Fi−1 | space for the original sketch,
so storing S does not require any more space asymptotically.
14
Actually, we only need to store the weights of the leaves of the forest formed by the nonempty nodes, since the
r. Since it doesn’t make a different to the asymptotic space complexity, we store all the weights for simplicity.
26
Now, it remains to store the weights in Wi−1 . Here we must distinguish between the cases i = k
and i < k. If i = k, then we store the weights explicitly. The weights always remain at most
αk−1 = O(αk ) = O(n0 /n∗ ) (by Fact 5.20(j)), so the total space required to store the weights is
O(|S| log(n0 /n∗ )). Since |S| ≤ Fk = O(1/ε), this is then at most the weight allocated to store Wk
originally, so again this does not require extra asymptotic space.
If i < k, then we first make one small optimization: as stated in a footnote, in Algorithm 3
(the compression algorithm), we do not need to move the weight up in increments of 1. Indeed, the
weights start out as multiples of αi , and the threshold αi−1 is also a multiple of αi . Thus, we can
move weight in increments of αi−1 , so that the weights in Wk always remain multiples of αi . Now,
since the weights are all multiples of αi , we can store their ratios with αi ; we store the ratios in
unary, so that storing a weight of ℓαi requires O(ℓ + 1) bits of space. Then, the total space needed
to store the weights is O(ni /αi + |S|). Again, |S| ≤ Fi , so we can see that this is again at most the
weight allocated to storing Fi originally.
Thus, we have shown that in all cases, the merge step does not require any more space (asymp-
totically) than storing the sketch already does.
5.8 Runtime
In this section, we prove that, for reasonably-sized n, our algorithm processes updates and queries
in O(log(1/ε)) amortized time. We will need a few technical assumptions and simplifications to
make our algorithm run in O(log(1/ε)) time. The first is that we relax the space requirement a
bit to O(ε−1 (log(εn) + log U )) bits, which still within O(ε−1 ) words. Secondly, we assume that
n > (log U )C /ε2 for some absolute constant C that depends on the computational model. Also, we
assume that there are no queries during the first (log U )C /ε2 insertions.
Insertion into the last layer. Our procedure for insertion, Algorithm 1, contains two steps.
The first step is to insert the new element x into the last-layer sketch Tk . The second step is to
merge the layer i into i − 1 (Algorithm 5).
Now, let us focus on the time complexity of the first step (Lines 3 and 4 of Algorithm 1). The
reason we relax the space requirement a little is to allow us to store the tree Tk at the last layer
explicitly, not in the bracket representation. There are at most 3|Fk | ≤ 3 · αnkk = O(1/ε) nodes in
the last layer. For each node u ∈ Tk , we store its weight Wk [u] (which takes O(log(εn)) bits) and
the interval [au , bu ] (which takes O(log U ) bits).
To efficiently find the highest non-full node containing x, we always maintain a sorted list of all
exposed nodes (non-full nodes whose parent is full and the non-full roots). By Observation 5.5, these
nodes have disjoint intervals whose union covers the entire [U ]. Thus these nodes are simply sorted
in the increasing order of these intervals. A binary search in O(log 1/ε) time finds the exposed node
(which is also the highest non-full node) u whose interval contains x. Then, we increase the weight
Wk [u] of that node by 1.
In the rare case where the node u becomes full after this, we need to remove it from the list and
add its two empty children. Although this takes O(1/ε) time as we have to modify the entire list
and the topology of the tree we store, it only happens once every αk = nn∗0 (Equation (6)) insertions.
Here n0 is the current estimate of string length, which keeps doubling as explained Section 5.5.
Since we know that n > (log U )C /ε2 from our assumption, we can run the algorithm starting
with n0 = (log U )C /ε2 . As n∗ ≤ ε−1 (log(εU ))1+o(1) (Fact 5.20(i)), we have αk ≥ O((log U )C−1 /ε).
27
We can amortize the O(1/ε) running time to these αk insertions and get O(1) amortized running
time for updating the list.
Merging layer i into layer i − 1. First of all, in each tree Ti , the number of all nodes is
|Fi | ≤ αnii = O(log(εi Ui )/εi ) (εi = ε/2k−i+4 ). We want to amortize the time cost to ni insertions.
For Algorithm 5, there are three procedures which we will analyze one by one.
• Move(i) (Algorithm 2): At Line 5, we need to find the base-level descendant v ′ of v for every
node v ∈ Fi above the base level. This can be done by traversing the stored part of tree Ti
once, which takes |Fi | time.
In the rest of this algorithm, since we only maintain the full nodes Fi−1 in Ti−1 , in this step,
all the empty nodes in Ti−1 whose weights increase are not stored before by our algorithm. We
simply store them and their weights as a list using O((log U + log(εn)) · |Fi |) bits of memory
in the depth-first-search order. This takes O(|Fi |) time.
• Round(i − 1) (Algorithm 4): Finally, Algorithm 4 finds the partial nodes in our list while
visiting each node at most once. So this takes only |Fi | time as well.
After these three steps, we also have to update the topology of Fi−1 and add new full nodes to
its bracket representation. This takse |Fi−1 | time. In total, the time complexity is |Fi−1 | + |Fi |. So
the amortized time is (|Fi−1 | + |Fi |)/ni = O (1/αi ) ≤ O(1/αk ) per layer i. As there are k = log∗ U
many layers, while αk ≥ (log U )C−1 /ε, the amortized time cost is just O(1).
Answering rank queries. For answering P rank queries, running exactly Algorithm 6 requires
traversing T0 , T1 , . . . , Tk , which takes O( ki=0 |Fi |) = O((log U )/ε) time. For simplicity, we assume
that there are only queries after first n0 elements are inserted. After every ε·n0 insertions, we run Al-
gorithm 6, compute each ε-approximate quantile and store them. This takes at most O((log U )/ε2 )
time. Then for every query x, we just binary search in O(log 1/ε) time, and count the number of
stored quantile elements less than that x, multiply that by εt (where t is the number of current
insertions), and output the answer. This has an error of at most 2εn. Since we can amortize the
O((log U )/ε2 ) time cost to ε · n0 ≥ (log U )C /ε2 elements. This takes O(log 1/ε) amortized time per
query and O(1) amortized time per insertion.
6 Practical considerations
Mergeability. One popular feature with quantile sketches is being fully-mergeable, meaning that
any two sketches with the error parameter ε can be merged into a single sketch without increasing the
error parameter ε. A weaker notion of mergeability is the one-way mergeability, which, informally
speaking, means that it is possible to maintain an accumulated sketch S and keep merging other
28
small sketches into S without increasing the error ε. As pointed out in [GK16, ACH+ 13], every
quantile sketch is one-way mergeable.
Among these sketches, the GK sketch and the optimal KLL sketch is not fully mergeable, while
q-digest is fully mergeable, and KLL sketch has a mode in which it is fully mergeable but loses its
optimal space bound. Our sketch is based on the fully mergeable Q-digest sketch, but we do not
know whether it is fully mergeable in its current form. We leave it as a future direction to come up
with a fully mergeable mode for our algorithm.
However, our algorithm is in a sense partially mergeable. That is, if we have two instances
of size at most n each with error parameter ε, we can merge them while incurring an additional
discrepancy of at most O(εn/ log(εU )) (as we will soon describe). Though this is not as strong
as a fully-mergeable data structure, which incurs additional error of 0, it is still better than the
O(εn) additional error incurred by merging quantile sketches in a black-box sense (by querying their
quantiles to obtain an O(εn)-approximation to their streams). In practice, this means that one can
merge up to poly(U ) of our sketches simultaneously (by performing merges in a binary tree with
depth O(log(εU ))), with only a constant-factor loss in ε.
We now sketch how to perform this partial merge. Suppose we wish to merge the data structures
D and D′ , with current sizes t > t′ . To begin with, let us first imagine that only layer 0 is occupied
(in both structures). Then, we simply add values of the weight map W0′ (of D′ ) into W0 (of D).
Then, the discrepancy of W0 is now εt + εt′ . Now, the only problem is that the invariant that all
nodes are either full or empty may not hold anymore, and the full nodes are no longer upward-
closed. To fix this, we perform the compression and rounding steps of Algorithms 3 and 4 — by
Lemmas 5.12 and 5.13, this increases the discrepancy by at most α0 = O(εt/ log(εU )). If there is
now a doubling step (Algorithm 7) to be performed (that is, if t0 + t′0 ≥ n0 ), then we now do it
as usual. Note that though the discrepancy has increased, the data structure is otherwise still a
valid data structure for the error parameter ε, and we can continue to perform the usual operations
(including more merges) on the new data structure, while keeping track of the increased discrepancy.
Now, suppose that there are occupied layers other than layer 0. Then, before merging the two
data structures, we simply perform the operation Merge(i) early for i = k, k − 1, . . . , 1, on both
data structures. This proceeds identically to an ordinary Merge(o)peration, except that during
the rounding step, the total weight may not be a multiple of αi−1 ; we simply discard the excess
weight down to a multiple of αi−1 (and insert arbitrary elements to replace them at the end of the
merge). Overall, this has the effect of discarding elements down to the nearest multiple of α0 , so it
will introduce a discrepancy of at most α0 = O(εt/ log(εU )). Additionally, the proof of Lemma 5.14
still shows that the discrepancy introduced by this merge is at most γ1 n1 = O(εt/ log(εU )). Thus,
overall, this partial merge still adds an additional O(εt/ log(εU )) to the discrepancy, as desired.
Constant factors. The parameters that we selected in Section 5.6 were chosen to make the
analysis simple. There is, however, a lot of leeway in choosing the parameters to still satisfy the
necessary properties, and our exact choices likely do not attain the best constant factors on space
complexity. We use k + 1 = log∗ (εU ) + 1 layers, but in practice, we expect that around 4 layers is
probably enough, and the parameters can then be chosen appropriately.
Additionally, beyond just the setting of our parameters, our analysis has generally been wasteful
in terms of constants for ease of presentation and readability. There are several places this can be
improved. For example, we can improve the error ε by a factor of 2 by performing the moving and
rounding steps of the merge in different directions; that is, in the moving step, we can move nodes
29
only to their leftmost (least) descendant, and in the rounding step, we round nodes upward only
(which is what we already do).
Removing amortization. Currently, our runtime analysis is amortized, since a step containing
a merge can take a long time compared to a normal insertion step. If one is concerned about worst-
case update time, then we can improve performance by executing the time-consuming operations
over a longer time period while storing received elements in a buffer, similarly to Claim 3.13 of
[AJPS23].
Answering select queries with real elements. One feature of quantile queries is that they
can also answer select queries: that is, given a rank r, one can query select(r) to obtain an element
x that is between the rank-(r − εt) and rank-(r + εt) elements of the stream. This is equivalent to
being able to answer rank queries, since one can use a binary search of rank queries to answer a
select query (and vice versa). One might also desire, though, that the answers to the select queries
are actual elements of the stream, rather than arbitrary elements of [1, U ]. As stated, our algorithm
does not provide a way to do this. It turns out, however, that given any quantile sketch algorithm
that can answer approximate rank queries, it is possible to augment it (in a black-box manner)
so that it can answer select queries with real elements of the stream, with only a constant-factor
degradation in the error parameter ε. We will now sketch how to do so.
We initialize a quantile sketch with error parameter ε, and we maintain a list x1 < x2 < · · · < xℓ
which are actual elements of the stream (and by convention we write x0 = 0 and xℓ+1 = U + 1),
and rank estimates r1 , . . . , rℓ (where again by convention we say r0 = 0) satisfying the following
properties at all times t:
(Note that the first item is trivially satisfied for i = 0.) Now, suppose that we receive an insertion
x into the stream. First, we increment ri for all i such that xi ≥ ri , to maintain property 1 (note
that t increases by 1, but this only makes property 1 easier to satisfy).
Now, if x = xj for some j, then property 2 continues to be satisfied since the left-hand side of
the inequality remains the same for all i. Otherwise, suppose that x ∈ (xj , xj+1 ) for some j. Then,
2 might become violated for i = j, since the left-hand side will have increased by 1. To fix this,
we insert a new element xj+1 = x (and shift the indices of the existing xi , ri of all i ≥ j + 1 up
by 1). Then, we execute a rank query on x to get r such that | rankπ (x) − r| ≤ εt. Then, we set
rj+1 = max{r, rj + 1}. Note that property 1 continues to be satisfied by the accuracy of the rank
query and because rj + 1 ≤ rankπ (rj ) + εt + 1 ≤ rankπ (rj+1 ) + εt. It remains to check that property
2 is now satisfied. Indeed, for i = j + 1, this follows from the fact that rj+1 ≥ rj + 1 and that the
property was previously satisfied for i = j. For i = j, it follows from the fact that rankπ (x − 1)
is at most the former value of rankπ (xj+1 − 1), and that the property was previously satisfied for
i = j. Thus, we have established that the properties both continue to hold.
Finally, while there is any j such that rj+1 − rj−1 ≤ εt, we delete xj and rj (and shift the indices
i > j down by 1 to accommodate). This preserves the properties: we only need to check property
2 for i = j − 1, and indeed, rankπ (xj − 1) − rj−1 ≤ (rj + εt) − rj−1 ≤ 2εt by property 2 and by the
30
assumption that rj − rj−1 ≤ εt (note that the old rj+1 has become rj ). Thus, this preserves the
properties.
Now, we answer a select query as follows: on a query of rank r, we pick the minimal i such that
r ≤ ri + 2εt, and return xi . As a special case, if r < 2εt, we return x1 instead of x0 = 0. (Note
that by property 2 applied to i = ℓ, we never return xℓ+1 .) Then, assuming that r ≥ 2εt, we have
by property 1 that rank(xi ) ≥ ri − ε ≥ r − 3εt. Also, by property 2, rank(xi − 1) ≤ ri−1 + 2εt < r
(by minimality of i), so the rank-r element is at least xi . Thus the error in the select query is at
most O(εt) as long as r ≥ 2εt. Also, in the special case r < 2εt, we answer x1 , and by property 2,
rank(x1 − 1) ≤ 2εt, so again the error is at most O(εt). Thus, the answers to the select queries are
always approximately correct.
Finally, it remains to analyze the total space taken. Note that we have rj+1 − rj−1 ≤ εt for
all j, so the total number of indices ℓ is at most O(1/ε). Therefore, we only need to store the
O(1/ε) elements x1 , . . . , xℓ and r1 , . . . , rℓ , which takes O(1/ε) words. Indeed, since the xi are in
increasing order and the increments of the ri are at most O(εn), we can actually store these in
O(ε−1 (log(εU ) + log(εn)) space, so this does not take any additional asymptotic space over our
algorithm.
7 Lower bounds
The space complexity of our algorithm is O(ε−1 (log(εU ) + log(εn). In this section, we’ll discuss the
optimality of this result. The first term O(ε−1 log(εU )) must be incurred by any quantile sketch,
even a randomized one that succeeds with reasonable probability, as we will now show. This already
implies that when n ≤ poly(U ), our algorithm is tight15 . When this is not the case, we conjecture
that our algorithm is optimal among deterministic sketches anyway. In particular, Conjecture 1.3
implies a space lower bound of O(ε−1 log(εn)) for quantiles.
Theorem 7.1. Any randomized streaming algorithm for Problem 1.1 that succeeds with probability
at least 0.9 (that is, it can answer a rank query chosen by an oblivious adversary with that probability)
on a universe of size U > Cε−1 for some sufficiently large C uses at least Ω(ε−1 log(εU )) bits of
space.
Proof. It suffices to show that the final state of the algorithm requires Ω(ε−1 log(εU )) bits of space.
Let us restrict ourselves to streams that only contain k = 3ε−1 distinct elements, each of which
occurs n/k times. Under this model, let the stream be π1′ < . . . < πk′ (each with multiplicity n/k).
Under this model, the min-entropy of the stream (when the stream is chosen uniformly randomly)
is log Uk . We will show that access to the sketch reduces the min-entropy considerably (by at
least a constant factor). To do this, we will describe an algorithm for a party to make ε−1 log U
queries to the sketch and with probability at least 0.01, output at least 0.01 fraction of the elements
π1′ , π2′ , . . . , πk′ correctly. The min-entropy of this distribution of outputs is much lower: the only
possibilities are those that overlap on at least 0.01-fraction of π1′ . . . πk′ , of which there are at most
0.01k 0.99k . The most likely outcome therefore occurs with probability at least 0.01 times the log
k
U
15
Technically speaking, this result alone only implies tightness when n ≤ poly(εU ). However, if U > 1/ε2 , then
poly(εU ) and poly(U ) are the same, and when U < 1/ε2 , then n ≤ poly(U ) implies that n ≪ poly(1/ε), and as we
discussed in Section 1.1, a result of [AHNY22] implies that our algorithm is tight when ε−1 > log(εn).
31
of this quantity, so the min-entropy has decreased by at least
U k U
≥ Ω(ε−1 log(εU ))
log − log 100
k 0.01k 0.99k
by Stirling’s approximation when U > Cε−1 for a sufficiently large C. Then, by the fact blow, the
sketch must have contained at least this many bits of information.
Fact 7.2. Let Hmin (·) denote the min-entropy of random variables. For any two random variables,
x and y supported on X and Y respectively, we have
Hmin (x) − Hmin (x | y) ≤ H(y).
In our case, x is the elements π1′ , π2′ , . . . , πk′ and y is the memory state of our algorithm.
Proof.
X 1
Hmin (x) − Hmin (x | y) = Hmin (x) − Pr(y = y) min log
x∈X Pr(x = x | y = y)
y∈Y
X Pr(y = y)
= Hmin (x) − Pr(y = y) min log
x∈X Pr(x = x, y = y)
y∈Y
X Pr(y = y)
≤ Hmin (x) − Pr(y = y) min log
x∈X Pr(x = x)
y∈Y
1 X 1
= Hmin (x) − min log + Pr(y = y)
x∈X Pr(x = x) Pr(y = y)
y∈Y
= H(y)
Now we describe the list of queries to ask the sketch to output least 0.01 fraction of the elements
π1′ . . . πk′ correctly with probability 0.01. For each rank i ∈ [k], binary search for the rank i’th
element in a noise resilient way [Pel02] (resilient to 0.2 fraction of adversarial error). At the end,
this must find the element at rank i exactly, since each element’s multiplicity is more than the
permissible error. The noisy binary search must succeed whenever the fraction of error is at most
0.2, which is true on at least 0.01 fraction of the elements at least 0.01 fraction of the time.
Theorem 7.3. Conjecture 1.3 implies that any deterministic streaming algorithm for Problem 1.1
uses at least Ω(ε−1 log(εn)) bits of space.
Proof. We will show the following. Any data structure that can compute a quantile sketch for
0.1ε−1 on n elements in the range [ε−1 ] can also return counts of each element that are accurate
to within ±εn. Then, if there is a quantile sketch using o(ε−1 log n) bits of memory, there is also a
deterministic parallel approximate counter using that much space.
Let us try to comp estimate the count of i ∈ [ε−1 ]. The true count of i is the difference of the
true ranks ri − ri−1 , since the rank rj is the number of elements at most j. We query the rank of i
in the quantile sketch and get the answer rbi and the rank of i − 1 and get rbi−1 . Then,
(ri − ri−1 ) − (b
ri − rbi−1 ) ≤ 0.2εn,
32
Acknowledgments
We would like to thank Jelani Nelson for his excellent mentorship, and specifically, for pointing
us to this problem, helpful discussions, and suggestions for the manuscript. We would also like to
thank the others who have provided feedback for drafts of the manuscript, including Lijie Chen,
Yang Liu, and Naren Manoj. Lastly, we would like to thank Angelos Pelecanos for the scorpion
lollipop [Hot22].
References
[ACH+ 13] Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei,
and Ke Yi. Mergeable summaries. ACM Transactions on Database Systems (TODS),
38(4):1–28, 2013. 3, 29
[AHNY22] Ishaq Aden-Ali, Yanjun Han, Jelani Nelson, and Huacheng Yu. On the amortized com-
plexity of approximate counting. arXiv preprint arXiv:2211.03917, 2022. 3, 31
[AJPS23] Sepehr Assadi, Nirmit Joshi, Milind Prabhu, and Vihan Shah. Generalizing
Greenwald-Khanna streaming quantile summaries for weighted inputs. arXiv preprint
arXiv:2303.06288, 2023. 3, 30
[AM04] Arvind Arasu and Gurmeet Singh Manku. Approximate counts and quantiles over
sliding windows. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART
symposium on Principles of database systems, pages 286–296, 2004. 3
[ARS97] Khaled Alsabti, Sanjay Ranka, and Vineet Singh. A one-pass algorithm for accurately
estimating quantiles for disk-resident data. In Very Large Data Bases Conference, 1997.
3
[AXL+ 15] Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley,
Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. Spark sql: Rela-
tional data processing in spark. In Proceedings of the 2015 ACM SIGMOD international
conference on management of data, pages 1383–1394, 2015. 1
[CG16] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Pro-
ceedings of the 22nd acm sigkdd international conference on knowledge discovery and
data mining, pages 785–794, 2016. 1
[CKL+ 21] Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, and Pavel Veselỳ. Relative
error streaming quantiles. In Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI
Symposium on Principles of Database Systems, pages 96–108, 2021. 3
[CKMS06] Graham Cormode, Flip Korn, Shanmugavelayutham Muthukrishnan, and Divesh Sri-
vastava. Space-and time-efficient deterministic algorithms for biased quantiles over data
streams. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART sympo-
sium on Principles of database systems, pages 263–272, 2006. 3
33
[CMRV21] Graham Cormode, Abhinav Mishra, Joseph Ross, and Pavel Veselỳ. Theory meets
practice at the median: A worst case comparison of relative error quantile algorithms.
In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data
Mining, pages 2722–2731, 2021. 3
[CV20] Graham Cormode and Pavel Veselỳ. A tight lower bound for comparison-based quantile
summaries. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on
Principles of Database Systems, pages 81–93, 2020. 1
[DE19] Ted Dunning and Otmar Ertl. Computing extremely accurate quantiles using t-digests.
arXiv preprint arXiv:1902.04023, 2019. 3
[FO17] David Felber and Rafail Ostrovsky. A randomized online quantile summary in
O((1/ε) log(1/ε)) words. Theory of Computing, 13(1):1–17, 2017. 3
[GDT+ 18] Edward Gan, Jialin Ding, Kai Sheng Tai, Vatsal Sharan, and Peter Bailis. Moment-
based quantile sketches for efficient high cardinality aggregation queries. Proceedings of
the VLDB Endowment, 11(11), 2018. 3
[GK01] Michael Greenwald and Sanjeev Khanna. Space-efficient online computation of quantile
summaries. ACM SIGMOD Record, 30(2):58–66, 2001. 1, 2
[GK16] Michael B Greenwald and Sanjeev Khanna. Quantiles and equi-depth histograms over
streams. In Data Stream Management: Processing High-Speed Data Streams, pages 45–
86. Springer, 2016. 2, 29
[GZ03] Anupam Gupta and Francis Zane. Counting inversions in lists. In SODA, volume 3,
pages 253–254, 2003. 3
[KLL16] Zohar Karnin, Kevin Lang, and Edo Liberty. Optimal quantile approximation in
streams. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science
(FOCS), pages 71–78. IEEE, 2016. 1, 2, 3
[MP80] J Ian Munro and Mike S Paterson. Selection and sorting with limited storage. Theoretical
computer science, 12(3):315–323, 1980. 1, 3
[MRL98] Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G Lindsay. Approximate me-
dians and other quantiles in one pass and with limited memory. ACM SIGMOD Record,
27(2):426–435, 1998. 3
[MRL19] Charles Masson, Jee E Rim, and Homin K Lee. Ddsketch: A fast and fully-mergeable
quantile sketch with relative-error guarantees. arXiv preprint arXiv:1908.10693, 2019.
3
[Pel02] Andrzej Pelc. Searching games with errors—fifty years of coping with liars. Theoretical
Computer Science, 270(1-2):71–109, 2002. 32
34
[SBAS04] Nisheeth Shrivastava, Chiranjeeb Buragohain, Divyakant Agrawal, and Subhash Suri.
Medians and beyond: new aggregation techniques for sensor networks. In Proceedings of
the 2nd international conference on Embedded networked sensor systems, pages 239–249,
2004. 1, 2, 8
[ZW07] Qi Zhang and Wei Wang. An efficient algorithm for approximate biased quantile com-
putation in data streams. In Proceedings of the sixteenth ACM conference on Conference
on information and knowledge management, pages 1023–1026, 2007. 3
Here, we will prove the various parts of Fact 5.20, by showing a series of claims. Note that
Fact 5.20(f) follows directly from the definitions of εi and γi .
Proof. For i = 0, this follows from the assumption (made at the start of Section 5) that εU is
sufficiently large. For i = 1, we have U1 = 2Vlog(εU/8) + 1W/ε and ε1 = ε/2k+3 , so ε1 U1 =
∗
Vlog(εU/8) + 1W/2log (εU )+2 , which is at least 2 again by the assumption that εU is sufficiently
large. Finally, for i ≥ 2, this follows by induction using the recursive definition of Ui and the fact
that εi−1 < εi .
Proof. Since εi+1 ≤ 2εi , we have by the inductive definition of Ui , (4), that
(Here we have used the fact that log(εi Ui ) is a positive integer, which follows from Claim A.1 and
Claim A.2.)
Proof. We have k = log∗ (εU ) ≥ log∗ (Q0 ), so if we iteratively take the logarithm of Q0 , we get down
below 1 in at most k steps. Thus, by Claim A.4, we have Qk ≤ 8, so Uk = 16Qk /εk = O(1/ε).
35
Claim A.6 (Fact 5.20(h)). U1 + U2 + · · · + Uk = O(ε−1 log(εU )).
Proof. We have U1 = 2Vlog(ε0 U0 + 1)W/ε0 = O(ε−1 log(εU )). Meanwhile, for i > 1, by Claim A.4,
∗
we have Qi ≤ O(log log Q0 ) = O(log log(εU )). Also, εi ≥ 2−k+3 ε = Ω(2− log (εU ) ε). Therefore, for
∗
i > 1, we have Ui = O(Qi /εi ) ≤ O(ε−1 2log (εU ) log log(εU )). Thus, since k = log∗ (εU ),
∗
U2 + · · · + Uk ≤ O(ε−1 log∗ (εU )2log (εU )
log log(εU )) < O(ε−1 log(εU )),
so we are done.
Claim A.7 (Fact 5.20(c)). For all i < k, ni+1 is a factor of ni .
Proof. Since the ni are powers of 2, it is enough to check that ni+1 ≤ ni . For i ≥ 1, this follows
directly from the definition of ni+1 since εi+1 > εi (and because of Claim A.2). For i = 0, we get
n0 = n and
∗
ε0 n 0 2log (εU ) n0
n1 = = ,
ε1 Vlog(ε0 U0 ) + 1W Vlog(εU/8) + 1W
which is at most n0 by the assumption that εU is sufficiently large.
Claim A.8 (Fact 5.20(d)). For all i < k, αi+1 = αi /Vhi+1 + 1W.
Proof. By successive applications of Claim A.8 and then using the definition of α0 , we have
α0 n 0 ε0
αk = = .
Vh1 + 1W · . . . · Vhk + 1W Vh0 + 1W · . . . · Vhk + 1W
Thus, we have
n0 Vh0 + 1W · . . . · Vhk + 1W
n∗ = = .
αk ε0
Since ε0 = ε/8, the first inequality of the claim follows immediately. Now, note that we have
Vhi + 1W = O(log(εi Ui )) = O(max{log Qi , 1})
Now, this means that Vh0 + 1W = O(log(εU )), and for i > 0, by Claim A.4, we have Vhi + 1W ≤
O(log log εU ). Thus, since k = log∗ (εU ), we have
∗
O(log(εU )) · (O(log log εU ))log (εU )
n∗ ≤ = ε−1 (log(εU ))1+o(1) ,
ε/8
as desired.
36
Claim A.11 (Fact 5.20(j)). αk−1 = O(n0 /n∗ ).
Proof. By Claim A.4, we have Qk = O(1), so by Fact 5.20(d), we have αk−1 = αk Vlog Qk + 1W =
O(αk ) = O(n0 /n∗ ).
37