T-Digest Algorithm
T-Digest Algorithm
t-DIGESTS
TED DUNNING AND OTMAR ERTL
Abstract. An on-line algorithm for computing approximations of rank-based statistics
is presented that allows controllable accuracy. Moreover, this new algorithm can be used
to compute hybrid statistics such as trimmed means in additional to computing arbitrary
quantiles. An unusual property of the method is that it allows a quantile q to be computed
with an accuracy relative to q(1 q) rather than with an absolute accuracy as with most
methods. This new algorithm is robust with respect to highly skewed distributions or
highly ordered datasets and allows separately computed summaries to be combined with
no loss in accuracy.
An open-source implementation of this algorithm is available from the author.
1. Introduction
Given a sequence of numbers, it is often desirable to compute rank-based statistics such
as the median, 95-th percentile or trimmed means in an on-line fashion, keeping only a
small data structure in memory. Traditionally, such statistics were often computed by
sorting all of the data and then either finding the quantile of interest by interpolation or
by processing all samples within particular quantiles. This sorting approach is infeasible
for very large datasets which has led to interest in on-line approximate algorithms.
One early algorithm for computing on-line quantiles is described in [CLP00]. In that
work specific quantiles were computed by incrementing or decrementing an estimate by
a value proportional to the simultaneously estimated probability density at the desired
quantile. This method is plagued by a circularity in that estimating density is only possible
by estimating yet more quantiles. Moreover, this work did not allow the computation of
hybrid quantities such as trimmed means.
[GK01]Munro and Paterson provided an alternative algorithm to get an accurate estimate of any particular quantile. This is done by keeping s samples from the N samples
seen so far where s << N by the time the entire data set has been seen. If the data are
presented in random order and if s = (N 1/2 ), then Munro and Patersons algorithm has
a high probability being able to retain a set of samples that contains the median. This
algorithm can be adapted to find a number of pre-specified quantiles at the same time at
proportional cost in memory. The memory consumption of Munro-Paterson algorithm is
excessive if precise results are desired. Approximate results can be had with less memory,
however. A more subtle problem is that the implementation in Sawzall[PDGQ05] and the
1
Datafu library[Lin] uses a number of buckets computed from the GCD of the desired quantiles. This means that if you want to compute the 99-th, 99.9-th and 99.99-th percentiles,
a thousand buckets are required.
An alternative approach is described in [SBAS04]. In this work, incoming values are
assumed to be integers of fixed size. Such integers can trivially be arranged in a perfectly
balanced binary tree where the leaves correspond to the integers and the interior nodes
correspond to bit-wise prefixes. This tree forms the basis of the data structure known as a
Q-digest. The idea behind a Q-digest is that in the uncompressed case, counts for various
values are assigned to leaves of the tree. To compress this tree, sub-trees are collapsed
and counts from the leaves are aggregated into a single node representing the sub-tree
such that the maximum count for any collapsed sub-tree is less than a threshold that is a
small fraction of the total number of integers seen so far. Any quantile can be computed
by traversing the tree in left prefix order, adding up counts until the desired fraction of
the total is reached. At that point, count for the last sub-tree traversed can be used to
interpolate to the desired quantile within a small and controllable error. The error is
bounded because the count for each collapsed sub-tree is bounded.
The two problems with the Q-digest are that it depends on the tree structure being
known ahead of time and that the error bounds do not necessarily apply if the algorithm
is used in an on-line fashion. Adapting the Q-digest to use an balanced tree over arbitrary
elements is difficult. This difficulty arises because rebalancing the tree involves sub-tree
rotations and these rotations may require reapportionment of previously collapsed counts
in complex ways. This reapportionment could have substantial effects on the accuracy of
the algorithm and in any case make the implementation much more complex because the
concerns of counting cannot be separated from the concerns of maintaining a balanced
tree. Another problem with Q-digests is that if they are subjected to compression during
building, it isnt entirely clear how to handle compressed counts that move high above
the leaves, but which eventually have non-trivial counts at a lower level in the tree. The
proof of correctness for Q-digests ignores this complication by only considering the case
where counts are compressed after being collected on the leaves. It would be desirable to
have error bounds that apply to a completely on-line data structure. These limitations
do not apply in the original formulation of the Q-digest as a compression algorithm for
quantile information, but current trends towards the use of on-line algorithms make these
limitations awkward.
The work described here shows how the fundamental idea of a Q-digest can be easily
extended to values in R without the complexity of apportioning counts during tree rebalancing. Indeed, this new data structure eliminates the idea of a tree for storing the original
samples, maintaining only the concept of collapsing groups of observations in a way that
preserves accuracy. This new algorithm, known as t-digest, has well-defined and easily
proven error bounds and allows parallel on-line operation. A particularly novel aspect of
the variant of t-digest described here is that accuracy for estimating the q quantile is relative to q(1 q). This is in contrast to earlier algorithms which had errors independent
of q. The relative error bound of the t-digest is convenient when computing quantiles for
very small q or for q near 1. As with the Q-digest algorithm, the accuracy/size trade-off
can be controlled by setting a single compression parameter.
The accuracy bounds are tight regardless of data ordering for the non-parallel on-line
case. In the case of parallel execution, the error bound is somewhere between the constant
error bound and the relative error bound except in the case of highly ordered input data.
For randomly ordered data, parallel execution has expected error bounded by the same
bound as for sequential execution.
2. The Algorithm
The basic outline of the algorithm for constructing a t-digest is quite simple. An initially
empty ordered list of centroids, C = [c1 . . . cm ] is kept. Each centroid consists of a mean
and a count. To add a new value xn with a weight wn , the set of centroids is found that
have minimum distance to xn . This set is reduced by retaining only centroids with a count
less than 4q(1 q)n where controls the accuracy of quantile estimates and q is the
estimated quantile for the mean of the centroid. If more than one centroid remains, one is
selected at random. If a centroid is found, then (xn , wn ) is added to that centroid. It may
happen that the weight wn is larger than can be added to the selected centroid. If so, as
much weight as possible is allocated to the selected centroid and the selection is repeated
with the remaining weight. If no satisfactory centroid is found or if there is additional
weight to be added after all centroids with minimum distance are considered, then xn is
used to form a new centroid and the next point is considered. This procedure is shown
more formally as Algorithm 1.
In this algorithm, a centroid object contains a mean and and a count. Updating such an
object with a new data-point (x, w) is done using Welfords method [Wik, Knu98, Wel62].
The number of points assigned to each centroid is limited to max(1, b4N q(1 qc)
where q is the quantile for the approximate mean of the centroid and is a parameter
chosen to limit the number of points that can be assigned to a centroid. Typically, this
compression factor is described in terms of its inverse, 1/ in order to stay compatible with
the conventions used in the Q-Digest. The algorithm approximates q for centroid ci by
summing the weights for all of the centroids ordered before ci :
P
ci .count/2 + j<i cj .count
P
q(ci ) =
j cj .count
In order to compute this sum quickly, the centroids can be stored in a data structure such
as a balanced binary tree that keeps sums of each sub-tree. For centroids with identical
means, order of creation is used as a tie-breaker to allow an unambiguous ordering. Figure
2 shows actual centroid weights from multiple runs of this algorithm and for multiple
distributions plotted against the ideal bound.
2.1. Ordered Inputs. The use of a bound on centroid size that becomes small for extreme values of q is useful because it allows relative error to be bounded very tightly, but
this bound may be problematic for some inputs. If the values of X are in ascending or
descending order, then C will contain as many centroids as values that have been observed.
if wn > 0 :
C C + (xn , wn );
if |C| > K/ :
C cluster(permute(C));
C cluster(permute(C));
return C
This will happen because each new value of X will always form a new centroid because
q(1 q) 0. To avoid this pathology, if the number of centroids becomes excessive, the
set of centroids is collapsed by recursively applying the t-digest algorithm to the centroids
themselves after randomizing the order of the centroids.
In all cases, after passing through all of the data points, the centroids are recursively
clustered one additional time. This allows adjacent centroids to be merged if they do not
violate the size bound. This final pass typically reduces the number of centroids by 20-40%
with no apparent change in accuracy.
2.2. Accuracy Considerations. Initial versions of this algorithm tried to use the centroid
index i as a surrogate for q, applying a correction to account for the fact that extreme
centroids have less weight. Unfortunately, it was difficult to account for the fact that
the distribution of centroid weights changes during the course of running the algorithm.
Initially all weights are 1. Later, some weights become substantially larger. This means
that the relationship between i and q changes from completely linear to highly non-linear
in a stochastic way. This made it difficult to avoid too large or too small cutoff for centroids
resulting in either too many centroids or poor accuracy.
The key property of this algorithm is that the final list of centroids C is very close to
what would have been obtained by simply sorting and then grouping adjacent values in X
Sequential Distribution
Centroid size
Centroid size
600
400
0
200
Centroid size
800 1000
Uniform Distribution
0.25
0.5
Quantile
0.75
0.25
0.5
Quantile
0.75
0.25
0.5
0.75
Quantile
Figure 1. The t-digest algorithm respects the size limit for centroids. The
solid grey line indicates the size limit. These diagrams also shows actual centroid weights for 5 test runs on 100, 000 samples from a uniform, (0.1, 0.1)
and sequential uniform distribution. In spite of the underlying distribution
being skewed by roughly 30 orders of magnitude of difference in probability
density for the distribution, the centroid weight distribution is bounded
and symmetric as intended. For the sequential uniform case, values are
produced in sequential order with three passes through the [0, 1] interval
with no repeated values. In spite of the lack of repeats, successive passes
result in many centroids at the quantiles with the same sizes. In spite of
this, sequential presentation of data results in only a small asymmetry in
the resulting size distribution and no violation of the intended size limit.
into groups with the desired size bounds. Clearly, such groups could be used to compute
all of the rank statistics of interest here and if there are bounds on the sizes of the groups,
then we have comparable bounds on the accuracy of the rank statistics in question.
That this algorithm does produce such an approximation is more difficult to prove rigorously, but an empirical examination is enlightening. Figure 3 shows the deviation of
samples assigned to centroids for uniform and highly skewed distributions. These deviations are normalized by the half the distance between the adjacent two centroids. This
relatively uniform distribution for deviations among the samples assigned to a centroid is
found for uniformly distributed samples as well as for extremely skewed data. For instance,
the (0.1, 0.1) distribution has a 0.01 quantile of 6.071020 , a median of 0.006 and a mean
of 1. This means that the distribution is very skewed. In spite of this, samples assigned
to centroids near the first percentile are not noticeably skewed within the centroid. The
impact of this uniform distribution is that linear interpolation allows accuracy considerably
better than q(1 q)/.
2.3. Finding the cumulative distribution
R x at a point. Algorithm 3 shows how to
compute the cumulative distribution P (x) = p() d for a given value of x by summing
the contribution of uniform distributions centered at each the centroids. Each of the
1.0
0.5
0.0
0.5
1.0
1500
Gamma, q=0.01
500
0
10000
10000
30000
30000
1.0
0.5
0.0
0.5
1.0
1.0
0.5
0.0
0.5
1.0
2.5. Computing the trimmed mean. The trimmed mean of X for the quantile range
Q = [q0 , q1 ] can be computed by computing a weighted average of the means of centroids
that have quantiles in Q. For centroids at the edge of Q, a pro rata weight is used that is
z = max(1, (x mi )/);
if z < 1 :
return ( Nt + kNi z+1
2 )
t t + ki ;
return 1
i
12
ki
2
13
t t + ki
14
return cm .mean
based on an interpolated estimate of the fraction of the centroids samples that are in Q.
This method is shown as Algorithm 5.
3. Empirical Assessment
3.1. Accuracy of estimation for uniform and skewed distributions. Figure 4 shows
the error levels achieved with t-digest in estimating quantiles of 100,000 samples from a
Algorithm 4: Estimate trimmed mean. Note how centroids at the boundary are
included on a pro rata basis.
Input: Centroids derived from distribution p(x), C = [. . . [mi , si , ki ] . . .] , limit values
q0 , q2
Output: Estimate of mean of values x [q0 , q1 ]
1 s = 0, k = 0;
P
P
ki , q1 q1 ki ;
2 t = 0, q1 q1
3 for i 1 . . . m :
4
ki = ci .count;
5
if q1 < t + ki :
6
if i > 1 :
7
(ci+1 .mean ci1 .mean)/2;
8
elif i < m :
9
(ci+1 .mean ci .mean);
10
else:
11
(ci .mean ci1 .mean);
qt
1
12
ki
2 ;
13
14
15
16
17
18
19
20
21
22
23
24
25
26
s s + ki ci .mean;
k k + ki ;
if q2 < t + ki :
if i > 1 :
(ci+1 .mean ci1 .mean)/2;
elif i < m :
(ci+1 .mean ci .mean);
else:
(ci .mean ci1 .mean);
= 12 qt
;
ki
s s ki ci .mean;
k k ki ;
t t + ki
return s/k
uniform and from a skewed distribution. In these experiments = 0.01 was used since it
provides a good compromise between accuracy and space. There is no visible difference in
accuracy between the two underlying distributions in spite of the fact that the underlying
densities differ by more roughly 30 orders of magnitude. The accuracy shown here is
computed by comparing the computed quantiles to the actual empirical quantiles for the
sample used for testing and is shown in terms of q rather than the underlying sample value.
At extreme values of q, the actual samples are preserved as centroids with weight 1 so the
observed for these extreme values is zero relative to the original data. For the data shown
here, at q = 0.001, the maximum weight on a centroid is just above 4 and centroids in this
range have all possible weights from 1 to 4. Errors are limited to, not surprisingly, just
a few parts per million or less. For more extreme quantiles, the centroids will have fewer
samples and the results will typically be exact.
(0.1,
0.1)
1000
0
0.99
0.999
1000
2000
2000
Uniform
0.001
0.01
0.1
0.5
Quantile (q)
0.9
0.99
0.999
0.001
0.01
0.1
0.5
0.9
Quantile (q)
Figure 3. The
R xabsolute error of the estimate of the cumulative distribution
function q = p() d for the uniform and distribution for 5 runs,
each with 100,000 data points. As can be seen, the error is dramatically
decreased for very high or very low quantiles (to a few parts per million).
The precision setting used here, 1/ = 100, would result in uniform error of
10,000 ppm without adaptive bin sizing and interpolation.
Obviously, with the relatively small numbers of samples such as are used in these experiments, the accuracy of t-digests for estimating quantiles of the underlying distribution
cannot be better than the accuracy of these estimates computed using the sample data
points themselves. For the experiments here, the errors due to sampling completely dominate the errors introduced by t-digests, especially at extreme values of q. For much larger
10
sample sets of billions of samples or more, this would be less true and the errors shown
here would represent the accuracy of approximating the underlying distribution.
It should be noted that using a Q-Digest implemented with long integers is only able
to store data with no more than 20 significant decimal figures. The implementation in
stream-lib only retains 48 bits of significants, allowing only about 16 significant figures.
This means that such a Q-digest would be inherently unable to even estimate the quantiles
of the distribution tested here.
3.2. Persisting t-digests. For the accuracy setting and test data used in these experiments, the t-digest contained 820860 centroids. The results of t-digest can thus be stored
by storing this many centroid means and weights. If centroids are kept as double precision
floating point numbers and counts kept as 4-byte integers, the t-digest resulting from from
the accuracy tests described here would require about 10 kilobytes of storage for any of
the distributions tested.
This size can be substantially decreased, however. One simple option is to store differences between centroid means and to use a variable byte encoding such as zig-zag encoding
to store the cluster size. The differences between successive means are at least three orders
of magnitude smaller than the means themselves so using single precision floating point
to store these differences can allow the t-digest from the tests described here to be stored
in about 4.6 kilobytes while still regaining nearly 10 significant figures of accuracy in the
means. This is roughly equivalent to the precision possible with a Q-digest operating on
32 bit integers, but the dynamic range of t-digests will be considerably higher and the
accuracy is considerably better.
3.3. Space/Accuracy Trade-off. Not surprisingly, there is a strong trade-off between
the size of the t-digest as controlled by the compression parameter 1/ and the accuracy
which which quantiles are estimated. Quantiles at 0.999 and above or 0.001 or below were
estimated to within a small fraction of 1% regardless of digest size. Accurate estimates of
the median require substantially larger digests. Figure 5 shows this basic trade-off.
The size of the resulting digest depends strongly on the compression parameter 1/ as
shown in the left panel of Figure 6. Size of the digest also grows roughly with the log of the
number of samples observed, at least in the range of 10,000 to 10,000,000 samples shown
in the right panel of Figure 6.
3.4. Computing t-digests in parallel. With large scale computations, it is important to
be able to compute aggregates like the t-digest on portions of the input and then combine
those aggregates.
For example, in a map-reduce framework such as Hadoop, a combiner function can
compute the t-digest for the output of each mapper and a single reducer can be used to
compute the t-digest for the entire data set.
Another example can be found in certain databases such as Couchbase or Druid which
maintain tree structured indices and allow the programmer to specify that particular aggregates of the data being stored can be kept at interior nodes of the index. The benefit of
this is that aggregation can be done almost instantaneously over any contiguous sub-range
q = 0.001
Quantile error
Quantile error
0.02
0.00
0.02
0.04
Quantile error
0.04
q = 0.5
11
0.5
1.0
2.0
5.0
10.0 20.0
50.0
0.5
1.0
2.0
5.0
10.0 20.0
50.0
0.5
1.0
2.0
5.0
10.0 20.0
50.0
12
10M samples
10k samples
10
Size (kB)
0.1
Size (kB)
10
15
20
100
1 = 100
10
20
50
1
100
500
10
100
1000
10,000
Samples (x1000)
Figure 5. Size of the digest scales sub-linearly with compression parameter ( 0.7 . . . 0.9) for fixed number of points. Size scales approximately
logarithmically with number of points for fixed compression parameter. The
panel on the right is for 1/ = 100. The dashed lines show best-fit log-linear
models. In addition, the right panel shows the memory size required for the
GK01 algorithm if 6 specific quantiles are desired.
compared to the similar parameter 1/ for the t-digest. For the same value of compression
parameter, the sizes of the two digests is always within a factor of 2 for practical uses. The
middle and right panel show accuracies for uniform and distributions.
As expected, the t-digest has very good accuracy for extreme quantiles while the Qdigest has constant error across the range. Interestingly, the accuracy of the Q-digest is at
best roughly an order of magnitude worse than the accuracy of the t-digest even. At worse,
with extreme values of q, accuracy is several orders of magnitude worse. This situation
is even worse with a highly skewed distribution such as with the (0.1, 0.1) shown in the
right panel. Here, the very high dynamic range introduces severe quantization errors into
the results. This quantization is inherent in the use of integers in the Q-digest.
For higher compression parameter values, the size of the Q-digest becomes up to two
times smaller than the t-digest, but no improvement in the error rates is observed.
3.6. Speed. The current implementation has been primarily optimized for ease of development, not execution speed. As it is, running on a single core of a 2.3 GHz Intel Core i5,
it typically takes 2-3 microseconds to process each point after JIT optimization has come
into effect. It is to be expected that a substantial improvement in speed could be had by
profiling and cleaning up the prototype code.
Direct
Merged
100 parts
Direct
Merged
1 = 50
Direct
Merged
1 = 50
1 = 50
5000
5000
20 parts
10000
10000
5 parts
13
0.001
0.1
0.2
Quantile (q)
0.3
0.5
0.001
0.01
0.1
0.2
Quantile (q)
0.3
0.5
0.001
0.01
0.1
0.2
0.3
0.5
Quantile (q)
Fei Chen, Diane Lambert, and Jos C. Pinheiro. Incremental quantile estimation for massive
tracking. In In Proceedings of KDD, pages 516522, 2000.
300
100
300
1K
3K
10K
tdigest (bytes)
30K
100K
0.001
0.1
0.5
Quantile
0.9
0.99
80000
Qdigest
tdigest
(0.1, 0.1)
1 = 50
60000
15000
10000
40000
1K
20
10
3K
100
50
Uniform
1 = 50
200
Qdigest
tdigest
5000
500
1000
20000
2000
100
Qdigest (bytes)
14
0.001
0.1
0.5
0.9
0.99
Quantile
Figure 7. The left panel shows the size of a serialized Q-digest versus the
size of a serialized t-digest for various values of 1/ from 2 to 100,000. The
sizes for the two kinds of digest are within a factor of 2 for all compression
levels. The middle and right panels show the accuracy for a particular
setting of 1/ for Q-digest and t-digest. For each quantile, the Q-digest
accuracy is shown as the left bar in a pair and the t-digest accuracy is shown
as the right bar in a pair. Note that the vertical scale in these diagrams
are one or two orders of magnitude larger than in the previous accuracy
graphs and that in all cases, the accuracy of the t-digest is dramatically
better than that of the Q-digest even though the serialized size of the each
is within 10% of the other. Note that the errors in the right hand panel
are systematic quantization errors introduced by the use of integers in the
Q-digest algorithm. Any distribution with very large dynamic range will
show the same problems.
[GK01]
Michael Greenwald and Sanjeev Khanna. Space-efficient online computation of quantile summaries. In In SIGMOD, pages 5866, 2001.
[Knu98]
Donald E. Knuth. The Art of Computer Programming, volume 2: Seminumerical Algorithms,
page 232. Addison-Wesley, Boston, 3 edition, 1998.
[Lin]
LinkedIn. Datafu: Hadoop library for large-scale data processing. https://github.com/
linkedin/datafu/. [Online; accessed 20-December-2013].
[PDGQ05] Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan. Interpreting the data: Parallel
analysis with sawzall. Sci. Program., 13(4):277298, October 2005.
[SBAS04] Nisheeth Shrivastava, Chiranjeeb Buragohain, Divyakant Agrawal, and Subhash Suri. Medians
and beyond: New aggregation techniques for sensor networks. pages 239249. ACM Press, 2004.
[Thi]
Add This. Algorithms for calculating variance, online algorithm. https://github.com/
addthis/stream-lib. [Online; accessed 28-November-2013].
[Wel62]
B. P. Welford. Note on a method for calculating corrected sums of squares and products. Technometrics, pages 419420, 1962.
[Wik]
Wikipedia. Algorithms for calculating variance, online algorithm. http://en.wikipedia.
org/wiki/Algorithms_for_calculating_variance#Online_algorithm. [Online; accessed 19October-2013].
[ZW07]
15
Qi Zhang and Wei Wang. An efficient algorithm for approximate biased quantile computation
in data streams. In Proceedings of the Sixteenth ACM Conference on Conference on Information
and Knowledge Management, CIKM 07, pages 10231026, New York, NY, USA, 2007. ACM.