Balancing Weight-Balanced Trees
Balancing Weight-Balanced Trees
c
Cambridge University Press 2011
doi:10.1017/S0956796811000104
287
Balancing weight-balanced trees
YOI CHI HI RAI
The University of Tokyo, JSPS Research Fellow
(e-mail: [email protected])
KAZUHI KO YAMAMOTO
IIJ Innovation Institute Inc.
(e-mail: [email protected])
Abstract
A weight-balanced tree (WBT) is a binary search tree, whose balance is based on the sizes
of the subtrees in each node. Although purely functional implementations on a variant
WBT algorithm are widely used in functional programming languages, many existing
implementations do not maintain balance after deletion in some cases. The diculty lies
in choosing a valid pair of rotation parameters: one for standard balance and the other for
choosing single or double rotation. This paper identies the exact valid range of the rotation
parameters for insertion and deletion in the original WBT algorithm where one and only
one integer solution exists. Soundness of the range is proved using a proof assistant Coq.
Completeness is proved using eective algorithms generating counterexample trees. For two
specic parameter pairs, we also proved in Coq that set operations also maintain balance.
Since the dierence between the original WBT and the variant WBT is small, it is easy to
change the existing buggy implementations based on the variant WBT to the certied original
WBT with a rational solution.
1 Introduction
Weight-balanced trees (WBTs) (Nievergelt & Reingold, 1972) are binary search
trees, which can be used to implement nite sets and nite maps (associative arrays).
Although other balanced binary search trees, such as AVL trees (Adelson-Velskii &
Landis, 1962) and redblack trees (Guibas & Sedgewick, 1978), use the height of
subtrees for balancing, the balance of WBTs is based on the sizes (number of
elements) of the subtrees below each node. Its purely functional implementations
are widely used in functional programming languages. In fact, fundamental modules
Data.Set and Data.Map in Haskell (Marlow, 2010) and the wttree.scm library in
MIT/GNU Scheme and slib are based on a variant of the WBT algorithm (Adams,
1993).
In order to ensure performance, the algorithm keeps the height of a tree
logarithmic to its size by balancing the sizes of the subtrees in each node. In
2010, a bug report
1
conrmed that the Data.Map library broke the tree balance after
1
http://hackage.haskell.org/trac/ghc/ticket/4242
288 Y. Hirai and K. Yamamoto
a
c
b
a
split
a
c
b
c
c
a
Single rotation
Double rotation
x
y z
x
y0 y1
z
x y
z
x
y0 y1
z
Fig. 1. A single left rotation and a double left rotation. a, b, and c are elements. x, y, y0, y1,
and z are the size of each tree. If y is too large, a double rotation is chosen. Otherwise, a
single rotation is used.
deletion.
2
We investigated the existing literature but failed to nd a rigorous proof
that both insertion and deletion preserve the balance of WBTs. Instead, we found
that proving balance preservation requires checking several inequalities in 14 cases
of program behaviors for ve dierent parameter zones. We used a proof assistant
Coq (Bertot & Casteran, 2004) in order to cope with this intensive case analysis.
To keep the balance of WBTs, there are two important parameters and :
decides whether any rotation is made at all and chooses a single rotation or a
double rotation (Figure 1). These parameters must ensure that a newly created tree
is balanced after any insertion or deletion in a given balanced WBT. The original
paper of WBT suggests (, ) = (1 +
2,
2,
2,
2) suggested by the
original paper. To implement the originally suggested balance condition with integer
arithmetic, we have to compare the squares of the weights. Here is a straightforward
implementation:
isBalanced :: Set a -> Set a -> Bool
isBalanced a b = 2 * y * y <= z * z
where x = size a + 1
y = size b + 1
z = x + y
isSingle :: Set a -> Set a -> Bool
isSingle a b = z * z < 2 * w * w
where z = size a + 1
w = size b + 1
Since integers in typical computer languages are xed length, this calculation is prone
to overow. In order to avoid this problem, we have to deploy more complicated
implementation. Rational parameters are preferable.
6 Identifying valid range
Before delving into rigorous mathematical analysis, we used our test suite described
in Section 4 and the Omega solver (Pugh, 1991) to identify the range of valid
parameters. Since only (4, 2) and (3, 2) are possible integer solutions for the variant
WBT, we guessed the valid range for the original WBT is around them. We tested
the original WBT using our test suite, where is an integer between 1 and 10 and
is an integer between 1 and 10. These tests showed that only (3, 2) is a possible
integer solution for the original WBT.
294 Y. Hirai and K. Yamamoto
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2 2.5 3 3.5 4 4.5 5
Fig. 3. Results of tests plotted along (, ). The dotted square symbols indicate that no
insertion nor deletion broke the balance. The plus symbols + indicate discovery of concrete
counterexamples.
6.1 Tests
To obtain a more precise parameter range around (3, 2), we tested not only with
integer parameters, but also with rational parameters. Figure 3 shows the results
with the range, where = 2, 2.1, . . . , 5 and = 1, 1.05, . . . , 2.2. The shape of the valid
range seemed more complex than we had expected.
6.2 Automated arithmetic solver
We also ran the automated arithmetic solver Omega (Pugh, 1991) with the same set
of parameter pairs. For each parameter pair, we gave a logical formula in Presburger
arithmetic to the solver. The logical formulas expressed the possibility of balance
breaking by insertion or deletion in a WBT tree.
We produced the inputs for Omega by dening the balance preservation condition
in Coq and then converting the condition into an assertion that only contains linear
inequalities. We then replaced the conjunction in Coq (/) with the conjunction
in Omega (&&), added parentheses, and applied some other cosmetic changes. The
Omega solver (Pugh, 1991) showed three kinds of behavior. In some cases, the
Omega solver gave a concrete counterexample within a second. In other cases,
within a second, the solver conrmed any large enough balanced WBT tree is
balanced after any insertion or deletion. Sometimes, the solver gave up or did not
respond for a couple of minutes, so we terminated it. Figure 4 illustrates the result.
Comparing Figures 3 and 4, we saw that Omega found new counterexamples for
parameters, where QuickCheck was not able to. We guessed that the rst experiment
could not nd some existing counterexamples outside the lower boundaries.
6.3 Finding boundaries
We conjectured that the following four boundaries determine the valid parameter
range:
Right < 4.5
Left 6 1
Balancing weight-balanced trees 295
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2 2.5 3 3.5 4 4.5 5
Fig. 4. Results using the automated arithmetic solver Omega plotted along (, ). The square
symbols indicate the parameter pairs are valid. The plus symbols + indicate discovery of
concrete counterexamples. Blanks mean time-out.
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2 2.5 3 3.5 4 4.5 5
>= ( + 1) /
< 4.5
<= - 1
= 3/2
= 4/2
= 4/3
= 5/3
Fig. 5. The boundaries of the valid parameter range for the original WBT are shown with
inequalities. The points are some rational valid parameters.
Lower > ( + 1)/
Upper 6
2,
_
, x =
_
y + z + 2
_
1.
This original tree is balanced because of the following. At node c, the right subtree
of size z is not too much larger than the left subtree of size y for suciently large y:
(y + 1) (z + 1) = (y + 1)
__
y + 1
_
+ 1
_
>(y + 1)
_
y + 1
+ 1
_
=
_
1
_
y + C
>0
where C does not contain x, y, or z. The last inequality holds for large y because
the coecient for y is 1/, which is positive since > 1 > 1/.
300 Y. Hirai and K. Yamamoto
The left subtree of c is not too much larger than the right subtree, either:
(z + 1) (y + 1) =
__
y + 1
_
+ 1
_
(y + 1)
>
y + 1
(y + 1)
=
_
1
_
(y + 1)
>0 (by Equation (2)).
At node a, the right subtree is not smaller: x 6 y + z + 1. At the same time, the
right subtree is not too large:
(x + 1) (y + z + 2) =
__
y + z + 2
__
(y + z + 2) >
y + z + 2
(y + z + 2) = 0.
If we delete one element from the left subtree of node a, the balance is broken as
follows:
x (y + z + 2) =
__
y + z + 2
_
1
_
(y + z + 2) <
y + z + 2
(y + z + 2) = 0.
At this time, a single rotation is chosen because
(z + 1) (y + 1) =
__
y + 1
_
+ 1
_
(y + 1) >
y + 1
(y + 1) = 0.
After the single rotation, the size of the left subtree of node c is x + y and that
of the right subtree is z. The balance is broken if the following expression has a
negative value:
(z + 1) (x + y + 1) =
__
y + 1
_
+ 1
_
__
y +
_
y+1
_
+ 2
_
+ y 1
_
<
_
y + 1
+ 1
_
_
y +
_
y+1
_
+ 2
+ y 1
_
<
_
y + 1
+ 1
_
_
y +
y+1
+ 1
+ y 1
_
=
( + 1)( 1)
y + C
where C does not contain x, y, or z. The coecient for y is negative by Equation (3).
This implies, for suciently large y, the rotated tree becomes unbalanced after a
delete operation.
8.3 Outside the lower boundary
Assume that the parameter pair is outside the lower boundary:
<
+ 1
_
1.
Balancing weight-balanced trees 301
b
a
a
c
b
c
x
y z
x - 1 y z w w
Fig. 11. A counterexample for parameter pairs outside the lower boundary. x, y, z, and w
denote the size of each subtree. The original tree on the left side is balanced. However, after
deletion of one element in the left subtree, a double rotation breaks the balance at node c if
z is large enough.
For large values of x, the original tree on the left side of the gure is balanced.
To see that, let us look at each node. On node a, the right subtree is not too large:
(x + 1) =
_
r + 1
_
>
r + 1
= r + 1.
The left subtree of node a is not too large if r is large enough:
(r + 1) > r + 2 > ]r + 1| >
_
r + 1
_
= x + 1.
On node c, the right subtree is not too large:
(y + z + 1 + 1) =
_
j(z + 1)| + z + 1
_
> j(z + 1)| + 1 = w + 1.
On the other hand, the left subtree of node c is not too large for large values of z:
(w + 1) =
_
j(z + 1)| + 1
_
> j(z + 1)| + z + 1 = y + z + 2
where the inequality in the middle holds when z is large enough. This is because
both sides are almost linear on z, where the coecient on the left side
2
is larger
than the coecient on the right side + 1. The inequality between the coecients
2
> + 1 holds because > 2.
On node b, the right subtree is not too large:
(y + 1) = w = j(z + 1)| > j(z + 1)| > z + 1
where the inequalities come from > 2 and the fact that z is an integer. On the
other hand, the left subtree of node b is not too large:
(z + 1) > j(z + 1)| = w = y + 1.
Although the original tree is balanced as we have seen, if we delete an element
from the left subtree of node a, the balance is broken:
x (r + 1) =
__
r + 1
_
1
_
(r + 1) <
_
r + 1
_
(r + 1) = 0.
This implies either a single rotation or a double rotation takes place. Actually, a
double rotation takes place if z is large enough because the following expression has
a nonpositive value:
(w + 1) (y + z + 2) = (j(z + 1)| + 1) (j(z + 1)| + z + 1)
< ((z + 1) + 1) ((z + 1) + z)
= ( 1)z + C.
302 Y. Hirai and K. Yamamoto
b
a
a
c
b
c
x
x y y
c
a
b
x
y
Double rotation
Single rotation
Deleted
Fig. 12. A counterexample for parameter pairs outside the upper boundary. x and y denote
the sizes of the subtrees. The original tree on the left side is balanced. However, deletion of
the single element in the left subtree breaks the balance. A double rotation maintains the
balance but a single rotation breaks the balance at node a. When the parameter pair is
outside the upper boundary, a single rotation is chosen.
where C does not contain x, y nor z. The coecient of z is negative according to the
inequation (4). So, we can choose a large enough z that ensures a double rotation.
If a double rotation is chosen, the balance is broken at node c:
(z + 1) (w + 1) = (z + 1) (j(z + 1)| + 1) < (z + 1) (z + 1) = 0.
8.4 Outside the upper boundaries
Some specic small trees determine the upper boundaries. Consider the trees in
Figure 12. The sizes of the subtrees in the gure are dened as follows:
x = j| 1, y = j 1/2| 1.
Since we already have the other boundaries, we only have to consider 2.5 6 <
4.5. Thus, in this last case, we only have to deal with four dierent small trees. It is
easy to check that these four trees are balanced and that its balance is broken if the
left subtree of node a is removed. If a double rotation is chosen, the resulting tree
is balanced. If a single rotation is chosen, the balance at node a is broken:
(x + 2) = j| 1 < 0.
In order to obtain a counterexample, it is enough to ensure a single rotation. For
this, satisfying the following inequality is enough:
>
x + 2
y + 1
= .
In the table below, we summarize the above result for the four dierent trees.
x y = (x + 2)/(y + 1)
2.5 6 < 3 1 1 3/2
3 6 < 3.5 2 1 4/2
3.5 6 < 4 2 2 4/3
4 6 < 4.5 3 2 5/3
Balancing weight-balanced trees 303
T
i
m
e
Fig. 13. Performance of the insert operation using dierent WBT algorithms.
8.5 Tests of the counterexamples
To check the correctness of the boundaries, we dened four tests for each boundary
that produced counterexample trees following the descriptions given above. The
results of the tests are exactly the same as illustrated in Figure 5.
9 Performance
The balance constraints are ultimately for performance. We benchmarked the
original WBT with (3, 2) to compare against the variant WBT with (3, 2) and
(4, 2), and Logarithmic BST described in Section 10. Their code is based on the
Haskell Data.Map implementation in the containers package version 0.3.0.0.
7
We
used Dell OptiPlex 960 with a 2.66 GHz Intel Core 2 Quad CPU with 2 GB memory
running Linux 2.6.35. The Haskell compiler was the Glasgow Haskell Compiler
version 6.12.3 with the -O2 option. Benchmarking a language with lazy evaluation
is not straightforward. We used the criterion package version 0.5.0.5 and the
progression package version 0.4 as reliable benchmark tools. Data.Map is dened
as strict and we used a strict data type Int as key. So, we removed the toList
overhead used in criterion when reducing Data.Map to its normal form. We also
benchmarked the original WBT with several rational parameters.
Comparison between the original and variant WBTs. We evaluated the performance
of the insertion operation, the deletion operation, and the lookup operation. For all
operations, we prepared 1k, 10k, and 100k elements both in the increasing order
and random order. They are labeled as inc 10
3
, inc 10
4
, inc 10
5
, rnd 10
3
, rnd 10
4
,
and rnd 10
5
, respectively, in Figures 1315. Some error bars are invisibly short.
For the insertion operation, we measured the entire time to construct a WBT tree
from all elements. The results are illustrated in Figure 13. For the delete operation,
we rst constructed a WBT tree from all elements then measured the entire time
to delete each element in the insertion order from the full tree. The results are
7
As of this writing, performance tuning is going on. The containers package version 0.3.0.0 does not
include such performance tuning.
304 Y. Hirai and K. Yamamoto
T
i
m
e
Fig. 14. Performance of the delete operation using dierent WBT algorithm.
T
i
m
e
Fig. 15. Performance of the lookup operation using dierent WBT algorithm.
illustrated in Figure 14. For the lookup operation, we rst constructed a WBT tree
from all elements then we measured the entire time to look up each element in the
tree. The results are illustrated in Figure 15. To show the results of three dierent
sizes in a graph, we divide each entire time by each size. We can say that the original
WBT with (3, 2) has at least the same performance as the variant WBT with (3, 2)
and (4, 2) and Logarithmic BST.
Comparison among dierent parameter choices for the original WBT. Likewise, we
compared the performance of eight dierent parameter pairs within the valid range
for insertion, deletion, and lookup (Figures 1618). We found that the smaller
, which enforces the stricter balance condition, performs better. For incremental
inputs, the largest time dierence between the slowest and the fastest reached 43%
for insertion. For randomized inputs, the largest dierence was 14% for lookup.
10 Related work
Coq verication of balanced tree algorithms. Filli atre and Letouzey (2004) proved
correctness of AVL tree and red-black tree implementations in Coq and extracted
OCaml codes from the Coq implementation. At some stages during the
Balancing weight-balanced trees 305
T
i
m
e
Fig. 16. Performance of the insert operation with dierent veried parameter pairs.
T
i
m
e
Fig. 17. Performance of the delete operation with dierent veried parameter pairs.
T
i
m
e
Fig. 18. Performance of the lookup operation with dierent veried parameter pairs.
implementation, they were not able to prove a balancing condition in Coq. This
led to discovery of an implementation bug relating to the balance of the AVL tree
implementation in the OCaml standard library at the time. In this paper, we pointed
out balancing bugs of the algorithm, not merely in an implementation.
Chargu eraud (2010) veried many functional tree algorithms in Okasakis book
(Okasaki, 1998) with a new method of transforming a program into a proposition
transformer. However, neither Chargu erauds verication nor the book contains
WBT algorithms. If we apply Chargu erauds method to verifying WBT algorithms,
306 Y. Hirai and K. Yamamoto
it would be much easier to verify an existing WBT implementation. However, the
arithmetic argument in the rst half of our Coq script would still be useful.
In contrast to both Filli atre and Letouzeys verication and Chargu erauds
verication, we have not veried that our target algorithm correctly implements
nite set/map operations. We expect this to be straightforward.
Another dierence is that our target algorithm is parameterized and there
are many restrictions on the parameters. Moreover, some of the restrictions are
combinatorially determined by small trees of size 10. Many conditions on many
cases yield a large amount of case analysis, which makes hand-written proofs more
error-prone and machine certied proofs more advantageous in our case.
Other balanced tree algorithms. Logarithmic BST (Roura, 2001) is another variant of
WBT. To implement Logarithmic BST, isBalanced and isSingle use bit operations
and other code can be shared with the WBT family.
(.<.) :: Size -> Size -> Bool
a .<. b
| a >= b = False
| otherwise = ((a .&. b) shiftL 1) < b
isBalanced a b = not (size a .<. (size b shiftR 1))
isSingle a b = not (size b .<. size a)
The paper (Roura, 2001) says (the original WBT with (1 +
2,
2)) which
is anyway an expensive property to check. This seems to be the main reason not to
use weighted BSTs as default balancing method. We show the original WBT with
our choice of parameters (3, 2) here in order to compare it with the Logarithmic
BST version shown above.
isBalanced :: Set a -> Set a -> Bool
isBalanced a b = 3 * (size a + 1) >= size b + 1
isSingle :: Set a -> Set a -> Bool
isSingle a b = size a + 1 < 2 * (size b + 1)
For mathematical reliability, Logarithmic BST is simpler, but we have shown
rigorous analysis of the original WBT is attainable using Coq. For performance, we
benchmarked Logarithmic WBT against the original WBT with (3, 2) (Figures 13
15). For large (10
5
elements) trees on randomized inputs, the original WBT performs
as well as or slightly better than Logarithmic WBT.
11 Conclusion
We identied the exact range of the valid rotation parameters of the original weight-
balanced tree and proved in Coq that it can maintain balance after any insertion
and deletion operations. Within the range, the only integer solution is (3, 2), which
allows simpler implementation of the original weight-balanced tree. Benchmarks
showed that the original weight-balanced tree with (3, 2) works in almost the same
Balancing weight-balanced trees 307
performance as the variant at (3,2) and (4,2). We benchmarked other valid rational
parameters and found the smaller is the better performer. We proved in Coq
that set operations, such as union, intersection, and dierence, can maintain balance
under (3, 2) and (5/2, 3/2). We also showed how to produce counterexamples outside
the boundaries of the valid range.
Acknowledgments
The authors would like to thank Taylor Campbell for his bug report that initiated
our research and Eijiro Sumii for discussion and instructive comments on our early
draft. The authors are grateful to anonymous referees for a number of presentation
improvements and a concise title.
References
Adams, S. (1992) Implementing sets eciently in a functional language, Technical report CSTR
92-10. University of Southampton.
Adams, S. (1993) Ecient sets: A balancing act. J. Funct. Program., 3(4), 553562.
Adelson-Velskii, G. M. & Landis, E. M. (1962) An algorithm for the organization of
information. Dokl. Akad. Nauk SSSR, 146(2), 263266.
Bertot, Y. & Casteran, P. (2004) Interactive Theorem Proving and Program Development.
CoqArt: The Calculus of Inductive Constructions. Springer.
Borchers, B. (1999) CSDP, a c library for semidenite programming. Optim. Methods Softw.,
11(1), 613623.
Chargu eraud, A. (2010) Program verication through characteristic formulae.In Proceedings
of the 15th International Conference on Functional Programming (ICFP). ACM.
Claessen, K. & Hughes, J. (2000) QuickCheck: A lightweight tool for random testing of haskell
programs. In Proceedings. of the Fifth International Conference on Functional Programming
(ICFP). ACM.
Filli atre, J.-C. & Letouzey, P. (2004) Functors for proofs and programs. In Programming
Languages and Systems, Schmidt, D. (ed), Lecture Notes in Computer Science, vol. 2986.
Springer, pp. 370384.
Guibas, L. J. & Sedgewick, R. (1978) A dichromatic framework for balanced trees. In
Proceedings of the 19th Annual Symposium on Foundations of Computer Science (SFCS 78).
IEEE, pp. 821.
Knuth, D. E. (1998) The Art of Computer Programming: Sorting and Searching. 2nd ed., vol. 3.
Addison-Wesley.
Marlow, S., et al. (2010) Haskell 2010 Language Report, Marlow, S. (ed), Available online
http://www.haskell.org/ (May 2011).
Nievergelt, J. & Reingold, E. M. (1972) Binary search trees of bounded balance. In Proceedings
of the Fourth Annual Acm Symposium on Theory of Computing. ACM, pp. 137142.
Okasaki, C. (1998) Purely Functional Data Structures. Cambridge University Pres.
Pugh, W. (1991) The Omega test: A fast and practical integer programming algorithm for
dependence analysis. In Proceedings. of the 1991 ACM/IEEE Conference on Supercomputing.
ACM.
Roura, S. (2001) A new method for balancing binary search trees. In Automata, Languages and
Programming, Orejas, F., Spirakis, P. & van Leeuwen, J. (eds), Lecture Notes in Computer
Science, vol. 2076. Springer, pp. 469480.