CMSC 420: Lecture 12 Balancing by Rebuilding - Scapegoat Trees
CMSC 420: Lecture 12 Balancing by Rebuilding - Scapegoat Trees
Insertion:
• The key is first inserted just as in a standard (unbalanced) binary tree
• We monitor the depth of the inserted node after each insertion, and if it is too high,
there must be at least one node on the search path that has poor weight balance
(that is, its left and right children have very different sizes).
• In such a case, we find such a node, called the scapegoat,1 and we completely rebuild
the subtree rooted at this node so that it is perfectly balanced.
Deletion:
• The key is first deleted just as in a standard (unbalanced) binary tree
• Once the number of deletions performed is sufficiently large relative to the entire
tree size, rebuild the entire tree so it is perfectly balanced.
You might wonder why there is a notable asymmetry between the rebuilding rules for insertion
and deletion. The existence of a single very deep node is proof that a tree is out of balance.
Thus, for insertion, we can use the fact that the inserted node is too deep to trigger rebuilding.
However, observe that the converse does not work for deletion. The natural counterpart
would be “if the depth of the leaf node containing the deleted key is too small, then trigger a
rebuilding operation.” However, the fact that a single node has a low depth, does not imply
1
The colorful term “scapegoat” refers to an individual who is assigned the blame when something goes wrong. In
this case, the unbalanced node takes the blame for the tree’s height being too great.
that the rest of the tree is out of balance. (It may just be that a single search path has low
depth, but the rest of the tree is perfectly balanced.) Could we instead apply the deletion
rebuilding trigger to work for insertion? Again, this will not work. The natural counterpart
would be, “given a newly rebuild tree with n keys, we will rebuild it after inserting roughly n/2
new keys.” However, if we are very unlucky, all these keys may fall along a single search path,
and the tree’s height would be as bad as O((log n) + n/2) = O(n), and this is unacceptably
high.
How to Rebuild a Subtree: Before getting to the details of how the scapegoat tree works, let’s
consider the basic operation that is needed to maintain balance, namely rebuilding subtrees
into balanced form. We shall see that if the subtree contains n keys, this operation can be
performed in O(n) time (see Fig. 1). Letting p denote the root node of the subtree to rebuild,
call this function rebuild(p):
• Perform an inorder traversal of p’s subtree, copying the keys to an array A[0..k-1],
where k denotes the number of nodes in this subtree. Note that the elements of A are
sorted.
• Invoke the following recursive subtree-building function: buildSubtree(A)
– Let k = A.length.
– If k == 0, return an empty tree, that is, null.
– Otherwise, let x be the median key, that is, A[j], where j = bk/2c. Recursively in-
voke L = buildSubtree(A[0..j-1]) and R = buildSubtree(A[j+1..k-1]). Fi-
nally, create an internal node containing x with left subtree L and right subtree R.
Return a pointer to x.
Note that if A is implemented as a Java ArrayList, there is a handy function called sublist
for performing the above splits. The function is given in the code block below.
b A A[j]
0 1 2 3 4 5
d d
a c a b c d e f
k=6
0 1 2 0 1
e j = bk/2c = 3 b f
a b c e f
d f A[0..j − 1] A[j + 1..k − 1] a c e
Ignoring the recursive calls, we spend O(1) time in each recursive call, so the overall time is
proportional to the size of the tree, which is k, so the total time is O(k).
Scapegoat Tree Operations: In addition to the nodes themselves, the scapegoat tree maintains
two integer values. The first, denoted by n, is just the actual number of keys in the tree. The
second, denoted by m, is a special parameter, which is used to trigger the event of rebuilding
the entire tree.
In particular, whenever we insert a key, we increment m, but whenever we delete a key we
do not decrement m. Thus, m ≥ n. The difference m − n intuitively represents the number
of deletions. When we reach a point where m > 2n (or equivalently m − n > n) we can infer
that the number of deletions exceeds the number of keys remaining in the tree. In this case,
we rebuild the entire tree in balanced form.
We are now in a position to describe how to perform the dictionary operations for a scapegoat
tree.
find(Key x): The find operation is performed exactly as in a standard (unbalanced) binary
search tree. We will show that the height of the tree never exceeds log3/2 n ≈ 1.7 · lg n,
so this is guaranteed to run in O(log n) time.
delete(Key x): This operates exactly the same as deletion in a standard binary search tree.
After deleting the node, decrement n (but do not change m). If m > 2n, rebuild the
entire tree by invoking rebuild(root), and set m ← n.
insert(Key x, Value v): First, increment both n and m.The begins exactly as insertion
does for a standard binary search tree. But, as we are tracing the search path to the
insertion point, keep track of our depth in the tree. (Recall that depth is the number
of edges to root.) If the depth of the inserted node exceeds log3/2 m then we trigger a
rebuilding event. This involves the following:
• Walk back up along the insertion search path towards the root. Let p be the current
node that is visited, and let p.child be the child of p that lies on the search path.
• Let size(p) denote the size of the subtree rooted at p, that is, the number of nodes
in this subtree.
• If
size(p.child) 2
> ,
size(p) 3
then rebuild the subtree rooted at p by invoking rebuild(p). The node p is the
scapegoat.
An example of insertion is shown in Fig. 2. After inserting 5, the tree has n = 11 nodes.
The newly inserted node is at depth 6, and since 6 > log3/2 11 (which is approximately
5.9), we trigger the rebuilding event. We walk back up the search path. We find node 9
whose size is 7, but the child on the search path has size 6, and 6/7 > 2/3, so we invoke
rebuild on the node containing 9.
Must there be a scapegoat? The fact that a child has over 2/3 of the nodes of the entire subtree
intuitively means that this subtree has (roughly) more than twice as many nodes as its sibling.
insert(5)
13 13 13
12 15 12 15 12 15
6
7 > 23 !!
9 17 9 17 4 17
3
6 rebuild(9)
2 2 1 7
2
3
1 7 1 7 0 2 5 9
1
2
0 4 0 4
6 > log 3 11 ≈ 5.9!!
5 2
Fig. 2: Inserting a key into a scapegoat tree, which triggers a rebuilding event. The node containing
9 is the first scapegoat candidate encountered while backtracking up the search path and is rebuilt.
We call such a node on the search path a scapegoat candidate. A short way of summarize the
above process is “rebuild the scapegoat candidate that is closest to the insertion point.’
You might wonder whether we will necessarily encounter an scapegoat candidate when we
trace back along the search path. The following lemma shows that this is always the case.
Lemma: Given a binary search tree of n nodes, if there exists a node p such that depth(p) >
log3/2 n, then p has an ancestor (possibly p itself) that is a scapegoat candidate.
Proof: The proof is by contradiction. Suppose to the contrary that no node from p to the
root is a scapegoat candidate. This means that for every ancestor node u from p to the
root, we have size(u.child) ≤ 23 · size(u). We know that the root has a size of n. It follows
that if p is at depth k in the tree, then
k
2
size(p) ≥ n.
3
We know that size(p) ≥ 1 (since the subtree contains p itself, if nothing else), so it follows
that 1 ≥ (2/3)k n. With some simple manipulations, we have
k
3
≤ n,
2
which implies that k ≤ log3/2 n. However, this violates our hypothesis that p’s depth
exceeds log3/2 n, yielding the desired contradiction.
Recall that m ≥ n, and so if a rebuilding event is triggered, the insertion depth is at least
log3/2 m, which means that it is at depth at least log3/2 n. Therefore, by the above lemma,
there must be a scapegoat candidate along the search path.
How to Compute Subtree Sizes? We mentioned earlier that the scapegoat tree does not store
any information in the nodes other than the key, value, and left and right child pointers.
So how can we compute size(u) for a node u during the insertion process, without this
information? There is a clever trick for doing this on the fly.
Since we are doing this as we back up the search path, we may assume that we already know
the value of s0 = size(u.child), where this is the child that lies along the insertion search
path. So, to compute size(u), it suffices to compute the size of u’s other child. To do this,
we perform a traversal of this child’s subtree to determine its size s00 . Given this, we have
size(u) = 1 + s0 + s00 , where the +1 counts the node u itself.
You might wonder, how can we possibly expect to achieve O(log n) amortized time for inser-
tion if we are using brute force (which may take as much as O(n) time) to compute the sizes
of the subtrees? The reason is to first recall that we do not need to compute subtree sizes
unless a rebuild event has been triggered. Every node that we are visiting in the counting
process will need to be visited again in the rebuilding process. Thus, the cost of this counting
process can be “charged for” in the cost of the rebuilding process, and hence it essentially
comes for free!
Amortized Analysis: We will not present a formal analysis of the amortized analysis of the
scapegoat tree. The following theorem (and the rather sketchy proof that follows) provides
the main results, however.
Theorem: Starting with an empty tree, any sequence of m dictionary operations (find, insert,
and delete) to a scapegoat tree can be performed in time O(m log m). (Don’t confuse m
here with the m used in the algorithm.)
Proof: The proof is quite detailed, so we will just sketch the key ideas. In order for each
rebuild operation to be triggered, you need to perform a lot of cheap (rebuild-free)
operations to get into an unbalanced state. We will explain the results in terms of n,
the number of items in the tree, but note that m is generally larger.
Find: Because the tree’s height is at most log3/2 m ≤ log3/2 (2n) = O(log n) the costs
of a find operation is O(log n) (unconditionally).
Delete: In order to rebuild the entire tree due to deletions, at least half the entries
since the last full rebuild must have been deleted. (The value m − n is the number
of deletions, and a rebuild is triggered when m > 2n, implying that m − n > n.) By
token-based analyses, it follows that the O(n) cost of rebuilding the entire tree can
be amortized against the time spent processing the (inexpensive) deletions.
Insert: This is analyzed by a potential argument. This is complicated by the fact that
subtrees of various sizes can be rebuilt. Intuitively, after any subtree of size k is
rebuilt, it takes an additional O(k) (inexpensive) operations to force this subtree
to become unbalanced and hence to be rebuilt again. We charge the expense of
rebuilding to against these “cheap” insertions.