0% found this document useful (0 votes)
17 views507 pages

Eci 2023

Uploaded by

Yami Morales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views507 pages

Eci 2023

Uploaded by

Yami Morales
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 507

Advanced Data Structures

Conrado Martínez
Univ. Politècnica de Catalunya, Spain

36a Escuela de Ciencias Informáticas (ECI 2023)


July 24–28, 2023
Buenos Aires, Argentina
Outline of the course

1 Analysis of Algorithms:
Review of basic concepts. Probabilistic tools. The
continuous master theorem. Amortized analysis.
2 Probabilistic & Randomized Dictionaries:
Review of basic techniques (search trees, hash tables).
Randomized binary search trees. Skip lists. Cuckoo
hashing. Bloom filters.
3 Priority Queues:
Review of basic techniques. Binomial queues. Fibonacci
heaps. Applications: Dijkstra’s algorithm for shortest paths.

2/405
Outline of the course

4 Disjoint Sets:
Union by weight and by rank. Path compression heuristics.
Applications: Kruskal’s algorithm for minimum spanning
trees.
5 Data Structures for String Processing:
Tries. Patricia. Ternary search trees.
6 Multidimensional Data Structures:
Associative queries. K -dimensional search trees.
Quadtrees.

3/405
Part I

Analysis of Algorithms

1 Some Probabilistic Tools

2 The Continuous Master Theorem

3 Amortized Analysis

4/405
Linearity of Expectations

For any random variables X and Y , independent or not,

E[aX + bY ] = a E[X ] + b E[Y ] :


If X and Y are independent then V[X + Y ] = V[X ] + V[Y ].

5/405
Indicator variables

It is often useful to introduce indicator random variables


Xi = IAi such that Xi = 1 if the event Ai is true and Xi = 0 if
the event Ai is false. Let pi = P[event Ai happens]. Then the
Xi are Bernouilli random variables with
E[Xi ] = P[Xi = 1] = pi :
In many cases we can express or bound a random variable X
as a linear combination of indicator random variables and then
exploit linearity of expectations to derive E[X ].

6/405
Union Bound

For any sequence (finite or denumerable) of events fAi gi0


2 3
[ X
P4 A i 5  P[Ai ] :
i0 i 0

7/405
Markov’s Inequality

Theorem
Let X be a positive random variable. For any a>0
[X ]
P[X > a]  E :
a

8/405
Markov’s Inequality

Proof.
Let A be the event X > a. Let IA denote the indicator
variable of that event. Then

P[X > a] = P[IA = 1] = E[IA ] ;


but a  IA < X , therefore a E[IA ] < E[X ] and
[X ]
P[X > a] < E :
a

9/405
Markov’s Inequality

Example
Suppose we throw a fair coin n times. Let Hn
denote number of heads in the n throws. We have
E[Hn ] = n=2. Using Markov’s inequality

n=2 2
P[Hn > 3n=4]  = :
3n=4 3
In general,

E[X ] 1
P[X > c  E[X ]]  = :
c E[X ] c

10/405
Chebyshev’s Inequality

Theorem
Let X be a positive random variable. For any a>0
V[X ]
P[jX E[X ] j  a]  :
a2

Corollary
1
P[jX E[X ] j  c  X ]  2 ;
c
p
with X = V[X ], the standard deviation of X .

11/405
Chebyshev’s Inequality

Proof.
We have
h i
P[jX E[X ] j  a] = P (X E[X ])2  a2
 
E (X E[X ])2 [X ]
 2 a
=V2 :
a

12/405
Chebyshev’s Inequality

Example
Again Hn = the number of heads in n throws of a fair coin.
Since Hn  Binomial(n; 1=2), E[Hn ] = n=2 and
V[Hn ] = n=4. Using Chebyshev’s inequality
 
n
P[Hn > 3n=4]  P jHn
2
j  n4  (Vn=[H4)n2] = n4 :

13/405
Chebyshev’s Inequality

Example
The expected number of comparisons E[qn ] in standard
quicksort is 2n lnn + o(n log n). It can be shown that
V[qn ] = 7 23 n2 + o(n2 ). Hence, the probability that
2

we deviate more than c times from the expected value


goes to 0 as 1= log n:
 
22 n2 + o(n2 )
7
3
P[jqn E[qn ] j  c E[qn ]] 
2c n ln n + o(n2 log2 n)
2 2 2
 
7 23
2

= + o (1= log n) :
2c ln n

14/405
Jensen’s Inequality

Theorem
If f is a convex function then

E[f (X )]  f (E[X ]) :

Example
 
For any random variable X, E X2  (E[X ])2 , since
f (x) = x2 is convex.

15/405
Chernoff Bounds

Theorem
Let fXi gni=0 be independent Bernouilli trials, with
Pn
P[Xi = 1] = pi . Then, if X = i=1 Xi , and  = E[X ], we
have  
1 P[X  (1  )]  (1 e)(1 ) , for  2 (0; 1).
 
2 P[X  (1 +  )]  e for any  > 0.
(1+ )(1+)

16/405
Chernoff Bounds

Corollary (Corollary 1)
Let fXi gni=0 be independent Bernouilli trials, with
Pn
P[Xi = 1] = pi . Then if X = i=1 Xi , and  = E[X ], we
have
1 P[X  (1  )]  e 2 =2 ; for  2 (0; 1).
2 P[X  (1 +  )]  e  2 =3 ; for  2 (0; 1].

Corollary (Corollary 2)
Let fXi gni=0 be independent Bernouilli trials, with
Pn
P[Xi = 1] = pi . Then if X = i=1 Xi ,  = E[X ] and
 2 (0; 1), we have
P[jX j  ]  2e  2 =3 :

17/405
Chernoff Bounds
Back to an old example: We flip n times a fair coin, we wish an
upper bound on the probability of having at least 34n heads.
Recall Let Hn  Binomial(n; 1=2), then,
 = E[Hn ] = n=2; V[Hn ] = n=4.
h i
We want to bound P Hn  34n .
h i
Markov: P Hn  34n  3n=  = 2=3:
h i 4  
Chebyshev: P Hn  34n  P jHn n2 j  n4  (Vn= [Hn ] 4
4)2 = n .
Chernoff:
h Using
i Cor. 1.2,
 
P Hn  4 = P Hn  (1 +  ) n2 ) (1 +  ) = 32 )  = 12
3n
h i
=) P Hn  34n  e  =3 = e 24 :
2 n

Example
If n = 100, Cheb. = 0:04, Chernoff = 0:0155
If n = 106 , Cheb. = 4  10 6 , Chernoff = 2:492  10 18095

18/405
Part I

Analysis of Algorithms

1 Some Probabilistic Tools

2 The Continuous Master Theorem

3 Amortized Analysis

19/405
The Continuous Master Theorem

CMT considers divide-and-conquer recurrences of the following


type: X
Fn = tn + !n;j Fj ; n  n0 ;
0j<n
for some positive integer n0 , a function tn , called the toll
function, and a sequence of weights !n;j  0. The weights
must satisfy two conditions:
P
1 Wn = 0j<n !n;j  1 (at least one recursive call).
P
2 Zn = 0j<n nj  !Wn;jn < 1 (the size of the subinstances is a
fraction of the size of the original instance).
The next step is to find a shape function ! (z ), a continuous
function approximating the discrete weights !n;j .

20/405
The Continuous Master Theorem

Definition
Given the sequence of weights !n;j , ! (z ) is a shape
function for that set of weights if
R1
0 ! (z ) dz  1
1

2 there exists a constant  > 0 such that

X Z (j +1)=n
!n;j !(z ) dz = O(n  )
0j<n j=n

A simple trick that works very often, to obtain a convenient


shape function is to substitute j by z  n in !n;j , multiply by n
and take the limit for n ! 1:

!1 n  !n;zn
!(z ) = nlim
21/405
The Continuous Master Theorem

The extension of many discrete functions to functions in the


real domain is immediate, e.g., j 2 ! z 2 . For binomial numbers
one might use the approximation
!
zn (z  n)k
k
 k!
:

The continuation of factorials to the real numbers is given by


Euler’s Gamma function (z ) and that of harmonic numbers by
function: (z ) = d lndz(z ) .
For instance, in quicksort’s recurrence all weights are equal:
!n;j = n2 . Hence a simple valid shape function is
!(z ) = limn!1 n  !n;zn = 2.

22/405
The Continuous Master Theorem

Theorem (Roura, 1997)


Let Fn satisfy the recurrence
X
Fn = tn + !n;j Fj ;
0j<n

with tn = (na (log n)b ), for some constants a  0 and


b > 1, and let !(z ) beR a shape function for the weights
!n;jR . Let H = 1 0
0 ! (z )z dz and H =
1 a (b +
1 a
1) 0 ! (z )z ln z dz .

23/405
The Continuous Master Theorem

Theorem (Roura, 1997; cont’d)


Then
8
>
>H
<
tn
+ o(tn ) if H > 0,
ln n + o(tn log n) if H = 0 and H0 6= 0,
Fn = > Htn0
>
:
(n ) if H < 0,

where x = is the unique non-negative solution of the


equation Z 1
1 !(z )z x dz = 0
0

24/405
Example #1: QuickSort

C.A.R. Hoare (1934–)

Q UICK S ORT (Hoare, 1962) is a sorting algorithm using the


divide-and-conquer principle too, but contrary to the previous
examples, it does not guarantee that the size of each
subinstance will be a fraction of the size of the original given
instance.

25/405
QuickSort

The basis of Quicksort is the procedure PARTITION: given an


element p, called the pivot, the (sub)array is rearranged as
shown in the picture below.

26/405
QuickSort

PARTITION puts the pivot in its final place. Hence it suffices to


recursively sort the subarrays to its left and to its right.

procedure Q UICK S ORT(A, `, u)


[ ]
Ensure: Sorts subarray A `::u
if u ` +1  M then
use a simple sorting method, e.g., insertion sort
else
(
PARTITION A; `; u; k )
. A[`::k 1]  A[k]  A[k + 1::u]
Q UICK S ORT((A; `; k 1)
Q UICK S ORT(A; k + 1; u)

27/405
QuickSort

There are many ways to do the partition; not them all are
equally good. Some issues, like repeated elements, have to be
dealt with carefully. Bentley & McIlroy (1993) discuss a very
efficient partition procedure, which works seamlessly even in
the presence of repeated elements. Here, we will examine a
basic algorithm, which is reasonably efficient.
We will keep two indices i and j such that A[` + 1::i 1]
contains elements less than or equal to the pivot p, and
A[j + 1::u] contains elements greater than or equal to the pivot
p. The two indices scan the subarray locations, i from left to
right, j from right to left, until A[i] > p and A[j ] < p, or until they
cross (i = j + 1).

28/405
QuickSort

procedure PARTITION(A, `, u, k)
Require: `  u
[
Ensure: A `::k 1] [ ] [ + 1 ]
 A k  A k ::u
:= + 1 := := [ ]
i ` ; j u; p A `
while i < j +1do
while i < j +1 [ ]
^ A i  p do
i := i + 1
while i < j + 1 ^ A[j ]  p do
j := j 1
if i < j + 1 then
A[i] :=: A[j ]
i := i + 1; j := j 1
A[`] :=: A[j ]; k := j

29/405
The Cost of QuickSort
The worst-case cost of Q UICK S ORT is (n2 ), hence not very
attractive. But it only occurs is all or most recursive call one of
the subarrays contains very few elements and the other
contains almost all. That would happen if we systematically
choose the first element of the current subarray as the pivot
and the array is already sorted!
The cost of the partition is (n) and we would have then

Q(n) = (n) + Q(n 1) + Q(0)


= (n) + Q(n 1) = (n) + (n 1) + Q(n 2)
0 1
n
X X
=  = (i) =  @ iA
i=0 0in
2
= (n ):

30/405
The Cost of QuickSort

However, on average, there will be a fraction of the elements


that are less than the pivot (and will be to its left) and a fraction
of elements that are greater than the pivot (and will be to its
right). It is for this reason that Q UICKSORT belongs to the family
of divide-and-conquer algorithms, and indeed it has a good
average-case complexity.

31/405
The Cost of QuickSort

To analyze the performance of Q UICKSORT it only matters the


relative order of the elements to be sorted, hence we can safely
assume that the input is a permutation of the elements 1 to n.
Furthermore, we can concentrate in the number of
comparisons between the elements since the total cost will be
proportional to that number of comparisons.

32/405
The Cost of QuickSort

Let us assume that all n! possible input permutations is equally


likely and let qn be the expected number of comparisons to sort
the n elements. Then
X
qn = E[# compar. j pivot is the j -th]  Prfpivot is the j -thg
1j n
 n1
X
= ( n 1 + qj 1 + qn j )
1j n
1 X
= n + O(1) + (qj 1 + qn j )
n 1j n
2 X
= n + O(1) + qj
n 0j<n

33/405
Solving QuickSort’s Recurrence
We apply CMT to quicksort’s recurrence with the set of weights
!n;j = 2=n and toll function tn = n 1. As we have already
seen, we can take ! (z ) = 2, and the CMT applies with a = 1
and b = 0. All necessary conditions to apply CMT are met.
Then we compute
Z 1
z =1
H=1 2z dz = 1 z 2 z=0 = 0;
0
hence we will have to apply CMT’s second case and compute
Z 1 z =1
z2 2 1
H0 = 2z ln z dz =
2
z ln z = :
0 z =0 2
Finally,
n ln n
qn = + o(n log n) = 2n ln n + o(n log n)
1=2
= 1:386 : : : n log2 n + o(n log n): 34/405
Example #2: QuickSelect

The selection problem is to find the j -th smallest element in a


given set of n elements. More specifically, given an array
A[1::n] of size n > 0 and a rank j , 1  j  n, the selection
problem is to find the j -th element of A if it were in ascending
order.
For j = 1 we want to find the minimum, for j = n we want to find
the maximum, for j = dn=2e we are looking for the median, etc.

35/405
QuickSelect

The problem can be trivially but inefficiently (because it implies


doing much more work than needed!) solved with cost
(n log n) sorting the array. Another solution keeps an
unsorted table of the j smallest elements seen so far while
scanning the array from left to right; it has cost (j  n), and
using clever data structures the cost can be improved to
(n log j ). This is not a real improvement with respect the first
trivial solution if j = (n).
Q UICK S ELECT (Hoare, 1962), also known as F IND and as
one-sided Q UICK S ORT, is a variant of Q UICK S ORT adapted to
select the j -th smallest element out of n.

36/405
QuickSelect

Assume we partition the subarray A[`::u], that contains the


elements of ranks ` to u, `  j  u, with respect some pivot p.
Once the partition finishes, suppose that the pivot ends at
position k.
Then A[`::k 1] contains the elements of ranks ` to (k 1) in A
and A[k + 1::u] contains the elements of ranks (k + 1) to u. If
j = k we are done since we have found the sought element. If
j < k then we need to recursively continue in the left subarray
A[`::k 1], whereas if j > k then the sought element must be
located in the right subarray A[k + 1::u].

37/405
QuickSelect

Example
We are looking the fourth element (j = 4) out of n = 15
elements
9 5 10 12 3 1 11 15 7 2 8 13 6 4 14

38/405
QuickSelect

Example
We are looking the fourth element (j = 4) out of n = 15
elements
9 5 10 12 3 1 11 15 7 2 8 13 6 4 14

38/405
QuickSelect

Example
We are looking the fourth element (j = 4) out of n = 15
elements
7 5 4 6 3 1 8 2 9 15 11 13 12 10 14
pivot ends at position k=9>j

38/405
QuickSelect

Example
We are looking the fourth element (j = 4) out of n = 15
elements
7 5 4 6 3 1 8 2 9 15 11 13 12 10 14

38/405
QuickSelect

Example
We are looking the fourth element (j = 4) out of n = 15
elements
1 5 4 2 3 6 8 7 9 15 11 13 12 10 14
pivot ends at position k=6>j

38/405
QuickSelect

Example
We are looking the fourth element (j = 4) out of n = 15
elements
1 5 4 2 3 6 8 7 9 15 11 13 12 10 14

38/405
QuickSelect

Example
We are looking the fourth element (j = 4) out of n = 15
elements
2 3 1 4 5 6 8 7 9 15 11 13 12 10 14
pivot ends at position k = 4 = j ) DONE!

38/405
QuickSelect

procedure Q UICK S ELECT(A, `, j , u)


Ensure: Returns the j( +1 )` -th smallest element in A[`::u],
`ju
if ` = u then
return A[`]
PARTITION(A; `; u; k)
if j = k then
return A[k]
if j < k then
return Q UICK S ELECT (A; `; j; k 1)
else
return Q UICK S ELECT (A; k + 1; j; u)

39/405
QuickSelect

In the worst-case, the cost of Q UICK S ELECT is (n2 ). However,


its average cost is (n), with the proportionality constant
(j )
depending on the ratio j=n. Knuth (1971) proved that Cn , the
expected number of comparisons to find the smallest j -th
element among n is:

Cn(j ) = 2 (n + 1)Hn (n + 3 j )Hn+1 j



(j + 2)Hj + n + 3
The maximum average cost corresponds to finding the median
(j = bn=2c); then we have

Cn(bn=2c) = 2(ln 2 + 1)n + o(n):

40/405
QuickSelect

Let us now consider the analysis of the expected cost Cn when


j takes any value between 1 and n with identical probability.
Then

Cn = n + O(1)
1 X
+ [remaining number of comp. j pivot is the k-th element] ;
n 1kn E
as the pivot will be the k-th smallest element with probability
1=n for all k, 1  k  n.

41/405
QuickSelect
The probability that j = k is 1=n, then no more comparisons
are need since we would be done. The probability that j < k is
(k 1)=n, then we will have to make Ck 1 comparisons.
Similarly, with probability (n k)=n we have j > k and we will
then make Cn k comparisons. Thus
1 X k 1 n k
Cn = n + O(1) + C + C
n 1kn n k 1 n n k

2 X k
= n + O(1) + C:
n 0k<n n k
Applying the CMT with the shape function
2zn
lim n 
n!1
= 2z
n n
R1 2
we obtain H = 1 0 2z dz = 1=3 > 0 and Cn = 3n + o(n).
42/405
Part I

Analysis of Algorithms

1 Some Probabilistic Tools

2 The Continuous Master Theorem

3 Amortized Analysis

43/405
Amortized Analysis

In amortized analysis we find the (worst/best/average) cost Cn


of a sequence of n operations; the amortized cost per operation
is
C
an = n
n
Sometimes we compute the cost C (n1 ; : : : ; nk ) of a sequence
involving n1 operations of type 1, n2 operations of type 2,
. . . The amortized cost is then
C (n1 ; : : : ; nk )
A(n1 ; : : : ; nk ) =
n1 +    + nk

44/405
Amortized Analysis

Amortized cost is interesting when we consider that a sequence


of operations must be performed, and some are expensive, but
some are cheap; bounding the total cost by n times the cost of
the most expensive operation is overly pessimistic.

45/405
A first example: Binary counter

Suppose we have a counter that we initialize to 0 and increment


it (mod 2) n times. The counter has k bits. How many bit flips
are needed?
Theorem
Starting from 0, a sequence of n increments makes
O(nk) bit flips.

46/405
A first example: Binary counter

Counter B[5] B[4] B[3] B[2] B[1] B[0]


0 0 0 0 0 0 0
1 0 0 0 0 0 1
2 0 0 0 0 1 0
3 0 0 0 0 1 1
4 0 0 0 1 0 0
5 0 0 0 1 0 1
6 0 0 0 1 1 0
7 0 0 0 1 1 1
8 0 0 1 0 0 0
9 0 0 1 0 0 1
10 0 0 1 0 1 0
11 0 0 1 0 1 1
12 0 0 1 1 0 0
13 0 0 1 1 0 1
14 0 0 1 1 1 0
15 0 0 1 1 1 1
16 0 1 0 0 0 0

Proof
Any increment flips O(k) bits. 

47/405
Aggregate method

Determine (an upper bound on) the number N (c) of


operations of cost c in the sequence. Then the cost of the
P
sequence is  c>0 c  N (c).
An alternative is to count how many operations N 0 (c) have
cost  c, then the cost of the sequence is  c>0 N 0 (c).
P

48/405
Aggregate method
In the binary counter problem, we observe that bit 0 flips n
times, bit 1 flips bn=2c times, bit 2 flips bn=4c times, . . .
Theorem
Starting from 0, a sequence of n increments makes
(n) bit flips.

Proof
Each increment flips at least 1 bit, thus we make at
least (n) flips. But total cost is O(n). Indeed
kX1  n  kX1 1 X1 1 1
2j
n 2 j < n
2j = n 1 (1=2) = 2n:
j =0 j =0 j =0

49/405
Accounting method (banker’s viewpoint)
We associate “charges” to different operations, these charges
may be smaller or larger than the actual cost.
When the charge or amortized cost c^i of an operation is
larger than the actual cost ci then the difference is seen as
credits that we store in the data structure to pay for future
operations.
When the amortized cost c^i is smaller than ci the
difference must be covered from the credits stored in the
data structure.
The initial data structure D0 has 0 credits.
Invariant: For all `,

(^ci ci )  0;
i=1
that is, at all moments, there must be a positive number of
credits in the data structure.
50/405
Accounting method (banker’s viewpoint)

Theorem
The total cost of processing a sequence of n operations
starting with D0 is bounded by the sum of amortized
costs.

Proof
n
X n
X
Invariant =) c^i  ci = total cost:
i=1 i=1


51/405
Accounting method (banker’s viewpoint)
In the binary counter problem, we charge 2 credits every time
we flip a bit from 0 to 1, we pay 1 credit every time we flip a bit
from 1 to 0.
We consider that each time we flip a bit from 0 to 1 we store
one credit in the data structure and use the other credit to pay
the flip. When the bit is flip from 1 to 0, we use the stored credit
for that. Thus 1-bits all store 1 credit each, whereas 0-bits do
not store credits.

Source: Kevin Wayne


(https://www.cs.princeton.edu/~wayne/kleinberg-tardos/pdf/AmortizedAnalysis.pdf)
52/405
Accounting method (banker’s viewpoint)
In the binary counter problem, we charge 2 credits every time
we flip a bit from 0 to 1, we pay 1 credit every time we flip a bit
from 1 to 0.
We consider that each time we flip a bit from 0 to 1 we store
one credit in the data structure and use the other credit to pay
the flip. When the bit is flip from 1 to 0, we use the stored credit
for that. Thus 1-bits all store 1 credit each, whereas 0-bits do
not store credits.

Source: Kevin Wayne


(https://www.cs.princeton.edu/~wayne/kleinberg-tardos/pdf/AmortizedAnalysis.pdf)
52/405
Accounting method (banker’s viewpoint)
In the binary counter problem, we charge 2 credits every time
we flip a bit from 0 to 1, we pay 1 credit every time we flip a bit
from 1 to 0.
We consider that each time we flip a bit from 0 to 1 we store
one credit in the data structure and use the other credit to pay
the flip. When the bit is flip from 1 to 0, we use the stored credit
for that. Thus 1-bits all store 1 credit each, whereas 0-bits do
not store credits.

Source: Kevin Wayne


(https://www.cs.princeton.edu/~wayne/kleinberg-tardos/pdf/AmortizedAnalysis.pdf)
52/405
Accounting method (banker’s viewpoint)

Theorem
Starting from 0, a sequence of n increments makes
(n) bit flips.

Proof
P
Every increment flips at least one bit, thus
i
ci  n. Every increment flips a 0-bit to
a 1-bit once (the rightmost 0 in the counter before the increment is the only 0-bit flipped).
Hence c ^i = 2 because all the other flips during the i-th increment are from 1-bits to 0-bits,
and their amortized cost is 0. Thus

X X
c^i = 2n  ci :
i i

As the number of credits per bit is  0 then the number of credits stored at the counter are
 0 at all times, that is, the invariant is preserved. 

53/405
Potential method (physicist’s viewpoint)

In the potential method we define a potential function  that


associates a non-negative real to every possible configuration
D of the data structure.
1 (D)  0 for all possible configurations D of the data
structure.
2 (D0 ) = 0 the potential of the initial configuration is 0
Di = configuration of the data structure after i-th operation
ci = actual cost of the i-th operation in the sequence
c^i = ci + (Di ) (Di 1 ) =
amortized cost of the i-th operation, it is the actual cost ci
of the operation plus the change in potential
i = (Di ) (Di 1 ).

54/405
Potential method (physicist’s viewpoint)

Theorem
The total cost of processing a sequence of n operations
starting with D0 is bounded by the sum of amortized
costs.

Proof

n
X n
X n
X n
X
c^i = ci + i = ci + ((Di ) (Di 1 ))
i=1 i=1 i=1 i=1
Xn n
X n
X
= ci + (Dn ) (D0 ) = ci + (Dn )  ci ;
i=1 i=1 i=1
since (Dn )  0 and (D0 ) = 0. 

55/405
Potential method (physicist’s viewpoint)
For the binary counter problem, we take
(D) = number of 1-bits in the binary counter D. Notice that
(D0 ) = 0 and (D)  0 for all D.
The actual cost ci of the i-th increment is  1 + p, where p,
0  p  k is the position of the rightmost 0-bit. We flip the p
1’s to the right of the rightmost 0-bit, then the rightmost
0-bit (except when the counter is all 1’s and we reset it,
then the cost is p).
The change in potential is  1 p because we add one
1-bit (flipping the rightmost 0-bit to a 1-bit, except if p = k)
and we flip p 1-bits to 0-bits, those to the right of the
rightmost 0-bit. Hence
c^i = ci + i  1 + p + (1 p) = 2;
X X
=) 2n  c^i  ci :
i i
56/405
Stacks with multi-pop

Example
Suppose we have a stack that supports:
P USH(x)
P OP(): pops the top of the stack and returns it,
stack must be non-empty
MP OP(k): pops k items, the stack must contain at
least k items
The cost of P USH and P OP is O(1) and the cost of
MP OP(k) is (k) = O(n) (n = size of the stack), but
saying that the worst-case cost of a sequence of N
stack operations is O(N 2 ) is too pessimistic!

57/405
Stacks with multi-pop

Example
Accounting: Assign 2 credits to each P USH. One is
used to do the operation and the other credit to pop
(with pop or multi-pop) the element at a later time. The
total number of credits in the stack = size of the stack.
c^PUSH = 2.
c^POP = c^MPOP = 0.
N
X N
X
=) 2N  c^i  ci :
i=1 i=1

58/405
Stacks with multi-pop

Example
Potential: (S ) = size(S ). Then: (S0 ) = 0 and (S ) 
0 for all stacks S .
c^PUSH = 1 + i = 2.
c^POP = 1 + i = 1 + ( 1) = 0.
c^MPOP = k + i = k + ( k) = 0 jSi 1j  k
N
X N
X
=) 2N  c^i  ci :
i=1 i=1

59/405
Dynamic arrays

Example
We often use dynamic arrays (a.k.a. vectors in C++),
the array dynamically grows as we add items (using
v.push_back(x), say).

A common way to implement dynamic arrays is to


allocate an array of some size from dynamic memory; in
a given moment, we use only part of the array, we have
then
size: number of elements in the array
capacity: number of memory cells in the array,
size  capacity

60/405
Dynamic arrays

Example
When a new element has to be added and n = size =
capacity a new array with double capacity is allocated
from dynamic memory, the contents of the old array
copied into the new and the old array freed back to
dynamic memory, with total cost (n). The program
sets the array name (a pointer) to point to the new array
instead of poiting to the old.

This procedure is called resizing, and it implies that a


single push_back can be very costly if it has to invoke
a resizing to accomplish its task.

61/405
Dynamic arrays

Example
Cost to fill a dynamic array using n push_back’s?
Aggregate:
The cost ci of the i-th push_back is (1) except if
i = 2k + 1 for some k, 0  k  log2 (n 1).
When i = 2k + 1, it triggers a resizing with cost (i).
0 1
n
X X X
Total cost = ci =  @ 1+ iA
i=1 i:i6=2k +1 i:i=2k +1
blog2X
(n 1)c
= n (log n) + (2k + 1)
k=0
n (log n) + (log n) + (2log2 (n 1)+1 1) = (n):
62/405
Dynamic arrays

Example
Cost to fill a dynamic array using n push_back’s?
Accounting:
Charge 3 credits to the assignment v [j ] := x in
which we add x to the first unused array slot j ;
every push_back does it, sometimes a resizing is
also needed. Use 1 credit for the assignment, and
store the remaining 2 credits in slot j .
When resizing an array v of size n to an array v 0
with capacity 2n, each j 2 v [n=2::n 1] stores 2
credits; use one credit for the copying v 0 [j ] := v [j ]
and use the other credit for the copying of
v[j n=2] to v0 [j n=2]

63/405
Dynamic arrays

Source: Kevin Wayne


(https://www.cs.princeton.edu/~wayne/kleinberg-tardos/pdf/AmortizedAnalysis.pdf)

64/405
Dynamic arrays

Example
Cost to fill a dynamic array using n push_back’s?
Accounting: The total number of credits in the dynamic
array v is 2  (size(v )=2) = size(v ), therefore always  0.
n
X n
X
c^i = 3n  ci :
i=1 i=1

65/405
Dynamic arrays

Example
Instead of 3 credits for the assignment v [j ] := x we
might charge some other constant quantity c  3 so
that we use 1 credit for the assignment v [j ] := x proper,
and we store c 1 credits at every used slot j in the
upper half of v ; these c 1 credits will be used to pay
for the copying of v [j ] and of v [j n=2], but also the
creation of a unused slot v 0 [j + n] in the new array and
the destruction of v [j n=2] and v[j ] in the old array,
if such construction/destruction costs need to be taken
into account.

66/405
Dynamic arrays

Example
Cost to fill a dynamic array using n push_back’s?
Potential:
When there is no resizing: ci = 1.
When there is resizing: ci = 1 + k  cap(vi ), for
some constant k, vi is the dynamic array after the
i-th push_back
(v ) = 2k (2  size(v ) cap(v ) + 1).
N.B. We will take k = 1=2 to simplify the calculations

67/405
Dynamic arrays

Example
Cost to fill a dynamic array using n push_back’s?
Potential:
When there is no resizing: cap(vi ) = cap(vi 1 ),

(vi ) (vi 1 ) = 2(size(vi 1 ) + 1) cap(vi 1 ) + 1


f2  size(vi 1) cap(vi 1) + 1g
= 2;
and c^i = ci + i = 3.

68/405
Dynamic arrays

Example
Cost to fill a dynamic array using n push_back’s?
Potential:
When there is resizing:
cap(vi ) = 2  cap(vi 1 ) = 2  size(vi 1 ),

(vi ) (vi 1 ) = 2(size(vi 1 ) + 1) 2  cap(vi 1 ) + 1


f2  size(vi 1) cap(vi 1) + 1g
= 2 cap(vi 1 );
and

c^i = ci + i = 1 + cap(vi 1 ) + 2 cap(vi 1 ) = 3:


Pn Pn
Hence i=1 c^i = 3n  i=1 ci .
69/405
Part II

Probabilistic & Randomized


Dictionaries
4 Randomized Binary Search Trees

5 Skip Lists

6 Hash Tables
Separate Chaining
Open Addressing
Cuckoo Hashing

7 Bloom Filters

70/405
Random BSTs

In a random binary search tree (built by a random


permutation) any of its n elements is the root with
probability 1=n
Idea: To obtain random BST –independently of any
assumption on the distribution of the input– insert a new
item in a tree of size n as follows:
insert it at the root with probability 1=(n + 1),
otherwise proceed recursively

71/405
Random BSTs

In a random binary search tree (built by a random


permutation) any of its n elements is the root with
probability 1=n
Idea: To obtain random BST –independently of any
assumption on the distribution of the input– insert a new
item in a tree of size n as follows:
insert it at the root with probability 1=(n + 1),
otherwise proceed recursively

71/405
Random BSTs

In a random binary search tree (built by a random


permutation) any of its n elements is the root with
probability 1=n
Idea: To obtain random BST –independently of any
assumption on the distribution of the input– insert a new
item in a tree of size n as follows:
insert it at the root with probability 1=(n + 1),
otherwise proceed recursively

71/405
Random BSTs

In a random binary search tree (built by a random


permutation) any of its n elements is the root with
probability 1=n
Idea: To obtain random BST –independently of any
assumption on the distribution of the input– insert a new
item in a tree of size n as follows:
insert it at the root with probability 1=(n + 1),
otherwise proceed recursively

71/405
Randomized binary search trees

C. Aragon R. Seidel
Two incarnations
Randomized treaps (tree+heap) invented by Aragon and
Seidel (FOCS 1989, Algorithmica 1996) use random
priorities and bottom-up balancing
Randomized binary search trees (RBSTs) invented by
Martínez and Roura (ESA 1996, JACM 1998) use subtree
sizes and top-down balancing

72/405
Randomized binary search trees

C. Aragon R. Seidel S. Roura


Two incarnations
Randomized treaps (tree+heap) invented by Aragon and
Seidel (FOCS 1989, Algorithmica 1996) use random
priorities and bottom-up balancing
Randomized binary search trees (RBSTs) invented by
Martínez and Roura (ESA 1996, JACM 1998) use subtree
sizes and top-down balancing

72/405
Insertion in a RBST

Inserting an item x = 48

42 6

27 3 64 2

11 1 1
35 56 1

73/405
Insertion in a RBST

Inserting an item x = 48
48
insert new
42 6
item

27 3 64 2

11 1 1
35 56 1

73/405
Insertion in a RBST
Inserting an item x = 48

48

6 with prob 1/7 insert


42
at root

27 3 64 2

11 1 1
35 56 1

73/405
Insertion in a RBST
Inserting an item x = 48

6
48
42

27 3 64 2

11 1 35 56 1 with prob =1/3


1
insert at root

73/405
Insertion in a RBST

Inserting an item x = 48

42 7

27 3 48 3

11 1 1
35 64 2

56 1

73/405
Insertion in a RBST

procedure I NSERT(T , k, v )
:=
n T ! size . n =0 =
if T 
(0 ) = 0
if U NIFORM ; n then
=
. this will always succeed if T 
( )
return I NSERT- AT-R OOT T; k; v
if k < T ! key then
:= ( )
T ! left I NSERT T ! left; k; v
else
:= (
T ! right I NSERT T ! right; k; v )
Update T ! size
return T

74/405
Insertion in a RBST

To insert a new item x at the root of T , we use the


algorithm S PLIT that returns two RBSTs T and T + with
element smaller and larger than x, resp.

hT ; T +i = S PLIT(T; x)
T = BST for fy 2 T j y < xg
T + = BST for fy 2 T j x < yg
S PLIT is like partition in Quicksort
Insertion at root was invented by Stephenson in 1976

75/405
Splitting a RBST
To split a RBST T around x, we need just to follow the path
from the root of T to the leaf where x falls

76/405
Splitting a RBST
To split a RBST T around x, we need just to follow the path
from the root of T to the leaf where x falls

x<z
z

T −= L−
L+
T+= <L+, z , R>

76/405
Splitting a RBST & Insertion at Root

. Pre: k is not present in T


procedure S PLIT(T , k, T , T + )
if T = null then
T := null; T + := null; return
if k < T ! key then
S PLIT(T ! left; k; L ; L+ )
T ! left := L+
Update T ! size
T := L
T + := T
else
. “Symmetric” code for k > T ! key

77/405
Splitting a RBST

Lemma
Let T and T + be the BSTs produced by S PLIT(T; x).
If T is a random BST containing the set of keys K ,
then T and T + are independent random BSTs
containing the sets of keys K = fy 2 T j y < xg
and K + = fy 2 T j y > xg, respectively.

78/405
Insertion in RBSTs

Theorem
If T is a random BST that contains the set of keys K
and x is any key not in K , then I NSERT(T; x) produces a
random BST containing the set of keys K [ fxg.

79/405
The Cost of Insertions

The cost of the insertion at root (measured # of visited


nodes) is exactly the same as the cost of the standard
insertion
For a random(ized) BST the cost of insertion is the depth
of a random leaf in a random binary searh tree:

E[In ] = 2 ln n + O(1)

80/405
The Cost of Insertions

The recurrence of E[In ]:

1 X j n j+1
E[In ] = 1 + E[Ij 1 ] + [I ]
n 1j n n + 1 n+1 E n j
To solve this recurrence the Continuous Master Theorem
(Roura, 2001) comes handy
We need to produce O(log n) random numbers on average
to insert an item

81/405
The Cost of Insertions

The recurrence of E[In ]:

1 X j n j+1
E[In ] = 1 + E[Ij 1 ] + [I ]
n 1j n n + 1 n+1 E n j
To solve this recurrence the Continuous Master Theorem
(Roura, 2001) comes handy
We need to produce O(log n) random numbers on average
to insert an item

81/405
The Cost of Insertions

The recurrence of E[In ]:

1 X j n j+1
E[In ] = 1 + E[Ij 1 ] + [I ]
n 1j n n + 1 n+1 E n j
To solve this recurrence the Continuous Master Theorem
(Roura, 2001) comes handy
We need to produce O(log n) random numbers on average
to insert an item

81/405
RBST resulting from the insertion of 500 keys in ascending
order
Source: R. Sedgewick, Algorithms in C (3rd edition), 1997

82/405
Deletions in RBSTs

The fundamental problem is how to remove the root node


of a BST, in particular, when both subtrees are not empty
The original deletion algorithm by Hibbard was assumed to
preserve randomness
In 1975, G. Knott discovered that Hibbard’s deletion
preserves randomness of shape, but an insertion following
a deletion would destroy randomness (Knott’s paradox)

83/405
Deletions in RBSTs

J. Culberson J.L. Eppinger D.E. Knuth


Several theoretical and experimental work aimed at
understanding what was the effect of deletions, e.g.,
Jonassen & Knuth’s An Algorithm whose Analysis Isn’t
(JCSS, 1978)
Knuth’s Deletions that Preserve Randomness (IEEE Trans.
Soft. Eng., 1977)
Eppinger’s experiments (CACM, 1983)
Culberson’s paper on deletions of the left spine (STOC,
1985)
These studies showed that deletions degraded
performance in the long run
84/405
Deletions in RBSTs

procedure D ELETE(T , k)
=
if T  then
return T
=
if k T ! key then
return D ELETE -R OOT T ( )
if x < T ! key then
:= (
T ! left D ELETE T ! left; k )
else
:= (
T ! right D ELETE T ! right; k )
Update T ! size
return T

85/405
Deletions in RBSTs

We delete the root using a procedure J OIN(T1 ; T2 ). Given two


BSTs such that for all x 2 T1 and all y 2 T2 , x  y , it returns a
new BST that contains all the keys in T1 and T2 .

J OIN(; ) = 
J OIN(T; ) = J OIN(; T ) = T
J OIN(T1 ; T2 ) = ?; T1 6= ; T2 6= 

86/405
Joining two BSTs

x y

T1 T2

L1 R1 L2 R2

87/405
Joining two BSTs

join(R1, T 2) y

L1

R1 L2 R2

87/405
Joining two BSTs

If we systematically choose the root of T1 as the root of


J OIN(T1 ; T2 ), or the other way around, we will introduce an
undesirable bias
Suppose both T1 and T2 are random. Let m and n denote
their sizes. Then x is the root of T1 with probability 1=m
and y is the root of T2 with probability 1=n
Choose x as the common root with probability m=(m + n),
choose y with probability n=(m + n)
1 m 1
 =
m m+n m+n
1
n
 m n+ n = m 1+ n

88/405
Joining two RBSTs

Lemma
Let L and R be two independent random BSTs, such
that the keys in L are strictly smaller than the keys in
R. Let KL and KR denote the sets of keys in L and R,
respectively. Then T = J OIN(L; R) is a random BST that
contains the set of keys K = KL [ KR .

89/405
Joining two RBSTs

The recursion for J OIN(T1 ; T2 ) traverses the rightmost


branch (right spine) of T1 and the leftmost branch (left
spine) of T2
The trees to be joined are the left and right subtrees L and
R of the ith item in a RBST of size n; then
length of left spine of L = path length to ith leaf
length of right spine of R = path length to (i + 1)th leaf

The cost of the joining phase is the sum of the path lengths
to the leaves minus twice the depth of the ith item; the
expected cost follows from well-known results
 
1 1
2 = O(1)
i n+1 i
90/405
Deletions in RBSTs

Theorem
If T is a random BST that contains the set of keys K ,
then D ELETE(T; x) produces a random BST containing
the set of keys K n fxg.

91/405
Deletions in RBSTs

Theorem
If T is a random BST that contains the set of keys K ,
then D ELETE(T; x) produces a random BST containing
the set of keys K n fxg.

Corollary
The result of any arbitary sequence of insertions and
deletions, starting from an initially empty tree is always a
random BST.

91/405
Additional remarks

Arbitrary insertions and deletions yield always random


BSTs
A deletion algorithm for BSTs that preserved randomness
was a long standing open problem (10-15 yr)
Properties of random BSTs have been investigated in
depth and for a long time
Treaps only need to generate a single random number per
node (with O(log n) bits)
RBSTs need O(log n) calls to the random generator per
insertion, and O(1) calls per deletion (on average)

92/405
Additional remarks

Storing subtree sizes for balancing is more useful: they


can be used to implement search and deletion by rank,
e.g., find the ith smallest element in the tree
Other operations, e.g., union and intersection are also
efficiently supported by RBSTs
Similar ideas have been used to randomize other search
trees, namely, K -dimensional binary search trees (Duch
and Martínez, 1998) and quadtrees (Duch, 1999) (stay
tuned!)

93/405
To learn more

[1] C. Martínez and S. Roura.


Randomized binary search trees.
J. Assoc. Comput. Mach., 45(2):288–323, 1998.
[2] R. Seidel and C. Aragon.
Randomized search trees.
Algorithmica, 16:464–497, 1996.

94/405
To learn more (2)

[3] J. L. Eppinger.
An empirical study of insertion and deletion in binary
search trees.
Comm. of the ACM, 26(9):663—669, 1983.
[4] W. Panny.
Deletions in random binary search trees: A story of errors.
J. Statistical Planning and Inference, 140(8):2335–2345,
2010.
[5] H. M. Mahmoud.
Evolution of Random Search Trees.
Wiley Interscience, 1992.

95/405
Part II

Probabilistic & Randomized


Dictionaries
4 Randomized Binary Search Trees

5 Skip Lists

6 Hash Tables
Separate Chaining
Open Addressing
Cuckoo Hashing

7 Bloom Filters

96/405
Skip lists

W. Pugh

Skip lists were invented by William Pugh (C. ACM, 1990)


as a simple alternative to balanced trees
The algorithms to search, insert, delete, etc. are very
simple to understand and to implement, and they have
very good expected performance—independent of any
assumption on the input

97/405
Skip lists

W. Pugh

Skip lists were invented by William Pugh (C. ACM, 1990)


as a simple alternative to balanced trees
The algorithms to search, insert, delete, etc. are very
simple to understand and to implement, and they have
very good expected performance—independent of any
assumption on the input

97/405
Skip lists

A skip list S for a set X consists of:


1 A sorted linked list L1 , called level 1, contains all elements
of X
2 A collection of non-empty sorted lists L2 , L3 , . . . , called
level 2, level 3, . . . such that for all i  1, if an element x
belongs to Li then x belongs to Li+1 with probability q , for
some 0 < q < 1, p := 1 q

98/405
Skip lists

−OO 12 21 37 40 42 53 66 + OO
Header NIL

To implement this, we store the items of X in a collection of


nodes each holding an item and a variable-size array of
pointers to the item’s successor at each level; an additional
dummy node gives access to the first item of each level

99/405
Skip lists

−OO 12 21 37 40 42 53 66 + OO
Header NIL

To implement this, we store the items of X in a collection of


nodes each holding an item and a variable-size array of
pointers to the item’s successor at each level; an additional
dummy node gives access to the first item of each level

99/405
Skip lists

The level or height of a node x, height(x), is the number of


lists it belongs to.
It is given by a geometric r.v. of parameter p:

Prfheight(x) = kg = pq k 1 ; q=1 p
The height of the skip list S is the number of non-empty
lists,
height(S ) = maxfheight(x)g
x2 S

100/405
Skip lists

The level or height of a node x, height(x), is the number of


lists it belongs to.
It is given by a geometric r.v. of parameter p:

Prfheight(x) = kg = pq k 1 ; q=1 p
The height of the skip list S is the number of non-empty
lists,
height(S ) = maxfheight(x)g
x2 S

100/405
Skip lists

The level or height of a node x, height(x), is the number of


lists it belongs to.
It is given by a geometric r.v. of parameter p:

Prfheight(x) = kg = pq k 1 ; q=1 p
The height of the skip list S is the number of non-empty
lists,
height(S ) = maxfheight(x)g
x2 S

100/405
Searching in a skip list

Searching for an item x, 42 < x  53

−OO 12 21 37 40 42 53 66 + OO
Header NIL

101/405
Searching in a skip list

Searching for an item x, 42 < x  53

−OO 12 21 37 40 42 53 66 + OO
Header NIL

101/405
Searching in a skip list

Searching for an item x, 42 < x  53

−OO 12 21 37 40 42 53 66 + OO
Header NIL

101/405
Searching in a skip list

Searching for an item x, 42 < x  53

−OO 12 21 37 40 42 53 66 + OO
Header NIL

101/405
Searching in a skip list

Searching for an item x, 42 < x  53

−OO 12 21 37 40 42 53 66 + OO
Header NIL

101/405
Searching in a skip list

Searching for an item x, 42 < x  53

−OO 12 21 37 40 42 53 66 + OO
Header NIL

101/405
Implementing skip lists

. Returns pointer to item with key k or null


. if not such item exists in the skip list S
procedure S EARCH(k, S )
p := S:header
` := S:height
while ` > 0 do
if p ! next[`] = null _ k  p ! next[`] ! key then
` := ` 1
else
p := p ! next[`]
if p ! next[1] = null _ k 6= p ! next[1] ! key then
. k is not present
return null
else . k is present, return pointer to the node
return p ! next [1]
102/405
Insertion in a skip list

Inserting an item x = 48

−OO 12 21 37 40 42 53 66 + OO
Header NIL

103/405
Insertion in a skip list

Inserting an item x = 48

−OO 12 21 37 40 42 53 66 + OO
Header NIL

Geom(p)

48

103/405
Insertion in a skip list

Inserting an item x = 48

−OO 12 21 37 40 42 53 66 + OO
Header NIL

48

103/405
Insertion in a skip list

Inserting an item x = 48

−OO 12 21 37 40 42 48 53 66 + OO
Header NIL

103/405
Implementing skip lists

To insert a new item we go through four phases:


1) Search the given key. The search loop is slightly
different from before, since we need to keep track
of the last node seen at each level before
descending from that level to the one immediately
below.
2) If the given key is already present we only update
the associated value and finish.

104/405
Implementing skip lists

. Inserts new item hk; vi or


. updates value if key k is present in the skip list S
procedure I NSERT(k, v , S )
p := S:header; ` := S:height
create array pred of pointers of size S:height
for i := 1 to S:height do pred[i] := S:header
while ` > 0 do
if p ! next[`] = null _ k  p ! next[`] ! key then
. p should be the predecessor of the new item
. at level `
pred[`] := p; ` := ` 1
else
... p := p ! next[`]

105/405
Implementing skip lists

procedure I NSERT(k, v , S )
...
while : : : do
. loop to locate whether k is present or not
. and to determine predecessors at each level
[1] =
if p ! next = [1]
null _ k 6 p ! next ! key then
. k is not present
. Insert new item, see next slide
else
. k is present, update its value
[1]
p ! next ! value v :=

106/405
Implementing skip lists

3) When k is not present, create a new node with key


k and value v, and assign a random level r to the
new node, using geometric distribution
4) Link the new node in the first r lists, adding empty
lists if r is larger than the maximum level of the
skip list

107/405
Implementing skip lists

. Insert new item


. RNG() generates a random number U (0; 1)
h := 1;
while RNG() > p do h := h + 1
nn := new NODE(k; v; h)
if h > S:height then
Resize S:header and pred with h S:height
new pointers, all set to null and S:header, resp.
S:height := h
for i := 1 to h do
nn ! next[i] := pred[i] ! next[i]
pred[i] ! next[i] := nn

108/405
Other Operations

Deletions are also very easy to implement


Ordered raversal of the keys is trivially implemented
Skip lists can also support many other operations, e.g.,
merging, search and deletion by rank, finger search, . . .
They can also support concurrency and massive
parallelism without too much effort

109/405
Other Operations

Deletions are also very easy to implement


Ordered raversal of the keys is trivially implemented
Skip lists can also support many other operations, e.g.,
merging, search and deletion by rank, finger search, . . .
They can also support concurrency and massive
parallelism without too much effort

109/405
Other Operations

Deletions are also very easy to implement


Ordered raversal of the keys is trivially implemented
Skip lists can also support many other operations, e.g.,
merging, search and deletion by rank, finger search, . . .
They can also support concurrency and massive
parallelism without too much effort

109/405
Other Operations

Deletions are also very easy to implement


Ordered raversal of the keys is trivially implemented
Skip lists can also support many other operations, e.g.,
merging, search and deletion by rank, finger search, . . .
They can also support concurrency and massive
parallelism without too much effort

109/405
Performance of skip lists

A preliminary rough analysis considers the search path


backwards. Imagine we are at some node x and level i:
The height of x is > i and we come from level i + 1 since
the sought key k is smaller than the key of the successor of
x at level i + 1
The height of x is i and we come from x’s predecessor at
level i since k is larger or equal to the key at x

110/405
Performance of skip lists

Figure from W. Pugh’s Skip Lists: A Probabilistic Alternative to Balanced


Trees (C. ACM, 1990)—the meaning of p is the opposite of what we have
used!

111/405
Performance of skip lists

The expected number C (k) of steps to “climb” k levels in an


infinite list

C (k) = p(1 + C (k)) + (1 p)(1 + C (k 1))


1
= 1 + pC (k) + qC (k 1) = (1 + qC (k 1))
q
1
= + C (k 1) = k=q
q
since C (0) = 0.

112/405
Performance of skip lists

The analysis above is pessimistic since the list is not infinite


and we might “bump” into the header. Then all remaining
backward steps to climb up to a level k are vertical—no more
horizontal steps. Thus the expected number of steps to climb
up to level Ln is
 (Ln 1)=q

113/405
Performance of skip lists

Ln = the largest level L for which


E[# of nodes with height  L]  1=q
Probability that a node has height  k is
X
Prfheight(x)  kg = pqi 1 = pq k 1 X qi = qk 1
i k i 0
Number of nodes with height  k is a binomial r.v. with
parameters n and q k 1 , hence

E[# of nodes with height  k] = nq


k 1

Then
nqLn 1 = 1=q =) Ln = logq (1=n) = log1=q n

114/405
Performance of skip lists

Ln = the largest level L for which


E[# of nodes with height  L]  1=q
Probability that a node has height  k is
X
Prfheight(x)  kg = pqi 1 = pq k 1 X qi = qk 1
i k i 0
Number of nodes with height  k is a binomial r.v. with
parameters n and q k 1 , hence

E[# of nodes with height  k] = nq


k 1

Then
nqLn 1 = 1=q =) Ln = logq (1=n) = log1=q n

114/405
Performance of skip lists

Ln = the largest level L for which


E[# of nodes with height  L]  1=q
Probability that a node has height  k is
X
Prfheight(x)  kg = pqi 1 = pq k 1 X qi = qk 1
i k i 0
Number of nodes with height  k is a binomial r.v. with
parameters n and q k 1 , hence

E[# of nodes with height  k] = nq


k 1

Then
nqLn 1 = 1=q =) Ln = logq (1=n) = log1=q n

114/405
Performance of skip lists

Ln = the largest level L for which


E[# of nodes with height  L]  1=q
Probability that a node has height  k is
X
Prfheight(x)  kg = pqi 1 = pq k 1 X qi = qk 1
i k i 0
Number of nodes with height  k is a binomial r.v. with
parameters n and q k 1 , hence

E[# of nodes with height  k] = nq


k 1

Then
nqLn 1 = 1=q =) Ln = logq (1=n) = log1=q n

114/405
Performance of skip lists
Then the steps remaining to reach Hn (=the height of a random
skip list of size n) can analyzed this way:
we need not more horizontal steps than nodes with height
 Ln, the expected number is  1=q, by definition
the probability that Hn > k is
 n
1 1 q k  nq k
the expected value of the height Hn can be bounded as
X X X
E[Hn ] = P[Hn > k] = P[Hn > k] + P[Hn > k]
k0 0k<Ln kLn
X X
 Ln + P[Hn > Ln + k] = Ln + nq n q k
L
k 0 k 0
= Ln + 1=p
thus the expected additional vertical steps need to reach
Hn from Ln is  1=p 115/405
Performance of skip lists

Summing up, the expected path length of a search is


1
 (Ln 1)=q + 1=q + 1=p = log1=q n + 1=p
q
On the other hand, the average number of pointers per node is
1=p so there is a trade-off between space and time:
p ! 0; q ! 1 =) very tall “nodes”, short horizontal cost
p ! 1; q ! 0 =) flat skip lists
Pugh suggested p = 3=4 as a good practical choice; the
optimal choice minimizes factor (q ln(1=q )) 1 =)
q = e 1 = 0:36 : : : ; p = 1 e 1  0:632 : : :

116/405
Analysis of the height

W. Szpankowski V. Rego

Theorem (Szpankowski and Rego,1990)

1
E[Hn ] = log1=q n + + (log1=q n) + O(1=n)
ln(1=q ) 2
where = 0:577 : : : is Euler’s constant and (t) a
fluctuation of period 1, mean 0 and small amplitude.

117/405
Analysis of the forward cost

The number of forward steps Fn;k is the number of weak


left-to-right maxima in ak ; ak 1 ; : : : ; a1 , with ai = height(xi )

−OO 12 21 37 40 42 53 66 + OO
Header NIL

118/405
Analysis of the forward cost

The number of forward steps Fn;k is the number of weak


left-to-right maxima in ak ; ak 1 ; : : : ; a1 , with ai = height(xi )

−OO 12 21 37 40 42 53 66 + OO
Header NIL

118/405
Analysis of the forward cost

Total unsuccessful search cost


X
Cn = Cn;k = nHn + Fn
0kn

Total forward cost


X
Fn = Fn;k
0kn

119/405
Analysis of the forward cost

Total unsuccessful search cost


X
Cn = Cn;k = nHn + Fn
0kn

Total forward cost


X
Fn = Fn;k
0kn

119/405
Analysis of the forward cost

P. Kirschenhofer H. Prodinger

Theorem (Kirschehofer, Prodinger, 1994)


The expected total forward cost in a random skip list of
size n is
 
1 1 1
E[Fn ] =
q
1 n log1=q n +
ln(1=q ) 2
!
1
+ (log1=q n) + O(log n);
ln(1=q )
where = 0:577 : : : is Euler’s constant and  a periodic 120/405
Skip Lists in Real Life

Source: Wikipedia

121/405
To learn more

[1] L. Devroye.
A limit theory for random skip lists.
The Annals of Applied Probability, 2(3):597–609, 1992.
[2] P. Kirschenhofer and H. Prodinger.
The path length of random skip lists.
Acta Informatica, 31(8):775–792, 1994.
[3] P. Kirschenhofer, C. Martínez and H. Prodinger.
Analysis of an Optimized Search Algorithm for Skip Lists.
Theoretical Computer Science, 144:199–220, 1995.

122/405
To learn more (2)

[4] T. Papadakis, J. I. Munro, and P. V. Poblete.


Average search and update costs in skip lists.
BIT, 32:316–332, 1992.
[5] H. Prodinger.
Combinatorics of geometrically distributed random
variables: Left-to-right maxima.
Discrete Mathematics, 153:253–270, 1996.
[6] W. Pugh.
Skip lists: a probabilistic alternative to balanced trees.
Comm. ACM, 33(6):668–676, 1990.
[7] W. Pugh.
A Skip List Cookbook.
Technical Report UMIACS–TR–89–72.1. U. Maryland,
College Park, 1989.
123/405
Part II

Probabilistic & Randomized


Dictionaries
4 Randomized Binary Search Trees

5 Skip Lists

6 Hash Tables
Separate Chaining
Open Addressing
Cuckoo Hashing

7 Bloom Filters

124/405
Hash Tables

A hash table (esp: tabla de dispersión) allows us to store a set


of elements (or pairs hkey; valuei) using a hash function
h : K =) I , where I is the set of indices or addresses into the
table, e.g., I = [0::M 1].
Ideally, the hash function h would map every element (their
keys) to a distinct address of the table, but this is hardly
possible in a general situation, and we should expect to find
collisions (different keys mapping to the same address) p as soon
as the number of elements stored in the table is n = ( M ).

125/405
Hash Tables

If the hash function evenly “spreads” the keys, the hash table
will be useful as there will be a small number of keys mapping
to any given address of the table.
Given two distinct keys x and y , we say that they are
synonyms, also that they collide if h(x) = h(y ).
A fundamental problem in the implementation of a dictionary
using a hash table is to design a collision resolution strategy.

126/405
Hash Functions

A good hash function h must enjoy the following properties


1 It is easy to compute
2 It must evenly spread the set of keys K : for all i, 0  i < M

#fk 2 K j h(k) = ig
#fk 2 K g
 M1

127/405
Collision Resolution

Collision resolution strategies can be grouped into two main


families. By historical reasons (not very logically) they are
called
Open hashing: separate chaining, 2-way chaining,
coalesced hashing, . . .
Open addressing: linear probing, double hashing,
quadratic hashing, cuckoo hashing, . . .

128/405
Separate Chaining

In separate chaining, each slot in the hash table has a pointer


to a linked list of synonyms.
template <typename Key, typename Value,
template <typename> class HashFunct = Hash>
class Dictionary {
...
private:

struct node {
Key _k;
Value _v;
..-
};
vector<list<node>> _Thash; // array of linked lists of synonyms
int _M; // capacity of the table
int _n; // number of elements
double _alpha_max; // max. load factor
};

129/405
Separate Chaining
M = 13 X = { 0, 4, 6, 10, 12, 13, 17, 19, 23, 25, 30}
h (x) = x mod M

0 13 0
1
2

4 30 17 4

5
6 19 6
7
8

10 23 10
11

12 25 12

130/405
Separate Chaining

For insertions, we access the apropriate linked list using the


hash function, and scan the list to find out whether the key was
already present or not. If present, we modify the associated
value; if not, a new node with the pair hkey; valuei is added to
the list.
Since the lists contain very few elements each, the simplest
and more efficient solution is to add elements to the front.
There is no need for double links, sentinels, etc. Sorting the
lists or using some other sophisticated data structure instead of
linked lists does not report real practical benefits.

131/405
Separate Chaining

Searching is also simple: access the apropriate linked list using


the hash function and sequentially scan it to locate the key or to
report unsuccessful search.

132/405
Separate Chaining

procedure I NSERT(T , k, v )
if n=M > max then
( )
R ESIZE T
:=
i H ASH k ()
:= ( )
p __L OOKUP T; i; k
=
if p null then
:=
p new NODE k; v( )
:= [ ]
p:next T i
T [i] := p
n := n + 1
else
p:value := v

133/405
Separate Chaining

procedure L OOKUP(T , k, found, v )


:=
i H ASH k ()
:= (
p __L OOKUP T; i; k )
=
if p null then
:=
found false
else
:=
found true
:=
v p:value
procedure __L OOKUP(T , i, k)
p := T [i]
while p 6= null ^ p:key 6= k do
p := p:next
return p

134/405
The Cost of Separate Chaining

Let n be the number of elements stored in the hash table. On


average, each linked list contains = n=M elements and the
cost of lookups (either successful or unsuccessful), of
insertions and of deletions will be proportional to . If is a
small constant value then the cost of all basic operations is, on
average, (1). However, it can be shown that the expected
length of the largest synonym list is (log n= log log n).
The value is called load factor, and the performance of the
hash table will be dependent on it.

135/405
Open Addressing

In open addressing, synonyms are stored in the hash table.


Searches and insertions probe a sequence of positions until the
given key or an empty slot is found. The sequence of probes
starts in position i0 = h(k) and continues with i1 ; i2 ; : : : The
different open addressing strategies use different rules to define
the sequence of probes. The simplest one is linear probing:

i1 = i0 + 1; i2 = i1 + 1; : : : ;
taking modulo M in all cases.

136/405
Linear Probing
M = 13 X = { 0, 4, 6, 10, 12, 13, 17, 19, 23, 25, 30}
h (x) = x mod M (incremento 1)

0 0 0 0 occupied 0 0 occupied

1 1 13 occupied 1 13 occupied

2 2 free 2 25 occupied

3 3 free 3 free

4 4 4 4 occupied 4 4 occupied

5 5 17 occupied 5 17 occupied

6 6 6 6 occupied 6 6 occupied

7 7 19 occupied 7 19 occupied

8 8 free 8 30 occupied

9 9 free 9 free

10 10 10 10 occupied 10 10 occupied

11 11 23 occupied 11 23 occupied

12 12 12 12 occupied 12 12 occupied

137/405
+ {0, 4, 6, 10, 12} + {13, 17, 19, 23} + {25, 30}
Linear Probing

procedure L OOKUP(T , k, found, v )


:=
i __L OOKUP T; k( )
[]
if T i :free then
:=
found false
else
:=
found true
:= [ ]
v T i :value
procedure __L OOKUP(T , k)
. we assume at least one free slot
:=
i H ASH k ()
[] [] =
while :T i :free ^ T i :key 6 k do
i := (i + 1) mod M
return i

138/405
Other Open Addressing Schemes

As we have already mention different probe sequences give us


different open addressing strategies. In general, the sequence
of probes is given by

i0 = h(x);
ij = ij 1  (j; x);
where x  y denotes x + y (mod M ).

139/405
Other Open Addressing Schemes

1 Linear Probing: (j; x) = 1 (or a constant); ij = h(x)  j


2 Quadratic Hashing: (j; x) = a  j + b;
ij = h(x)  (Aj 2 + Bj + C ); constants a and b must be
carefully choosen to guarantee that the probe sequence
will ultimately explore all the table if necessary
3 Double Hashing: (j; x) = h2 (x) for a second independent
hash function h2 such that h2 (x) 6= 0; ij = h(x)  j  h2 (x)
4 Uniform Hashing: i0 , i1 , . . . is a random permutation of
f0; : : : ; M 1g
5 Random Probing: i0 , i1 , . . . is a random sequence such
that 0  ik < M , for all k, and it contains every value in
f0; : : : ; M 1g at least once
140/405
Other Open Addressing Schemes
Uniform Hashing and Random Probing are completely
impractical algorithms; they are interesting as idealizations
—they do not suffer from clustering
Linear Probing suffers primary clustering. There are only
M distinct probe sequences, the M circular permutations
of 0; 1 : : : ; M 1
Quadratic Hashing and other methods with (j; x) = f (j )
(a non-constant function only of j ) behave almost as the
schemes with secondary clustering: two keys such that
h(x) = h(y) will probe exactly the same sequence of slots,
but if a key x probes ij in the j -th step and y probes i0k in
the k-th step then ij +1 and i0k+1 will be probably different
Double Hashing is even better and generalizations, they
exhibit secondary (more generally k-ary clustering) as they
depend on (k 1) evaluations of independent hash
functions 141/405
Other Open Addressing Schemes

In linear probing two keys will have the same probe


sequence with probability 1=M ; in an scheme with
secondaty clustering that probability drops to 1=M (M 1)
The average performance of schemes with k-ary
clustering, k  2, is close to that of uniform hashing (no
clustering)
Random probing also approximates well the performance
of uniform hashing

142/405
The Cost of Open Addressing

We will focus in the following parameters (we assume M is


fixed):
1 Un: number of probes in an unsuccessful search that starts
at a random slot in a table with n items
2 Sn;i: number of probes in the successful search of the i-th
inserted item when the table contains n items, 1  i  n
We will actually be more interested in Sn := Sn;U where Un is a
n
random uniform value in f1; : : : ; ng, that is, Sn is the cost of a
successful search of a random item in a table with n items

143/405
The Cost of Open Addressing

The cost of the (n + 1)-th insertion is given by Un


With the FCFS insertion policy, an item will be inserted
where the unsuccessful search terminated and never be
moved from there, hence

Sn;i =D Ui 1
D
where = denotes equal distribution

144/405
The Cost of Open Addressing

Consider random probing. What is Un = E[Un ]?


With one probe we land in an empty slot and we are done.
Probability is (1 ). If the first place is occupated, probability
, we probe a second slot, which is empty with probability
1 . And so on. Thus

Un = 1  (1 )+2  (1 )+3 2  (1 )
X
k 1  (1 Xd( k )
= k ) = (1 )
k>0 k>0 d
d X k 1
= (1 ) = :
d k>0 1

145/405
The Cost of Open Addressing

And for the expected successful search we have


1 X 1 X 1 X
Sn = E[Sn ] = E[Sn;i ] = E[Ui 1 ] = Ui 1
n 1in n 1in n 1in
Using Euler-McLaurin
 
1 X 1Z 1 1 1
Sn = Ui 1 = d = ln
M 1in 0 1 1

146/405
The Cost of Open Addressing

The actual expected costs of hashing with uniform hashing


(and thus of quadratic hashing, double hashing) are slightly
different from those of random probing, a few small corrections
must be introduced:
Un = 1=(1 ) ln(1 )
R
Sn = 1= 0 U ( ) d = 1 =2 ln(1 ) ()

147/405
The Cost of Open Addressing

The analysis of linear probing turns out to be more challenging


than one could think at first.
The average cost of unsuccessful search is
 
1 1
Un = 1+
2 (1 )2
The average cost of successful search is
 
1Z 1 1
Sn = U( ) d = 1+ ()
0 2 1

148/405
The Cost of Open Addressing

Comparison of experimental vs. theoretical expected cost of


successful search in linear probing and quadratic hashing
149/405
Cuckoo Hashing

Rasmus Pagh Flemming F. Rodler

In cuckoo hashing we have two tables T1 and T2 of size M


each, and two hash functions h1 ; h2 : 0 ! M 1.
The worst-case complexity of searches and deletions in a
cuckoo hash table is (1). We can insert in such table n < M
items: the load factor = n=2M must be strictly less than 1/2
in order to guarantee constant expected time for insetions.

150/405
Cuckoo Hashing

To insert a new item x, we probe slot T1 [h1 (x)], if it is empty, we


put x there and stop. Otherwise if y sits already in that slot, then
x kicks out y— x is put in T1 [h1 (x)] and y moves to T2 [h2 (y)].
If that slot in T2 is empty, we’re done, but if some z occupies
T2 [h2 (y)], then y is put in its second “nest” and z is kicked out to
T1 [h1 (z )], and so on.
These “kicks out” give the name to this strategy. If this
procedure succeeds to insert n keys then each key x can only
appear in one of its two nests: T1 [h1 (x)] or T2 [h2 (x)],
nowhere else!
151/405
Cuckoo Hashing

152/405
Cuckoo Hashing

procedure L OOKUP(T , k, found, v)


n1 := T1 [h1 (k)]
if :n1:free ^ n1:key = k then
found := true; v := n1:value
else
n2 := T2 [h2 (k)]
if:n2:free ^ n2:key = k then
found := true; v := n2:value
else
found := false

Only two probes are necessary in the worst-case! To delete we


localate with  2 probes the key to remove and mark the slot as
free.

153/405
Cuckoo Hashing

The insertion of an item x can fail because we enter in an


infinite loop of items each kicking out the next in the cycle
...
The solution to the problem: nuke the table! Draw two new
hash functions, and rehash everything again with the two
new functions.
This rehashing is clearly quite costly; moreover, we don’t
have a guarantee that the new functions will succeed
where the old failed!
We will see, however, that insertion has expected amortized
constant cost, or equivalently, that the expected cost of n
insertions is (n)

154/405
Cuckoo Hashing

=
procedure I NSERT(T hT1 ; T2 i, k, v )
if k 2 T then . update v and return
else
=
if n M 1 = =
then . M jT1 j jT2 j
( )
R ESIZE T
( )
R EHASH T . can’t insert  M elements
:= ( ); =
x NODE k; v x:free false
:= 1
for i ( )
to M AX I TER n; M do
( )=2
. for example, M AX I TER n; M n
x :=: T1 [h1 (k)]
if x:free then return
x :=: T2 [h2 (k)]
if x:free then return
. Insertion failed! pick new functions h1 and h2
R EHASH(T )
I NSERT(T; k; v ) . retry with the new functions
155/405
Cuckoo Hashing

We say that an insertion is good if it does not run into a infinite


loop (our implementation protects from 1-loops by bounding
the number of iterations).
A “high-level analysis” of the cost of insertions follows from:
1 The expected number of steps/iterations in a good
insertion is (1)
2 The probability that the insertion of an item is not good is
O(1=n2 )

156/405
Cuckoo Hashing

3 By the union bound, the probability that we fail to make n


consecutive good insertions is O(1=n)
4 The expected total cost of making n good
insertions—conditioned on the event that we can make
them—is n  (1) = (n)

157/405
Cuckoo Hashing

1 The expected number of times we need to rehash a set of


n items until we can insert all with good insertions is given
by a geometric r.v. with probability of success 1 O(1=n):
1
E[# rehashes] = = 1 + O(1=n)
1 O(1=n)
2 Each rehash plus the attempt to insert with good insertions
the n items has expected cost (n)
3 By Wald’s lemma, the expected cost of the insertion will be

E[#rehashes]E[cost of rehash] = (1+O(1=n))O(n) = O(n)

158/405
Cuckoo Hashing
To prove facts #1 (good insertion needs expected O(1) time)
and #2 (probability of a good insertion is 1 O(1=n2 )) we
formulate the problem in graph theoretic terms.

159/405
Cuckoo Hashing

Cuckoo graph:
Vertices: V = fv1;i ; v2;i j 0  i < M g =
the set of 2M slots in the tables
Edges: If T1 [j ] is occupied by x then there’s an edge
(v1;j ; v2;h2 (x) ), where v`;j is the vertex associated to T` [j ]; x
is the label of the edge. If T2 [k] is occupied by y then there
is an edge (v2;k ; v1;h1 (y) ) with label y .
This is a labeled directed “bipartite” multigraph—all edges go
from v1;j to v2;k or from v2;k to v1;j .

160/405
Cuckoo Hashing

Consider the connected components of the cuckoo graph. A


component can be either a tree (no cycles), unicyclic (exactly
one cycle–with trees “hanging”) or complex (two or more
cycles). Trees with k nodes have exactly k 1 edges, unicycles
have exactly k edges and complex components have > k
edges.
Fact 1: An insertion that creates a complex component is
not good =) if the cuckoo graph contains no complex
components then all insertions were good
Fact 2: the expected time of a good insertion is bounded
by the expected diameter of the component in which we
make the insertion (also by the size)

161/405
Cuckoo Hashing

Then we convert the analysis to that of the cuckoo graph as a


random bipartite graph with 2M vertices and n = (1 )M
edges—each item gives us an edge.
This is a very “sparse” graph, but if the density n=M grew to 1/2
there will be an complex component with very high probability
(a similar thing happens in random Erdös-Renyi graphs).

162/405
Cuckoo Hashing
The most detailed analysis of the cuckoo graph has been made
by Drmota and Kutzelnigg (2012). They prove, among many
other things:
1 The probability that the cuckoo graph contains no complex

component is
1
1 h() + O(1=M 2 )
M
We do not reproduce their explicit formula for h() here
(h() ! 1 as  ! 0)
2 The expected number of steps in n good insertions is
 
=)
 n  min 4; ln(1
1 
+ O(1)
These two results prove the two Facts that we needed for our
analysis
163/405
Cuckoo Hashing

Several variants of Cuckoo hashing have appeared in the


literature, for instance, using d > 2 tables and d hash
functions. With such d-Cuckoo Hashing higher load factor,
approaching 1 can be achieved
An interesting variant puts all items in one single table, all
the d  2 hash functions map keys into the range 0::M 1;
the load factor n=M must be below some threshold d . We
need to now which function was used to put the item at an
occupied location—easily using log2 d bits.

164/405
Part II

Probabilistic & Randomized


Dictionaries
4 Randomized Binary Search Trees

5 Skip Lists

6 Hash Tables
Separate Chaining
Open Addressing
Cuckoo Hashing

7 Bloom Filters

165/405
Bloom filters

A Bloom Filter is a probabilistic data structure representing a


set of items; it supports:
Addition of items:F := F [ fxg
Fast lookup: x 2 F ?
Bloom filters do require very little memory and are specially
well suited for unsuccessful search (when x 62 F )

166/405
Bloom filters

The price to pay for the reduced memory consumption and


very fast lookup is the non-null probability of false positives.
If x 2 F then a lookup in the filter will always return true;
but if x 62 F then there is some probability that we get a
positive answer from the filter.
In other words, if the filter says x 62 F we are sure that’s the
case, but if the filter says x 2 F there is some probability
that this is an error.

167/405
Bloom filters

Bloom filters are the most basic example of the so-called


Approximate Membership Query Filters (AMQ filters) and
support the following operations:
1 F := C REATE BF(Nmax ; fp): creates an empty Bloom filter
F that might store up to Nmax items, and sets an upper
bound fp on the false positive rate allowed
2 F:I NSERT(x): add item x to filter F
3 F:L OOKUP(x): returns whether x belongs to the filter F or
not
if the answer is true, it might be wrong with probability  fp
if the answer is false, then x 62 F for sure

168/405
Implementing Bloom filters

To represent a Bloom filter for a subset of items drawn from the


domain U we will use:
1 A bitvector A of size M
2 A set of k pairwise independent hash functions
fh1; : : : ; hk g, each hi : U ! f0; : : : ; M 1g
The values of M and k are carefully chosen as a function of
Nmax and fp

169/405
Implementing Bloom filters

procedure C REATE BF(Nmax , fp)


:= :=
M : : :; k : : :
: [0
A bitvector ::M 1]
for i:= 0to M 1do A i [ ] := 0
for j := 1
to k do hi :=
a random hash function

The k independent hash functions can be choosen from a


universal class of hash functions (later in this course)

170/405
Insertion & lookup

procedure I NSERT(x)
for j:= 1 to k do
A[hj (x)] := 1

procedure L OOKUP(x)
for j:= 1 to k do
[ ( )] = 0
if A hj x then
return false
return true

171/405
Insertion & lookup

Source: D. Medjedovic & E. Tahirovic, Algorithms and Data


Structures for Massive Datasets, 2022
172/405
Insertion & Lookup

Source: D. Medjedovic & E. Tahirovic, Algorithms and Data


Structures for Massive Datasets, 2022
173/405
Analysis of Bloom filters

Probability that the j -th bit is not updated when inserting x


k
Y
 
1 k
P[hi (x) 6= j ] = 1
i=1 M
Probability that the j -th bit is not updated after n insertions

n
Y
P[A[j ] is not updated in `-th insertion] =
`=1
 !n
1 kn
  
1 k
1 = 1
M M

174/405
Analysis of Bloom filters

Probability that A[j ] = 1 after n insertions

1 kn
 
1 1
M
Probability that k checked bits are set to 1  probability of
a false positive
 !k
1 kn
  k
1 1
M
 1 e kn=M

if n = M , for some > 0



a bx
1
x
!e ba ; x!1

175/405
Analysis of Bloom filters

The derivation above is the so-called classic model for


Bloom filters—but it is not the formula that Bloom himself
derived in his paper!
The approximation fails for small filters; correct formulas
have been derived by Bose et al. (2008) and Christensen
et al. (2010)
For the rest of the presentation we will take

P[x is a false positive] = P[x 62 F ^ F:contains(x) = true]


 k
 1 e kn=M ;
where x is drawn at random. Be careful! The formula does
not give the probability that the filter reports x as a positive,
conditioned to x being negative!
176/405
Optimal parameters for Bloom filters

Fix n and M . The optimal value k minimizes the


probability of false positive, thus

d  
kn=M k

1 e =0
dk k=k
which gives
M M
k  ln 2  0:69
n n
Call p the probability of a false positive. This probability is a
function of k, p = p(k); for the optimal choice k we have
 ln 2 M
  M ln 2 1
p(k )  1  0:6185
n
e ln 2 n
=
M
n
2

177/405
Optimal parameters for Bloom filters

Suppose that you want the probability of false positive


p = p(k ) to remain below some bound P
M
p  P =) ln p = (ln 2)2  ln P
n
M
(ln 2)2  ln P = ln(1=P )
n
M 1

n ln 2 2
log (1=P )  1:44 log2 (1=P )
M  1:44  n  log2 (1=P )

178/405
Optimal parameters for Bloom filters

procedure C REATE BF(Nmax , fp)


M := 1:44  Nmax  log2 (1=fp);
k := log2 (1=fp)
...

179/405
Optimal parameters for Bloom filters

If we want a Bloom filter for a database that will store about


n  108 elements and a false positive rate  5%, we need
a bitvector of size M  624  106 bits (that’s around 74MB
of memory).
Despite this amount of memory is big, it is only a small
fraction of the size of the database itself: even if we store
only keys of 32 bytes each, the database occupies more
than 3GB.
The optimal number k of hash functions for the example
above is 4:32 ( =) use 4 or 5 hash functions for optimal
performance)

180/405
To learn more

[1] B.H. Bloom.


Space/Time Trade-offs in Hash Coding with Allowable
Errors.
Communications of the ACM 13 (7): 422–426, 1970.
[2] A. Broder and M. Mitzenmacher.
Network Applications of Bloom Filters: A Survey
Internet Mathematics 1 (4):485–509, 2003.

181/405
To learn more (2)

[3] P. Bose, H. Guo, E.Kranakis et al.


On the False-Positive Rate of Bloom Filters
Information Processing Letters 108 (4):210–213, 2004.
[4] K. Christensen, A. Roginsky and M. Jimeneo.
A New Analysis of the False-Positive Rate of a Bloom
Filter
Information Processing Letters 110 (21):944–949, 2010.

182/405
Part III

Priority Queues

8 Priority Queues: Introduction

9 Heaps

10 Binomial Queues

11 Fibonacci Heaps

183/405
Priority Queues: Introduction

A priority queue (esp: cola de prioridad) stores a collection of


elements, each one endowed with a value called its priority.
Priority queues support the insertion of new elements and the
query and removal of an element of minimum (or maximum)
priority.

184/405
Introduction

template <typename Elem, typename Prio>


class PriorityQueue {
public:
...
// Adds an element x with priority p to the priority queue.
void insert(cons Elem& x, const Prio& p);

// Returns an element of minimum priority; throws an


// exception if the queue is empty.
Elem min() const;

// Returns the priority of an element of minimum priority; throws an


// exception if the queue is empty.
Prio min_prio() const;

// Removes an element of minimum priority; throws an


// exception if the queue is empty.
void remove_min();

// Returns true iff the priority queue is empty


bool empty() const;
};

185/405
Priority Queues: Introduction

// We have two arrays Weight and Symb with the atomic


// weights and the symbols of n chemical elements, e.g.,
// Symb[i] = "C" y Weight[i] = 12.2, for some i.
// We use a priority queue to sort the information in alphabetic
// ascending order

PriorityQueue<double, string> P;
for (int i = 0; i < n; ++i)
P.insert(Weigth[i], Symb[i]);
int i = 0;
while(not P.empty()) {
Weight[i] = P.min();
Symb[i] = P.min_prio();
++i;
P.remove_min();
}

186/405
Priority Queues: Introduction

Several techniques that used for the implementation of


dictionaries can also be used for priority queues (not hash
tables).
For instance, balanced search trees such as AVLs can be
used to implement a PQ with cost O(log n) for insertions
and deletions

187/405
Part III

Priority Queues

8 Priority Queues: Introduction

9 Heaps

10 Binomial Queues

11 Fibonacci Heaps

188/405
Heaps

Definition
A heap is a binary tree such that
1 All empty subtrees are located in the last two levels
of the tree.
2 If a node has an empty left subtree then its right
subtree is also empty.
3 The priority of any element is larger or equal than
the priority of any element in its descendants.

189/405
Heaps

Because of properties 1–2 in the definition, a heap is a


quasi-complete binary tree. Property #3 is called heap order
(for max-heaps).

If the priority of an element is smaller or equal than that of its


descendants then we talk about min-heaps.

190/405
Heaps

n = 10 level 0
76
height = 4

72 34 level 1

59 63 29 level 2
17

33 level 3
37 29

level 4

leaves

191/405
Heaps

Proposition
1 The root of a max-heap stores an element of
maximum priority.
2 A heap of size n has height
h = dlog2 (n + 1)e:

If heaps are used to implement a PQ the query for a max/min


element and its priority is trivial: we need only to examine the
root of the heap.

192/405
Heaps: Removing the maximum

1 Replace the root of the heap with the last element (the
rightmost element in the last level)
2 Reestablish the invariant (heap order) sinking the root:
The function sink exchanges a given node with its largest
priority child, if its priority is smaller than the priority of its
child, and repeats the same until the heap order is
reestablished.

193/405
Heaps: Removing the maximum

1 Replace the root of the heap with the last element (the
rightmost element in the last level)
2 Reestablish the invariant (heap order) sinking the root:
The function sink exchanges a given node with its largest
priority child, if its priority is smaller than the priority of its
child, and repeats the same until the heap order is
reestablished.

193/405
Heaps: Removing the maximum

194/405
Heaps: Adding a new element

1 Add the new element as rightmost node of the last level of


the heap (or as the first element of a new deeper level)
2 Reestablish the heap order sifting up (a.k.a. floating) the
new added element:
The function siftup compares the given node to its
father, and they are exchanged if its priority is larger than
that of its father; the process is repeated until the heap
order is reestablished.

195/405
Heaps: Adding a new element

1 Add the new element as rightmost node of the last level of


the heap (or as the first element of a new deeper level)
2 Reestablish the heap order sifting up (a.k.a. floating) the
new added element:
The function siftup compares the given node to its
father, and they are exchanged if its priority is larger than
that of its father; the process is repeated until the heap
order is reestablished.

195/405
The Cost of Heaps

Since the height of a heap is (log n), the cost of removing the
maximum and the cost of insertions is O(log n).

We can implement heaps with dynamically allocated nodes,


and three pointers per node (left, right, father) . . . But it is much
easier and efficient to implement heaps with vectors!
Since the heap is a quasi-complete binary tree this allows us to
avoid wasting memory: the n elements are stored in the first n
components of the vector, which implicitly represent the tree.

196/405
Implementing Heaps

To make the rules easier we will use a vector A of size n + 1 and


discard A[0]. Resizing can be used to allow unlimited growth.
1 A[1] contains the root
2 If 2i  n then A[2i] contains the left child of A[i] and if
2i + 1  n then A[2i + 1] contains the right subtree of A[i]
3 If i  2 then A[i=2] contains the father of A[i]

197/405
Implementing Heaps

To make the rules easier we will use a vector A of size n + 1 and


discard A[0]. Resizing can be used to allow unlimited growth.
1 A[1] contains the root
2 If 2i  n then A[2i] contains the left child of A[i] and if
2i + 1  n then A[2i + 1] contains the right subtree of A[i]
3 If i  2 then A[i=2] contains the father of A[i]

197/405
Implementing Heaps

To make the rules easier we will use a vector A of size n + 1 and


discard A[0]. Resizing can be used to allow unlimited growth.
1 A[1] contains the root
2 If 2i  n then A[2i] contains the left child of A[i] and if
2i + 1  n then A[2i + 1] contains the right subtree of A[i]
3 If i  2 then A[i=2] contains the father of A[i]

197/405
Implementing Heaps

template <typename Elem, typename Prio>


class PriorityQueue {
public:
...
private:
// Component of index 0 is not used
vector<pair<Elem, Prio> > h;
int nelems;

void siftup(int j) throw();


void sink(int j) throw();
};

198/405
Implementing Heaps

template <typename Elem, typename Prio>


bool PriorityQueue<Elem,Prio>::empty() const {

return nelems == 0;
}

template <typename Elem, typename Prio>


Elem PriorityQueue<Elem,Prio>::min() const {

if (nelems == 0) throw EmptyPriorityQueue;


return h[1].first;
}

template <typename Elem, typename Prio>


Prio PriorityQueue<Elem,Prio>::min_prio() const {

if (nelems == 0) throw EmptyPriorityQueue;


return h[1].second;
}

199/405
Implementing Heaps

template <typename Elem, typename Prio>


void PriorityQueue<Elem,Prio>::insert(cons Elem& x,
cons Prio& p) {
++nelems;
h.push_back(make_pair(x, p));
siftup(nelems);
}

template <typename Elem, typename Prio>


void PriorityQueue<Elem,Prio>::remove_min() const {

if (nelems == 0) throw EmptyPriorityQueue;


swap(h[1], h[nelems]);
--nelems;
h.pop_back();
sink(1);
}

200/405
Implementing Heaps

// Cost: O(log(n/j))
template <typename Elem, typename Prio>
void PriorityQueue<Elem,Prio>::sink(int j) {

// if j has no left child we are at the last level


if (2 * j > nelems) return;

int minchild = 2 * j;
if (minchild < nelems and
h[minchild].second > h[minchild + 1].second)
++minchild;

// minchild is the index of the child with minimum priority


if (h[j].second > h[minchild].second) {
swap(h[j], h[minchild]);
sink(minchild);
}
}

201/405
Implementing Heaps

// Cost: O(log j)
template <typename Elem, typename Prio>
void PriorityQueue<Elem,Prio>::siftup(int j) {

// if j is the root we are done


if (j == 1) return;

int father = j / 2;
if (h[j].second < h[father].second) {
swap(h[j], h[father]);
siftup(father);
}
}

202/405
Part III

Priority Queues

8 Priority Queues: Introduction

9 Heaps

10 Binomial Queues

11 Fibonacci Heaps

203/405
Binomial Queues

J. Vuillemin

A binomial queue is a data structure that efficiently


supports the standard operations of a priority queue
(insert, min, extract_min) and additionally it supports
the melding (merging) of two queues in time O(log n).
Note that melding two ordinary heaps takes time O(n).
Binomial queues (aka binomial heaps) were invented by J.
Vuillemin in 1978.

204/405
template <typename Elem, typename Prio>
class PriorityQueue {
public:
PriorityQueue() throw(error);
~PriorityQueue() throw();
PriorityQueue(const PriorityQueue& Q) throw(error);
PriorityQueue& operator=(const PriorityQueue& Q) throw(error);

// Add element x with priority p to the priority queue


void insert(cons Elem& x, const Prio& p) throw(error)

// Returns an element of minimum priority. Throws an exception if


// the priority queue is empty
Elem min() const throw(error);

// Returns the minimum priority in the queue. Throws an exception


// if the priority queue is empty
Prio min_prio() const throw(error);

// Removes an element of minimum priority from the queue. Throws


// an exception if the prioirty queue is empty
void remove_min() throw(error);

// Returns true if and only if the queue is empty


bool empty() const throw();

// Melds (merges) the priority queue with the priority queue Q;


// the priority queue Q becomes empty
void meld(PriorityQueue& Q) throw();

...
};

205/405
Binomial Queues

A binomial queue is a collection of binomial trees.


The binomial tree of order i (called Bi ) contains 2i nodes

B0

B1

B2
B3

206/405
Binomial Queues

A binomial tree of order i + 1 is (recursively) built by


planting a binomial tree Bi as a child of the root of another
binomial tree Bi .

Bi

Bi

Bi+1

The size of Bi is 2i ; indeed jB0 j = 20 = 1,


jBi+1j = 2  jBij = 2  2i = 2i+1 
A binomial tree of order i has exactly ki descendants at
level k (the root is at level 0); hence their name
A binomial tree of order i has height i = log2 jBi j
207/405
Binomial Queues

Let (bk 1 ; bk 2 ; : : : ; b0 )2 be the binary representation of n.


Then a binomial queue for a set of n elements contains b0
binomial trees of order 0, b1 binomial trees of order 1, . . . ,
bj binomial trees of order j , . . .
n = 10 = (1,0,1,0) 2

5 3

7 4 5 6

9 4 8

10

A binomial queue for n elements contains at most


dlog2(n + 1)e binomial trees
The n elements of the binomial queue are stored in the
binomial trees in such a way that each binomial tree
satisfies the heap property: the priority of the element at
any given node is  than the priority of its descendants 208/405
Binomial Queues

Each node in the binomial queue will store an Elem and its
priority (any type that admits a total order)
Each node will also store the order of the binomial subtree
of which the node is the root
We will use the usual first-child/next-sibling representation
for general trees, with a twist: the list of children of a node
will be double linked and circularly closed
We need thus three pointers per node: first_child,
next_sibling, prev_sibling
The binomial queue is simply a pointer to the root of the
first binomial tree
We will impose that all lists of children are in increasing
order

209/405
Binomial Queues

n = 10 = (1,0,1,0) 2

5 3

7 6 5 4

8 4 9

10

210/405
Binomial Queues

template <typename Elem, typename Prio>


class PriorityQueue {
...
private:
struct node_bq {
Elem _info;
Prio _prio;
int _order;
node_bq* _first_child;
node_bq* _next_sibling;
node_bq* _prev_sibling;
node_bq(const Elem& x, const Prio& p, int order = 0) : _info(x), _prio(p),
_order(order), _first_child(NULL) {
_next_sibling = _prev_sibling = this;
};
};
node_bq* _first;
int _nelems;

211/405
Binomial Queues

To locate an element of minimum priority it is enough to


visit the roots of the binomial trees; the minimum of each
binomial tree is at its root because of the heap property.
Since there are at most dlog2 (n + 1)e binomial trees, the
methods min() and min_prio() take O(log n) time and
both are very easy to implement.

212/405
Binomial Queues

We can also keep a pointer to the root of the element with


minimum priority, and update it after each insertion or
removal, when necessary. The complexity of updates does
not change and min() and min_prio() take O(1) time

213/405
Binomial Queues

static node_bq* min(node_bq* f) const throw(error) {


if (f == NULL) throw error(EmptyQueue);
Prio minprio = f -> _prio;
node_bq* minelem = f;
node_bq* p = f-> _next_sibling;
while (p != f) {
if (p -> _prio < minprio) {
minprio = p -> _prio;
minelem = p;
};
p = p -> _next_sibling;
}
return minelem;
}

Elem min() const throw(error) {


return min(_first) -> _info;
}

Prio min_prio() const throw(error) {


return min(_first) -> _prio;
}

214/405
Binomial Queues

To insert a new element x with priority p, a binomial queue


with just that element is trivially built and then the new
queue is melded with the original queue
If the cost of melding two queues with a total number of
items n is M (n), then the cost of insertions is O(M (n))

215/405
Binomial Queues

void insert(const Elem& x, const Prio& p) throw(error) {


node_bq* nn = new node_bq(x, p);
_first = meld(_first, nn);
++_nelems;
}

216/405
Binomial Queues

To delete an element of minimum priority from a queue Q,


we start locating such an element, say x; it must be at the
root of some Bi
The root of Bi is dettached from Q and thus Bi is no longer
part of the original queue Q; the list of x’s children is a
binomial queue Q0 with 2i 1 elements
The queue Q0 has i binomial trees of orders 0, 1, 2, . . . up
to i 1
1 + 2 + : : : + 2i 1 = 2i 1
The queue Q n Bi is then melded with Q0

217/405
Binomial Queues

n = 10 = (1,0,1,0) 2

5 3

7 4 5 6

9 4 8

10

218/405
Binomial Queues

n = 10 = (1,0,1,0) 2

5 3

7 4 5 6

9 4 8

10

218/405
Binomial Queues

n = 10 = (1,0,1,0) 2

5
3

7
4 5 6

9 4 8

10

218/405
Binomial Queues

n = 10 = (1,0,1,0) 2

7
Q’ 4 5 6

9 4 8

10

218/405
Binomial Queues

n = 9 = (1,0,0,1) 2

6 4

5 9 4

5 8 10

218/405
Binomial Queues

void remove_min() throw(error) {


node_bq* m = min(_first);
node_bq* children = m -> _first_child;
if (m != m -> _next_sibling) { // there is more than one
// binomial tree
m -> _prev_sibling -> _next_sibling = m -> _next_sibling;
m -> _next_sibling -> _prev_sibling = m -> _prev_sibling;
} else {
_first = NULL;
}
node_bq* qaux = m -> _first_child;
m -> _first_child = m -> _next_sibling = m -> _prev_sibling = NULL;
delete m;
_first = meld(_first, qaux);
--_nelems;
}

219/405
Binomial Queues

The cost of extracting an element of minimum priority:


To locate the minimum priority has cost O n (log )
( ( ))
Melding Q n Bi and Q0 has cost O M n , since
jQ n Bi j + jQ0 j = n 2i + 2 i 1 = n 1
In total: O(log n + M (n))

220/405
Binomial Queues

Melding two binomial queues Q and Q0 is very similar to


the addition of two binary numbers bitwise
The procedure iterates along the two lists of binomial trees;
at any given step we consider two binomial trees Bi and
Bj0 , and a carry C = Bk00 or C = ;

221/405
Binomial Queues

Let r = min(i; j; k).


If there is only one binomial tree in fBi ; Bj0 ; C g of order r,
put that binomial tree in the result and advance to the next
binomial tree in the corresponding queue (or set C ;) =
If exactly two binomial trees in fBi ; Bj0 ; C g are of order r,
=
set C Br+1 by joining the two binomial trees (while
preserving the heap property), remove the binomial trees
from the respective queues, and advance to the next
binomial tree where appropiate
If the three binomial trees are of order r, put Bk00 in the
=
result, remove Bi from Q and Bj0 from Q0 , set C Br+1 by
joining Bi and Bj0 , and advance in both Q and Q0 to the next
binomial trees

222/405
Binomial Queues

Q 5 3

7 4 5 6

9 4 8

10

Q’ 12 1 2

6 4 2

13

223/405
Binomial Queues

Q 3 5
B

4 5 6 7

9 4 8 12 B’

10

carry
Q’ 1 2

6 4 2

13

223/405
Binomial Queues

Q 3 5
B

4 5 6 7

9 4 8 B’

10

carry
Q’ 1 2

6 4 2

result 12

13

223/405
Binomial Queues

Q 3 5
B

4 5 6 7

1
9 4 8 B’

6
10

carry
Q’ 2

4 2

result 12

13

223/405
Binomial Queues

Q 3
B

4 5 6

9 4 8 B’

10
1
carry
Q’ 2
5 6

4 2
12 7
result
13

223/405
Binomial Queues

3 B

4 5 6 2

B’
9 4 8 4 2

1
10 13
carry
Q’
5 6

12 7
result

223/405
Binomial Queues

3 B

4 5 6

B’
9 4 8

1
10
carry
Q’
2 5 6

12 4 2 7
result

13

223/405
Binomial Queues

Q
B

B’

3 2 6
5 carry
Q’
4 5 6 4 2 7

result 12 9 4 8 13

10

223/405
Binomial Queues

Q
B

Q’ B’

carry

3 2 6
5

4 5 6 4 2 7

result 12 9 4 8 13

10

223/405
Binomial Queues
// removes the first binomial tree from the binomial queue q
// and returns it; if the queue q is empty, returns NULL: cost: Theta(1)
static node_bq* pop_front(node_bq*& q) throw();

// adds the binomial queue b (typically consisting of a single tree)


// at the end of the binomial queue q;
// does nothing if b == NULL; cost: Theta(1)
static void append(node_bq*& q, node_bq* b) throw();

// melds Q and Qp, destroying the two binomial queues


static node_bq* meld(node_bq*& Q, node_bq*& Qp) throw() {
node_bq* B = pop_front(Q);
node_bq* Bp = pop_front(Qp);
node_bq* carry = NULL;
node_bq* result = NULL;
while (non-empty(B, Bp, carry) >= 2) {
node_bq* s = add(B, Bp, carry);
append(result, s);
if (B == NULL) B = pop_front(Q);
if (Bp == NULL) Bp = pop_front(Qp);
}
// append the remainder t othe result
append(result, Q);
append(result, Qp);
append(result, carry);
return result;
}

224/405
Binomial Queues
static node_bq* add(node_bq*& A, node_bq*& B, node_bq*& C) throw() {
int i = order(A); int j = order(B); int k = order(C);
int r = min(i, j, k);
node_bq* a, b, c;
a = b = c = NULL;
if (i == r) { a = A; A = NULL; }
if (j == r) { b = B; B = NULL; }
if (k == r) { c = C; C = NULL; }
if (a != NULL and b == NULL and c == NULL) {
return a;
}
if (a == NULL and b != NULL and c == NULL) {
return b;
}
if (a == NULL and b == NULL and c != NULL) {
return c;
}
if (a != NULL and b != NULL and c == NULL) {
C = join(a, b);
return NULL;
}
if (a != NULL and b == NULL and c != NULL) {
C = join(a,c);
return NULL;
}
if (a == NULL and b != NULL and c != NULL) {
C = join(b,c);
return NULL;
}
/// a != NULL and b != NULL and c != NULL
C = join(a,b); 225/405
Binomial Queues

static int order(node_bq* q) throw() {


// no binomial queue will ever be of order as high as 256 ...
// unless it had 2^256 elements, more than elementary particles in
// this Universe; to all practical purposes 256 = infinity
return q == NULL ? 256 : q -> _order;
}

// plants p as rightmost child of q or q as rightmost child of p


// to obtain a new binomial tree of order + 1 and preserving
// the heap property
static node_bq* join(node_bq* p, node_bq* q) {
if (p -> _prio <= q -> _prio) {
push_back(p -> _first_child, q);
++p -> _order;
return p;
} else {
push_back(q -> _first_child, p);
++q -> _order;
return q;
}
}

226/405
Binomial Queues

Melding two queues with ` and m binomial trees each,


respectively, has cost O(` + m) because the cost of the
body of the iteration is O(1) and each iteration removes at
least one binomial tree from one of the queues
Suppose that the queues to be melded contain n elements
in total; hence the number of binomial trees in Q is  log n
and the same is true for Q0 , and the cost of meld is
M (n) = O(log n)
The cost of inserting a new element is O(M (n)) and the
cost of removing an element of minimum priority is

O(log n + M (n)) = O(log n)

227/405
Binomial Queues

Note that the cost of inserting an item in a binomial queue


of size n is (`n + 1) where `n is the weight of the
rightmost zero in the binary representation of n.
The cost of n insertions
X dlog2X
(n+1)e
n
(`i + 1) = (r)  r
0i<n r=1 2
0 1
X rA
 n @ r = (n);
r 0 2
as  n=2r of the numbers between 0 and n 1 have their
rightmost zero at position r, and the infinite series in the
last line above is bounded by a positive constant
This gives a (1) amortized cost for insertions
228/405
Binomial Queues

To learn more:
[1] J. Vuillemin
A Data Structure for Manipulating Priority Queues.
Comm. ACM 21(4):309–315, 1978.
[2] T. Cormen, C. Leiserson, R. Rivest and C. Stein.
Introduction to Algorithms, 2e.
MIT Press, 2001.

229/405
Part III

Priority Queues

8 Priority Queues: Introduction

9 Heaps

10 Binomial Queues

11 Fibonacci Heaps

230/405
Fibonacci Heaps

M.L. Fredman R.E. Tarjan


A Fibonacci heap is a data structure that efficiently
supports the standard operations of a priority queue
(insert, min, extract_min) and additionally it
supports: (1) the melding (merging) of two queues in
amortized time O(log n), (2) the deletion of an arbitrary
element in amortized time O(log n) and (3) the decrease of
the priority of an element in amortized constant time O(1).
Fibonacci heaps were invented by Michael L. Fredman and
Robert E. Tarjan in 1986.
231/405
template <typename Elem>
class FibonacciHeap {
public:
class FH_handle;

// creates an empty Fibonacci heap


FibonacciHeap();
...

// Add element x with integer priority p to the priority queue


// returns a "pointer" (FH_handle) to the inserted element
FH_handle insert(const Elem& x, int p);

// Returns a handle to the element of minimum priority. Throws an exception if


// the priority queue is empty
FH_handle min() const;

// Returns the minimum priority in the queue. Throws an exception


// if the priority queue is empty
int min_prio();

// Removes an element of minimum priority from the queue. Throws


// an exception if the prioirty queue is empty
void extract_min();

// Removes the element pointed to by the given handle h


void delete(FH_handle p);

// Decrease the priority of the element pointed to by the handle


// to make it p (< old priority of h)
void decrease_prio(FH_handle h, int p);

// Returns true if and only if the queue is empty


bool empty() const;

// Melds (merges) the priority queue with the priority queue Q;


// the priority queue Q becomes empty 232/405
void meld(FibonacciHeap& F);
Fibonacci Heaps

A Fibonacci heap is a collection of heap-ordered trees


(every node contains a priority smaller or equal to that of
its descendents).

233/405
Fibonacci Heaps

Some nodes in a FH can be marked, other are unmarked.

234/405
Fibonacci Heaps

The roots of the collection of trees in a FH are maintained


in a double circular linked list
A pointer min points to a root with minimum priority

235/405
Fibonacci Heaps

The children of each node are also maintained in a circular


double linked list
Each node has a pointer to one of its children, and also a
pointer to its parent
The rank of a node is the number of children of the node
The rank of the FH is the maximum rank

236/405
Fibonacci Heaps

class FibonacciHeap {
...
private:
struct FH_node {
FH_node* parent;
FH_node* a_child;
FH_node* left_sibling, * right_sibling;
int rank;
bool mark;
Elem info;
int prio;
};
FH_node* min;
int rank;
...
}

237/405
Fibonacci Heaps

Operations in constant time


Find an element of minimum priority.
Merge to root lists (= concatenate the two linked lists)
Obtain the rank of a given node (via a FH_handle)
Add or remove a tree (via a handle to its root) to a root list
Remove a subtree of some node and add it to the root list
Add a subtree as a child of some node (add it to the linked
lists of children of that node)

238/405
Fibonacci Heaps
Notation Meaning
n number of elements
rank(x) rank of x = number of children of node x
rank(H ) max. rank of any node in H
trees(H ) number of trees in H
marks(H ) number of marked nodes in H

239/405
Fibonacci Heaps

Potential function:

(H ) = trees(H ) + 2marks(H )

240/405
Fibonacci Heaps

To insert a new element x with priority p


Create a new node with x and p (and no children, no
parent, . . . )
Add the “tree” with x to the root list, and update min if
needed

241/405
Fibonacci Heaps

To insert a new element x with priority p


Create a new node with x and p (and no children, no
parent, . . . )
Add the “tree” with x to the root list, and update min if
needed

241/405
Fibonacci Heaps

Actual cost: ci = O(1)


Change in potential:  = (H 0 ) (H ) = 1, there is one
more tree after insertion
Amortized cost: c^i = O(1) +  = O(1)

242/405
Fibonacci Heaps

Linking:
Given two trees of rank k (rank of a tree = rank of its root),
linking T1 and T2 yields a tree of rank k + 1, adding the root with
larger priority as a child of the root with smaller priority.

243/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the min, merging the children with the root list,
and update min
2 Consolidate the root list (= keep linking trees until no two
trees of the same rnk remain)

244/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the min, merging the children with the root list,
and update min
2 Consolidate the root list (= keep linking trees until no two
trees of the same rnk remain)

244/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the root pointed to by min, merging the children
with the root list, and update min
2 Consolidate the root list (= link trees so that no two trees of
the same rnk remain)

245/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the root pointed to by min, merging the children
with the root list, and update min
2 Consolidate the root list (= link trees so that no two trees of
the same rnk remain)

245/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the root pointed to by min, merging the children
with the root list, and update min
2 Consolidate the root list (= link trees so that no two trees of
the same rnk remain)

245/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the root pointed to by min, merging the children
with the root list, and update min
2 Consolidate the root list (= link trees so that no two trees of
the same rnk remain)

245/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the root pointed to by min, merging the children
with the root list, and update min
2 Consolidate the root list (= link trees so that no two trees of
the same rnk remain)

245/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the root pointed to by min, merging the children
with the root list, and update min
2 Consolidate the root list (= link trees so that no two trees of
the same rnk remain)

245/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the root pointed to by min, merging the children
with the root list, and update min
2 Consolidate the root list (= link trees so that no two trees of
the same rnk remain)

245/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the root pointed to by min, merging the children
with the root list, and update min
2 Consolidate the root list (= link trees so that no two trees of
the same rnk remain)

245/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the root pointed to by min, merging the children
with the root list, and update min
2 Consolidate the root list (= link trees so that no two trees of
the same rnk remain)

245/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the root pointed to by min, merging the children
with the root list, and update min
2 Consolidate the root list (= link trees so that no two trees of
the same rnk remain)

245/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the root pointed to by min, merging the children
with the root list, and update min
2 Consolidate the root list (= link trees so that no two trees of
the same rnk remain)

245/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the root pointed to by min, merging the children
with the root list, and update min
2 Consolidate the root list (= link trees so that no two trees of
the same rnk remain)

245/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the root pointed to by min, merging the children
with the root list, and update min
2 Consolidate the root list (= link trees so that no two trees of
the same rnk remain)

245/405
Fibonacci Heaps

To extract an element of minimum priorty:


1 Remove the root pointed to by min, merging the children
with the root list, and update min
2 Consolidate the root list (= link trees so that no two trees of
the same rank remain)

246/405
Fibonacci Heaps

Actual cost: ci = O(rank(H )) + trees(H )


( (H ))
Removing min + merging children: O rank
(
Update min: O rank H ( )) +
trees H ( ) 1
(
Consolidate: O rank H( )) +
trees H( ) 1
Change in potential:  = (H 0 ) (H ) =
trees(H 0 ) trees(H )  rank(H 0 ) + 1 trees(H ), as no two
trees have same rank after consolidation
Amortized cost:
c^i = ci +  = O(rank(H 0 )) + O(rank(H )). We will show
later that rank(H ) = O(log n) if H has n elements.

247/405
Fibonacci Heaps

If there are only insertions and extract_min, all trees in a


Fibonacci heap H are binomial trees (rank = order of the
binomial trees)
Hence rank(H )  log2 n
p binomial, but
Decrease_priority: trees are no longer
rank(H )  log n, with  = (1 + 5)=2  1:618 : : :

248/405
Fibonacci Heaps

To decrease the pripority of node x from p to p0


If heap-order is preserved, update priority to p0
Otherwise, cut the subtree rooted at x and add it to the root
list
Update min if necessary

249/405
Fibonacci Heaps

To decrease the pripority of node x from p to p0


If heap-order is preserved, update priority to p0
Otherwise, cut the subtree rooted at x and add it to the root
list
Update min if necessary

249/405
Fibonacci Heaps

To decrease the pripority of node x from p to p0


If heap-order is preserved, update priority to p0
Otherwise, cut the subtree rooted at x and add it to the root
list
Update min if necessary

249/405
Fibonacci Heaps

To decrease the pripority of node x from p to p0


If heap-order is preserved, update priority to p0
Otherwise, cut the subtree rooted at x and add it to the root
list
Update min if necessary

249/405
Fibonacci Heaps

To decrease the pripority of node x from p to p0


If heap-order is preserved, update priority to p0
Otherwise, cut the subtree rooted at x and add it to the root
list
The number of nodes can be less than exponential w.r.t.
the rank (e.g., trees with size  rank)

250/405
Fibonacci Heaps

To solve the problem:


1 If a node x loses a child for the first time, cut the child but
also mark x
2 If you cut a child from a marked node x then cut the
subtree rooted at x too and merge it with the root list
3 The roots of cut subtrees that are added to the root list lose
their marks if they had marks

251/405
Fibonacci Heaps

Case 1: heap-order is preserved

252/405
Fibonacci Heaps

Case 1: heap-order is preserved

252/405
Fibonacci Heaps

Case 2: parent is not marked

253/405
Fibonacci Heaps

Case 2: parent is not marked

253/405
Fibonacci Heaps

Case 2: parent is not marked

253/405
Fibonacci Heaps

Case 2: parent is not marked

253/405
Fibonacci Heaps

Case 3: parent is marked

254/405
Fibonacci Heaps

Case 3: parent is marked

254/405
Fibonacci Heaps

Case 3: parent is marked

254/405
Fibonacci Heaps

Case 3: parent is marked

254/405
Fibonacci Heaps

Case 3: parent is marked

254/405
Fibonacci Heaps

Case 3: parent is marked

254/405
Fibonacci Heaps

Case 3: parent is marked

254/405
Fibonacci Heaps

Actual cost: ci = c + 1, c = number of cuts; includes


changing the key, merging each cutted subtree in the root
list
Change in potential:  = O(1) c
( )=
trees H 0 ( )+
trees H c
( )
marks H 0  marks H ( )+2 c; every cut, except the first
and the last, removes a mark; the last might or might not
remove a mark
  c + 2  (2 c) = 4 c
Amortized cost: c^i = ci +  = O(1).

255/405
Fibonacci Heaps

Summary:
Insert: O(1)
Extract min:O(rank(H )) amortized
Decrease priority: O(1) amortized
Last step: Fibonacci lemma.
Lemma
Let H be a Fibonacci heap with n elements. Then

rank(H )  log n

256/405
Fibonacci Heaps

Lemma
Fix some moment in time and consider a tree of rank k
with root x. Denote y1 , . . . , yk the k children of x in the
order in which they have been attached as children of x.
Then (
0 if i = 1,
rank(yi ) 
i 2 if i  2.

257/405
Fibonacci Heaps

1 When yi gets linked to x both must be at least of rank i 1


(as x had y1 ; : : : ; yi 1 as children and possibly others that
x lost later)
2 Thus rank(x) = rank(yi 1 )  i 1
3 Since yi gets linked as a child of x it can lose one child, but
not more; otherwise yi would have been cut too
4 Therefore rank(yi )  i 2

258/405
Fibonacci Heaps

Lemma
Let sk be the smallest possible number of elements in
a Fibonacci heap of rank k. Then sk  Fk+2 , where Fk
denotes the k-th Fibonacci number

Proof
Consider a FH consisting of a single tree with root x.
Basis: s0 = 1, s1 = 2. Inductive hyp: si  Fi+2 for
all i, 0  i < k.
Let y1 ; : : : ; yk denote the children of x.

sk  1 + 1 + (s0 +    + sk 2 ) because of Lemma 1


 1 + F1 + (F2 +    + Fk ) because of inductive hyp.
= Fk :
259/405
Fibonacci Heaps

Two facts (both easily proved by induction):


For all k  0, Fk+2 = 1 + F0 + F1 +    + Fk
p
1

2 For all k  0, Fk+2  k , where  = (1 + 5)=2  1:618 : : :


Let H be a Fibonacci heap with n elements and rank k. Then
by Lemma 2, n  sk  Fk+2  k . Hence,

k = rank(H )  log n = 1:44 log2 n


This implies that the amortized cost of extract min is O(log n).

260/405
Part IV

Disjoint Sets

12 Disjoint Sets: Introduction

13 Implementation of Union-Find

14 Analysis of Union-Find

261/405
Disjoint Sets

A set of disjoint sets or partition  of a non-empty set A is a


collection of non-empty subsets  = fA1 ; : : : ; Ak g such that
1 i 6= j =) Ai \ Aj = ;
2 A = S1ik Ai
Each Ak is often called block or class; we might see a partition
as an equivalence relation and each Ak as one of its
equivalence classes.

262/405
Disjoint Sets

Given a partition  of A, it induces an equivalence relation 

x  y () there is Ai 2  such that x; y 2 Ai


Conversely, an equivalence relation of a non-empty set A
induces a partition  = fAx gx2A , with

Ax = fy 2 A j y  xg:

263/405
Disjoint Sets

Without loss of generality we will assume that the support A of


the disjoint sets is f1; : : : ; ng (or f0; 1; : : : ; n 1g). If that were
not the case, we can have a dictionary to map the actual
elements of A into the range f1; : : : ; ng.
We shall also assume that A is static, that is, no elements are
added or removed. Efficient representations for partitions of a
dynamic set can be obtain with some extra but small effort.

264/405
Disjoint Sets

Two fundamental operations supported by a D ISJOINT S ETS


abstract data type are:
1 Given i and j , determine if the items i and j belong to the
same block (class), or not. Alternatively, given an item i
find the representative of the block (class) to which i
belongs; i and j belong to the same block
() Find(i) = Find(j )
2 Given i and j , perform the union (a.k.a. merge) of the
blocks of i and j into a single block; the operation might
require i and j to be the representatives of their respective
blocks
It is because of these two operations that these data structures
are usually called union-find sets or merge-find sets (mfsets, for
short).
265/405
Union-Find

class UnionFind {
public:
// Creates the partition {{0}, {1}, ..., {n-1}}
UnionFind(int n);

// Returns the representative of the class to which


// i belongs (should be const, but it is not to
// allow path compression)
int Find(int i);

// Performs the union of the classes with representatives


6
// ri and rj, ri = rj
void Union(int ri, int rj);

// Returns the number of blocks in the union-find set


int nr_blocks() const;
...
};

266/405
Part IV

Disjoint Sets

12 Disjoint Sets: Introduction

13 Implementation of Union-Find

14 Analysis of Union-Find

267/405
Implementation #1: Quick-find

We represent the partition with a vector P :


P [i] = the representive of the block of i
Initially, P [i] = i for all i, 1  i  n
F IND(i) is trivial: just return P [i]
For the union of two blocks with representatives ri and rj ,
simply scan the vector P and set all elements in the block
of ri now belong to the block of rj , that is, set P [k] := rj
whenever P [k] = ri (or vice-versa, transfer elements in the
block rj to block ri)
F IND(i) is very cheap ((1)), but U NION(ri; rj ) is very
expensive ((n)).

268/405
Implementation #1: Quick-find

We can avoid scanning the entire array to perform a union; but


we will still have to change the representative of all the
elements in one block to point to the representative of the other
block, and this has linear cost in the worst-case too.

Despite it is not very natural in this case, it is very convenient to


think of the union-find set as a collection of trees, one tree per
block, and see P [i] as a pointer to the parent of i in its tree;
P [i] = i indicates that i is the root of the tree—i is the
representative of the block. With quick-find, all trees have
height 1 (blocks with a single item) or 2 (the representative is
the root and all other items in the block are its children).

269/405
Implementation #1: Quick-find

270/405
Implementation #2: Quick-union

In quick-union, to merge two blocks with representatives ri and


rj , it is enough to set P [ri] := rj or P [rj ] := ri. That makes
U NION(ri; rj ) trivial and cheap (cost is (1)).
If we allow U NION(i; j ) with whatever i and j , we must find the
corresponding representatives ri and rj , check that they are
different and proceed as above. The operation can now be
costly, but that’s because of the calls to F IND.

A call F IND(i) can be expensive in the worst-case, it is


proportional to the maximum height of the tree that contains i,
and that can be as much as (n).

271/405
Implementation #2: Quick-union

class UnionFind {
...
private:
vector<int> P;
int nr_blocks;
};

UnionFind::UnionFind(int n) : P(vector<int>(n)) {
// constructor
for (int j = 0; j < n; ++j)
P[j] = j;
nr_blocks = n;
}
void UnionFind::Union(int i, int j) {
int ri = Find(i); int rj = Find(j);
if (ri != rj) {
P[ri] = rj; --nr_blocks;
}
}
int UnionFind::Find(int i) {
while (P[i] != i) i = P[i];
return i;
}

272/405
Implementation #2: Quick-union

273/405
Implementation #3: Union by weight or by rank

To overcome the problem of unbalanced trees (leading to trees


which are too high) it is enough to make sure that in a union
1 The smaller tree becomes the child of the bigger tree
(union-by-weight), or
2 The tree with smaller rank becomes the child of the tree
with larger rank (union-by-rank)
Unless we use path compression (stay tuned!) rankheight.

274/405
Implementation #3: Union by weight or by rank

To implement one of these two strategies we will need to


know, for each block, its size (number of elements) or its
rank (=height).
We can use an auxiliary array to store that information. But
we can avoid the extra space as follows: if i is the
representative of its block, instead of setting P [i] := i to
mark it as the root we can have
1 P [i] = the size of the tree rooted at i
2 P [i] = the rank of the tree rooted at i
We use the negative sign to indicate that i is the root of a
tree.

275/405
Implementation #3: Union by weight or by rank

class UnionFind {
...
private:
vector<int> P;
int nr_blocks;
};

UnionFind::UnionFind(int n) : P(vector<int>(n)) {
// constructor
for (int j = 0; j < n; ++j)
P[j] = -1; // all items are roots of trees of size 1 (or rank 1)
nr_blocks = n;
}
int UnionFind::Find(int i) {
// P[i] < 0 when i is a root
while (P[i] > 0) i = P[i];
return i;
}
...

276/405
Implementation #3: Union by weight or by rank

void UnionFind::Union(int i, int j) {


int ri = Find(i); int rj = Find(j);
if (ri != rj) {
if (P[ri] >= P[rj]) {
// ri is the smallest/shortest
P[ri] = rj;
P[rj] += P[ri]; // <= union-by-weight
// P[rj] = min(P[rj], P[ri]-1); // <= union-by-rank
} else {
// rj is the smallest/shortest
P[rj] = ri;
P[ri] += P[rj]; // <= union-by-weight
// P[ri] = min(P[ri], P[rj]-1); // <= union-by-rank
}
--nr_blocks;
}
}

277/405
Implementation #3: Union by weight or by rank

Lemma
The height of a tree that represents a block of size k is
 1 + log2 k, using union-by-weight.
Proof
We prove it by induction. If k = 1 the lemma is
obviously true, the height of a tree of one element is
1. Let T be a tree of size k resulting from the union-
by-weight of two trees T1 and T2 of sizes r and s,
respectively, assume r  s < k = r + s. Then T
has been obtained putting T1 as child of T2 .

278/405
Implementation #3: Union by weight or by rank

Proof (cont’d)
By inductive hypothesis, height(T1 )  1 + log2 r
and height(T2 )  1 + log2 s. The height of T is
that of T2 unless height(T1 ) = height(T2 ), then
height(T ) = height(T1 ) + 1. That is,

height(T ) = max(height(T2 ); height(T1 ) + 1)


 1 + max(log2 s; 1 + log2 r)
= 1 + max(log2 s; log2 (2r))
 1 + log2 k;
since s  k and 2r  r + s = k. 

279/405
Implementation #3: Union by weight or by rank

An analogous lemma can be proved if we perform union by


rank.

We might be satisfied with union-by-rank or union-by-weight,


but we can improve even further the cost of F IND applying
some path compression heuristic.

280/405
Path Compression
While we look for the representative of i in a F IND(i), we follow
the pointers from i up to the root, and we could make the
pointers along that path change so that the path becomes
shorter and therefore we may speeed up future calls to F IND.
There are several heuristics for path compression:
1 In full path compression, we traverse the path from i to its

representative twice: first to determine that ri is such


representative; second, to set P [k] := ri for all k along the
path, as all these items have ri as their representative; this
only doubles the cost of F IND(i).
2 In path splitting, we maintain two consecutive items in the

path i1 and i2 = P [i1], then when we go up in the tree we


make P [i1] := P [i2]; at the end of this traversal, all k along
the path, except the root and its immediate child, will point
to the element that was previously their grand-parent; we
reduce the length of the path roughly by half. 281/405
Path Compression

3 In path halving, we traverse the path from i to its


representative, making every other node point to its
grand-parent.

282/405
Path compression: full path compression

Make every node point to its representative.

Source: Kevin Wayne


(https://www.cs.princeton.edu/~wayne/kleinberg-tardos/pdf/UnionFind.pdf)

283/405
Path compression: full path compression

// iterative full path compression


// with the convention that P[i] = -the rank of i if P[i] < 0
int UnionFind::Find(int i) {
int ri = i;
// P[ri] < 0 when ri is a root
while (P[ri] > 0) ri = P[ri];

// traverse the path again making everyone point to ri


while (P[i] > 0) {
int aux = i;
i = P[i];
P[aux] = ri;
}
return ri;
}

// recursive full path compression


// with the convention that P[i] = -the rank of i if P[i] < 0
int UnionFind::Find(int i) {
if (P[i] < 0) return i;
else {
P[i] = Find(P[i]);
return P[i];
}
}

284/405
Path compression: full path compression

285/405
Path compression: full path compression

285/405
Path compression: full path compression

285/405
Path compression: full path compression

285/405
Path compression: full path compression

285/405
Path compression: path splitting
Make every node point to its grandparent (except if it is the root
or a child of the root).

Source: Kevin Wayne


(https://www.cs.princeton.edu/~wayne/kleinberg-tardos/pdf/UnionFind.pdf)
286/405
Path compression: path halving
Make every other node in the path point to its grandparent
(except if it is the root or a child of the root).

Source: Kevin Wayne


(https://www.cs.princeton.edu/~wayne/kleinberg-tardos/pdf/UnionFind.pdf)
287/405
Part IV

Disjoint Sets

12 Disjoint Sets: Introduction

13 Implementation of Union-Find

14 Analysis of Union-Find

288/405
Amortized analysis of Union-Find
The analysis of Union-Find with union by weight (or by rank)
using some path compression heuristic must be amortized: the
union of two representatives (roots) is always cheap, and the
cost of any F IND is bounded by O(log n), but if we apply many
F IND’s the trees become bushier, and we approach rather
quickly the situation of Quick-Find while we avoid costly
U NION’s.

In what follows we will analyze the cost of a sequence of m


intermixed U NIONs and F INDs performed in a Union-Find data
structure with n  m elements, using union-by-rank and full
path compression.

Similar results hold for the various combinations of


union-by-weight/union-by-rank and path compression
heuristics.
289/405
Amortized analysis of Union-Find
1 Observation #1. Path compression does not change the
rank of any node, hence rank(x)  height(x) for any node
x.
2 In what follows we assume that we initialize the rank of all
nodes to 0 and keep the ranks in a different array, so we
will have:
int UnionFind::Find(int i) {
if (P[i] == i) return i;
else {
P[i] = Find(P[i]);
return P[i];
}
}

void UnionFind::Union(int i, int j) {


int ri = Find(i); int rj = Find(j);
if (ri != rj) {
if (rank[ri] <= rank[rj]) {
P[ri] = rj;
rank[rj] = max(rank[rj], 1+rank[ri]);
} else {
P[rj] = ri;
rank[ri] = max(rank[ri], 1+rank[rj]);
}
} 290/405
}
Amortized analysis of Union-Find

Proposition
The tree roots, node ranks and elements within a tree
are the same with or without path compression.

Proof
Path compression only changes some parent pointers,
nothing else. It does not create new roots, does not
change ranks or move elements from one tree to
another. 

291/405
Amortized analysis of Union-Find

Properties:
1 If x is not a root node then rank(x) < rank(parent(x)).
2 If x is not a root node then its rank will not change.
3 Let rt = rank(parent(x)) at time t. If at time t + 1 x
changes its parent then rt < rt+1
4 A root node of rank k has  2k descendants.
5 The rank of any node is  dlog2 ne
6 For any r  0, the Union-Find data structure contains at
most n=2r elemnts of rank r

292/405
Amortized analysis of Union-Find

All the six properties hold for Union-Find with union-by-rank. By


the previous proposition, propeties #2, #4, #5 and #6
immediately hold for any variant using path compression

Only properties #1 and #3 might not hold as path compression


makes changes to parent pointers. However, they still hold if we
are doing path compression.

293/405
Amortized analysis of Union-Find

Proof of Property #1
A node of rank k can only be created by joining two
nodes of rank k 1. Path compression can’t change
ranks. However it might change the parent of x; in
that case, x will point to some ancestor of its previous
parent, hence rank(x) < rank(parent(x)) at all times. 

Proof of Property #2
The rank of a node can only change in union-by-rank
if x was a root and becomes a non-root. Once a root
becomes a non-root it will never become a root node
again. Path compression never changes ranks and
never changes roots. 

294/405
Amortized analysis of Union-Find

Example of property #1

295/405
Amortized analysis of Union-Find

Proof of Property #3
When the parent of x changes it is because either
1 x becomes a non-root and union-by-rank
guarantees that rt = rank(parent(x)) = rank(x) and
rt+1 > rt as x becomes a chlid of a node whose
rank is larger than rt
2 x is a non-root at time t but path compression
changes its parent. Because x will be pointing to
some ancestor of its parent then rt+1 > rt (because
of Property #1)


296/405
Amortized analysis of Union-Find

Proof of Property #4
By induction on k.
Base: If k = 0 then x is the root of a tree with only one
node, so the number of descendants is  2k .
Inductive hypothesis: a node x of rank k can only get
that rank because of the union of two nodes of rank
k 1, hence x was the root of one of the trees involved
and its rank was k 1 before the union. By hypothesis,
each tree contained  2k 1 and the result must then
contain  2k nodes. 

Proof of Property #5
Immediate from Properties #1 and #4. 

297/405
Amortized analysis of Union-Find

An example of property #4

298/405
Amortized analysis of Union-Find

Proof of Property #6
Because of Property #4, any node x of rank k is the
root of a subtree with  2k nodes. Indeed, if x is a root
that is the statement of the property. Else, inductively,
because x had the property just before becoming a non-
root; since neither its rank nor the set of descendants
can change afterwards, the property is also true for non-
root nodes. Because of Property #1, two distinct nodes
of rank k can’t be one ancestor of the other and then
they can’t have common ancestors.
Therefore, there can be at most n=2r nodes of rank r.


299/405
Amortized analysis of Union-Find

An example of property #6

300/405
Amortized analysis of Union-Find

Definition
The iterated logarithm function is
(
0; if x  1
lg x = 
1 + lg (lg x); otherwise
We consider only logarithms base 2, hence write
lg  log2 .
n lg n
(0; 1] 0
(1; 2] 1
(2; 4] 2 lg n  5 in this Universe.
(4; 16] 3
(16; 65536] 4
(65536; 265536 ] 5 301/405
Amortized analysis of Union-Find
2

Given k, let 2 "" k = 22
| {z } . Inductively: 2 "" 0 = 1, and
k exponentiations
2 "" k = 22""(k 1) . Then lg (2 "" k) = k. Define groups
G0 = f1g
G1 = f2g
G2 = f3; 4g
G3 = f5; : : : ; 16g
G4 = f17; : : : ; 65536g
G5 = f65537; : : : ; 265536 g
::: = :::
Gk = f1 + 2 "" (k 1); : : : ; 2 "" kg

302/405
Amortized analysis of Union-Find

For any n > 0, n belongs to Glg n . The rank of any node in a


Union-Find data structure of n elements will belong to one of
the first lg n groups (as all ranks are between 0 and lg n).

303/405
Amortized analysis of Union-Find
Accounting scheme: We assign credits during a U NION to the
node that ceases to be a root; if its rank belongs to group Gk
we assign 2 "" k credits to the item.
Proposition
The number of credits assigned in total among all nodes
is  n lg n.

Proof
By Property #6, the number of nodes with rank  x+1 is
at most
n n n
+ x+2 + : : :  x
2x+1 2 2
Consider nodes that belong (their ranks) to group Gk =
fx + 1; : : : ; 2xg (x = 2 "" (k 1)).
304/405
Amortized analysis of Union-Find

Proof (cont’d)
As the group contains  2x nodes the number of credits
assigned to nodes in the group is  n. All the ranks
belong in the first lg n groups, hence the total number
of credits is  n lg n. 

305/405
Amortized analysis of Union-Find

The cost of Union is constant. We need to find the amortized


cost of Find. The actual cost is the number of parent pointers
followed:
1 parent(x) is a root =) this is true for at most one of the
nodes visited during the execution of a F IND,
2 rank(parent(x)) belongs to a group Gj higher that rank(x)
=) this might happen at most for lg n visited x’s during
the execution of a F IND
3 rank(parent(x)) and rank(x) belong to the same group
=) see next slide

306/405
Amortized analysis of Union-Find

We make any node x such that rank(parent(x)) and rank(x) are


in the same group to pay 1 credit to follow and update the
parent pointer during the F IND.
Then rank(parent(x)) strictly increases (Property #1).
If the node was in group G = fx + 1; : : : ; 2x g then it had 2x
credits to spend and the rank of its parent will belong to a
higher group before x has been been updated 2x times by
F IND operations.
Once the parent’s rank belongs to a higher group than the
rank of x, the situation will remain, as rank(x) (hence the
group) never changes and rank(parent(x)) never
decreases.
Therefore x has enough credits to pay for all F IND’s in which it
gets involved before it becomes a node in Case #2.
307/405
Amortized analysis of Union-Find

Theorem
Starting from an initial Union-Find data structures for
n elements with n disjoint blocks, any sequence of
m  n U NION and F IND using union-by-rank and full
path compression have total cost O(m lg n).

Proof
The amortized cost of F IND is O(lg n), and that of any
U NION is constant, hence the sequence of m operations
has total cost O(m lg n). 

308/405
Amortized analysis of Union-Find
1970 1990

19721973 1975 1984 1989

1972: Fischer: O(m log log n)


1973: Hopcroft & Ullman: O(m lg n)
1975: Tarjan: O(m (m; n)). Ackermann’s inverse
(m; n) is an extremely slowly growing function
(m; n)  lg n
1984: Tarjan & van Leeuwen: O(m (m; n)). For all
combinations of union-by-weight/rank and path
compression heuristics (full/splitting/halving).
1989: Fredman & Saks: (m (m; n)). A non-trivial
lower bound for amortized complexity of
Union-Find in the cell probe model.
309/405
Disjoint Sets

To learn more:
[1] Michael J. Fischer
Efficiency of Equivalence Algorithms
Symposium on Complexity of Computer Computations,
IBM Thomas J. Watson Research Center, 1972.
[2] J.E. Hopcroft and J.D. Ullman
Set Merging Algorithms
SIAM J. Computing 2(4):294–303, 1973.
[3] Robert E. Tarjan
Efficiency of a Good But Not Linear Set Union Algorithm
J. ACM 22(2):215–225, 1975.

310/405
Disjoint Sets

To learn more:
[1] Robert E. Tarjan and Jan van Leeuwen
Worst-Case Analysis of Set Union Algorithms
J. ACM 31(2):245–281, 1984.
[2] Michael L. Fredman and Michael E. Saks
The Cell Probe Complexity of Dynamic Data Structures
Proc. 21st Symp. Theory of Computing (STOC), p.
345–354, 1989.
[3] Z. Galil and G. Italiano
Data Structures and Algorithms for Disjoint Set Union
Problems
ACM Computing Surveys 23(3):319–344, 1991.

311/405
Part V

Data Structures for String


Processing

15 Tries

16 Suffix Trees

312/405
Tries

We often deal with keys which are sequences of symbols


(characters, decimal digits, bits, . . . ). Such a decomposition is
usually very natural and can be exploited to implement efficient
dictionaries.
Moreover, we usually want, besides lookups and updates,
operations in which the keys as symbol sequences matter: for
example, we might want an operation that, given a collection of
words C and some word w, returns all words in C that contain
w as a subsequence. Or as a prefix, or a suffix, etc.

313/405
Tries
Consider a finite alphabet  = f1 ; : : : ; m g with m  2
symbols. We denote  , as usual in the literature, the set of all
strings that can be formed with symbols from . Given two
strings u and v in  we write u  v for the string which results
from the concatenation of u and v . We will use  to dente the
empty string or string of length 0.
Definition
Given a finite set of strings X  , all of identical
length, the trie T of X is an m-ary tree recursively
defined as follows:
1 If X contains a single element x or none, then T is
a tree consisting on a single node that contains x or
is empty.
2 If jX j  2, let Ti be the trie for the subset
Xi = fy j x = i  y 2 X ^ i 2 g 314/405
Tries

315/405
Tries

Lemma
If the edges in the trie T for X are labelled in such a
way that the edge connecting the root of T with subtree
Ti has label i , 1  i  m, then the sequence of
labels in the path from the root to a non-empty leaf
that contains x form the shortest prefix that univoquely
identifies x, that is, the shortest prefix of x which is not
shared by any other element of X

Lemma
Let p be the sequence of labels in a path from the root
of the trie T to some node v (either internal or leaf); the
subtree rooted at v is a trie for the subset of all strings
in X starting with the prefix p (and no other strings)
316/405
Tries

Lemma
Given a finite set X   of strings of equal length,
the trie T for X is unique; in particular, T does not
depen on the order in which we “present” or “insert” the
elements of X

Lemma
The height of a trie T is the minimum length of prefixes
needed to distinguish to elements in X ; in other words,
the length of the longest prefix which is common to  2
elements in X ; of course, if ` is the length of the string
in X then
height(T )  `

317/405
Tries

Our definition of tries requires all string being of the same


length; that’s too restrictive. What we actually need is that no
string in X is a proper prefix of another string in X .
The standard solution to the problem is to extend  with a
special symbol (e.g. ]) to mark the end of strings. If we append
] to the end of all strings in X , then no (marked) string is a
proper prefix of another string. The modest price to pay is to
work with an alphabet of m + 1 symbols

318/405
Tries
X = {ahora, alto, amigo, amo, asi, bar, barco, bota, ...}

a b c #
...
a h l m s # a o #
... ... ...
ahora
asi bota
alto
a r #
a
...
i o
...
#
... ...
c #
amigo amo ...
319/405
barco bar
Tries

In order to implement tries we can use standard techniques to


implement m-ary trees, namely, use an array of pointers for
each internal node or use the first child-next sibling
representation.

If using arrays of pointers, the pointers give us access to the


root node of every non-empty subtree of the current node, and
the symbols of  can be used to indices into the array of
pointers (eventually with some easy bijective mapping
f :  ! f0; : : : ; m 1g. Leaves that are not empty can contain
the remaining suffix of the element, the prefix given by the path
from the root to the leave.

320/405
Tries

When using first child-next sibling, each node stores a symbol


c; the pointer to the first child points to the root of the trie of
words that have c at that level, while the next sibling points to
the node giving us access to some other trie, now with some
other symbol d. It is usual that since the symbols in  typically
admit a natural total order then the list of children of a node in
the trie are incrreasingly sorted according to that order.

321/405
Tries

322/405
Tries
Despite it is more costly in space to use nodes to store full
string, symbol by symbol, instead of storing suffixes once a leaf
is reached, it is advantadgeous to avoid different types of
nodes, different types of pointers or forcing pointers of one type
point to nodes of some other type, using wasteful unions,. . .
// We asssume that the class Key supports
// x.length() = length() >= 0 of key x
int Key::length() const;

// x[i] = i-th symbol of key x;


// throws an exception if i < 0 or i >= x.length()
template <typename Symbol>
Symbol Key::operator[](const Key& x, int i);

template <typename Symbol, typename Key, typename Value>


class DigitalDictionary {
public:
...
private:
struct trie_node {
Symbol _c;
trie_node* _first_child;
trie_node* _next_sibl;
Value _v;
};
trie_node* root; 323/405
...
Tries

template <typename Symbol,


typename Key,
typename Value>
void DigitalDictionary<Symbol,Key,Value>::lookup(
const Key& k, bool& exists, Value& v) const {

trie_node* p = _lookup(root, k, 0);


if (p == nullptr)
exists = false;
else {
exists = true;
v = p -> _v;
}
}

324/405
Tries

// Pre: p points to the root of the subtree that contains


// all elements such that their first i-1 symbols
// coincide with the first i-1 symbols of the key k
// Post: returns a pointer to the node that stores the value
// associated to $k$ if $\pair{k,v}$ belongs to the dictionary
// and a null pointer if not such pair exists
// Cost: O j j
( k m)
template <typename Symbol, typename Key,
typename Value>
DigitalDictionary<Symbol,Key,Value>::trie_node*
DigitalDictionary<Symbol,Key,Value>::_lookup(trie_node* p,
const Key& k, int i) const {

if (p == nullptr) return nullptr;


if (i == k.length()) return p;
if (p -> _c > k[i]) return nullptr;
if (p -> _c < k[i])
return _lookup(p -> _next_sibl, k, i);
// p -> _c == k[i]
return _lookup(p -> _first_child, k, i+1);
}

325/405
Tries

template <typename Symbol,


typename Key,
typename Value>
void DigitalDictionary<Symbol,Key,Value>::insert(
const Key& k, const Value& v) {

_root = _insert(root, k, 0);


}

326/405
Tries
// Pre: p points to the root of the subtree that contains
// all elements such that their first i-1 symbols
// coincide with the first i-1 symbols of the key k
// Post: returns a pointer to the root of the tree resulting from
// the insertion of the pair $\pair{k[i..],v}$ in the subtree
// Cost: O j j
( k m)
template <typename Symbol, typename Key,
typename Value>
DigitalDictionary<Symbol,Key,Value>::trie_node*
DigitalDictionary<Symbol,Key,Value>::_insert(trie_node* p,
const Key& k, int i) const {

if (i == k.length()) {
if (p == nullptr) p = new trie_node;
p -> _c = Symbol(); // Symbol() is the end-of-string symbol
// e.g. Symbol() == ’\0’ or Symbol() == ’\sharp’
p -> _v = v;
return p;
}
if (p == nullptr or p -> _c > k[i]) {
trie_node* p = new trie_node;
p -> _next_sibl = p;
p -> _c = k[i];
p -> _first_child = _insert(nullptr, k, i+1);
return p;
}
if (p -> _c < k[i])
p -> _next_sibl = _insert(p -> _next_sibl, k, i);
else // p -> _c == k[i]
p -> _first_child = _insert(p -> _first_child, k, i+1);
return p; 327/405
Ternary Search Trees
One alternative to implement tries is to represent the trie nodes
as binary search trees with pointers to roots of subtrees,
instead of as array of pointers to roots, or as linked lists of
pointers to roots (of non-empty subtrees).
The new data structure, invented by Bentley and Sedgewick
(1997), is called an ternary search tree (TST). It tries to
combine the efficiency in space of list-tries (we avoid the large
number of null pointers when using arrays) and the efficiency in
time (we avoid the linear “scans” when using lists to navigate to
the appropriate subtree).
Nodes in TSTs have a symbol c and 3 pointers each: pointers
to the left and right child of the node in the BST that represents
the trie “node”, and a central pointer to the root of the subtree
with symbol c at that level.
328/405
Ternary Search Trees

template <typename Symbol,


typename Key,
typename Value>
class DigitalDictionary {
public:
...
void lookup(const Key& k, bool& exists, Value& v) const;
void insert(const Key& k, const Value& v);
...
private:
struct tst_node {
Symbol _c;
tst_node* _left;
tst_node* _cen;
tst_node* _right;
Value _v;
};
tst_node* root;
...
static tst_node* _lookup(tst_node* t,
int i, const Key& k, const Value& v);
static tst_node* _insert(tst_node* t,
int i, const Key& k, const Value& v);
...
};

329/405
Ternary Search Trees

330/405
Ternary Search Trees

// Pre: p points to the root of the subtree that contains


// all elements such that their first i-1 symbols
// coincide with the first i-1 symbols of the key k
// Post: returns a pointer to the node that stores the value
// associated to $k$ if $\pair{k,v}$ belongs to the dictionary
// and a null pointer if not such pair exists
// Expected cost: Ojj
( k log m)}
template <typename Symbol, typename Key,
typename Value>
DigitalDictionary<Symbol,Key,Value>::trie_node*
DigitalDictionary<Symbol,Key,Value>::_lookup(tst_node* p,
const Key& k, int i) const {

if (p == nullptr) return nullptr;


if (i == k.length()) return p;
if (k[i] < p -> _c > k[i]) return _lookup(p -> _left, k, i);;
if (k[i] == p -> _c) return _lookup(p -> _cen, k, i+1);;
if (p -> _c < k[i]) return _lookup(p -> _right, k, i);
}

331/405
Ternary Search Trees

template <typename Symbol,


typename Key,
typename Value>
void DigitalDictionary<Symbol,Key,Value>::insert(
const Key& k, const Value& v) {
// Symbol() is the end-of-string symbol
// e.g. Symbol() == ’\0’ or Symbol() == ’\sharp’
k[k.length()] = Symbol(); // add end-of-string
root = _insert(root, 0, k, v);
}

332/405
Ternary Search Trees
template <typename Symbol,
typename Key,
typename Value>
DigitalDictionary<Symbol,Key,Value>::tst_node*
DigitalDictionary<Symbol,Key,Value>::_insert(
tst_node* t, int i,
const Key& k, const Value& v) {

if (t == nullptr) {
t = new tst_node;
t -> _left = t -> _right = t -> cen = nullptr;
t -> _c = k[i];
if (i < k.length() - 1) {
t -> _cen = _insert(t -> _cen, i + 1, k, v);
} else { // i == k.length() - 1; k[i] == Symbol()
t -> _v = v;
}
} else {
if (t -> _c == k[i])
t -> _cen = _insert(t -> _cen, i + 1, k, v);
if (k[i] < t -> _c)
t -> _left = _insert(t -> _left, i, k, v);
if (t -> _c < k[i])
t -> _right = _insert(t -> _right, i, k, v);
}
return t;
}

333/405
Performance of Tries

There are several measures of the performance of tries in


terms of space and time of the different operations.

For example, in tries using arrays of pointers and considering


an extended alphabet with m + 1 symbols a tree for a set of n
elements will contain  n leaves; how many of them?

A common model to study the average behavior of tries (array,


list or TST) is to consider that the n strings are produced by
some memoryless source (so that the r-th symbol of the
element is symbol i with a probability pi irresp. of the previous
symbols and r) or some Markovian model there is a probability
pi;j that some symbol is i given that the preceding sysmbol
was j

334/405
Performance of Tries

Theorem (Clà c ment, Flajolet, Vallà c e (1998))


The external path length (EPL) in a random trie of n
elements produced by a random source S is
CS
n log n + o(n log n)
HS
where both CS are HS are constants dependening on
the random source S ; moreover CS depends on the
specific implementation of the trie (array, list, TST).

The EPL is the sum of the length of all paths from the root of
the trie to all leaves in the trie; a random search (successful or
unsuccessful) will cost, on average CS =HS log n

335/405
Performance of Tries
For example, if the source is memoryless with probabilities p1 ,
P
. . . , pm , for the symbols of the alphabet ( i pi = 1) then
P
HS = i pi log pi is the entropy of the source and
Type CS
Array 1
P
List
Pi
(i 1)pi
TST 2 i<j pi +p:::i p+j pj
When all pi = 1=m we have that HS = log m and hence the
cost of random searches is
Type Cost of random search
Array logm n
List  m2 logm n
TST  2 ln m logm n = 2 ln n
336/405
Performance of Tries

Let N denote the total number of symbols (including


end-of-string, if needed) required for our n strings. The shared
prefixes will provide for a more compact representation of the
set of strings and thus we expect to need  N nodes
(assuming we store one symbol per node like in list-tries and
TSTs).
On the other hand, it has been shown that to store n strings in
a trie we will need, on average, n=HS internal nodes (Knuth
1968, Regnier 1988), hence the average number of pointers is
n  (m + 1)=HS (array-tries), 2n=HS (list-tries) and 3n=Hs
(TSTs).

337/405
Performance of Tries

For example, for a memoryless source with m symbols all


equally likely, we will need on average n= ln m nodes, e.g.,
n= ln 2 nodes in a binary trie.
In list-tries and TSTs the number of nodes coincides, in the
worst-case, with N as each node holds one symbol. But in
practice the common prefixes will be exploited giving a
reduction by a factor of 1=HS ( 1= ln m for string made out of
equally likely independent symbols)

338/405
Patricia (a.k.a. Compressed Tries)

When an internal node in a trie has only one non-empty


subtree we say that such a node is redundant. In a compressed
trie or Patricia (Practical Algorithm To Retrieve Information
Coded in Alphanumeric, D. Morrison, 1968) chains of redudant
nodes are substitutedby a single node, and edges labeled by
subsequence of one or more symbols.
In a Patricia the n strings are stored in the n leaves, and by
definition there are no empty leaves with one or more
non-empty siblings.

339/405
Patricia

To implement Patricia we can use the first child/next sibling or


the TST representation, but instead of storing one symbol per
node, we can store a substring of  1 symbols, or keep skip
attributes that indicate which position has to be examined next
if we move to a descendant node in the trie.

Thus during a search, if we reach a node x at level i which we


have matched the first i 1 symbols of the given key and the
node has the substring w = w0 : : : wj 1 as a label, then we will
have to see if k[i 1::i + j 2] matches w, and if so, we
“descend”, otherwise we will continue looking for the apropriate
subtree or declare the search unsuccessful–this will depend on
whether we are compacting a list-trie or a TST.

340/405
Patricia

In a list-Patricia, the first child pointer will point to the


subtree that contains all strings with prefix k[0::i 2]  w,
and the next sibling will give us access to the subtrees that
store words with a prefix  k[0::i 2]  w00 , where w00 is the
successor of w0 i  in the alphabetic order.
In a bst-Patricia, the central child contains all strings with
prefix k[0::i 2]  w, the left child all strings with a smaller
prefix, the right child all strings with a larger prefix

f g
A Patricia for the set X = bear; bell; : : : ; stock; stop .

Source: M. T. Goodrich & R. Tamassia 341/405


Patricia

Despite Patricia saves on empty leaves and it might be slightly


more efficient to store common infixes in the internal nodes
instead of the full expanded infixes, one symbol per node, but
the major advantadge of Patricia occurs when the string are
externally store and Patricia is an index into the external
storage.

For example, if we have an array of strings S [0::M 1], we can


designate the substring between positions i and j of S [k] by a
triplet hk; i; j i; we can build our Patricia storing such triplets in
the nodes instead of symbols or subsequences of symbols.

342/405
Patricia

343/405
Patricia

template <typename Symbol, typename Key,


typename Value>
DigitalDictionary<Symbol,Key,Value>::patricia_node*
DigitalDictionary<Symbol,Key,Value>::_lookup(patricia_node* p,
const Key& k, int i) const {

if (p == nullptr) return nullptr;


if (i == k.length()) return p;
triplet x = p -> _x; // x=(idx,first,last)
int len = x.last - x.first + 1; // length of the subsequence
if (k[i..i+len-1] < S[x.idx][first..last]) return nullptr;
if (S[idx][first..last] < k[i..i+len-1])
return _lookup(p -> _next_sibl, k, i);
// S[idx][first..last] == k[i..i+len-1]
return _lookup(p -> _first_child, k, i+len);
}

344/405
Patricia

The lookup algorithm in Patricia has cost O(`  m) in the


worst-case, where ` is the length of the longest string in the set,
and m the size of the alphabet. If we combine TST &
compression we get expected cost O(`  log m) for searches.
Likewise the cost of insertions and deletions will be like that of
search.

345/405
Inverted files

One interesting application of tries and Patricia is inverted files


(a.k.a. inverted indices).
Suppose we have a large collection of documents D1 , . . . ,DT .
For each document we extract the unique set of words
(vocabulary, index terms) of each document (we eventually
record the positions of the document at which each unique
word occurs). It is also frequent to remove common words such
as pronouns, articles, connectives, . . . known as stopwords.

346/405
Inverted files
We then proceed to insert/update, one by one, each index term
of each document, in a trie (or TST or Patricia). When a word
appears in several different documents we will keep track in a
occurrence list. Because of their sheer volume, occurrence lists
will be typically stored in secondary memory, and they won’t be
kept in any particular order.
When we process a word w from document Di , we consider
three cases
1 w is a stopword: discard it and proceed to the next

word/term in Di (or start processing a new document)


2 w was already in the inverted file: use the (compressed)

trie to located the occurrence list and add a reference to


document Di to the list associated to word w; or
3 w wasn’t yet in the inverted list: create a new occurrence

list with (w; Di ), append the new occurrence list to the set
of occurrence lists, and add a link from the (compressed) 347/405
Inverted files

Inverted indices are used in search engines to retrieve relevant


documents as follows.
1 The user query Q is normalized and stopwords removed
2 For each term/word t in Q, use the compressed trie (in
main memory) to get the link to the corresponding
occurrence list for t and retrieve it from secondary memory
3 Intersect (*) the occurence lists and sort the resulting set of
documents according to some relevance parameter (e.g.
the PageRank when the documents are wbe pages)
(*) This is what we do when it is assumed that we want the
subset of documents that contain all the terms in Q; in some
case, the user will use operators or will allow a certain degree
of mismatch, or we will have to produce results discarding
terms that do not appear in the index
348/405
Inverted files

If jVi j is the size of the vocabulary Vi of Di , without stopwords,


then we will be building a (compressed) trie for
T
[
N= Vi
i=1
words/terms, and the space that it will occupy will be roughly
(N=HS ), HS being the empirical entropy for , based upon
the set of documents. On the other hand, we will have to store
the N words and their respective occurrence lists somewhere
else.
To construct the inverted file we will have to perform
P
D = 1iT jDi j insertions/updates in the index and each such
operation will have cost O(`), where ` is the length of the
longest word in the collection of documents.
349/405
Inverted files

The cost of processing a query will be the cost of searching the


Q terms in the (compressed) trie (O(jQj  `) plus the cost of
merging the jQj occurrence lists and sorting the final result by
relevance.
In practical situations the occurrence lists can be long, but not
extremely long (words that are not stopwords do not appear in
a significative fraction of the documents in the collection!). This
is even more true for the final result (the “merging” of all
occurrence lists) and the cost of sorting will be small compared
to the others, hence the cost will be  (jQj  T + T log T ),
indeed, it will be close to (jQj), as the length of the
occurrence lists can be though as O(1)

350/405
Tries

To learn more:
[1] D. E. Knuth
The Art of Computer Programming, Volume 3: Sorting and
Searching, 2nd ed
Addison-Wesley, 1998
[2] M. T. Goodrich and R. Tamassia
Algorithm Design and Applications
John Wiley & Sons, 2015

351/405
Tries

To learn more:
[1] J. L. Bentley and R. Sedgewick
Fast algorithms for sorting and searching strings
Proc. SODA, pp. 360–369, 1997
[2] J. Clà c ment, Ph. Flajolet and B. Vallà c e
The Analysis of Hybrid Trie Structures
Proc. SODA, pp. 531–539, 1998

352/405
Part V

Data Structures for String


Processing

15 Tries

16 Suffix Trees

353/405
Suffix Trees

A suffix tree (or suffix trie) is simply a trie for all the suffixes of a
string, the text, T [0::n 1].

Thus we will form a trie with the M suffixes of the text T . Since
the length of the text is n there are The total number of (proper)
suffixes to store is n (n + 1 if we count the empty suffix) and the
total number of symbols involved is
n
X n(n + 1)
 n i=
2
i=0
as the suffix T [i::n 1] has length n i, 0  i  n

354/405
Suffix Tries
A compact representation of the trie, storing the pair (i; j ) to
represent a substring T [i::j ] will be most convenient; in any
case, using Patricia for the suffixes guarantees that the space
used is (n).

(a) the suffix trie for the text T = minimize 355/405


Suffix Tries

A naïve approach to the construction of suffix tries would


need (n2 ) in the worst case—assuming that we use the array
of pointers representation with (1) cost to find which children
to use in the next level, otherwise an extra factor m or log m
appears

However there exist several linear-time algorithms for the


construction of suffix tries
1 Weiner, 1973 (position trees)
2 McCreight, 1976 (more space efficient than Weiner’s)
3 Ukkonen, 1995 (same bounds as McCreight’s, simpler)
They won’t be covered here

356/405
Suffix Tries

Suffix tries have many applications. The most immediate one is


the substring matching problem. We are given a text T [0::n 1]
and a pattern P [0::k 1], typically k  n and we need to show
if P occurs as a substring of T , and if so, where.

Well known algorithms like Knuth-Morris-Pratt (KMP) or


Boyer-Moore (BM) —there many other— solve this important
problem in time by preprocessing the pattern in time (k) and
then scanning the text in time (n), giving a total cost (n + k)
in the worst-case

357/405
Suffix Tries

The same bound is achieved with suffix tries: but we invest time
(n) in the preprocessing of the text (! build the suffix trie!)
and then search the pattern in the suffix trie with cost (k).

The great news is that we can search many patterns very


efficiently, the cost of building the suffix trie was paid once, the
search of each pattern is lightning fast. This simply not possible
with KMP, BM and many of the string mathcing algorithms since
they preprocess each pattern and will need to scan the text with
cost (n) for every pattern. Unless we were given all the
patterns in advance, in which case we can preprocess the
whole set of p patterns at once (with cost (k  p + n) instead of
cost (k + n)  p)). But this is not the case in many situations
where the patterns to be searched are not known in advance.

358/405
Suffix tries

int k = P.size();
int j = 0;
suffix_trie_node* p = T.root;
do {
bool fin = true;
for(q:children of p) {
int i = q.first;
if (P[j] == T[i]) {
int len = q.last - i + 1;
if (k <= len) {
// suffix is shorter that node label
if (P[j..j+k-1] == T[i..i+k-1])
return ‘‘match at i-j’’
else
return ‘‘P not a substring of T’’
} else { // k > len
if (P[j..j+len-1]==T[i..i+len-1])
k -= len; j += len; p = q;
fin = false;
break; // end the for(q:children of p) loop
}
}
}
} while (not fin and p is not a leaf);
return ‘‘P not a substring of T’’;

359/405
Tries

To learn more:
[1] D. E. Knuth
The Art of Computer Programming, Volume 3: Sorting and
Searching, 2nd ed
Addison-Wesley, 1998
[2] M. T. Goodrich and R. Tamassia
Algorithm Design and Applications
John Wiley & Sons, 2015
[3] Dan Gusfield
Algorithms on Strings, Trees & Sequences
Cambridge Univ. Press, 1997

360/405
Part VI

Multidimensional Data Structures

17 Multidimensional Data Structures: Introduction

18 K -Dimensional Search Trees

19 Quad Trees

20 Analysis of the Cost of Associative Queries

361/405
Why Multidimensional?
Multidimensional data everywhere:
Points, lines,
rivers, maps, cities, roads,
hyperplanes, cubes, hypercubes,
mp3, mp4 and mp5 files,
jpeg files, pixels,
...,
Used in applications such as:
database design, geographic information systems (GIS),
computer graphics, computer vision, computational
geometry, image processing,
pattern recognition,
very large scale integration (VLSI) design,
...
362/405
Why Multidimensional?

Data: File of K -dimensional points, K -tuples of the form:

x = (x0 ; x1 ; : : : ; xK 1 )
Retrieval: associative queries that involve more than one of
the K dimensions
Data structures:
K -Dimensional Search Trees (a.k.a. K -d trees)
Quad Trees
...

363/405
Associative Retrieval

Multidimensional data structures must support:


Insertions, deletions and (exact) search
Associative queries such as:
Partial Match Queries: Find the data points that match
some specified coordinates of a given query
point q .
Orthogonal Range Queries: Find the data points that fall
within a given hyper rectangle Q (specified by
K ranges).
Nearest Neighbor Queries: Find the closest data point to
some given query point q (under a predefined
distance).

364/405
Associative Queries

365/405
Associative Queries

365/405
Associative Queries

365/405
Partial Match Queries

Definition
Given a file F of n K -dimensional records and a query
q = (q0 ; q1 ; : : : ; qK 1 ) where each qi is either a value
in Di (it is specified) or  (it is unspecified), a partial
match query returns the subset of records x in F whose
attributes coincide with the specified attributes of q . This
is,

fx 2 F j qi =  or qi = xi, 8i 2 f0; : : : ; K 1gg:

366/405
Partial Match Queries

Example
Query: q = (; q2 ) or q = (q1 ; q2 ) with specification
pattern: 01

 

 
 

 

 
 

 


 


 
 

 
 

 
 

 
 

 

 

 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

 
 

367/405
Part VI

Multidimensional Data Structures

17 Multidimensional Data Structures: Introduction

18 K -Dimensional Search Trees

19 Quad Trees

20 Analysis of the Cost of Associative Queries

368/405
Standard K -d trees

369/405
Standard K -d trees

369/405
Standard K -d trees

369/405
Standard K -d trees

369/405
Standard K -d trees

369/405
Standard K -d Trees

Definition (Bentley75)
A standard K -d tree T of size n  0 is a binary tree that
stores a set of n K -dimensional points, such that
it is empty when n = 0, or
its root stores a key x and a discriminant
j = level of the root mod K , 0  j < K , and the
remaining n 1 records are stored in the left and
right subtrees of T , say L and R, in such a way that
both L and R are K -d trees; furthermore, for any
key u 2 L, it holds that uj < xj , and for any key
v 2 R, it holds that xj < vj .
370/405
Relaxed K -d trees

A relaxed K -d tree (Duch, Estivill-Castro, Martínez, 1998) for a


set of K -dimensional keys is a binary tree in which:
1 Each node contains a K -dimensional record and has
associated an arbitrary discriminant j 2 f0; 1; : : : ; K 1g.
2 For every node with key x and discriminant j , the following
invariant is true: any record in the right subtree with key y
satisfies yj < xj and any record in the left subtree with key
y satisfies yj  xj .

371/405
Relaxed K -d trees

372/405
Relaxed K -d trees

372/405
Relaxed K -d trees

2
2

372/405
Relaxed K -d trees

2
2

3
3
1

372/405
Relaxed K -d trees

1
4
2
2

3
3 4
1

372/405
Relaxed K -d trees

1
4
2
5 5 2

3
3 4
1

372/405
K -d trees

Several other variants of K -d trees have been proposed in the


literature, they differ by the way in which discriminants are
assigned to nodes. For example:
Squarish K -d trees: Use as discriminant the coordinate which
cuts the longest edge of the bounding box
Median K -d trees: Use as discriminant the corrdinate whic is
close to the bounding box’s center corresponding
coordinate

373/405
K -d trees
The bounding box of a leaf in the K -d tree T is a region of the
domain associated to that leaf, namely, the set of points in the
K -dimensional space which will replace that leaf if any of them
were inserted into T .

The bounding box of a node x is that of the leaf which that node
replaced.

374/405
Partial Match in K -d Trees

Partial match search in K -d trees works as follows:


At each node of the tree we verify if it satisfies the query
and we examine its discriminant.
If the discriminant is specified in the query then the
algorithm recursively follow in the appropriate subtree
depending on the result of the comparison between the
key and the query.
Otherwise the algorithm recursively follows the two
subtrees of the node.

375/405
Partial Match Algoritm

procedure PARTIAL M ATCH(T , q )


. T : tree, q: query
=
if T 6  then . nothing to do if T were empty
=
i T ! discr
( )
if M ATCH T ! key; q then
(
R EPORT T ! key )
[ ]=
if q i 6  then . Coordinate i specified
[] []
if q i < T ! key i then
(
PARTIAL M ATCH T ! left; q )
else
(
PARTIAL M ATCH T ! right; q )
else . Coordinate i not specified, q i  []=
(
PARTIAL M ATCH T ! left; q)
(
PARTIAL M ATCH T ! right; q )

376/405
Orthogonal Range and Region Queries in K -d
Trees

Orthogonal range and region (even for complex regions) in K -d


trees work as follows:
At each node of the tree we verify if it satisfies the query or
not, e.g., the key at the node is inside the orthogonal range
(hyper)rectangle, or it is within the given distance from the
query “center”
For each subtree T 0 of T , we check if the bounding box of
T 0 intersects the region defined by the query; if so we
recursively visit T 0

377/405
Orthogonal Range Algoritm

procedure O RTHOGONAL R ANGE(T , Q)


=[ ] [ ]
. T : tree, Q `0 ; u0  `K 1 ; uK 1 : query
=
if T 6  then . nothing to do if T were empty
=
i T ! discr
( )
if I NSIDE T ! key; Q then
(
R EPORT T ! key )
[] []
if Q i :`  T ! key i then
( )
O RTHOGONAL R ANGE T ! left; Q
[] []
if T ! key i  Q i :u then
( )
O RTHOGONAL R ANGE T ! right; q

378/405
Nearest Neighbors

In order to find the k nearest neighbors of point q (the query) in


the K -d tree T , we carry out a region search but instead of a
fixed distance we update the distance at the search proceeds;
in particular the distance is at all moment the k-th smallest
distance observed so far.

The algorithm maintains a list S of the k points of the tree that


are closest to the query so far. It is convenient to maintain such
list as a (min) priority queue, with distances to q as priorities.

If the current node x is at distance less than r  S:min_prio()


then an element of minimum priority is extracted from S and x
is added to S .

379/405
Nearest Neighbors

Let i be the discriminant at the current node x, and


r = S:min_prio() the current radius of search. If q[i] r  x[i]
then the left subtree bounding box intersects the query region
and the left subtree must be explored. Likewise if x[i]  q [i] + r
then the right subtree bounding box intersects the query region
and the right subtree must be explored.

If both conditions were true we have to make both recursive


calls; but it must turn out that after completion of the first
recursive call the value of r has been update and the visit to the
other subtree is no longer needed. For that reason the
algorithm chooses to visit first the left or the right subtree,
depending on which side is most likely help avoid the second
call.

380/405
Nearest Neighbors

381/405
Nearest Neighbors Algoritm

procedure NN(T , q , S , r)
. T : tree, q: query, S : result, initially empty
+
. r: search radius, initially 1
=
if T 6  then . nothing to do if T were empty
=
i T ! discr
if d:= ( )
D ISTANCE T ! key; q  r then
if jS j  k then
()
S:EXTRACT _ MIN
( )
S:INSERT T ! key; d
:= ()
r S:MIN _ PRIO
else
( )
S:INSERT T ! key; d
. . . . see next slide

382/405
Nearest Neighbors Algoritm

procedure NN(T , q , S , r)
...
[]
if q i [] []
r  T ! key i ^ T ! key i  q i r then [ ]+
[]
if q i  T ! key i then[]
(
NN T ! left; q; S; r )
[]
if T ! key i  q i [ ]+
r then
(
NN T ! right; q; S; r )
else
(
NN T ! right; q; S; r )
[]
if q i r  T ! key i then[]
(
NN T ! left; q; S; r )
else
. query region does no intersect both BB’s
. visit the subtree that corresponds
...

383/405
Part VI

Multidimensional Data Structures

17 Multidimensional Data Structures: Introduction

18 K -Dimensional Search Trees

19 Quad Trees

20 Analysis of the Cost of Associative Queries

384/405
2-d Quad Trees

Definition (Bentley & Finkel, 1974)


A quad tree for a file of 2-dimensional records, is a
quaternary tree in which:
1 Each node contains a 2-dimensional key and has
associated four subtrees corresponding to the
quadrants NW , NE , SE and SW .
2 For every node with key x the following invariant is
true: any record in the NW subtree with key y
satisfies y1 < x1 and y2  x2 ; any record in the NE
subtree with key y satisfies y1  x1 and y2  x2 ;
any record in the SE subtree with key y satisfies
y1  x1 and y2 < x2 ; and, any record in the SW 385/405
Quad Trees

386/405
Quad Trees

386/405
Quad Trees

386/405
Quad Trees

386/405
Quad Trees

Definition (Bentley & Finkel, 1974)


A quad tree T of size n  0 stores a set of n K -
dimensional records. The quad tree T is a 2K -ary tree
such that
either it is empty and n = 0, or
its root stores a record with key x and has 2K
subtrees, each one associated to a K -bitstring
w = w0 w1 : : : wK 1 2 f0; 1gK , and the remaining
n 1 records are stored in one of these subtrees,
let’s say Tw , in such a way that 8w 2 f0; 1gK : Tw is
a quad tree, and for any key y 2 Tw , it holds that
yj < xj if wj = 0 and yj > xj otherwise, 0  j < K .

387/405
Part VI

Multidimensional Data Structures

17 Multidimensional Data Structures: Introduction

18 K -Dimensional Search Trees

19 Quad Trees

20 Analysis of the Cost of Associative Queries

388/405
The Cost of Partial Match Searches

With probability Ks the discriminant will be specified in the


query and the algorithm will follow one of the subtrees.
With probability KK s the algorithm will follow the two
subtrees
In a random K -d tree the size of the left subtree of a tree of size
n is j with equal probability for all j , 0  j < n. Hence, the
expected cost Pn of a partial match in a relaxed K -d tree of
size n is

s 1 nX1  j + 1 n j 
Pn = 1 + Pj + P
K n j =0 n + 1 n+1 n 1 j

K s 1 nX1
+ (P + P 1 j)
K n j =0 j n

389/405
The Cost of Partial Match Searches
The shape function for the recurrence is, with  := s=K ,
!(z ) = 2z + 2(1 )
If we compute
Z 1
H=1 !(z )dz = 1 =  1<0
0
We need to find 2 [0; 1] such that
Z 1
z !(z )dz = 1;
0
that is
2 2(1 )
+ = 1:
+1 +2
The solution of the quadratic equation is
p
= ( 9 8 1)=2
390/405
The Cost of Partial Match

Theorem (Duch et al., 1998)


The expected cost Pn (measured as the number of
comparisons) of a PM query with s out of K coordinates
specified in a random relaxed K -d tree of size n is

Pn = n + O(1);
where
p
= ( 9 8 1)=2
(2 + 1)
=
(1 )( + 1) 3 ( + 1)
(z ) is Euler’s Gamma function.

391/405
The Cost of Partial Match

Theorem (Flajolet and Puech, 1986)


The expected cost Pn (measured as the number
of comparisons) of a PM query q with s out of K
coordinates specified in a standard K -d tree of size
n is
Pn = un + O(1);
where is the uniques solution in [0; 1] of

( + 2)  ( + 1)1  = 2;
and u is a constant depending on the query pattern u
(u[i] = specified/non-specificied)

392/405
The Cost of Partial Match

The exponent in standard K -d trees is samller than the for


relaxed K -d trees. If we consider the excess

#= (1 )
it is very close to 0 (and never greater than 0.07) for satdnard
K -d trees. The excess for relaxed K -d trees is not very big but
can be as much as 0.12.
Squarish K -d trees achieve the ultimate optimal expected
performance as they have excess = 0, thanks to their more
(heuristically) balanced space partition, induced by the choice
of discriminants.

393/405
The Cost of Partial Match Searches
Excess of the exponent with respect to 1 s=K . Solid line:
standard K -d trees. Dotted line: relaxed K -d trees

0.12

0.1

0.08

0.06

0.04

0.02

0 0.2 0.4 0.6 0.8 1


r

394/405
The Cost of Partial Match Searches
Plot of the coefficient for relaxed K -d trees

14

12

10

0 0.2 0.4 0.6 0.8 1


r

395/405
The Cost of Orthogonal Range Queries

The expected cost of a orthogonal range query Q with side


lengths 0 = u0 `0 , 1 = u1 `1 , . . . is

vK ()  n + vK 1 ()  n (1=K ) + v


K 2 ()  n
(2=K ) +

   + v1  n ((K 1)=K ) + 2v
0 ln n + O(1)
where vK is, roughly, the “volume” of the K -dimensional
boundary of Q. For example, if K = 2 then v2 = 0  1 ,
v1 = 0 (1 1 ) + 1 (1 0 )  0 + 1 ,
v0 = (1 0 )  (1 1 ) = 1 0 1 + 0  1 . Intuition: the
orthogonal range search behaves in a region of “‘volume” vj
exactly as a partial match with K j specified coordinates.

396/405
The Cost of Neareast Neighbors Queries

The expected cost of a nearest neighbors query is

Sn = (n# + log n)
where # = maxj f (j=K ) 1 + j=K g is the maximum excess.
Intuition: the nearest neighbor search behaves like an

orthogonal range search where the query has side lenghts


i = (n1=K ).

397/405
To learn more

[1] J. L. Bentley.
Multidimensional binary search trees used for associative
retrieval.
Communications of the ACM, 18(9):509–517, 1975.
[2] J. L. Bentley and R. A. Finkel.
Quad trees: A data structure for retrieval on composite
keys.
Acta Informatica, 4:1–9, 1974.
[3] H. H. Chern and H. K. Hwang.
Partial match queries in random k-d trees.
SIAM J. on Computing, 35(6):1440–1466, 2006.
[4] H. H. Chern and H. K. Hwang.
Partial match queries in random quad trees.
SIAM Journal on Computing, 32(4):904–915, 2003.
398/405
To learn more (2)

[5] L. Devroye.
Branching processes in the analysis of the height of trees.
Acta Informatica, 24:277–298, 1987.
[6] L. Devroye and L. Laforest.
An analysis of random d-dimensional quadtrees.
SIAM Journal on Computing, 19(5):821–832, 1990.
[7] A. Duch.
Randomized insertion and deletion in point quad trees.
In Int. Symposium on Algorithms and Computation
(ISAAC), LNCS. Springer–Verlag, 2004.
[8] A. Duch, V. Estivill-Castro, and C. Martínez.
Randomized K -dimensional binary search trees.
In K.-Y. Chwa and O. H. Ibarra, editors, Int. Symposium on
Algorithms and Computation (ISAAC’98), volume 1533 of
399/405
LNCS, pages 199–208. Springer-Verlag, 1998.
To learn more (3)

[9] A. Duch and C. Martínez.


On the average performance of orthogonal range search
in multidimensional data structures.
Journal of Algorithms, 44(1):226–245, 2002.
[10] A. Duch and C. Martínez.
Updating relaxed k-d trees.
ACM Transactions on Algorithms (TALG), 6(1):1–24, 2009.
[11] Ph. Flajolet, G. Gonnet, C. Puech, and J. M. Robson.
Analytic variations on quad trees.
Algorithmica, 10:473–500, 1993.
[12] Ph. Flajolet and C. Puech.
Partial match retrieval of multidimensional data.
Journal of the ACM, 33(2):371–407, 1986.
400/405
To learn more (4)

[13] C. Martínez, A. Panholzer, and H. Prodinger.


Partial match queries in relaxed multidimensional search
trees.
Algorithmica, 29(1–2):181–204, 2001.
[14] R. Neininger.
Asymptotic distributions for partial match queries in K -d
trees.
Random Structures and Algorithms, 17(3–4):403–4027,
2000.
[15] R. L. Rivest.
Partial-match retrieval algorithms.
SIAM Journal on Computing, 5(1):19–50, 1976.
[16] H. Samet.
Deletion in two-dimensional quad-trees.
Communications of the ACM, 23(12):703–710, 1980. 401/405
General References

[1] R. Sedgewick & Ph. Flajolet


An Introduction to the Analysis of Algorithms.
Addison-Wesley, 2nd edition, 2013.
[2] D. E. Knuth.
The Art of Computer Programming: Sorting and
Searching, volume 3.
Addison-Wesley, 2nd edition, 1998.
[3] D.P. Mehta and S. Sahni, editors.
Handbook of Data Structures and Applications.
Chapman & Hall, CRC, 2005.

402/405
General References (2)

[4] T. Cormen, C. Leiserson, R. Rivest & C. Stein.


Introduction to Algorithms.
The MIT Press, 3rd edition, 2009.
[5] P. Raghavan and R. Motwani.
Randomized Algorithms.
Cambridge University Press, 1995.
[6] M. Mitzenmacher and E. Upfal.
Probability and computing: Randomized algorithms and
probabilistic analysis.
Cambridge University Press, 2005.

403/405
General References (3)

[7] R. Sedgewick.
Algorithms in C.
Addison-Wesley, 3rd edition, 1997.
[8] R. Sedgewick and K. Wayne.
Algorithms.
Addison-Wesley, 4th edition, 2011.
[9] D. Gusfield.
Algorithms on String, Trees, and Sequences.
Cambridge Univ. Press, 1997.

404/405
General References (3)

[10] H. Samet.
Foundations of Multidimensional and Metric Data
Structures.
Morgan Kaufmann, 2006.

405/405

You might also like