Algorithms Part 1 - Lecture Notes: 1 Union Find
Algorithms Part 1 - Lecture Notes: 1 Union Find
Union Find
1.1
General Steps to developing a usable algorithm: Model the problem, find an algorithm, calculate its speed, and
improve until satisfied.
Defining the Problem: Given a set of N objects: Union command is connecting two objects, and the find query:
is there a path connecting the two objects?
Implementing the operations:
Find Query - Check if two objects are in the same component (connected component = set of maximal
connected nodes)
Union Command - Replace components containing two objects with their union
1.2
The data structure is a integer array of size N, and p and q are connected iff they have the same id. So, if 1, 2,
3 are connected: then n = 1, 2, 3 will have the same id in the array. Union: To merge components containing
p and q, change all entries whose id equals id[p] to id[q].
Algorithm:
1. Initialize an array of size N and set id of each object to itself.
2. Union: If adding p and q, change all entries with id[p] to id[q].
3. To check if p and q are connected, check if id[p] = id[q]. If true then connected.
Efficiency: Quick-find takes N to initialize, N to get the union and 1 to find. Hence, takes N 2 array accesses
to process sequence of N union commands on N objects.
1.3
Data Structure: Integer array of size N, an6 d the parent of i is id[i]. Now, to check if two nodes are connected,
we just check if they have the same root. Now, for union, to merge components, containing p and q, set the id
of ps root to the id of qs root.
Algorithm:
1. Set if of each object to itself (N array accesses)
2. Union: Given p and q, change the root of p to point to the root of q.
3. To check if p and q are connected: while p 6= id[p], p = id[p] and then return p. This is the parent root
of p, and repeat for q. Then, check if these parent roots are the same.
Defect for this algorithm is that the trees can get too tall, hence the find operation becomes too expensive.
1.4
1.4.1
Improvements
Improvement 1 - Weighted Quick Union
Take steps to avoid having tall trees. Keep track of number of objects in each tree, and then maintain balance
by linking the root of the smaller tree to the root of the larger tree (smaller tree lower). In this data structure,
we need a new array, which counts the number of objects in the tree node at i: size[i]. Modifying quick union:
given nodes p and q: get their roots using the roots function. Now, compare their sizes: if equal then return.
Otherwise: if size[i] < size[j], then id[i] = j and size[j] += size[i].
Proposition 1 Depth of any node x is at most log2 (N ).
Proof Depth of x increased by 1 when tree T1 containing x is merged into another tree T2 . By weighted
algorithm, |T2 | |T1 |. Hence, size of the tree containing x at least doubles. Then, if we start with 1, then
the size of the tree containing x can double at most log2 (N ) times because 2|x| N . Efficiency: Initializing =
O(N ) and connected/union = O(log2 (N )).
1.4.2
After computing the root of p, set the id of each examined node to point to that root. Make every other node
in the path point to its grandparent (hence halving the length). In the while loop in the root function: id[i] =
id[id[i]].
Proposition 2 Starting from an empty data structure, any sequence of M union fidn ops on N objects makes
c(N + M log2 N ) array accesses. Here, lg N is the iterated log function: is the number of times we need
to take the log to get N (less than around 5). Hence, we have a linear relationship.
Worst Case Analysis:
Quick Find = O(M N )
Quick Union = O(M N )
Weighted Quick Union = O(N + M log(N ))
Quick Union with path compression = O(N + M log(N ))
weighted quick union + path compression = O(N + M log (N ))
1.5
Percolation We have a N by N grid of sites, and each site is open with some probability p. System percolate
iff top and bottom are connected by open sites.
When N is large, there is a sharp threshold p such that: if p > p then it almost certainly percolates, and if
less than then it will almost certainly not percolate. We cannot mathematically find p, but use simulations
to find this value. We can run Monte Carlo simulations: initialize whole grid to be blocked, and then declare
random sites to open up until the top and bottom are connected. Vacancy percentage estimates p. Note that
we have two types of open boxes: open and empty or open and connected.
Checking if N by N system percolates: Create an object for each site and name then from 0 to N 2 1. Sites
are in the same component if they are connected by open sites. Percolates iff any site on the bottom row is
connected to site on top row. If we brute force, then efficiency is O(n2 ). To fix this, we introduce two more
vertices: a virtual top (connecting to every node in the top row), and a virtual bottom connecting to every
node in the bottom row. Now, system percolates iff virtual top site is connected to the virtual bottom site.
Modeling a open: connect it to all the adjacent sites (4 at maximum). After running this simulation, we find
that this threshold is approximately 0.59.
Analysis of Algorithms
2.1
Observations
Example 1: 3-Sum: Given N distinct integers, how many triples sum to exactly 0.
Algorithm
Input: An array of numbers a and a number N , to which the three numbers must sum.
1. Initialize a count variable to 0: count = 0.
2. Three nested for loops: for i < N . Then, for j = i + 1 < N , and finally for k = j + 1 < N .
3. If a[i] + a[j] + a[k] = 0 then count += 1.
Output: The variable count, which is the total number of triples that sum to N .
Calculating the running time: Create a plot of N versus T (N ), and also do a log-log plot. Then, run a regression
to get a straight line. Hence, log(T (N )) = blog(N ) + c. To quickly estimate b run the program doubling the
size of the input, then the log of the ratio converges to the constant b.
2.2
Mathematical Models
We make simplifications: When N is large, the smaller order terms can be ignored. For example, for 2-Sum,
we choose 2 from N , which is N2 , which we then multiply by 2 because we have 2 array accesses, hence we use
N 2 array accesses intotal therefore the running time is of the order O(n2 ). Now, for 3-Sum we are choosing 3
from N , which is N3 , which is of the order 61 N 3 , and we have 3 array access for each tuple, hence 12 N 3 , so the
running time is O(n3 ).
To estimate a discrete sum, we can approximate using an integral (helps to get the high order approximation).
For example:
Z N
N
X
1
1 + 2 + + N =
xdx = N 2
i
2
1
i=1
So, in general, if we are summing a function, we can approximate by integrating the function between 1 and
N . Given f (x) g(x) for large x, we verify the condition that:
f (x)
=1
x g(x)
lim
2.3
Proof First, define T (N ) = the number of compares to binary search in a sorted subarray of size N . Now,
we know that T (N ) T ( N2 ) + 1, N > 1, T (1) = 1. Now, applying the recurrence to the RHS we have:
T (N ) T ( N4 ) + 1 + 1. When we repeat this lg(N ) times, we have: T (N ) T ( N
N ) + lg(N ) = 1 + lg(N ).
Now, we cna come up with a faster algorithm for the 3-Sum problem. First, sort the N distinct numbers, and
for every pair of numbers a[i] and a[j], binary search for (a[i] + a[j]). Then only count if a[i] < a[j] < a[k] to
avoid double counting. The first step is any type of sorting algorithm (take insertion sort for example), which
has O(n2 ), and the second step is binary search, which yields a O(n2 log(n)).
2.3.1
Notation
O(n) means that the upper bound for the running time of the algorithm is an.
Note: A function 12N 3 and 11N are both O(N 3 ), because they are both bounded by some aN 3 .
(n) means that the lower bound for the running time of the algorithm is an.
(n) means that the upper and lower bound for the running time have the same order.
The lower bound for both the 1-Sum and 3-Sum is (n), because we need to examine at least the entire array.
For the 3-Sum, there is a gap between the lower and upper bound. We have only found an optimal algorithm
when the upper and lower bound coincide.
2.4
Memory
Typical memory usage: boolean(1), byte(1), char(2), int(4), float(4), long(8), and double(8).
Total memory usage for a data type value:
Primitive type: 4 bytes for int, 8 bytes for double etc
Object Reference: 8 bytes
Array: 24 bytes + memory for each array entry
Object: 16 bytes + memory for each instance variable + 8 (if inner class)
Padding: Round up to multiple of 8 bytes.
Example: How much memory does the Weighted Quick Union Function use as a function of N : 16 bytes for
the object overhead, then we have two arrays each contributing: (4N + 24) + 8, where the 8 bytes are for the
reference to the array. Finally, we also have 8 byes for the padding and 4 for the integer count. Hence, the
total memory usage is: 8N + 88 8N . The padding memory is assigned as follows: first we sum all memory
contributions, and see the minimum amount needed to make the answer a multiple of 8.
3
3.1
For stacks: remove the item that was most recently added, and for a queue, examine the item least recently
added.
3.1.1
Refer to code files. Every operation takes a constant time in the worst case, and as for space usage: A stack
with N bytes will use of the order 40N bytes. Because: 16 bytes of object overhead + 8 bytes for the inner
class overhead + 8 bytes for the string reference and 8 bytes for the node reference. This does not include the
space for the strings, because they are under the client. Each item stores a node, such that the node contains
the item and the next item (starting from the last node).
3.1.2
Use an array to store N items on the stack. Then to add, we simply push and to remove we simply pop from
the last index (n 1). The defect is that stack overflows when N exceeds the capacity, because we need to
declare the size of the array beforehand. For overflow, we must resize the array (later in the notes). There is a
problem with this in java: when we popped from the stack, java will still hold a reference to that object, and
we must remove this for more efficient memory allocation. If we popped the N index, then we have: String
item = s[N] and then we need two further lines: s[N] = Null and return item. This means garbage collector
can reclaim memory only if no outstanding references.
3.2
Resizing Arrays
Our above implementation had a defect, where we required clients to provide the maximum capacity of the
stack. Out first attempt is as follows: we push and pop terms on the stack to increase or decrease the length.
However, this approach is computationally expensive, because each time we must copy the stack. This can thus
2
require time of the order: 1 + 2 + 3 + + N N2 , which is quadratic. Therefore, we want to avoid doing this.
So, we will reduce the number of times we need to create a new array. Hence, each time we need to extend the
stack, we create a stack of twice the size and copy the items. This requires time: N + 2 + 4 + 8 + + N 3N .
Now, for the case to pop, we cannot use this same idea, where instead of doubling, we halve each time, because
in the worst case, when the client pushes and pops, we are halving and doubling, and this will indeed still take
the quadratic time. Hence, the solution is to halve the size of array when it is a quarter full. Starting from an
empty stack, any sequence of N push and pop operations takes at worst N time, and the best case time is a
constant.
Resizing Array versus Linked List: Linked-list implementation: every operation takes a constant time
in the worst case and uses extra time and space to deal with links. Resizing array implementation: Every
operation takes constant amortized (averaged over the whole process) time and less wasted space. To be sure
that every operation is fast we should use linked list, but if we only care about the total then we can use resizing
array.