UNIT - 1: Disjoint SETS: Equivalence Relations
UNIT - 1: Disjoint SETS: Equivalence Relations
Disjoint Set ADT is an efficient data structure to solve the equivalence problems.
It has wide applications: Kruskal's minimum spanning tree algorithm, Least
common ancestor, compiling equivalence statements in Fortran, Matlab's
bwlabel() function in image processing, and so on.
Equivalence relations
3. Two cities are related if they are in the same country. This is an
equivalence relation.
4. Suppose town a is related to b if it is possible to travel from a
to b by taking roads. This relation is an equivalence relation if
all the roads are two-way.
We can model the problem like the following: the input is initially a
collection of N sets, each with one element. This initial representation is
that all relations(except reflexive relations) are false. Each set has a
different element, so that S ∩ S = ∅; this makes the sets disjoint. In
i j
addition, since we only care about the knowledge of the elements' locations
not values, we can assume that all the elements have been numbered
sequentially from 1 to N. Thus, we have S = {i} for i = 1 through N. At
i
last, we don't care what value returned by find operation as long as find(a) =
find(b) if a and b are in the same set.
Quick-find
The first approach to solve the problem is called quick-find, which
ensures that the find instruction can be executed in constant worst-case
time. For the find operation to be fast, we could maintain, in an array, the
name of the equivalence class for each element. Then find is just a
simple O(1) lookup:
Quick-union
The second approach to solve the problem is to ensure that the union instruction
can be executed in constant worst-case time, which is called "quick-union". One
thing to note is that both find and union cannot be done simultaneously in
constant worst-case time. Recall that the problem doesn't require that a find
operation return any specific name as long as find on the elements in the same
connected component returns the same value. Thus, we can use a tree to
represent each component becase each element in a tree has the same root.
Thus, the root can be used to name the set. The structure looks like below:
Since only the name of the parent is required, we can assume that this tree
is stored implicitly in an array: each entry id[i] in the array represents the
parent of element i. If i is the root, then id[i] = i. A find(X) on element X is
performed by returning the root of the tree containing X. The time to
perform this operation depending on the depth of the tree that represents
the set containing X, which is O(N) in the worst case because of the
possiblity of creating a tree of depth N − 1. union(p,q) can be done by
change the root of tree containing p into the value of root containing q:
Improvements
There are two major improvements we can do with our quick- union: smart-union
works on union operation and path compression works on find operation. Their
goal is to make the tree of each set shallow, which can reduce the time we spend
on find.
To find the running time of find and union, we need to find out the depth of
any node X, which in this case is at most logN. The proof is simple: when
the depth of X increases, the size of tree is at least doubled (i.e., join two
equal-size trees). Since there are at maximum N nodes for a tree, the size
of trees doubled at least logN times. Thus, the depth of any node is at
most logN. With this claim, we have running time for find is O(logN) and
running time for union is O(logN) as well.
Path compressionn
Path compression is performed during a find operation and is independent of the
strategy used to perform union. The effect of path compression is that every node
on the path from X to the root has its parent changed to the root. For example,
suppose we call find(9) for the following tree representation of our disjoint set:
Then the following picture shows the end state of our tree after calling
find(9). As you can see, on the path from 9 to 0 (root), we have 9, 6, 3, 1. All
of them have been directly connected to the root after the call is done:
This strategy may look familiar to you: we do the path compression in the
hope of the fast future accesses on these nodes (i.e., 9, 6, 3, 1) will pay off
for the work we do now. This idea is exactly the same as the splaying in
splay tree.
When union are done arbitrarily, path compression is a good idea, because
there is an abundance of deep nodes and these are brought near the root
by path compression. Path compression is perfectly compatible with union-
by-size, and thus both routines can be implemented at the same time. In
fact, the combination of path compression and a smart union rule
guarantees a very efficient algorithm in all cases. Path compression is not
entirely compatible with union-by-height, because path compression can
change the heights of the trees. We don't want to recompute all the heights
and in this case, heights stored for each tree become estimated heights
(i.e., ranks), but in theory union-by-rank is as efficient as union-by- size.
If we do analysis on smart union with path compression, the running time
for any sequence of M union-find operations on N objects makes O(N +
∗
Mlog N) accesses.
The following table summarizes the running time for M union- find
operations on a set of N objects (don't forget we need to spend O(N) to
initialize disjoint sets):
The running time for each operation for each algorithm is following:
Remarks
Essentially, union- find structure addresses the "dynamic connectivity
problem":
For example, given two points in a maze, we may ask "Is there a path
connecting p and q?" Objects can be:
Pixels in a digital photo.
Computers in a network.
Friends in a social network.
Transistors in a computer chip.
Elements in a mathematical set.
Variable names in a Fortran program.
Metallic sites in a composite system.
Links to resources
Here are some of the resources I found helpful while preparing this
article:
∗
1. log N counts the number of times you have to take the log of N
to get one. This is also called iterated log function. For
∗
example, log 65536 = 4 because loglogloglog65536 = 1
.