An Overview of Distributed MST Algorithms
An Overview of Distributed MST Algorithms
1 Introduction
In this paper, we discuss the problem of finding a Minimum Spanning Tree over a weighted undi-
rected graph. We also discuss and compare three classic distributed algorithms for the problem at
hand in sections 1, 1 and 1. The problem of finding a distributed algorithm for a minimum weight
spanning tree is a fundamental problem in the field of distributed network algorithms. Trees and
MSTs are used in a wide variety of algorithms in the distributed graph structure domain. Around
1983, when the influential and classic approach was provided by Gallager et al. [7], MST algo-
rithms were already being used in broadcast algorithms for communication networks. With the
help a minimum cost tree, the cost associated for a broadcast can be reduced by a significant
amount [2, 3]. Here, the edge weights are associated to the cost of using a channel in a specific
direction. In addition to the broadcast application, there are many potential control problems for
networks whose communication complexities are reduced by having a known spanning tree. Span-
ning trees themselves are essential components in classic distributed graph problems like Leader
Election [10], network synchronization [1], Breadth-First-Search [3] and Deadlock Resolution [4].
, where w(e) is the weight assigned to the edge e. The edges are assumed to have distinct weights,
which makes the minimum weight spanning tree unique. This property can easily be ensured by
various workarounds, .e.g. appending node ids to the edge weights which insures a proper ordering,
where ties are broken by the node ids. In the late 1950’s, much before the realization of importance
of the problem in the communication networks domain, the classic approaches towards finding
MST were proposed by Dijkstra [6], Prim [12] and Kruskal [9]. These approaches primarily relied
on the following two lemmas. Note that the proofs are informal and they assume the edge costs to
be distinct, hence ensuring the presence of a unique MST.
1
Lemma 1.1 (MST Cut Lemma) Let P and Q be two disjoint node sets that together give the
union of the original graph nodes, then among all the edges between P and Q, the node with
minimum edge cost must be present in the MST.
Proof Let T be an MST that does not contain e (the edge with the minimum cost), then adding e
to it will produce a cycle. Traverse along that cycle, and whenever it crosses from P to Q, replace
that edge with e. This produces an MST again, a contradiction.
Lemma 1.2 (MST Cycle Lemma) The costliest edge of a connected undirected graph must not
be in the MST.
Proof Let T be the MST that contains the costliest edge, e. The graph obtained by removing e
from T (T − e) gives two connected disjoint components, traverse along the cycle to get a cheaper
edge that can be replaced with e. Hence, a contradiction.
Any of the above two lemmas can be used to construct the MST. Prim’s algorithm [12] is a nearly
direct implementation of the Cut Lemma (Lemma 1.1). These approaches have a time complexity
of O(ElogV ).
However, from here on, we focus on the distributed approaches towards the problem in the static
asynchronous network domain [1]. In this construct, the graph is assumed to “be representative
of a point-to-point communication network, where the set of nodes V represents processors of the
network and the set of edges E represents bidirectional non-interfering communication channels
operating between neighbouring nodes” [3]. The communication is only through the channels and
no memory is shared between the processes. However, a node does know the identity of its neigh-
bours. Also, the approaches are confined only to event-driven algorithms, hence, there is no help
of a global clock to gain knowledge for events taking place in other processes.
For the distributed approaches, the complexity of the algorithm is usually measured either in terms
of communication or time. Communication complexity is the number of messages sent during
the algorithm. In some models, it is further assumed that a single message can contain at most
O(logV ) bits [3]. The definition for Time complexity varies with the usage. It is usually based on
either the number of rounds (relative to an algorithm) or time units (global clock).
2
propagating an initiate message to each node in the fragment. Each node gathers the reports of all
its children and propagates the info of the best minimum outgoing edge reported by its children or
determined by itself. The algorithm upon termination finds and propagates the information about
the unique MST for the distributed network to all the nodes.
Even though all the three algorithms follow the same general scheme, they differ in terms of prob-
lem breakdown and the approaches adopted to solve the sub-problems. In a general sense, there
are two methods for computing the MST.
Both Galager’s [7] and Awerbuch’s [1] algorithms use the first approach whereas Galay’s [8] algo-
rithm combines the two approaches.
Awerbuch’s and Galay’s algorithms achieve optimal performance, both in terms of message pass-
ing and time complexity. The algorithms proceed by breaking down the problem into three smaller
sub-problems which are solved in three different phases. In the different phases, the algorithms
establish a trade-off between message passing and time complexity.
Awerbuch’s algorithm starts with the Counting phase where it determines the number of nodes
in the network. Counting phase is followed by a second phase where the MST is developed, by
V
following the GHS algorithm until the fragments become of a large size (O( log(V )
)). At this stage
the algorithm switches to the third phase which consists of two relatively complex procedures:
Root Distance and Leader Distance, for adding the remaining edges of the MST. We discuss the
implementation details and subtleties of the procedures later in the report.
Garay’s algorithm starts with the Controlled GHS phase, which is a modified and controlled
version of the GHS algorithm. The first phase produces a forest of a bounded number of frag-
ments where each fragments diameter is also bounded by an upper limit. This is followed by the
second phase which involves the elimination of short cycles in the fragment graph. The third
phase involves global edge elimination in which the remaining fragment graph is reduced to the
MST.
1.3 Challenges
Before jumping to the individual algorithms, let us first review the typical issues faced in this
domain. In the distributed domain, no node has the knowledge of the state of the whole graph,
which is required for both Prim’s and Kruskal’s algorithm. In the message-passing model, it is
also difficult to process a single node at a time. Hence, the algorithms need to take care of the
different orderings of events which can take place in different nodes. Some distributed algorithms
bear similarities to Borvka’s algorithm [5] for the classical MST problem. For the distributed
approaches, there are issues to be tackled for each algorithm. A common issue is due to the
assumption of known distinct edge costs. This is not always the case for the channels. However,
as mentioned earlier, workarounds exist to ensure proper ordering of the edges. A common issue
3
is to identify each fragment, so as to insure the internal and outgoing edges. This issue is resolved
by associating the identity of a fragment to the root node or edge.
• Each node has the knowledge of all its neighbours, including the weights (or costs) of the
edges (or channels) connecting to them.
• Each node has its own copy of the full algorithm in the initial state.
• The channels are assumed to be asynchronous, bidirectional, FIFO, and without any errors.
2.1 Overview
The algorithm involves some complications due to the detailed logistics, however, we first try to
give an overview of the working. After which, we will build upon the specifics in the next section.
At any point in the algorithm, there are a set of sub-trees, which the authors call fragments. These
fragments merge with each other to form the MST ultimately. Each fragment will hence be formed
of a set of nodes. Initially, a fragment will just be a single node of the graph. Also, each node will
have some identifying information (to be defined later) for the fragment it belongs to, and will
be aware of the edge leading to the core (a special edge, to be defined later) of the fragment. Due
to the Cut Lemma (Lemma 1.1), we know that a fragment should merge only along the outgoing
edge which has the minimum edge cost. To reduce the complexity, the authors impose a further
constraint for a fragment to merge only with a bigger fragment. Due to this asymmetry, this merg-
ing is often called as absorbing or hooking into the bigger fragment. This property of merging into
bigger sets is a classic optimization technique used in data structures like Union-Find. Hence, the
edge must also be connecting to a bigger fragment. Such an edge, if found, is called the best-edge
for the fragment. Therefore, each fragment will get merged along its best-edge. It can hence be
observed that finding the best-edge for a fragment will form a crucial module for the algorithm.
The core of a fragment, introduced in the previous paragraph, can be treated as a root of the
fragment, since each node will maintain an inbound edge leading to the core. Also, the identifi-
cation (introduced in the previous paragraph) for a fragment is nothing but the weight of the core
edge. Since the weights are distinct the identification is also unique. The definition of the core
edge relies on an attribute called the level, defined for each fragment. The level provides a lower
bound on the logarithm of the number of nodes in the fragment, we shall shortly realise how. A
4
level of a fragment is initialized as 0 (when the fragment comprises of a single node). Each frag-
ment gets absorbed into a fragment of higher or equal level. If the fragment gets absorbed into a
fragment of a higher level, the fragment just becomes a part of the bigger fragment and assumes
the identity of the bigger fragment. However, if the levels of the two fragments are equal, a new
fragment is formed with the level incremented by 1. We can now define the core, which is the
ID for the fragment. When a fragment has its level set as 0 (single node), there is no core edge.
However, during each increment of the level (caused by the merging of fragments of equal levels),
the core is updated to the edge along which the merge took place (connecting the two fragments).
Note that if the levels of the two fragments are equal during the merging, the best-edge for both
fragments will change. It hence follows that a level L + 1 fragment always contains, at least, two
level L fragments, and hence each level L fragment contains, at least, 2L nodes. Thus, the level of
a fragment provides a lower bound on the log of the number of nodes.
To give a one line summary for GHS algorithm, it maintains a set of mutually exclusive frag-
ments, each fragment has a core edge and at any point of time each fragment is waiting to get
merged along its best-edge (which may not be known at all times), and these fragments ultimately
merge to give a single fragment, the MST.
We now describe the algorithm for the search for the best-edge for a fragment. If the fragment
consists of a single node, the best-edge is simply the least cost outgoing edge which leads to a
higher or equal levelled fragment. To do so, it iterates over all of the adjacent edges, chooses the
edge with the minimum cost, and sends a Connect message to the adjacent node. It subsequently
goes into the state Found and waits for a response from the fragment at the other end. Now, let’s
consider the algorithm for a fragment with more than one node (non-zero level). The type of mes-
sage used for this case is Initiate. The process for this is started whenever two fragments of level
L − 1 merge to form a bigger fragment, with a new core. The two nodes forming the core edge
broadcast the Initiate message to the other nodes of the fragment. This broadcast is done along
the outward (opposite of the inward edge) edges of the tree. The initiate message also carries the
information < ID, Level >, where ID is the weight of the core. When a node receives the Initiate
message, it changes its state to Find. This essentially starts the process of finding the minimum-
weight adjacent edge for each node. Each node labels each of the adjacent edges into either of the
following three classes: Branch, Rejected and Basic. Branch means an edge is in the fragment
tree. Rejected means an edge, which has been discovered to be pointing to a node of the same
framework. The Basic edges are the remaining edges which are not labelled as Branch or Re-
jected. Now, to find the best-edge candidate for that node, it finds the minimum-weight Basic edge
and sends a Test message to the node on the other side of the edge. The Test message contains the
5
< ID, Level > information of the fragment. If the node on the other side has the same ID (same
fragment), the node replies with a Reject message, and both nodes label the edge as Rejected.
Note that if a node sends the Test messages and receives a Test message from the other side as
well, with the same identity, the node need not send a Reject message as a reply and just label the
edge as Rejected. Now, if the node on the other side has a different identity, it will either respond
by sending an Accept message (if its level is greater or equal), or it will delay making any response
until its fragment reaches a level greater or equal than the one received in the message. Since the
response is delayed, the node which sent the message is blocked and hence the whole fragment is
blocked. This essentially means that a fragment will finish finding the best-edge if and only if none
of the outgoing best-edge candidates of the fragment lead to a fragment with a lower level.
When the individual nodes have found their respective minimum-weight edges (the candidates
for best-edge), the nodes need to cooperate to find the best-edge for the fragment. This is achieved
by propagating Report messages towards the core. A Report(W ) message is sent to the inbound
edge, where W is the weight of the minimum-weight outgoing edge encountered yet. W will be
∞ when there are no outgoing edges yet encountered. A node will wait to receive Report mes-
sages from all its branches, except the inbound edge, and it will be the minimum of those weights
along with the minimum-weight edge found by the node itself (of its outgoing edges). The global
minimum of all of these weights is then propagated again in a Report message towards its inward
edge. When a node sends a Report message, it changes its state to Found. Furthermore, the nodes
save the edge leading to the best-edge candidate so that the path can be traced back.
Ultimately, when both nodes of the core edge have exchanged Report messages, these nodes act
to inform the node having the minimum-weight outgoing edge. Also, it is now certain that the
fragment must merge along that best-edge, hence the inbound edges need to be reordered towards
that node since the core edge will either be in the adjoining fragment or it will be the best-edge. To
do so, a Change-core message is propagated towards the node with the best-edge. For each edge
that is encountered along this path, the inbound edge is reversed to point towards the best-edge.
Finally, the node with the best-edge sends a Connect(L) message towards the best-edge. L,
here, is the level of the fragment of the sending node. It may also happen that the fragment on
the other side has the same level. This causes the best-edge to become the new core of a newly
formed fragment with level L + 1. To achieve this, Initiate messages are broadcasted with the
new < ID, Level > information to both the fragments. This achieves two purposes, sending the
information update to all the nodes, and the initiation of a new search since the level as changed.
On the other hand, if the connecting fragment has level L0 > L, the fragment needs to get absorbed
in the connecting fragment. To do so, Initiate message is broadcasted to only the joining fragment
(smaller level). This achieves the purpose of both updating the < ID, Level > information of
the joining fragment as well as the initiation of search for the joining fragment since the level is
now updated. Note that the nodes in the connecting fragment remain unchanged since they were
already in the blocked search of searching for the best-edge.
6
2.3 Example
In Figure 1, we have shown an example run of the algorithm on a graph with 5 nodes.
• n is used as the default notation for the node at which the procedure executes.
• N odeState(n): State of the node n, enumerator variable with possible values as SLEEPING,
FIND and FOUND. Initialized with SLEEPING.
• EdgeState(e): State of the edge e, enumerator variable with possible values as BRANCH,
REJECTED and BASIC. Initialized with BASIC.
• ID(n): Variable storing the identity of the fragment which contains node n.
• Level(n): Variable storing the level of the fragment which contains node n.
• BestEdge(n): Variable storing the edge leading to the best-edge of the fragment which
contains node n. Used for tracing back from core.
• BestW eight(n): Variable storing the weight of the best-edge of the fragment which con-
tains node n.
• T estEdge(n): Variable storing the outgoing edge at which a test message has been sent. Set
back to nil after response received.
• InboundEdge(n): Variable storing the inbound edge which leads to the core of the frag-
ment.
• F indCount(n): Variable storing the count of Initiate messages sent by the node n. Must
receive Report messages from all of these before reporting to inbound edge.
7
(a) Initial graph, all nodes are in sleeping state. (b) Node B spontaneously wakes up, sends Connect mes-
sage to A.
(c) Node A wakes up and connects with B. Initiate mes- (d) A − B becomes the core. Nodes A and B send Test
sage with new < ID, Level > information is sent to both. messages. Node E wakes up and merges with Node C.
Node C also wakes up independently.
(e) C accepts the Test message since both have Level 1. D (f) B responds with Initiate message to D. A and B report
wakes up and sends the Connect message. D rejects the to each other and Change-Core message is sent towards
Test message since level is lower (0 < 1). A.
8
(g) Change-Core reaches A (node with best-edge) and Ini- (h) During the the propagation of Initiate messsages (not
tiate message with incremented ID is sent to both frag- shown), inbound edge direction for E is reversed. MST is
ments. D’s Test message will be rejected later. formed since no outgoing edges.
Figure 1: Example run for GHS with a graph containing 5 nodes and 6 edges
The time complexity is shown to be O(V log V ) time units. The proof behind this lies in the
fact that it takes at most O(lV ) time units until all nodes reach level l. This can be proved with
the help of induction on the number of levels, since the propagation of cooperation signals within
a fragment, requires O(V) time units. Since the level l is upper-bounded by log V , total time units
is bounded by O(V log V ).
3 Awerbuch (1987)
In this section, we describe the classical article by Awerbuch [4] published in 1987. We give a
brief overview of the algorithm followed by the algorithm specifics.
3.1 Overview
The main contribution of this work over past works is to develop a linear time algorithm for find-
ing Minimum Spanning Tree in the asynchronous network; with the best previous one having
Θ(E + V logV ) message complexity and taking Θ(V logV ) time. The GHS algorithm explained
9
Algorithm 1 GHS Algorithm
1: procedure WAKE -U P . Called at spontaneous waking of a node
2: Level(n) := 0
3: F indCount(n) := 0
4: N odeState(n) := FOUND
5: Find the adjacent edge m with such that w(m) is minimum.
6: EdgeState(m) := BRANCH
7: send Connect(0) towards m.
8: procedure R ESPONSE -I NITIATE(< ID, Level >, State) received through edge j
9: Level(n) := Level
10: ID(n) := ID
11: N odeState(n) := State
12: InboundEdge(n) := j . Initiate message comes from the path leading to core edge
13: BestEdge(n) := nil . Initially, no best-edge, set only after a valid best-edge found
14: BestW eight(n) :=∞
15: Send Initiate(< ID, Level >, State) on all adjacent edges (except j) of n which have
state set as BRANCH . Broadcast along the branch edges
16: if State = FIND then
17: F indCount(n) := number of adjacent edges (except j) of n which have state set as
BRANCH
18: Find the adjacent edge m such that w(m) is minimum and EdgeState(m) = BASIC
19: if there is no such edge m then . No edges to send T est to, good to go
20: T estEdge(n) := nil
21: Execute REPORT
22: else
23: T estEdge(n) = m
24: Send T est(< ID(n), Level(n) >) on m
25: procedure R ESPONSE -T EST(< ID, Level >) received through edge j
26: if N odeState(n) = SLEEPING then
27: Execute WAKE-UP
28: if Level(n) ≥ Level and ID(n) 6= ID then
29: Send Accept on j
30: if ID(n) = ID then
31: Send Reject on j
32: EdgeState(j) := REJECTED
33: if Level(n) < Level and ID(n) 6= ID then
34: Delay the response by placing the received message on end of queue
35: procedure R EPORT . Checks if got response from all the branches as well as outgoing edges
36: if F indCount(n) = 0 and T estEdge(n) = nil then
37: N odeState(n) := FOUND . Now need to propagate the results back to core
38: Send Report(BestW eight(n)) on InboundEdge(n)
39: procedure R ESPONSE -R EPORT(W eight) received through edge j
40: if j 6= InboundEdge(n) then
41: F indCount(n) := F indCount(n) − 1
10
42: if W eight < BestW eight(n) then
43: BestW eight(n) := W eight
44: BestEdge(n) := j
45: Execute REPORT . Check if Report received from all branches
46: else if N odeState(n) 6= FIND and W eight > BestW eight(n) then
47: Execute CHANGE-CORE . Received at core
48: else if BestW eight(n) = ∞ then . No outgoing edges
49: halt, MST found
50: procedure R ESPONSE -ACCEPT received through edge j
51: T estEdge(n) = nil . Good to go
52: if W eight(j) < BestW eight(n) then . Check with branches who have already reported
53: BestEdge(n) := j
54: BestW eight(n) := W eight(j)
55: Execute REPORT
56: procedure R ESPONSE -R EJECT received through edge j
57: if EdgeState(j) = BASIC then
58: EdgeState(j) := REJECTED
59: Find the adjacent edge m such that w(m) is minimum and EdgeState(m) = BASIC
60: if there is no such edge m then
61: T estEdge(n) := nil
62: Execute REPORT
63: else
64: T estEdge(n) = m
65: Send T est(< ID(n), Level(n) >) on m
66: procedure C HANGE -C ORE . Best-edge found, inform the node with best-edge
67: if EdgeState(BestEdge(n)) = BRANCH then
68: Send Change-Core on BestEdge(n) . Propagate
69: else . Reached node with best-edge, send Connect
70: Send Connect(Level(n)) on BestEdge(n)
71: EdgeState(BestEdge(n)) := BRANCH
72: procedure R ESPONSE -C HANGE -C ORE received through edge j
73: InboundEdge(n) := j
74: Execute CHANGE-CORE . Recursive implementation
75: procedure R ESPONSE -C ONNECT(Level) received through edge j
76: if N odeState(n) = SLEEPING then
77: Execute WAKE-UP
78: if Level(n) > Level then
79: Send Initiate(< Level(n), ID(n) >, N odeState(n)) on j
80: if N odeState(n) = FIND then
81: F indCount(n) := F indCount(n) + 1
82: else if EdgeState(j) = BASIC then
83: Delay response by placing the message at the end of queue
84: else . New core, note that Connect will be received by sending node as well
85: Send Initiate(< Level(n) + 1, W eight(j) >,FIND) on j
11
in the previous paper presented basic fundamental ideas and concepts to do so. The best previous
algo was given by Chin and Ting and Gafni. The algorithm given here is suboptimal by a fac-
tor of Θ(V logV ) in time. This is due to reason that small trees sometimes wait for bigger trees
leading to complex combinatorial structure as a consequence of waiting for relations between trees.
The improvement in performance of MST algorithm in this paper is primarily due to 2 new in-
novations: Root Update and Test Distance. The algorithm consists of two stages: Counting stage,
which computes the number of nodes in the network and uses this information to find the MST
in Minimum Spanning Tree stage. Both are optimal in communication and time. The Counting
stage first finds some spanning tree and elects a leader in the network; which helps to compute the
number of nodes in the system.
In most distributed MST algorithms, a spanning forest of rooted trees is maintained; each tree
being a subtree of the MST. Initially, every tree consists of a single node. In the course of the
algorithm, the subtrees try to find the best edge(minimum weight edge among all leading to other
trees). The best edge is guranteed to be in the MST given weights are unique. The tree then hooks
itself on the other side of that edge, becoming a sub-tree in the bigger tree. Hooking is a sequence
of manipulation of father pointers. Core edges (two trees hooking onto each other) create a cycle
of two in pointer graph; for which root with bigger identity is unhooked; hence, the larger id root
becoming the root of combined tree. A naive implementation of this algorithm requires O(V 2 )
messages and time complexity; the worst case being the tree of size V /2 being hooked onto other
trees V /2 times, each requiring linear work.
Classical idea to improve this is to use the Union-Find algorithm; leading to a double size of
the combined tree each time pointer of a node is changed. Since each node undergoes a maximum
of atmost log2 V pointer changes, we can achieve a communication complexity of O(E + V logV )
and time complexity of O(V logV ) if we ensure best edge of tree leads to bigger or equal tree.
To achieve this, the previous paper introduces the technique of levels. The reason that its time
complexity is not linear(O(V logV )) is that there might be a bunch of sub-trees of the same level
(say l), each hooked onto the next one on the chain, resulting in a tree of level l + 1, regardless
of the length of the chain. A tree of level 1 with V /2 nodes may be created, which may undergo
logV − 1 changes in level, each needing Ω(V ) time. Chin, Ting and Gafni addressed this problem
by updating the level to the logarithm of the cardinality of the tree, each time that computation of
the best edge is performed. However, the time complexity remained the same. The logV factor is
due to the fact that updating the level of long chain comes too late. The minimum weight property
can help to achieve a linear time algorithm because then, instead of hooking itself on to its mini-
mum weight edge, each tree will hook itself on edge leading to neighbouring tree of the maximum
level. This is the main idea behind the Counting stage of our algorithm.
12
3.2.1 MST stage
The MST stage is performed in two phases. The first phase runs algorithm similar to GHS
algo(in above paper)[7] , the only difference being it is terminated when all trees reach the size
of Ω(V /logV ).
The second phase brings new algorithmic ideas, in which aggressive update of levels is done in an
accurate fashion, such that small trees are prevented from waiting for the big trees and speeds the
algorithm. The counting stage is needed in order to know the number of nodes V. The details of
the algorithm are as followed:
1) Root initialisation: In the course of the algorithm (second phase, MST stage) as trees coa-
lesce and hence new nodes become roots of the resulting trees. As soon as a node r becomes root
with level l of tree T, it broadcasts an initialisation message containing (r,l) parameters over T,
which is further relayed onto trees that hook themselves onto T.
Upon delivering the initialisation message, an internal node j remembers those parameters in local
variables Levelj , Rootj , and starts execution of a local search procedure.
2) Local candidate selection: The local search procedure tries to find the minimum weight edge
outgoing from node j to node i in a separate tree such that the i’s level is greater than l. Actually,
node j scans its incident edges one by one. It does so by passing a special test message to node k
on the other side of the edge and getting the reply from k. 3 broad cases arise:
a) Rootk = r : k is in T. Reply is negative.
b) Levelk <l : k delays response to that message until Levelk reaches l. If this level increase at k
is due to hooking of k’s tree onto T, then k will have Rootk = r. Hence, reply is negative.
c) Levelk >l : Edge (j,k) is declared to be local candidate for best edge of T.
3) Best edge selection : Names of local candidates are collected at the root. The root waits until
all the nodes get replies from all their neighbours and all the possible candidates have reached it.
If there is no local candidate, the algorithm terminates since tree spans the network and hence is
the MST.
Else, the root selects the minimum edge (v,w) with v being the internal endpoint. Root sends
special message(pointer reversal message) to v, reversing all the father points from r to v, so that
v becomes the new root. 2 cases arise:
a) (v,w) is a core edge and v is its biggest endpoint : w hooks itself onto v and v becomes root of
the resulting tree. Level of v increases by 1 and (1) that is Root initialisation is done.
b) (v,w) is not a core edge or v is not the bigger endpoint : v hooks itself onto w. T becomes a
subtree in the bigger tree.
If w has received an initialisation message, then v relays it over T making T participate in the
best edge selection of entire tree. Until best edge selection happens, Test-Distance procedure is
iterated by v. This is where the aggresive update of levels takes place and te innovation of paper
lies. Upon each invocation of Test-Distance, node v sends an exploration token to father w. The
token initially carries counter value 2l(v)+1 . Upon arrival of the token at a certain node, the node
subtracts the number of sons from the counter. If the counter is positive, and the receiving node is
not a root node then that node forwards the token to the father. Thus moving up, either the counter
13
becomes ¡=0 and the token dies or a positive counter reaches the root. If the token is alive and
it reaches root, then acknowledgment is sent back from root to v, upon which v sends a special
message over T, which causes every node in T to increase by its level by 1. The Test-Distance
procedure is revoked again and again with increased level until the token does not die. It is noted
that the Test Distance takes place until a new root in the tree is decided.
4) Root Update Procedure: This process is activated either when initialisation message has ad-
vanced for distance bigger than 2m+1 or if some node detected more than 2m+1 internal edges in
local candidate selection, m being the level of the tree root. In either case process of best edge
selection and Test Distance Procedure are interrupted, and level of the root is increased by 1 and
then Root initialization process is revoked.
• Procedure WAKE-UP can be triggered on a node at the start to initialise all the variables.
• n is used as the default notation for the node at which the procedure executes.
• BestW eight(n): Variable storing the weight of the best-edge of the fragment which con-
tains node n.
• local − cand − arr: Array storing all the local candidate edges with their weights.
14
(a) Broadcast: level l is reset or r is made root. (b) Local candidate selection case 1:
Rootj = Rootk ,Levelj = Levelk
(d) Local candidate selection case 3: Levelk > Levelj , (j − k) stored as local candidate
15
(e) Best candidate selection Step 1: All local candidates sent up to root. Minimum
edge v-w selected.
16
(g) Best candidate selection step 2 case 2:
Levelv > Levelw k Levelv = Levelw (core-edge),IDw > IDv
(h) Test distance step 1: Sum of degrees of nodes on w-root path < 2Level(v)+1
(i) Test distance step 2: Broadcast to increase level of all nodes os subtree of v
17
(j) Root update
18
Algorithm 2 Phase 2, MST Stage (Awerbuch)
1: procedure WAKE -U P(id,V) . Called each node at starting of algorithm
2: Level := 0
3: Root := n
4: count := dict
5: local − cand − arr := []
6: ID := id
7: count(best − edge) :=0
8: parent := id
9: V := V
10: procedure ROOT-I NITIALISATION . Called just after phase-1 of MST or after level of root is
reset or root node is reset.
11: BroadcastInitiate(< ID, Level >, 0)overthetree
12: procedure R ESPONSE -I NITIALISATION(Initiate(< id, level >, val)) received through edge
i
13: if Initiate(< id, level >, val) received for the first time then
14: Level := level
15: Root := id
16: val := val + 1
17: if val > 2level+1 then
18: Send Root − U pdate − M essage to father upto path of root
19: Call Procedure local-candidate-selection
20: Broadcast Initiate(< ID, Level >, val) over the tree
21: procedure LOCAL - CANDIDATE - SECTION
22: no − internal − edge := 0
23: For each edge k incident to n,
24: Send T est − M essage(< Root, Level >) to k
25: Receive Response − T est − M essage(< Reply >) from k
26: if res > 0 then
27: add << n, k >, weight(< n, k >) > to local − cand − arr
28: else
29: no − internal − edge := no − internal − edge + 1
30: if no − internal − edge > 2level+1 then
31: Send Root − U pdate − M essage to father upto path of root
32: End For
33: Send Best − Edge < local − cand − arr > to parent
34: procedure BEST- EDGE - SELECTION(Best − Edge < local − cand − array >) received from
son i
35: if Root != ID then
36: Send Best − Edge < local − cand − arr > to parent, storing the path
37: else
38: if count of total number of best edge arrays received = V − 1 then
19
39: Select minimum edge v − w , v being the internal node
40: Remove first node in path of v from its sons and set it as its parent
41: Send P ointer − Reversal < v, w > to first node in path of v
42: else
43: count(best − edge) := count(best − edge) + 1
44: procedure R ECEIVE -P OINTER -R EVERSAL(P ointer−Reversal < v, w >) received through
edge i
45: Add i to set of sons
46: if ID!=v then
47: Set first node in path to v as parent
48: Send P ointer − Reversal < v, w > to first node in path of v
49: else
50: Set father as empty
51: Send Joining < Root, ID, Level > to w
52: Receive < join > from w
53: if join > 0 then
54: Set w as father
55: Call procedure Test-Distance(2Level+1 , v)
56: else
57: Add w to its list of sons
58: Level := Level + 1
59: Call procedure Root-Initialisation
60: procedure R ECEIVE -J OINING(Joining < Root, ID, Level >) received from v
61: if (Level == this.LevelandID < this.ID)or(Level < this.Level) then
62: Add father to list of sons
63: Reverse nodes till the path of root so as to reset the father and son pointers
64: Set v as father
65: Send 1 to n
66: else
67: Add v to its list of sons
68: Send −1 to n
69: procedure T EST-D ISTANCE(val, v) called or on receiving Test-Distance< val, v > through
edge i
70: if Root == IDandval > 0 then
71: Send < Ack − T est > to v using the saved path
72: else
73: if val > 0 then
74: temp := val − deg(val)
75: Send < temp, v > to father
76:
20
77: procedure T EST-D ISTANCE -U PDATE(< Ack − T est >) received
78: Level := Level + 1
79: Broadcast an increase in level of 1 in the entire subtree
80: Send Test-Distance¡2Level+1 , v¿ to father
81: procedure ROOT-U PDATE(Root − U pdate) received through son i
82: if Root == Id then
83: Level := Level + 1
84: Call procedue Root − Initialisation
85: else
86: Send Root − U pdate to its parent
b) Level Update : Whenever the time spent by Link search procedure is high, it is interrupted.
Whenever a node is detected such that sum of its height and degree in tree exceeds the value 2k+1 ,
where k is the level of the tree; link search procedure is interrupted.
The procedure succeeds only when the tree is not being absorbed by a bigger level tree and
aborts otherwise. The procedure operates similarly to “two-phase commit” protocals. It locks
the nodes which have not been captured by some other tree. The locking phase takes place in 2
phases; each phase involving one broadcast and one convergecast.
In the first broadcast, nodes are conveyed that locking mechanism has started. A node receiving
the first broadcast is locked if it has not been invaded by some another tree. Once a node is
locked, all the incoming exploration messages are buffered and processed immediately after node
21
is unlocked.
This is followed by a convergecast, where the leader finds out whether all locks have been ob-
tained. The locking succeeds if all the locks have been obtained. If successful, then the new level
is computed which is actually the (intger) value of the logarithm of a number of nodes(cardinality)
of the tree.
The second broadcast informs all the nodes if the locking was successful. If locking was succesful,
then each node udates its level. In any case, the nodes become unlocked.
The second convergecast is needed for the purpose of synchronisation; that is to ensure all the
nodes have completed the procedure.
In case the Level-Update aborts, the leader becomes inactive with no additional procedure being
executed in its tree. This is because the leader can never again become the network leader as its tree
is absorbed by bigger level trees. Upon termination of Level-Update, either the level is increased
or the tree becomes inctive.
Thus, 2 events may take place: either uninterrupted execution of Link-Search or tree is invaded
by another tree. In the latter case, the tree leader is killed.
If none of the tree nodes found a feasible link(in the former case), then the tree must cover the
entire network with the termination of the algorithm as the spanning tree is found. Root is declared
as the leader. Its name is broadcasted over the network and the total number of nodes is counted.
Otherwise, some feasible links have been found. Two cases arise:
1) If all feasible links lead to trees of the same level, then the preferred link is elected with the
minimum weight; the tree on the other side of the edge is called the preferred tree.
2) If there exists a bigger leel tree on the other side, the tree becomes inactive.
c) Marriage Procedure : If the tree is active at this point, that is ,i f all feasible links lead to
trees of the same level , then the Marriage Procedure merges the pairs of trees of same level, hav-
ing the same preferred link. In such pair, the tree with bigger identity conquers the tree with
smaller identity.
They identify that diameter of the graph is indeed one such inherent parameter in the construc-
tion of an MST and present a distributed MST algorithm whose time complexity is sub-linear in
V and linear in Diam. The motivation of the work coming from the fact that there exists trivial
O(Diam) algorithms for various other important distributed network problems such as Leader
Election, Breadth First Search Tree Construction etc. One other motivation being that in most real
22
large area networks Diam V . So any such improvement would hugely improve the perfor-
mance in real-world distributed systems.
For the algorithm to execute in the declared time and message passing complexity we need to
make a few assumptions. We need to follow all the assumptions made by the GHS algorithm that
we enlisted before. Besides them we also need to make the following assumptions
1. We will assume that the size of the messages has an upper bound of O(log V )
2. Also a node may send at most one message on each edge at each time unit
3. Edge weights are polynomial in V , so an edge weight can be sent in a single message.
1. In the first stage the basic GHS algorithm is executed until the stage where each fragment in
the network has found its minimum weight outgoing edge(an outgoing edge of a fragment F
being an edge with one endpoint in F and another at a node outside it). So at the end of this
stage, we get a forest structure of fragments which is referred as F F in the remainder of the
report.
2. Each of the fragments in the resulting forest is broken down into small O(1) trees and merge
operation of the GHS algorithm is performed only on these small trees. The trees are broken
down by computing a dominating set M (T ) on each tree T of the fragment forest F F , and
then the merge operation is carried out with each fragment F ∈ M picking one neighboring
fragment F ∈ / M and merging with it. The breaking down of the fragment in this stage
ensures that the diameter of each fragment remains small.
23
1. M dominates V (T )
|V (T )|
2. |M | ≤ 2
The procedure is based on the following. For a vertex v in a tree T , let Child(v) denote the set
of vs children in T . We use a depth function L(v) on the nodes, defined as follows:
(
0, if v is a leaf,
L(v) =
1 + minu∈Child(v) (L(u)), otherwise
Also, we denote the set of tree nodes at i as L(i). Now we can proceed to give the algorithm of the
procedure:
Algorithm: Small-Dom-Set
1. Mark the nodes of T with depth numbers L(v) = 0 , 1 , 2.
3. Then, M := Q ∪ L1;
Output of Controlled-GHS: For the computation of the dominating sets we use a distributed
implementation of Procedure Small-Dom-Set. The algorithm employs the distributed Minimal In-
dependent Set (MIS) algorithm of Panconesi and Srinivasan [? ] for calculating the MIS.
It is important to note that the first phase of the algorithm achieves the following.
Lemma 4.1 In each phase of Controlled-GHS
1. the number of fragments, at least, halves.
Lemma 4.2 Also, when algorithm Controlled-GHS is activated for I phases, it takes O(3I .2log V )
time, and yields up to N = 2VI fragments, of diameter at most d = 3I .
The above results can be easily proved from the basic properties of the procedures of fragment
breakdown and small dominating set construction used in Controlled-GHS.
24
Let F G denote the fragment graph that is the outcome of Phase I. The vertices of this graph are
the fragments constructed in Phase I, and its edges are all the inter-fragment edges. On observa-
tion, we find that cycles and multiple edges(from different nodes belonging to the same fragment)
connecting any two fragments might exist in this graph. The algorithm uses a complex procedure
to identify and eliminate these cycles.
Cycle Elimination procedure For eliminating cycles the procedure depends on the following
lemma:
Lemma 4.3 Given a weighted graph G = (V, E), if e is a bottleneck edge of G then e ∈
/ M ST (G).
One of the nodes is distinguished as the fragment’s center r(F ) which is also considered the root
of the fragment. The procedure eliminates all short cycles of length at most l and also concentrate
via T (F ) , all the information pertaining to every other fragment up to distance l from F in r(F ).
The procedure starts by eliminating all cycles of length 2 and then goes on to eliminate all cy-
cles of length at most l. We first consider the procedure of eliminating all cycles of length 2 as
after that extending the procedure to cycles of longer length would be much easier. The nodes of
the fragment collect information on the edges connecting F to the adjacent fragments and send it
upwards on the tree T (F ) to the center r(F ). In order to execute the procedure, each fragment
node v creates the record Path(F) containing edge information, for each F ∈ F G adjacent to F .
It is important to note that out of all records of fragments adjacent to the node in its subtree, it
sends exactly one record concerning each such fragment. It is easy to verify the following basic
properties of the above pipelining policy.
Lemma 4.4 Each node v ∈ F sends to its parent exactly one record P athl(F ) for each fragment
F that is adjacent to nodes in vs subtree in T (F ); these records are sent up in increasing the order
of fragment id.
For eliminating the remaining small cycles the algorithm basically repeats the procedure de-
scribed in the previous section for l phases.
25
4.2 Example
In Figure 4 we show a snapshot of the system running Controlled-GHS phase of the algorithm on
a distributed network. In Figure 5, we show the edge elimination procedure of the second stage of
the algorithm.
(a) Stage I: A particular Fragment tree formed after merg- (b) Label each node on the fragment tree with levels their
ing of smaller fragments. respective levels.
(c) Find dominating set by taking the union of the MIS (d) Breaking down the fragments tree on the basis of their
and first level nodes in the fragment tree dominating set.
Figure 4: Example run for Stage I-Controlled GHS with a graph containing 15 nodes and 22 edges
26
Figure 5: Maximum weight edge elimination in the Fragment Graph. All small cycles are detected
and edges eliminated.
√
Part I: 3I ∗ O(2 log V )
Part II: 3I + 2VI ∗ log V 2
Part III: Diam(G) + 2VI
To optimize the running time we choose I such that 3I = 2VI ie. I = lnlnV6
For this value of I we get a bound of O(Diam(G) + V 0.614 ) on the time complexity.
5 Conclusion
In this paper, we discussed three distributed algorithms which solve the problem of finding the
Minimum Spanning Tree for a connected asynchronous network. It his classic work, Angluin [?
] showed that there exists no deterministic distributed algorithm to solve the MST problem with a
bounded number of messages if the distributed network graph has neither distinct edge weights nor
distinct node identifiers. Therefore, we assume that each edge is associated with a distinct weight
known to adjacent nodes. Even though having distinct edges is not an essential requirement, we
assume this as it guarantees a unique MST in the network. All the algorithms also operate in
the condition that the size of messages is upper bounded by O(log V). With these assumptions,
the algorithms attempt to optimally solve the problem of finding MST on distributed network. We
realize that all the three algorithms use ideas from the GHS algorithm, and additionally also involve
other complex procedures and subtleties, to achieve optimal performance in terms of message
passing and time complexity. The classical algorithm by Gallagher et al. has an optimal message
passing complexity of O(E + V ∗ log V ) but a suboptimal running time complexity of O(V ∗
log V ). The algorithm by Awerbuch [4] achieved the optimal running time and communication
27
complexity by breaking down the problem into three parts and solving the sub-problems in three
phases. The different phases represent a trade-off between the demands of the initial part of the
problem (involving large numbers of small fragments, where bounding the number of messages is
most important) and the last part (involving a small number of large fragments where we need to
bound the running time). However, Garay et al. identified the diameter of the graph as an inherent
parameter in the construction of the MST and presented an algorithm whose time complexity is
sub-linear in V and linear in Diam. The motivation for the algorithm comes from the fact that
there are several O(Diam) algorithms for various other important network problems [11].
References
[1] Baruch Awerbuch. Complexity of network synchronization. Journal of the ACM (JACM),
32(4):804–823, 1985.
[2] Baruch Awerbuch. Reliable broadcast protocols in unreliable networks. Networks,
16(4):381–396, 1986.
[3] Baruch Awerbuch. Optimal distributed algorithms for minimum weight spanning tree, count-
ing, leader election, and related problems. In Proceedings of the nineteenth annual ACM
symposium on Theory of computing, pages 230–240. ACM, 1987.
[4] Baruch Awerbuch and Silvio Micali. Dynamic deadlock resolution protocols. In Foundations
of Computer Science, 1986., 27th Annual Symposium on, pages 196–207. IEEE, 1986.
[5] Cüneyt F Bazlamaçcı and Khalil S Hindi. Minimum-weight spanning tree algorithms a survey
and empirical study. Computers & Operations Research, 28(8):767–785, 2001.
[6] Edsger W Dijkstra. A note on two problems in connexion with graphs. Numerische mathe-
matik, 1(1):269–271, 1959.
[7] Robert G. Gallager, Pierre A. Humblet, and Philip M. Spira. A distributed algorithm for
minimum-weight spanning trees. ACM Transactions on Programming Languages and sys-
tems (TOPLAS), 5(1):66–77, 1983.
[8] Juan A Garay, Shay Kutten, and David Peleg. A sublinear time distributed algorithm for
minimum-weight spanning trees. SIAM Journal on Computing, 27(1):302–316, 1998.
[9] Joseph B Kruskal. On the shortest spanning subtree of a graph and the traveling salesman
problem. Proceedings of the American Mathematical society, 7(1):48–50, 1956.
[10] Navneet Malpani, Jennifer L Welch, and Nitin Vaidya. Leader election algorithms for mobile
ad hoc networks. In Proceedings of the 4th international workshop on Discrete algorithms
and methods for mobile computing and communications, pages 96–103. ACM, 2000.
[11] David Peleg. Time-optimal leader election in general networks. Journal of parallel and
distributed computing, 8(1):96–99, 1990.
[12] Robert Clay Prim. Shortest connection networks and some generalizations. Bell system
technical journal, 36(6):1389–1401, 1957.
28