Implementation of Load Balancing Policies in Distributed Systems
Implementation of Load Balancing Policies in Distributed Systems
by
Jean Ghanem
THESIS
Master of Science
Electrical Engineering
June, 2004
c
°2004, Jean Ghanem
iii
Dedication
iv
Acknowledgments
I would like to dedicate my thesis to my parents who always stood beside me in
everything I did. Thank you mother for been always so loving and caring. Thank
you father for making me always strive for better. Thank you Sabine and Samer for
being the best friends whom I can rely on.
I would also like to dedicate my thesis to the Hamadé family who were always
there in my good times and bad times. Thank you for always believing in me and
for your continuous support.
I would also like to dedicate my thesis1 to my advisor and mentor Professor
Chaouki Abdallah who was the source of my motivation and inspiration, through
his continuous guidance, encouragement and patience. Thank you Professor for
everything, I will be forever grateful.
I would like to thank Professor Majeed Hayat for his expertise in the field of load
balancing, for all the valuable discussions that we had and for his continuous support.
I would also like to thank my committee member Professor Gregory Heileman for
his helpful comments.
I would like to express my sincere gratitude to Mr. Henry Jerez for his help in
this thesis work and for sharing his great knowledge in the field of networking and
distributed systems. I would also like to thank my colleague Mr. Sagar Dhakal for
his great help in this thesis work. It was a pleasure working with you.
Last but not least, I would like to dedicate this work and extend my warmest
gratitude to my beautiful fiancée Nayla. Thank you Nayla for correcting and enhanc-
ing my thesis. Thank you for being the person that gave me the strength to survive
throughout all my bad moments. Thank you for being the trusting and confiding
person I relied on during our stay in Albuquerque. Thank you my love for being the
sincere and wonderful person to whom I will devote my entire life.
1 This
work was supported by the National Science Foundation under Information Tech-
nology Research (ITR) grant No. ANI-0312611 and ANI-0312182
v
Implementation of Load Balancing
Policies in Distributed Systems
by
Jean Ghanem
ABSTRACT OF THESIS
Master of Science
Electrical Engineering
June, 2004
Implementation of Load Balancing
Policies in Distributed Systems
by
Jean Ghanem
Abstract
Load balancing is the allocation of the workload among a set of co-operating compu-
tational elements (CEs). In large-scale distributed computing systems, in which the
CEs are physically or virtually distant from each other, there are communication-
related delays that can significantly alter the expected performance of load-balancing
policies that do not account for such delays. This is a particularly prominent problem
in systems for which the individual units are connected by means of a shared commu-
nication medium such as the Internet, ad-hoc networks, wireless LANs. Moreover,
the system performance may greatly vary since it incorporates heterogenous nodes
that are not necessarily dedicated to the application at hand. In such cases, an actual
implementation becomes necessary to understand the load-balancing strategies and
their reactions when employed in several environments since mathematical models
may not always capture the unpredictable behavior of such systems.
vii
vironments. We then experimentally investigate network delays that are the main
factor in degrading the performance of the load distribution strategies. Subsequently,
we test the different policies on our test-bed and use the results to develop an im-
proved policy that adapts to the system parameters such as transfer delays, connec-
tivity, and CE computational power.
viii
Contents
1 Introduction 1
ix
Contents
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Implementation Architecture 25
3.1 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
x
Contents
3.2 Macro-Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Network Delays 42
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5 Experimental results 58
xi
Contents
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Appendices 90
References 90
xii
List of Figures
2.2 The multiple queue multiple server model. λ is the job arrival rate
and µi is the service rate of node i . . . . . . . . . . . . . . . . . . . 13
3.4 Example of a row of size n where the maximum precision was set to
5 Bytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
xiii
List of Figures
4.3 Delay distribution pdf for the different paths in the Internet (Taiwan
is Sinica-Taiwan and Taiwan2 is NTU-Taiwan). . . . . . . . . . . . . 48
4.5 Delay distribution (pdf) for the ECE Local Area Network (LAN). . 50
4.6 (a) Ad-hoc wireless network delay measurements between node 1 and
node 2 for a 3 hours period as a function of time. (b) distribution of
the delays for the same period. . . . . . . . . . . . . . . . . . . . . . 52
4.7 (a) Ad-hoc wireless network delay measurements between node 2 and
node 3 for a 3 hours period as a function of time. (b) distribution of
the delays for the same period. . . . . . . . . . . . . . . . . . . . . . 52
4.8 (a) Ad-hoc wireless network delay measurements between node 3 and
node 1 for a 3 hours period as a function of time. (b) distribution of
the delays for the same period. . . . . . . . . . . . . . . . . . . . . . 53
xiv
List of Figures
4.14 Example of a straight line fit for a shifted wireless delay pdf plotted
on a logarithmic scale. . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Summary of the load balance time as a function of the feedback gain
K. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
xv
List of Figures
xvi
List of Figures
5.17 Simulation results for the average total excess load decided by the
load-balancing policy to be transferred (at the load-balancing in-
stant) as a function of the balancing instant. The load-balancing
gain parameter is set at K = 1 [20]. . . . . . . . . . . . . . . . . . . 73
xvii
List of Tables
xviii
Chapter 1
Introduction
The demand for high performance computing continues to increase everyday. The
computational need in areas like cosmology, molecular biology, nanomaterials, etc.,
cannot be met even by the fastest computers available [6, 29]. But with the avail-
ability of high speed networks, a large number of geographically distributed com-
putational elements (CEs) can be interconnected and effectively utilized in order to
achieve performances not ordinarily attainable on a single CE. The distributed na-
ture of this type of computing environment calls for consideration of heterogeneities
in computational and communication resources. A common architecture is the clus-
ter of otherwise independent CEs communicating through a shared network. An
incoming workload has to be efficiently allocated to these CEs so that no single CE
is overburdened, while one or more other CEs remain idle. Further, tasks migra-
tion from high to low traffic area in a network may alleviate to some extent the
network-traffic congestion problem.
Workstation clusters are being recognized as the most promising computing re-
1
Chapter 1. Introduction
source of the near future. A large-size cluster, consisting of locally connected worksta-
tions, has power comparable to a supercomputer, at a fraction of the cost. Further-
more, a wide-area coupling of workstation clusters is not only suitable for exchange of
mail and news or the establishment of distributed information systems, but can also
be exploited as a large metacomputer [9]. In theory, a metacomputer is a similarly
easy-to-use assembly of distinct computers or processors working together to tackle
a single task or a set of problems. Distributing the total computational load across
available processors is referred to in the literature as load-balancing.
2
Chapter 1. Introduction
3
Chapter 1. Introduction
Another issue related to load-balancing is that a computing job may not arbi-
trarily divisible leading to certain constraints in dividing tasks. Each job consists
of several smaller tasks and each of those tasks can have different execution times.
Also, the load on each processor as well as on the network can vary from time to
time based on the workload brought about by the users. The processor capacity may
be different from each other in architecture, operation system, CPU speed, memory
size, and available disk space. The load-balancing problem also needs to consider
fault-tolerance and fault-recovery. With all these factors taken into account, load-
balancing can be generalized into four basic steps: (1) Monitoring processor load
and state; (2) Exchanging load and state information between processors; (3) Cal-
culating the new work distribution; and (4) Actual data movement. In this scheme,
numerous load-balancing strategies are available but they all could be implemented
on the same test-bed since they share the same basic steps described above.
The main goal of this thesis is to experimentally investigate the behavior of dis-
tributed load-balancing policies in a real environment. Analytical and queueing
models may not always take into account all the parameters and inputs of an actual
system that has unpredictable behavior. Therefore, it is crucial to experiment the as-
pects of the policies under actual conditions to check the system’s response and come
up with heuristic improvements. Hence, we propose a software implementation of a
load-balancing system where we examine how its three components, application, load
distribution, and network communication should interact to provide high throughput
to the application at hand regardless of the policy adopted. Moreover, to better un-
derstand the reactions of the policies in large networks, we conduct an experimental
4
Chapter 1. Introduction
analysis of the network delays and categorize them according to their characteristics.
After investigating the basic policies and the effect of delays on the stability of the
systems they act upon, we propose adaptive load-balancing policies that account for
several system parameters including CE computational power and interconnection
delays.
This thesis may also have the potential for being useful in other fields such as
networked control systems (NCS) and teleautonomy. In a NCS the sensor and the
controller are connected over a shared network and therefore, there is a delay in
closing the feedback loop. A special application of teleautonomy [37, 38] is that of
robots distributed geographically and working autonomously but at the same time
being monitored by a distant controller. Clearly, load distribution may be needed
across the robots where communication delays may degrade the performance of such
systems.
5
Chapter 1. Introduction
balancing schemes available in the literature and the queueing models on which the
policies adopted in this thesis are based. Chapter 3 presents the internal software
architecture of the proposed test-bed system followed by a description of the policies
that were integrated in it. Chapter 4 introduces delay probing experiments performed
on the Internet and classified according to their variability. We then conduct delay
experiments on the wireless network whose results are integrated in the Monte-Carlo
load-balancing simulator of the stochastic queueing model presented in Section 2.3.2.
In Chapter 5, experimental results conducted on the implemented policies over two
different test-beds, the Internet and a wireless network are presented. The effect of
delays and the variation in the CEs performance are examined to see how they influ-
ence the system’s ability to reach a load balanced state. Finally, based on previous
observations regarding the behavior of network delays and the performance of the
policies in distributed systems, we propose in Chapter 6 a dynamic and adaptive
load-balancing policy that accounts for such parameters. Chapter 7 presents our
conclusions and suggestions for future research.
6
Chapter 2
In this section, the different categories of load-balancing policies are presented and
can be found in [12, 20]. A detailed overview of the different taxonomies can be found
in [12]. Figure 2.1 shows the organization of the different load-balancing schemes.
7
Chapter 2. Load-Balancing Taxonomy and Previous Work
Static Dynamic
Reassignment
Centralized Distributed
Dynamic Initiation
One-Time
Local Global
Sender Receiver
Adaptive Non-Adaptive
Cooperative Non-Cooperative
Static load distribution, also known as deterministic scheduling, assigns a given job
to a fixed processor or node. Every time the system is restarted, the same binding
task-processor (allocation of a task to the same processor) is used without consid-
ering changes that may occur during the system’s lifetime. Moreover, static load
distribution may also characterize the strategy used at runtime, in the sense that it
may not result in the same task-processor assignment, but assigns the newly arrived
jobs in a sequential or fixed fashion. For example, using a simple static strategy, jobs
can be assigned to nodes in a round-robin fashion so that each processor executes
approximately the same number of tasks.
Dynamic load-balancing takes into account that the system parameters may not
be known beforehand and therefore using a fixed or static scheme will eventually
produce poor results. A dynamic strategy is usually executed several times and may
reassign a previously scheduled job to a new node based on the current dynamics of
the system environment.
8
Chapter 2. Load-Balancing Taxonomy and Previous Work
This division usually falls under the dynamic load-balancing scheme where a natural
question arises about where the decision is made. Centralized policies store global
information at a central location and use this information to make scheduling de-
cisions using the computing and storage resources of one or more processors. This
scheme is best suited for systems where an individual processor’s state information
can be easily collected by a central station at little cost, and new jobs arriving at this
centralized location are then redirected to subsequent nodes. The main drawback of
this scheme is that it has a single point of failure.
Another scheme that fits between the two types above is the hierarchical one
where selected nodes are responsible for providing task scheduling to a group of
processors. The nodes are arranged in a tree and the selected nodes are roots of the
subtree domains. An example of this scheme is described in Section 2.5.
Local and global load-balancing fall under the distributed scheme since a central-
ized scheme should always act globally. In a local load-balancing scheduling, each
processor polls other processors in its neighborhood and uses this local information
to decide upon a load transfer. This local neighborhood is usually denoted as the
9
Chapter 2. Load-Balancing Taxonomy and Previous Work
Within the realm of distributed dynamic global scheduling, two mechanisms can be
distinguished involving the level of cooperation between the different parts of the
system. In the non-cooperative or autonomous scheme, each node has autonomy
over its own resource scheduling. That is, decisions are made independently of the
rest of the system and therefore the node may migrate or allocate tasks based on
local performance. On the other hand, in cooperative scheduling, processes work
together toward a common system-wide global balance. Scheduling decisions are
made after considering their effects on some global effective measures (for example,
global completion time).
Adaptive and non-adaptive schemes are part of the dynamic load-balancing policies.
In an adaptive scheme, scheduled decisions take into consideration past and current
system performance and are affected by previous decisions or changes in the environ-
ment. If one (or more parameters) does not correlate to the program performance, it
is weighted less next time. In the non-adaptive scheme, parameters used in schedul-
ing remain the same regardless of system’s past behavior. An example would be a
policy that always weighs its inputs the same regardless of the history of the system
behavior.
10
Chapter 2. Load-Balancing Taxonomy and Previous Work
Techniques of scheduling tasks in distributed systems have been divided mainly into
sender-initiated, receiver-initiated, and symmetrically-initiated. In sender-initiated
algorithms, the overloaded nodes transfer one or more of their tasks to several under-
loaded nodes. In receiver-initiated schemes, under-loaded nodes request tasks to be
sent to them from nodes with higher loads. In the symmetric approach, both the
under-loaded as well as the loaded nodes may initiate load transfers.
11
Chapter 2. Load-Balancing Taxonomy and Previous Work
In this section, several load-balancing policies introduced in earlier works are de-
scribed.
The shortest expected delay (SED) [7, 45] and adaptive separable policy (AS) [45]
are based on the multiple queue multiple server model shown in Figure 2.2. Both
policies can be either centralized where new tasks arrive to a central server and then
assigned to subsequent nodes, or distributed where each available node can insert
new jobs into the system. In any case, the algorithm is triggered whenever a new
job arrives at node p. Subsequently, a cost function is evaluated for each node and
the job is sent to the corresponding node that produces the minimum cost. the cost
SED(i) is actually the expected time to complete the new job at node i and is given
by
ni + 1
SED(i) = , (2.1)
µi
where ni and µi are respectively the load and service rate of node i. The information
exchanged and the balancing process can be done either globally or is restricted to
local domains.
The adaptive separable policy is an improvement over the SED policy in the
sense that it estimates the completion time of new arrival at a node by adjusting the
service rate based on its utilization or idle time fraction ui . The new cost becomes
ni + 1
AS(i) =
µi ui
12
Chapter 2. Load-Balancing Taxonomy and Previous Work
Figure 2.2: The multiple queue multiple server model. λ is the job arrival rate and
µi is the service rate of node i
The never queue policy (NQ) ([39]) is inspired by the fact that in heterogeneous
systems, fast servers may take over all the slow servers in executing most of the
jobs and therefore result in idle nodes in the system. This case occurs mostly when
applying the SED policy in a highly loaded environment and thus yields suboptimal
results.
The NQ policy first assigns the newly arriving job to the idle node. If more than
one idle node is available, the new job is sent to the fastest node that has the largest
(1/µi ) term. On the other hand, if all nodes are busy, the SED policy is used.
The aim of the maximum throughput policy developed by Chow and Kohler in 1979
[15] is to maximize the throughput of the system during the next job arrival. The
throughput function TP is given by
"n −1 µ ¶k µ ¶#
m
X Xi ³ ni ´ µi λ
TP(n1 , n2 , · · · , nm ) = λ 1− − ni ln , (2.2)
i=1 k=1
k λ + µi λ + µi
13
Chapter 2. Load-Balancing Taxonomy and Previous Work
where λ is the arrival rate and m is the number of nodes in the system. TP is a
reward function calculated for each possible assignment for the new arriving job and
the node that maximizes this function is chosen. This function is complex to evaluate
and renders the load-balancing algorithm inefficiently slow.
Nelson and Towsley in 1985 derived another reward function that is implemented
in the Greedy Throughput policy (GT). The GT reward function stated in [40] is
easier to evaluate than TP and is given as,
µi ni +1
GT(i) = ( ) (2.3)
µi + λ
Both TP and GT policies depend on the inter-arrival rate λ which may not be
available in a real system implementation.
In the gradient model policy [30], the underloaded nodes notify the other nodes
about their state, and overloaded nodes respond by transmitting jobs to the nearest
lightly loaded node. Therefore, loads migrate in the system in the direction of the
underloaded nodes guided by the proximity gradient. A global balance state is
achieved computationally by successive localized balances.
At every step of the algorithm, each node compares its load to a Low-Water
Mark (LWM) and a High-Water Mark (HWM) thresholds. The node is set to the
underloaded state if it has a load less than LWM and to the overloaded state if it
has a load greater than HWM. Underloaded nodes set their proximity to zero and
all other nodes p set their proximity according to
where ni denote the neighboring nodes of node p. The node’s proximity is defined
as the shortest distance from itself to the nearest lightly loaded node in the system.
14
Chapter 2. Load-Balancing Taxonomy and Previous Work
Subsequently, all overloaded nodes send a fraction δ of their loads in the direction of
the lowest proximity. The algorithm is illustrated in Figure 2.3.
Note that no measure of the degree of imbalance is found using this algorithm,
but only that one exists. When an imbalance occurs, the number of excess tasks can
only be known to be greater than HWM-LWM. Hence, the HWM, LWM, and the
fraction δ parameters have a critical impact on the stability and performance of the
algorithm and should therefore be wisely chosen.
The gradient model policy cannot be used in distributed systems since the nodes
are not connected in a certain topology such as a mesh or hypercube. This fact
renders the proximity concept useless. Moreover, the proximity algorithm is a cas-
cading function and therefore requires a considerable amount of time to be evaluated
in large-scale networks where delays are prominent.
However, a modification to the algorithm may be suitable for P2P networks such
as Freenet where nodes are only aware of their immediate neighbors. Consequently,
15
Chapter 2. Load-Balancing Taxonomy and Previous Work
the proximity concept becomes valid and the algorithm may become useful.
Sender initiated diffusion ([46][36]) and receiver initiated diffusion ([46][41]) are local
strategies based on the near-neighboring diffusion concept. Each node exchanges
information within its own domain composed of a node and its neighboring nodes.
Global balancing is achieved by the fact that the domains are overlapping.
For the SID policy, the balancing process is triggered whenever a node p receives
from a neighboring node i a load update li less than a preset threshold Llow (li <
Llow ). After that, the node p proceeds by calculating the domain load average Lp
à K
!
1 X
Lp = lp + lk (2.5)
K +1 k=1
where K is the number of neighboring nodes. The load balancing algorithm continues
if the local excess load (lp − Lp ) is greater than a preset threshold Lthreshold . Load δk
is then transferred from node p to each neighbor in proportion to its deviation from
the domain calculated using
Lp − lk if lk < Lp ,
hk =
0 otherwise.
hk
δk =(lp − Lp ) K
(2.6)
Σk=1 hk
The RID strategy can be thought of as the converse of the SID strategy in that it is
a receiver-initiated approach as opposed to the sender-initiated approach [46]. How-
ever, to avoid instability due to delays and aging in the load exchange information,
the overloaded nodes transmit tasks up to the half of their current load. SID and
RID are illustrated in Figure 2.4.
16
Chapter 2. Load-Balancing Taxonomy and Previous Work
Figure 2.4: Sender Initiated Diffusion (SID) and Receiver Initiated Diffusion (RID
examples
The algorithms implemented in this thesis are based on the SID scheme without
restricting the balancing process to a local domain, but rather expanding it to the
global system.
The Hierarchical Balancing Method (HBM) strategy [46] arranges the nodes in a
hierarchy, thereby creating balancing domains at each level. For a binary tree orga-
nization, all nodes are included at the leaf level (level 0). Half the nodes at level 0
become subtree roots at level 1. Subsequently, half the nodes again become subtree
roots at the next level and so forth until one node becomes the root of the whole
tree.
17
Chapter 2. Load-Balancing Taxonomy and Previous Work
Global balancing is achieved by ascending the tree and balancing the load between
adjacent domains at each level in the hierarchy. If at any level, the imbalance between
the left and right subtrees exceeds a certain threshold, each node in the overloaded
subtree sends a portion of its load to the corresponding node in the underloaded
subtree.
The advantage of the HBM scheme is that it minimizes the communication over-
head and therefore can be scaled to large systems. Moreover, the policy matches
hypercube topologies well. In fact, the dimensional exchange approach [17] designed
for hypercube systems is similar to the HBM method in the sense that it proceeds
by load-balancing per domain basis. Here, each domain is defined as one dimension
in the hypercube. The hierarchical organization of an eight-processor hypercube is
shown in Figure 2.5.
This scheme is clearly not suitable for systems with large network delays for
the following reasons. As the balancing process proceeds on to the next level in
the tree, critical changes occurring at lower levels may not propagate quickly due
to delays. Therefore, corrections may not reach higher domains in time and may
thereby result in an imbalance at the global level. Moreover, although the scheme
is decentralized, a failure at root nodes especially at high levels in the tree, renders
a global balance state unattainable. Consequently, this scheme is not suitable for
Internet-scale distributed systems since nodes may become unreachable at any time,
and will therefore affect the balance state of the system if such nodes happen to be
roots for subtree domains.
SED, NQ, TP GT and AS policies were compared by Banawan and Zeidat in [8].
Several simulations were performed on different types of systems. These systems vary
18
Chapter 2. Load-Balancing Taxonomy and Previous Work
by their node service rates µi , system utilization, and network delays. The results
indicate that in most cases, the NQ policy performed best.
For the GT policy, an empirical method for calculating the λ rate was proposed
19
Chapter 2. Load-Balancing Taxonomy and Previous Work
where it was assumed that the delay in transferring a task is less than the inter-
arrival time of new jobs. Simulations were conducted over eight heterogeneous nodes
positioned according to a fixed topology. The results show that the NQ policy
outperformed the other policies under most operating system conditions.
On the other hand, Willebeek and Reeves in [46] simulated the GM, RID, SID,
HBM and the Dimension Exchange Method (DEM) policies on a 32-processor 5-
dimensional hypercube Intel iPSC/2 machine. Their results show that low granu-
larity tasks gave poor results due to lower ability to optimally transfer loads. Also
high tasks granularity also gave poor results due to the increased overhead of moving
tasks. Nevertheless, The DEM and HBM policies gave the best results as expected.
However, the authors concluded by recommending the RID scheme that surprisingly
gave good results for a broader range of systems (non-hypercube).
In this section we describe two queueing models for local, sender-initiated, load-
balancing algorithms that were developed at the University of New Mexico and the
University of Tennessee. These models were initially tested in simulations, and the
system developed in this thesis has been used to validate both models in a real
environment under different policies.
Both models focus upon the effects of delays in the exchange of information
among the computational elements (CEs), and the constraints these effects impose
on the design of a load-balancing strategy.
20
Chapter 2. Load-Balancing Taxonomy and Previous Work
The authors consider a computing network consisting of n nodes all of which can
communicate with each other. Initially, the nodes are assigned an equal number
of tasks. However, when a node executes a particular task it can generate more
tasks so that the overall load distribution becomes non-uniform. To balance the
loads, each computer in the network sends its queue size qj (t) at time t to all other
computers in the network. A node i receives this information from node j delayed by
a finite amount of time τij , that is, it receives qj (t − τij ). Each node i then uses this
information to compute its local estimate of the average number of tasks per node
³P ´
n
in the network using the simple estimator j=1 qj (t − τij ) /n (τii = 0), which is
based on the most recent observations. Node i then compares its queue size qi (t)
³P ´
n
with its estimate of the network average as qi (t) − q
j=1 j (t − τij ) /n and, if this
is greater than zero, the node sends some of its tasks to the other nodes while if it is
less than zero, no tasks are sent. Furthermore, the tasks sent by node i are received
by node j with a delay hij . The authors present a mathematical model of a given
computing node for load-balancing, which is given by:
21
Chapter 2. Load-Balancing Taxonomy and Previous Work
X tpn
dxi (t)
= λi − µi + ui (t) − pij i uj (t − hij )
dt j=1
tpj
Pn
j=1 xj (t − τij )
yi (t) = xi (t) − (2.7)
n
ui (t) = −Ki sat (yi (t))
n
X
pij ≥ 0, pjj = 0, pij = 1
i=1
where
sat (y) = y if y ≥ 0
= 0 if y < 0.
In this model:
• xi (t) is the expected waiting time experienced by a task inserted into the queue
of the ith node and ui (t) is the rate of removal (transfer) of the tasks as deter-
mined by the balancing algorithm.
local information of the waiting times xi (t), i = 1, .., n are used to set the values
of the pij such that node j can send tasks to node i in proportion to the amounts
by which node i is below the local average as seen by node j. Several methods can
be used to choose the pij ’s according to predefined policies. These policies will be
discussed in the next chapter.
22
Chapter 2. Load-Balancing Taxonomy and Previous Work
In this section, a stochastic time delay queueing model in differential form is de-
scribed [24, 20, 19]. The motivation behind this model is the stochastic nature of the
distributed computing problem that include: 1) Randomness and possible burst-like
nature of the arrival of new job requests at each node from external sources (i.e.,
from users); 2) Randomness of the load-transfer process itself, since the communica-
tion delays in large networks are random; and 3) Randomness in the task completion
process at each node. Based on these facts, the following dynamics of the ith queue
in differential form is given by
X X
Qi (t+∆t) = Qi (t)−Ci (t, t+∆t)− Lji (t)+ Lij (t−τij (t))+Ji (t, t+∆t), (2.8)
j6=i j6=i
where
• Ci (t, t + ∆t) is a Poisson process with rate µi describing the random number
of tasks completed in the interval [t, t + ∆t]
• Ji (t, t+∆t) is the random number of new (from external sources) tasks arriving
in the same interval, as discussed above
• τij (t) is the delay in transferring the load arriving to node i in the interval
[t, t + ∆t] from node j, and finally
• Lij (t) is the load transferred from node j to node i at the time t.
For any k 6= `, the random load Lk` diverted from node ` to node k is governed
by the load-balancing policy at hand. In general,
µ n
X ¶ µ n
X ¶
−1 −1
Lkl (t) = Kk pkl · Ql (t) − n Qj (t − ηlj (t)) · u Ql (t) − n Qj (t − ηlj (t)) ,
j=1 j=1
23
Chapter 2. Load-Balancing Taxonomy and Previous Work
where u(·) is the unit step function, ηlj (t) is the state exchange delay between the jth
and lth nodes at time t and Kk is the gain parameter at the kth (load distributing)
node. The fractions pij will be discussed in the next chapter as part of the load-
balancing strategy.
2.4 Summary
In this chapter, a number of load-balancing policies and their taxonomies were de-
scribed. Moreover, the queueing models describing the behavior of the system were
presented. In the next chapter, the test-bed software that implements several load-
balancing policies based on these models is introduced, followed by a description of
the different strategies that were actually adopted and experimented.
24
Chapter 3
Implementation Architecture
A distributed system has been developed to validate the deterministic model and the
stochastic model described in Sections 2.3.1 and 2.3.2 and to assess the performance
of different load-balancing policies in a real environment. The system consists of
duplicates of the same software running on each node. The load-balancing decision
consisting of when to balance and how many tasks to transmit, is done locally at
each node. The decision is therefore distributed as opposed to be centralized, case
for which a master node is responsible for making the decision. The load-balancing
process running on each node bases its decision on local information and on shared
data which are exchanged between the nodes. The initial configuration of each node
is set through three configuration files, which will be discussed in Section 3.7.
25
Chapter 3. Implementation Architecture
3.1 Platforms
The load-balancing software was built in ANSI C over UNIX-based systems, namely,
Sun Solaris and Linux. Sun machines were used to run experiments over the LAN
network in the ECE department whereas the Planet-Lab system was used to run
experiments over the Internet. The Planet-Lab [2] operating system is based on the
Linux RedHat operating system. On the other hand, in order to run experiments
over the wireless test-bed, the code was imported to the “Cygwin” environment that
runs over Microsoft Windows. Cygwin [1] is a Linux-like environment for Windows
that acts as a Linux emulation layer, providing substantial Linux API functionality.
3.2 Macro-Architecture
The general architecture of the system consists of three layers as shown in Figure 3.1.
Each layer is implemented as a module in order to facilitate its own modification or
replacement without affecting the other layers. More importantly, this architecture
allows the testing and implementation of different load-balancing policies by simply
changing few lines of code and without interfering with the rest of the system layers.
The modules communicate with one another through well-defined interfaces. In
what follows, a detailed description of the system architecture is provided that is
summarized in Figure 3.7.
26
Chapter 3. Implementation Architecture
3. Application
Algo.
Load balancing
2. Load balancing process
1. Communication
Two main data structures were used in the program. The first one is a simple linked-
list that contains state information about the rest of the nodes and also used as a
communication tool between the load-balancing module and the task transmission
module. This list, illustrated in Figure 3.2, is created upon executing the program
and no subsequent alteration is made except for the information stored inside of it.
The “state information” is mostly kept up-to-date by the “state reception” module,
and it contains information regarding the node current queue size, computational
power, etc. These parameters are stored in the info structure that is shown below.
27
Chapter 3. Implementation Architecture
//from node j
}
The way these parameters are calculated and used are policy-dependent and
therefore will be discussed in subsequent sections.
List
Figure 3.2: List data structure containing others node state information
The second data structure used in the program is the task queue, which has a
linked-queue structure as illustrated in Figure 3.3. Newly arriving tasks from either
an external source or from within the network (sent by other nodes) are added to the
rear of the queue, whereas the application at hand pops one task at a time from the
front of the queue and executes it. Moreover, the load-balancing layer may decide at
any time to transfer several tasks to other nodes and therefore extract from the front
of the queue the desired number of tasks that it wishes to transmit. In all cases,
for any operation applied on the task queue, the variable Current Queue Size that
reflects the number of tasks present in the queue, is atomically updated accordingly.
Front Rear
Pop Insert
Task 1 Task 2 Task 3 Task m NULL
28
Chapter 3. Implementation Architecture
The communication layer is divided into four separate threads: the “state trans-
mission,” the “state reception,” the “tasks reception,” and the “tasks transmission”
threads. The “state transmission” thread is responsible for transmitting the state
of the local node to all other nodes that are part of the system. The state informa-
tion includes the information listed in the previous section which incorporates the
current queue size, the node computational power, and other local information that
may be relevant to the load-balancing policy in use. The sizes of the state frame
ranges between 20 and 34 bytes depending on the policy at hand. The transmission
of the node state is performed every defined amount of time as specified in the node
initialization file. As for the transport protocol, one has the option of either TCP
or UDP by setting the appropriate parameter in the initialization file. It is always
recommended to use UDP since it involves less overhead in the transmission. In
case the state frame was dropped by the network, a retransmission will occur in the
next scheduled state exchange. The “state transmission” can also be triggered by
the load-balancing thread if the policy in use decides so.
The “state reception” thread is the complement of the “state transmission” thread
and has the architecture of a concurrent single threaded server. The single threaded
architecture was used because our aim is to provide maximum performance to the
application layer that is running in the same process, which will be slowed down
if more threads were created. The “state reception” listens to a well-defined TCP
or UDP port. Upon reception of any state information, the thread updates the
corresponding node information, available in the local node-list, whenever the time-
stamp of the received frame is greater than the time-stamp of the stored information.
This is done to ensure that two state frames will not overwrite each other, if they were
received in reverse order. This scenario may happen frequently in packet switching
networks.
29
Chapter 3. Implementation Architecture
The “tasks transmission” module responsible for transmitting jobs to other nodes,
runs in the same thread as the load-balancing module that will be described later.
The main reason behind this design is that another instance or cycle of the load-
balancing policy cannot be initiated unless all prior data transmissions have been
completed. The “tasks transmission” module also has concurrent single threaded
client architecture. Only TCP can be used as a transport protocol since reliable
transmission is needed so that no tasks are lost in the network. Upon transmission
completion, the “task transmission” module informs the “load-balancing” module
about its final status i.e., whether all, part, or none of the transmissions have been
successful.
The second layer of the system consists of a single thread called the “load-balancing”
thread, which is the core of the program. This layer is easily modifiable to include
different policies. Nevertheless, all policies follow the same general steps, identified
by a cycle and described as follows. The load-balancing “process” is initiated at
a predefined amount of time (read from a file) or at a calculated amount of time
depending on the policy at hand. Consequently, the process determines the portion
of the tasks to be sent to every node in the system if applicable. This decision is policy
dependent and is based on the current state of the node and on the states of the other
available nodes. Furthermore, some policies may also rely on the detected network
30
Chapter 3. Implementation Architecture
delays, to calculate the number of tasks to transmit (see Chapter 5). Subsequently,
the thread packs the tasks into a network frame; it has access to the tasks queue where
it can extract the desired number of tasks without deleting them but setting their
status to inactive. This is done in order to prevent the application from executing
the tasks in the transition period. The purpose of this procedure is that after the
“task transmission” module has completed its job, the tasks are either resetted to the
active mode or deleted from the queue, depending on the status of the transmission
(successful or not). As denoted earlier, another cycle of the load-balancing procedure
cannot be initiated until all prior transmissions have completed.
The application layer is divided into two threads: the “application input” and the
“application execution” threads. The “application input” creates a number of tasks
defined in the initialization file upon program startup and inserts them in the task
queue. Moreover, this thread is responsible for adding new tasks to the queue either
through an external source or from other nodes in the system. In the latter case,
the “application input” gets the network frame from the “tasks reception” thread,
unpacks it, and then adds the resulting tasks to the queue. On the other hand, the
“application execution” thread is responsible for the tasks execution. It simply pops
an active task from the queue, executes it, and then updates the Current Queue Size
31
Chapter 3. Implementation Architecture
variable.
The above description applies to any generic application that can be divided into
independent tasks. In our case, we used matrix multiplication as the basis for our
experiments, where one task is defined as the multiplication of a row by a static
matrix duplicated on all nodes. Therefore, the task queue contains rows having the
same size, which can be set by a parameter in the initialization file. In order to
emulate a real life application where the execution time of a task may vary, the size
of each element (in bytes) of a single row is generated randomly from a specified
range also set in the initialization file. This way, the multiplication of two elements
or two numbers of different sizes may take different amounts of time, which leads, in
turn, to variation in the execution time of the tasks.
32
Chapter 3. Implementation Architecture
E1 E2 E3 En
1B 2B 5B Sn
Figure 3.4: Example of a row of size n where the maximum precision was set to 5
Bytes.
Finally, the program has three initialization files; the parameter initialization file
“init.ini” that contains policy and application related parameters, the balancing in-
stance file“balance.ini” and the node file “node.ini” that contains the address of the
nodes (either IP address or host-name) that are part of the system. An example of an
initialization file is shown below where each parameter is explained by a commented
description (% denotes a comment) that precedes it.
SYNC 1s 50000000ns
GAIN 0.7
33
Chapter 3. Implementation Architecture
SYNCPROTOCOL UDP
% the following parameter defines the initial number of tasks in the task-queue
% used by the application layer
INITNBTASKS 250
INPINTERVAL 0s 0ns
ROWSIZE 100
MAXBYTES 15
DELAYBALANCE NO
The“balance.ini” file contains a list of intervals that are used to initiate the load-
balancing process. Each entry is defined in seconds and nano-seconds.
34
Chapter 3. Implementation Architecture
later used for statistical analysis and to generate plots. Eight types of logs corre-
sponding to eight different events are described as follow:
1. This type is generated by the application layer and corresponds to a task com-
pletion event. The log includes: Time when the event happened, the corre-
sponding task ID and the execution time (nano-second resolution).
2. This type is triggered by a change in the task queue size. The cause may be
either the execution of a task, the transmission of one or several tasks, the
reception of one or several tasks from the system, or an external source. The
log includes: The time when the event took place and the resulting queue size.
3. This type corresponds to the initiation of the load-balancing process. The time
when the event has occurred is recorded.
4. This type logs the event for tasks transmission attempt. The log includes:
The time when the transmission began, the destination node IP address, the
number of tasks to transmit, and the total tasks size (in bytes).
5. This type corresponds to the completion of the tasks transmission. It has the
same fields as the previous type with the addition of the end transmission time.
6. This type corresponds to the tasks reception event. The log includes: The time
when the tasks frame was received, the source node IP address, the number of
tasks received and the corresponding size (in bytes).
7. This type corresponds to a state transmission event. The state of the node is
recorded in addition to the corresponding IP address of the destination node
and the time when the transmission has occurred.
8. This type corresponds to a state reception event. The state of the source node
is recorded in addition to the time when the state was received.
35
Chapter 3. Implementation Architecture
The policies implemented in this system follow the same general guidelines but differ
mainly in the scheduling of the load-balancing process and the allocation of the
fractions pij . Recall from the previous chapter that pij is the fraction of the excess
tasks as decided by node j that will be transmitted to node i.
The first scheme is the one-shot load-balancing [20, 19] where the nodes at-
tempt to exchange tasks among themselves only once. The scheduling of this single-
balancing instance is usually done early after the launch of the system but not before
the state information of each node has widely propagated. This is done to ensure
that each node is aware of the state of the other nodes when deciding on its load
distribution strategy. This scheme is mostly suitable in systems where external ar-
riving tasks are not prominent and the servicing rate or the computational power
of each node is, more or less, stable. In any case, this scheme can be extended to
the latter cases where a new balancing instance can be scheduled according to the
occurrence of a special event such as the arrival of a new external task as proposed
in [20]. Experimental work has been done to find the optimal balancing instance for
the single-shot load-balancing strategies (Section 5.2) [23].
The second scheme allocates regularly a balancing instance when the load distri-
bution process is triggered and tasks exchange between nodes take place [22]. In our
case, the balancing instants are read from the “balance.ini” initialization file that
was introduced in the previous section. The time intervals between two consecutive
balancing instances may be constant or varying. Since the load distribution policy is
distributed, each node can choose its balancing instances differently from the others
as defined in each node’s “balance.ini” file.
36
Chapter 3. Implementation Architecture
1. Node j calculates the total number of tasks available in the system from the
information available in its node-list.
n
X
Queue total = Queue(k)
k=0
4. Calculate the fractions pij of the excess tasks that node j will transmit to node
i. Three different methods can be used.
a. constant pij
b.
Pn Queue average−Queue(i)
if Queue(i) < Queue average,
k=1,k6=j (Queue
average−Queue(k))
pij =
0 otherwise.
(3.2)
c.
à !
1 Queue(i)
pij = 1 − Pn (3.3)
n−2 k=1,k6=j Queue(k)
37
Chapter 3. Implementation Architecture
One may think that setting the gain parameter K = 1 will achieve the best
performance. But in systems with large delays where nodes may rely on outdated
information in calculating the load distribution, K = 1 will actually give poor results.
This phenomenon was first observed by the load balancing group at the University of
New Mexico and The University of Tennessee in their simulation and analytical work.
The experiments described in chapter 5 have been performed in order to optimize
over the gain values K.
Equations (3.1) and (3.2) used for setting the fractions pij were introduced in
[13] and [10]. These equations were primarily used in simulations and experiments
to validate the deterministic model of Section 2.3.1. On the other hand, Equation
(3.3) was used in simulations and experiments to validate the stochastic model of
section 2.3.2 [24, 18]. Note that both methods allocate tasks to nodes inversely
proportional to their queue sizes. The two methods are illustrated in Figure 3.5. A
block diagram of the policies is shown in Figure 3.6.
250
200
0.3
150 0.2
100 0.1
50
0
0
1 2 3 4 5 6 1 2 3 4 5 6
Node # Node #
(a) Queue sizes of the nodes as stored in (b) Fractions pij as calculated by node 1
the node-list of node 1 at the time when using the two different methods of Equa-
the load-balancing policy was initiated at tions 3.2 and 3.3.
node 1
38
Chapter 3. Implementation Architecture
Wait Tb time
Calculate Q_total
& Q_average sizes
Is
No Q(j) >
Q_Average
?
Yes
Q_excess=
(Q(j)–Q_average)*K
Transmit
Ti = pij * Q_excess
Tasks To node i
Figure 3.6: Summary of the steps for the load-balancing policy performed at node j.
We can deduce from steps 1-5 that the algorithm scales linearly with the number
of new nodes added to the system. The runtime is therefore O(n) due to the fact
that a full traversal of the node-list is needed in steps 1 and 4. A more advanced
algorithm is developed in chapter 6 where network information is used to determine
the portions pij .
39
Chapter 3. Implementation Architecture
3.9 Summary
40
update or insert Thread
Application execution
-remove task from queue read Data structure or
-execute task variable
External Source -update queue remove
signal or trigger
41
Current state
(Q_size,) other nodes state sent before repeating the
Tasks transfer Trigger process
.Concurent single threaded every st time .Concurent single threaded .Concurent single threaded client
server arch. or when triggered server arch. -protocol: TCP
-protocol: TCP -protocol TCP or UDP Protocol: TCP or UDP
Network Delays
The study of network delays gained attention lately when several services started
using IP-based networks. These services include but are not limited to voice over IP
(VoIP) [32] and teleoperation [31] that are significantly affected by delay variations
and therefore require strict delay constraints. In systems where load-balancing is
involved, delays greatly affect their stability in several aspects. First, the load distri-
bution policies base their decisions on system state information that is outdated to a
certain extent, and as the delay increases, the system becomes less stable. Moreover,
this fact greatly affects the scalability of the system since the “error” in the global
state information grows with the addition of new nodes. The use of prediction may
not always be useful since network delays are unstable and vary according to several
network conditions as will be shown in this chapter. Furthermore, fluctuations in
the global system state also arise due to delays and variability in the transmission
rates when load exchanges take place. In other words, migration of tasks between
nodes may take an unknown amount of time. In fact, scheduled load distribution
instances before the end of a task(s) transmission, causes the policy to base its assess-
ment on old information i.e., the initial state of the system prior to the occurrence
of the transmission. This fact when occurring frequently, renders global stability
42
Chapter 4. Network Delays
Moreover, the transition of tasks from one node to another may come at an
unexpectedly high cost where the absence of transmission may have given better
results. Therefore, a priori knowledge of the statistics of transmission delays may
help the policy at hand wisely decide on the load distribution.
Nowadays, most delay experiments are performed on the RIPE NCC network, part of
the Test Traffic Measurement (TTM) project [3]. RIPE NCC (Reseaux IP Europeen
Network coordinator Center) [4] is a non-profit organization providing services for
the benefit of the IP-based network operators (ISP) in Europe and the surrounding
areas. The aim of the TTM project is to perform active measurements in order to
detect the connectivity and to probe and monitor the one way delays between the
different ISP networks.
In this section, some of the TTM delay experiments and their respective classes
are presented. Then, delay probing experiments that we conducted over the Internet
(not limited to Europe) are presented and compared to the different TTM categories.
43
Chapter 4. Network Delays
• Processing delay is the time needed to process a packet at a given node for
transmission or reception. It has both deterministic and stochastic components
due to variations in the node’s computational power that affect the packet
processing.
• Propagation delay is the time needed to propagate one bit over the channel,
and is primarily caused by the travel time of an electromagnetic wave. This
delay is mostly observed in satellite links.
44
Chapter 4. Network Delays
In [25] several methods were proposed to model the stochastic delays. They
achieved an approximation of the processing delay distribution using a Gaussian
pdf. Three parametric models were proposed for the stochastic queueing delay: the
exponential model, the Weibull model, and the polynomial or Pareto model. All
models exhibited discrepancies when they were compared to the available data.
On the other hand, Bovy et al. [11] classified the end-to-end delays into 4 cate-
gories. They used the RIPE one-way measurements with fixed IP probe-packets of
100 bytes. The configuration details are available in [21]. They characterized most
of the experimental delay distributions as gamma-like distributions based on 2160
measurements taken per day per path. The 4 classes are listed below,
• Class A is the dominant and typical one and is modeled as gamma-like with
a heavy tail that decays slower than an exponential. (Figure 4.2(a))
• Class D has many peaks (white noise-like). This is mostly observed in paths
that have high packet loss. (Figure 4.2(d))
45
Chapter 4. Network Delays
46
Chapter 4. Network Delays
application layers’ perspectives. That is, the time taken for the packet to travel
through the TCP/IP protocol stack (upward and downward) is also included in the
delay. This case is more relevant to the load-balancing system that is implemented in
the application layer. The results for 8 different paths are summarized in Table 4.1.
The resulting delay distributions based on measurements accumulated in 24 hours
are shown in Figure 4.3.
At first sight, we can observe that our results are consistent with the RIPE
experiments in the sense that the PDFs obtained are very similar in shape. Indeed,
most of the distributions plotted in Figure 4.3 can be classified as class A. However,
several triangular shapes were obtained that may not be well-modeled by a gamma
distribution as is the case in Figures 4.3(d),(e),(g) and (h). Second, the distribution
in Figure 4.3(a) has two peaks with one lower than the other which suggests that it
belongs to class B. In fact, looking at the individual delay measurements plotted in
Figure 4.4(a), we can see that between 11am and 5pm, higher network delays were
present, which explains the shape of the corresponding distribution.
On the other hand, the path “France-Taiwan” exhibits a different behavior dis-
closed in its delay distribution in Figure 4.3(c). The pdf has an exponential rise
followed by a sudden drop whereas in general, the inverse is observed i.e., the pdf
47
Chapter 4. Network Delays
10
UNM−Frankfurt (a) 40 Frankfurt−Taiwan (b)
5 20
0 0
0 0.2 0.4 0.6 0.8 0.3 0.4 0.5 0.6
60 France−Taiwan2 (c) 30
Taiwan−UNM (d)
40 20
20 10
0 0
0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4 0.5
20
20
10
0 0
0 0.1 0.2 0.3 0 0.1 0.2 0.3
30 Australia−London (g) 30 UNM−Australia (h)
20 20
PDF
10 10
0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4
DELAY (s) DELAY (s)
Figure 4.3: Delay distribution pdf for the different paths in the Internet (Taiwan is
Sinica-Taiwan and Taiwan2 is NTU-Taiwan).
suddenly rises and then exponentially decays. This suggests further investigation by
looking at the delay measurements (Figure 4.4(b)) and comparing them to a typical
one (Figure 4.4(c)). We can observe that the delays measurements are clustered in
the higher part of the plot whereas in Figure 4.4(c) the delays are clustered in its
lower part. Thus, we can deduce that the link was mostly busy at that time which
explains why such distribution is obtained. In general however, the typical distri-
bution encountered is due to the fact that the links are lightly used or unsaturated.
48
Chapter 4. Network Delays
UNM−Frankfurt France−Taiwan2
1 0.4
0.9
0.35
0.8
0.7 0.3
DELAY (s)
DELAY (s)
0.6
0.25
0.5
0.4 0.2
0.3
0.15
0.2
0.1 0.1
0 5 10 15 20 0 5 10 15 20
TIME (Hours) MST TIME (Hours) MST
(a) (b)
Italy−France
0.4
0.35
0.3
0.25
DELAY (s)
0.2
0.15
0.1
0.05
0
0 5 10 15 20
TIME (hours) MST
(c)
Figure 4.4: Individual delay measurements during 24 hours period (MST zone) for
some paths on the Internet.
Consequently, this can also be used as a method to evaluate how busy a link is based
on its delay distribution. Finally, although the two nodes available in Hong-Kong
and Canada can be accessed from the University of New Mexico where the delay
reports were collected, the two nodes were not able to reach each other in either
direction. In fact, a traceroute run from the Hong-Kong node in direction of the
Canadian node shows that the packet is dropped by the sixth hop in Hong-Kong,
and a traceroute executed on the Canadian node in direction of the Hong-Kong node
shows that the packet is dropped by the 8th hop also in Hong-Kong. This suggests
that the Internet is not as completely connected as one would have thought.
49
Chapter 4. Network Delays
The same delay probing experiment was performed on the ECE local area network.
The two nodes picked were separated by at least 5 switches. The test was performed
over 48 hours where 5754 measurements were collected. The minimum round-trip
delay encountered was 317µs, the average delay 351µs, the maximum delay 1.12ms
and the standard deviation 28.7µs. The delay distribution shown in Figure 4.5 is
clearly a typical Class A sample.
0.03
0.025
0.02
PDF
0.015
0.01
0.005
0
300 350 400 450
DELAY(micro−seconds)
Figure 4.5: Delay distribution (pdf) for the ECE Local Area Network (LAN).
The wireless delay testing took a different aspect. Here, mid-size TCP segments
of size 376KB were transmitted between the different nodes. Our main objective
was to investigate the behavior of the tasks exchange part of the load-balancing
system that runs over TCP. The delay distribution obtained was modeled and then
integrated into the simulator used to validate the stochastic model. Load-balancing
experiments performed on the wireless-testbed were compared later to the simulator
results (Section 5.2).
50
Chapter 4. Network Delays
Min. Avg.
Nb. of Failed Max. Std
From - To Error % delay delay
trans. trans. delay (s) Dev.
(s) (s)
node1 - node2 449 18 4.0% 5.48 21.7 247.1 29.3
node2 - node3 1380 97 7.0% 3.703 8.3 29.3 3.1
node3 - node1 2446 19 0.8% 1.16 4.2 24 1.2
Table 4.2: Summary of the delay probing experiments in the wireless with AP net-
work.
The delay probing from node A to node B was conducted as follows. Node A
opens a TCP socket and writes on it 376KB of random data. When node B receives
the entire segment, it sends back a 3 Bytes (ACK) acknowledgment on the same
established connection. Actually, this scenario is the same when tasks exchange
happens between two computational elements in the load-balancing system. Once
again the delay is calculated from the application layer’s perspective by taking the
difference between the time the connection was established by node A and the time
the ACK packet was received. This scheme was implemented on two different test-
beds, an Ad-Hoc wireless network and a wireless network with infrastructure.
The Ad-Hoc wireless test-best consists of 3 nodes connected amongst each other
without the use of an AP (Access Point). The 3 nodes equipped with an 802.11b
wireless adapter, were positioned inside the ECE department, where no direct line
of sight was available between any 2 nodes. The three nodes were exchanging TCP
segments at the same time for a period of 3 hours and 15 minutes. The results are
summarized in Table 4.2. The individual delay measurements as a function of time
and the corresponding pdf of each path are shown in Figures 4.6-4.8.
51
Chapter 4. Network Delays
node2−node3
250 NODE 1 − NODE 2
0.1
200
0.09
0.08
150 0.07
Delay (s)
PDF
0.06
0.05
100
0.04
0.03
50
0.02
0.01
0 0
0 2000 4000 6000 8000 10000 12000 0 10 20 30 40 50 60 70 80 90 100
Time (s) DELAY (s)
(a) (b)
Figure 4.6: (a) Ad-hoc wireless network delay measurements between node 1 and
node 2 for a 3 hours period as a function of time. (b) distribution of the delays for
the same period.
It is clear that the wireless ad-hoc network exhibits a higher standard deviation
and packet loss rate than the Internet which makes it less predictable. Moreover, as
indicated by the path between node 1 and node 2, the wireless network is fragile in
node2−node3
30 node2−node3
0.25
25
0.2
20
0.15
delay (s)
15
pdf
0.1
10
0.05
5
0 0
1000 3000 5000 7000 9000 11000 0 5 10 15 20 25 30
Time (s) delay (s)
(a) (b)
Figure 4.7: (a) Ad-hoc wireless network delay measurements between node 2 and
node 3 for a 3 hours period as a function of time. (b) distribution of the delays for
the same period.
52
Chapter 4. Network Delays
node3−node1 node3−node1
25 0.5
0.45
20 0.4
0.35
15 0.3
Delay (s)
pdf
0.25
10 0.2
0.15
5 0.1
0.05
0 0
0 2000 4000 6000 8000 10000 12000 0 5 10 15
Time (s) Delay (s)
(a) (b)
Figure 4.8: (a) Ad-hoc wireless network delay measurements between node 3 and
node 1 for a 3 hours period as a function of time. (b) distribution of the delays for
the same period.
the sense that it is affected by the slightest variation in the environment. This fact is
indicated by the high standard deviation obtained (29.3) and mostly apparent in the
sudden variation in the plot of Figure 4.6(a) which in turn explains the heavy tail
of its corresponding distribution. Note that the nodes are stationary and therefore
the main cause of such sudden delay variation could be related to any disturbance
in the surrounding environment that has affected the wireless path between node 1
and node 2.
The same delay probing experiments were performed on the ECE wireless network
equipped with 802.11b access points (AP). Three nodes were also used and the setup
of the experiment is shown in Figure 4.9. Each node was connected to a different AP
located on a different floor in the ECE building where no possible signal interference
could occur except from outside elements using the network (wireless or wired).The
53
Chapter 4. Network Delays
Node 1 AP
ECE AP
Wired
Network
AP
Node 3
Node 2
Min. Avg.
Nb. of Failed Max. Std
From - To Error % delay delay
trans. trans. delay (s) Dev.
(s) (s)
node1 - node2 620 20 3.2% 1.4 22.6 160 19.9
node1 - node3 620 1 0.2% 3.029 7.6 37.4 2.7
node2 - node3 2172 18 0.8% 2.546 6.8 112.2 4.7
node3 - node1 1659 34 2.0% 2.357 9.2 35.6 4.5
Table 4.3: Summary of the delay probing experiments in the wireless with AP net-
work.
results are summarized in Table 4.3. The individual delay measurements as a function
of time and the corresponding pdf of each path are shown in Figures 4.10-4.13.
Our first observation is that although the setup of the experiment seems symmet-
rical, no similarities between any of the paths can be found. Moreover, the path 1-3
did not exhibit the same characteristics in both directions. Nevertheless, the results
shown in this section were better than the ones of the previous section in terms of
packet loss and delay stability.
Plotting the different PDFs of the delays present in the wireless network on a
log scale shows that most of them can be approximated by a straight line as shown
54
Chapter 4. Network Delays
NODE 1 − NODE 2
160 NODE 1 − NODE 2
140 0.1
0.09
120
0.08
100 0.07
DELAY (s)
PDF
0.06
80
0.05
60
0.04
40 0.03
0.02
20
0.01
0
2000 4000 6000 8000 10000 12000 14000 0 10 20 30 40 50 60 70 80 90 100
TIME (s) DELAY (s)
(a) (b)
Figure 4.10: (a) Wireless with AP network delay measurements between node 1 and
node 2 for a 4 hours period as a function of time. (b) distribution of the delays for
the same period.
NODE 1 − NODE 3
40 NODE 1 NODE 3
35
0.25
30
0.2
25
DELAY (s)
20 0.15
15
0.1
10
0.05
5
0 0
2000 4000 6000 8000 10000 12000 14000 0 5 10 15 20 25 30
TIME (s) DELAY (s)
(a) (b)
Figure 4.11: Wireless with AP network (a) delay measurements between node 1 and
node 3 for a 4 hours period as a function of time. (b) distribution of the delays for
the same period.
55
Chapter 4. Network Delays
NODE 2 − NODE 3
NODE 2 − NODE 3
120
0.25
100
0.2
80
DELAY (s)
PDF
0.15
60
0.1
40
0.05
20
0 0
2000 4000 6000 8000 10000 12000 14000 0 5 10 15 20 25 30
TIME (s) DELAY (s)
(a) (b)
Figure 4.12: (a) Wireless with AP network delay measurements between node 2 and
node 3 for a 4 hours period as a function of time. (b) distribution of the delays for
the same period.
4.4 Summary
Delays in the Internet, LAN and wireless networks were investigated and categorized
according to the shape of their probability density function. The four classes intro-
NODE 3 − NODE 1
NODE 3 − NODE 1
40
35
30
0.1
25
DELAY (s)
20
15
0.05
10
0 0
2000 4000 6000 8000 10000 12000 14000 0 5 10 15 20 25 30
TIME (s) DELAY (s)
(a) (b)
Figure 4.13: (a) Wireless with AP network delay measurements between node 3 and
node 1 for a 4 hours period as a function of time. (b) distribution of the delays for
the same period.
56
Chapter 4. Network Delays
−1
10
LOG PDF
−2
10
0 2 4 6 8 10 12
DELAY (s)
Figure 4.14: Example of a straight line fit for a shifted wireless delay pdf plotted on
a logarithmic scale.
duced show that the delay is not predictable and varies greatly, which may affect
systems where load-balancing is used. Moreover, connectivity between the nodes is
not guaranteed; a link may become unavailable with higher probability in the Inter-
net than in a LAN. In the wireless network, we noticed that the delay varied more
frequently and packet drops were more prominent. Such delays can be approximated
by an exponential distribution as shown when plotted on a log scale.
The delay probing experiments performed in this chapter will be helpful in un-
derstanding the behavior of the policies implemented in the load-balancing system
of Chapter 3 and tested in different environments as will be seen in Chapter 5.
57
Chapter 5
Experimental results
58
Chapter 5. Experimental results
results presented in this chapter were published in [22], [14], [18] and [23].
The initial settings of the experiment were as follow: The average time tpi to
process a task is the same on all nodes (identical processors) and is equal to 10µ sec
while the time it takes to ready a load for transfer is about 5µ sec . The initial queue
values inserted at each node are q1 (0) = 6000, q2 (0) = 4000, q3 (0) = 2000. node 1
was balancing every 75µs, node 2 every 120µs, and node 3 every 100µs . All the
experimental responses were carried out with constant pij = 1/2 for i 6= j.
The plots of the system responses for different gain values K are shown in Figure
5.1. Figure 5.2 summarizes the data from several experimental runs of the type shown
in Figures 5.1. For K = 0.1, 0.2, 0.3, 0.4, 0.5, ten runs were made and the settling
time (time to load balance) were determined. These are marked as small horizontal
ticks on Figure 5.2. (For all such runs, the initial queues were the same and equal to
q1 (0) = 600, q2 (0) = 400, q3 (0) = 200. For each value of K, the average settling time
for these ten runs was computed and is marked as a dot on given on Figure 5.2. For
values of K = 0.6 and higher (with increments of 0.1 in K), consistent results could
59
Chapter 5. Experimental results
Comparison of local tracking responses on node01 - node03 Comparison of local tracking responses on node01 - node03
7000 7000
node01 node01
node02 node02
6000 node03 6000 node03
5000 5000
queue length - local queue average
3000 3000
2000 2000
1000 1000
0 0
-1000 -1000
-2000 -2000
-3000 -3000
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
time (ms) time (ms)
(a) the average gain value is K = 0.5. (b) the average gain value is K = 0.3.
Comparison of local tracking responses on node01 - node03
7000
node01
node02
6000 node03
5000
queue length - local queue average
4000
3000
2000
1000
-1000
-2000
-3000
0 5 10 15 20 25 30 35 40
time (ms)
Figure 5.1: Experimental response of the load-balancing algorithm. The plots show
the excess load at each node versus time.
For example, Figure 5.3(a) shows the plots of the queue length less the local
queue average for an experimental run with K = 0.6 where the settling time is ap-
proximately 7 milliseconds. In contrast, Figure 5.3(b) shows the experimental results
under the same conditions where persistent ringing regenerates for 40 milliseconds.
The response was so oscillatory that a settling time was not possible to determine
accurately. However, Figure 5.2 shows that one should choose the gain close to 0.5
60
Chapter 5. Experimental results
40
35
30
20
15
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6
Kz − portion to send
Figure 5.2: Summary of the load balance time as a function of the feedback gain K.
Comparison of local tracking responses on node01 - node03 Comparison of local tracking responses on node01 - node03
200 250
node01 node01
node02 node02
node03 node03
150 200
150
100
100
queue length - average estimate
50
50
0
0
-50
-50
-100
-100
-150
-150
-200 -200
-250 -250
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
time (ms) time (ms)
(a) (b)
Figure 5.3: (a) K = 0.6 - Settling time is approximately 7 milliseconds. (b) K = 0.6
These are the same conditions as (a), but now the ringing persists.
To match the experimental settings of the previous section, 3 Planet-Lab nodes were
used; node1 at the University of New Mexico, node2 in Taipei-Taiwan and node3
61
Chapter 5. Experimental results
Average
Roundtrip Data Transmission of
delay τij transmisison rate one Task
n1 - n2 215 ms 1.34 KB/s 14 ms
n1 - n3 200 ms 1.42 KB/s 16 ms
n2 - n3 307 ms 1.03 KB/s 20 ms
62
Chapter 5. Experimental results
varying delays. In order to observe the behavior of the system under various gains,
several experiments were conducted for different gain values K ranging from 0.1 to
1. Fig. 5.4 is a plot of the system responses corresponding to each node i where
the gain K was set to 0.3. Similarly, Fig. 5.5 shows the system response for gain
K equal to 0.5. Figure 5.6 summarizes several runs corresponding to different gain
values. For each K = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, ten runs were made and the
settling times (time to load balance) were determined. For gain values higher than
0.8, consistent results could not be obtained. For instance, in most of the runs no
settling time could be achieved. However, when the observed network delays were
less varying, the system response was steady and converged quickly to a balanced
state when K is equal to 0.8 (Figure 5.7). As previously indicated, this scenario
wasn’t frequently seen. The system’s behavior in these set of experiments does not
exactly match, for the same gain value, the results obtained in the previous sections,
due to the difference in network topology and delays. For instance, the ratio between
the average delay and the task process time is 20 (200µs/10µs) for the LAN setting
and 12 (120ms/10ms) for the distributed setting. This fact is one of the reasons
why ringing is observed earlier (for K = 0.6) in the LAN experiment whereas under
Planet-Lab unstable responses were observed starting which K = 0.8.
63
Chapter 5. Experimental results
-1000
-1500
-2000
-2500
Time (s)
Figure 5.4: Experimental response of the load-balancing algorithm under large de-
lays. gain K = 0.3 and pij = 0.5.
other hand, when the gain was set to 0.8, the system did not reach a stable point as
shown by the nodes’ oscillation responses in Figure 5.9. The reason behind this fact
is that the load-balancing policy may base its decision on outdated information and
consequently it becomes better not to migrate all the excess load.
As this point, only the effect of the delay on the stability of the system was
Node1
1500 Node2
Node3
1000
500
0
0 5 10 15 20 25 30
-500
-1000
-1500
Time (s)
-2000
Figure 5.5: Experimental response of the load-balancing algorithm under large de-
lays. gain K = 0.5 and pij = 0.5.
64
Chapter 5. Experimental results
tested. In order to test the effect of the variability of the task processing time on
the system behavior, the matrix multiplication application was adjusted in order to
obtain the following results; the average task processing time was kept at 10.2ms
but the standard deviation became 7.15 ms instead of 2.5 ms. This was done by
adjusting the 2 parameters MAXBYTES and ROWSIZE introduced in Section 3.7.
Figures 5.10 and 5.11 show the respective system responses for gains K = 0.3 and
Node1
2000 Node2
1500 Node3
1000
500
0
-500 0 10 20 30 40
-1000
-1500
-2000
-2500
Time (s)
Figure 5.7: Experimental response of the load-balancing algorithm under large de-
lays. gain K = 0.8 and pij = 0.5.
65
Chapter 5. Experimental results
1000
500
0
-500 0 10 20 30 40
-1000
-1500
-2000
-2500
Time (s)
Figure 5.8: Experimental response of the load-balancing algorithm under large de-
lays. gain K = 0.4 and pij = 0.5.
K = 0.8. Comparing Figures 5.4 and 5.10, we can see that in the latter case, some
ringing persists and the system did not completely stabilize. On the other hand,
setting the gain K to 0.8 led the system to accommodate the variances in the task
processing time.
The results drawn from the two test-beds were consistent with each other. In
Node 1
1500 Node 2
1000 Node 3
500
0
0 5 10 15 20 25 30 35 40
-500
-1000
-1500
-2000
Time (s)
Figure 5.9: Experimental response of the load-balancing algorithm under large de-
lays. gain K = 0.8 and pij = 0.5.
66
Chapter 5. Experimental results
particular, high gains were shown to be inefficient and therefore introduce drawbacks
in systems with large delays. Conversely, systems with low gain values could not cope
with the variability introduced by the tasks processing time. Therefore, one should
avoid the limiting cases and carefully choose an adequate gain value.
2500
Node1
2000 Node2
1500 Node3
1000
500
0
-500 0 10 20 30 40
-1000
-1500
-2000
-2500
Time(s)
67
Chapter 5. Experimental results
In this section, load-balancing experiments were conducted over the wireless net-
work where one load-balancing instance is chosen and the proportions pij were set
according to (3.3) and is given as follows,
³ ´
1 Pn Queue(i)
n−2
1− if all Q(i) are known.
k=1,k6=j Queue(k)
pij = (5.1)
1/(n-1) otherwise
The experiments were conducted over a wireless network using an 802.11b access
point. The testing was completed on three computers: a 1.6 GHz Pentium IV
processor machine (node 1) and two 1 GHz Transmeta processor machines (nodes 2
& 3). To increase communication delays between the nodes (so as to bring the test-
bed to a setting that resembles a realistic setting of a busy network), the access point
was kept busy by third party machines, which continuously downloaded files. We
consider the case where all nodes execute the load-balancing algorithm at a common
balancing time tb . On average, the completion time of a task was 525 ms on node 1,
and 650 ms on the other two nodes.
The aim of the first experiment is to optimize the overall completion time with
respect to the balancing instant tb by setting the gain value K to 1. Each node was
assigned a certain number of tasks according to the following distribution: Node
1 was assigned 60 tasks, node 2 was assigned 30 tasks, and node 3 was assigned
120 tasks. The information exchange delay (viz., communication delay) was on
average 850 ms on the average. Several experiments were conducted for each case
of the load-balancing instant and the average was calculated using five independent
realizations for each selected value of the load-balancing instant. In the second set
68
Chapter 5. Experimental results
of experiments, the load-balancing instant was fixed at 1.4 s in order to find the
optimal gain that minimizes the overall completion time. The initial distribution of
tasks was as follows: 60 tasks were assigned to node 1, 150 tasks were assigned to
node 2, and 10 tasks were assigned to node 3. The average information exchange
delay was 322 ms and the average data transfer delay per task was 485 ms.
The results of the first set of experiments show that if the load-balancing is
performed blindly, as on the onset of receiving the initial load, the performance
is poorest. This is demonstrated by the relatively large average completion time
(namely 45s∼50s) when the balancing instant is prior to the time when all state
communication between the CEs is completed (when tb is approximately below 1s),
as shown in Fig.5.12. Note that the completion time drops significantly (down to
40s) as tb begins to approximately exceed the time when all inter-CE communications
have arrived (e.g., when tb > 1.5s). In this scenario of tb , the load-balancing is done
in an informed fashion, that is, the nodes have knowledge of the initial load of every
CE. Thus, it is not surprising that load-balancing is more effective than the case
the load-balancing is performed when the CEs have not yet received the state of the
other CEs.
The explanation for the sudden rise in the completion time for balancing instants
between 0.5 s and 1 s is that the knowledge states in the system are hybrid, that is,
some nodes are aware of the queue sizes of the others while others arent. When this
hybrid knowledge state is used in the load-balancing policy (Eqn. (6.1)), the resulting
load distribution turns out to be severely uneven across the nodes, which in turn,
has an adverse effect on the completion time. Finally, we observe that as tb increases
farther beyond the time all the inter-CE communications arrive (e.g., tb > 5s), the
average completion time begins to increase. This occurs precisely because any delay
in executing the load-balancing beyond the arrival of the inter-CE communications
time would increase the probability that some CEs will run out of tasks in the period
69
Chapter 5. Experimental results
55
45
40
35
30
0 1 2 3 4 5 6 7 8
BALANCING INSTANT (s)
Figure 5.12: Average total task-completion time as a function of the load balancing
instant. The load-balancing gain parameter is set at K = 1. The dots represent
the actual experimental values and the solid curve is a best polynomial fit. This
convention is used thought out Fig.5.15.
140
NUMBER OF TASKS
120
100
80
60
40
0 1 2 3 4 5 6 7 8
BALANCING INSTANT (s)
Figure 5.13: Average total excess load decided by the load-balancing policy to be
transferred (at the load-balancing instant) as a function of the balancing instant.
The load-balancing gain parameter is set at K = 1.
Next we examine the size of the loads transferred as a function of the instant at
which the load-balancing is executed, as shown in Fig. 5.13. The illustrated behavior
shows that the dependence of the size of the total load transferred on the “knowledge
state” of the CEs. It is clear from the figure that for load-balancing instants up to
approximately the time when all CEs have accurate knowledge of each other’s load
states, the average size of the load assigned for transfer is unduly large. Clearly,
this seemingly “uninformed” load-balancing leads to the waste of bandwidth on the
interconnected network.
70
Chapter 5. Experimental results
The results of the second set of experiments indeed confirm our earlier prediction
that when communication and load-transfer delays are prevalent, the load-balancing
gain must be reduced to prevent “overreaction” (i.e., sending unnecessary excess
load). This behavior is shown in Figure 5.14, and demonstrates that the optimal
performance is achieved not at the maximal gain (K = 1) but when K is approx-
imately 0.8. This is a significant result as it is unexpected in a situations where
the delay is insignificant (as in a fast Ethernet case), K = 1 indeed yields optimal
performance. Figure 5.15 shows the dependence of the total load to be transferred
as a function of the gain. A large gain (near unity) results in a large load to be
transferred, which in turn, leads to a large load-transfer delay. Thus, large gains
increase the likelihood of a node (that may not have been overloaded initially) to
complete all its load and remain idle until the transferred load arrives. This would
clearly increase the total average task completion time, as confirmed earlier by Fig.
5.14.
120
AVERAGE COMPLETION TIME (s)
110
100
90
80
70
60
0.2 0.4 0.6 0.8 1
GAIN, K
Figure 5.14: Average total task-completion time as a function of the balancing gain.
The load-balancing instant is fixed at 1.4 s.
.
A Monte-Carlo Stool that allows the simulation of the queues described in Section
2.3.2 was developed at the University of New Mexico [20]. The network parame-
71
Chapter 5. Experimental results
80
NUMBER OF TASKS
60
40
20
0
0.2 0.4 0.6 0.8 1
GAIN, K
Figure 5.15: Average total excess load decided by the load-balancing policy to be
transferred (at the load-balancing instant) as a function of the balancing gain. The
load-balancing instant is fixed at 1.4 s.
ters (i.e., the statistics of the communication delays ηkj and the load transfer delays
τij ) and the task execution time in the simulation were set according to the respec-
tive average values obtained from the experiments described in the previous section.
This simulation tool was used to validate the correspondence between the stochastic
queuing model and the experimental setup. In particular, the simulated versions of
Figures 5.12 –5.14 were generated, and are shown in Figures 5.16–5.18.
It is observed that the general characteristics of the curves are very similar, but
they are not exactly identical, due to the unpredicted behavior and complexity of
the wireless environment. Nevertheless, the result of the first simulation, shown
in Fig.5.16, were consistent with the experimental result as we can clearly identify
the sudden rise in the completion time around the balancing instant corresponding
to the communication delay (850 ms). The reason for this behavior was described
in the experimental section. As for the excess transferred load plotted in Fig.5.17,
the simulation resulted in the same curve and transition shape obtained from the
experiment. The curve characteristics of the second simulation, shown in Fig. 5.18,
are also analogous to the ones obtained in the experiment. Indeed, the gain values
found are almost the same: 0.8 from the experiment and 0.87 from the simulation.
As indicated before, the small difference is due to the unstable delay values and other
factors present in the wireless environment, which have been approximated both by
72
Chapter 5. Experimental results
60
55
45
40
35
30
0 1 2 3 4 5 6 7 8
BALANCING INSTANT (s)
Figure 5.16: Simulation results for the average total task-completion time as a func-
tion of the load-balancing instant. The load-balancing gain parameter is set at
K = 1. The dots represent the actual experimental values[20].
140
120
TASKS EXCHANGED
100
80
60
40
0 1 2 3 4 5 6 7 8
BALANCING INSTANT (s)
Figure 5.17: Simulation results for the average total excess load decided by the load-
balancing policy to be transferred (at the load-balancing instant) as a function of
the balancing instant. The load-balancing gain parameter is set at K = 1 [20].
5.3 Summary
73
Chapter 5. Experimental results
100
95
85
80
75
70
65
60
55
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
GAIN, k
z
Figure 5.18: Simulation results for the average total task-completion time as a func-
tion of the balancing gain. The load-balancing instant is fixed at 1.4 s. [20]
Internet, the results showed that a gain parameter is necessary to compensate for the
delay incurred in the transfer of information and the variability in the computational
power at each node. As predicted by the stochastic model, the monte-carlo simulation
and then shown in the experimental section, a gain K value of 1 does not give optimal
result. Therefore, one should set K the mid range of (0, 1] in systems where network
delays are prominent.
In the wireless network, our experimental results and simulations both indicate
that in systems where communication and load-transfer delays are tangible, it is
best to execute the load-balancing after each node receives information from other
nodes regarding their load states. In particular, our results indicate that the loss
of time in waiting for the inter-node communications to arrive is compensated for
by the informed nature of the load-balancing. For both systems, the optimal load-
balancing gain turns out to be less than unity, contrary to systems that do not
exhibit significant latency. In delay-infested systems, a moderate balancing gain
has the benefit of reduced load-transfer delays, as the fraction of the load to be
transferred is reduced.
Nevertheless, the policies implemented so far do not account for load transfer
74
Chapter 5. Experimental results
delays and connectivity in the system in a direct way but only through the use of
a gain parameter. In Chapter 6, an adaptive load-balancing policy is introduced to
probe the system and uses its performance history to decide on an adequate load
distribution.
75
Chapter 6
76
Chapter 6. Dynamic and Adaptive Load-Balancing Policy
duced and in Section 6.1.1, the computational methods of the adaptive parameters
used by the policy are described. Section 6.1.2 presents an experimental evaluation
of this new dynamic policy.
The Internet is not as completely connected as one might think. This has been
observed in Section 4.1 where nodes in Hong-Kong and Canada weren’t able to
reach each other although they were perfectly accessible from other sites such as
UNM. Add to that fact that in distributed systems, a node may become unavailable
or unreachable at any time due to a failure in the node itself, or in the network
path leading to it. Therefore, the assumption made by the load-balancing policies
that all nodes are accessible at any time is unrealistic especially in Internet scale
distributed systems or in Ad-Hoc wireless networks. This will greatly affect the load
balance state of the system since loads assigned to unreachable nodes can never be
delivered. The proposed algorithm can detect the connectivity in the system and
decide accordingly what nodes may participate in the load sharing. At each load-
balancing instance, the group of reachable nodes is referred to as the “current node
space”.
This load-balancing policy also takes into account the change in the computa-
tional performance of nodes and distributes the tasks accordingly. Actually, the
system is not dedicated to the application at hand; other users may be using one or
more nodes at a given time and therefore alter their computational power. Moreover,
tasks are not considered identical, they may greatly differ in their completion time.
These facts cannot be guessed a priori and assigning fixed computational power for
each node is not always suitable. Therefore, the load strategy should be adaptive to
these changes and be able to make decisions correspondingly.
77
Chapter 6. Dynamic and Adaptive Load-Balancing Policy
Moreover, transfer delays incurred when tasks are migrated from a node to an-
other may be unexpectedly large and result in a negative impact on the overall system
performance. To avoid this situation, an a priori estimate of the transfer delays will
help the policy at hand decide if the transfer is profitable and adequately decide on
the size of load to migrate. These estimates should also be dynamically updated
since delays may greatly vary during the system’s life as shown for delays of class B
and higher in Section 4.1.
• rij is the throughput or transfer rate in bytes/seconds between node j and node
i. Note that rij 6= rji .
• qj,av is the average queue size calculated by node j based on its locally available
information.
• pij is the fraction of the excess tasks of node j that should be transferred to
node i as decided by the load balancing policy.
The first 5 parameters are assumed known at the time the load distribution pro-
cess is triggered. That is, the update of these variables is done before the balancing
instance is reached as will be described later. The general steps of the load-balancing
policy invoked at node j are described below followed by a detailed description.
78
Chapter 6. Dynamic and Adaptive Load-Balancing Policy
2. Determine how many n0o and which nodes are below the average. These
nodes will participate in the load sharing as viewed by node j.
3. Calculate the optimal p0ij fraction only for the n0o nodes using the following
formula:
q − (Ci /Cj )qi
p0ij = P i,av
qk,av − (Ck /Cj )qk
k k6=j
4. Calculate the p00ij for the n0o nodes. p00ij is the maximum portion of the excess
load that is judged to be profitable when transmitted to node i.
(qj − qj,excess )Cj rij
p00ij =
qj,excess sj
Set pij = M in(p0ij , p00ij ).
P
5. if i pij > a (a is a threshold parameter between 0 and 1)
Otherwise
79
Chapter 6. Dynamic and Adaptive Load-Balancing Policy
P
assign the remaining fraction (1 − i pij ) to nodes that have p00ij > p0ij
and call the newly assigned fractions p000
ij ,
The first step determines the node space where the load distribution will take
place from node j’s perspective. This is achieved by checking the last time a state or
SYNC packet was received from each node. To test for connectivity to node i, the
local timestamp (in the info structure, Section 3.3) of the last received state packet
is compared to the current time decremented by three times the interval between 2
consecutive state broadcasts. That is, if the last state packet received from node i is
one of the last three packet transmitted, node i is considered to be reachable from
node j otherwise it is not included as part of the load-balancing node space since
most likely, a load transmitted to this node will not be correctly delivered. Therefore,
it is more suitable for the policy to base its calculations on nodes where migration
of loads have higher probability od success. Note that, at every instance of the load
distribution process, the node space may end up with different elements according to
the nodes’ connectivity at that time. After that, the local queue average is calculated
where each queue is scaled by the Ci /Cj factor that accounts for the difference in
computational power of each node. In the queue excess calculations, a gain factor K
is used since it is becoming a requirement in any policy that operates on large-delay
systems where outdated state information is prominent. This fact appears in the
experimental and theoretical studies conducted in the literature and in this thesis.
The second and third steps employ the method introduced in Section 3.8 and used
earlier in the literature in the SID and RID policies (Section 2.4) to calculate the
fractions p0ij . This method is attractive since it only considers nodes that have queue
sizes less than the average, and therefore results in as few connections as possible
when the excess load is moved out of node j which makes the policy at hand more
scalable (Figure 3.5). Moreover, this method leads to optimal results when the policy
80
Chapter 6. Dynamic and Adaptive Load-Balancing Policy
is triggered once on each node and where no delay is present in the system. Note
that the formula is adjusted by the Ci /Cj factor.
The fourth step judges if the proportion p0ij determined is worth transmitting
to node i when transmission delays are present. This is accomplished by setting
an upper bound on the maximum proportion of the excess load that is profitable
when the exchange takes place. The task migration is said to be profitable if the
time needed to transmit the load to the other end is less than the time needed to
start executing the load on the current node (node j in this case). This statement is
interpreted as follows,
Solving for p00ij , we get the upper bound for pij as indicated in step 4. In case rij is not
available, rji is used instead to provide an approximation of the bandwidth between
node j and i. If neither parameter is available, step 4 is omitted for the node pair
(i, j). The rate rij is detected and updated each time a load is transmitted from node
i to node j as will be explained in the subsequent section where rji is received in
the state information packet transmitted by node i to node j. Both parameters are
saved respectively in the variables rate and symm rate in the info structure (Section
3.3).
The fifth step is included for completion and can be omitted at any time. The
rational behind it, is that after executing the algorithm, node j may find itself only
transmitting a small portion (i.e less than a variable a) of its excess load due to the
delay restrictions. Therefore, it is suitable to reassign the remaining untransferred
proportion to the nearest nodes (i.e nodes reached through links of higher rate) in
the hope that they may possibly have better connectivity to the system.
81
Chapter 6. Dynamic and Adaptive Load-Balancing Policy
In this section, the computation procedure for the dynamic parameters C and rij is
explained. Note that the si parameter can be easily determined by averaging the
tasks’ sizes upon their creation.
Every time a task is completed by the application layer at node i, the Ci parameter
is updated as follows,
if Ci = 0 then Ci = Ttask
where Ttask is the execution time of the last task and α is a gain parameter that
affects the Ci term in its ability to reflect the current computational power of the
system. Therefore, the values of α are critical to the stability and efficiency of the
load-balancing policy. That is, assigning values in the high range of (0,1] to α may
result in fluctuations in the Ci parameter which will have in turn an adverse impact
on the decision of the load distribution, leading to bouncing of tasks back and forth
between nodes. On the other hand, setting α to low values may not keep the load-
balancing policy informed about the latest state of the node. Consequently, the
value of α should be selected depending on the application used and the interference
degree of external users. The update procedure could be easily modified to suit other
methods.
The other parameter that is dynamically updated is the transfer rate rij incurred
between node j and node i. On each data (or tasks) transmission, the transfer delay
Tdelay is recorded and is calculated by taking the difference between the instance
the connection is initiated by node j and the instance node i acknowledgment of
tasks reception is received by node j. Consequently the average transmission rate
(rate = Tdelay /totalsize) is calculated, where totalsize is the total size of the tasks
migrated to node i. After each successful exchange of loads, the rij parameter is
82
Chapter 6. Dynamic and Adaptive Load-Balancing Policy
updated as follows,
This scheme is a simplified version of the method used to update the Round
Trip Time (RTT) of the packet exchanged during a TCP connection where the delay
variance is additionally taken into consideration [26]. In the next section, β was set
to 1/8 as suggested by [27] for the RTT update method.
Finally, both parameter Cj and rij are included in node j state information when
transmitted to node i for all i = 1, ..., n , i 6= j.
First, lb2 was evaluated for the gain values K between 0.3 and 1 with 0.1 incre-
mental steps. The α parameter introduced in the previous section was set to 0.05 by
running several experiments and observing the behavior of the C parameter. Note
that, the first time the load-balancing process was triggered was after 20s from the
start of the system and then the strategy was executed regularly at 10s intervals.
83
Chapter 6. Dynamic and Adaptive Load-Balancing Policy
This was done to ensure that the C parameter had enough time to adapt and reflect
the current computational power of each node before the occurrence of any tasks
migration between the nodes.
Second, lb1 was evaluated under the same conditions as lb2. Since, there is a
discrepancy between the computational power of the nodes (as shown in Table 6.1),
the lb1 strategy was adjusted to account for these differences by scaling the queue
sizes in the pij computation as follow,
³ ´
1 1 − Pn (Ci /Cj )Queue(i)
n−2
if all Q(i) are known.
k=1,k6=j (Ck /Cj )Queue(k)
pij = (6.1)
1/(n-1) otherwise
Note that the ratio Ci /Cj are fixed over time. Their values were obtained from the
lb2 experiments by averaging over all the tasks processing time at each node.
Both policies were evaluated by conducting 5 runs for each value of K between
0.3 and 1 with 0.1 incremental step. Figure 6.1 shows the overall average completion
84
Chapter 6. Dynamic and Adaptive Load-Balancing Policy
time versus K and Figure 6.2 shows the total number of exchanged tasks between
all the nodes.
150 lb1
AVG. COMPLETION TIME (s) 140
lb2
130
120
110
100
90
80
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
GAIN
Figure 6.1: completion time averaged over 5 runs Vs different gain values K. The
graph shows the results for both policies.
lb1
NUMBER OF TASKS EXCHANGED
lb2
210
190
170
150
130
110
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
GAIN
Figure 6.2: Total number of tasks exchanged averaged over 5 runs Vs different gain
values K. The graph shows the results for both policies.
We can clearly see that lb2 outperformed lb1 especially for K = 0.8 that corre-
sponds to lb2 earliest completion time whereas lb1 performed best at K = 0.5. The
85
Chapter 6. Dynamic and Adaptive Load-Balancing Policy
reason may be that lb1 used greater predictive computations before distributing the
loads, which makes it “more or less” independent of the gain value K (in the range
[0.60.9]). As for the network traffic generated during the lifetime of the system, lb2
had fewer tasks exchanged for most of the gain K values. It is expected that the
difference in total tasks migrated between the 2 policies will grow as the number of
nodes increases.
In this experiment, few aspects of the lb2 policy were examined. Further tests
should be performed under different conditions; a higher number of nodes (to test
scalability), bigger tasks size (to investigate the effect of transfer delays). Moreover,
theoretical and experimental studies should be done in the optimization of the α and
β parameters.
6.2 Summary
86
Chapter 7
In this thesis, we first presented a brief description of the different taxonomies of load-
balancing policies followed by an overview of previous work in the field. We then
presented the design and implementation of a general framework where distributed
load-balancing policies can be tested and compared. The multi-threaded architecture
of the system provides high performance for the application at hand and is not
halted when transfer of information or loads takes place. We also showed that the
system provides flexibility in integrating different types of distributed, dynamic, and
adaptive strategies.
Delay probing experiments performed on the Internet and the wireless test-beds
(both ad-hoc and with infrastructure) showed that these environments are unstable
in the sense that high network delay variabilities and packet drops were observed
especially in the wireless network. Furthermore, connectivity between nodes in the
Internet is not always guaranteed. These facts had a considerable influence on the
performance of the implemented strategies in these different environments. The gain
parameter K was found to have a great impact on the stability of the system where
a value in the mid range of the interval (0, 1] provided the best results. Moreover,
87
Chapter 7. Conclusions and Future Work
load-balancing should always occur in an informed manner directly after the receipt
of the state information from all the nodes belonging to the current balancing space.
Based on the delay probing experiments and the performance of the dynamic load-
balancing strategies in different environments, we proposed a dynamic and adaptive
load-balancing policy that takes into account the connectivity in the network, the
variability in the transfer delays, and the computational power of each node. Pre-
liminary experimental results show that this policy provides improvements over the
previously implemented dynamic strategies.
Finally, further investigations of the newly proposed adaptive and dynamic load-
balancing policy (Chapter 6) are needed; more experiments should be conducted on
a larger number of nodes to test its performance and more importantly its scalabil-
ity. Moreover, the update methods of the adaptive parameters C and rij should be
88
Chapter 7. Conclusions and Future Work
enhanced by observing the impact of the gain values α and β on the stability of the
system.
89
References
[6] D.A. Bader, B.M.E. Moret, and L. Vawter. Industrial applications of high-
performance computing for phylogeny reconstruction. SPIE ITCom2001, Au-
gust 2001.
[8] Sayed A. Banawan and Nidal M. Zeidat. A comparative study of load sharing
in heterogeneous multicomputer systems. In 25th Annual Proceedings of the
Simulation Symposium, pages 22–31, April 1992.
[10] J. Douglas Birdwell, John Chiasson, Zhong Tang, Chaouki Abdallah, Majeed M.
Hayat, and Tsewei Wang. Dynamic time delay models for load balancing part
90
References
[13] J. Chiasson, J. D. Birdwell, Z. Tang, and C.T. Abdallah. The effect of time
delays in the stability of load balancing algorithms for parallel computations.
IEEE CDC, Maui, Hawaii, 2003.
[15] Yuan-Chieh Chow and Walter H. Kohler. Models for dynamic load balancing in
a heterogeneous multiple processor system. IEEE Transactions on Computers,
volume C-28(5):pages 354–361, May 1979.
[16] Douglas E. Comer and David L. Stevens. Client-Server Programming and Ap-
plications, BSD Socket Version with ANSI C, volume 3 of Internetworking with
TCP/IP. Prentice Hall, second edition, 1996.
[17] George Cybenko. Dynamic load balancing for distributed memory multiproces-
sors. Journal of Parallel and Distributed Computing, volume 7(2):pages 279–301,
October 1989.
91
References
[21] Fotis Georgatos, Florian Gruber, Daniel Karrenberg, Mark Santcroos, Ana Su-
sanj, Henk Uijterwaal, and Rene Wilhelm. Providing active measurements as a
regular service for isps. In Proceedings of the Passive and Active Measurements
Workshop (PAM2001), Amsterdam, April 2001.
[25] Gerard Hooghiemstra and Piet Van Mieghem. Delay distributions on fixed
internet paths. Technical Report 20011031, Delf University of Technology, 2001.
[29] S.N.V. Kalluri, J. JàJà, D.A. Bader, Z. Zhang, J.R.G. Townshend, and
H. Fallah-Adl. High performance computing algorithms for land cover dynamics
using remote sensing data. In International Journal of Remote Sensing, volume
21:6, pages 1513–1536, 2000.
92
References
[30] Frank C. H. Lin and Robert M. Keller. The gradient model load balancing
method. IEEE Transactions on Software Engineering, SE-13(1), January 1987.
[31] Peter Xiaoping Liu, Max Q.-H. Meng, Xiufen Ye, and Jason Gu. End-to-end
delay boundary prediction using maximum entropy principle (mep) for internet-
based teleoperation. In Proceedings of the IEEE International Conference on
Robotics and Automation (ICRA), pages 2701–2706, 2002.
[32] Hazem Hamed Lopa Roychoudhuri, Ehab Al-Shaer and Greg Brewster. On
studying the impact of the internet delays on audio transmission. IEEE Work-
shop on IP Operations and Management (IPOM’02), 2002.
[33] William Osser. Automatic process selection for load balancing. Master’s thesis,
University of California at Santa Cruz, June 1992.
[35] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flan-
nery. Numerical Recipes in C. The Art of Science Computing. Cambridge
University Press, second edition, 1992.
[36] Vikram A. Saletore. A distributed and adaptive dynamic load balancing scheme
for parallel processing of medium-grain tasks. In Proceedings of the Fifth Dis-
tributed Memory Computing Conference, pages 994–999, Charleston, SC, April
1990.
[40] Scott Shenker and Abel Weinrib. The optimal control of heterogeneous queue-
ing systems: A paradigm for load-sharing and routing. IEEE Transactions on
Computers, volume 38(12):pages 1724–35, December 1989.
93
References
[41] K. G. Shin and Y. C. Chang. Load sharing in distributed real-time systems with
state-change broadcasts. IEEE Transactions on Computers, volume 38(9):pages
1124–1142, September 1989.
[42] W. Richard Stevens. Networking APIs: Sockets and XTI, volume 1 of UNIX
Network Programming. Prentice Hall, second edition, 1998.
[43] Andrew S. Tanenbaum and Maarten van Steen. Distributed Systems: Principles
and Paradigms. Prentice Hall, 2002.
[44] Andras Veres and Miklos Boda. The chaotic nature of TCP congestion control.
In Proceedings of the IEEE Infocom, pages 1715–1723, 2000.
[45] Abel Weinrib and Scott Shenker. Greed is not enough: Adaptive load sharing
in large heterogeneous systems. In Proceedings of the IEEE Infocom ’88, pages
986–994, March 1988.
[46] Marc H. Willebeek-LeMair and Anthony P. Reeves. Strategies for dynamic load
balancing on highly parallel computers. IEEE Transactions on Parallel and
Distributed Systems, volume 4:pages 979–993, September 1993.
94