0% found this document useful (0 votes)
146 views46 pages

Load Balancing

This document discusses various techniques for load balancing distributed systems. It describes static and dynamic load balancing, and covers algorithms like recursive bisection, diffusion-based approaches, and dimension exchange. The key goals of load balancing are to ensure all processors have work and to distribute load efficiently while minimizing communication overhead. Dynamic techniques periodically rebalance as loads change, while decentralized approaches avoid centralized bottlenecks.

Uploaded by

Lore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views46 pages

Load Balancing

This document discusses various techniques for load balancing distributed systems. It describes static and dynamic load balancing, and covers algorithms like recursive bisection, diffusion-based approaches, and dimension exchange. The key goals of load balancing are to ensure all processors have work and to distribute load efficiently while minimizing communication overhead. Dynamic techniques periodically rebalance as loads change, while decentralized approaches avoid centralized bottlenecks.

Uploaded by

Lore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

CS 584

Load Balancing
Goal: All processors working all the
time

Efficiency of 1
Distribute the load (work) to meet the goal

Two types of load balancing

Static
Dynamic

Load Balancing
The load balancing problem can be
reduced to the bin-packing problem

NP-complete

For simple cases, we can do well, but

Heterogeneity
Different types of resources
Processor
Network, etc.

Evaluation of load
balancing
Efficiency

Are the processors always working?


How much processing overhead is associated
with the load balance algorithm?

Communication

Does load balance introduce or affect the


communication pattern?
How much communication overhead is
associated with the load balance algorithm?
How many edges are cut in communication
graph?

Partitioning Techniques
Regular grids (-: Easy :-)

striping
blocking
use processing power to divide load more
fairly

Generalized Graphs

Levelization
Scattered Decomposition
Recursive Bisection

Levelization
Begin with a boundary

Number these nodes 1

All nodes connected to a level 1


node are labeled 2, etc.
Partitioning is performed

determine the number of nodes per processor


count off the nodes of a level until exhausted
proceed to the next level

Levelization

Levelization
Want to insure nearest neighbor
comm.
If p is # processors and n is # nodes.
Let ri be the sum of the number of
nodes in contiguous levels i and i + 1
Let r = max{r1, r2, , rn}
Nearest neighbor communication is
assured if n/p > r

Scattered Decomposition
Used for highly irregular grids
Partition load into a large number r of
rectangular clusters such that r >> p
Each processor is given a disjoint set
of r/p clusters.
Communication overhead can be a
problem for highly irregular problems.

Recursive Bisection
Recursively divide the domain in
two pieces at each step.
3 Methods

Recursive Coordinate Bisection


Recursive Graph Bisection
Recursive Spectral Bisection

Recursive Coordinate
Bisection
Divide the domain
based on the
physical coordinates
of the nodes.
Pick a dimension and
divide in half.
RCB uses no connectivity information

lots of edges crossing boundaries


partitions may be disconnected

Some new research based on graph


separators overcomes some problems.

Ineritial Bisection
Often, coordinate bisection is susceptible
to the orientation of the mesh
Solution: Find the principle axis of the
communication graph

Graph Theory Based


Algorithms
Geometric algorithms are generally
low quality

they dont take into account


connectivity

Graph theory algorithms apply


what we know about generalized
graphs to the partitioning problem
Hopefully, they reduce the cut size

Greedy Bisection
Start with a vertex of
the smallest degree

least number of edges

Mark all its neighbors


Mark all its neighbors
neighbors, etc.
The first n/p marked
vertices form one
subdomain
Apply the algorithm
on the remaining

Recursive Graph Bisection


Based on graph
distance rather than
coordinate distance.
Determine the two
furthest separated
nodes
Organize and partition
nodes according to
their distance from
extremities.

Computationally
expensive

Can use approximation


methods.

Recursive Spectral
Bisection
Uses the discrete Laplacian
Let A be the adjacency matrix
Let D be the diagonal matrix where

D[i,i] is the degree of node I

LG = A - D

Recursive Spectral
Bisection
LG is negative semidefinite
Its largest eigenvalue is zero and the
corresponding eigenvector is all ones.
The magnitude of the second largest
eigenvalue gives a measure of the
connectivity of the graph.
Its corresponding eigenvector gives a
measure of distances between nodes.

Recursive Spectral
Bisection
The eigenvector corresponding to
the second largest eigenvalue is
the Fiedler vector.
Calculation of the Fiedler vector is
computationally intensive.
RSB yields connected partitions
that are very well balanced.

Example

RSB
299 edges cut

RCB 529 edges cut

RGB 618 edges cut

Global vs Local Partitioning


Global methods produce a good
partitioning
Local methods can then be used to
improve the partitioning

The Kernighan-Lin
algorithm
Swap pairs of nodes to decrease the cut
Will allow intermediate increases in the cut size to
avoid certain local minima
Loop

choose the pair of nodes with largest benefit of swapping


logically exchange them (not for real)
lock those nodes
until all nodes are locked

Find the sequence of swaps that yields the largest


accumulated benefit
Perform the swaps for real

The Kernihan-Lin
Algorithm

Helpful-Sets
Two Steps

Find a set of nodes in one partition and


move it to the other partition to decrease
the cut size
Rebalance the load

The set of nodes moved must be helpful


Helpfulness of node is equal to the
change in cut size if the node is moved

Helpful-Sets

All these sets are


2 - helpful

Helpful-Sets Algorithm

The Helpful-Sets Algorithm


Theory

If there is a bisection and if its cut size is not


too small then there exists a small 4helpful set in one side or the other
This 4-helpful set can be moved and will
reduce the cut by 4
If imbalance is not too large and cut of
unbalanced partition is not too small then
it is possible to rebalance without increasing
the cut size by more than 2

Apply the theory iteratively until too


small condition is met.

Multi-level Hybrid Methods


For very large graphs, time to
partition can be extremely costly
Reduce time by coarsening the graph

shrink a large graph to a smaller one


that has similar characteristics

Coarsen by

heavy edge matching


simple partitioning heuristics

Multi-level Hybrid Methods

Comparisons

(x.xx) run time in seconds


ML Multilevel (spectral on coarse KL on intermedia
IN Inertial
Party 5 or 6 different methods

Graph
airfoil

|v|
4253

|e|
12289

ML
85
(0.08)

Chaco
IN
94
(0.00)

crack

10240

30380

211
(0.16)

377
(0.01)

218
(0.05)

196
(0.14)

243
(0.10)

208
(0.44)

wave

156317 10593319542
(3.64)

9834
(0.19)

9660
(1.61)

9801
(3.50)

10361
(2.84)

9614
(11.93)

22579

13643
(0.06)

9897
(0.06)

8869
(3.45)

8869
(11.52)

lh
1443
20148
total edge weight

36376
487380 (0.33)

mat

73752

17617189359
(1.80)

DEBR

10485762097149100286
(48.99)

IN+KL
83
(0.02)

Metis
PMetis
85
(0.04)

all
94
(0.04)

all+HS
83
(0.15)

9555
(2.04)

Party

101674 172204 94272


(988.39) (16.63) (577.97)

(0.

Dynamic Load Balancing


Load is statically partitioned initially
Adjust load when an imbalance is
detected.
Objectives

rebalance the load


keep edge cut minimized (communication)
avoid having too much overhead

Dynamic Load Balancing


Consider adaptive algorithms
After an interval of computation

mesh is adjusted according to an


estimate of the discretization error
coarsened in areas
refined in others

Mesh adjustment causes load


imbalance

Dynamic Load Balancing

After refinement, node 1 ends up with more work

Centralized DLB
Control of the load is centralized
Two approaches

Master-worker (Task scheduling)


Tasks are kept in central location
Workers ask for tasks
Requires that you have lots of tasks with weak locality

requirements. No major communication between


workers

Load Monitor
Periodically, monitor load on the processors
Adjust load to keep optimal balance

Repartitioning
Consider: dynamic situation is simply a
sequence of static situations
Solution: repartition the load after each

some partitioning algorithms are very quick

Issues

scalability problems
how different are current load distribution
and new load distribution
data dependencies

Decentralizing DLB
Generally focused on work pool
Two approaches

Hierarchy

Fully distributed

Fully Distributed DLB


Lower overhead than centralized
schemes.
No global information

Load is locally optimized


Propagation is slow
Load balance may not be as good as
centralized load balance scheme

Three steps

Flow calculation (How much to move)


Mesh node selection (Which work to move)
Actual mesh node migration

Flow calculation
View as a network flow problem

Add source and sink nodes


Connect source to all nodes
edge value is current load

Connect sink to all nodes


edge value is mean load
processor communication graph

Flow calculation
Many network flow algorithms

more intense than necessary


not parallel

Use simpler, more scalable algorithms


Random Matchings

pick random neighboring processes


exchange some load
eventually you may get there

Diffusion
Each processor balances its load with
all its neighbors

How much work should I have?

wtp1 wtp

t
t

(
w

w
pq p q )

q ,{ p , q}F

How much to send on an edge?

l tpq1 pq ( wtp wqt )


Repeat until all load is balanced

log(1 / )

O
2
1

steps

Diffusion
Convergence to load balance can be
slow
Can be improved with over-relaxation

Monitor what is sent in each step


Determine how much to send based on
current imbalance and how much was sent
in previous steps

Diffuses load in

log(
1
/

O
1 2

steps

Dimension Exchange
Rather than communicate with all neighbors
each round, only communicate with one

Comes from dimensions of hypercube


Use edge coloring for general graphs

Exchange load with neighbor along a


dimension

l = (li + lj)/2

Will converge in d steps if hypercube


Some graphs may need different factor to
converge faster

l = li * a + lj * (1 a)

Diffusion & Dimension


Exchange
Can view

diffusion as a Jacobi method


dimension exchange as Gauss-Seidel

Can use multi-level variants

Divide the processor communication


graph in half
Determine the load to shift across the
cut
Recursively rebalance each half

Mesh node selection


Must identify which mesh nodes to
migrate

minimize edge cut and overhead

Very dependent on problem


Shape & size of partition may play a role
in accuracy

Aspect ratio maintenance


Move items that are further away from
center of gravity.

Load Balancing Schemes


(Who do I request work from?)

Asynchronous Round Robin

each processor maintains target


Ask from target then increment target

Global Round Robin

target is maintained by master node

Random Polling

randomly select a donor


each processor has equal probability

You might also like