Module 1 Chapter2
Module 1 Chapter2
1
Program and Network Properties
• Conditions of parallelism
• Program partitioning and scheduling
• Program flow mechanisms
• System interconnect architectures
Conditions of Parallelism
The exploitation of parallelism in computing requires
understanding the basic theory associated with it.
Progress is needed in several areas:
computation models for parallel computing
interprocessor communication in parallel architectures
integration of parallel systems into general environments
Data dependences
• Unknown dependence:
o The subscript of a variable is itself subscripted.
o The subscript does not contain the loop index variable.
o A variable appears more than once with subscripts having different
coefficients of the loop variable (that is, different functions of the loop
variable).
o The subscript is nonlinear in the loop index variable.
• Parallel execution of program segments which do not have total data
independence can produce non-deterministic results.
Data dependence example
Do 20 I = 1, N Do 10 I = 1, N
A(I) = C(I) IF(A(I-1) .EQ. 0) A(I)=0
IF(A(I) .LT. 0) A(I)=1 10 Continue
20 Continue
Resource dependence
P1 : C=DxE
P1
P2 : M=G+C
P3 : A=B+C P2 P4
P4 : C=L+M
P5 : F=G/E
P3 P5
Utilizing Bernstein’s conditions
Hardware parallelism
Cycle 1 L1 L2 L3 L4
Maximum software
parallelism (L=load,
Cycle 2 X1 X2 X/+/- = arithmetic).
Cycle 3 + -
A B
Mismatch between software and hardware
parallelism - 2
L1 Cycle 1
+ Cycle 6
- Cycle 7
A
B
Mismatch between software and hardware
parallelism - 3
L1 L3 Cycle 1
L2 L4 Cycle 2
L5 L6 Cycle 5
= inserted for
synchronization + - Cycle 6
A B
Software parallelism
Increasing
communication
demand and
Jobs or programs
scheduling
overhead
Higher degree of
parallelism
related parts of a program
Procedures, subroutines,
tasks, or coroutines
}Medium grain
Non-recursive loops
or unfolded iterations
Instructions
or statements
}Fine grain
Instruction Level Parallelism
• Two questions:
o How can I partition a program into parallel “pieces” to yield the shortest
execution time?
o What is the optimal size of parallel grains?
• There is an obvious tradeoff between the time spent scheduling and
synchronizing parallel grains and the speedup obtained by parallel
execution.
• One approach to the problem is called “grain packing.”
Program Graphs and Packing
P1 P2 P1 P2
A,4 4 A I 4
a,1
a,8 6 B
B,1 C,1 I
c,1 13 12
c,8 C 14
b,1 E
D,2 E,2 16
21 20
D
23
d,4 e,4 27
Schedule with node duplication
P1 P2 P1 P2
A,4 A’,4 4 A A 4
a,1 a,1
a,1 6 B C 6
B,1 C’,1 C,1 7 C
c,1 E
b,1 c,1 9
10 D
D,2 E,2
13
14
Grain determination and scheduling optimization
• No need for
o shared memory
o program counter
o control sequencer
• Special mechanisms are required to
o detect data availability
o match data tokens with instructions needing them
o enable chain reaction of asynchronous instruction execution
A Dataflow Architecture - 1
• String-reduction model:
o each demander gets a separate copy of the expression string to
evaluate
o each reduction step has an operator and embedded reference to
demand the corresponding operands
o each operator is suspended while arguments are evaluated
• Graph-reduction model:
o expression graph reduced by evaluation of branches or subgraphs,
possibly in parallel, with demanders given pointers to results of
reductions.
o based on sharing of pointers to arguments; traversal and reversal of
pointers continues until constant arguments are encountered.
System Interconnect Architectures
• Bisection Width:
o Channel bisection width b: The minimum number of edges along the cut
that divides the network in two equal halves
o Each channel has w bit wires
o Wire bisection width: B=b*w; B is the wiring density of the network. It
provides a good indicator of tha max communication bandwidth along the
bisection of the network
Terminology - 1
• Shifting
• Rotating
• Permutation (one to one)
• Broadcast (one to all)
• Multicast (many to many)
• Personalized broadcast (one to many)
• Shuffle
• Exchange
• Etc.
Permutations
• Functionality
• Network latency
• Bandwidth
• Hardware complexity
• Scalability
Factors Affecting Performance
• Linear Array
• Ring and Chordal Ring
• Barrel Shifter
• Tree and Star
• Fat Tree
• Mesh and Torus
Static Networks – Linear Array
• Like a linear array, but the two end nodes are connected by an n th
link; the ring can be uni- or bi-directional. Diameter is n/2 for a
bidirectional ring, or n for a unidirectional ring.
• By adding additional links (e.g. “chords” in a circle), the node degree
is increased, and we obtain a chordal ring. This reduces the network
diameter.
• In the limit, we obtain a fully-connected network, with a node
degree of n -1 and a diameter of 1.
Static Networks – Barrel Shifter
• Like a ring, but with additional links between all pairs of nodes that
have a distance equal to a power of 2.
• With a network of size N = 2n, each node has degree d = 2n -1, and
the network has diameter D = n /2.
• Barrel shifter connectivity is greater than any chordal ring of lower
node degree.
• Barrel shifter much less complex than fully-interconnected network.
Static Networks – Tree and Star