0% found this document useful (0 votes)
3 views66 pages

ACA Chapter2

The document discusses the principles of parallel computing, focusing on conditions for parallelism, data and resource dependencies, and the role of hardware and software in achieving effective parallel execution. It outlines various types of dependencies, including data, control, and resource dependencies, and introduces Bernstein’s conditions for parallel execution. Additionally, it covers the architecture of dataflow machines, interconnection networks, and the importance of compilers in optimizing parallelism in computing systems.

Uploaded by

sadia firdous
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views66 pages

ACA Chapter2

The document discusses the principles of parallel computing, focusing on conditions for parallelism, data and resource dependencies, and the role of hardware and software in achieving effective parallel execution. It outlines various types of dependencies, including data, control, and resource dependencies, and introduces Bernstein’s conditions for parallel execution. Additionally, it covers the architecture of dataflow machines, interconnection networks, and the importance of compilers in optimizing parallelism in computing systems.

Uploaded by

sadia firdous
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

Program and Network

Properties
ACA
Conditions of Parallelism
• The exploitation of parallelism in computing requires
understanding the basic theory associated with it.
Progress is needed in several areas:
• computation models for parallel computing
• interprocessor communication in parallel architectures
• integration of parallel systems into general environments
Data and Resource
Dependencies
• Program segments cannot be executed in parallel unless they are
independent.
• Independence comes in several forms:
• Data dependence: data modified by one segement must not be modified
by another parallel segment.
• Control dependence: if the control flow of segments cannot be identified
before run time, then the data dependence between the segments is
variable.
• Resource dependence: even if several segments are independent in other
ways, they cannot be executed in parallel if there aren’t sufficient
processing resources (e.g. functional units).
Data Dependence - 1
• Flow dependence: S1 precedes S2, and at least one output of S1 is
input to S2.
• Antidependence: S1 precedes S2, and the output of S2 overlaps the
input to S1.
• Output dependence: S1 and S2 write to the same output variable.
• I/O dependence: two I/O statements (read/write) reference the same
variable, and/or the same file.
Data Dependence - 2
• Unknown dependence:
• The subscript of a variable is itself subscripted.
• The subscript does not contain the loop index variable.
• A variable appears more than once with subscripts having different
coefficients of the loop variable (that is, different functions of the loop
variable).
• The subscript is nonlinear in the loop index variable.
• Parallel execution of program segments which do not have total data
independence can produce non-deterministic results.
Control Dependence
• Control-independent example:
for (i=0;i<n;i++) {
a[i] = c[i];
if (a[i] < 0) a[i] = 1;
}
• Control-dependent example:
for (i=1;i<n;i++) {
if (a[i-1] < 0) a[i] = 1;
}
• Compiler techniques are needed to get around control dependence
limitations.
Resource Dependence
• Data and control dependencies are based on the independence of the
work to be done.
• Resource independence is concerned with conflicts in using shared
resources, such as registers, integer and floating point ALUs, etc.
• ALU conflicts are called ALU dependence.
• Memory (storage) conflicts are called storage dependence.
Bernstein’s Conditions - 1
• Bernstein’s conditions are a set of conditions which must exist if two
processes can execute in parallel.
• Notation
• Ii is the set of all input variables for a process Pi .
• Oi is the set of all output variables for a process Pi .
• If P1 and P2 can execute in parallel (which is written as P1 || P2), then:

I1  O2 
I 2  O1 
O1  O2 
Bernstein’s Conditions - 2
• In terms of data dependencies, Bernstein’s conditions imply that two
processes can execute in parallel if they are flow-independent,
antiindependent, and output-independent.
• The parallelism relation || is commutative (Pi || Pj implies Pj || Pi ), but
not transitive (Pi || Pj and Pj || Pk does not imply Pi || Pk ) . Therefore,
|| is not an equivalence relation.
• Intersection of the input sets is allowed.
Hardware Parallelism
• Hardware parallelism is defined by machine architecture and hardware
multiplicity.
• It can be characterized by the number of instructions that can be issued per
machine cycle. If a processor issues k instructions per machine cycle, it is called a
k-issue processor. Conventional processors are one-issue machines.
• Examples. Intel i960CA is a three-issue processor (arithmetic, memory access,
branch). IBM RS-6000 is a four-issue processor (arithmetic, floating-point,
memory access, branch).
• A machine with n k-issue processors should be able to handle a maximum of nk
threads simultaneously.
Software Parallelism
• Software parallelism is defined by the control and data dependence of
programs, and is revealed in the program’s flow graph.
• It is a function of algorithm, programming style, and compiler
optimization.
Mismatch between software and
hardware parallelism - 1

Cycle 1 L1 L2 L3 L4
Maximum software
parallelism (L=load, X/+/- =
arithmetic).
Cycle 2 X1 X2

Cycle 3 + -

A B
Mismatch between software
and hardware parallelism - 2
L1 Cycle 1

Same problem, but considering the L2 Cycle 2


parallelism on a two-issue
superscalar processor. X1 L3 Cycle 3

L4 Cycle 4
X2
Cycle 5

+ Cycle 6

- Cycle 7
A

B
Mismatch between software
and hardware parallelism - 3
L1 L3 Cycle 1

L2 L4
Cycle 2

Same problem, X1 X2 Cycle 3


with two single-
issue processors S1 S2 Cycle 4

L5 L6 Cycle 5
= inserted for
synchronization + - Cycle 6

A B
Types of Software Parallelism
• Control Parallelism – two or more operations can be performed
simultaneously. This can be detected by a compiler, or a programmer
can explicitly indicate control parallelism by using special language
constructs or dividing a program into multiple processes.
• Data parallelism – multiple data elements have the same operations
applied to them at the same time. This offers the highest potential
for concurrency (in SIMD and MIMD modes). Synchronization in SIMD
machines handled by hardware.
Solving the Mismatch Problems
• Develop compilation support
• Redesign hardware for more efficient exploitation by compilers
• Use large register files and sustained instruction pipelining.
• Have the compiler fill the branch and load delay slots in code
generated for RISC processors.
The Role of Compilers
• Compilers used to exploit hardware features to improve performance.
• Interaction between compiler and architecture design is a necessity in
modern computer development.
• It is not necessarily the case that more software parallelism will
improve performance in conventional scalar processors.
• The hardware and compiler should be designed at the same time.
Program Partitioning &
Scheduling
• The size of the parts or pieces of a program that can be considered for
parallel execution can vary.
• The sizes are roughly classified using the term “granule size,” or simply
“granularity.”
• The simplest measure, for example, is the number of instructions in a
program part.
• Grain sizes are usually described as fine, medium or coarse,
depending on the level of parallelism involved.
Latency
• Latency is the time required for communication between different
subsystems in a computer.
• Memory latency, for example, is the time required by a processor to
access memory.
• Synchronization latency is the time required for two processes to
synchronize their execution.
• Computational granularity and communicatoin latency are closely
related.
Levels of Parallelism

}
Jobs or programs

Increasing Coarse grain


communication

}
Subprograms, job steps or
demand and related parts of a program
scheduling
overhead Medium grain
Procedures, subroutines,
tasks, or coroutines
Higher degree

}
of parallelism
Non-recursive loops
or unfolded iterations
Fine grain
Instructions
or statements
Instruction Level Parallelism
• This fine-grained, or smallest granularity level typically involves less
than 20 instructions per grain. The number of candidates for parallel
execution varies from 2 to thousands, with about five instructions or
statements (on the average) being the average level of parallelism.
• Advantages:
• There are usually many candidates for parallel execution
• Compilers can usually do a reasonable job of finding this parallelism
Loop-level Parallelism
• Typical loop has less than 500 instructions.
• If a loop operation is independent between iterations, it can be
handled by a pipeline, or by a SIMD machine.
• Most optimized program construct to execute on a parallel or vector
machine
• Some loops (e.g. recursive) are difficult to handle.
• Loop-level parallelism is still considered fine grain computation.
Procedure-level Parallelism
• Medium-sized grain; usually less than 2000 instructions.
• Detection of parallelism is more difficult than with smaller grains;
interprocedural dependence analysis is difficult and history-sensitive.
• Communication requirement less than instruction-level
• SPMD (single procedure multiple data) is a special case
• Multitasking belongs to this level.
Subprogram-level Parallelism
• Job step level; grain typically has thousands of instructions; medium-
or coarse-grain level.
• Job steps can overlap across different jobs.
• Multiprograming conducted at this level
• No compilers available to exploit medium- or coarse-grain parallelism
at present.
Job or Program-Level Parallelism
• Corresponds to execution of essentially independent jobs or programs
on a parallel computer.
• This is practical for a machine with a small number of powerful
processors, but impractical for a machine with a large number of
simple processors (since each processor would take too long to
process a single job).
Program Flow Mechanisms
• Conventional machines used control flow mechanism in
which order of program execution explicitly stated in user
programs.
• Dataflow machines which instructions can be executed
by determining operand availability.
• Reduction machines trigger an instruction’s execution
based on the demand for its results.
Control Flow vs. Data Flow
• Control flow machines used shared memory for instructions and
data. Since variables are updated by many instructions, there may
be side effects on other instructions. These side effects frequently
prevent parallel processing. Single processor systems are
inherently sequential.
• Instructions in dataflow machines are unordered and can be
executed as soon as their operands are available; data is held in
the instructions themselves. Data tokens are passed from an
instruction to its dependents to trigger execution.
Data Flow Features
• No need for
• shared memory
• program counter
• control sequencer
• Special mechanisms are required to
• detect data availability
• match data tokens with instructions needing them
• enable chain reaction of asynchronous instruction execution
A Dataflow Architecture - 1
• The Arvind machine (MIT) has N PEs and an N-by-N
interconnection network.
• Each PE has a token-matching mechanism that dispatches only
instructions with data tokens available.
• Each datum is tagged with
• address of instruction to which it belongs
• context in which the instruction is being executed
• Tagged tokens enter PE through local path (pipelined), and can
also be communicated to other PEs through the routing network.
A Dataflow Architecture - 2
• Instruction address(es) effectively replace the program counter in a
control flow machine.
• Context identifier effectively replaces the frame base register in a
control flow machine.
• Since the dataflow machine matches the data tags from one
instruction with successors, synchronized instruction execution is
implicit.
A Dataflow Architecture - 3
• An I-structure in each PE is provided to eliminate excessive copying of
data structures.
• Each word of the I-structure has a two-bit tag indicating whether the
value is empty, full, or has pending read requests.
• This is a retreat from the pure dataflow approach.
• Example 2.6 shows a control flow and dataflow comparison.
• Special compiler technology needed for dataflow machines.
Demand-Driven Mechanisms
• Data-driven machines select instructions for execution based on
the availability of their operands; this is essentially a bottom-up
approach.
• Demand-driven machines take a top-down approach, attempting
to execute the instruction (a demander) that yields the final
result. This triggers the execution of instructions that yield its
operands, and so forth.
• The demand-driven approach matches naturally with functional
programming languages (e.g. LISP and SCHEME).
Reduction Machine Models
• String-reduction model:
• each demander gets a separate copy of the expression string to evaluate
• each reduction step has an operator and embedded reference to demand
the corresponding operands
• each operator is suspended while arguments are evaluated
• Graph-reduction model:
• expression graph reduced by evaluation of branches or subgraphs, possibly
in parallel, with demanders given pointers to results of reductions.
• based on sharing of pointers to arguments; traversal and reversal of
pointers continues until constant arguments are encountered.
System Interconnect Architectures
• Direct networks for static connections
• Indirect networks for dynamic connections
• Networks are used for
• internal connections in a centralized system among
• processors
• memory modules
• I/O disk arrays
• distributed networking of multicomputer nodes
Goals and Analysis
• The goals of an interconnection network are to provide
• low-latency
• high data transfer rate
• wide communication bandwidth
• Analysis includes
• latency
• bisection bandwidth
• data-routing functions
• scalability of parallel architecture
Network Properties and Routing
• Static networks: point-to-point direct connections that will not change
during program execution
• Dynamic networks:
• switched channels dynamically configured to match user program
communication demands
• include buses, crossbar switches, and multistage networks
• Both network types also used for inter-PE data routing in SIMD
computers
Terminology - 1
• Network usually represented by a graph with a finite number of nodes linked by
directed or undirected edges.
• Number of nodes in graph = network size .
• Number of edges (links or channels) incident on a node = node degree d (also
note in and out degrees when edges are directed). Node degree reflects number
of I/O ports associated with a node, and should ideally be small and constant.
• Diameter D of a network is the maximum shortest path between any two nodes,
measured by the number of links traversed; this should be as small as possible
(from a communication point of view).
Terminology - 2
• Channel bisection width b = minimum number of edges cut to split
a network into two parts each having the same number of nodes.
Since each channel has w bit wires, the wire bisection width B =
bw. Bisection width provides good indication of maximum
communication bandwidth along the bisection of a network, and
all other cross sections should be bounded by the bisection width.
• Wire (or channel) length = length (e.g. weight) of edges between
nodes.
• Network is symmetric if the topology is the same looking from any
node; these are easier to implement or to program.
• Other useful characterizing properties: homogeneous nodes?
buffered channels? nodes are switches?
Data Routing Functions
• Shifting
• Rotating
• Permutation (one to one)
• Broadcast (one to all)
• Multicast (many to many)
• Personalized broadcast (one to many)
• Shuffle
• Exchange
• Etc.
Permutations
• Given n objects, there are n ! ways in which they can be reordered
(one of which is no reordering).
• A permutation can be specified by giving the rule fo reordering a
group of objects.
• Permutations can be implemented using crossbar switches,
multistage networks, shifting, and broadcast operations. The time
required to perform permutations of the connections between nodes
often dominates the network performance when n is large.
Perfect Shuffle and Exchange
• Stone suggested the special permutation that entries according to the
mapping of the k-bit binary number a b … k to b c … k a (that is,
shifting 1 bit to the left and wrapping it around to the least significant
bit position).
• The inverse perfect shuffle reverses the effect of the perfect shuffle.
Hypercube Routing Functions
• If the vertices of a n-dimensional cube are labeled with n-bit numbers
so that only one bit differs between each pair of adjacent vertices,
then n routing functions are defined by the bits in the node (vertex)
address.
• For example, with a 3-dimensional cube, we can easily identify routing
functions that exchange data between nodes with addresses that
differ in the least significant, most significant, or middle bit.
Factors Affecting Performance
• Functionality – how the network supports data routing, interrupt handling,
synchronization, request/message combining, and coherence
• Network latency – worst-case time for a unit message to be transferred
• Bandwidth – maximum data rate
• Hardware complexity – implementation costs for wire, logic, switches,
connectors, etc.
• Scalability – how easily does the scheme adapt to an increasing number of
processors, memories, etc.?
Static Networks
• Linear Array
• Ring and Chordal Ring
• Barrel Shifter
• Tree and Star
• Fat Tree
• Mesh and Torus
Static Networks – Linear Array
• N nodes connected by n-1 links (not a bus); segments between
different pairs of nodes can be used in parallel.
• Internal nodes have degree 2; end nodes have degree 1.
• Diameter = n-1
• Bisection = 1
• For small n, this is economical, but for large n, it is obviously
inappropriate.
Static Networks – Ring,
Chordal Ring
• Like a linear array, but the two end nodes are connected by an n th
link; the ring can be uni- or bi-directional. Diameter is n/2 for a
bidirectional ring, or n for a unidirectional ring.
• By adding additional links (e.g. “chords” in a circle), the node degree
is increased, and we obtain a chordal ring. This reduces the network
diameter.
• In the limit, we obtain a fully-connected network, with a node degree
of n -1 and a diameter of 1.
Static Networks – Barrel Shifter
• Like a ring, but with additional links between all pairs of nodes that
have a distance equal to a power of 2.
• With a network of size N = 2n, each node has degree d = 2n -1, and
the network has diameter D = n /2.
• Barrel shifter connectivity is greater than any chordal ring of lower
node degree.
• Barrel shifter much less complex than fully-interconnected network.
Static Networks – Tree and Star
• A k-level completely balanced binary tree will have N = 2k – 1 nodes,
with maximum node degree of 3 and network diameter is 2(k – 1).
• The balanced binary tree is scalable, since it has a constant maximum
node degree.
• A star is a two-level tree with a node degree d = N – 1 and a constant
diameter of 2.
Static Networks – Fat Tree
• A fat tree is a tree in which the number of edges between nodes
increases closer to the root (similar to the way the thickness of limbs
increases in a real tree as we get closer to the root).
• The edges represent communication channels (“wires”), and since
communication traffic increases as the root is approached, it seems
logical to increase the number of channels there.
Static Networks – Mesh and Torus
• Pure mesh – N = n k nodes with links between each adjacent pair of nodes in a
row or column (or higher degree). This is not a symmetric network; interior node
degree d = 2k, diameter = k (n – 1).
• Illiac mesh (used in Illiac IV computer) – wraparound is allowed, thus reducing the
network diameter to about half that of the equivalent pure mesh.
• A torus has ring connections in each dimension, and is symmetric. An n  n
binary torus has node degree of 4 and a diameter of 2  n / 2 .
Static Networks – Systolic Array
• A systolic array is an arrangement of processing elements and
communication links designed specifically to match the computation
and communication requirements of a specific algorithm (or class of
algorithms).
• This specialized character may yield better performance than more
generalized structures, but also makes them more expensive, and
more difficult to program.
Static Networks – Hypercubes
• A binary n-cube architecture with N = 2n nodes spanning along n
dimensions, with two nodes per dimension.
• The hypercube scalability is poor, and packaging is difficult for higher-
dimensional hypercubes.
Static Networks – Cube-connected
Cycles
• k-cube connected cycles (CCC) can be created from a k-cube by
replacing each vertex of the k-dimensional hypercube by a ring of k
nodes.
• A k-cube can be transformed to a k-CCC with k  2k nodes.
• The major advantage of a CCC is that each node has a constant
degree (but longer latency) than in the corresponding k-cube. In that
respect, it is more scalable than the hypercube architecture.
Static Networks – k-ary n-Cubes
• Rings, meshes, tori, binary n-cubes, and Omega networks (to be seen)
are topologically isomorphic to a family of k-ary n-cube networks.
• n is the dimension of the cube, and k is the radix, or number of of
nodes in each dimension.
• The number of nodes in the network, N, is k n.
• Folding (alternating nodes between connections) can be used to avoid
the long “end-around” delays in the traditional implementation.
Static Networks – k-ary n-Cubes
• The cost of k-ary n-cubes is dominated by the amount of wire, not the
number of switches.
• With constant wire bisection, low-dimensional networks with wider
channels provide lower latecny, less contention, and higher “hot-
spot” throughput than higher-dimensional networks with narrower
channels.
Network Throughput
• Network throughput – number of messages a network can handle in a unit time
interval.
• One way to estimate is to calculate the maximum number of messages that can
be present in a network at any instant (its capacity); throughput usually is some
fraction of its capacity.
• A hot spot is a pair of nodes that accounts for a disproportionately large portion
of the total network traffic (possibly causing congestion).
• Hot spot throughput is maximum rate at which messages can be sent between
two specific nodes.
Minimizing Latency
• Latency is minimized when the network radix k and dimension n are
chose so as to make the components of latency due to distance (# of
hops) and the message aspect ratio L / W (message length L divided
by the channel width W ) approximately equal.
• This occurs at a very low dimension. For up to 1024 nodes, the best
dimension (in this respect) is 2.
Dynamic Connection Networks
• Dynamic connection networks can implement all communication patterns based
on program demands.
• In increasing order of cost and performance, these include
• bus systems
• multistage interconnection networks
• crossbar switch networks
• Price can be attributed to the cost of wires, switches, arbiters, and connectors.
• Performance is indicated by network bandwidth, data transfer rate, network
latency, and communication patterns supported.
Dynamic Networks – Bus Systems
• A bus system (contention bus, time-sharing bus) has
• a collection of wires and connectors
• multiple modules (processors, memories, peripherals, etc.) which connect to the wires
• data transactions between pairs of modules
• Bus supports only one transaction at a time.
• Bus arbitration logic must deal with conflicting requests.
• Lowest cost and bandwidth of all dynamic schemes.
• Many bus standards are available.
Dynamic Networks – Switch
Modules
• An a  b switch module has a inputs and b outputs. A binary switch
has a = b = 2.
• It is not necessary for a = b, but usually a = b = 2k, for some integer k.
• In general, any input can be connected to one or more of the outputs.
However, multiple inputs may not be connected to the same output.
• When only one-to-one mappings are allowed, the switch is called a
crossbar switch.
Multistage Networks
• In general, any multistage network is comprised of a collection of a  b switch
modules and fixed network modules. The a  b switch modules are used to
provide variable permutation or other reordering of the inputs, which are then
further reordered by the fixed network modules.
• A generic multistage network consists of a sequence alternating dynamic
switches (with relatively small values for a and b) with static networks (with larger
numbers of inputs and outputs). The static networks are used to implement
interstage connections (ISC).
Omega Network
• A 2  2 switch can be configured for
• Straight-through
• Crossover
• Upper broadcast (upper input to both outputs)
• Lower broadcast (lower input to both outputs)
• (No output is a somewhat vacuous possibility as well)
• With four stages of eight 2  2 switches, and a static perfect shuffle for each of
the four ISCs, a 16 by 16 Omega network can be constructed (but not all
permutations are possible).
• In general , an n-input Omega network requires log 2 n stages of 2  2 switches
and n / 2 switch modules.
Baseline Network
• A baseline network can be shown to be topologically equivalent to
other networks (including Omega), and has a simple recursive
generation procedure.
• Stage k (k = 0, 1, …) is an m  m switch block (where m = N / 2k )
composed entirely of 2  2 switch blocks, each having two
configurations: straight through and crossover.
4  4 Baseline Network
Crossbar Networks
• A m  n crossbar network can be used to provide a constant latency connection
between devices; it can be thought of as a single stage switch.
• Different types of devices can be connected, yielding different constraints on
which switches can be enabled.
• With m processors and n memories, one processor may be able to generate requests for
multiple memories in sequence; thus several switches might be set in the same row.
• For m  m interprocessor communication, each PE is connected to both an input and an
output of the crossbar; only one switch in each row and column can be turned on
simultaneously. Additional control processors are used to manage the crossbar itself.

You might also like