Program and Network Properties
Program and Network Properties
• Introduction
• Condition of parallelism
o Data dependence and resource dependence
o Hardware and software dependence
o The role of compiler
• Program partitioning and scheduling
o Grain size and latency
o Grain packing and scheduling
• Program flow mechanism
• System interconnect architecture.
o Network properties and routing
o Static connection network
o Dynamic connection network
• Summary
• Keywords
2.0 Objective
In this lesson we will study about fundamental properties of programs how parallelism
can be introduced in program. We will study about the granularity, partitioning of
programs, and program flow mechanism and compilation support for parallelism.
Interconnection architecture both static and dynamic type will be discussed.
2.1 Introduction
The advantage of multiprocessors lays when parallelism in the program is popularly
exploited and implemented using multiple processors. Thus in order to implement the
parallelism we should understand the various conditions of parallelism.
What are various bottlenecks in implementing parallelism? Thus for full implementation
of parallelism there are three significant areas to be understood namely computation
models for parallel computing, interprocessor communication in parallel architecture and
system integration for incorporating parallel systems. Thus multiprocessor system poses a
number of problems that are not encountered in sequential processing such as designing a
parallel algorithm for the application, partitioning of the application into tasks,
coordinating communication and synchronization, and scheduling of the tasks onto the
machine.
2.2 Condition of parallelism
The ability to execute several program segments in parallel requires each segment to be
independent of the other segments. We use a dependence graph to describe the relations.
The nodes of a dependence graph correspond to the program statement (instructions), and
directed edges with different labels are used to represent the ordered relations among the
statements. The analysis of dependence graphs shows where opportunity exists for
parallelization and vectorization.
2.2.1 Data and resource Dependence
Data dependence: The ordering relationship between statements is indicated by the data
dependence. Five type of data dependence are defined below:
1. Flow dependence: A statement S2 is flow dependent on S1 if an execution path exists
from s1 to S2 and if at least one output (variables assigned) of S1feeds in as input
Bernstein’s Conditions - 2
33
In terms of data dependencies, Bernstein’s conditions imply that two processes can
execute in parallel if they are flow-independent, antiindependent, and output-
independent. The parallelism relation || is commutative (Pi || Pj implies Pj || Pi ), but not
transitive (Pi || Pj and Pj || Pk does not imply Pi || Pk ) . Therefore, || is not an equivalence
relation. Intersection of the input sets is allowed.
2.2.2 Hardware and software parallelism
Hardware parallelism is defined by machine architecture and hardware multiplicity i.e.,
functional parallelism times the processor parallelism .It can be characterized by the
number of instructions that can be issued per machine cycle. If a processor issues k
instructions per machine cycle, it is called a k-issue processor. Conventional processors
are one-issue machines. This provide the user the information about peak attainable
performance. Examples. Intel i960CA is a three-issue processor (arithmetic, memory
access, branch). IBM RS -6000 is a four-issue processor (arithmetic, floating-point,
memory access, branch).A machine with n k-issue processors should be able to handle a
maximum of nk threads simultaneously.
Software Parallelism
Software parallelism is defined by the control and data dependence of programs, and is
revealed in the program’s flow graph i.e., it is defined by dependencies with in the code
and is a function of algorithm, programming style, and compiler optimization.
2.2.3 The Role of Compilers
Compilers used to exploit hardware features to improve performance. Interaction
between compiler and architecture design is a necessity in modern computer
development. It is not necessarily the case that more software parallelism will improve
performance in conventional scalar processors. The hardware and compiler should be
designed at the same time.
2.3Program Partitioning & Scheduling
2.3.1 Grain size and latency
The size of the parts or pieces of a program that can be considered for parallel execution
can vary. The sizes are roughly classified using the term “granule size,” or simply
“granularity.” The simplest measure, for example, is the number of instructions in a
34
program part. Grain sizes are usually described as fine, medium or coarse, depending on
the level of parallelism involved.
Latency
Latency is the time required for communication between different subsystems in a
computer. Memory latency, for example, is the time required by a processor to access
memory. Synchronization latency is the time required for two processes to synchronize
their execution. Computational granularity and communication latency are closely
related. Latency and grain size are interrelated and some general observation are
• As grain size decreases, potential parallelism increases, and overhead also
increases.
• Overhead is the cost of parallelizing a task. The principle overhead is
communication latency.
• As grain size is reduced, there are fewer operations between communication, and
hence the impact of latency increases.
• Surface to volume: inter to intra-node comm.
Levels of Parallelism
Instruction Level Parallelism
This fine-grained, or smallest granularity level typically involves less than 20 instructions
per grain. The number of candidates for parallel execution varies from 2 to thousands,
with about five instructions or statements (on the average) being the average level of
parallelism.
Advantages:
There are usually many candidates for parallel execution
Compilers can usually do a reasonable job of finding this parallelism
Loop-level Parallelism
Typical loop has less than 500 instructions. If a loop operation is independent between
iterations, it can be handled by a pipeline, or by a SIMD machine. Most optimized
program construct to execute on a parallel or vector machine. Some loops (e.g. recursive)
are difficult to handle. Loop-level parallelism is still considered fine grain computation.
Procedure-level Parallelism
35
Medium-sized grain; usually less than 2000 instructions. Detection of parallelism is more
difficult than with smaller grains; interprocedural dependence analysis is difficult and
history-sensitive. Communication requirement less than instruction level SPMD (single
procedure multiple data) is a special case Multitasking belongs to this level.
Subprogram-level Parallelism
Job step level; grain typically has thousands of instructions; medium- or coarse-grain
level. Job steps can overlap across different jobs. Multiprograming conducted at this level
No compilers available to exploit medium- or coarse-grain parallelism at present.
Job or Program-Level Parallelism
Corresponds to execution of essentially independent jobs or programs on a parallel
computer. This is practical for a machine with a small number of powerful processors,
but impractical for a machine with a large number of simple processors (since each
processor would take too long to process a single job).
Communication Latency
Balancing granularity and latency can yield better performance. Various latencies
attributed to machine architecture, technology, and communication patterns used.
Latency imposes a limiting factor on machine scalability. Ex. Memory latency increases
as memory capacity increases, limiting the amount of memory that can be used with a
given tolerance for communication latency.
Interprocessor Communication Latency
• Needs to be minimized by system designer
• Affected by signal delays and communication patterns Ex. n communicating tasks
may require n (n - 1)/2 communication links, and the complexity grows
quadratically, effectively limiting the number of processors in the system.
Communication Patterns
• Determined by algorithms used and architectural support provided
• Patterns include permutations broadcast multicast conference
• Tradeoffs often exist between granularity of parallelism and communication
demand.
2.3.2 Grain Packing and Scheduling
Two questions:
36
How can I partition a program into parallel “pieces” to yield the shortest execution time?
What is the optimal size of parallel grains?
There is an obvious tradeoff between the time spent scheduling and synchronizing
parallel grains and the speedup obtained by parallel execution.
One approach to the problem is called “grain packing.”
Program Graphs and Packing
A program graph is similar to a dependence graph Nodes = { (n,s) }, where n = node
name, s = size (larger s = larger grain size).
Edges = { (v,d) }, where v = variable being “communicated,” and d = communication
delay.
Packing two (or more) nodes produces a node with a larger grain size and possibly more
edges to other nodes. Packing is done to eliminate unnecessary communication delays or
reduce overall scheduling overhead.
Scheduling
A schedule is a mapping of nodes to processors and start times such that communication
delay requirements are observed, and no two nodes are executing on the same processor
at the same time. Some general scheduling goals
• Schedule all fine-grain activities in a node to the same processor to minimize
communication delays.
• Select grain sizes for packing to achieve better schedules for a particular parallel
machine.
Node Duplication
Grain packing may potentially eliminate interprocessor communication, but it may not
always produce a shorter schedule. By duplicating nodes (that is, executing some
instructions on multiple processors), we may eliminate some interprocessor
communication, and thus produce a shorter schedule.
Program partitioning and scheduling
Scheduling and allocation is a highly important issue since an inappropriate scheduling of
tasks can fail to exploit the true potential of the system and can offset the gain from
parallelization. In this paper we focus on the scheduling aspect. The objective of
scheduling is to minimize the completion time of a parallel application by properly
37
allocating the tasks to the processors. In a broad sense, the scheduling problem exists in
two forms: static and dynamic. In static scheduling, which is usually done at compile
time, the characteristics of a parallel program (such as task processing times,
communication, data dependencies, and synchronization requirements) are known before
program execution
A parallel program, therefore, can be represented by a node- and edge-weighted directed
acyclic graph (DAG), in which the node weights represent task processing times and the
edge weights represent data dependencies as well as the communication times between
tasks. In dynamic scheduling only, a few assumptions about the parallel program can be
made before execution, and thus, scheduling decisions have to be made on-the-fly. The
goal of a dynamic scheduling algorithm as such includes not only the minimization of the
program completion time but also the minimization of the scheduling overhead which
constitutes a significant portion of the cost paid for running the scheduler. In general
dynamic scheduling is an NP hard problem.
2.5 System interconnects architecture.
Various types of interconnection networks have been suggested for SIMD computers.
These are basically classified have been classified on network topologies into two
categories namely
Static Networks
Dynamic Networks
Static versus Dynamic Networks
The topological structure of an SIMD array processor is mainly characterized by the data
routing network used in interconnecting the processing elements.
The topological structure of an SIMD array processor is mainly characterized by the data
routing network used in the interconnecting the processing elements. To execute the
communication the routing function f is executed and via the interconnection network the
PEi copies the content of its Ri register into the Rf(i) register of PEf(i). The f(i) the
processor identified by the mapping function f. The data routing operation occurs in all
active PEs simultaneously.
2.5.1 Network properties and routing
The goals of an interconnection network are to provide low-latency high data transfer rate
wide communication bandwidth. Analysis includes latency bisection bandwidth data-
routing functions scalability of parallel architecture
These Network usually represented by a graph with a finite number of nodes linked by
directed or undirected edges.
Number of nodes in graph = network size .
Number of edges (links or channels) incident on a node = node degree d (also note in and
out degrees when edges are directed).
Node degree reflects number of I/O ports associated with a node, and should ideally be
small and constant.
Network is symmetric if the topology is the same looking from any node; these are easier
to implement or to program.
Diameter : The maximum distance between any two processors in the network or in other
words we can say Diameter, is the maximum number of (routing) processors through
which a message must pass on its way from source to reach destination. Thus diameter
measures the maximum delay for transmitting a message from one processor to another
as it determines communication time hence smaller the diameter better will be the
network topology.
Connectivity: How many paths are possible between any two processors i.e., the
multiplicity of paths between two processors. Higher connectivity is desirable as it
minimizes contention.
Arch connectivity of the network: the minimum number of arcs that must be removed for
the network to break it into two disconnected networks. The arch connectivity of various
network are as follows
• 1 for linear arrays and binary trees
• 2 for rings and 2-d meshes
• 4 for 2-d torus
• d for d-dimensional hypercubes
Larger the arch connectivity lesser the conjunctions and better will be network topology.
Channel width : The channel width is the number of bits that can communicated
simultaneously by a interconnection bus connecting two processors
Bisection Width and Bandwidth: In order divide the network into equal halves we require
the remove some communication links. The minimum number of such communication
links that have to be removed are called the Bisection Width. Bisection width basically
provide us the information about the largest number of messages which can be sent
simultaneously (without needing to use the same wire or routing processor at the same
time and so delaying one another), no matter which processors are sending to which other
processors. Thus larger the bisection width is the better the network topology is
considered. Bisection Bandwidth is the minimum volume of communication allowed
between two halves of the network with equal numbers of processors This is important
for the networks with weighted arcs where the weights correspond to the link width i.e.,
(how much data it can transfer). The Larger bisection width the better network topology
is considered.
Cost the cost of networking can be estimated on variety of criteria where we consider the
the number of communication links or wires used to design the network as the basis of
cost estimation. Smaller the better the cost
Data Routing Functions: A data routing network is used for inter –PE data exchange. It
can be static as in case of hypercube routing network or dynamic such as multistage
network. Various type of data routing functions are Shifting, Rotating, Permutation (one
to one), Broadcast (one to all), Multicast (many to many), Personalized broadcast (one to
many), Shuffle, Exchange Etc.
Permutations
Given n objects, there are n ! ways in which they can be reordered (one of which is no
reordering). A permutation can be specified by giving the rule for reordering a group of
objects. Permutations can be implemented using crossbar switches, multistage networks,
shifting, and broadcast operations. The time required to perform permutations of the
connections between nodes often dominates the network performance when n is large.
Perfect Shuffle and Exchange
Stone suggested the special permutation that entries according to the mapping of the k-bit
binary number a b … k to b c … k a (that is, shifting 1 bit to the left and wrapping it
around to the least significant bit position). The inverse perfect shuffle reverses the effect
of the perfect shuffle.
Hypercube Routing Functions
If the vertices of a n-dimensional cube are labeled with n-bit numbers so that only one bit
differs between each pair of adjacent vertices, then n routing functions are defined by the
bits in the node (vertex) address. For example, with a 3-dimensional cube, we can easily
identify routing functions that exchange data between nodes with addresses that differ in
the least significant, most significant, or middle bit.
Factors Affecting Performance
Functionality – how the network supports data routing, interrupt handling,
synchronization, request/message combining, and coherence
Network latency – worst-case time for a unit message to be transferred
Bandwidth – maximum data rate
Hardware complexity – implementation costs for wire, logic, switches, connectors, etc.
Scalability – how easily does the scheme adapt to an increasing number of processors,
memories, etc.?
2.5.2 Static connection Networks
In static network the interconnection network is fixed and permanent interconnection
path between two processing elements and data communication has to follow a fixed
route to reach the destination processing element. Thus it Consist of a number of point-
to-point links. Topologies in the static networks can be classified according to the
dimension required for layout i.e., it can be 1-D, 2-D, 3-D or hypercube.
One dimensional topologies include Linear array as shown in figure 2.2 (a) used in some
pipeline architecture.
Various 2-D topologies are
• The ring (figure 2.2(b))
• Star (figure 2.2(c))
• Tree (figure 2.2(d))
• Mesh (figure 2.2(e))
• Systolic Array (figure
2.2(f)) 3-D topologies include
• Completely connected chordal ring (figure 2.2(g))
• Chordal ring (figure 2.2(h))
• 3 cube (figure 2.2(i))
Figure 2.2 Static interconnection network topologies.
Torus architecture is also one of popular network topology it is extension of the mesh by
having wraparound connections Figure below is a 2D Torus This architecture of torus is a
symmetric topology unlike mesh which is not. The wraparound connections reduce the
torus diameter and at the same time restore the symmetry. It can be
o 1-D torus
2-D torus
3-D torus
The torus topology is used in Cray T3E
Figure 2.3 Torus technology
We can have further higher dimension circuits for example 3-cube connected cycle. A D-
dimension W-wide hypercube contains W nodes in each dimension and there is a
connection to a node in each dimension. The mesh and the cube architecture are actually
2-D and 3-D hypercube respectively. The below figure we have hypercube with
dimension 4.
Figure 2.16
The diameter is m=log_2 p, since all message must traverse m stages. The bisection
width is p. This network was used in the IBM RP3, BBN Butterfly, and NYU
Ultracomputer. If we compare the omega network with cube network we find Omega
network can perform one to many connections while n-cube cannot. However as far as
bisections connections n-cube and Omega network they perform more or less same.