0% found this document useful (0 votes)

56 views100 pages

Module 1 Chapter2

The document discusses various types and levels of parallelism in computing including: 1) Instruction, loop, and procedure level parallelism which involve the finest granularity and highest potential for parallel execution. 2) Subprogram level parallelism which has a medium granularity of thousands of instructions and allows job steps to overlap across jobs. 3) The mismatch between software and hardware parallelism and how compilers can help address this through techniques like filling branch and load delay slots.

Uploaded by

Usha Vizay Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views100 pages

Module 1 Chapter2

Uploaded by

Usha Vizay Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 100

Chapter 2

Program and Network

Properties

1
Program and Network Properties

• Conditions of parallelism
• Program partitioning and scheduling
• Program flow mechanisms
• System interconnect architectures
Conditions of Parallelism
The exploitation of parallelism in computing requires
understanding the basic theory associated with it.
Progress is needed in several areas:
computation models for parallel computing
interprocessor communication in parallel architectures
integration of parallel systems into general environments
Data dependences

The ordering relationship between statements is indicated by the data

dependence.
• Flow dependence
• Anti dependence
• Output dependence
• I/O dependence
• Unknown dependence
Data Dependence - 1

• Flow dependence: S1 precedes S2, and at least one output of S1 is

input to S2.
• Antidependence: S1 precedes S2, and the output of S2 overlaps the
input to S1.
• Output dependence: S1 and S2 write to the same output variable.
• I/O dependence: two I/O statements (read/write) reference the
same variable, and/or the same file.
Data Dependence - 2

• Unknown dependence:
o The subscript of a variable is itself subscripted.
o The subscript does not contain the loop index variable.
o A variable appears more than once with subscripts having different
coefficients of the loop variable (that is, different functions of the loop
variable).
o The subscript is nonlinear in the loop index variable.
• Parallel execution of program segments which do not have total data
independence can produce non-deterministic results.
Data dependence example

S1: Load R1, A S1

S2: Add R2, R1
S3: Move R1, R3 S2 S4
S4: Store B, R1
S3
I/O dependence example

S1: Read (4), A(I)

S2: Rewind (4)
S3: Write (4), B(I) S1 I/O
S3
S4: Rewind (4)
Control dependence

• The order of execution of statements cannot be determined before

run time
o Conditional branches
o Successive operations of a looping procedure
Control dependence examples

Do 20 I = 1, N Do 10 I = 1, N
A(I) = C(I) IF(A(I-1) .EQ. 0) A(I)=0
IF(A(I) .LT. 0) A(I)=1 10 Continue
20 Continue
Resource dependence

• Concerned with the conflicts in using shared resources

o Integer units
o Floating-point units
o Registers
o Memory areas
o ALU
o Workplace storage
Bernstein’s conditions

• Set of conditions for two processes to execute in parallel

I1  O2 = Ø
I2  O1 = Ø
O1  O2 = Ø
Bernstein’s Conditions - 2

• In terms of data dependencies, Bernstein’s conditions imply that two

processes can execute in parallel if they are flow-independent,
antiindependent, and output-independent.
• The parallelism relation || is commutative (Pi || Pj implies Pj || Pi ),
but not transitive (Pi || Pj and Pj || Pk does not imply Pi || Pk ) .
Therefore, || is not an equivalence relation.
• Intersection of the input sets is allowed.
Utilizing Bernstein’s conditions

P1 : C=DxE
P1
P2 : M=G+C
P3 : A=B+C P2 P4
P4 : C=L+M
P5 : F=G/E
P3 P5
Utilizing Bernstein’s conditions
Hardware parallelism

• A function of cost and performance tradeoffs

• Displays the resource utilization patterns of
simultaneously executable operations
• Denote the number of instruction issues per
machine cycle: k-issue processor
• A multiprocessor system with n k-issue processors
should be able to handle a maximum number of nk
threads of instructions simultaneously
Software parallelism

• Defined by the control and data dependence of programs

• A function of algorithm, programming style, and compiler
organization
• The program flow graph displays the patterns of simultaneously
executable operations
Mismatch between software and hardware parallelism - 1

Cycle 1 L1 L2 L3 L4
Maximum software
parallelism (L=load,
Cycle 2 X1 X2 X/+/- = arithmetic).

Cycle 3 + -

A B
Mismatch between software and hardware
parallelism - 2
L1 Cycle 1

Same problem, but L2 Cycle 2

considering the
X1 L3 Cycle 3
parallelism on a two-issue
superscalar processor. L4 Cycle 4
X2 Cycle 5

+ Cycle 6

- Cycle 7
A

B
Mismatch between software and hardware
parallelism - 3
L1 L3 Cycle 1

L2 L4 Cycle 2

Same problem, X1 X2 Cycle 3

with two single-
issue processors
S1 S2 Cycle 4

L5 L6 Cycle 5
= inserted for
synchronization + - Cycle 6

A B
Software parallelism

• Control parallelism – allows two or more operations to be

performed concurrently
o Pipelining, multiple functional units
• Data parallelism – almost the same operation is performed over
many data elements by many processors concurrently
o Code is easier to write and debug
Types of Software Parallelism

• Control Parallelism – two or more operations can be performed

simultaneously. This can be detected by a compiler, or a
programmer can explicitly indicate control parallelism by using
special language constructs or dividing a program into multiple
processes.
• Data parallelism – multiple data elements have the same operations
applied to them at the same time. This offers the highest potential
for concurrency (in SIMD and MIMD modes). Synchronization in
SIMD machines handled by hardware.
Solving the Mismatch Problems

• Develop compilation support

• Redesign hardware for more efficient exploitation by compilers
• Use large register files and sustained instruction pipelining.
• Have the compiler fill the branch and load delay slots in code
generated for RISC processors.
The Role of Compilers

• Compilers used to exploit hardware features to improve

performance.
• Interaction between compiler and architecture design is a necessity
in modern computer development.
• It is not necessarily the case that more software parallelism will
improve performance in conventional scalar processors.
• The hardware and compiler should be designed at the same time.
Program Partitioning & Scheduling

• The size of the parts or pieces of a program that can be considered

for parallel execution can vary.
• The sizes are roughly classified using the term “granule size,” or
simply “granularity.”
• The simplest measure, for example, is the number of instructions in
a program part.
• Grain sizes are usually described as fine, medium or coarse,
depending on the level of parallelism involved.
Latency

• Latency is the time required for communication between different

subsystems in a computer.
• Memory latency, for example, is the time required by a processor to
access memory.
• Synchronization latency is the time required for two processes to
synchronize their execution.
• Computational granularity and communicatoin latency are closely
related.
Levels of Parallelism

Increasing
communication
demand and
Jobs or programs

Subprograms, job steps or } Coarse grain

scheduling
overhead

Higher degree of
parallelism
related parts of a program

Procedures, subroutines,
tasks, or coroutines
}Medium grain

Non-recursive loops
or unfolded iterations

Instructions
or statements
}Fine grain
Instruction Level Parallelism

• This fine-grained, or smallest granularity level typically involves less

than 20 instructions per grain. The number of candidates for
parallel execution varies from 2 to thousands, with about five
instructions or statements (on the average) being the average level
of parallelism.
• Advantages:
o There are usually many candidates for parallel execution
o Compilers can usually do a reasonable job of finding this parallelism
Loop-level Parallelism

• Typical loop has less than 500 instructions.

• If a loop operation is independent between iterations, it can be
handled by a pipeline, or by a SIMD machine.
• Most optimized program construct to execute on a parallel or vector
machine
• Some loops (e.g. recursive) are difficult to handle.
• Loop-level parallelism is still considered fine grain computation.
Procedure-level Parallelism

• Medium-sized grain; usually less than 2000 instructions.

• Detection of parallelism is more difficult than with smaller grains;
interprocedural dependence analysis is difficult and history-
sensitive.
• Communication requirement less than instruction-level
• SPMD (single procedure multiple data) is a special case
• Multitasking belongs to this level.
Subprogram-level Parallelism

• Job step level; grain typically has thousands of instructions;

medium- or coarse-grain level.
• Job steps can overlap across different jobs.
• Multiprograming conducted at this level
• No compilers available to exploit medium- or coarse-grain
parallelism at present.
Job or Program-Level Parallelism

• Corresponds to execution of essentially independent jobs or

programs on a parallel computer.
• This is practical for a machine with a small number of powerful
processors, but impractical for a machine with a large number of
simple processors (since each processor would take too long to
process a single job).
Communication Latency

• Balancing granularity and latency can yield better performance.

• Various latencies attributed to machine architecture, technology,
and communication patterns used.
• Latency imposes a limiting factor on machine scalability. Ex.
Memory latency increases as memory capacity increases, limiting
the amount of memory that can be used with a given tolerance for
communication latency.
Interprocessor Communication Latency

• Needs to be minimized by system designer

• Affected by signal delays and communication patterns
• Ex. n communicating tasks may require n (n - 1)/2 communication
links, and the complexity grows quadratically, effectively limiting
the number of processors in the system.
Communication Patterns

• Determined by algorithms used and architectural support provided

• Patterns include
o permutations
o broadcast
o multicast
o conference
• Tradeoffs often exist between granularity of parallelism and
communication demand.
Grain Packing and Scheduling

• Two questions:
o How can I partition a program into parallel “pieces” to yield the shortest
execution time?
o What is the optimal size of parallel grains?
• There is an obvious tradeoff between the time spent scheduling and
synchronizing parallel grains and the speedup obtained by parallel
execution.
• One approach to the problem is called “grain packing.”
Program Graphs and Packing

• A program graph is similar to a dependence graph

o Nodes = { (n,s) }, where n = node name, s = size (larger s = larger grain
size).
o Edges = { (v,d) }, where v = variable being “communicated,” and d =
communication delay.
• Packing two (or more) nodes produces a node with a larger grain
size and possibly more edges to other nodes.
• Packing is done to eliminate unnecessary communication delays or
reduce overall scheduling overhead.
Example 2.5

• Example 2.5 illustrates a matrix multiplication program requiring 8

multiplications and 7 additions.
• Using various approaches, the program requires:
o 212 cycles (software parallelism only)
o 864 cycles (sequential program on one processor)
o 741 cycles (8 processors) - speedup = 1.16
o 446 cycles (4 processors) - speedup = 1.94
Scheduling

• A schedule is a mapping of nodes to processors and start times such

that communication delay requirements are observed, and no two
nodes are executing on the same processor at the same time.
• Some general scheduling goals
o Schedule all fine-grain activities in a node to the same processor to
minimize communication delays.
o Select grain sizes for packing to achieve better schedules for a particular
parallel machine.
Static multiprocessor scheduling

• Grain packing may not be optimal

• Dynamic multiprocessor scheduling is an NP-hard problem
• Node duplication is a static scheme for multiprocessor scheduling
Node duplication

• Duplicate some nodes to eliminate idle time and reduce

communication delays
• Grain packing and node duplication are often used jointly to
determine the best grain size and corresponding schedule
Schedule without node duplication

P1 P2 P1 P2

A,4 4 A I 4
a,1
a,8 6 B
B,1 C,1 I
c,1 13 12
c,8 C 14
b,1 E
D,2 E,2 16
21 20
D
23
d,4 e,4 27
Schedule with node duplication

P1 P2 P1 P2

A,4 A’,4 4 A A 4
a,1 a,1
a,1 6 B C 6
B,1 C’,1 C,1 7 C
c,1 E
b,1 c,1 9
10 D
D,2 E,2

13
14
Grain determination and scheduling optimization

Step 1: Construct a fine-grain program graph

Step 2: Schedule the fine-grain computation
Step 3: Grain packing to produce coarse grains
Step 4: Generate a parallel schedule based on
the packed graph
Program Flow Mechanisms

• Conventional machines used control flow mechanism in which order of

program execution explicitly stated in user programs.
• Dataflow machines which instructions can be executed by determining
operand availability.
• Reduction machines trigger an instruction’s execution based on the
demand for its results.
Control Flow vs. Data Flow

• Control flow machines used shared memory for instructions and

data. Since variables are updated by many instructions, there
may be side effects on other instructions. These side effects
frequently prevent parallel processing. Single processor systems
are inherently sequential.
• Instructions in dataflow machines are unordered and can be
executed as soon as their operands are available; data is held in
the instructions themselves. Data tokens are passed from an
instruction to its dependents to trigger execution.
Data Flow Features

• No need for
o shared memory
o program counter
o control sequencer
• Special mechanisms are required to
o detect data availability
o match data tokens with instructions needing them
o enable chain reaction of asynchronous instruction execution
A Dataflow Architecture - 1

• The Arvind machine (MIT) has N PEs and an N-by-N

interconnection network.
• Each PE has a token-matching mechanism that dispatches
only instructions with data tokens available.
• Each datum is tagged with
o address of instruction to which it belongs
o context in which the instruction is being executed
• Tagged tokens enter PE through local path (pipelined), and
can also be communicated to other PEs through the routing
network.
A Dataflow Architecture - 2

• Instruction address(es) effectively replace the program counter in a

control flow machine.
• Context identifier effectively replaces the frame base register in a
control flow machine.
• Since the dataflow machine matches the data tags from one
instruction with successors, synchronized instruction execution is
implicit.
A Dataflow Architecture - 3

• An I-structure in each PE is provided to eliminate excessive copying

of data structures.
• Each word of the I-structure has a two-bit tag indicating whether the
value is empty, full, or has pending read requests.
• This is a retreat from the pure dataflow approach.
• Example 2.6 shows a control flow and dataflow comparison.
• Special compiler technology needed for dataflow machines.
FIRST INTERNAL
PORTION
Demand-Driven Mechanisms

• Data-driven machines select instructions for execution based on the

availability of their operands; this is essentially a bottom-up approach.
• Demand-driven machines take a top-down approach, attempting to
execute the instruction (a demander) that yields the final result. This
triggers the execution of instructions that yield its operands, and so
forth.
• The demand-driven approach matches naturally with functional
programming languages (e.g. LISP and SCHEME).
Reduction Machine Models

• String-reduction model:
o each demander gets a separate copy of the expression string to
evaluate
o each reduction step has an operator and embedded reference to
demand the corresponding operands
o each operator is suspended while arguments are evaluated
• Graph-reduction model:
o expression graph reduced by evaluation of branches or subgraphs,
possibly in parallel, with demanders given pointers to results of
reductions.
o based on sharing of pointers to arguments; traversal and reversal of
pointers continues until constant arguments are encountered.
System Interconnect Architectures

• Direct networks for static connections

• Indirect networks for dynamic connections
• Networks are used for
o internal connections in a centralized system among
• processors
• memory modules
• I/O disk arrays
o distributed networking of multicomputer nodes
Goals and Analysis

• Static networks: point-to-point direct connections that will not

change during program execution
• Dynamic networks:
o switched channels dynamically configured to match user program
communication demands
o include buses, crossbar switches, and multistage networks
• Both network types also used for inter-PE data routing in SIMD
computers
Network Parameters

• Network size: The number of nodes in the graph used to represent

the network
• Node Degree d: The number of edges incident to a node. Sum of in
degree and out degree
• Network Diameter D: The maximum shortest path between any two
nodes
Network Parameters (cont.)

• Bisection Width:
o Channel bisection width b: The minimum number of edges along the cut
that divides the network in two equal halves
o Each channel has w bit wires
o Wire bisection width: B=b*w; B is the wiring density of the network. It
provides a good indicator of tha max communication bandwidth along the
bisection of the network
Terminology - 1

• Network usually represented by a graph with a finite

number of nodes linked by directed or undirected edges.
• Number of nodes in graph = network size .
• Number of edges (links or channels) incident on a node =
node degree d (also note in and out degrees when edges
are directed). Node degree reflects number of I/O ports
associated with a node, and should ideally be small and
constant.
• Diameter D of a network is the maximum shortest path
between any two nodes, measured by the number of links
traversed; this should be as small as possible (from a
communication point of view).
Terminology - 2

• Channel bisection width b = minimum number of edges cut

to split a network into two parts each having the same
number of nodes. Since each channel has w bit wires, the
wire bisection width B = bw. Bisection width provides
good indication of maximum communication bandwidth
along the bisection of a network, and all other cross sections
should be bounded by the bisection width.
• Wire (or channel) length = length (e.g. weight) of edges
between nodes.
• Network is symmetric if the topology is the same looking
from any node; these are easier to implement or to
program.
• Other useful characterizing properties: homogeneous
nodes? buffered channels? nodes are switches?
Data Routing Functions

• Shifting
• Rotating
• Permutation (one to one)
• Broadcast (one to all)
• Multicast (many to many)
• Personalized broadcast (one to many)
• Shuffle
• Exchange
• Etc.
Permutations

• For n objects there are n! permutations by which

the n objects can be reordered.The set of all
permutations form a permutation group with
respect to a composition operation. Cycle
notation can be used to specify a permutation
operation.
• Permutation p = (a, b, c)(d, e) means: a->b, b-
>c, c->a, d->e and e->d in a circular fashion. The
cycle (a, b, c) has a period of 3, and the cycle (d,
e) has a period of 2. p will have a period equal to
2 x 3 = 6.
Permutations (cont.)

• Can be implemented using crossbar switches, multistage networks

or with shifting or broadcast operations.
• Permutation capability is an indication of network’s data routing
capabilities
Perfect Shuffle and Exchange

• Stone suggested the special permutation that entries according to

the mapping of the k-bit binary number a b … k to b c … k a (that is,
shifting 1 bit to the left and wrapping it around to the least
significant bit position).
• The inverse perfect shuffle reverses the effect of the perfect shuffle.
Perfect Shuffle

• Special permutation function

• n = 2k objects; each object representation requires k bits
• Perfect shuffle maps x to y where:
o x = ( xk-1, …, x1, x0 )
o y = ( xk-2, …, x1, x0, xk-1 )
Exchange

• n = 2k objects; each object representation requires k bits

• The exchange maps x to y where:
o x = ( xk-1, …, x1, x0 )
o y = ( xk-1, …, x1, x0’ )

• Hypercube routing functions are exchanges

Hypercube Routing Functions

• If the vertices of a n-dimensional cube are labeled with n-bit

numbers so that only one bit differs between each pair of adjacent
vertices, then n routing functions are defined by the bits in the node
(vertex) address.
• For example, with a 3-dimensional cube, we can easily identify
routing functions that exchange data between nodes with addresses
that differ in the least significant, most significant, or middle bit.
Broadcast and Multicast

• Broadcast: One-to-all mapping

• Multicast: one subset to another subset
• Personalized Broadcast: Personalized messages to only selected
receivers
Network Performance

• Functionality
• Network latency
• Bandwidth
• Hardware complexity
• Scalability
Factors Affecting Performance

• Functionality – how the network supports data routing, interrupt

handling, synchronization, request/message combining, and
coherence
• Network latency – worst-case time for a unit message to be
transferred
• Bandwidth – maximum data rate
• Hardware complexity – implementation costs for wire, logic,
switches, connectors, etc.
• Scalability – how easily does the scheme adapt to an increasing
number of processors, memories, etc.?
Static Networks

• Linear Array
• Ring and Chordal Ring
• Barrel Shifter
• Tree and Star
• Fat Tree
• Mesh and Torus
Static Networks – Linear Array

• N nodes connected by n-1 links (not a bus); segments between

different pairs of nodes can be used in parallel.
• Internal nodes have degree 2; end nodes have degree 1.
• Diameter = n-1
• Bisection = 1
• For small n, this is economical, but for large n, it is obviously
inappropriate.
Static Networks – Ring, Chordal Ring

• Like a linear array, but the two end nodes are connected by an n th
link; the ring can be uni- or bi-directional. Diameter is n/2 for a
bidirectional ring, or n for a unidirectional ring.
• By adding additional links (e.g. “chords” in a circle), the node degree
is increased, and we obtain a chordal ring. This reduces the network
diameter.
• In the limit, we obtain a fully-connected network, with a node
degree of n -1 and a diameter of 1.
Static Networks – Barrel Shifter

• Like a ring, but with additional links between all pairs of nodes that
have a distance equal to a power of 2.
• With a network of size N = 2n, each node has degree d = 2n -1, and
the network has diameter D = n /2.
• Barrel shifter connectivity is greater than any chordal ring of lower
node degree.
• Barrel shifter much less complex than fully-interconnected network.
Static Networks – Tree and Star

• A k-level completely balanced binary tree will have N = 2k – 1 nodes,

with maximum node degree of 3 and network diameter is 2(k – 1).
• The balanced binary tree is scalable, since it has a constant
maximum node degree.
• A star is a two-level tree with a node degree d = N – 1 and a constant
diameter of 2.
Static Networks – Fat Tree

• A fat tree is a tree in which the number of edges between nodes

increases closer to the root (similar to the way the thickness of limbs
increases in a real tree as we get closer to the root).
• The edges represent communication channels (“wires”), and since
communication traffic increases as the root is approached, it seems
logical to increase the number of channels there.
Static Networks – Mesh and Torus

• Pure mesh – N = n k nodes with links between each adjacent pair of

nodes in a row or column (or higher degree). This is not a
symmetric network; interior node degree d = 2k, diameter = k (n –
1).
• Illiac mesh (used in Illiac IV computer) – wraparound is allowed,
thus reducing the network diameter to about half that of the
equivalent pure mesh.
• A torus has ring connections in each dimension, and is symmetric.
An n  n binary torus has node degree of 4 and a diameter of 2  n /
2 .
Static Networks – Systolic Array

• A systolic array is an arrangement of processing elements and

communication links designed specifically to match the computation
and communication requirements of a specific algorithm (or class of
algorithms).
• This specialized character may yield better performance than more
generalized structures, but also makes them more expensive, and
more difficult to program.
Static Networks – Hypercubes

• A binary n-cube architecture with N = 2n nodes spanning along n

dimensions, with two nodes per dimension.
• The hypercube scalability is poor, and packaging is difficult for
higher-dimensional hypercubes.
Static Networks – Cube-connected Cycles

• k-cube connected cycles (CCC) can be created from a k-cube by

replacing each vertex of the k-dimensional hypercube by a ring of k
nodes.
• A k-cube can be transformed to a k-CCC with k  2k nodes.
• The major advantage of a CCC is that each node has a constant
degree (but longer latency) than in the corresponding k-cube. In
that respect, it is more scalable than the hypercube architecture.
Static Networks – k-ary n-Cubes

• Rings, meshes, tori, binary n-cubes, and Omega networks (to be

seen) are topologically isomorphic to a family of k-ary n-cube
networks.
• n is the dimension of the cube, and k is the radix, or number of of
nodes in each dimension.
• The number of nodes in the network, N, is k n.
• Folding (alternating nodes between connections) can be used to
avoid the long “end-around” delays in the traditional
implementation.
Static Networks – k-ary n-Cubes

• The cost of k-ary n-cubes is dominated by the amount of wire, not

the number of switches.
• With constant wire bisection, low-dimensional networks with wider
channels provide lower latecny, less contention, and higher “hot-
spot” throughput than higher-dimensional networks with narrower
channels.
Network Throughput

• Network throughput – number of messages a network can handle in

a unit time interval.
• One way to estimate is to calculate the maximum number of
messages that can be present in a network at any instant (its
capacity); throughput usually is some fraction of its capacity.
• A hot spot is a pair of nodes that accounts for a disproportionately
large portion of the total network traffic (possibly causing
congestion).
• Hot spot throughput is maximum rate at which messages can be
sent between two specific nodes.
Minimizing Latency

• Latency is minimized when the network radix k and dimension n are

chose so as to make the components of latency due to distance (# of
hops) and the message aspect ratio L / W (message length L divided
by the channel width W ) approximately equal.
• This occurs at a very low dimension. For up to 1024 nodes, the best
dimension (in this respect) is 2.
Dynamic Connection Networks

• Dynamic connection networks can implement all communication

patterns based on program demands.
• In increasing order of cost and performance, these include
o bus systems
o multistage interconnection networks
o crossbar switch networks
• Price can be attributed to the cost of wires, switches, arbiters, and
connectors.
• Performance is indicated by network bandwidth, data transfer rate,
network latency, and communication patterns supported.
Dynamic Networks – Bus Systems

• A bus system (contention bus, time-sharing bus) has

o a collection of wires and connectors
o multiple modules (processors, memories, peripherals, etc.) which
connect to the wires
o data transactions between pairs of modules
• Bus supports only one transaction at a time.
• Bus arbitration logic must deal with conflicting requests.
• Lowest cost and bandwidth of all dynamic schemes.
• Many bus standards are available.
Dynamic Networks – Switch Modules

• An a  b switch module has a inputs and b outputs. A binary switch

has a = b = 2.
• It is not necessary for a = b, but usually a = b = 2k, for some integer
k.
• In general, any input can be connected to one or more of the
outputs. However, multiple inputs may not be connected to the
same output.
• When only one-to-one mappings are allowed, the switch is called a
crossbar switch.
Multistage Networks

• In general, any multistage network is comprised of a collection of a 

b switch modules and fixed network modules. The a  b switch
modules are used to provide variable permutation or other
reordering of the inputs, which are then further reordered by the
fixed network modules.
• A generic multistage network consists of a sequence alternating
dynamic switches (with relatively small values for a and b) with
static networks (with larger numbers of inputs and outputs). The
static networks are used to implement interstage connections (ISC).
Omega Network

• A 2  2 switch can be configured for

o Straight-through
o Crossover
o Upper broadcast (upper input to both outputs)
o Lower broadcast (lower input to both outputs)
o (No output is a somewhat vacuous possibility as well)
• With four stages of eight 2  2 switches, and a static
perfect shuffle for each of the four ISCs, a 16 by 16
Omega network can be constructed (but not all
permutations are possible).
• In general , an n-input Omega network requires log 2 n
stages of 2  2 switches and n / 2 switch modules.
Baseline Network

• A baseline network can be shown to be topologically equivalent to

other networks (including Omega), and has a simple recursive
generation procedure.
• Stage k (k = 0, 1, …) is an m  m switch block (where m = N / 2k )
composed entirely of 2  2 switch blocks, each having two
configurations: straight through and crossover.
4  4 Baseline Network
Crossbar Networks

• A m  n crossbar network can be used to provide a constant latency

connection between devices; it can be thought of as a single stage
switch.
• Different types of devices can be connected, yielding different
constraints on which switches can be enabled.
o With m processors and n memories, one processor may be able to generate
requests for multiple memories in sequence; thus several switches might be set in
the same row.
o For m  m interprocessor communication, each PE is connected to both an input
and an output of the crossbar; only one switch in each row and column can be
turned on simultaneously. Additional control processors are used to manage the
crossbar itself.
Summary of Dynamic Network Characteristics

Module 1 Chapter2
No ratings yet
Module 1 Chapter2
98 pages
Chapter 2: Program and Network Properties
No ratings yet
Chapter 2: Program and Network Properties
94 pages
ACA-Chapter2
No ratings yet
ACA-Chapter2
66 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
36 pages
Pca Chapter 2 Program & Network Properties
No ratings yet
Pca Chapter 2 Program & Network Properties
71 pages
Program and Network Properties
No ratings yet
Program and Network Properties
27 pages
Advanced Computer Architecture: Conditions of Parallelism
No ratings yet
Advanced Computer Architecture: Conditions of Parallelism
27 pages
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
No ratings yet
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
47 pages
Dependency Graph and Bernstein Conditions
No ratings yet
Dependency Graph and Bernstein Conditions
39 pages
Grain Packing & Scheduling Ch2 Hwang - Copy
No ratings yet
Grain Packing & Scheduling Ch2 Hwang - Copy
80 pages
COA UNIT 5 (AutoRecovered)
No ratings yet
COA UNIT 5 (AutoRecovered)
14 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
Hpc_unit-1 Insem Notes
No ratings yet
Hpc_unit-1 Insem Notes
76 pages
Parallelism
No ratings yet
Parallelism
22 pages
Performnce Metrics and Measures
No ratings yet
Performnce Metrics and Measures
24 pages
Coa Chapter 5
No ratings yet
Coa Chapter 5
96 pages
ACA UNIT-1 B kai hwang
No ratings yet
ACA UNIT-1 B kai hwang
23 pages
Parallel Computing
No ratings yet
Parallel Computing
24 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Unit 5
No ratings yet
Unit 5
96 pages
What Is Concurrency? What Is Concurrency? What Is Concurrency? What Is Concurrency?
No ratings yet
What Is Concurrency? What Is Concurrency? What Is Concurrency? What Is Concurrency?
10 pages
2-TypesofParallelism (1)
No ratings yet
2-TypesofParallelism (1)
69 pages
CA Classes-21-25
No ratings yet
CA Classes-21-25
5 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
19 pages
Parallel_computing
No ratings yet
Parallel_computing
32 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
mod5_aca-1-52
No ratings yet
mod5_aca-1-52
52 pages
CH03
No ratings yet
CH03
26 pages
Lecture1 Introduction PDF
No ratings yet
Lecture1 Introduction PDF
43 pages
HPC-Unit-2
No ratings yet
HPC-Unit-2
72 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Code Scheduling in Compiler Design
No ratings yet
Code Scheduling in Compiler Design
6 pages
Lecture-2-06.01.2025
No ratings yet
Lecture-2-06.01.2025
21 pages
15CS72_ACA_Module1_Chapter2FinalCopy
No ratings yet
15CS72_ACA_Module1_Chapter2FinalCopy
28 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
Chap 4-7 - Parallel - Abstractions - and - MPI
No ratings yet
Chap 4-7 - Parallel - Abstractions - and - MPI
34 pages
Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Module 1
No ratings yet
Module 1
14 pages
Unit V
No ratings yet
Unit V
95 pages
CS 6461: Computer Architecture Instruction Level Parallelism
No ratings yet
CS 6461: Computer Architecture Instruction Level Parallelism
41 pages
Parallel Computing
100% (1)
Parallel Computing
12 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
CS526 3 Design of Parallel Programs
No ratings yet
CS526 3 Design of Parallel Programs
83 pages
Multiprocessing vs Multithreading 2
No ratings yet
Multiprocessing vs Multithreading 2
16 pages
Unit 4
No ratings yet
Unit 4
42 pages
PDC UNIT-2
No ratings yet
PDC UNIT-2
48 pages
Lecture - 4
No ratings yet
Lecture - 4
19 pages
Unit 5
No ratings yet
Unit 5
66 pages
Fundamentals of Multicore Software Development PDF
No ratings yet
Fundamentals of Multicore Software Development PDF
322 pages
Concurrency
No ratings yet
Concurrency
99 pages
Con Currency
No ratings yet
Con Currency
99 pages
unit-3
No ratings yet
unit-3
49 pages
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
Swift Programming Simplified: A Practical Guide with Examples
From Everand
Swift Programming Simplified: A Practical Guide with Examples
William E. Clark
No ratings yet
Java Streams Explained: A Practical Guide with Examples
From Everand
Java Streams Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Micron's Product Range
No ratings yet
Micron's Product Range
101 pages
Microsoft Powerplan Explanations
No ratings yet
Microsoft Powerplan Explanations
30 pages
EPAM Private Cloud: Integration With AWS
No ratings yet
EPAM Private Cloud: Integration With AWS
19 pages
Connectrix - Brocade Departmental Firmware Upgrade Procedures-DS-6520B
No ratings yet
Connectrix - Brocade Departmental Firmware Upgrade Procedures-DS-6520B
18 pages
Introduction To: Embedded System
100% (1)
Introduction To: Embedded System
122 pages
Accessing The In-Game Tutorial
No ratings yet
Accessing The In-Game Tutorial
5 pages
Docs Pwntools Com en Stable
No ratings yet
Docs Pwntools Com en Stable
399 pages
Environmental Data Acquisition System: Air Quality Monitoring Systems
No ratings yet
Environmental Data Acquisition System: Air Quality Monitoring Systems
2 pages
16fin4890_a_ncr_selfserv_87_ds
No ratings yet
16fin4890_a_ncr_selfserv_87_ds
2 pages
Chapter 5 (Opening Case) : Source: K. Schwalbe, Information Technology Project Management, 8th Ed 1
No ratings yet
Chapter 5 (Opening Case) : Source: K. Schwalbe, Information Technology Project Management, 8th Ed 1
2 pages
Nvswitch Technical Overview
No ratings yet
Nvswitch Technical Overview
8 pages
M100754 Fanuc 16 18 Memory Reference Chart (Loc)
No ratings yet
M100754 Fanuc 16 18 Memory Reference Chart (Loc)
3 pages
Modern Day Next-Gen Firewall For Every Office
No ratings yet
Modern Day Next-Gen Firewall For Every Office
2 pages
CTF - Kioptrix Level 4 - Walkthrough Step by Step: @hackermuxam - Edu.vn
No ratings yet
CTF - Kioptrix Level 4 - Walkthrough Step by Step: @hackermuxam - Edu.vn
12 pages
Esd Module 7
No ratings yet
Esd Module 7
65 pages
TVL-ICT-CSS-11-Q3_ICCS-Week-5-6
No ratings yet
TVL-ICT-CSS-11-Q3_ICCS-Week-5-6
9 pages
Computer Networks: (3161007) B.E. 6 Semester
No ratings yet
Computer Networks: (3161007) B.E. 6 Semester
71 pages
Intel 8251 USART
No ratings yet
Intel 8251 USART
21 pages
L3 - Representing Instructions in The Computer
No ratings yet
L3 - Representing Instructions in The Computer
13 pages
User Manual of IVMS-4200 - V2.6.1
No ratings yet
User Manual of IVMS-4200 - V2.6.1
250 pages
Chapter 7 - Telecommunications, The Internet, and Wireless Technology
100% (1)
Chapter 7 - Telecommunications, The Internet, and Wireless Technology
13 pages
Technical Paper Presentation TOPICS
50% (2)
Technical Paper Presentation TOPICS
1 page
Cte 441
No ratings yet
Cte 441
20 pages
Detailed Lesson Plan 1
No ratings yet
Detailed Lesson Plan 1
10 pages
Cisco: Section 1 - Basic Digital Section
No ratings yet
Cisco: Section 1 - Basic Digital Section
3 pages
Computer Systems Servicing
No ratings yet
Computer Systems Servicing
15 pages
Communication With S7-Cpu Via Bacnet Gateway: S7-1200 Cpu / S7-1500 Cpu and Ugw//Micro Profinet Rs232/Rs485
No ratings yet
Communication With S7-Cpu Via Bacnet Gateway: S7-1200 Cpu / S7-1500 Cpu and Ugw//Micro Profinet Rs232/Rs485
61 pages
README
No ratings yet
README
4 pages
Performance Evaluation of Microservices
No ratings yet
Performance Evaluation of Microservices
9 pages
Desert Strike - Manual - PC
No ratings yet
Desert Strike - Manual - PC
17 pages