0% found this document useful (0 votes)

13 views18 pages

Parallel Network and Topologies

Chapter 2 discusses parallel architectures and interconnection networks, emphasizing the importance of understanding network topologies for designing parallel algorithms. It defines various network topologies such as binary trees, fully-connected networks, and meshes, along with their properties like diameter, bisection width, and degree. The chapter illustrates how these properties affect communication efficiency and scalability in parallel computing systems.

Uploaded by

spyadav31192

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views18 pages

Parallel Network and Topologies

Uploaded by

spyadav31192

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Chapter 2 Parallel Architectures and Interconnection

Networks

“The interconnection network is the heart of parallel architecture.” - Chuan-Lin and Tse-Yun
Feng [1]

2.1 Introduction
You cannot really design parallel algorithms or programs without an understanding of some of the key
properties of various types of parallel architectures and the means by which components can be connected
to each other. Parallelism has existed in computers since their inception, at various levels of design. For
example, bit-parallel memory has been around since the early 1970s, and simultaneous I/O processing (using
channels) has been used since the 1960s. Other forms of parallelism include bit-parallel arithmetic in the
arithmetic-logic unit (ALU), instruction look-ahead in the control unit, direct memory access (DMA), data
pipe-lining, and instruction pipe-lining. However, the parallelism that we will discuss is at a higher level;
in particular we will look at processor arrays, multiprocessors, and multicomputers. We begin, however, by
exploring the mathematical concept of a topology

2.2 Network Topologies

We will use the term network topology to refer to the way in which a set of nodes are connected to each
other. In this context, a network topology is essentially a discrete graph – a set of nodes connected by edges.
Distance does not exist and there is no notion of the length of an edge1 . You can think of each edge as being
of unit length.
Network topologies arise in the context of parallel architectures as well as in parallel algorithms. In the
domain of parallel architectures, network topologies describe the interconnections among multiple processors
and memory modules. You will see in subsequent chapters that a network topology can also describe the
communication patterns among a set of parallel processes. Because they can be used in these two different
ways, we will first examine them purely as mathematical entities, divorced from any particular application.
Formally, a network topology < S, E > is a finite set S of nodes together with an adjacency relation
E ⊆ S × S on the set. If v and w are nodes such that (v, w) ∈ E, we say that there is a directed edge
from v to w.2 Sometimes all of the edges in a particular topology will be undirected, meaning that both
(v, w) ∈ E and (w, v) ∈ E. Unless we state otherwise, we will treat all edges as undirected. When two nodes
are connected by an edge, we say they are adjacent. An example of a network topology that should be
quite familiar to the reader is the binary tree shown in Figure 2.1. Notice that the nodes in that figure are
labeled . A label is an arbitrary symbol used to refer to the node, nothing more.
We will examine the following topologies in these notes:

• Binary tree

• Fully-connected (also called completely-connected)

• Mesh and torus
1 Network topologies are a kind of mathematical topological space, for those familiar with this concept, but this is of no
importance to us.
2 The reader familiar with graphs will notice that a network topology is essentially a discrete graph.

1
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks

2 3

4 5 6 7

Figure 2.1: Binary tree topology with 7 nodes.

• Hypercube (also called a binary n-cube)

• Butterfly

2.2.1 Network Topology Properties

Network topologies have properties that determine their usefulness for particular applications, which we now
define.
Definition 1. A path from node n1 to node nk is a sequence of nodes n1 , n2 , . . . , nk such that, for 1 ≤ i < k,
ni is adjacent to ni+1 . The length of a path is the number of edges in the path, not the number of nodes.

Definition 2. The distance between a pair of nodes is the length of the shortest path between the nodes.

For example, in Figure 2.1, the distance between nodes 4 and 7 is 4, whereas the distance between nodes 6
and 7 is 2.
Definition 3. The diameter of a network topology is the largest distance between any pair of nodes in the
network.

The diameter of the network in Figure 2.1 is 4, since the distance between nodes 4 and 7 is 4, and there is no
pair of nodes whose distance is greater than 4. Diameter is important because, if nodes represent processors
that must communicate via the edges, which represent communication links, then the diameter determines
a lower bound on the communication time. (Note that it is a lower bound and not an upper bound; if a
particular algorithm requires, for example, that all pairs of nodes send each other data before the next step
of a computation, then the diameter determines how much time will elapse before that step can begin.)

Definition 4. The bisection width of a network topology is the smallest number of edges that must be
deleted to sever the set of nodes into two sets of equal size, or size differing by at most one node.

In Figure 2.1, edge (1,2) can be deleted to split the set of nodes into two sets {2,3,5} and {1,4,6,7}. Therefore,
the bisection width of this network is 1. Bisection width is important because it can determine the total
communication time. Low bisection width is bad, and high is good. Consider the extreme case, in which a
network can be split by removing one edge. This means that all data that flows from one half to the other
must pass through this edge. This edge is a bottleneck through which all data must pass sequentially, like a
one-lane bridge in the middle of a four-lane highway. In contrast, if the bisection width is high, then many
edges must be removed to split the node set. This means that there are many paths from one side of the set
to the other, and data can flow in a high degree of parallelism from any one half of the nodes to the other.
Definition 5. The degree of the network topology is the maximum number of edges that are incident to a
node in the topology.

The maximum number of edges per node can affect how well the network scales as the number of processors
increases, because of physical limitations on how the network is constructed. A binary tree, for example,
has the property that the maximum number of edges per node is 3, regardless of how many nodes are in

2
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks

the tree. This is good, because the physical design need not change to accommodate the increase in number
of processors. Not all topologies have a constant degree. If the degree increases with network size, this
generally means that more connections need to be made to each node. Nodes might represent switches, or
processors, and in either case they have a fixed pin-out, implying that the connections between processors
must be implemented by a complex fan-out of the wires, a very expensive and potentially slow mechanism.
Although the edges in a network topology do not have length, we assume that nodes cannot be infinitely
small. As a consequence, the definition of the topology itself can imply that, as the number of nodes
increases, the physical distance between them must increase. Maximum edge length is a measure of this
property. It is important because the communication time is a function of how long the signals must travel.
It is best if the network can be laid out in three-dimensional space so that the maximum edge length is a
constant, independent of network size. If not, and the edge length increases with the number of processors,
then communication time increases as the network grows. This implies that expanding the network to
accommodate more processors can slow down communication time. The binary tree in Figure 2.1 does not
have a constant maximum edge length, because as the size of the tree gets larger, the leaf nodes must be
placed further apart, which in turn implies that eventually the edges that leave the root of the tree must get
longer.

2.2.2 Binary Tree Network Topology

In a binary tree network, the 2k − 1 nodes are arranged in a complete binary tree of depth k − 1, as in Figure
2.1. The depth of a binary tree is the length of a path from the root to a leaf node. Each interior node is
connected to two children, and each node other than the root is connected to its parent. Thus the degree is
3. The diameter of a binary tree network with 2k − 1 nodes is 2(k − 1), because the longest path in the tree
is any path from a leaf node that must go up to the root of the tree and then down to a different leaf node.
If we let n = 2k − 1 then 2(k − 1) is approximately 2 log2 n; i.e., the diameter of a binary tree network with
n nodes is a logarithmic function of network size, which is very low.
The bisection width is low, which means it is poor. It is possible to split the tree into two sets differing by
at most one node in size by deleting either edge incident to the root; the bisection width is 1. As discussed
above, maximum edge length is an increasing function of the number of nodes.

2.2.3 Fully-Connected Network Topology

In a fully-connected network, every node is connected to every other node, as in Figure 2.2. If there are
n nodes, there will be n(n − 1)/2 edges. Suppose n is even. Then there are n/2 even numbered nodes and
n/2 odd numbered nodes. If we remove every edge that connects an even node to an odd node, then the
even nodes will form a fully-connected network and so will the odd nodes, but the two sets will be disjoint.
There are (n/2) edges from each even node to every odd node, so there are (n/2)2 edges that connect these
two sets. Not removing any one of them fails to disconnect the two sets, so this is the minimum number.
Therefore, the bisection width is (n/2)2 . The diameter is 1, since there is a direct link from any node to
every other node. The degree is proportional to n, so this network does not scale well. Lastly, the maximum
edge length will increase as the network grows, because nodes are not arbitrarily small. (Think of the nodes
as lying on the surface of a sphere, and the edges as chords connecting them.)

Figure 2.2: Fully-connected network with 6 nodes.

3
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks

2.2.4 Mesh Network Topology

In a mesh network, nodes are arranged in a q-dimensional lattice. A 2-dimensional lattice with 36 nodes
is illustrated in Figure 2.3. The mesh in that figure is square. Unless stated otherwise, meshes are usually
square. In general, there are k 2 nodes in a 2-dimensional mesh. A 3-dimensional mesh is the logical extension
of a 2-dimensional one. It is not hard to imagine a 3-dimensional mesh. It consists of the lattice points
in a 3-dimensional grid, with edges connecting adjacent points. A 3-dimensional mesh, assuming the same
number of nodes in all dimensions, must have k 3 nodes. While we cannot visually depict q-dimensional mesh
networks when q > 3, we can describe their properties. A q-dimensional mesh network has k q nodes. k is
the number of nodes in a single dimension of the mesh. Henceforth we let q denote the dimension of the
mesh.

Figure 2.3: A two-dimensional square mesh with 36 nodes.

The diameter of a q-dimensional mesh network with k q nodes is q(k − 1). To see this, note that the farthest
distance between nodes is from one corner to the diagonally opposite one. An inductive argument is as
follows. In a 2-dimensional lattice with k 2 nodes, you have to travel (k − 1) edges horizontally, and (k − 1)
edges vertically to get to the opposite corner, in any order. Thus you must traverse 2(k − 1) edges. Suppose
we have a mesh of dimension q − 1, q > 3. By assumption its diameter is (q − 1)(k − 1). A mesh of one
higher dimension has (k − 1) copies of the (q − 1)-dimensional mesh, side by side. To get from one corner to
the opposite one, you have to travel to the corner of the (q − 1)-dimensional mesh. That requires crossing
(q − 1)(k − 1) edges, by hypothesis. Then we have to get to the k th copy of the mesh in the new dimension.
We have to cross (k − 1) more edges to do this. Thus we travel a total of (q − 1)(k − 1) + (k − 1) = q(k − 1)
edges. This is not rigorous, but this is the idea of the proof.
If k is an even number, the bisection width of a q-dimensional mesh network with k q nodes is k q−1 . Consider
the 2D mesh of Figure 2.3. To split it into two halves, you can delete 6 = 61 edges. Imagine the 3D mesh
with 216 nodes. To split it into two halves, you can delete the 36 = 62 vertical edges connecting the 36
nodes in the third and fourth planes. In general, one can delete the edges that connect adjacent copies of
the (q − 1)-dimensional lattices in the middle of the q-dimensional lattice. There are k q−1 such edges. This
is a very high bisection width. One can prove by an induction argument that the bisection width when k is
odd is (k q − 1)/(k − 1). Thus, whether k is even or odd, the bisection width is Θ(k q−1 ). Since the number of
nodes in the mesh is n = k q , as a function of n the bisection width is Θ(n/n1/q ) or equivalently, Θ(n(q−1)/q ),
which is a very high bisection width.
The degree in a mesh is fixed for each given q: it is always 2q . The maximum edge length is also a constant,
independent of the mesh size, for two- and three-dimensional meshes. For higher dimensional meshes, it is
not constant.
An extension of a mesh is a torus. A torus, the 2-dimensional version of which is illustrated in Figure 2.4,
is an extension of a mesh by the inclusion of edges between the exterior nodes in each row and those in each
column. In higher dimensions, it includes edges between the exterior nodes in each dimension. It is called a
torus because the surface that would be formed if it were wrapped around the nodes and edges with a thin

4
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks

film would be a mathematical torus, i.e., a doughnut. A torus, or toroidal mesh, has lower diameter than a
non-toroidal mesh, by a factor of 2.

Figure 2.4: Two-dimensional mesh with toroidal connections.

2.2.5 Hypercube (Binary n-Cube)

A binary n-cube or hypercube network is a network with 2n nodes arranged as the vertices of a n-
dimensional cube. A hypercube is simply a generalization of an ordinary cube, the three-dimensional shape
which you know. Although you probably think of a cube as a rectangular prism whose edges are all equal
length, that is not the only way to think about it.
To start, a single point can be thought of as a 0-cube. Suppose its label is 0. Now suppose that we replicate
this 0-cube, putting the copy at a distance of one unit away, and connecting the original and the copy by a
line segment of length 1, as shown in Figure 2.5. We will give the duplicate node the label, 1.
We extend this idea one step further. We will replicate the 1-cube, putting the copy parallel to the original
at a distance of one unit away in an orthogonal direction, and connect corresponding nodes in the copy to
those in the original. We will use binary numbers to label the nodes, instead of decimal. Each binary number
has n bits, where n is the dimension of the cube, so that the 1-cube’s labels are 1-bit long, the 2-cube’s labels
are 2 bits and so on. The nodes in the copy will be labeled with the same labels as those of the original
except for one change: the most significant bit in the original will be changed from 0 to 1 in the copy, as
shown in Figure 2.5. Now we repeat this procedure to create a 3-cube: we replicate the 2-cube, putting the
copy parallel to the original at a distance of 1 unit away in the orthogonal direction, connect nodes in the
copy to the corresponding nodes in the original, and relabel all nodes by adding another significant bit, 0 in
the original and 1 in the copy.

110 111

10 11 010 011

100 101

0 0 1 00 01 000 001

Figure 2.5: Hypercubes of dimensions 0, 1, 2, and 3.

5
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks

It is now not hard to see how we can create hypercubes of arbitrary dimension, though drawing them becomes
a bit cumbersome. A 4-cube is illustrated in Figure 2.6 though.

0110 0111 1110 1111

0010 0011 1010 1011

0100 0101 1100 1101

0000 0001 1000 1001

Figure 2.6: A 4-dimensional hypercube.

The node labels will play an important role in our understanding of the hypercube. Observe that

• The labels of two nodes differ by exactly one bit change if and only if they are connected by an edge.
• In an n-dimensional hypercube, each node label is represented by n bits. Each of these bits can be
inverted (0->1 or 1->0), implying that each node has exactly n incident edges. In the 4D hypercube,
for example, each node has 4 neighbors. Thus the degree of an n-cube is n.
• The diameter of an n-dimensional hypercube is n. To see this, observe that a given integer represented
with n bits can be transformed to any other n-bit integer by changing at most n bits, one bit at a
time. This corresponds to a walk across n edges in a hypercube from the first to the second label.
• The bisection width of an n-dimensional hypercube is 2n−1 . One way to see this is to realize that all
nodes can be thought of as lying in one of two planes: pick any bit position and call it b. The nodes
whose b-bit = 0 are in one plane, and those whose b-bit = 1 are in the other. To split the network
into two sets of nodes, one in each plane, one has to delete the edges connecting the two planes. Every
node in the 0-plane is attached to exactly one node in the 1-plane by one edge. There are 2n−1 such
pairs of nodes, and hence 2n−1 edges. No smaller set of edges can be cut to split the node set.
• The number of edges in an n-dimensional hypercube is n · 2n−1 . To see this, note that it is true when
n = 0, as there are 0 edges in the 0-cube. Assume it is true for all k < n. A hypercube of dimension n
consists of two hypercubes of dimension n − 1 with one edge between each pair of corresponding nodes
in the two smaller hypercubes. There are 2n−1 such edges. Thus, using the inductive hypothesis, the
hypercube of dimension n has 2·(n−1)·2n−2 +2n−1 = (n−1)·2n−1 +2n−1 = (n−1+1)·2n−1 = n·2n−1
edges. By the axiom of induction, it is proved.

The bisection width is very high (one half the number of nodes), and the diameter is low. This makes the
hypercube an attractive organization. Its primary drawbacks are that (1) the number of edges per node
is a (logarithmic) function of network size, making it difficult to scale up, and the maximum edge length
increases as network size increases.

2.2.6 Butterfly Network Topology

A butterfly network topology consists of (k + 1)2k nodes arranged in k + 1 ranks, each containing n = 2k
nodes. k is called the order of the network. The ranks are labeled 0 through k. Figure 2.7 depicts a butterfly
network of order 3, meaning that it has 4 ranks with 23 = 8 nodes in each rank. The columns in the figure
are labeled 0 through 7.
We describe two different methods for constructing a butterfly network of order k.

6
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
000 001 010 011 100 101 110 111
Rank 0

Rank 1

Rank 2

Rank 3

000 001 010 011 100 101 110 111

Figure 2.7: Butterfly network of order 3.

Method 1:

• Create k + 1 ranks labeled 0 through k, each containing 2k nodes, labeled 0 through 2k − 1.

• Let [i, j] denote the node in rank i, column j.

• For each rank, from 0 through k − 1, connect all nodes [i, j] to nodes [i + 1, j]. In other words, draw
the straight lines down the columns as shown in Figure 2.7.
• Let a ⋄ b represent the bitwise exclusive-or of a and b. For each rank, from 0 through k − 1, connect
each node [i, j] to node [i + 1, j ⋄ 2k−i−1 ]. The expression j ⋄ 2k−i−1 inverts the (k − i − 1)th bit (from
the right) of j, and creates the diagonal edges that form the butterfly pattern. For example, when
k = 3, in rank 0 (i = 0), the node in column j is connected to the node in rank 1 in column (j ⋄ 2k−1 ),
so the nodes 0,1,2, and 3 in rank 0 are connected to the nodes 4,5,6, and 7 respectively in rank 1, and
nodes 4,5,6, and 7 in rank 0 are connected to nodes 0,1,2, and 3 respectively in rank 1. In rank 2, [i, j]
is connected to [i + 1, j ⋄ 1], so nodes 0,2,4, and 6 connect to 1, 3, 5, and 7 respectively, and 1, 3, 5,
and 7 connect to 0, 2, 4, and 6 respectively.3

Method 2: This is a recursive definition of a butterfly network.

• A butterfly network of order 0 consists of a single node, labeled [0, 0].

• To form a butterfly network of order k + 1, replicate the butterfly network of order k, labeling the
corresponding nodes in the copy by adding 2k to their column numbers. Place the copy to the right of
the original so that nodes remain in the same ranks. Add one to all rank numbers and create a new
rank 0. For each j = 0, 1, , ..., 2k+1 , connect the node in column j of the new rank 0 to two nodes: the
node in rank 1, columnj and the corresponding node in the copy.

This recursive sequence is illustrated in Figures 2.8 and 2.9.

0 copy 0 1 create 0 1 connect 0 1

new rank new rank

Figure 2.8: Recursive construction of butterfly network of order 1 from order 0.

It can be shown that there is a path from any node in the first rank to any node in the last rank. If this is the
case, then the diameter is 2k: to get from node 0 to node 7 in the first rank (0) requires descending to rank
3 and returning along a different path. If, however, the last rank is really the same as the first rank, which is

7
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks

copy create and connect

new rank 00 01 10 11
0 1 00 01 10 11

Figure 2.9: Recursive construction of butterfly network of order 2 from order 1.

000 001 010 011 100 101 110 111

Rank 0

Rank 1

Rank 2

Rank 3

000 001 010 011 100 101 110 111

Figure 2.10: Butterfly network with first and last ranks representing the same node.

sometimes the case, then the diameter is k. Figure 2.10 schematically represents this type of network with
dashed lines between the first and last ranks.
The bisection width is 2k . To split the network requires deleting all edges that cross between columns
2k−1 − 1 and 2k − 1. Only the nodes in rank 0 have connections that cross this divide. There are 2k nodes in
rank 0, and one edge from each to a node on the other side of the imaginary dividing line. If n = (k + 1)2k ,
then 2k ≈ n/ log n. This network topology has a fixed number of edges per node, 4 in total; however the
maximum edge length will increase as the order of the network increases.
One last observation: in Figure 2.7, imagine taking each column and enclosing it in a box. Call each
box a node. There are 2k such nodes. Any edge that was incident to any node within the box is now
considered incident to the new node (i.e., the box). The resulting network contains 2k nodes connected in a
k-dimensional hypercube. This is the relationship between butterfly and hypercube networks.

2.3 Interconnection Networks

We now focus on actual, physical connections. An interconnection network is a system of links that
connects one or more devices to each other for the purpose of inter-device communication. In the context
of computer architecture, an interconnection network is used primarily to connect processors to processors,
or to allow multiple processors to access one or more shared memory modules. Sometimes they are used to
connect processors with locally attached memories to each other. The way that these entities are connected
to each other has a significant effect on the cost, applicability, scalability, reliability, and performance of a
parallel computer. In general, the entities that are connected to each other, whether they are processors or
memories, will be called nodes.
An interconnection network may be classified as shared or switched. A shared network can have at most one
message on it at any time. For example, a bus is a shared network, as is traditional Ethernet. In contrast, a
switched network allows point-to-point messages among pairs of nodes and therefore supports the transfer
of multiple concurrent messages. Switched Ethernet is, as the name implies, a switched network. Shared
networks are inferior to switched networks in terms of performance and scalability. Figure 2.11 illustrates a
3 We can express the operation j ⋄ 2m using arithmetic operators using the following equivalence, but it is hard to follow

and uses many more operations: j ⋄ 2m = 2m+1 · (j/2m+1 ) + 2m · ((j/2m ) + 1)%2 + j%2m .

8
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks

shared network. In the figure, one node is sending a message to another, and no other message can be on
the network until that communication is completed.

Node Node Node Node

Figure 2.11: A shared network connecting 4 nodes.

Figure 2.12 depicts a switched network in which two simultaneous connections are taking place, indicated
by the dashed lines.

Node Node

Switched Network

Node Node

Figure 2.12: A switched network connecting 4 nodes.

2.3.1 Interconnection Network Topologies

Notation. When we depict networks in these notes, we will follow the same convention as the Quinn book
and use squares to represent processors and/or memories, and circles to represent switches. For example,
in Figure 2.13, there are four processors on the bottom row and a binary tree of seven switches connecting
them.
We distinguish between direct and indirect topologies. In a direct topology, there is exactly one switch
for each processor node, whereas in an indirect topology, the number of switches is greater than the number
of processor nodes. In Figure 2.13, the topology is indirect because there are more switches than processor
nodes.

Figure 2.13: Binary tree interconnection network. The circles are switches and the squares are processors.

Certain topologies are usually used as direct topologies, others as indirect topologies. In particular,

• The 2D mesh is almost always used as a direct topology, with a processor attached to each switch, as
shown in Figure 2.14.
• Binary trees are always indirect topologies, acting as a switching network to connect a bank of proces-
sors to each other, as shown in Figure 2.13.

9
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks

• Butterfly networks are always indirect topologies; the processors are connected to rank 0, and either
memory modules or switches back to the processors are connected to the last rank.
• Hypercube networks are always direct topologies.

Figure 2.14: 2D mesh interconnection network, with processors (squares) attached to each switch.

2.4 Vector Processors and Processor Arrays

There are two essentially different models of parallel computers: vector computers and multiprocessors. A
vector computer is simply a computer that has an instruction that can operate on a vector. A pipelined
vector processor is a vector processor that can issue a vector instruction that operates on all of the elements
of the vector in parallel by sending those elements through a highly pipelined functional unit with a fast
clock. A processor array is a vector processor that achieves parallelism by having a collection of identical,
synchronized processing elements (PE), each of which executes the same instruction on different data,
and which are controlled by a single control unit. Because all PEs execute the same instruction at the same
time, this type of architecture is suited to problems with data parallelism. Processor arrays are often used
in scientific applications because these often contain a high amount of data parallelism.

2.4.1 Processor Array Architecture

In a processor array, each PE has a unique identifier, its processor id, which can be used during the compu-
tation. Each PE has a small, local memory in which its own private data can be stored. The data on which
each PE operates is distributed among the PEs’ local memories at the start of the computation, provided
it fits in that memory. The control unit, which might be a full-fledged CPU, broadcasts the instruction to
be executed to the PEs, which execute it on data from its local memory, and can store the result in their
local memories or can return global results back to the CPU. A global result line is usually a separate,
parallel bus that allows each PE to transmit values back to the CPU to be combined by a parallel, global
operation, such as a logical-and or a logical-or, depending upon the hardware support in the CPU. Figure
2.15 contains a schematic diagram of a typical processor array. The PEs are connected to each other through
an interconnection network that allows them to exchange data with each other as well.
If the processor array has N PEs, then the time it takes to perform the same operation on N elements is
the same as to perform it on one element. If the number of data items to be manipulated is larger than N ,
then it is usually the programmer’s job to arrange for the additional elements to be stored in the PEs’ local
memories and operated on in the appropriate sequence. For example, if the machine has 1024 PEs and an
array of size 5000 must be processed, then since 5000 = 4 · 1024 + 4, the 5000 elements would be distributed
among the PEs by giving 4 elements to 1020 PEs and 5 to 4 PEs.
The topology of the interconnection network determines how easy it is to perform different types of compu-
tations. If the interconnection network is a butterfly network, for example, then messages between PEs will

10
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks

Front−end computer

I/O devices
CPU Memory I/O Processors

Processor array Memory bus

Global result bus

Instruction broadcast bus

PE PE PE PE PE PE PE PE

M M M M M M M M

Interconnection network

Figure 2.15: Typical processor array architecture.

travel many links in each communication, slowing down the computation greatly. If the machine is designed
for fast manipulation of two-dimensional data sets, such as images or matrices, then the interconnection
network would be a 2D mesh arranged as a direct topology, as shown in Figure 2.14.

2.4.2 Processor Array Performance

Although the subject of parallel algorithm performance will be covered in depth in a later chapter, we
introduce a measure of computer performance here, namely the number of operations per second that a
computer can execute. There are many other metrics by which a computer’s performance can be evaluated;
this is just one common one.
If a processor array has 100 processing elements, and each is kept busy all of the time and executes H
operations per second, then the processor array executes 100H operations per second. If, at any moment
in time, on average, only a fraction f of the PEs are active, then the performance is reduced to 100f H
operations per second. The average fraction of PEs that are actively executing instructions at any time is
called the utilization of the processor array.
Several factors influence this utilization. One, as alluded to above, is whether or not the number of data
elements to be processed is a multiple of the number of PEs. If it is not, then at some point there will be
idle PEs while others are active.
Example 6. A processor array has 512 PEs. Two arrays A and B containing 512 floating point numbers each
are to be added and stored into a third array C. The PEs can execute a floating point addition (including
fetching and storing) in 100 nanoseconds. The performance of this processor array on this problem would be

512 operations 512 operations

= = 5.12 × 109 f lops.
100 nanosecs 100 × 10−9 seconds
The term flops is short for floating-point operations per second.
Suppose the size of the array is 700, instead. In this case the first 512 elements will be executed in parallel
by all PEs, taking 100 nanoseconds in total, but the remaining 188 elements will require only 188 PEs to be
active, and 100 nanoseconds will elapse. Therefore the performance will be
700 operations 3.5 operations
= = 3.5 × 109 f lops.
200 nanosecs 10−9 seconds

11
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks

This is less than 70% of peak performance.

2.4.3 Processor Masking

A processor array also has to handle what PEs do when they do not need to participate in a computation,
as in the preceding example. This problem also arises when conditional instructions such as if-then-else
instructions are executed. In this case, it is possible that the true-branch will need to be taken in some PEs
and the false-branch in others. The problem is that the control unit issues a single instruction to all PEs –
there is no mode in which some PEs execute one instruction and others, a different one. Therefore, the way
a processor array has to work is that when the true branch is executed, the PEs that are not supposed to
execute that instruction must be de-activated, and when the false branch is executed, the ones that are not
supposed to execute that instruction must be de-activated.
This de-activation is achieved by giving the PEs a mask bit, which they can set and unset. When a
condition is tested by a PE, if it must be de-activated in the next instruction, it sets its mask bit to prevent
it from executing. After the instruction sequence for the true branch is complete, the control unit issues an
instruction to each PE to invert its mask bit. The ones that were masked out are unmasked, and the ones
that had been active become inactive because their mask bits are set. Then the false branch instructions are
executed and the control unit sends an instruction to all PEs to clear their mask bits. It should be apparent
that when a program contains a large fraction of conditional-executed code, the fraction of idle PEs increases
and the performance decreases.

2.4.4 Summary
• Processor arrays are suitable for highly data parallel problems, but not those that have little data
parallelism.
• When a program has a large fraction of conditionally-executed code, a processor array will not perform
well.
• Processor arrays are designed to solve a single problem at a time, running to completion, as in batch
style processing, because context switching is too costly on them.
• The cost of the front end of a processor array and the interconnection network is high and this cost
must be amortized over a large number of PEs to make it cost effective. For this reason, processor
arrays are most cost-effective when the number of PEs is very large.
• One of the primary reasons that processor arrays became popular was that the control unit of a
processor was costly to build. As control units have become less expensive, processor arrays have
become less competitively priced in comparison to multicomputers.

2.5 Multiprocessors
In keeping with Quinn’s usage[2], we will use the term multiprocessor to mean a computer with multiple
CPUs and a shared memory. (This is what many others call a shared memory multiprocessor.) In a
multiprocessor, the same address generated on two different CPUs refers to the same memory location.
Multiprocessors are divided into two types: those in which the shared memory is physically in one place,
and one in which it is distributed among the processors.

2.5.1 Centralized (Shared Memory) Multiprocessors

In a centralized memory multiprocessor, all processors have equal access to the physical memory. This type of
multicomputer is also called a uniform memory access (UMA) multiprocessor . A UMA multiprocessor
may also be called a symmetric multiprocessor, or SMP. UMA multiprocessors are relatively easy to

12
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks

build because additional processors can be added to the bus of a conventional uniprocessor machine. Because
modern caches greatly reduce the need for primary memory accesses, the increase in bus traffic does not
become a major performance issue until the number of CPUs is more than a few dozen or so.
Because there is a shared physical memory, independent processes running on separate CPUs can share
data; it becomes the programmer’s responsibility to ensure that the data is accessed without race conditions;
usually the hardware in this type of machine provides various instructions to make this easier for the pro-
grammer, such as barrier synchronization primitives, or semaphore operations. A barrier synchronization
instruction is an instruction that, when it is executed by a process, causes that process to wait until all other
cooperating processes have reached that same instruction in their code. This will be described in more detail
in a later chapter. Semaphores, and semaphore operations, are also a means to allow processes to cooperate
in how they access a region of their code that accesses a shared data item, but these too require that the
programmer use them correctly. The processes can also have their own private data, which is data used only
by a single process. Figure 2.16 depicts a UMA multiprocessor with four CPUs.

Figure 2.16: UMA multiprocessor.

The hardware designer must address the cache coherence problem. In a shared memory multiprocessor
with separate caches in each processor, the problem is that the separate caches can have copies of the same
memory blocks, and unless measures are taken to prevent it, the copies will end up having different values
for the blocks. This will happen whenever two different processors modify their copies with different values
and nothing is done to propagate the changes to other caches. Usually, a snooping cache protocol is used
in these types of machines. We will not explain it here.

2.5.2 Distributed (Shared Memory) Multiprocessors

In a centralized memory multiprocessor, the shared access to a common memory through a bus limits
how many CPUs can be accommodated. The alternative is to attach separate memory modules to each
processor. When all processors can access the memory modules attached to all other processors, it is called a
distributed multiprocessor, or a non-uniform memory access (NUMA) multiprocessor. Because
it is much faster to access the local memory attached directly to the processor than the modules attached to
other processors, the access is not uniform, hence the name. Executing programs tend to obey the principles
of spatial locality and temporal locality. Spatial locality means that the memory locations that they
access tend to be near each other in small windows of time, and temporal locality means that memory
locations that are accessed once tend to be accessed frequently in a small window of time. This behavior
can be used to advantage so that most of the process’s references to memory are in its private memory.
When shared memory is distributed in this way, the cache coherence problem cannot be solved with snooping
protocols because the “snooping” becomes inefficient for large numbers of processors. On these machines, a

13
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks

directory-based protocol is used. We will not describe that protocol here. The interested reader is referred
to the Quinn book [2].

2.6 Multicomputers
A multicomputer is a distributed memory, multiple-CPU computer, but the memory is not shared. Each
CPU has its own address space and can access only its own local memory, which is called private memory.
Thus, the same address on two different CPUs refers to two different memory locations. These machines are
also called private-memory multiprocessors. Because there is no shared address space, the only way for
processes running on different CPUs to communicate is through some type of message-passing system, and
the architecture of this type of multicomputer typically supports efficient message-passing.
A commercial multicomputer is a multicomputer designed, manufactured, and intended to be sold as a
multicomputer. A commodity cluster is a multicomputer put together out of off-the-shelf components to
create a multicomputer. A commercial multicomputer’s interconnection network and processors are opti-
mized to work with each other, providing low-latency, high-bandwidth connections between the computers, at
a higher price tag than a commodity cluster. Commodity clusters, though, generally have lower performance,
with higher latency and lower bandwidth in the interprocessor connections.

2.6.1 Asymmetrical Multicomputers

Some multicomputers are designed with a special front-end computer and back-end computers, the idea
being that the front-end serves as the master and gateway for the machine, and the back-end CPUs are
for computation. These types of machines are called asymmetrical multicomputers. Users login to the
front-end and their jobs are run on the back-end processors, and all I/O takes place through the front-end
machine. The software that runs on the front and back ends is also different. The front-end has a more
powerful and versatile operating system and compilers, whereas the back-end processors have scaled down
operating systems that require less memory and resources. See Figure 2.17
The problems with this design are that:

• the front-end is a bottleneck;

• it does not scale well because of the front-end, as the performance of the front-end host limits the
number of users and the number of jobs;
• the pared-down operating systems on the back-end do not allow for sophisticated debugging tools;
• the parallel programs must be written in two parts – the part that runs on the front-end and interacts
with the I/O devices and the user, and the part that runs on the back-end.

These last two problems were such an impediment that many asymmetrical multicomputers include advanced
debugging facilities and I/O support on the back-end hosts.

2.6.2 Symmetrical Multicomputers

A symmetrical multicomputer is one in which all of the hosts are identical and are connected to each
other through an interconnection network. Users can login to any host and the file system and I/O devices
are equally accessible from every host. This overcomes the bottleneck of a front-end host as well as the
scalability issue, but it has many other problems. It is not really a suitable environment for running large
scale parallel programs, because there is little control over how many jobs are running on any one host, and
all hosts are designed to allow program development and general interactive use, which degrades performance
of a highly compute-bound job. See Figure 2.18.

14
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks

Asymmetrical Multicomputer
back−end
computer

back−end
computer
Users

Front−end Interconnecton back−end

computer Network computer

back−end
computer

Figure 2.17: Asymmetrical multicomputer.

Symmetrical Multicomputer

back−end back−end back−end

computer computer computer

Users

Interconnecton
Network

Disk

back−end back−end back−end

computer computer computer

Figure 2.18: Symmetrical multicomputer.

2.7 Flynn’s Taxonomy

In 1966, Michael Flynn [3] proposed a categorization of parallel hardware based upon a classification scheme
with two orthogonal parameters: the instruction stream and the data stream. In his taxonomy, a machine was
classified by whether it has a single or multiple instruction streams, and whether it had single or multiple
data streams. An instruction stream is a sequence of instructions that flow through a processor, and a
data stream is a sequence of data items that are computed on by a processor. For example, the ordinary
single-processor computer has a single instruction stream and a single data stream.
This scheme leads to four independent types of computer:

SISD single instruction, single data; i.e., a conventional uni-processor.

SIMD single instruction, multiple data; like MMX or SSE instructions in the x86 processor series,
processor arrays and pipelined vector processors. SIMD multiprocessors issue a single instruction
that operates on multiple data items simultaneously. Vector processors are SIMD multiprocessors,
which means that processor arrays and pipelined vector processors are also SIMD machines.

15
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks

MISD multiple instruction, single data; very rare but one example is the U.S. Space Shuttle flight
controller. Systolic arrays fall into this category as well.
MIMD multiple instruction, multiple data; SMPs, clusters. This is the most common multiprocessor.
MIMD multiprocessors are more complex and expensive, and so the number of processors tends
to be smaller than in SIMD machines. Today’s multiprocessors are found on desktops, with
anywhere from 2 to 16 processors. Multiprocessors and multicomputers fall into this category.

Acknowledgments
Several students have found mistakes in the notes over the years and these have been corrected. I thank
Jaspal Singh for discovering a mistake in the formula used to calculate the target of the cross edges in the
butterfly network in Method 1 there.

16
References

[1] C. Wu and T. Feng. Tutorial, interconnection networks for parallel and distributed processing. Tutorial
Texts Series. IEEE Computer Society Press, 1984.

[2] M.J. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill Higher Education.
McGraw-Hill Higher Education, 2004.
[3] Michael J. Flynn. Some computer organizations and their effectiveness. IEEE Trans. Comput., 21(9):948–
960, September 1972.

17
Subject Index

adjacent, 1 processing element, 10

asymmetrical multicomputers, 14 processor array, 10
processor masking, 12
barrier synchronization, 13
binary n-cube, 5 shared network, 8
bisection width, 2 SIMD, 15
butterfly network topology, 6 SISD, 15
snooping cache protocol, 13
cache coherence problem, 13 spatial locality, 13
commercial multicomputer, 14 switched network, 8
commodity cluster, 14 symmetric multiprocessor„ 12
computer performance, 11 symmetrical multicomputer, 14
degree, 2 temporal locality, 13
depth, 3 torus, 4
diameter, 2
direct topology, 9 uniform memory access, 12
directed edge, 1
directory-based protocol, 14 vector computer, 10
distance, 2
distributed multiprocessor, 13

floating-point operations per second, 11

flops, 11
fully-connected network, 3

hypercube, 5

indirect topology, 9
interconnection network, 8

maximum edge length, 3

mesh network, 4
MIMD, 16
MISD, 16
multicomputer, 14
multiprocessor, 12

network topology, 1
node, 1
non-uniform memory access multiprocessor, 13

order, 6

path, 2
pipelined vector processor, 10
private-memory multiprocessor, 14

Intro To Comp Netwks
No ratings yet
Intro To Comp Netwks
1,765 pages
Troanary Photonic Storage Blueprint - How Light Based Logic can Redefine Computation and Data Storage
From Everand
Troanary Photonic Storage Blueprint - How Light Based Logic can Redefine Computation and Data Storage
Ylia Callan
No ratings yet
Chapter 7 Parallel Processing
No ratings yet
Chapter 7 Parallel Processing
29 pages
Samsung GT-N7100 - UM - Open - HongKong - Jellybean - Eng - Rev.1.0 - 120924 - Screen
No ratings yet
Samsung GT-N7100 - UM - Open - HongKong - Jellybean - Eng - Rev.1.0 - 120924 - Screen
135 pages
Anticancer Drugs Classification
100% (1)
Anticancer Drugs Classification
19 pages
Teachings Temple 3
No ratings yet
Teachings Temple 3
479 pages
Vienna Superautomatica Parts Diagram
0% (1)
Vienna Superautomatica Parts Diagram
8 pages
Network Topology Edited by Amend
No ratings yet
Network Topology Edited by Amend
25 pages
Static and Dynamic
No ratings yet
Static and Dynamic
43 pages
Chap 2 Topology and Architecture
No ratings yet
Chap 2 Topology and Architecture
10 pages
11-12-18 Practice Exam Review
No ratings yet
11-12-18 Practice Exam Review
153 pages
Cne Chapter2 Notes
No ratings yet
Cne Chapter2 Notes
24 pages
Introduction To Computer Networks
No ratings yet
Introduction To Computer Networks
83 pages
Comp Prac 1
No ratings yet
Comp Prac 1
11 pages
Dogmatica Zizioulas
No ratings yet
Dogmatica Zizioulas
241 pages
Network Topologies
No ratings yet
Network Topologies
11 pages
Computersandeep
No ratings yet
Computersandeep
12 pages
CN-Module 1 Notes MCA
No ratings yet
CN-Module 1 Notes MCA
75 pages
Network Design and Topologies
No ratings yet
Network Design and Topologies
13 pages
Network Topologies
No ratings yet
Network Topologies
8 pages
Gertec Grout Pump 36974
No ratings yet
Gertec Grout Pump 36974
69 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
111 pages
Lab 1 CN
No ratings yet
Lab 1 CN
3 pages
GROUP 6 New
No ratings yet
GROUP 6 New
9 pages
CSE Fundamental - Topology
No ratings yet
CSE Fundamental - Topology
14 pages
Networking Ceh
No ratings yet
Networking Ceh
98 pages
1 Introduction To Computer Networking
No ratings yet
1 Introduction To Computer Networking
31 pages
Advance Computer Architecture: Unit:Ii System Interconnect Architectures
No ratings yet
Advance Computer Architecture: Unit:Ii System Interconnect Architectures
53 pages
2.classification and Topologies
No ratings yet
2.classification and Topologies
50 pages
TM AHU 60R410A Onoff T SA NA 171205
No ratings yet
TM AHU 60R410A Onoff T SA NA 171205
67 pages
Interconnect Network Topologies
No ratings yet
Interconnect Network Topologies
23 pages
Internet and Interanet Question Solution
No ratings yet
Internet and Interanet Question Solution
55 pages
Unit 1 Part 1
No ratings yet
Unit 1 Part 1
35 pages
Network 1
No ratings yet
Network 1
9 pages
Work Topology
No ratings yet
Work Topology
20 pages
1e Aldehyde & Ketone
100% (1)
1e Aldehyde & Ketone
48 pages
Network Types Notes
No ratings yet
Network Types Notes
9 pages
Network Topology
No ratings yet
Network Topology
6 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
1 - Unit 2 - Assignment 1 Frontsheet 12345
No ratings yet
1 - Unit 2 - Assignment 1 Frontsheet 12345
39 pages
Topology
No ratings yet
Topology
24 pages
Wilderness A Survival Adventure Dos 04bs
No ratings yet
Wilderness A Survival Adventure Dos 04bs
57 pages
1.1 Background of Problems
No ratings yet
1.1 Background of Problems
15 pages
Second Lecture Second Lecture Second Lecture Second Lecture: Basic Components Basic Components
No ratings yet
Second Lecture Second Lecture Second Lecture Second Lecture: Basic Components Basic Components
31 pages
Chapter 2 1 and 2 21
No ratings yet
Chapter 2 1 and 2 21
17 pages
Memuna
No ratings yet
Memuna
6 pages
Topologies: Project Report ON
No ratings yet
Topologies: Project Report ON
13 pages
Checklist For Access Control System Installation
75% (8)
Checklist For Access Control System Installation
6 pages
Different Types of Data Communications Connections: 1. Point-to-Point
No ratings yet
Different Types of Data Communications Connections: 1. Point-to-Point
20 pages
EL 303 SP 21 CCN LEC1 Introduction
No ratings yet
EL 303 SP 21 CCN LEC1 Introduction
68 pages
Chapter 10 Network Topologies
No ratings yet
Chapter 10 Network Topologies
15 pages
2nd Quarter TLE ICT 7 - Chapter 5
No ratings yet
2nd Quarter TLE ICT 7 - Chapter 5
12 pages
DCC Notes
No ratings yet
DCC Notes
15 pages
Chapter 5 Part 2 Computer Network
No ratings yet
Chapter 5 Part 2 Computer Network
29 pages
Perioperative Nursing
No ratings yet
Perioperative Nursing
8 pages
ITC Networking Lecture
No ratings yet
ITC Networking Lecture
9 pages
LittleShrub Report - Thermal Bridging and Thermal Break
No ratings yet
LittleShrub Report - Thermal Bridging and Thermal Break
49 pages
Basic Concepts of Computer Networks: Network Topology
No ratings yet
Basic Concepts of Computer Networks: Network Topology
25 pages
Implementing Cisco IP Routing (ROUTE) Foundation Learning Guide
No ratings yet
Implementing Cisco IP Routing (ROUTE) Foundation Learning Guide
55 pages
Computer Topology
No ratings yet
Computer Topology
14 pages
SIM900 EVB Kit User Guide V1.04
No ratings yet
SIM900 EVB Kit User Guide V1.04
23 pages
Physical Topology
No ratings yet
Physical Topology
4 pages
Use of Coagulants To Reduce Crud by Colloidal Silica
100% (1)
Use of Coagulants To Reduce Crud by Colloidal Silica
11 pages
Ecommerce 1-5QNS Final
No ratings yet
Ecommerce 1-5QNS Final
20 pages
Basic Networks
No ratings yet
Basic Networks
14 pages
cnnnNOTES 1
No ratings yet
cnnnNOTES 1
11 pages
Mcse Note Topology by Wahidullah Shahaadat
No ratings yet
Mcse Note Topology by Wahidullah Shahaadat
30 pages
ICT Project
No ratings yet
ICT Project
9 pages
Situation Test PDF
No ratings yet
Situation Test PDF
22 pages
Introduction To Networks-3
No ratings yet
Introduction To Networks-3
3 pages
Social Collapse Best Practices - by Dmitry Orlov
100% (1)
Social Collapse Best Practices - by Dmitry Orlov
15 pages
Network Topology Is The Layout Pattern of Interconnections of The Various
No ratings yet
Network Topology Is The Layout Pattern of Interconnections of The Various
3 pages
Physical Topology and Logical Topology
No ratings yet
Physical Topology and Logical Topology
3 pages
Topologies: Network Topology Is The Layout Pattern of Interconnections of The
No ratings yet
Topologies: Network Topology Is The Layout Pattern of Interconnections of The
7 pages
Speed Test - 11 Simple Interest
No ratings yet
Speed Test - 11 Simple Interest
4 pages
Network Topology
No ratings yet
Network Topology
7 pages
Iplc
No ratings yet
Iplc
22 pages
FLashcard of Polymerisation
No ratings yet
FLashcard of Polymerisation
28 pages
Netwrkd Types and Topology
No ratings yet
Netwrkd Types and Topology
5 pages
Guarniz Flores, Joel Luis
No ratings yet
Guarniz Flores, Joel Luis
8 pages
Quantum Transport in QD and DQD PDF
No ratings yet
Quantum Transport in QD and DQD PDF
26 pages
Physiology of Lactation
No ratings yet
Physiology of Lactation
16 pages
ASTM D5162-01 - Discontinuity (Holiday) Testing of Nonconductive Protective Coating On Metallic Substrates
No ratings yet
ASTM D5162-01 - Discontinuity (Holiday) Testing of Nonconductive Protective Coating On Metallic Substrates
4 pages
RT9724GB GP
No ratings yet
RT9724GB GP
13 pages
Eng SS 114-92009 A
No ratings yet
Eng SS 114-92009 A
3 pages
Jubilee-Greenway-Route-Section-One
No ratings yet
Jubilee-Greenway-Route-Section-One
5 pages
Solid State Voltage Regulator
No ratings yet
Solid State Voltage Regulator
9 pages
Herbarium Data
No ratings yet
Herbarium Data
9 pages
A240CX-BD CD DD Flameproof Coil Solenoid Valves PDF
No ratings yet
A240CX-BD CD DD Flameproof Coil Solenoid Valves PDF
1 page
(Naruto Shippuden) - Man of The Worls
No ratings yet
(Naruto Shippuden) - Man of The Worls
4 pages

Parallel Network and Topologies

Uploaded by

Parallel Network and Topologies

Uploaded by

Chapter 2 Parallel Architectures and Interconnection

2.2 Network Topologies

• Fully-connected (also called completely-connected)

Figure 2.1: Binary tree topology with 7 nodes.

• Hypercube (also called a binary n-cube)

2.2.1 Network Topology Properties

2.2.2 Binary Tree Network Topology

2.2.3 Fully-Connected Network Topology

Figure 2.2: Fully-connected network with 6 nodes.

2.2.4 Mesh Network Topology

Figure 2.3: A two-dimensional square mesh with 36 nodes.

Figure 2.4: Two-dimensional mesh with toroidal connections.

2.2.5 Hypercube (Binary n-Cube)

Figure 2.5: Hypercubes of dimensions 0, 1, 2, and 3.

0110 0111 1110 1111

0010 0011 1010 1011

0100 0101 1100 1101

0000 0001 1000 1001

Figure 2.6: A 4-dimensional hypercube.

2.2.6 Butterfly Network Topology

000 001 010 011 100 101 110 111

Figure 2.7: Butterfly network of order 3.

• Create k + 1 ranks labeled 0 through k, each containing 2k nodes, labeled 0 through 2k − 1.

• Let [i, j] denote the node in rank i, column j.

Method 2: This is a recursive definition of a butterfly network.

• A butterfly network of order 0 consists of a single node, labeled [0, 0].

This recursive sequence is illustrated in Figures 2.8 and 2.9.

0 copy 0 1 create 0 1 connect 0 1

Figure 2.8: Recursive construction of butterfly network of order 1 from order 0.

copy create and connect

Figure 2.9: Recursive construction of butterfly network of order 2 from order 1.

000 001 010 011 100 101 110 111

000 001 010 011 100 101 110 111

2.3 Interconnection Networks

Node Node Node Node

Figure 2.11: A shared network connecting 4 nodes.

Figure 2.12: A switched network connecting 4 nodes.

2.3.1 Interconnection Network Topologies

2.4 Vector Processors and Processor Arrays

2.4.1 Processor Array Architecture

Processor array Memory bus

Global result bus

Instruction broadcast bus

Figure 2.15: Typical processor array architecture.

2.4.2 Processor Array Performance

512 operations 512 operations

This is less than 70% of peak performance.

2.4.3 Processor Masking

2.5.1 Centralized (Shared Memory) Multiprocessors

Figure 2.16: UMA multiprocessor.

2.5.2 Distributed (Shared Memory) Multiprocessors

2.6.1 Asymmetrical Multicomputers

• the front-end is a bottleneck;

2.6.2 Symmetrical Multicomputers

Front−end Interconnecton back−end

Figure 2.17: Asymmetrical multicomputer.

back−end back−end back−end

back−end back−end back−end

Figure 2.18: Symmetrical multicomputer.

2.7 Flynn’s Taxonomy

SISD single instruction, single data; i.e., a conventional uni-processor.

adjacent, 1 processing element, 10

floating-point operations per second, 11

maximum edge length, 3

You might also like