Parallel Network and Topologies
Parallel Network and Topologies
Networks
“The interconnection network is the heart of parallel architecture.” - Chuan-Lin and Tse-Yun
Feng [1]
2.1 Introduction
You cannot really design parallel algorithms or programs without an understanding of some of the key
properties of various types of parallel architectures and the means by which components can be connected
to each other. Parallelism has existed in computers since their inception, at various levels of design. For
example, bit-parallel memory has been around since the early 1970s, and simultaneous I/O processing (using
channels) has been used since the 1960s. Other forms of parallelism include bit-parallel arithmetic in the
arithmetic-logic unit (ALU), instruction look-ahead in the control unit, direct memory access (DMA), data
pipe-lining, and instruction pipe-lining. However, the parallelism that we will discuss is at a higher level;
in particular we will look at processor arrays, multiprocessors, and multicomputers. We begin, however, by
exploring the mathematical concept of a topology
• Binary tree
1
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
2 3
4 5 6 7
Definition 2. The distance between a pair of nodes is the length of the shortest path between the nodes.
For example, in Figure 2.1, the distance between nodes 4 and 7 is 4, whereas the distance between nodes 6
and 7 is 2.
Definition 3. The diameter of a network topology is the largest distance between any pair of nodes in the
network.
The diameter of the network in Figure 2.1 is 4, since the distance between nodes 4 and 7 is 4, and there is no
pair of nodes whose distance is greater than 4. Diameter is important because, if nodes represent processors
that must communicate via the edges, which represent communication links, then the diameter determines
a lower bound on the communication time. (Note that it is a lower bound and not an upper bound; if a
particular algorithm requires, for example, that all pairs of nodes send each other data before the next step
of a computation, then the diameter determines how much time will elapse before that step can begin.)
Definition 4. The bisection width of a network topology is the smallest number of edges that must be
deleted to sever the set of nodes into two sets of equal size, or size differing by at most one node.
In Figure 2.1, edge (1,2) can be deleted to split the set of nodes into two sets {2,3,5} and {1,4,6,7}. Therefore,
the bisection width of this network is 1. Bisection width is important because it can determine the total
communication time. Low bisection width is bad, and high is good. Consider the extreme case, in which a
network can be split by removing one edge. This means that all data that flows from one half to the other
must pass through this edge. This edge is a bottleneck through which all data must pass sequentially, like a
one-lane bridge in the middle of a four-lane highway. In contrast, if the bisection width is high, then many
edges must be removed to split the node set. This means that there are many paths from one side of the set
to the other, and data can flow in a high degree of parallelism from any one half of the nodes to the other.
Definition 5. The degree of the network topology is the maximum number of edges that are incident to a
node in the topology.
The maximum number of edges per node can affect how well the network scales as the number of processors
increases, because of physical limitations on how the network is constructed. A binary tree, for example,
has the property that the maximum number of edges per node is 3, regardless of how many nodes are in
2
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
the tree. This is good, because the physical design need not change to accommodate the increase in number
of processors. Not all topologies have a constant degree. If the degree increases with network size, this
generally means that more connections need to be made to each node. Nodes might represent switches, or
processors, and in either case they have a fixed pin-out, implying that the connections between processors
must be implemented by a complex fan-out of the wires, a very expensive and potentially slow mechanism.
Although the edges in a network topology do not have length, we assume that nodes cannot be infinitely
small. As a consequence, the definition of the topology itself can imply that, as the number of nodes
increases, the physical distance between them must increase. Maximum edge length is a measure of this
property. It is important because the communication time is a function of how long the signals must travel.
It is best if the network can be laid out in three-dimensional space so that the maximum edge length is a
constant, independent of network size. If not, and the edge length increases with the number of processors,
then communication time increases as the network grows. This implies that expanding the network to
accommodate more processors can slow down communication time. The binary tree in Figure 2.1 does not
have a constant maximum edge length, because as the size of the tree gets larger, the leaf nodes must be
placed further apart, which in turn implies that eventually the edges that leave the root of the tree must get
longer.
3
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
The diameter of a q-dimensional mesh network with k q nodes is q(k − 1). To see this, note that the farthest
distance between nodes is from one corner to the diagonally opposite one. An inductive argument is as
follows. In a 2-dimensional lattice with k 2 nodes, you have to travel (k − 1) edges horizontally, and (k − 1)
edges vertically to get to the opposite corner, in any order. Thus you must traverse 2(k − 1) edges. Suppose
we have a mesh of dimension q − 1, q > 3. By assumption its diameter is (q − 1)(k − 1). A mesh of one
higher dimension has (k − 1) copies of the (q − 1)-dimensional mesh, side by side. To get from one corner to
the opposite one, you have to travel to the corner of the (q − 1)-dimensional mesh. That requires crossing
(q − 1)(k − 1) edges, by hypothesis. Then we have to get to the k th copy of the mesh in the new dimension.
We have to cross (k − 1) more edges to do this. Thus we travel a total of (q − 1)(k − 1) + (k − 1) = q(k − 1)
edges. This is not rigorous, but this is the idea of the proof.
If k is an even number, the bisection width of a q-dimensional mesh network with k q nodes is k q−1 . Consider
the 2D mesh of Figure 2.3. To split it into two halves, you can delete 6 = 61 edges. Imagine the 3D mesh
with 216 nodes. To split it into two halves, you can delete the 36 = 62 vertical edges connecting the 36
nodes in the third and fourth planes. In general, one can delete the edges that connect adjacent copies of
the (q − 1)-dimensional lattices in the middle of the q-dimensional lattice. There are k q−1 such edges. This
is a very high bisection width. One can prove by an induction argument that the bisection width when k is
odd is (k q − 1)/(k − 1). Thus, whether k is even or odd, the bisection width is Θ(k q−1 ). Since the number of
nodes in the mesh is n = k q , as a function of n the bisection width is Θ(n/n1/q ) or equivalently, Θ(n(q−1)/q ),
which is a very high bisection width.
The degree in a mesh is fixed for each given q: it is always 2q . The maximum edge length is also a constant,
independent of the mesh size, for two- and three-dimensional meshes. For higher dimensional meshes, it is
not constant.
An extension of a mesh is a torus. A torus, the 2-dimensional version of which is illustrated in Figure 2.4,
is an extension of a mesh by the inclusion of edges between the exterior nodes in each row and those in each
column. In higher dimensions, it includes edges between the exterior nodes in each dimension. It is called a
torus because the surface that would be formed if it were wrapped around the nodes and edges with a thin
4
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
film would be a mathematical torus, i.e., a doughnut. A torus, or toroidal mesh, has lower diameter than a
non-toroidal mesh, by a factor of 2.
110 111
10 11 010 011
100 101
0 0 1 00 01 000 001
5
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
It is now not hard to see how we can create hypercubes of arbitrary dimension, though drawing them becomes
a bit cumbersome. A 4-cube is illustrated in Figure 2.6 though.
The node labels will play an important role in our understanding of the hypercube. Observe that
• The labels of two nodes differ by exactly one bit change if and only if they are connected by an edge.
• In an n-dimensional hypercube, each node label is represented by n bits. Each of these bits can be
inverted (0->1 or 1->0), implying that each node has exactly n incident edges. In the 4D hypercube,
for example, each node has 4 neighbors. Thus the degree of an n-cube is n.
• The diameter of an n-dimensional hypercube is n. To see this, observe that a given integer represented
with n bits can be transformed to any other n-bit integer by changing at most n bits, one bit at a
time. This corresponds to a walk across n edges in a hypercube from the first to the second label.
• The bisection width of an n-dimensional hypercube is 2n−1 . One way to see this is to realize that all
nodes can be thought of as lying in one of two planes: pick any bit position and call it b. The nodes
whose b-bit = 0 are in one plane, and those whose b-bit = 1 are in the other. To split the network
into two sets of nodes, one in each plane, one has to delete the edges connecting the two planes. Every
node in the 0-plane is attached to exactly one node in the 1-plane by one edge. There are 2n−1 such
pairs of nodes, and hence 2n−1 edges. No smaller set of edges can be cut to split the node set.
• The number of edges in an n-dimensional hypercube is n · 2n−1 . To see this, note that it is true when
n = 0, as there are 0 edges in the 0-cube. Assume it is true for all k < n. A hypercube of dimension n
consists of two hypercubes of dimension n − 1 with one edge between each pair of corresponding nodes
in the two smaller hypercubes. There are 2n−1 such edges. Thus, using the inductive hypothesis, the
hypercube of dimension n has 2·(n−1)·2n−2 +2n−1 = (n−1)·2n−1 +2n−1 = (n−1+1)·2n−1 = n·2n−1
edges. By the axiom of induction, it is proved.
The bisection width is very high (one half the number of nodes), and the diameter is low. This makes the
hypercube an attractive organization. Its primary drawbacks are that (1) the number of edges per node
is a (logarithmic) function of network size, making it difficult to scale up, and the maximum edge length
increases as network size increases.
6
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
000 001 010 011 100 101 110 111
Rank 0
Rank 1
Rank 2
Rank 3
Method 1:
It can be shown that there is a path from any node in the first rank to any node in the last rank. If this is the
case, then the diameter is 2k: to get from node 0 to node 7 in the first rank (0) requires descending to rank
3 and returning along a different path. If, however, the last rank is really the same as the first rank, which is
7
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
Rank 1
Rank 2
Rank 3
Figure 2.10: Butterfly network with first and last ranks representing the same node.
sometimes the case, then the diameter is k. Figure 2.10 schematically represents this type of network with
dashed lines between the first and last ranks.
The bisection width is 2k . To split the network requires deleting all edges that cross between columns
2k−1 − 1 and 2k − 1. Only the nodes in rank 0 have connections that cross this divide. There are 2k nodes in
rank 0, and one edge from each to a node on the other side of the imaginary dividing line. If n = (k + 1)2k ,
then 2k ≈ n/ log n. This network topology has a fixed number of edges per node, 4 in total; however the
maximum edge length will increase as the order of the network increases.
One last observation: in Figure 2.7, imagine taking each column and enclosing it in a box. Call each
box a node. There are 2k such nodes. Any edge that was incident to any node within the box is now
considered incident to the new node (i.e., the box). The resulting network contains 2k nodes connected in a
k-dimensional hypercube. This is the relationship between butterfly and hypercube networks.
and uses many more operations: j ⋄ 2m = 2m+1 · (j/2m+1 ) + 2m · ((j/2m ) + 1)%2 + j%2m .
8
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
shared network. In the figure, one node is sending a message to another, and no other message can be on
the network until that communication is completed.
Figure 2.12 depicts a switched network in which two simultaneous connections are taking place, indicated
by the dashed lines.
Node Node
Switched Network
Node Node
Figure 2.13: Binary tree interconnection network. The circles are switches and the squares are processors.
Certain topologies are usually used as direct topologies, others as indirect topologies. In particular,
• The 2D mesh is almost always used as a direct topology, with a processor attached to each switch, as
shown in Figure 2.14.
• Binary trees are always indirect topologies, acting as a switching network to connect a bank of proces-
sors to each other, as shown in Figure 2.13.
9
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
• Butterfly networks are always indirect topologies; the processors are connected to rank 0, and either
memory modules or switches back to the processors are connected to the last rank.
• Hypercube networks are always direct topologies.
Figure 2.14: 2D mesh interconnection network, with processors (squares) attached to each switch.
10
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
Front−end computer
I/O devices
CPU Memory I/O Processors
PE PE PE PE PE PE PE PE
M M M M M M M M
Interconnection network
travel many links in each communication, slowing down the computation greatly. If the machine is designed
for fast manipulation of two-dimensional data sets, such as images or matrices, then the interconnection
network would be a 2D mesh arranged as a direct topology, as shown in Figure 2.14.
11
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
2.4.4 Summary
• Processor arrays are suitable for highly data parallel problems, but not those that have little data
parallelism.
• When a program has a large fraction of conditionally-executed code, a processor array will not perform
well.
• Processor arrays are designed to solve a single problem at a time, running to completion, as in batch
style processing, because context switching is too costly on them.
• The cost of the front end of a processor array and the interconnection network is high and this cost
must be amortized over a large number of PEs to make it cost effective. For this reason, processor
arrays are most cost-effective when the number of PEs is very large.
• One of the primary reasons that processor arrays became popular was that the control unit of a
processor was costly to build. As control units have become less expensive, processor arrays have
become less competitively priced in comparison to multicomputers.
2.5 Multiprocessors
In keeping with Quinn’s usage[2], we will use the term multiprocessor to mean a computer with multiple
CPUs and a shared memory. (This is what many others call a shared memory multiprocessor.) In a
multiprocessor, the same address generated on two different CPUs refers to the same memory location.
Multiprocessors are divided into two types: those in which the shared memory is physically in one place,
and one in which it is distributed among the processors.
12
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
build because additional processors can be added to the bus of a conventional uniprocessor machine. Because
modern caches greatly reduce the need for primary memory accesses, the increase in bus traffic does not
become a major performance issue until the number of CPUs is more than a few dozen or so.
Because there is a shared physical memory, independent processes running on separate CPUs can share
data; it becomes the programmer’s responsibility to ensure that the data is accessed without race conditions;
usually the hardware in this type of machine provides various instructions to make this easier for the pro-
grammer, such as barrier synchronization primitives, or semaphore operations. A barrier synchronization
instruction is an instruction that, when it is executed by a process, causes that process to wait until all other
cooperating processes have reached that same instruction in their code. This will be described in more detail
in a later chapter. Semaphores, and semaphore operations, are also a means to allow processes to cooperate
in how they access a region of their code that accesses a shared data item, but these too require that the
programmer use them correctly. The processes can also have their own private data, which is data used only
by a single process. Figure 2.16 depicts a UMA multiprocessor with four CPUs.
The hardware designer must address the cache coherence problem. In a shared memory multiprocessor
with separate caches in each processor, the problem is that the separate caches can have copies of the same
memory blocks, and unless measures are taken to prevent it, the copies will end up having different values
for the blocks. This will happen whenever two different processors modify their copies with different values
and nothing is done to propagate the changes to other caches. Usually, a snooping cache protocol is used
in these types of machines. We will not explain it here.
13
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
directory-based protocol is used. We will not describe that protocol here. The interested reader is referred
to the Quinn book [2].
2.6 Multicomputers
A multicomputer is a distributed memory, multiple-CPU computer, but the memory is not shared. Each
CPU has its own address space and can access only its own local memory, which is called private memory.
Thus, the same address on two different CPUs refers to two different memory locations. These machines are
also called private-memory multiprocessors. Because there is no shared address space, the only way for
processes running on different CPUs to communicate is through some type of message-passing system, and
the architecture of this type of multicomputer typically supports efficient message-passing.
A commercial multicomputer is a multicomputer designed, manufactured, and intended to be sold as a
multicomputer. A commodity cluster is a multicomputer put together out of off-the-shelf components to
create a multicomputer. A commercial multicomputer’s interconnection network and processors are opti-
mized to work with each other, providing low-latency, high-bandwidth connections between the computers, at
a higher price tag than a commodity cluster. Commodity clusters, though, generally have lower performance,
with higher latency and lower bandwidth in the interprocessor connections.
• it does not scale well because of the front-end, as the performance of the front-end host limits the
number of users and the number of jobs;
• the pared-down operating systems on the back-end do not allow for sophisticated debugging tools;
• the parallel programs must be written in two parts – the part that runs on the front-end and interacts
with the I/O devices and the user, and the part that runs on the back-end.
These last two problems were such an impediment that many asymmetrical multicomputers include advanced
debugging facilities and I/O support on the back-end hosts.
14
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
Asymmetrical Multicomputer
back−end
computer
back−end
computer
Users
back−end
computer
back−end
computer
Symmetrical Multicomputer
Users
Interconnecton
Network
Disk
SIMD single instruction, multiple data; like MMX or SSE instructions in the x86 processor series,
processor arrays and pipelined vector processors. SIMD multiprocessors issue a single instruction
that operates on multiple data items simultaneously. Vector processors are SIMD multiprocessors,
which means that processor arrays and pipelined vector processors are also SIMD machines.
15
CSci 493.65 Parallel Computing Prof. Stewart Weiss
Chapter 2 Parallel Architectures and Interconnection Networks
MISD multiple instruction, single data; very rare but one example is the U.S. Space Shuttle flight
controller. Systolic arrays fall into this category as well.
MIMD multiple instruction, multiple data; SMPs, clusters. This is the most common multiprocessor.
MIMD multiprocessors are more complex and expensive, and so the number of processors tends
to be smaller than in SIMD machines. Today’s multiprocessors are found on desktops, with
anywhere from 2 to 16 processors. Multiprocessors and multicomputers fall into this category.
Acknowledgments
Several students have found mistakes in the notes over the years and these have been corrected. I thank
Jaspal Singh for discovering a mistake in the formula used to calculate the target of the cross edges in the
butterfly network in Method 1 there.
16
References
[1] C. Wu and T. Feng. Tutorial, interconnection networks for parallel and distributed processing. Tutorial
Texts Series. IEEE Computer Society Press, 1984.
[2] M.J. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill Higher Education.
McGraw-Hill Higher Education, 2004.
[3] Michael J. Flynn. Some computer organizations and their effectiveness. IEEE Trans. Comput., 21(9):948–
960, September 1972.
17
Subject Index
hypercube, 5
indirect topology, 9
interconnection network, 8
network topology, 1
node, 1
non-uniform memory access multiprocessor, 13
order, 6
path, 2
pipelined vector processor, 10
private-memory multiprocessor, 14
18