0% found this document useful (0 votes)
6 views16 pages

1981 IEEE A Survey On Interconnection Networks

The document discusses the importance of interconnection networks in concurrent processing systems, highlighting various network topologies and switching strategies. It categorizes communication modes, control strategies, and switching methodologies, emphasizing their roles in enhancing processing speed for real-time applications. Additionally, it reviews static and dynamic network topologies, routing techniques, and communication protocols essential for effective interprocessor communication.

Uploaded by

ESPIN SHALO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views16 pages

1981 IEEE A Survey On Interconnection Networks

The document discusses the importance of interconnection networks in concurrent processing systems, highlighting various network topologies and switching strategies. It categorizes communication modes, control strategies, and switching methodologies, emphasizing their roles in enhancing processing speed for real-time applications. Additionally, it reviews static and dynamic network topologies, routing techniques, and communication protocols essential for effective interprocessor communication.

Uploaded by

ESPIN SHALO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Concurrent processing depends on interconnection networks for

communication among processors and memory modules. Various


network topologies and switching strategies are covered here.

A Survey of
Interconnecfion
Networks
Tse-yun Feng
The Ohio State University

Concurrent processing of data items is considered a


proper approach for significantly increasing processing
speed.' In many real-time applications-such as image
PROGRAM processing and weather computation, which need an in-
(PARALLEL struction execution rate of more than one billion floating-
LANGUAGE) point instructions per second-concurrent processing is
unavoidable. And now, with the advent of LSI technol-
ogy, it is economically feasible to construct a concurrent
processing system by interconnecting hundreds-even
thousands-of off-the-shelf processors and memory
TASK ~~~~~~~modules.
PARTITION A basic concurrent processing system is shown in
AND
~ ~ ~ ~ Figure
1. Processes, generated by compiling and parti-
tioning a user's program, are assigned to individual pro-
cessors, and an interconnection network implements in-
terprocess communication. A general model of the hard-
ware system is shown in Figure 2. The interconnection
network facilitates communication not only among the n
processors and the m memory modules but also between
the processors and memory modules.
Many interconnection networks have been reviewed in
other surveys.2-9 In this article we consider interconnec-
tion networks from a practical design viewpoint. We ex-
amine design decisions that are essential in choosing a
cost-effective communication network, survey the var-
ious topologies and communication protocols, and dis-
SWITCH ~~~cussconnection issues related to concurrent processing.

Design decisions

~~~~~~~~~~~~~~In operation
selecting architecture of interconnection
work, fur
cern
the decisions
design can bean net-
identified.iO They con-
mode, control strategy, switching method,
i
Figure 1. Ai overview of concurrent processing systems. and network topology.

0018-9162/81/1200-0012$00.75 0 1981 IEEE COMPUTER


Operation mode. Two types of communication can two processors are passive and dedicated buses cannot be
be identified: synchronous and asynchronous. Synchro- reconfigured for direct connections to other processors.
nous communication is needed for processing in which On the other hand, links in the dynamic category can be
communication paths are established synchronously for reconfigured by setting the network's active switching
either a data manipulating function"I or a data/instruc- elements.
tion broadcast. Asynchronous communication is needed The cross product of the set of categories in each de-
for multiprocessing in which connection requests are sign decision- [operation modej x [control strategy] x
issued dynamically. A system may also be designed to [switching methodology] x [network topology]-repre-
facilitate both synchronous and asynchronous process- sents a space ofinterconnection networks. Obviously, the
ing. Therefore, typical operation modes of interconnec- cross product contains some uninteresting cases, but a
tion networks can be classified into three categories: syn- network designer can obtain a meaningful subspace by
chronous, asnychronous, and combined. exercising a practical view of engineering technology.
Control strategy. A typical interconnection network
consists of a number of switching elements and intercon- Topologies
necting links. Interconnection functions are realized by
properly setting control of the switching elements. The Network topology is a key factor in. determining a
control-setting function can be managed by a centralized suitable architectural structure, and many topologies
controller or by the individual switching element. The lat- have been considered for telephone switching connec-
ter strategy is called distributed control; the first strategy tions. 12 Here, we review those proposed or used for con-
is called centralized control. nections in tightly coupled multiple-processor systems
(see Figure 3).
Switching methodology. The two major switching
methodologies are circuit switching and packet switching.
In circuit switching, a physical path is actually established
between a source and a destination. In packet switching,
data is put in a packet and routed through the intercon-
nection network without establishing a physical connec-
tion path. In general, circuit switching is much more suit-
able for bulk data transmission, and packet switching is
more efficient for short data messages. Another option,
integrated switching, includes capabilities of both circuit
switching and packet switching. Therefore, three switch-
ing methodologies can be identified: circuit switching,
packet switching, and integrated switching.
Network topology. A network can be depicted by a
graph in which nodes represent switching points and
edges represent communication links. The topologies
tend to be regular and can be grouped into two categories: Figure 2. Hardware model of concurrent processing
static and dynamic. In a static topology, links between systems.

Figure 3. Topologies of Interconnection networks.

December 1981 13
Figure 4. Examples of static network toplogies: (a) one dimensional; (b-f) two dimensional; and (g-j) three dimensional.

14 COM PUTER
Static. Topologies in the static category can be by replacing each node of the 3-cube by a 3-node cycle.
classified according to dimensions required for layout Each node in the cycle is connected to the corresponding
-specifically, one-dimensional, two-dimensional, three- node in another cycle.
dimensional, and hypercube as shown in Figure 3. Ex-
amples of one-dimensional topologies include the linear Dynamic. There are three topological classes in the
array used for some pipeline architectures (Figure 4a).13 dynamic category: single-stage, multistage, and crossbar
Two-dimensional topologies include the ring,14"15 star,'6 (see Figure 5).
tree,17 near-neighbor mesh,'8 and systolic array.13 Ex- Single-stage. A single-stage network is composed of a
amples are shown in Figure 4b-f. Three-dimensional stage of switching elements cascaded to a link connection
topologies include the completely connected,'9 chordal pattern. The shuffle-exchange network23 is a single-stage
ring,20 3-cube,21 and 3-cube-connected-cycle22 networks network based on a perfect-shuffle connection cascaded
depicted in Figure 4g-j. A D-dimensional, W-wide hyper- to a stage of switching elements as shown in Figure 5a.
cube contains Wnodes in each dimension, and there is a The single-stage network is also called a recirculating net-
connection to a node in each dimension. The near-neigh- work because data items may have to recirculate through
bor mesh and the 3-cube are actually two- and three- the single stage several times before reaching their final
dimensional hypercubes, respectively. The cube-con- destination.
nected-cycle is a deviation of the hypercube. For example, Multistage. A multistage network consists of more
the 3-cube-connected-cycle shown in Figure 4j is obtained than one stage of switching elements and is usually capa-

Figure 5. Examples of dynamic network topologies: (a) single stage; (b-i) multistage; and (J) crossbar. (Cont'd on p. 16.)

December 1981 15
Figure 5 (cont'd from p.15). Examples of multistage and crossbar (j) dynamic network topologies.

16 COMPUTER
ble of connecting an arbitrary input terminal to an ar- Routing techniques. The routing techniques depend on
bitrary output terminal. Multistage networks can be one- the network topology and the operation mode used. More
sided or two-sided. The one-sided networks, sometimes or less, each multiple-processor system needs a routing
called full switches, have input-output ports on the same algorithm. Here, we use several well-defined routing
side. The two-sided multistage networks, which usually algorithms for examples.
have an input side and an output side, can be divided into Near-neighbor mesh. Bitonic sort has been adapted by
three classes: blocking, rearrangeable, and nonblocking. several authors45-47 for the routing of an n x n mesh-
In blocking networks, simultaneous connections of connected, single instruction-multiple data stream sys-
more than one terminal pair may result in conflicts in the tem. The procedure developed by Nassimi47 is as follows:
use of network communication links. Examples of this
type of network, which has been extensively investigated, Procedure SORT (n,n)
include data manipulator,24 baseline,25'26 SW banyan,27 1) K-S-1
omega,28 flip,29 indirect binary n-cube,30 and delta.31 A 2) While K < n do
topological equivalence relationship has been established a) consider the n x n processor array as com-
for this class of networks in terms of the baseline net- posed of many adjacent K x 2K subarrays
work.25'26 A data manipulator and a baseline network are b) do in parallel for each K x 2K array
shown in Figure Sb and 5c. HORIZONTAL_MERGE(K, 2K)
A network is called a rearrangeable nonblocking net- c) S S + 1 -

work if it can perform all possible connections between d) Consider the n x n processor array as
inputs and outputs by rearranging its existing connections composed of many adjacent 2K x 2K
so that a connection path for a new input-output pair can subarrays
always be established. A well-defined network, the Benes e) do in parallel for each 2K x 2K subarray
network12 shown in Figure 5d, belongs to this class. The VERTICAL_MERGE(2K, 2K)
Benes rearrangeable network topology has been exten- f) S - S + 1;K 2*K -

sively studied for use in synchronous data permuta- end


tion3235 and asynchronous interprocessor communica- SORT
tion.36'37 end
A network which can handle all possible connections
without blocking is called a nonblocking network. Two The HORIZONTAL-MERGE sorts a bitonic sequence
cases have been considered in the literature. In the first arranged in two arrays with the increasing sequence on the
case, the Clos network38 shown in Figure 5e, a one-to-one
connection is made between an input and an output. The
other case considers one-to-many connections.39 Here, a
- 8 I 0

generalized-connection network topology is generated to PURCHASE PLAN * 12-24 MONTH FULL OWNERSHIP PLAN * 36 MONTH LEASE PLAN
pass any of the NNmapping of inputs onto outputs where DESCRIPTION
PURCHASE
PRICE
PER MONTH
12 MOS. 24 MOS. 36 MOS.
Nis the number of inputs or outputs (see Figure 5f). In a LA36 DECwriter l ............$1,095 $105 $ 58 S 40
LA34 DECwriterlV ....... ..... 995 95 53 36
one-sided network (or full switch), one-to-one connec- LA34 DECwriter IV Forms Ctrl. 1,095 105
LA120 DECwriter III KSR ....... 2,295 220 122
58 40
83
tion is possible between all pairs of terminals.40'41 A LA120 DECwriter IIRO ........ 2,095 200 112 75
VT100 CRT DECscope ......... 1,695 162 90 61
cellular implementation, a base-line topology construc- VT1O1 CRT DECscope ...... ... 1,195 115 67 43
tion, and a Clos construction are shown in Figure 5g-i. VT125 CRT Graphics ...... .... 3,295 315 185 119
VT131 CRT DECscope ...... ... 1,745 167 98 63
Crossbar. In a crossbar switch every input port can be VT132 CRT DECscope ...... ... 1,995 190 106 72
VT18XAC Personal Computer Option 2,495 240 140 90
connected to a free output port without blocking. Figure T1745 Portable Terminal . 1,595 153 85 58
5j shows a schematic which is similar to one used in T1765 Bubble Memor Terminal 2,595 249 138 93
TI Insight 10 Terminal .695 67 37 25
C.mmp.42 A crossbar switch called a versatile line manip- T1785 Portable KSR, 120CPS. 2,395 230 128
T1787 Portable KSR, 120 CPS ... 2,845 273 152 102
86
ulator has also been designed and implemented.43'44 T1810 RO Printer
T1820 KSR Printer
.1,695 162
.2,195 211 117
90 61
80
ADM3A CRT Terminal .595 57 34 22
ADM5 CRT Terminal .645 62 36 24
ADM32 CRT Terminal . 1,165 112 65 42
Communication protocols ADM42 CRT Terminal . 1,995 190 106 72
DT80/1 CRT Terminal. 1,695 162 90 61
I IN DT80/3 CRT Terminal. 1,295 125 70 48
The switching methodology and the control strategy DT80/5L APL 15" CRT . 2,295 220 122 83
are implemented in switching elements (or switching __ N 920 CRT Terminal .895
950 CRT Terminal .1,075
86
103
48
57
32
39
points) according to required communication protocols. Letter Quality, 7715 RO ........ 2,895
Letter Quality, 7725 KSR. 3,295
278
316
154
175
104
119
The communication protocols can be viewed on two lev- 2030 KSR Printer 30 CPS . 1,195 115 67 43
els. The first level concerns switching control algorithms 2120 KSR Printer 120 CPS 2,195 211 117 80
Executive 80/20 .1,345 127 75 49
which generate necessary control settings on switching Executive 80/30 .1,695 162 90 61
elements to ensure reliable data routings from source to ^ 1
__ *~
^ j i
MX-80
s~~~~~v
F/T Printer.
4n n-.:_--
745
onn a
71
&
42 27
I'lle

destination. The first-level protocols are referred to as


routing techniques here. The second level is concerned
with the link control procedure that provides the hand-
shaking process among switching points. The handshak-
ing process is a basic function implemented by switching
elements.
December 1981 Reader Service Number 6 -
left array and the decreasing sequence on the right array,
or vice versa. Similarly, the VERTICAL-MERGE sorts a
bitonic sequence arranged in two arrays with the increas-
ing sequence on the upper array and the decreasing se-
quence on the lower array, or vice versa. A complete ex-
ample of sorting a 4 x 4 array is shown in Figure 6. The
order into which a subarray gets sorted is determined by
the SIGN function, " + " and " - ", used during a com-
parison-interchange where " +" is for nondecreasing
order and " - " is for nonincreasing order. In Figure 6,
the initial values given go through an HM sort on two I x I
arrays, a VM sort on two I x 2 arrays, an HM sort on two
2 x 2 arrays, and finally a VM sort on two 2 x 4 arrays.
Shuffle-exchange network. Both centralized and dis-
tributed routings have been worked out for the shuffle-
exchange network. It has been shown that the shuffle-
exchange network can realize an arbitrary permutation in
3(0og2N) - I passes where Nis the network size.48 An ex-
ample is shown in Figure 7 for the following permutation:

(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
14 12 5 7 15 8 9 13 4 3 10 6 1 0 2 11

The control setting developed consists of three matrices,


F, S and T. Among these three control matrices, S is in-
dependent of the permutation and Fand Tare modified
matrices obtained by performing some prescribed opera-
tions on the control matrix for the Benes binary network.
The detailed transformation is shown in Wu and Feng.48
The shuffle-exchange network can also be constructed to
adapt to a distributed control scheme. The construction
can be considered as a sorting network, and the binary
codes of the destination names are used as the values to be
sorted.23,49 Figure 8 illustrates an example for 2n elements
where n = 4. Each of then2 steps in this scheme consists of
a perfect-shuffle followed by simultaneous operations
performed on 2" 1 pairs of adjacent elements. Each of
the latter operations is either "O" (no operation, straight
connection), " + " (comparator module which'sends the
larger value to the lower link), or "-" (a reverse com-
parator module). The sorting proceeds in n stages of n
steps each: during stage s, for scn, we do n-s steps in
which all operations are "O", followed by s steps in which
the operations consist alternately of 2t " + followed by
21 " - " for t= 1, 2, . . . , s. During the last stage, all
operations are + ".
Data manipulator. A centralized control scheme is de-
signed for implementing data manipulating functions such
as permuting, replicating, spacing, masking, and comple-
menting.11 To implement a data manipulating function,
proper control lines of the six groups (U2I, U2, H2i, H2i
D'2, D2') in each column must be properly set through the
use of the control register and the associated decoder. A
"duplicate spaced substrings down" operation is illus-
trated in Figure 9. The two substrings to be duplicated are
AB and EF. For this operation the control ilie groups D
and H2' or H2' and H2' are activated, depending on
whether the control bit is I or 0 as determined by substring
length. In this example, the substring is 2; thus, only the
control bit for column 21 has a value of I, all others are
Figure 6. A complete example of sorting a 4 x 4 array. O's. Thus, in columns 22 and 20, H2' and H2'are activated,
COMPUTER
and in column 21, D2I and II H2' are activated. With this Fis 0000011l. The path traversed is straight, + 22, + 21,
control pattern, the substrings can be generated at the + 20. Multiple paths exist between a source-destination
output register. pair. For example, an alternative routing tag from source
A distributed control scheme has also been developed 13 to destination 6 is (0001 1001). The example is shown in
by McMillen and Siegel.50 It uses a routing tag which con- Figure 10. A general rule to calculate the routing tag is
tains 2n bits and is of the form F = (f2n-I .fn+I fn shown as
fn- I . . .fi fo). The n low-order bits represent the
magnitudes of the route, and the n high-order bits repre- D=S+ (-I)f2n-l (fn2n-1) + (_l)f2n-2(fn_ 2n-2)
sent the sign corresponding to the magnitudes. In stage i,
a given switching element examines bits i and n + i of the + (_ fn (fo 20)
routing tag. Iff, = 0, the straight link is used, regardless of
the value of fn+ If fi = l, bit n+i is examined. If
fn +i = O, the + 2i link is used; iffn +i = l, the - 2i link is where S and D are the addresses of the source and the
used. The source processor generates its own routing tag. destination, respectively.
For example, in a data manipulator of N=24, if the Baseline network. Routing techniques for baseline net-
source is 13 and the destination is 6, one possible value for works described here are also useful for other topological-

Figure 7. An example for universal realization of permutations.48

Figure 8. Sorting with shuffle-exchange. (Adapted from The Art of Computer Programming, Vol. 3: Sorting and Searching by D. E. Knuth;
Addison-Wesley, Reading, Mass., © 1973.)
December 1981 19
ly equivalent blocking multistage networks.25 Basically, cascaded matrix whose left part and right part are L and
two types of routing are available: recursive routing and R, respectively. Also let V (n - I)( b) be the 2n - I bit vector
destination tag routing.25'28'51 The recursive routing whose components are all equal to b. The control pattern
algorithm determines the control pattern according to K(n) of the flip function can then be expressed in terms of
permutation names. For some permutation, useful in par- the following recursive formula:
allel processing, the control pattern can be calculated re-
cursively on the fly as the data pass through the network. K(n) (Fwh ) = [ V(n - I) (ko); K(n - I) (Fer))e
Six categories of such permutations have been identified.
For our purpose, we describe one here and show the recur- where
sive routing algorithm. The flip permutation function29 is
described as follows: K(l)(Fln)) [V(n-')(k)].
=

Fe) (O sk<2n): p(Xr k) =Xand p(X(k) =X' For example, assuming

where Xris the number whose binary representation is the O I 2 3 4 5 6


reverse of X. Let k=2k' +ko and [L;RI denote the P=
1 5 3 7 0 4 2

p can be described by

F (3): p (Xr(D 4) = X.

Accordingly, we have
K(3) (F43)) = [V2(0); V(2)(o); V(2)(1)].
Hence

0 l
K(3)(p) = [ 0 1
0 l
O I .

The destination tag routing uses the binary representation


of the destination as a routing tag. Let the source terminal
link and destination terminal link be A and Z, respective-
ly. Also, let the binary representation of Z be
Zn - lZn-2 . . .zo. Starting at A, the first node to which A
is connected is set to switch A to the upper link if Zn - I = 0
or the lower link if Zn - I = 1. The second node in the path
is again set to switch A to the upper link if Z, - 2 = 0 or the
lower link if zn -2 = 1. This scheme is continued until we
get the proper destination. For example, in Figure 11,
A = 2 and Z = 11 (i.e., Z3Z2ZlZo= 1011). Switching ele-
Figure 9. Duplicate spaced substring down on data manipulator.11 ment I of the left-most stage switches A to the lower link
because Z3 = 1. At the next stage, switching element 4
switches A to the upper link because Z2 = 0. Again, switch-
ing element 4 in the third stage and switching element 5 in
the right-most stage both switch A to the lower links
because zi = zo = 1. If we consider Z as the source and A as
the destination, using the binary representation of A as
the routing tag and repeating the same routing procedure
will lead us to choose the same path. This routing tag
algorithm will connect the only path available between a
source and a destination and is extremely suitable for a
distributed control scheme. A conflict resolution
scheme25 has also been developed for implementing
destination tag routing in terms of centralized control.
Benes network. Sequential routing algorithms34'52
need O(NlogN) steps where Nis the network size. Many
Figure 10. Distributed routing on the data manipulator. researchers have worked toward improving this time com-

COMPUTER
plexity in terms of parallel processing technique,53 switching element can be set by a control line into a direct-
heuristic method,37 or recursive. formula.35 Here, we connection or crossed-connection state. Assume that 1o,
demonstrate the very basic routing algorithm, called the Ij, 00, 01, and C represent the two inputs, the two out-
looping algorithm. The basic principle, in terms of the puts, and the switching element control. The switching
permutation to be realized by the Benes binary network element's output function can be expressed as follows:
shown in Figure 5d, is
00 = CIo + C * I1, and
(0 1 2 3 4 5 6 7
p = 01 = ClI + C .I
3 7 4 0 2 6 1 5
where C= 0 means straight connection and C = I crossed
The loop algorithm starts recording the permutation, connection (see Figure 14).
p, as shown in Figure 12. The two output numbers of a
switching element in the output stage are shown in the
same column, and the two input numbers of a switching
element in the input stage are shown in the same row. We
then choose an arbitrary entry in the chart as a starting
point. For example, electing to start at row 23 and column
01, we then look for a same-row or column entry to form a
loop and, in Figure 12, choose row 23 and column 45. The
process continues until we obtain a loop by re-entering
row 23 and column 01. The loop's member entries are
then assigned "a" and "b" alternately. The second loop
can be formed in the same way. Then, we assign input and
output lines named "a" to subnetwork a and those named
"b" to subnetwork b. The control of the input and output
switching elements must be set as depicted in Figure 13.
This looping algorithm can be applied recursively to the
two subnetworks.

Construction of interconnection networks. Intercon-


nection networks are usually designed so they can be con-
structed of a single type of modular building block called
a switching element. The switching element realizes com-
munication protocols which specify the control strategy Figure 12. An example of the looping algorithm.
and the switching methodology.
The logic design of switching elements has been ex-
plored in many projects,54-56 including recent LSI im-
plementations. 10,57 Here, we describe in more detail three
designs that have been implemented and are operational.
Flip network 2 x 2switchingelement. The flip network
uses centralized control and circuit switching .29 The 2 x 2

Figure 13. Control setting result from the first iteration


of the looping algorithm.

Figure 11. Distributed routing on a baseline network. Figure 14. A 2 x 2 switching element.

December 1981 21
Dimond 2 x 2 switching element. A switching element Cacko COO + Co,
=
with two input and output ports, called Dimond for dual
interconnection modular network device,58 allows modu- Cack, = C1o + Cl
lar construction of interconnection networks. A packet of Cross = COO + Ci I
messages (containing routing information) arriving at a
Dimond is switched to a designated output port, where it Fillo = COO + Cl0;
is stored in a register. Figure 15 shows an implementation Fill, = Co, + C11
of Dimond which requires one control clock for all inter-
connected switching elements. The central clock has two where prio is the priority line indicating the index (0,1) of
phases. In the first clock phase, it is determined which in- the input served first in the event of conflict, and deso and
puts have to be copied into which registers. The copy des, are destination lines for ino and inI. In the second
allowances so determined are stored in four flip-flops: clock phase, two actions are performed concurrently. In-
COO, CO,, C1O, and C11 (Co, is the allowance for copying puts are copied into the output register, if required, and
ino into regl). In addition, output signals of copy the status flip-flops, stato and stat1 (status of rego and
acknowledgments (cacko and cackl), internal control reg1, respectively) are adapted. Precisely, we have the
signals (crosso, fillo, and fill,) are generated. More following:
precisely, we have the following:
= creqo des0 stato (creq1 +
Fillo - rego = cross ino + cross in ;
COO des, +,prio);
COI = creqO deso stat1 (creql + des, + prio); Fill, -reg1 = cross inO + cross in ;

C10 = creq, -XI stato (creqo + deso + prio); Fillo relo - stato= I
relo - stato = 0
Cl = creq1 des, stat1 (creqo + des0 + prio);
Fill, rell - stat = I1
rell - stat1 = 0

The information-available lines are connected to the


status flip-flops:

infao = stato
infal = stat1
The interconnection of two Dimonds is shown in Figure
16, which depicts the relation of handshaking lines.
64 x 64 switching element. A centralized-control and
circuit-switching 64 x 64 versatile data manipulator1 l (see
Figure 17) is operating in conjunction with the Staran
computer at the Rome Air Development Center.44 The
data manipulator operates under the control of the Staran
computer's parallel input-output unit. The contents of
the input and output masks, of the address control regis-
ter, and of the input and output control registers, as well
as the data to be manipulated, are entered via the 256-bit
wide PIO buffer interface. The manipulated data leave
Figure 15. A 2 x 2 dual interconnecting modular network device- the data manipulator via the same interface. The data
Dimond58-for packet switching.
manipulator's instruction repertoire allows one to load
the various address registers and masks and to start and
stop data manipulation. Self-test is performed by loading
address and input-data registers, allowing verification of
correct operation without assistance from the Staran
computer. There are 64 x 64 cells in the basic crossbar cir-
cuit. The output gate of cell (i,j) is controlled by the ith
address control register through a decoder. The decoder
has 64 outputs to control the 64 output gates in a basic-
crossbar-circuit row.

Connection issues for concurrent processing


Two approaches-array processing and multiprocess-
Figure 16. Connecting two Dimonds.58 ing-have been tried to provide processing concurrency.

22 COMPUTER
Figure 17. Block diagram of a versatile data manipulator.

Since array processors, which consist of multiple process-


ing elements and parallel memory modules under one
control unit, can handle single instructions and multiple
data streams, they are also known as SIMD computers.
Existing examples include Illiac IV and Staran. An overall
SIMD machine organization59 is shown in Figure 18. The
N processing elements, or PEs, are connected by two in-
terconnection networks to the M parallel memory
modules. The control unit in the center provides control
over PEs and memory modules.
Array processors allow explicit expression of parallel-
ism in user programs. The compiler detects the parallel-
ism and generates object code suitable for execution in the
multiple processing elements and the control unit. Pro-
gram segments which cannot be converted into parallel
executable forms are executed in the control unit; pro-
gram segments which can be converted into parallel ex-
ecutable forms are sent to the PEs and executed syn-
chronously on data fetched from parallel memory mod-
ules under the control of the control unit. To enable syn-
chronous manipulation in the PEs, the data are permuted
and arranged in vector form. Thus, to run a program
more efficiently on an array processor, one must develop
a technique for vectorizing the program (or algorithm).
The interconnection network plays a major role in vec-
torization.
The second approach for concurrent processing uses
multiprocessing. The multiprocessor can handle multiple Figure 18. SIMD model.59

December 1981 23
instructions and multiple data streams and hence is called Combinatorial capability. In array processing, data are
an MIMD processor. Examples of the MIMD architec- often stored in parallel memory modules in skewed forms
ture include HEP,60 data flow processor,61 and flow that allow a vector of data to be fetched without
model processor.62 A configuration of MIMD architec- conflict .63-65 However, the fetched data must be realigned
ture62 is shown in Figure 19. The N processing elements in prescribed order before they can be sent to individual
are connected to the M memory modules by an intercon- PEs for processing. This alignment is implemented by
nection network. The activities are coordinated by the permutation functions of the interconnection network,
coordinator. Unlike the control unit in an array pro- which also realigns data generated by individual PEs into
cessor, the coordinator does not execute object code; it skewed form for storage in the memory modules.
only implements the synchronization of processes and In the computer architecture project, one should ques-
smooths out the execution sequence. Again, the compiler tion whether the interconnection network chosen can ef-
must be designed to partition a computation task and ficiently perform the alignment. The rearrangeable net-
assign each piece to individual processing elements. Ef- work and the nonblocking network can realize every per-
fective partitioning and assignment are essential for effi- mutation function, but using these networks for align-
cient multiprocessing. The criterion is to match memory ment requires considerable effort to calculate control set-
bandwidth with the processor processing load, and the in- tings. A recursive routing mechanism has been provided
terconnection network is a critical factor in this matching. for a few families of permutations needed for parallel pro-
Below, we address some problems and results regarding cessing35; however, the problem remains for the realization
the role of the interconnection network in concurrent pro- of general permutations. Many articles45,51'66,67 concen-
cessing. trating on the permutation capabilities of single-stage net-

Figure 19. MIMD model.

24 COMPUTER
works and blocking multistage networks have shown that Bandwidth of interconnection networks. The band-
these networks cannot realize arbitrary permutations in a width can be defined as the expected number of requests
single pass. Recent results show that the baseline network accepted per unit time. Since the bus system cannot pro-
can realize arbitrary permutations in just two passes51 vide sufficient bandwidth for a large-scale multiprocessor
while other blocking multistage networks, such as the system and the crossbar switch is too expensive, it is par-
omega network, need at least three passes.66 As men- ticularly interesting to know what kind of bandwidth
tioned previously, the shuffle-exchange network can various interconnection networks can provide.
realize arbitrary permutations in 3(1og2N) - 1 passes The analytic method has been used to estimate band-
where Nis the network size.48 width.31'71'72 However, one cannot obtain a closed-form
solution, and the analytic model is sometimes too simpli-
fied. Just for example, one result showed that for a block-
Task assignments and reconfiguration. Consider a ing multistage interconnection network of size 256 x 256,
parallel program segment using M memory modules and the bandwidth is 77 requests (per memory cycle) and for a
N processing elements. During execution, data is usually crossbar switch of the same size, the bandwidth is 162.
transferred from memory modules to processing elements However, the crossbar costs about 20 times as much as the
or vice versa. It is also necessary to transfer data among multistage network, and with buffering (packet switch-
processing elements for data sharing and synchroniza- ing), the performance of the multistage network is quite
tion. Simultaneous data transfers through the intercon- comparable to the crossbar switch.71
nection network, which implements the transfers, may re- Numerical simulation, also used to estimate the band-
sult in contention for communication links and switching width,71 can simulate actual PE connection requests by
elements. In case of conflict, some of the data transfers analyzing the program to be executed. The access con-
must be deferred; consequently, throughput decreases flicts in the network and memory modules can be detected
because the processing elements which need the deferred as shown by Wu and Feng.25 Using the simulation meth-
data cannot proceed as originally expected. To minimize od, Barnes73 concluded that the baseline network is more
delays caused by communication conflicts, program than adequate to support connection needs of a proposed
codes must be assigned to proper processing elements and MIMD system which can execute one billion floating-
data assigned to proper memory modules. The assign- point instructions per second.
ment of data to memory modules, called mapping,68 has
recently been extended to include assignment of program
modules to processing elements.69 Reliability. Reliable operation of interconnection net-
A configuration concept has been proposed to better works is important to overall system performance. The
use the interconnection network.51 Under this concept, a reliability issue can be thought of as two problems: fault
network is just a configuration of another one in the diagnosis and fault tolerance. The fault-diagnosis prob-
same, topologically equivalent class.25 To configure a lem has been studied for a class of multistage interconnec-
permutation function as an interconnection network, we tion networks constructed of switching elements with two
can assign input/output link names in a way that realizes valid states.74 The problem is approached by generating
the permutation function in one conflict-free pass. The suitable fault-detection and fault-location test sets for
problem of assigning logical names that realize various every fault in the assumed fault model. The test sets are
permutation functions without conflicts is called a recon- then trimmed to a mimimal or nearly minimal set. Detect-
figuration problem. It has been shown that, through the ing a single fault (link fault or switching-element fault) re-
reconfiguration process, the baseline network can realize quires only four tests, which are independent of network
every permutation in one pass without conflicts.69 This size. The number of tests for locating single faults and
implies that concurrent processing throughput could be detecting multiple faults are also workable.
enhanced by proper assignment of tasks to processing The second reliability problem mainly concerns the
elements and data to memory modules. degree of fault tolerance.75 It is important to design a net-
work that combines full connection capability with grace-
ful degradation-in spite of the existence of faults. U
Partitioning. In partitioning-that is, dividing the net-
work into independent subnetworks of different sizes-
each subnetwork must have all the interconnection capa-
bilities of a complete network of the same type and size. Acknowledgment
Hence, with a partitionable network, a system can sup-
port multiple SIMD machines. By dynamically recon- The author wishes to acknowledge the original con-
figuring the system into independent SIMD machines and tribution of Dr. C. Wu in preparing this article.
properly assigning tasks to each partition, we can use
resources more efficiently.
Several authors have noted the importance of parti-
tioning.30 One recent study70 shows that single-stage net- References
works, such as the shuffle-exchange and Illiac networks,
cannot be partitioned into independent subnetworks, but 1. T. Feng, editor's introduction, special issue on parallel
blocking multistage networks, such as the baseline and processors and processing, Computing Surveys, Vol. 9,
data manipulator, can be partitioned. No. 1, Mar. 1977, pp. 1-2.

December 1981 25
2. C. V. Ramamoorthy, T. Krishnarao, and P. Jahanian, 23. H. S. Stone, "Parallel Processing with the Perfect
"Hardware Software Issues in Multi-Microprocessor Shuffle," IEEE Trans. Computers, Vol. C-20, No.2, Feb.
Computer Architecture," Proc. First Annual Rocky 1971, pp. 153-161.
Mountain Symp. Microcomputers, 1977, pp. 235-261. 24. T. Feng, Parallel Processing Characteristics and Im-
3. K. J. Thurber, "Interconnection Networks-A Survey and plementation of Data Manipulating Functions, Rome Air
Assessment," AFIPS Conf. Proc., Vol. 43, 1974 NCC, pp. Development Center report, RADC-TR-73-189, July
909-919. 1973.
4. K. J. Thurber, "Circuit Switching Technology: A State-of- 25. C. Wu and T. Feng, "On a Class of Multistage Intercon-
the-Art Survey," Proc. Compcon Fall 1978, Sept. 1978, nection Networks," IEEE Trans. Computers, Vol. C-29,
pp. 116-124. No. 8, Aug. 1980, pp. 694-702.
5. K. J. Thurber and G. M. Masson, Distributed-Processor 26. C. Wu and T. Feng, "On a Distributed-Processor Com-
Communication Architecture, Lexington Books, Lex- munication Architecture," Proc. Compcon Fall 1980, pp.
ington, Mass., 1979, 252 pp. 599-605.
6. H. J. Siegel, "Interconnection Networks for SIMD Ma- 27. L. R. Goke and G. J. Lipovski, "Banyan Networks for
chines," Computer, Vol. 12, No. 6, Junc 1979, pp. 57-66. Partitioning Multiprocessing Systems," Proc. First An-
7. H. J. Siegel, R. J. McMillen, and P. T. Mueller, Jr., "A nual Computer Architecture Conf., Dec. 1973, pp. 21-28.
Survey of Interconnection Methods for Reconfigurable 28. D. H. Lawrie, "Access and Alignment of Data in an Array
Parallel Processing Systems," AFIPS Conf. Proc., Vol. Processor," IEEE Trans. Computers, Vol. C-24, No. 12,
48, 1979 NCC, pp. 387-400. Dec. 1975, pp. 1145-1155.
8. G. M. Masson, G. C. Gingher, and Shinji Nakamura, "A 29. K. E. Batcher, "The Flip Network in STARAN," Proc.
Sampler of Circuit Switching Networks," Computer, Vol. 1976Int'l Conf. ParallelProcessing, Aug. 1976, pp.65-71.
12, No. 6, June 1979, pp. 32-48.
30. M. C. Pease, "The Indirect Binary n-Cube Microprocessor
9. T. Feng and C. Wu, Interconnection Networks in Multiple- Array," IEEE Trans. Computers, Vol. C-26, No. 5, May
Processor Systems, Rome Air Development Center report, 1977, pp. 548-573.
RADC-TR-79-304, Dec. 1979, 244 pp.
31. J. H. Patel, "Processor-Memory Interconnections for
10. C. Wu and T. Feng, "A VLSI Interconnection Network Multiprocessors," Proc. Sixth Annual Symp. Computer
for Multiprocessor Systems," Digest Compcon Spring Architecture, Apr. 1979, pp. 168-177.
1981, pp. 294-298.
32. A. Waksman, "A Permutation Network," J. ACM, Vol.
11. T. Feng, "Data Manipulating Functions in Parallel 9, No. 1, Jan. 1968, pp. 159-163.
Processors and Their Implementations," IEEE Trans.
Computers, Vol. C-23, No. 3, Mar. 1974, pp. 309-318. 33. A. E. Joel, Jr., "On Permutation Switching Networks,"
B.S.T.J., Vol. 67, 1968, pp. 813-822.
12. V. Benes, Mathematical Theory of Connecting Networks,
Academic Press, N.Y., 1965. 34. D. C. Opferman and N. T. Tsao-Wu, "On a Class of Rear-
13. H. T. Kung, "The Structure of Parallel Algorithms," in rangeable Switching Networks-Part I: Control Algo-
Advances in Computers, Vol. 19, M. C. Yovits, ed., rithm; Part II: Enumeration Studies of Fault Diagnosis,"
Academic Press, N.Y., 1980. B.S. T.J., 1971, pp. 1579-1618.
14. D. J. Farber and K. C. Larson, "The System Architecture 35. J. Lenfant, "Parallel Permutations of Data: A Benes Net-
of the Distributed Computer System-the Communica- work Control Algorithm for Frequently Used Permuta-
tions System," Proc. Symp. Computer Comm. Networks tions," IEEE Trans. Computers, Vol. C-27, No. 7, July
and Teletraffic, Brooklyn Polytechnic Press, Apr. 1972, 1978, pp. 637-647.
pp. 21-27. 36. T. Feng, C. Wu, and D. P. Agrawal, "A Microprocessor-
15. C. C. Reames and M. T. Liu, "A Loop Network for Simul- Controlled Asynchronous Circuit Switching Network,"
taneous Transmission of Variable Length Messages," Proc. Sixth Annual Symp. Computer Architecture, 1979,
Proc. Second Symp. Computer Architecture, Jan. 1975, pp. 202-215.
pp. 7-12.
37. Y-C. Chow, R. D. Dixon, T. Feng, and C. Wu, "Routing
16. S. I. Saffer et al., "NODAS-The Net Oriented Data Techniques for Rearrangeable Interconnection Networks,"
Acquisition System for the Medical Environment," AFIPS Proc. Workshop on Interconnection Networks, Apr. 1980,
Conf. Proc., Vol. 46, 1977 NCC, pp. 295-300. pp. 64-69.
17. J. A. Harris and D. R. Smith, "Hierarchical Multi- 38. C. Clos, "A Study of Nonblocking Switching Networks,"
processor Organization," Proc. Fourth Symp. Computer Bell System Tech. J., Vol. 32, 1953, pp. 406-424.
Architecture, Mar. 1977, pp. 41-48.
18. G. H. Barnes et al., "The Illiac IV Computer," IEEE 39. C. D. Thompson,'"Generalized Connection Networks for
Trans. Computers, Vol. C-17, No. 8, Aug. 1968, pp. Parallel Processor Intercommunication," IEEE Trans.
746-757. Computers, C-27, No. 12, Dec. 1978, pp. 1119-1125.
19. E. M. Aupperle, "MERIT Computer Network: Hardware 40. J. Gecsei, "Interconnection Networks from Three-State
Considerations," in Computer Networks, R. Rustin, ed., Cells," IEEE Trans. Computers, Vol. C-26, No. 8, Aug.
Prentice-Hall, Englewood Cliffs, N.J., 1972, pp. 49-63. 1977, pp. 705-711.
20. B. W. Arden and H. Lee, "Analysis of Chordal Ring Net- 41. Y-C. Chow, R. D. Dixon, and T. Feng, "An Interconnec-
work," IEEE Trans. Computers, Vol. C-30, No. 4, April tion Network for Processor Communication with Opti-
1981, pp. 291-295. mized Local Connections," Proc. 1980 Int'l Conf. Parallel
Processing, Aug. 1980, pp. 65-74.
21. H. Sullivan, T. R. Bashkow, and K. Klappholz, "A Large
Scale Homogeneous, Fully Distributed Parallel Machine," 42. W. A. Wulf and C. G. Bell, "C.mmp-A Multimicropro-
Proc. Fourth Symp. Computer Architecture, Nov. 1977, cessor," AFIPS Conf. Proc., Vol. 41, 1972 FJCC, pp.
pp. 105-125. 765-777.
22. F. P. Preparata and J. Vuillemin, "The Cube-Connected 43. T. Feng, The Design of a Versatile Line Manipulator,
Cycles: A Versatile Network for Parallel Computation," Rome Air Development Center report, RADC-TR-73-292,
Comm. ACM, Vol. 24, No. 5, May 1981, pp. 300-309. Sept. 1973.

26 COMPUTER
44. W. W. Gaertner, Design, Construction,and Installation of 65. D. H. Lawrie and C. Vora, "The Prime Memory System
Data Manipulator, Rome Air Development Center report, for Array Access," Proc. 1980 Int'l Conf. Parallel Pro-
RADC-TR-77-166, May 1977, 80 pp. cessing, pp. 81-87.
45. S. E. Orcutt, "Implementation of Permutations Functions 66. A. Shimer and S. Ruhman, "Toward a Generalization of
in an Illiac IV-Type Computer," IEEE Trans. Computers, Two- and Three-Pass Multistage, Blocking Interconnec-
Vol. C-25, No. 9, Sept. 1976, pp. 929-936. tion Networks," Proc. 1980 Int'l Conf. Parallel Process-
46. C. D. Thompson and H. T. Kung, "Sorting on a Mesh- ing, pp. 337-346.
Connected Parallel Computer," Comm. ACM, Vol.20, 67. T. Lang and H. S. Stone, "A Shuffle-Exchange Network
No. 4, Apr. 1977, pp. 263-271. with Simplified Control," IEEE Trans. Computers, Vol.
47. D. Nassimi and S. Sahni, "Bitonic Sort on a Mesh- C-25, No. 6, Jan. 1976, pp. 55-65.
Connected Parallel Computer," IEEE Trans. Computers, 68. H. T. Kung and D. Stevenson, "A Software Technique for
Vol. C-28, No. 1, Jan. 1979, pp. 2-7. Reducing the Routing Time on a Parallel Computer with a
Fixed Interconnection Network," in High Speed Com-
48. C. Wu and T. Feng, "Universality of the Shuffle-Exchange puterandAlgorithm Organization, Academic Press, N.Y.,
Network," IEEE Trans. Computers, Vol. C-30, No. 5, 1977, pp. 423-433.
May 1981.
49. D. E. Knuth, The Art of Computer Programming, Vol. 3: 69. C. Wu and T. Feng, "A Software Technique for Enhanc-
Sorting and Searching, Addison-Wesley, Reading, Mass., ing Performance of a Distributed Computer System,"
1973. Proc. Compsac 80, Oct. 1980, pp. 274-280.
50. R. J. McMillen and H. J. Siegel, "MIMD Machine Com- 70. H. J. Siegel, "The Theory Underlying the Partitioning of
munication Using the Augmented Data Manipulator Net- Permutation Networks," IEEE Trans. Computers, Vol.
work," Proc. Seventh Symp. Computer Architecture, C-29, No. 9, Sept. 1980, pp. 791-801.
June 1980, pp. 51-58. 71. D. M. Dias and J. R. Jump, "Analysis and Simulation of
51. C. Wu and T. Feng, "The Reverse-Exchange Interconnec- Buffered Delta Networks," IEEE Trans. Computers, Vol.
tion Network," IEEE Trans. Computers, Vol. C-29, No. C-30, No. 4, Apr. 1981, pp. 273-282.
9, Sept. 1980, pp. 801-811; also Proc. 1979 Int'l Conf. 72. D. A. Padua, D. J. Kuck, and D. H. Lawrie, "High-Speed
Parallel Processing, pp. 160-174. Multiprocessors and Compilation Techniques," IEEE
52. S. Anderson, "The Looping Algorithm Extended to Base Trans. Computers, Vol. C-29, No. 9, Sept. 1980, pp.
2t Rearrangeable Switching Networks," IEEE Trans. 763-776.
Comm., Vol. COM-25, No. 10, Oct. 1977, pp. 1057-1063. 73. G. H. Barnes, "Design and Validation of a Connection
53. G. Lev, N. Pippenger, and L. G. Valiant, "A Fast Parallel Network foi Many-Processor Multiprocessing Systems,"
Algorithm for Routing in Permutation Networks," IEEE Proc. 1980 lnt'l Conf. Parallel Processing, pp. 79-80.
Trans. Computers, Vol. C-30, No. 2, Feb. 1981, pp. 74. C. Wu and T. Feng, "Fault Diagnosis for a Class of
93-100. Multistage Interconnection Networks," Proc. 1979 Int'l
54. D. H. Lawrie, Memory-Processor Conneciton Networks, Conf. Parallel Processing, pp. 269-278.
UIUCDCS-R-73-557, University of Ilinois, Urbana, Feb. 75. J. P. Shen and J. P. Hayes, "Fault Tolerance of a Class of
1973. Connecting Networks," Proc. Seventh Symp. Computer
55. Numerical Aerodynamic Simulation Facility Feasibility Architecture, 1980, pp. 61-71.
Study, Burroughs Corporation, Mar. 1979.
56. U. V. Premkuma, R. Kapur, M. Malek, G. J. Lipovski,
and P. Horne, "Design and Implementation of the Banyan
Interconnection Network in TRAC, " AFIPS Conf. Proc.,
Vol. 49, 1980 NCC, pp. 643-653.
57. M. A. Franklin, "VLSI Performance Comparison of Ban-
yan and Crossbar Communication Networks," IEEE
Trans. Computers, Vol. C-30, No. 4, Apr. 1981, pp.
283-290.
58. P. G. Jansen and J. L. W. Kessels, "The DIMOND: A
Component for the Modular Construction of Switching
Networks," IEEE Trans. Computers, Vol. C-29, No. 10, Tse-yun Feng is a professor in the Depart-
Oct. 1980, pp. 884-889. ment of Computer and Information Sci-
ence, Ohio State University, Columbus.
59. D. J. Kuck, "A Survey of Parallel Machine Organization Previously, he was on the faculty at Wayne
and Programming," Computing Surveys, Vol. 9, No. 1, State University, Detroit, and Syracuse
Mar. 1977, pp. 29-59. Also in Proc. 1975 Sagamore Com- University, New York. He has extensive
puter Conf. Parallel Processing, pp. 15-39. technical publications in the areas of asso-
60. B. J. Smith, "A Pipelined, Shared Resource MIMD Com- ciative processing, parallel and concurrent
puter," Proc. 1978 Int'l Conf. Parallel Processing, pp. 6-8. processors, computer architecture, switch-
ing theory, and logic design, and has re-
61. J. B. Dennis, "Data Flow Supercomputers," Computer, ceived a number of awards for his technical contributions and
Vol. 13, No. 11, Nov. 1980, pp. 48-56. scholarship.
62. S. F. Lundstrom and G. Barnes, "A Controllable MIMD A past president of the IEEE Computer Society (1979-80),
Architecture, " Proc. 1980 Int'l Conf. Parallel Processing, Feng was a distinguished visitor (1973-78), and has served as a
pp. 19-27. reviewer, panelist, or session chairman for various technical
magazines and conferences. He also initiated the Sagamore
63. D. J. Kuck, "ILLIAC IV Software and Application Pro- Computer Conference on Parallel Processing and the Interna-
gramming," IEEE Trans. on Computers, Vol. C-17, No. tional Conference on Parallel Processing.
8, Aug. 1968, pp. 758-770. He received the BS degree from the National Taiwan Universi-
64. K. E. Batcher, "The Multi-Dimensional Access Memory in ty, Taipei, the MS degree from Oklahoma State University,
STARAN," IEEE Trans. Computers, Vol. C-26, No. 2, Stillwater, and the PhD degree from the University of Michigan,
Feb. 1977, pp. 174-177. Ann Arbor, all in electrical engineering.

December 1981 27

You might also like