15CS72 IAT2 Solution
15CS72 IAT2 Solution
October 2019
Sub: Advanced Computer Architecture Sub Code 15CS72 Branch: CSE
Date: 14/10/2019 Duration: 90 mins Max Marks: 50 Sem / Sec: VII OBE
A,B,C
Answer any FIVE FULL MARK S CO RB
Questions T
1 (a) Explain the following terms for System Interconnect architecture. 05 CO1 L2
a) Node Degree b) Bisection Width c) Static Connection Networks d) Data Routing
Functions e) Crossbar Networks
a) Node Degree: The number of edges incident on node represents node degree. If the
network contains directed edges then number of incoming edges represents in-degree
and number of outgoing edges represents out-degree. Then node degree is the sum of the
two. Hence the node degree should be kept small (constant) to reduce cost.
b) Bisection Width: When the network is cut into two halves, the minimum number of
edges along the cut is called as channel bisection width b. Each edge corresponds to a
channel with w bit wires. Hence wire bisection width is B=bw which represents the
wiring density of a network. The channel width w=B/b.
c) Static Connection Networks: Static networks are formed by point to point fixed direct
connections which will not change during the program execution.
d) Data Routing Functions: The data routing functions are used for data exchange
among multiple processors or PE in network. Commonly seen data routing functions
include shifting, rotation, permutation (one-to-one), multicast (one to many), shuffle,
exchange etc. These routing functions can be implemented on ring, mesh, hypercube or
multistage networks.
There are different processing levels and parallelism can be exploited at each level as
shown in Figure given below. Also lower the level, more finely will be the grain size.
Instruction Level Parallelism: Typical grain size would be less than 20 instructions. Also
the parallelism is detected and source code will be transformed to parallel code by
optimizing compilers. Grain size is Fine
Loop Level Parallelism: If loop iterations are independent they can be executed in
parallel by Vector Processor. But recursive loops are rather difficult to parallelize. A
typical loop contains less than 500 instructions. Grain size is Fine.
Procedure Level Parallelism : Here grain size is medium which typically corresponds to
a procedure or subroutine with less than 2000 instructions. Detection of parallelism at
this level is much more difficult than at the finer-grain levels.
Subprogram Level Parallelism: The grain size may typically contain tens or hundreds of
thousands of instructions. Traditionally, parallelism at this level has been exploited by
algorithm designers or programmers, rather than by compilers.
Omega Network
The figure given below shows 4 possible switch modules used for constructing the
Omega Network and 8*8 Omega Network. Three stages of 2 X 2 switches are needed.
There are 8 inputs on the left and 8 outputs on the right. The ISC pattern is the perfect
shuffle over 8 objects. The outputs from each stage are connected to the inputs of the
next stage using a perfect shuffle connection system
In general, an n-input Omega network requires log2n stages of 2*2 switches. Each stage
requires n/2 switch modules. In total, the network uses nlog2n/2 switches. Each switch
module is individually controlled.
Baseline Network
The first stage contains one N*N block and second stage contains two N/2*N/2 sub
blocks labeled as C0 and C1. The construction process is recursively applied to the sub
blocks until the size of the sub block gets reduces to 2*2. The ultimate building blocks
for sub blocks are 2*2 switches each with two legitimate connection states: straight and
crossover between the two input and two outputs. A 16*16 baseline network is shown
below.
3(a) Explain Crossbar network with a neat diagram 08 CO1 L2
CrossBar Network
The highest bandwidth and interconnection capability are provided by crossbar
networks. A crossbar network can be visualized as a single-stage switch network. Each
cross point switch can provide a dedicated connection path between a pair. The switch
can be set on or off dynamically upon program demand. Two types of crossbar networks
are shown in figure given below.
Inter-processor Memory Crossbar network : The pioneering C.mmp implemented a
16*16 crossbar network which connected 16 PDP11 processors to 16 memory modules,
each of which had capability of 1 million words of memory cells. The 16 memory
modules could be accessed by the processors in parallel. Each memory module can
satisfy only one processor request at a time. Only one cross point switch can be set on in
each column. However, several cross point switches can be set on simultaneously in
order to support parallel memory accesses.
Inter-processor Crossbar network
This large crossbar 224 * 224 was actually built in a vector parallel processor VPP500
by Fujitsu Inc. (1992). The PEs are processors with attached memory. The CPs stand for
control processors which are used to supervise the entire system operation, including the
crossbar networks. In this crossbar, at one time only one crosspoint switch can be set on
in each row and each column. The interprocessor crossbar provides permutation
connections among the processors. Only one-to-one connections are provided.
Therefore, the n*n crossbar connects at most n source, destination pairs at a time.
b) Average Parallelism
Consider a parallel computer with n homogenous processor and maximum parallelism in
profile is m. In ideal case n>>m. Δ is computing capacity of single processor. The total
work W(computations performed) is given as follows
a) Barrel Shifter
Consider number of nodes to be N. N=2n and barrel shifter has node degree of d= 2n-1
and diameter D= n/2. This implies that node i is connected to node j if j-i = 2r for some r
= [0, 1... n-1]. For N = 16 , the barrel shifter has a node degree of 7 with a diameter of 2.
But. the barrel shifter complexity is still much lower than that of the completely
connected network.
b) Cube Connected Cycle
This architecture is modified from the hypercube. 3-cube is modified to form 3-cube
connected cycle (3-CCC) with network diameter of 6. The idea is to replace the corner
vertices of the 3 cube with a ring (cycle) of 3 nodes as shown below in the figure.
In general k cube connected cycle can be formed using a k-cube with n= 2k cycles and
each cycle will have k nodes. Thus k-cube can be transformed to a k-CCC with k*2k
nodes. In general the network diameter of k-CCC is 2k. The major improvement of CCC
lies in its constant node degree of 3, which is independent of the dimension of the
underlying hypercube. Also CCC is better architecture for building scalable systems if
latency can be tolerated in some way.
Physical Limitations
1. Due to electrical, mechanical, and packaging limitations, only a limited number
of boards can be plugged into a single backplane bus.
2. The bus system is difficult to scale and mainly limited by contention.
3. Multiple backplane buses can be mounted on the same backplane chassis.
7(a) Explain all the application models and efficiency curve with neat diagram. 08 CO1 L2
Efficiency Curves
There are three speedup performance models defined below which are bounded by
limited memory, limited tolerance for latency of IPC and limited IO bandwidth.
7(b) What is Torus Topology? What are advantages of Folded Torus over Torus? 02 CO1 L3
The torus has ring connections along each row and along each column of the array. In
general, an n*n binary torus has a node degree of 4 and a diameter of 2(n/2). The torus
is a symmetric topology.
Folded Torus has uniform wire length as shown in the figure given below
8 Explain Amdahl’s law with neat diagram 10 CO4 L2
Now we define the fixed load speedup factor Sn as ratio of T(1) to T(n). T(1) indicates
response time for the uni-processor system and hence
Consider a situation where the system can operate in only two modes i.e. sequential
mode with DOP=1 and perfectly parallel mode with DOP=n Then Sn is given as follows
Hence execution time reduces when the parallel portion of the program is executed by n
processors, but the sequential portion of the program does does not change and requires
constant amount of time as shown in figure below. Thus
W1+Wn=ɑ + (1- ɑ) = 1
Here ɑ represents the percentage of the program that must be executed sequentially and
1- ɑ corresponds to portion of the code that can be executed in parallel. The total amount
of workload W1+Wn is kept constant.
The speedup curve decreases very rapidly as the ɑ increases. This means that with a
small percentage of the sequential code the entire performance cannot go higher than 1/
ɑ. Hence this is the sequential bottleneck in the program which cannot be solved by
increasing the number of processors.