Computer Architecture
Computer Architecture
S
ipellne Architecture 2
AN
Vector Processor 43
TI
Memory 71
RISC&CISC Architectures
Interprocess Communication AU 93
99
AK
M
E
AR
NOTE:
WBUT course structure and syllabus of 4th semester has been changed from 2012
Advanced Computer Architecture (4 Sem, CS) has been redesigned as
Computer Architecture (4 Sem, CS). Taking special care of this matter we are
e
Archltecture, from 2005 to 2011 along with the complete solution of new university
papers, so that students can get an idea about university questions patterns
S
AN
POPULAR PUBLICATIONS OMPLIERAR ITETURE
TI
b) 7.1 c)6.5 d) None of the
Maltlple Choice Type 9uestions a)4.5
Answer: (d)
complete n tasks in a k stage pipelina
1. The number of cycies required to s .Dynamic pipeline
allows WBUT 2008, 2009, 2011, 2014, 2016]
WBUT 2007, to evaluate
b) nk+1 c) k d) none of 20161 Multiples functions
function
b) only streamline connection
a) k+n 1
fixed d) none of these
c) to perform
AU
these
Answer: (a)
Answer: (a)
2. A 4-ary 3-cube hypercube architecture has WBUT 2007m
a)3 dimensions with 4 nodes along each dimension 0. The division
of stages of a pipeline into sub-stages is the basis for
b)4 dimensions what 3 nodes along each dimension WBUT 2009, 2014]
c) both (a) and (b) d) none of these b) super-pipelining
a} pipelining d) VLIW processor
Answer: (a c) Superscalar
3. Which of these are examples of 2-dimensional Answer: (a)
topologies in static networks?
AK
[WBUT 2007,2010] 11. A pipeline stage WBUT 2012, 2014]
a) Mesh b) 3CCC networks c) Linear array d) None of these a) is sequential circuit
Answer: (a)
b) is combinational circuit
4. The seek time of a disk 30
is ms. It rotates c) consists of both sequential and combinational circuits
capacity of each track is 300 at the rate of 30 rotations/second. d) none of these
words. The access time is approximately The
Answer: (c)
a) 62 ms b) 60 ms WBUT 2007, 2008, 2010
c) 47 ms
Answer: 1d d) none of these 12. Utilization pattern of successive stages of a synchronous pipeline can be
specified by [WBUT 2012, 2015]
5. For two
M
NBUT 2012
7. A single b) Super Scalar processor
bus structure a) VLIW processor d) none of these
a) Main frames is primarily c) Super pipelined
found in
Mini and
Answer: (C) Micro-computers
(wBUT
2008)
Answer: (6)
b) High performance machines possible data hazard? WBUT 2012]
d) Supercomputers 15. is
of the followingRAWnot the cause of c) WAR
hich
a) RAR b)
d) WAW
Answer: (a)
CA-2 CA-
e
W
S
AN
POPULAR PUBLICATIONS COMPLTER ARS RE
L
non-pipeline procesS
16. What wil de the speed up fora4 segment linear pipeline wheen Time require for nk.r nk
the number
instruction ? WBUT ot Time require for pipeline process (k +(n-1))r k+(n-1)
TI
2013, 2014
a) 4S 3.82 c)8.16 when n >>k
Aaswer: (b d) 2.95 um Speed-up is. Si*k
nANimum Speed-up is never fully achievai because of data-dependencies between
17. Which type of data hazard is not possible? intemupts, ram branches etc. So, many pipeline cycles may be wasted
WBUT 2013) gctions.
WAR b) RAW c) RAR waiting state caused by out-ofssequence instruction executions.
Auswer: ( d) WAW
18 MIPS means
aMultiple Instruction Per Second
e) utnstruction Pertformed System
Aaswer
b) Structural hazard
WBUT 2013
Per Second
WBUT 2014)
AU Compare
chitecture.
ASwer:
ieerscalar and
superscalar,
proccessors
was developed
benchmarks. The average
simulations suggest that this metric
AK
legree of super pipelining metric exploit ali of the instruction-
d) none of these machines already
Aawer. s already high for many machines. These non-numeric applications, cven without parallel
Irvel parallelism available in
many
20. Suppose pipelining.
the time delay of the four
stages of a pipeline are t1 struction issue or higher degrees of
ns. t3= 90 ns.
t4= 80 ns
then the maximum clock respectively and the interface latch has 60 ns, t2 50
frequency a delay t1 10 ns, multiple
al 100 ns for the pipeline is Superscalar units are kept busy by
b 90 ns WBUT 2016] multiple functional speed. parallel
Ansmer: (d) c) 190 ns perscalar processing has approached the limits of
d) 30 ns execution pipelines have Super-pipeined machines can issue
Sructions. As performance.
M
1. Define
Speed-up. Deduce instruction stream.
processor is k Is that the maximum
this maximum speed-up speed-up in a k-stage pipeln
always achievable? Explain. exccute at a higher rale b
Super-pipeline possidie to
time T. it may be accomplishcdiin two ways:
there are no stalls (waits) then OR, WBUT 2006] Given a pipe
pipeline stage can be substages
T/n. Thispipeline stages wto n
AR
TI
hence dependent on the completion of instruction with
SuperscalarSuper-pipelin it is
etion2, as
We may also combine superscalar operation with super-pipelining esents a problem with concurrent
and the tt execution, for example:
potentially the product of the spcedup factors. However. it
interlock between parallel pipes that are divided
into many'
is even more diieul
stages. Also. the memon
r2+r3
dordxrs
subsystem must be able to sustain a level of instruction
AU
throughput corresponding mory situation that there is a chand that i2 may be completed
toi
total throughput of the multiple pipelines - stretching
gap even more. Of course. w ith so many
the processor/memory performanthe
pipes and s0 many stages, ance e arein a
urent
awecition) we must ensure that we do not store the
execu
before il (i.e. with
result of register 3 before
become huge. and branch prediction becomes a
scrious bottleneck.
branch penaltiec shad a chance to fetch the operands.
But the real problem may be in finding
the parallelism required
and stages busy hetween branche: to keep all of the are the different rameters used in measuring CPU perfomance? Briefly
LConsider that a machine pipes Mhat
stages must always have access with 12 pipelines WBUT 2008, 2015]
to a window of 240 instructions of 20 MCUSS each.
to avoid all hazards, and that the average that are scheduled
of 40 branches that would be so as Lawer:
of that size are all corectly predicted present in a block h estimate the CPU performance, the measure that is generally most important is
AK
prefetch unit.) suftíciently in advance
to avoid stalling in gKCution time, T. because we can write
the
3. What are Performance = 1/ Execution time
the different factors the execution time increasos the CPU performance decreases. There are three
system? Differentiate that can affect the performance if
between WAR and RAW of a pipelined CPU, i.e. speedup, efficiency and
with a suitable example. uameters measure the performance of the
to
Answer: aroughput.
WBUT 2007] improvement, the effect of the
Pipelining achieves
a reduction of the average When considering the impact of some performanee
execution time per instruction. S, taken as the ratio o the
that pipeline can perform
more instructions nprovement is in terms of the speedup,
usually expressed
In the sense execution time with the
M
CA-6 .CA-7
W
S
AN
POPULARPUBLICATIONS COMPUTER ARCHITECEURE
Superscalar operation is limited by the number of independent operations that 6.What is meant by pipeline stal1? WBUT 200o
TI
on be
extracted from an instruction stream. It has been shown in carly studies simpler
Answer:
A nipeline operation is said to have been stalled if one unit (stage) requires more time to
processor models. that this is limited. mostly by branches. to a small
number The
superscalar technique is traditionally associated with several identifying characteris
erform its function, thus forcing other stages to become idle. Consider, for example. the
ristics.
These characteristics arc. ae of an instruction fetch that incurs a cache miss. Assume also that a cache miss
are issued from a scquential instruction stream reauires three extra time units. By this method we can prevent branch and structural
Instructions
. CPU hardware dynamically checks for data dependencies between instructions haZards from occurring. As instructions are fetched, control logic determines whether or
Superpipeline
Given a pipeline stage time T, it may be possible to execute at a higher
operations at intern als of T/n. This can be accomplished in two ways:
Further divide each of the pipeline stages into n substages.
Providen pipelines that are overlapped.
ions ae
rate by staring
AU
not
been
a hazard could/will occur. If this is true, then the control logic inserts NOPs (No
Operations) into the pipeline. Thus, before the next instruction (which will cause the
hazard) is executed, the previous one will be sufficiently complete to prevent the hazard.
Ifthe number NOPs is equal to the number offree
cleared of all instructions and can proceed
stages in the pipeline, the processor has
from hazards.
. NGR
E
TI
operations are performed on register Rl at
Resut forwarding is a technique to minimize tne stal
n pipeline processino sferred. The word count is decremented by one after each word transfer. The control
register specifies the transfer mode. Direct memory access data transfer can be performed
technique is that after execution of
one instniction. it the result is transfer to thhe eN
instruction bypass the write back stage directly. ie. without writing the t
result in the iaburst mode or single cycle mode. In burst mode,the DMA controller keeps control of
register pipeline transfer that resuh to the next instruction. he bus untilall the dat has becn transferred to (from) memory from (to) the peripheral
device. This mode of transfer is needed for fast devices where data transfer cannot be
(Resut fowarding may not work in case of branch instruction of a pipeline
execution
there is a branch instruction in the above instruction set then result forwarding willSo,
AU the entire transfer is done. Tn single-cycle mode (cycle stealing), the DMA
doDped until
if not
work. controller relinquishes the bus after cach transfer of one data word. This minimizes the
amount of time that the DMA controller keeps the CPU from controlling the
bus, but it
GExplan DAMA working principle WBUT 2009 requires that the bus request/acknowledge sequence be" performed for every single
Answer: transfer This overhead can result in a degradation of the performance. The single-cycle
The main idea of direct memory access (DMA) is to transfer data between more than a few cycles of added interrupt
peripheral mode is preferred if the system cannot toferate
devices and main memory to bypass the role of the CPU. It allows peripheral devices to very large amounts of data, causing the
transfer data directhy from and to memory without the latency or if the peripheral devices can buffer
intervention of the CPU. Having DMA controller to tie up the bus for an excessive amount of
time.
peripheral devices access memory directly would allow
AK
the CPU to do other work, which
wouid lead to improved períormance, espec are the remedies? [wBUT 2009]
controller ocontrols one or more peripheral
in the cases of large transfers.
The DMA 9.What are the different pipeline hazards and what
devices. It allows devices to transfer data OR,
fromthe system's memory without the help to or WBUT 2015]
use memorn bus and only one
of the processor. Both theDMA and CPU Discuss data hazards briefly.
DMA controlier then sends
or the other can use the memory
at the same time. The Answer: in a pipelined
a request to the CPU asking its a potential problem that can happen
permission to use the bus. The In computer architecture, a hazard is
typically three types of hazards: data hazards, branching hazards,
CPU returms an acknowiedgment
can now take control to the DMA controller granting it bus access. The DMA processor. There are
of thebus to independently
transfer is complete the DMA conduct memory transfer. When the and structural hazards.
relinquishes its control processor are pertormed in several stages, so that at any given
that support DMA provide of the bus to the CPU. Processors Instructions in a pipelined
are being executed. and instructions may not be completed
in the
M
data. opera
may use incorrec (WAR): Read an operand and write soon after to that same
Read
Write after the wrne
write have finished
may have finished before the read. the read instruction
operand. Because written value.
Data Bu
incorrectly get the new instru
may incorre Write Two
te (WAW): Two instructions that write to the same operand are
AR
Cod Sgnals
Write afterThe first one i may finish second, and therefore
leave the operand
Memory performed. value.
incorrect data hazards
with an involvedin can
hazards -
Fig DMA controller in data reside in memory or in a register.
shares the CPU's memory operands
bus The operalhaarS ds
A DMA controller when a part of the processor's
has an address ii. Stru hardware is needed by two or
The address register register. a word
count register, and a hazardhe same same tiine. A structural hazard
be iransierred. t
contains an address
thal specifies the memory control rg instructions at the a branch
A strucuctionsute branch instruction followed
might occur, for instance. if aa
is typically location more exCeure by a computation instruction.
possible to have of the daa to
the DMA controller program were CA-11
automaticaly
CA-10
e
W
S
AN
COMPUTER ARCHITEG URE
PUBLIGATIONSS
POPULAR
branching is typically slow AnSwer:
because equiring
executed in parallel, and instruction pipeline iIs a technique used in the design of computers system to increas
TI
counter-related computation. and writing to
Because they are registers). it is
a comparison. program
architecture) that the computation
instruction and the
branch 9te Lir instruction throughput. The fundamental idea is td split the processing of a computer
possible (depending on
the ALU at the same
time. riction into a series of independent steps. with storage at the end of each step. The
instruction will both require rinciples used in instruction pipelining can be used in order to improve the performance
i. Branch
hazards processor is tola computers in performing arithmetic operations such as add, subtract, and multiply
as control hazards) occur when the to
Branching hazards (also known one part of the instructi a program there
is several numbers of instructions. There are five steps to execute an
AU
In
a certain condition is tre, then Jump from
branch ie.. if sequentially. In such instruction and the steps are:
another not necessarily to the next instruction
a case
stream to process next instruction (when Fetch
whether it should the i.
processor cannot tell in advance
a distant instruction). IhIs can result in the processor doin Decode
may instead have to move to
doing i.
unwanted actions. ii. Operand fetch
iv. Execute
10.Use8-bit2s complement integer to perfom 43-(-13). [WBUT 20o9] v. Write back
-
Execufion cycle -
TE EB
Answer: IF
D
AK
-43 = 11010101 ustmuchionOperand Instrnuction Resut
Insruction|
- 15 = 1|110011 fetch decede fetch executionwte back
=
-43+(-13) 111001000
stnicnu execuhou phase
So. discard car and the resut is 11001000. ie. the 2's complement of 56. Fig: Instruction pipeline stages
11. What do you mean by pipeline processing? WBUT 2009] an overlapped manner.
are executed in the pipeline in 7
Answer: The streams of instructions cycle 6
Pipelining refers to the technique in which a given task is divided into a number of Clock I2
subtasks that need to be performed in sequence. Each subtask is performed by a given ID OF IE WB
M
IF
functional unit. The units are connected in a serial fashion and all of them operate
simultaneously. The use of pipelining improves the performance compared to the
traditional sequential execuion of tasks. The figure below shows an illustration of the
IF ID
-IF OF OF
ID
IE WB
IE WB
basic difference between execuming four subtasks of a given instruction (in this case IF ID O IE WB
14
fetching F. decoding D. execution E, and writing the results W) using pipelining ano
sequential processing.
operations. As an example. we show a floating-
led to arithmetic
Pipelining can e below. The floating-point add unit has several stages:
E
Figure
pipeline in T --
ack stage partitions the floating-point numbers into the three field:
:12j1 onent
The exponent field. and mantissa field. Any special cases such as not-a-
I. Unpuc iielg,
ne e rero, and intinities are delected during this stage.
sign(NaN, zero, and
sign
pounts of the two nmantissas by right-shifting the
Tine number stagealigns the binary
This the smaller esponen.
w:
Fig. Pipeline vs sequential
Processing we
ss sageadds
This
dds the two aligned mantissas.
Add:
zWhat are instruction pipeline and arithmetic pipeline?
wBUT 2009
CA-13
CA-12
e
W
S
AN
COMPUTER ARCHITECT
POPULARPUBLICATIONS
dta struction execution throughput increases in
Normalize: This stage
pácks the three fields of the
result after normalizatie
tormat. Any output excentioand
ine stages." ls it true? Jus your statement. proportion with the number
WBUT 2012, 2015]
o
floating-point are Answer:
TI
IEEE-754
rounding into the Aolining refers to the technique in which
stage.
detected during this a given task is divided into a number of
asks that need to be performed in sequence. Each subtask is performed
by a given
(1.4B). represented in 16 bit format. WBUT 2009 Cctional unit. The units are connected in a serial fashion
13. Find 2's complement of and all of them operale
imultaneously. The 'use of pipelining improves the performance compared
to the
Answer: aditional sequential execution of tasks.Consider the execution of m tasks (instructions)
AU
(1.4B-(00000001 10101011) using n-stages (units) pipeline. We assume that the unit time T = units. Then the
(0000 0001 1010 1011)) is Throughput U(n) is, m/ (ntm-1) x
2's complement of
(11I1o 0101 0101) ie. the no. of tasks executed per unit time. So, from the above equation, if we increase
the number of stages in a pipeline it also increase the throughput of the pipeline.
14. What are the different factors that
can affect the performance in a pipelined
system? Differentiate between WAR and RAW hazards. WBUT 2010]
16. For the code segment given below, explain how delayed branching can help
Answer: | LOAD R1, A
Pipelining achieves a reduction of the average execution time per instruction. In the sense Dec R3,1
that pipelime can perform more instructions per clock cycle. This can be viewed in two
BrZero | R3, 15
AK
ways Add R2, R4
Decreasing the CPI. Typical way in which people view the performance increase Sub R5R6
.Decreasing the cycle time (i.e., increasing the clock rate). Store | R5, B
Pipelining increases the CPU instruction throughput. Pipelining does not decrease the
WBUT 2013
execution time of an individual instruction. lt increases the execution time due to Answer:
R3, and 13 performs "BrZero R3,15". So, both 12 and 13
overhead (clock skew and pipeline register delay) in the control of the pipeline. Instruction 12 performs "Dec
register R3 at the same time. so, delayed branch actually first modify the
modify the R3,.15"
then upadate the value of R3 by "BrZero
WAR hazrds RAW hazards value of R3 by "Dec R3,I"
M
tries to writea destinaion before it is 1. j tries to read a source before i writes it, so j how ioop unrolling can help improve instruction
code show
read by i. so i incorrecthy gets the new incorrectly gets the old value. 17. For the following performance:
(LP)
value.
WAR hazard is eliminated by register2.
evel parallelism Load Ro,A (R1)|
RAW hazards are avoided by executing an Loop 1:11: JAis the starting address of array location
renaming of ali the destination registers instruction only when its operands are R1 holds the initial; address of the element
including those with a pending read or
write for an earlier instruction.
available. RO-RO+ R2, R2is ascalar
R2
WAR hazard if 12: Add RO,
RO, A(RIL go to nextword in Array of doubles
E
If we are in a situation that there is a chance The execution of instructions when they
first instruction is calculating a value to be be are independent of one
that i2 may be completed before il (ie. with saved in Answer: overlawerlap among instructions is called instruction-level
| register 2, and the second is going
concurren execution) we must ensure that we use Pipelinin8 otentidns can be parallelism
this value to compute a result for register be evaluated in parallel. The simplest
do not store the result of register 3 before il However, in
a pipeline, when we fetch tne
4. instrucfions
her. Thsthenstrucnt and most common
of parallclism available among instructions
has had a chance to fetch the operands.
(LP) since
is to exploit
operands for the 2nd operation, the results from theons of a loop. A loop is parallel unless there is a cycle
the first will not yet have way incr absence of a cycle
iabsence in the
we have a data
been saved, and
dependency.
henc m
nce
since the
aependencies, statenents.
mcans that the dependencies
Statement Il uses the value give
assigned in the previous
a partial
iteration
he CA-15
order
CA-14
e
W
S
AN
POPULARPUBLIGATIONS
statusto tne
channcl. If two or more programs simultaneously seek access to the ckannel,
loop-carried dependency between I1 and 12. Despite
access to
so there is a e apparatus establishes a priority order between them. thereby enabling
TI
by statement 12, logic
be made parallel because the dependency is not this program at a time. The priority ordering is based. in part. by the identity of th
dependency. this loop can nly one
rgram last cnabled. to thereby assure priority order allocation among the programs.)
18. What is pipeline chaining? WBUT
2013
Answer: 20. Draw pipeline
execution diagram during the execution of the following
Pipeline chaining is a linking process that occurS when results obtained from one instructions:
functional pipc. othpeline
unit are directly fed into the operand registers of another In
AU
MUL R1, R2, R3
wor
intermediate results do not have to be restored into memory and can be used even
the vector operation is completed. Chaining permits successive operations to be
a ADD R2, R3, R4
soon as the first result becomes available as an operand. The desired functional pine
ise INC R4
SUB R6, R3, R7
of the above
out the delay in pipeline execution due to data dependency WBUT
operand registers must be properly reserved: otherwise. chaining operations have Find 2016]
suspended until the demanded resources become available. be instructions.
Answer:
is a five stage pipeline. The stages
19. Compare between Control-Flow, Data-Flow and Demand-Driven We have considered that the above pipeline technique
mechanism. are fetch. decode, read, execution and write
back. In the figure, we also sbow that in
WBUT 2013]
which clock pulse what operation is performed.
AK
Answer:
(The Control-Flow Architecture is a Von Neumann The operations of the instructions are,
or control flow computing model.
Here a program is a series of addressable instructions,
each of which either specifies an
operation along with memory locations RI-R2.R3
of the operands or it specifies conditional
of control to some other instruction. The next instruction transfer
to be executed depends on R2-R3+R4
happened during the execution
of the current instruction. The next instruction what R4-R4+1
executed is pointed to and triggered
by the PC. The instruction is executed
to be
even if some R6-R3+R7
of its operands are not available yct.)
But in Dataflow model, the execution is driven only by the availability So, from the above instructions, we can say that
no data hazard will occur in the pipe line.
of operand. There
M
De:i de -
Answer: pipeline
Long Answer Type guestions
There are two types of pipelines:
stattic and dynamuç. A static can perfom
one Tunction a time, whercas a
at dynamic pipeline can perfonm more than onc only pipeline?
a
stages of pipeline are on at What Is a following reservation table:
TI
a time. A pipeline reservation table shows when
particular fumction. Each stagë ol the pipeline is represented by a row in the for a
reservation
ider the
tabls.
cyele The number of columns indicates the
s total number of timeunits reaui
Each row of the reservation table in tum broken into inte coluinns,
one S1
pipelhine to perform a particular function. To ndicate that some stage S for the $2
is in $3
time ty. an X is placed at the intersection of the row and column
AU
conësponding to that stage and time. Figure I represents a reservation
inSome he down the forbidden latencies and
initial collision vector. Draw the state
table far. le ram for scheduling the pipeline. Find out the sample and greedy
cycle and
Pipetline with throe stages. When scheduling a static pipeline, only ras
onl collisions static MHz, then what is the throughput of the
different input data for a particular function had to betwe
be avoided. With a dynamic Dinel
MAL. If
the pipeline clock rate is 25 WBUT 2007, 2011]
MAL?
it is possible for different input data pipeline? What are the bounds on
requiring difterent functions
pipeline at thesame time. Thereforc. to be present in
colisions between these data must Answer:
however. dynamiC pipeline schedulingbe
well. As with the static pipeline. considered
Part: a sequential process into sub operations being
compilation a
of set of forbidden lists from function begins with tha Pipelining is a technique of splitting
reservation tables. Next operates concurrently with all other segments. An
vectors are obtained, and finally
the sate diagram is the collision Cxecute of different segment that
AK
drawn.
instruction pipeline is a technique used in the
design of computers to increase their
22. Consider the erecution of a program instruction throughput (the number of
instructions that can be executed in a unit of time).
of15000 (SIMD) concept successive instructions
processor. The clock
rate of pipeline is 25 MHz. instructions by linear pipeline Pipelining assumes that with a single instruction
instruction is issued
per clock cycle. Neglect Pipeline has five stages in execution.
and out of sequence pipelines due to and one in a program sequence will overlap
idle while
execution:
() Calculate the speedup
branch instructions
A non-pipeline architecture is
inefficient because som CPU components are
program execution by pipeline imštruction cyclé. Pipelining does not completely
non-pipelined processor.
as compared with another modute is active during the
qi) What are
the efficiency and throughput that by cancel out idle time in a CPU but making
those modules work in parallel improves
Answer: of the pipeline processor. significantly.
M
program execution
Information [WBUT 20161 inside into stages which can semi-independently
we gel are: Processors with pipelining are organized
n= 15.000 instructions is organized and linked into a 'chain' so each stage's
- 25 MHz or tasks work on separate jobs. Each stage
stage until the job Is done. This organization of the processor
output is inputted to another
k= 5 stages time to be significantly reduced.
allows overall processing
Issued processor= are indepcndent. In a simple pipeline, completing an
Unfortunately, not all instructions
operate at 1ull pertormance. this pipeline will need to
(i)The Speedup (S,) - instruction may require 5 stages. To instructions
nkr instructions while the first is completing. 1f 4
E
nk
Tkr+(n-1}r k+(n-1) (15,000X5) 75,000 run 4 subsequent independent are available. the pipeline
output of the tirst instruction not
y Lficieney(t) 5+(15,000-1) = 4,999 that do not depend on the into the pipeline until the
=*=0,999 15,004 logic must insert a slall or wastca clock cycle
control suchn as forwarding can significantly
5 dependencv is resolved. Fortunately, ecnniques
Throughput
(T) stallng scure. wniie pipelining can in theory increase
AR
TI
100
110 101
iti) Z=A°X 23 22
iv) M=2-Y Mantiasa (m)
Evponnt te
v) N=ZX number
Representation ot floating point
vi) P=M/N Fig:
Answer: [WBUT 2008, 2014] Additlon/Subtrartion:
a) Data flow computer
is a large. \ery powerful Floating-Point Arithmetic stens from the fact that they may have different
numbers stems
all phy sically wired computer two FP nunbers FP
fvo
together with a large that has a number of processors Thedifticulty in
adding
adding two numbers, their exponents must be equalized.
connputers are highly amount of memory and before be
number that has smaller magnitude of exponent inust
E
.
Addition:
Normalization: he
point).
(assuming 4 bits arc allowed
3. radia CA-21
CA-20 aller the
e
W
S
AN
POPULAR PUBLICATIONS COMPUTERARCHTIE FLK
TI
are.
praduKt by adding the cxponents together 2( 1001 10)
after 1
State 3: 101010
1
(
I
State 5: 101110
111110)after cycle.
1
four row.
) What are the forbidden latencies State 6: 111010 110010)after 3 cycles.
and the initial colision vector? Draw Reaches State 4 ((100010)after 5 cycles.
traneition diagram. the sta
Reaches State 100010)aner o or more cycles.
1
Hi) Determine all
simple cycles, greedy Reaches State (
1
CA-22
W
S
AN
cOMPLTERARCHIEECLE
POPULAR PUBLICATIONS
Input: a. b..c
i) udi cnrk do
iori= to 8
TI
t thin pipelined mrur (I)= d a,
(1 20}» 0025
025 instructions b
=
nirugha: per
S a) What do you mean by MMX? Differentiate a data flow computer froma co.
k. e,-
bl
.
AU
fiow computer. end
b List some potental problemswith data flow computer implementation. Output d. e. "for the loop" and the "for loop" is
are three instructions in
c Wh simple diagram explain data flow architecture above example. there
the add. multiply
In
24 instructions will be executed. Suppose. each
d Dra data fow graphs to represent the folowing computations: eecuted 8 times. So. total
complete respectiv ey. The data
2 and 3 clock cycles to
and divide
operation requires I. instructions.
below. for above
flow graph is shown in the fgure bs
by b
AK
DWBUT 2011, 2013, 2014
ASer
aM tecmoiog is ar eruension to the Intel Architecture
a ot to (IA2 designed improve Fig: data flow
graph figure
uec:2 and cOTmunication algorithms.
The Pentium processor cycles as shown in the
within 14 clock
A cmiNis e is microprocessor to implement the new instruction with set. Al The above
instructions can be
executed
VA ne 2 izrger iniermal Li cache than
M
Pemc
their non-MMX counterparts. The
asr uith LX impiementation was the design
of new. dedicated. high-a
below.
iræ MX ppeline.
which was abie to execute
two MMX instructions with
gk changes m ne
enisting units. Although adding a pipeline stage improves
reaueny. n decrezses CPl períomance.
i.. the longer the pipeline. the more work done
specuatively by the machine and therefore
of
more work is being thrown away in the case
branc. .S-predtion.
E
A-24
W
S
AN
POPULAR PULICATIONS COMPUTERARCHITECT
ance analysis should help answering questions such as how fast can a progr
TI
d)
b tcd using a given computer In order to answer such a question. we need to
ine
be the time taken by a computer to exccute a given job. We define the clock uyele
the time befween two consecutive rising (trailing) edges of a periodic elock
cycles allow counting unit computations, because the storage of
Clock
signal.
nnulation results is synchronized with rising (trailing) clock edges. The time
required
CA-27
CA-26
e
W
S
AN
POPULAR PUBLICATIONS COMPUTERRAR
ARCHIECURE
d
detcrmi by the original source program. There
are, three types of data harards can
Loop bufer: Look ahead. look behind buffer
OCcur:
Many conditional branches operatuons are used for loop control
ead Aftèr Write (RA) hazards:
TI
Expand prefetch butfcr so as to butter the last few instructions executed
in Aataam hazard is the most common type. It appears
when the next instruction tries
to the ones that are waiting to be cccutad
a source before the previous instruction writes to
If buffer is big enough. entire loop can be held in it. this can reduce read to it. So. the next instruction
penalty
the branch
h. he
gets
incorrect old value such as an operand is modified and
read soon after. Because
the first
instru may not have finished writing to the
Branch prediction operand. the second instruction
data.
AU
Makc a good gucss as to which instruction will be exCuted nay use incorrect
next and stat thas Write After Read (WAR) hazards:
down the pipeline. at one
Static guesses. make the guess w ithout WAR hazard appears when the next instruction writes to a destination before the previous
considering the runtime history
program Branch never taken. Branch always taken. of the instruction reads it. In this case, the previous instruction gets a new value incorrectly such
Predict bascd on the opcode
Dxnamic guesses: track the history of conditional
branches in the program.pcode as read an operand and writes soon after to that same operand. Because the write may
Delayed branch have finished before the read. the read instruction may incorrectly get the new written
Minimize the branch penalty by finding value.
valid instructions to
while the branch address is heing execute in the pipeline Write Afier Write (WAW) hazards:
resolved.
It is possible
to improve perirmance by data hazard is situation when the next instruction tries to write to a destination
AK
automatically WAW
a program. so that branch rearranging instruction
instructionoccur later than actually within before a previous instruction writes to it and it results in changes done in the wrong order.
Compiler is tasked desired
with reordering the instruction Two instructions that write to the same operand are performed. The first one issued may
independent instructions sequence to find
to feed into the pipeline enough finish second, and therefore leave the operand with an incorrect data value. So the results
penalty is reduced to after the branch that
zero the branch of WAW hazards are
7. a) What are
the major hurdles Pipeline Latency
b) Discuss data hazard
briefly.
to achieve this ideal
speed-up? Instruction effects not completed before next operation begins
c) Discuss briefly two
d) Consider approaches to handle branch hazards. processor is told to branch i.e., if a certain condition
a4stage Branch hazards occur when the
M
c)
Decode (D), Execute pipeline that consists of Instruction from one part of the instruction stream to another not necessarily
(Ex) and Write Fetch (IF), Instruction is true. and then jump
stages are 50 ns, 50 Back (WB) stages. In such a case, the processor cannot tell in advance
ns, 110 ns and 80 ns The times taken by to the next instruction sequentally.
required after every respectively. these
The pipeline registers are process the next instruction (when t may instead have to move toa
pipeline stage, whether it should
10 ns delay. What and each of these pipeline brancn penalty. put in enough hardware so that we
is
the corresponding the speedup of the pipeline under ideal register consumes distant instruction). To minimize the
bráncn target address. and update the PC during the
non-pipelined implementation? conditions compare to can test registers. calculate the
Answer:
a We define the speedup
of a k -stage linear
WBUT 2012] second stage.
E
It
nk k- (n-1) wnirl Huzard
should be noted
maximum speedup
thatthe maximum speedup 2
1-Fenh
that a linear pipeline is S - k .for
n >>k. In other words. Fig: Situation of control hazards
AR
TI
of code are short enough Typcaliy. in order to claim chniques. Very Long Instruction and
predication. most or al of the insiuctins
must have this abilit o dnchxecute cessor design that tries to achieve high
Word (VLIW) is one particular
style of
levels of instruction level parallelism
condrtonalh hased on a predicate ecuting long instruction words composed by
A branch delay mnstruction is an instruction immediatcly follo
of multiple operations. The long instruction
Delayed Branch:
whether or not the branch ic
rd called a Multiop consists of muliple
arithmetic, logic and control operations
cach
a conditional branch instructnon whch s ciecuted aken. which would probably be an individual operation on a simple RISC procèssor. The
The branch delay slt is a side-cffect of pipclined architectures due to the branch
AU
VILIW processor
concurrently executes the set of operations
within a MultiOp thereby
hazard. 1C. the fact that the branch would not be resolved until the insiruction ho achieving instruction level parallelism. The remainder
of this article discusses the
worked its way through the pipclnc. A simple design would insert stalls into
pipetine afer a branch mstruction until the new branch target address is comauted
t technology. history, uses and the future
of such processors. We now describe Defoe, an
Pxample processor used in this section to give the reader a feel
and koaded mto the program counter. Each cycle where a stall is for VILIW architecture and
inserted is programming. Though it does not exist in reality, its features are derived from
considered one branch delay slot. The delayed branch always executes those of
the nevt several existing VLIW processors. Figure l shows the architecture
of the VLIW
scquental instruction. wrth the branch taking place afier that one instruction delay.
processor.
Branch
J JMn]L Tol2
AK
Deta
Deteved shet anstructon
330/4 825
Praderod Buller
Board
Instructions 20%. The cycles consumed byinstruction 30% and branching related Froun L
CA-30
e
W
S
AN
POPULAR PuBLICATIONS COMPUTER ARCHITEC
allocation. i
oister allocation. his second scheduling pass will also improve the placement o
register
VLIWProccssor tcr code.
Superscalar Processor VIIM arehitectures rely on
comr
hespil/ill
TI
onc in whichi ne cheduling is only donc
one after register allocation then
there will be false dependencies
A superscalar architecture is
initiated detection of paralelisim. The ompiler
several instructions
can be
independently analy sis the program and detects operations Tsed by the register allocation that will limit the amount of instruction motion
simultaneoush and executed
o he cxecuted in parallel. such operations o5sible by the scheduler
are packed into one "large instruction. .
features After one instruction has been fetched al Define pipelining technique. Assume a 4 stage pipeline: WBUT 2013]
.Superscalar architectures include all from the memory
AU
there can be comesponding operations are issued in Cetch: Read the instruction
in addition.
ofpipelinng but. parallel. Decode: Decode the instruction
several instructions eccuting instruction
stage. Execute: EXecute the
simultaneously in the same pipeline in destination location
No hardware Is needed for run-time Write: Store the result
Ver complex pipelining.
hardware is necded for run-time detection of parallclism. Draw the space-time diagram for
Much
.The window of execution problem is solved Answer:
technique whereby multiple instructions are overlapped
detection. in
Power consumption can be vern large. the compiler can potcntially analyze the Pipeline is an implementation an
pipeline (called a pipe stage) completes a part of
The window of erccution is limited whole program in order to detect parallel execution. Each step in the
operations. instruction.
AK
Tune
d) Number of data transfer instructions 2500 50% = 1250 and total no. of cycle
IE Exwa
consumed= 1250*2 = 2500
F1D EX we WB
Number of arithmetic instnuctions 2500* 30% = 750 and total no. of cycle consumed= D EX
750 =
3750 IF W8
F D EX
Aumber of branching related instructions = 2500 20% = 500 and total no. of cycle LIE D Ex W8
consumed= 500° 10 = 5000 hogran
Flow
Total clock cycle consumed for 2500 instructions 2500+ 3750 5000=11250
CP sends the address which is held by
M
Pipcline instruction scheduling may be done either ALU adds tction theaister filc. If it is Register-Immedidte insruction then the ALU
before or after register allocation o
both before and after it. The adv antage of doing il Decificd by the opcode on the first value from the register tile
before register allocationis thal this Register from
results in maximum parallelism. The disadv antage immediat
of doing it before register allocation
that this can result in the register allucator necding performs
sign-cxendeU
eR): Register-Register ALU instruction or l.oad instruction: Write
to use a number
thoe available. This will cause spill'fill code to he introduced of registers exceedie and the regIslertile
performance of the ection of code in question. which will reduce n Vrite intoihe
If the architecture being schcduled has nstruction result
combinations (due to a lack of struction sequences that have potentially ileg" he
interlocks) the instructions
must be schedu
CA-33
CA-32
e
W
S
AN
PUTER ARCHITECT
POPULARPUBLICATIONS
What do you meanby multiple issue processors?
What is arithmetic and instruction
pipeline? describe
Briefly the VLIW processor architecture.
TI
What are the limitations
b) Consider the following reservation table_ VLIW? of
WBUT 2014]
Answer:
$1 Refer to uestion.
on No. 8(a) & (b) of Long Answer Type Questions.
2
AWhat is meant by pipeline hazard? Briefly discuss different pipeline hazards.
List the set of forbidden latencies and collision vector. Draw the state traneis
diagram. List all simple cycle from state diagram. ldentity the greedy cycles amo
simple cycles. Find out minimum average latency (MAL). Find out maxi
throughput of this pipeline if the clock rate is 25 MHz.
c) What are bounds on MAL?
Answer:
a Refer o Question Na 12 of Short Amswer Type Questions.
WBUT
um
2014
AUWhat do you mean by job collision in pipeline processor?
occur in the following static pipeline.
s
S
0 1
X
2 3
X
Show how collisions
AK
c)Consider the execution of a program of 20,000 instructions by linear pipeline
a
b processor with a clock rate 40 MHz. AsSume that the instruction pipeline has five
Forbidden Latencies are 0.2 cycle. The penalties due to
stages and that one instruction is issued per clockignored. Calculate the speed
Pipeline collision Vector is: (i010) executions are
branch instructions and out-of-order
Greeds Cyck is: (1. 3 non-pipeline processor, the efficiency and
up of the pipeline over its equivalent WBUT 2015
State Diagram is
throughput
State! 10i0 Answer: Type Questions.
Rcaches State2 ( 1110) after I cycle.
a) Refer to Question
No. 17(a) of ShortAnswer
Rcaches Sate! (010) after cycles.
3
M
Reaches State i ( 0i0 ) after 4 between initiations 1s called the latency. The first step to
or more cycles. number of cycles
State 2: i1 10 The
identify the forbidden
latencies revealed by the diagram.
A latency is forbidden if it will
both the first and last
So is reguired during
Reaches State !( 1010 ) after 3
cycles. table shows tnat stage
Reaches State 1( 1010 after 4 lead to a collision. The a new as ailer oniy
one cycle. or there would be an
) or more cycles. initiate latencies fr
cycles. We cannot way to identify all forbidden
There is a Sy Stemanc
There are2 states in the state immediate collision. each row containing more than one X. we write down the
diagram. hle. For X's. E represents a forbidden latency. In
given reserv pair of Each such distance fare
State I represents i010 every Dair orm
distance between So contains wo X's which can one distinet pair. The latency
E
So. throughput MAL (ro. of stages vector thors use the opposite order). The collision vector for our
aulhors
AR
CA-34
e
W
S
AN
POPULAR PUBLICATIONS COIPLIER ARCHiTEGU
POssible improvements:
Consider the following pipeline reservation
table. Branch CPI can be decreased from 4
to 3.
TI
14.
2 3 4 61 Increase clock frequency from 2 to 2.3 GHz.
S1 Store CPIcan be decreased from 3
J. to 2. wBUT 2016]
$2 Answer:
S3 improve the pertormance, we have to
To i calculate CPl for all cases.
a) What are the forbidden latencies? In the given
problem,
b) Draw the state transition diagram.
AU
cycles.
CPl= 0.4*1+0.2*4+0.3*2+0.1*3=2.1
c) List all the simple cycles and greedy
minimal average latency clock is 2GHz = 2000MHz
d) Determine the optimal constant latency cycle and the
Now.
eLet the pipeline clock period be r= 20ns. Detemine the throughput of the Thus MIPS ((millions of instructions per second) 2.1*2000= 4200
pipeline. WBUT 2016] Now in case1, branch CPl can be decreased from 4 to 3.
Answer: CPl 0.4*1+0.2*3+0.3*2+0.1*3= 1.9
7
a) The Fortbadden latencies for your data 0. 2. 4. 5. Ifthe clock is 2GHz, then
MIPS 1.9*2000-3800
b) The State Diagram in a Tent Manner for your data Now in case 2, Increase clock frequency from 2 to 2.3 GHz
Statc1 1010110 When clock is 2.3 GHz
AK
Rcaches Statc111!iil11) after cvece. 1
Reaches State 3 (110i 10) after 6 cycles. following block diagram of a circuit. Form
Reaches State 1 (1010i 101)after 8 or more
16. a) Consider the
cycles. table.
There are 3 states in the state diagram:
State represcnts 10101101
I
ti.8* ns.
results)= 10 minimum as
(store What is the Inimum asynchronous time for any single instruction to
d) The Minimum Average Latency - 4.5 Sequence. pipelineed
complete? up as
thisup as a pipelined operation. How many stages should we
to setthis we clock the pipeline?
e) Throughput, c) We want whatrateshould WBUT 2016)
nave and at
AR
W- (n T)'(1'MAL)= (1/ 20 0)
. With the use of Amdahl's law,
conclude, among the
(1/4.5) = .222
TI
a) that must be preserved
Time inthe by the pipeline.
Therearetthree types of data hazards can occur:
X
,Read After Write (RAW) hazards:
nit 0
OAW data hazard is the most common type. It appears when the next instruction
Uniti read
rea
tries to
from a source betore the previous instruction writes to it. So, the next instruction
nit
AU ts the
gets incorrect old value such as an operand is modified and read soon after. Because
nit The
Unit 4 the first instruction may not have finished writing to the operand, the second instruction
may use incorrect data.
b) The minimum asynchronous time for any single instruction to complete = (30 + .Write After Read(WAR) hazards:
9-20-10) ns = 69ms WAR hazard appears when the next instruction writes to a destination before the previous
instruction reads it. In this case, the previous instruction gets a new value incorrectly such
a
c)To set up pipeline opeTa:ion. we need to construct minimum five stages in the as read an operand and writes soon after to that same operand. Because the write may
pipeline The stages are instruction tetch. instruction decode. operant read, instruction have finished before the read, the read instruction may incorrectly get the new
written
AK
cAecuton and store results. We should cakculate that which stage requires the highest value.
time for execution and this amount of ume we have to set for all the stages Write Afier Write (wAW) hazards:
pipeline
of the write to a destination
WAW data hazard is situation when the next instruction tries to
it results in changes done in the wrong order.
before a previous instruction writes to and
it
17. Write short notes of the following: same operand are performed. The first one issued may
Two instructions that write to the
a) Pipeline hazards WBUT 2005, 2011, 2014] leave the operand with an incortect data value. So the results
b) Reservation Table finish second, and therefore
c) Branch handling in instruction pipeline
WBUT 2008] of WAW hazards are:
d) Amdahl's law and its significance WBUT 2011] Pipeline Latency
WBUT 2012, 2014] completed before next operation begins
M
CA-38
e
W
S
AN
POPULAR PUBLICATIONS
COMPUTER ARC HIEECTURE
TI
possibility of a collision in the pipeline. A collhsion is an attempt to use the same stage the
the .c) In pipeline process the branch instructions are those that tell
It tiw0 or more sets ot imputs arrive for decision about what the next instruction
the processor to make a
two or more operations at the same time. at the
a to be executed should be based on the results of
simultaneoushy. at the ven least the pipcline will compute crroneous resulhee another instruction.Branch instructions can be troublesome in
stage
least one set of inputs. Depending on the details
short-cincuited
ot physical construction, the
to a common input,
our
possibl O
at
conditional on the results of an instruction
a pipeline if a branch
which has not yet finished path through
ils
is
the
different stages could even be pipeline
damage to the circuitry. Thus collisions are to be avoided at all costs when
pipeline
How can we determine when cllisions might occur in a pipeline? One graphical
can use to analyze pipeline operation is called a reseralion
con.
tool wa
lable. A reservation table is
just a chart with rows epresentng pipcline stages and columns representing
(clock cycles). Marks are placed in cells of the lable to indicate which
time stens
stages of the
pipeline are in use at which time steps while a given computation is being performed.
simplest reservation table is one for a stalic. linear pipeline. Table below
The
g
a
AU
For example:
Loop: add Sr3, Sr2, Srl
sub Sr6. Srs Sr4
equal Sr3, Sr6, Loop
The example above instructs the processor to add rl and r2 and put the resut in r3, then
subtract r4 from r5, storing the difference in r6. In the third instruction, beq stands for
branch if equal. If the contents of r3 and ró are cqual, the processor should execute the
AK
is an example of instruction labeled "Loop." Otherwise, it should continue to the next instruction. In this
a reserv ation table for a five-stage static linear pipeline.
Notice that all the marks in this simple reservation table lie in example, the processor cannot make a decision about which branch to take because
a diagonal line. This is neither the value of r3 or ró have been written into the registers yet.
because each stage ts used once and only once, in numerical
order, in performing each The processor could stall, but a more sophisticated method of dealing with branch
computation. Even if we permuted the order
of the stages., there would be still be only instructions is branch prediction. he processor makes a guess about which path to take -
I
algorithm's
Mage
Marr
Magr 3
1
not
,
parallelizable), Amdahl's law slates that the maximum speedup of the parallelized
rsiontechaically.
More
is
.
CA40
W
S
AN
COMPUTER ARCHTECTURE
POPULAR PUBLIGATIONS
VECTOR PROCESSOR
P
TI
( -
P)+
o Multiple Choice Type guestions
derived. assume that the running time of the
To see how this fomula was
The running time of the new computation
computation was I. for some unit of time. 1. Which of
the following types of instructions
fraction takes (which IsI
-
P). plus the length of vectors or _sparse matrices are useful in handling sparse
be the length of time the unimproved often encountered in practical vector processing
time the improved fraction takes. The
length of tme lor the improved part of th application?
AU
time divided by the WBUT 2007
computation is the ength of the improved part's fonner running a) Vector-Scalar instruction
b) Masking instruction
speedup. making the length of time of the improved part Ps. The final speedup is c) Vector-memory instructions
d) None of these
computed.by dividing the old running time by the new running time, which is what the Answer: (b)
above formula does.
2.The vector stride value is required WBUT 2009, 2011]
a) to deal with the length of vectors b) to find the parallellsm in vectors
c) to access the elements in multi-dimensional vectors
d) to exeCute vector instruction
Answer: (a)
AK
3. Basic difference between Vector and Array processors is WBUT 2010, 2014
a pipelining b) interconnection network c)register d) none of these
Answer: (a)
is present in
5. Array process b) MISD
wBUT 2013]
a) MIMD c) SISD SIMD
Answer: (d)
stride value is required WBUT 2015]
6. The vectorwith the lengthof vectors
) to deal
a) parailGlism in vectors
find the multi-
b) to access the elementss in multi-dimensional vectorss
E
c)tonone ofthese
d)
Answer: (c)
vectorizing compit
piler is
ofa of vectors WBUT 2015]
The task the length
AR
7.Thend
7. sequential;scalar
d theguential
a)t convert multi4dimensionalinstructions into vector instructions
b) to proce vector instr. vectors
te
el to execute instructions
dHo
Answer: (d)
CA-43
e
W
S
AN
POPULARPUeLICATIONS MPUTER ARCHIT
URE
WBUT 2015) Answer:
to exploit three difterent ttypes
8. Array processors perform computations b) spatial parallelism
There are different
of vector instructions basically
a) temporal paralelism programs d) modularity of programs mapping as given below, basis of their mathematical
TI
c) sequential behavior
of Vector-vector
tor instruction: From
different vector registe
Answer: (b) erands enter in a functional
opera Islers one or more than one vector
pipeline unit and result is
Short Answer Type guestions
This type of vector operation send to another vector regLster.
is called vector-vector
figure below, where Va, Vb, Vc instruction as shown in the
are different vector registers given
access in case of vector processing? following two mapping functions and it can define by the
How do you speed up memory
AU
1. fl and f2.
WBUT 2005, 2007]
f1:Va Vb and f2: Vbx Vc - Va
Answer:
For example, if the time it
Letr e the vector speed ratio and f be the vectoni7ationisratio. Vb Register
Ve Register Va Register
tahes tr add a vector of 64 integcrs using the scalar unit
10 times the timè it takes to do
10. Moreover. it the total number of operations
it using the vector unit. thenr= in a
program s 100 and only 10 of these arc sealar (aftcr vectorization). then f=90 (i.c. 90% of
the worh is done by the vector unit). It follows that the achievable speedup is:
So even
r1-
if the perfarmance of the vector unit is extremely high (r = oo) we get
iess than i1-1). which suggests that the ratio is crucial to performance since it
a speedup 0}F
Imit on the attainabie
f
speedup. This ratio depends on the efficiency
poses a Fig: Vector -vector instruction
of the compilation.
Vector instructions that access memory have a known access Vector-scalar instruction: In vector scalar instructions the input operands of the
pattem. If the vector's
elements are all adjacent. then íetching the functional unit enter from scalar register and vector register both and produce a vector
vector from a set of heavily interleaved
memor banks works ver weil. The high latency of initiating output as shown in the figure below, where Va. Vb are different vector registers and Sa
a main memory access
a
ersus accessing cache is amortized. because. single access is is a scalar register. It
can also define by the following mapping function
fl.
M
unit are directiy sed into the operand from one pipeline
registers of another functional pipe. In other words,
intermedute results do not have to be
restored into memory and -(
the vector operation is completed. can be used even betore
Chaining permits successive
soon as the first result becomes operations to be issued as Fig: Vector
available as an operand scalar instruction
The desired functional pipes and
AR
:M V
Define the various types of OR,
vector instructions.
WBUT 2010, 2014]
CA-44
CA-45
e
W
S
AN
CONPLTER ARCHITECTU
POPULARPU8LICATIONS A Yector processor includcs a
Rograt set of vector registers for
exceution of instructions and a storing data to be uscd
vector functional unit coupled the in
..
eiecuting instructions. The functional to vector registers
TI
unit executes instructions using operation
prov ided to il which operation codes code
include a field referencing a special register.
special register contains information The
about the length and starting point for cach vector
instruction. A series of new instructions enable
to rapid handling of image pixel data are
provided.
AU
Scaler Processot
Fig Vatar memorn instructions
calar
processors
vector stride in vector Lnstrertiona
4. Discuss about strip mining and WBUT 2008, 2012] pocessins Ppe2
stridc could be
memor that will be adjacent in a vector register. The value of the
different for different variable. When a vector is loadcd into a vector register then the Vector Processa
can
stnde is !. mcaning that all the ciements of vector are adjacent. Non-unit strides Fig: The architecture of a vector proccssor with multiple function pipes
M
cause major problems for the mermor system, which is bascd on unit stride (i.e. all
the
with concept of strip miing used in vector processors. Why do vector
eiements arc onc after another in different interleaved memory banks). Caches deal 6. Explain the memory banks? WBUT 2009]
unit stride. and behave badiy for non-unit stride. To account for non-unit stride, most processors use
a
systems have a stride register that the memory systen can use for loading elements of Answer: than that of the vector registers then segmentation of
has a length greater
vcctor register. However. the mcmon interieaving may not support rapid loading. The When a vector fixed length segments is necessary. This technique is called strip-
vector into
vector strides technique is used when the elements of vectors are not adjacent. the long segment is processed at a time. As an example. the vector segment
mining.ements in Cray computers. Lintil the entire vector elements in cach scgment
E
What is vector processor? Give the block diagram to indicate the architecture of lengthi50 the vector register cannot be assigned to another vector operation. Strip-
a
typical Vector Processor with mutiple function
pipes. areprodricted by thc number ot available vector, registers and vector chaining.
weUT 2008, 2010-short note] mining is ccess to vector elements stored in memory. the memory of a vector
Aaswer To allow en divided into memory banks. Interleaved memory
banks associate
Vector processors are specialized. heavily pipelined processors that perform efficient nrocessormany addresses
AR
CA-6
e
W
S
AN
COMPUTER ARCHITECT
TI
Answer: WBUT 2005, 2013]
Explain with example. WBUT 2009] There are two primary types
7. What is Vector aray processor? of vector operations:
Answer: one instruction) Vector-register operations
processor that can operate on entire vectors with
A vector prcessor is a the Memory-memory vector operations
specify comptele vectors. For example, consider
.c. the operands of some instractions
In vector-register operations, all vector operations-except load and store-are among
follow ing add instruction
C A 8
In both scalar and vector machines
and put the sum in C"
this means
In a scalar machne
add the contents of A to the contents of R
he operands are numbers, but in
processors the operands are vectors and the nstruction
the pair-w ise sum of each pair of vector clements. A
vector
directs the machine to comniute
processor register, usually called the
vector length register. tells the processor how many individual additions to
adds the
perfonm when
vectors A key division of vcctor processors arises from the way the
nstructions access their operands. In the memory lo memory organization the operands
AU
he vector registers. All major vector computers use vector-register architecture,
including
f7: (eg.
Vix Vj->s instructions:
There are thre diferent types of vector instructions basically basis of their mathematical
mapping as given below. Gather and Scatter
. (eg. gather)
Vector-vector instruction: From different vector registers one or more than one f8:Mx Va->Vb (e.g. scatter)
Vb->M
vector operands enter in a functional pipeline unit and result is send to another vector 9: Vax instructions:
register. This type of vector operaion is called vector-vector instruction as shown in Masking MMOVEVI. V2, V3)
the given figure below. where V, . Vs. V, are different vector registers
Vm->Vb (eg.process sparse matrices/vectors. The gather operation uses
and it can fa: Va x used
used to
scatter are from memar
to access from
E
dcfine by the foilowing two mapping functions and f2. memory "few" of the elements of a
f Gather and of indices to
and a set of vector registers. The scatter operation does the opposite. The
V.V, and f,: V, x V. - V a base address
the
2. Vector-scalar instruction: In vector scalar instructions into one of conditional execution
execution of an instruction based on a "masking"
the input operands of the large vector
operationsallow
functional unt enter from scalar register and masking
vector register both and produce a
AR
4-48
W
S
AN
COMPUTER
ARCHIIECTURE
POPULAR PUBLIÇATIONS
When first loaded.the
th model
contains
A in memory a program which reverses
locations 0 and 2 the order of the
or double words apart. This distance separatino"
i of the Processing Elements values
successive data references is i words
gathered into a single i
register is called the stride. Once a vector cach
2of of their memories) and leaves the (initially in locations 0 and
TI
elements that are to be results in location and 3
it had logically adjacent elements.
1
AU
greater than one. When multiple arrays) simultaneously. It is used to
memory sstem can occur from supporting strides colve the same or similar
problems as an array processor;
accesses contend for a bank. a memory bank conflict occurs and one access
must be however, a vector processor
passes a vector to a runctional unit, whereas
an array processor passes each element of a
stalled.A bank conflict, and hence a stal. will occur if vector to a different arithmetic unit.
Nunmber of hank_
Bank buv time Vector processors are based on a single-instruction, multiple data architecturc
Leaxt oommoa multuple (Strnde. Numhr ot hanksi that is
distinctly different than SIMD extension to scalar/superscalar
greater than the vector registers then segmentation of the long
When a vector has a length processors. Each vector
data path has some data independence from the others allowing data path dependent
vector into fixed length segments is necessary. This technique is called strip-mining. One
operations. This allows easier control for wider machines. Single chip vector processors
vector segment is processed at a time. As an example, the vector segment length is 64
can still be low power and easy to program even with eight paralle! vector units. For
AK
clements in Cray computers. Until the entire vector elements in each segment
are many communications algorithms, characterized by high data parallelism, vector single
processed. the vector register cannot be assigned to another vector operation.
Strip- instruction machines end up being the ideal balance of instruction/programming
mining is restricted by the number of available vector, registers and
vector chaining.
simplicity and compactness, while still supporting complex processing requirements and
b) The SIMD-1 Aray Processor consists of a Memory, an Array high performance.
Control Unit (ACU) and
the one-dimensional SIMD aray of simple processing elements
(PEs). The figures show
a 4-processor array. The figures shows the
initial image seen when the model is loaded. 3. a)How do vector processors improve the speed of instruction execution over
Scalar processors? Hlustrate with an-example.
we need it in a vector processor?
VOY b) What is vectorizing compler?Why do
M
WBUT 2015]
Answer: are Memor
optimizatnon Seeincs uscd in vector processors. banks
PC mun, o a) Many Derformance latency. Strip mining is used to generate code
Aray Coal UtcC reduce load/store so that ector
RNOP
ed to reau
are used
r
vector operands whose size is ess than
or greater than the size o
PE-IR
PEC
N0F operation is pO r
chaining the equivalent ol torwarding in vector processors
vector registe dependency among vector instructions. Special scatter and
is
gather
SMDAT uscd in case O provided to efficientty operate on sparse matrices.
E
instructic are
The ACU is a simple load/store. register-register com iler nust be devclop to detect the concurreney
arithmetic processor. It has 16 intelligent valized with pipelining or with among vector
purpose registers. a Program Counter general b) An be realiz the chaining of pipelines. A
(PC), a Condition code which
Instruction Register ( AÇ-IR). The
Program Counter has
Register (CC) and an cld
instructions compiler woule regeneratc parallelisim lost in the use of
s ral lan.
of sequential
AR
ages languages.
oh with rich par
label ficld is initially set two fields: label and offset. The level programmimg languages
to "main" and the offset vectorizinEto Use.llowing four ehparallel constructs on
registers. the Processing Element Instruction to zero. The ACU also uses two other following
progranmminses have been rce
four stages ha
Element Control register (PEC) Register (PE-IR) and processors. 1he DrOgrammung. The paraten recognicd in the development of
t is desirable
which are global registers the Processing vector advancod sau * parameter in parentheses indicates the
SIMD Aray. The Processing
Elements operate
used to cominunicate with
the parallelism in explorabIeat ench age degree
(deteined by the state of its PEC biu) in lock step.
ie. each active PE parallelism
algorithn(A)
Whenever a Pt. ACC is updated by a PE. obeys the same instruction at the same
instruction, the PE. time. of
Parallel language(1.)
High-level codet()
cach of its neighbors sends the new ACC value
t0 ojoctcodefnM)
Efficient
machine
Target CA-5
CA-30
e
W
S
AN
OMPUTER ARCHIT CTU
POPULAR PUBLICATIONS
Vector Stride
independent operations that ca
can be Arrayprocessor Vector processor wBUT 2012
refers to the number of N
The degree of parallelism situation with well developed
parallel
simultaneously. In the ideal user Answer: WBUT 2014
performed below.
M, as in the figure vector processors:
languages, we should expet A > L2 O> a) Scalar and
TI
Degve of A vector processor is a CPU design
that is able to run athematical operations on
paralehsm ultiple data elements simultaneously.
mult This is in contrast to a scalar processor which
handles one element at a time.
A computer with built-in instructions
multiple calcuiations on vectors (one-dimensional that pertorm
arrays) simultaneously. It is used to
colve the same or similar problems as
AU
an array processor; however, a vector processor
nasses a vector to a functional.unit, whereas an array processor
M passes each element of a
vector to a different arithmetic unit.
Vector processors are based on a single-instruction, multiple data architecture that is
distinctly difterent than SIMD extension to scalar/superscalar processors. Each vector
data path has some data independence from the others allowing data path dependent
Paralel Paralel Object code Machine code operations. This allows easier control for wider machines. Single chip vector processors
algorithm languag can still be low power and easy to program even with eight parallel vector units. For
AK
Fig: The ideal case of using parallel algorithm many communications algorithms, characterized by high data parallelism, vector single
instruction machines end up being the ideal balance of instruction/programming
At present any parallelism in an algorithm is lost when it is expressed in a sequential.
high-level language. In order to promote parallel processing in machine hardware, an simplicity and compactness, while still supporting complex processing requirements and
intelligent compiler is nceded to regenerate the parallelism through vectorizationas high performance.
A vector processor for executing
vector instuctions comprises a plurality of vector
shown in figure below.
and a plurality of pipeline arithmetic logic units. The vector registers are
Paralehsm registers
a speed equal to 2n times as fast as the
constructed with a circuit which operates in
of the pipeline arithmetic logic units. Either the read or the write
processing speed
M
e
same operation adiacent time intervals. Each unit provides an output daa item in
successive,adjacent
Perale Sequential Object code Machine code
plurality of which the unit nit performs the operation and provides a processed data
elgorhn langoag interval in successive,
adjacentnt time intervals. The special function unit
the time last of the computation forthe
the output data item of a selected one of the
Fig the case of using vectorizing compiler and sequential language the
item in special function ation
special interval in which the selected scalar unit performs the operation,
AR
provides a time
The process to replace a block of sequential code in the n use among the scalar units. A vector p.ocessing unit includes
by vector instructions is called scalar units, a conflict processor
vectorization and the system software which avoid Scalar processor, and an output orthogonal converter.
scalar
does this regeneration of parallelism is so as to bufter, the
called a vectorizing compiler. data
an input architecture:
mory vector
memo
so ate of one word fetched or stored per clock, the memory system
4. Write short notee on the following:
a) Scalar and vector proceesors b) Meman intducing or aceepting ths data. This is usually done by creating
b) Memory to memory vector architecture WBUT 2006, 2007] Tormaintale of There
capa banks. Ther are significant numbers of banks is useful for
access dealing with
c) Vectorizing compilers WBUT 2010] must be memory that tnat access rows gr columns of data.
WBUT 2010] multiple orstores
d) Vector registers architectures loads CA-53
WBUT 2011 vector
e
CA-52
W
S
AN
TER ARCHE UR
POPULAR PUBLICATIONS
Main memory
vertors have a relatively short length. 64 in the
:
TI
case of the Cray
these machines are much more ctticicnt 1or operations involvino. short
machines. Thus
vectors, but for long vectoi
operations the vecto registers must loaded with cach gment Vector
FP add/subtract
before the operation can continue. load-store
AU
and memory. it can also defined by the
or vector store operations between vector register the
follow ing two mapping functions f and f FP divide
Logical
Scalar
AK
registers
Data dep
Data tme for
time for a single vector instruction dependir.g on the vector length
compute ine
d) Vector registers architectures: We caninitiation rate Shich
athat is the rate at which a vector unit consumes
new operands and
Each vector register is a fixed-length
Weiniia
the new
and
thaem
lts. All modern supercomputers have vector functional units with
pipelines that can produce two or more results per
must have at least two read
bank holding a single clines
paralel clock cycle.
ports and one write porn. vector. Each vecior register
multiple
overlap among vectOr operations This will allow a high degree Aefernoto Question No. 4 of Short Answer Type
which total at least 16 read ports
to different vector
registers. The read and write
ot Stride:Refer Questions.
and 8 write ports, ports. e)Vector
inputs or outputs by a pair of crossbars. are connected
o the functional unit CA-S5
CA-54
e
W
S
AN
COMPUTER ARCHITECT
TI
handles most of the
differences a vector a omputer chip that
processing unit (CPU') is vector processor emplove Multiple Choice Type
processor. or central a computer. A
guestions
processed through
information and functions prucessor uses number of processing elements
pipelines. An array processor and requires a hoct Advantage of MMXtechnology lies in
multiple vector pracessor is a SMD pe wBUT 2010
operating in paralkel. An array
processor is a synchronous parallel processor Multimedia application b) VGA
processor). An array
AU
processor (control memory. The ALU together with c) CGA d) none of these
Each ALU contains local
cootaining mutiple ALUs. are synchronized to Answer: (a)
is called a processing clement (PE). The PEs
the local memon processor. The
operation simultaneousy. The host processor is a scalar 2. Array Processor is present in WBUT 2010]
perform same instructions are
by the control processor. The vector a}SIMD d) none of these
instructions are fetched and decoded b) MISD c) MIMD
execution over different elements of the vector operand. These Answer: (a)
sent to PEs for distributed
elements are contained in the kocal memories. The PEs are passive devices without
vector WBUT 2010, 2014, 2016]
and aray processing technology is not usually Which one of the following has no practical usage?
instruction decoding capabilities. Vactor
3.
technology is most often seen in high-traffic a) SISD b) SIMD SMISD d) MIMD
used in home or office computers. This
house and allow access to Answer: (c)
AK
senvers. Servers are racks of storage drives designed to
information from severai different users at different computers located on a computer WBUT 2011, 2016]
4. The expresslon for Amdahl's law is
network. Scalar processing technolog operates on dififerent principles than vector and where n b) S(n)=f where n->
a) S(n)=/f
aray processing technokogy and is the most common type of processing hardware used in d) None of these
the
avcrage computer. A
superscalar processor is a processor that operates like a scalar c) S(n)=l/T where n>o
processor. but it has many differet units within the CPU which each handle and process Answer: (c)
data simultaneoush. The higher-performance superscalar processor type is also equipped
5. Which MIMD
systems are best according to scalability with respect to the
with programming that makes it efficiently assign data processing to the available scalar WBUT 2011]
units widh in the CPU. Most modern home computer processors are superscalar
number of processors?
memory computers b) ccNUMA systems
ac) Distributedsystems
M
nccNUMA
d) Symmetric multiprocessors
Answer: (a)
processors have CPl of WBUT 2011]
6. Superscalar b) greater than
1 c) more than 2 d) greater than 3
than 1
a) less
Answer: (a)
of a computer has 2 cm blocks while the cache has 2 c
anoN
meusee
E
7. The main cache uses thes the set associative mapping scheme with 2 blocks per
the main memory maps to the set
blocks f
block k ofthe
WBUT 2011, 2016]
b) (k mod c) of the cache
set; then the cache
modm) of cache d) (k mod 2m) of the cache
a}k(kmod2c)ofthe
AR
c)
Answer:() requlred to
yalue ls vectors WeUT 20111
vectorstrldelengthofvecto
8. The with the
parallelism in
a)deal
find the
ents in muti-dimensional vectors
olements
b) the instruction
faccess
execute
vector
d)
Answer: (c) CA-S7
CAS6
e
W
S
AN
POPULAR PUBLICATION COMPUTER ARCHI CUURE
Multiple Instructions, Single
is a shared
resource, so there must be some Data stream (MISD)
9. As the bus in a multiprocessor algonthm form the ho Multiple nstructions operate
conflict. The on a single data stream.
mechanism to resolve the enerally used for fault
gene Uncommon architecture
TI
technique. WBUT 2016 tolerance. Heterogeneous whic
mentioned is not a conflict resolution b) FIFO algorithm strcam ana must agree on the systems operate on the same date
a) state priorty algorithm d) Daisy Chaining algorithm result. Examples include the Space Shuttle
flight contros
c) LRU algorithm
computer.
Answer: (a)
Instrution Memory
ontrod Unit
Short Answer Type guestions
OR
Describe Fynn's classification of computer
OR
architecture.
OR,
WBUT 2006, 2007, 2009, 20101
WBUT 2012
Explain in brief with neat diagrams the Flynn's classifications of computers.
WBUT 2013]
AU Instruetion Memory
Control Unit
Instrtion Suream
Data Suream
AK
Explain Flynn's classification.
Answer:
WBUT 2016]
The four classifications defined
by Flynn are based upon the number of concurrent Instruction Stream
nstruction (or control) and data streams available in
the architecture; Fig: MISD Architecture
ingle Instruction, Single Data stream (SISD)
Sequential computer which exploits
no parallelism in either the instruction
reams. Examples of SISD architecture or data Multiple Instructions, Multiple Data streams (MIMD)
are the traditional uni-processor machines processors Simultaneously executing different instructions on
C or old mainframes. like a Multiple autonomous
systems: are generally recognized to be MIMD architectures:
different data. Distributed
a single shared memory space or a distributed memory space.
M
Stream
lasiructn Data Stam
Fig. MIMD Architecture
Dua emy
TI
architecture to compute
2. Implement the data routing logic of SIMD
= 0.1.2..N WBUT 2008
s(k)=Ai for k -1.
OR 0, 1
in SIMD array processors? WBUT 2015]
Why do we need masking mechanism
In an SIMD aray processor of 8 PEs, the
sum S{k) of the first k components in
vector
Let A
A is desired for each k from
(4.A
S(A)= 4:for k =
0
Q.7
to 7.
Answer:
a
processor.
WBUT 2015]
0-3
AK
machine operations within single instruction incorporating a plurality of
operations. To SS)
accomplish this each different machine operation within the instruction includes
a
number of masking bits which address a specific location in a mask register.
The mask
register includes a mask bit bank. The mask location selected within
the mask register is
bit-wise ANDed with a mask context bit in order to establish
whether the processing
element wil be enabled or disabled for a particular conditional
called
sub-routine which is
We show the execution details
processing elements (PEs) to illustrate
of the following vector instruction in an
array of N 47
the necessity of data routing in an array processor.
M
are,
s(t)=Ai sor k =0.1,2..N-1. MHz
processor was used to execute a program with the following
3. A 50 clock cycle counts:
instruction mix and Instruction Count Clock Cycle Count
These n vector sunmations can Type
be computed recursively
by going through the following
Instruction
Arithmetic 50000
E
A, is transfer to the R,
register in PE , for I = in the figure below. Calculate the
0, 1,...n-1. At first each
R, to R- and added to
A with the resulting sum A, Now in step 1, A, is routed from Answerr Count
In step 2. the intermediate
sums in R, are routed
+
A. in R- for l= 0, 1,....6 action Count (IC)
We know, = Instruction
*
Clock pre Instruction (CPl) °
Clock Cycle Time
intermediate sums in R, to R 2 for I =
are routed R, CPU time
of S(k) fork 0,1.2,..,7 in the lasttocolumn for i=0to 3. Similarly, 0 to 5. In step 3, thee (CCT)
of the figure PE, has the final value
below.
CA-61
CA-60
e
W
S
AN
eOPULAR PUuIGATIONS COMPUTER ARCHITECTURE
ncal data be more frequent
oc than accesses to remote data. This property, called
= Number of times the i instruction is executed in a program. third fundamental requirement locality, 1S
where,, for parallel software, in addition
scalability.
TI
CPI, =Number of clock ccles for the
i instruction. to concurrency and
Another important class
The average value of clock Per Instruction ( PI) is given by. of parallel
MIMD computer. In multiprocessors, computer is the multiprocessor or shared-memory
typically v1a a bus or a hierarchy all processors share access
to a common memory,
Machine (PRAM) model, of buses. In the idealized Parallel Random Access
CPI often used in theoretical
processor can acces any studies of parallel algorithms, any
AU
memory element in
scaling this architecture usually the same amount of time. In practice,
wherefrquencyofocurrence of i* instruction in the program. introduces some form of memory hierarchy:
the frequency with which the shared in particular,
memory is accessed may be reduced by
CPU time=C°CPI, *(CT copies of frequently used data items storing
this cache is much faster than access
in a cache associated with each processor. Access to
Clock rate to the shared memory.
MIPS-CPU timex 10*
CPI x(CTx10° CPI, 10*
2. a) Describe the distribution and
Given Clack rate = 50 MHz shared memory model of SIMD architecture.
b) Draw the block diagram and explain
CPL S002700°3 250°1 40*2=4.15 the functionality of processing element.
AK
MIPS (50 10)/(4.15 10)= 12.05 WBUT 2008]
Answer:
a) There are two types of SIMD computer models are described below
&What is the instruction level parallelism? WBUT 2014 memory distribution and addressing scheme used. One is Distributed-Memory
based on the
Answer: Model and
Instruction-evel parallelism (1LP) is measure another is Shared-Memory Model. Most SIMD computers use a single control
a of how many of the operations in unit and
computer program can be performed simultaneously. a distributed memories, except for a few that use associative memories. The instruction
The potential overlap among set
instructions is called instruction of an SIMD computer is decoded by the array control unit. The processing elemenis
level parallelism. A goal
designers is to identify and of compiler and processor (PEs) in the SIMD aray are passive ALU executing instructions broadcast from thee
take advantage of as much ILP
programs are typically written as possible. Ordinary control unit.
under a sequential execution model
M
Scalar Instruction
Long Answer Type guestions Network
1. Differentiate between
multiprocessors and
Control Control Memory
(Program and Data)
H Host
Computer
E
TI
control unit for decoding. It it IS scalar
a or program control
An instruction is sent to the ic
by a scalar processor attached to the control unit.
operation, it will be directhy executed
the decoded instruction is a vector
operation, it will be broadcast to all the PEs f
paralel execution. PE
all the local memories attached to the PEs throuph
To other PEs a the
Partitioned data sets are distnibuted to ee oneuor.
interconnected by a data-routing network which performs Detwork
AU
a vector data bus. The PEs are
inter-PE data communications such as shifting. permutation, and
other routing operations
control unit. The PEs are
The data-routing netword is under program control through the
SVnchronized in hardware by the control unit. Almost all SIMD machines built today are
based on the distributed-memory model. Iliac IV, CM-2 are examples of Distributed
Memon SIMD architecture. L ALU
Control Memory Scalar PEs can establish an appropriate data routing mechanism. Each PE consists of an ALU
Scalar
ATay Control Unit Instr. OceSsor with registers and loca memory. The PEs are interconnected by a data-routing network.
There are set of local registers and flags. A B, C and S, are prescnt in a PE. The data .
M
Broadcast Bus routing register is R, address register is D and a local index register is When data
Nawori (Vector Instructions) transfer process occurs in PEs, then contains of the data routing register is transferred.
Cantrol
OR,
DaaBus difference
difference and similarities between multi-computer and
What is the WBUT 2014]
multiprocessor?
Fig 2 Shared -Memory Model SIMD architecture
model is callod
Answer: machine model called the multicomputer system. A
multiconputer
AR
b) An aray processor is a synchronous A_parallel ofol yonNeumann computers, or nodes. linked by an interconnection
number
compisesa
parallel computer with multiple arithmetic uter execules own.
proyram. This program may access
US
units. callcd processing clements (PE). The logic Each Co local
PEs are synchronized and receive messagcs over the network.
function al same time. to perform the same netw Messages are used to
nd mher Coimputers or. cquivalently. to
rcadread and write remote memories.
memaote wiuMIMD CompulerceCallea TIS
called the multiprocessor
comnLmemorocessors share access lo a common computer. In
hared- memory. ty pically vía a bus
SO and an
and any processor can access or
buses any memory element in
hierarchy of the same
a CA-65
CA-64
e
W
S
AN
COMPUTER ARCHITEC
ECTURE
POPULARPUBLIGATIONS
Write shortnotes on thefollowing:
class of machine
include the Silicon Graphics ray proces
ammunt of time. Evamples of this multipnavessor workstations. MMX Technology WBUT 2005, 2007, 2010]
and the many
Chalienge. Sequent Svmmetr, parallelism. CM-2 machine
data) is a technique to achieve WBUT 2005, 2006, 20071
MNAD (mutiple instructin, multipk asinchronously and Flynn's classification WBUT 2008]
TI
pnaessors that function
Machines usng MIMD have a number of WBUT 2011
be exevuing diferent instructions Answer:
independently At any tme, different pnarssors ma)
of ether shared memory or processor:
difiercnt of data. MIMD machines can be Array
on picces
are based on how MIMD processors AD-I
SIMD Array Processor consists
distnbuied memory categnes These casifications The
a
of Memory, an Aray Control Unit (ACU) an
bus-based. Distributed memorv
ac memon Shared memoy machines may be of the MIMD machines with
he one-dimensional slMD aray
of simple processing elements (PEs). The figures Show
AU
hyperude or mesh intenonnecthon schemes.
thachunes imay have 4-processor artay. I he figures shows the initial image
a common, central memory. In the simplest seen when the model is loaded.
shared mermnon have processors uhch share
form. all processors arr artached to a bus which
connects them to memory. MIMD VEKORY
machines with herarchical shared memoy use a hierarchy of buses to give processors
aes to cach other's memor. Processors on different boards may communicate through
inter-odal huss Buses suppot communication between boards. With this type of man o
PC
archte ture. the machine may suppori over a thousand processors. Arey Corntal Uit cc
AC-IRNOF
4Why do we need paralel processing? What are different levels of parallel PE-IR NOP
AK
PEC
processing? Explain. WBUT 2015]
Answer: SIMD ATay
together. The computer would start an load/store. regiSier-register anthmetic processor. It has 16 general
operatioa, and while it was waiting for the /0 The ACU is a simple and an
operation to complete, it would execute the Program Counter (PC), a Condition code Register (CC)
processor-imtensive program. The Durpose registers, a fields: label and offset. The
total execution time for the two jobs would
be a litle Register (AC-IR). Ihe Program ounter has two
Over one hour. Instruction
initially set to "man and the otfset to zero. The ACU also uses two other
Levels of parallel processing label field is
is initiauy
Element Instruction Register (PE-IR) and the Processing
We can have parallel processing registers, the (PEc) which are global registers used to communicate with the
at four levels.
) Instraction Level: Most processors
ricter
Element ControProcessing Elemcnts operate in lock step. i.e. each active PE
have several execution units
E
CA66
W
S
AN
POPULAR PUBLICATIONS COMPLIERARCIITFCEURE
Preletch
Operatione MMX
Technmology
over the non-MMX Pentium 1. Fetch 01 02Exeute wrebac
several improvements
The MMX technology consists of
TI
BTB
microprocessors: added those have been designed to Munit
. There are 57 ncw microprocessor instructions
graphical data more efticiently. Programs can use MMY
RCML Shadow
re
handle videco. audio, and stem visible state.
a new mode or operating-sy Code
instructions without changing to lechnique. Ins Ad
is also added to MMX cache
Aen 64-bit intcger data tpe it possible for one 16K ecoddeeco colc.
Mulnple Data (SIMD). makes
AU
new prcess. Single Instruction and op. Ir eger
A
multiple data items. exec
instruction to perform the same
operation on TLB LPO reod
micruprocessor has reased to 322KB. meaning fewer assad
Theiemor cache on the
accesses to memor that is off the microprocessor.
All MMX chips have a larger intemal LI
cache than their non-MMX counterparts. This Bus unit Page
improves the performance of any software running
on the chip, regardless of whether it unit
actualh uses the MMX-specific instnactions or not. Fig: Block diagram of the Pentium Processor with MMX technology
The Pentum
processor wth MMA amplementation was the design of a new, dedicated,
high- peromance MMX pipcine. which was able to execute two MMX instructions Although addingapipeline stage impróves frequency, it decreases CPl performance. ie.
AK
with minimal logiu changes in the caisting units. In addition, the design goal was to stay the longer the pipeline, the more work done speculatively by the machine and therefore
on the microprocessors performance curve. With the addition of new instructions, the more work is being thrown away in the case of branch miss-prediction. The additional
instruction decode logic had to be modified to decode. schedule and issue the new pipeline stage costs decreased the CPI performance of the processor by 5-6%.
instructions at a rate of up to rwo instructions per clock.
c) CM-2 machine:
The CM-2 was SIMD architecture based machine. The PEs in the CM-2 was capable of
Freqwency Speedup
To simplif the design and to meet the core frequency
goal, the pipeline of the Pentiam performing bit-serial arithmetic. The control processor, or sequencer could decompose an
sor example, into 8 PE nano-instructions. The CM-2 provides the
processor wMMX was extended with a new pipeline 8-bit operation,
M
The two major bottlenecks were The CM-2 15 orovides the hypercube connections between
decoder and the data.cache access. the instruction different prucessing
So they tried to fîx the
ian instruclion that uses old 5-stage decoder bottlencck first. Here OE The PEs were organized into modules each having 32 P:s Within a
pipe like Fetch, Decodel, clements (PEspES organized int
were organized intouse to 16-PE sets with each set having its own
Write-Back. Decode2, Execute, the PES
To speed things up. a 6th stage given module, thePES within a given
PES within given set use shared meinory to communicate with
one
AR
was added to tne pipe node. All values their respective local memories Each
added between Fetch and
Decode! to decouple freezes.
i.e. Prefetch. A queue was
also router writing router node
Prefetch, Fetch, Decodel, So now an instruction looks like: another by vertex inthe hypercube. cube. One interesting feature
of the routers was that they
helow After adding this
Decod2, Faecute, Write-Back cented a& veity
represented for message combining for messages with
circuitry for
new stage. machine as shown in the figure special communicationvi the same destination.
extra clock cycle. tuming is rebalanced provided munication via local memories of PEs
to take advantage of the terned commun ithin a given module. the
In
addition toth
supports
pafterned communications directly across
the wires ot the hypercube.
CM-2 also
e
CA-68 CA-69
W
S
AN
POPULAR PUueLIGATIONS
COMPUIERARCHAECTURE
from to Fons End C'vmpkee
MEMORY
TI
Scalar Memory bus
Global resuh bus
Instrution Brosdeas: bus Multiple Choice
Type 9uestions
1. A computer with
1000 ns, and a hit cache access time of 100ns, a main memory access
ratio of 0.9 produces e of
AU
a) 250 ns an average access time o
b) 200 ns
rocrssc A 190 ns d) none of these
Answer: (c) WBUT 2007, 2010, 2014
independet pieces of data. A vector of multiple a) pointer addressable memory b) very cheap memory
pieces of data may comprise multiple
data streams, and independent
memory values that are handled concurrently may also content addressable memory d) slow memory
comprise separate data streams. Each Answer:(C)
of the two attributes, Instructions/Data, may be
classified as SingleMultiple, yielding
four systems possibilities as illustrated in Figure 1. of locality justifles the use of
Data 5. The principle WBUT 2008, 2009]
Instruction Streams a)Unterrupts b) Poling c) DMA
Streams
Single Multiple
Cache memory
Answer:(d)
Single SISD SIMD
E
CA-71
CA-70
e
W
S
AN
COMPUTER ARCHITECrURE
POPULARPUBLICATIONSs Computer with cache
16. A access time of 100 ns, a
a physical memory
location in a paged-memory 00
1000 ns and a hit ratio of 0.9 main memory access tin
produces an average access time
8. In absence of TLB, to acenss WBUT 2012] a) 250 ns of WBUT 2016]
TI
accesses are required? b) 200 ns
system how many memory c) 3 d) 4 Answer: (c) )190 ns d) 80 ns
a) 1 )2
Answer: (b)
with n blocks is nothing but which of the
Short Answer Type guestions
9. A direct mapped cache memory WBUT 2012, 2015]
memory organizations?
folowing set associative cache /1. Consider the performance ofa main memory organization, when a cache miss
AU
b) 1-way set associative
a) 0-way set associative has occurred as
d) n way set associative WBUT 2007]
c) 2-way set associative 4 clock cyclesto send the address
Answer: (d ) 24 clock cycles for the access time per word
WBUT 2013] ii) 4 clock cycles to send a word of data.
10. In whichtype of memory mapping there will be conflict miss? Estimate:
a) direct mapping b) set associative mapping
a) The miss penalty for a cache block of 4 words.
c) associative mapping d both (a)& (h) b) The miss penalty for a 4 way interleaved main memory with a cache block
Answer: (d)
of 4 words.
11. Virtual address space can be divided into some fixed size WBUT 2013] Answer:
AK
a) segments b) blocks E) pages d) none of To find a single word from main memory, the processor wastes 4+24=28 cycles.
these
Answer: (c) Since, size of the main memory block=size of the cache memory block
To find 4 words in main memory processor access it one time and it takes the whole
12. Which is not the property of a memory nodule? block to cache.
wBUT 2013]
a) inclusion b) consistency c) capability d) locality So, miss penalty = 4x 28+4 =116
in 4 different banks.
Answer: (c)
For a 4-way inter leaved memory, the words be present
4
are activated and the time required to read = 24
So, the first 4 cycles, all the addresses
13. Effective access time (Ta) of memory is given by wBUT 2014] cycles.
cycle to send a word of data.
Let, it requires 4 clock
M
TI
i.e. both cache
Such a mapping is called a hashino same time. memory and main memory are
module. and to an address within that module. updatca t the a
scheme Clearly. the mapping must be one-to-one. In the write-back policy,
the update is performed
cache block is replaced only in the cache memory. When tne
access time of 100nanosecs, a main memory access time then it updates the
3. A computer has cache main memory contain.
0.9.
of 1000 nanosecs anda ht ratio of 5. Assume the performance
i) Find the average access time of the memory system. of 1-word wide primary memory organizatOn
AU
Suppose thet in the computer, there is no cache memory, and then find 4 clock cycles to
)average the 56 clock cycles for
send the address
access time, when the main memory access time is 1000 nanosecs. the access time per word
Compare the two access times. WBUT 2008] 4 clock cycles to send
a word
Answer: Given a cache block of 4 words, of data
) Suppose the average access time of memory system is penalty and the effective memory and that a word is 8 bytes, calculate the mk
x 9+ 1000.l = 190 ns bandwidth.
100 Re-compute the miss penalty and the memory bandwidth
assuming we have
Main memory width of 2 words
) if there is no cache memory, then the average access time is equal to the main Main memory width of 4 words
memoryy
access time, i.e., 1000ns.
Interleaved main memory with 4 bands with each bank 1-word wide.
AK
So. with cache memory. the average access time= 190ns and
without cache memory, the WBUT 2009
average access time = 1000ns Answer:
In the first case, total (4+56+4)= 64 clock cycle is required to access I word data from
4. Consider a computer where the clock per instruction (CPI)
memory accesses ht (no memory is 1.0 when all memory and I word data = 8 bytes.
stalls) in the cache. Assume each clock Memory bandwidth = (8/ 64) bytes per clock cycle
2 ns. The only data accesses are
loads and stores, and these total 50%cycle is
of the 0.125 bytes/clock cycle
instructios. Assume the following formula for calculating
CPU executiontime = (CPU clock execution time: In the second case, main memory width is 2 words. So, total(4+56+4) = 64 clock cycle is
For a program consisting of 100 cycles+Memory stall cycles) x Clock cycle time. required to access 2 word data from memory andl word data =8 bytes.
instructions:
Calculate the CPU execution time Memory bandwidth = (8*2/ 64) bytes per clock cycle
M
Calculate the CPU execution assuming there are no misses. = 0.25 bytes/clock cycle
time considering the miss penalty memory width is 4 words. So, total (4+56+4) 64 clock cycle is
cycles and the miss rate is 2%. is 25 clock In the third case, main
Discuss the difference between
required to access word data from memory and word data = 8 bytes.
I
write through and write back 4
cache policies. Memory bandwidth = (84/64) bytes per clock cycle = 0.5 bytes/clock
Answer: WBUT 2009] interleavea main memory has 4 banks and each bank is word wide.
1
handwidth
Case 2: leal address and physlcal address? If segment no. is 8, page no. iss
are
The CPl of cache memory = 6. What no. is 40. Segme
40. Segment no. 8 hold 30 and page no. 30 hold 019,
what will be
Here miss rate = 2 % and
Misses per instruction
04, word address??? F
ohysical address For answer figure is essential.
miss penalty = 25 clock x Miss penalty correspondingphysical WBUT 2009]
So, The CPl of cache memory=
AR
cycle
0.02 x 25 = 0.5
Now, CPU execution time
for 100 instructions
Answe
system, theregenerat of
are two types addresses: logical address and physical
enerated by CPU and this is the address address.
= (CPU
clock In a is at which a memory cell or
=(0.50)* 2*cycles + Memory
100 = 100
sall cycles) x Clock address
Logical addressocars to reside trom the perspective of an executing
.
cycle time x no. application
Cache write-through policy of instructions aladdress Is the address of main memory
and write-back policies: storaThe pyA logical address word which is pemanent
There are two cache write poliCies: progot chanddress iranslator ormay be different from the physical
address due too
operations. if the word write-through policy
and and caion
olidedmemor berrPing
mapping function.
the cache, and one in is in the cache. then there may be write-back policy. For write into a number of equal
ain memory and the disk sized pages. The basic unit of transfer of
main memory. n Dotn are two copies of the
word, one in main is the page, that is, at any
updated simultaneously, between the given time; the
this is referred to data CA-75
CA-14
e
W
S
AN
COMPUTERARCHITECTUR
POPULARPueLICATIONS the many-t0-one property.
This disadvantage
ratio. should lead to achieving
The set-associative a low cache hit
main memory may consist of pages from various segments. In this casc, the virtual mapping combines
address is divided into a segment number, a page number, and displacement within the mapping technique.
The set-associative
aspects of both direct
mapping and assoC
TI
page. Address translation is the same as evplained above except that the physical segment direct imapping with the flexibility mapping scheme combines
associative the simpliCI
base address obtained from the segment table is now added to the virtal page number in technique, the cache is dividcd of mapping. In the set-associative
mapping
order to obtain the appropriate entry in the page table. The output of the page table is the blocks. A given main memory into a number of sets. Eaçh set consists of a number or
page physical address, which when concatenated with the word field of the virtual block maps to a specific
address results in the physical address. S=i modS cache set based on the equaon
Where S is the number of sets
in the cache, i is the main
AU
SamenNunhr Virwi Page Number memory block number, ands1s
the specific cache set to which block i
memoryare connected to the maps. So, in this technique, the blocks of
different sets of the cache memory by
main
nn TaNe Offt technique. But in a set there are number direct mapping
Diylavemeo
memory may transfer to any block of cache blocks and a specific block of main
of a specific set of the cache memory by fully
associative cache mapping technique. So, the cache replacement
is reduced and hit ratio
is increased.
Pial Spmm Bas Adirew 8. Describe different techniques to reduce Miss Rate WBUT 2010]
AK
Page Tabke Answer:
Pomacal Aers One of the techniques to reduce the cache miss rate is Compiler-controlled prefetch.This
mayreduce the cache misS rate.While this approach yields better prefetch "hit" rates than
hardware prefetch, it does so at the expenseof executing more instructions. Thus, the
compilcr tends to concentratë on prefetching data that are likely to be cache mises
anyway.Loops are key targets since they operate over large data spaces and their data
accesses can be inferred from the loop index in advance.
Another method of reduce cache missrate is Compiler optimizations. This method does
NOT TEquire any hardware modifications. Yet it can be the most efficient way to
M
Physical Address eliminate cache misses. The improvement resuts from beer code and data
Fig paged segment address translation Oroanizations. For example, code can be rearranged to avoid conflicts in a direct-mapped
arrays can De reorderea to operate on blocks of data rather than
IThe above figure shows
data in that problem.]
the paged segment address
translation. But there are incomplete
cache, and accesses toarray.
processing rows ofthe
-
memory size is of 32kBx12. Cache memory size is of 512x12
7. What is the limitation mo that main
can be improved in of direct mapping
size is of1 word. Describe thefollowing WBUT 2010]
set-msociative mapping.method? Explain with example how it block
E
Answer:
and mapping technique
WBUT 2009] Direct
a) Associative mapping technique.
Direct mapping technique
is the simplest technique b)
incoming main memory
block into a specific of cache mapping. It places Answer: memon is 32kb 12 = 2!2x12 bytes
fixed cache block an main data
The size of the
is done based on a fixed line and 12 bit a bus in the sy ster
relation between location. The placement address
AR
number. j. and the number the incoming block are 15 bit 12 = 2"
of cache blocks, N number, i, the cache i.e. there memory iss 3512 x x bytes
block the cache 12 bit d.
j=i mod N The size of bit address bus and bit data bus in the system.
The main disadvantage 9
This is because according
of Direct mapping technique
are
2 2 words
i.c. there sire is I wordtechniquc.
in a block.
to is the inefficient Theblock napping
compete for a given cache this technique. a number use of the
in direct
expected poor block even of main memory blockscache. hit
a) So, addn>ss = 15block - 24
utilization of the
cache
there exist other empty may
Tle PU words in a
the restriction on the placement by the direct cache blocks. The
ot the incoming mapping technique is mainly I he
no. of CA-77
main memory due to
blocks in the
CA-76 cache i.c.,
e
W
S
AN
cOMPLTER ARCHITECT
POPULARPUBLIGATIONS Different Types of Locality:
Temporal Locality (Locality
in Time): The concept
The no. of blocks = 2°2' -2" referencea oy a program at one point in time that a memory location that is*
6 bits will be referenced again sometime in
No. of bits in Tag field - 15 -9 near future. if a memory element the
is referenced, it will tend
TI
CPU address Wond Field soon (e.g loops, reuse). It to be refereced aga
Ficld refers to the phenomenon
Tag Field 4 bits itemm has been referenced,
that once a particular memory
6 bits bits it is most likely that it will
Spatial Locality (Locality in Space): an itembe referenced next.
tochnique addresses are close by tend If is referenced, items whose
b) In Associative mapping to be referenced soon (e.g., straight line code, array
No. of words in a block -2 access). Spatial locality refers
to the phenomenon that when a given address has been
AU
4 = Ilbits
No. of bits in the Tag field 1$-
=
referenced, t is most likely that addresses
near it will be referenced within a short
CPU address period of tme. The concept that likelihood of referencing a memory location
Tag Field Word Field by a
4 bits program is higher if a memory location near it was just referenced.
U bits
oSequential locality: In typical programs, the execution of instructions follows a
10. What is the cache coherence problem? Suggest one method
to solve this sequential order (or the program order) unless branch instructions create out- of-order
problem. WBUT 2011 executions.
OR,
What do you mean by cache coherence problem? Describe one method to remove 12. How is a block chosen for replacement in set-associative cache to resolve a
AK
this problem and indicate its limitations. WBUT 2013, 2016] cache miss? WBUT 2012]
Answer:
Answer:
a
A protocol for managing the caches of multiprocessor system
a so
that no data is lost or
overwriten before the data is transierred from cache to the targetmemory. When two
The cache is divided into a number of sets containing an equal number of
blocks. Each
direct
block in main memory maps into one set in cache memory similar to that of
or more computer processors work together on asingle program, known as where a block can occupy
mapping. Within the set, the cache acts as associative mapping
multiprocessing cach processor may have its own memory cache that is separate from Replacement algorithms may be used within the set. However, an
any line within that set.
the larger RAM that the individual processors will access.'Memory caching effective
is in the assigned cache set. Therefore, the address
incoming block maps to any block the Tag, Set, and
because most programs access the same or instructions over
data and over.When processor is divided into three distinct fields. These are
M
issued by the
multipleprocesors with separaie caches share a common memory, it is necessary field is used to uniquely identify the specific cache set that ideally
to keep Word fields. The Set
block within
the caches in a state of coherence by ensuring that any shared operand that is changed in block. T he Iag field unrquely identifies the targeted
should hold the targeted element (word) within the block thatis
any cache is changed throughout the entire system. The Word field idenlities the
This is done in either the determined set.
of two ways: through a directory-based or a snooping system. In a processor.
requested by the Main Memory Address
directory-based system. the data being shared is placed in
maintains the coherence between caches. The directory
a
common_directory that
acts as a filter through which the TagFicld
Set Field Word Ficld
E
this indicates a
cache mis.
Therefore, the required 1 120 30 0 3 2 1
20 1 7 0
is found, then first, and
If in step 2, no match memory. deposited in the specified set 3
0 0 0|
TI
from the main processor. The cache Tag memory
block has to be brought available to the
the targeted element
(word) is made
be upJated
accordingly. 1| 1
block memory have to
and the cache
sequence of word addresses: 2
Total page fauli 8.
generates the following
13. A certain program
page frames in main memory is 3. How
10
4,5, 12,8, 10, 28,6, 16. An address space is specified by ot
words; the number of 28 bits and corresponding memory space
A page has four
many page faults
Answer:
are
4
if
48
optimum page replacement policy is used?
|8|8
5551012 12 10 28
13 28
610
WBUT 2013]
AU
26 bits. If a page consists of 4K words
i) How many pages and blocks are there in the system?
li) The asSociative memory page-table contains the following entries.
Page Block
0
701 20304230321201701 acCording to this technique, a number of main memory blocks may
WBUT 2013] This is becgiven cache block even if there exist other empty cache blocks. The
Answer:
The page to be replaced is the one that
compete or
expected poor on
utilization of the cach.
utilization cache by the direct mapping technique is mainly due to
will not be used for the longest period
of time. the placement of the incoming main memory blocks in the cache i.e.
disar
This algorithm requires future knowledee restrictioropety. property. This disadvantage should lead to achieving a low cache hit
AR
TI
block i
maps. Cache MM
the specific cache set to which
different sets of the cache memory by direct mapping
memon are connected to the
cache blocks and a specific block of main
technique. But in a set there are number of
a specitic set of the cache memory by fully ufter
memory may transfer to any block of
replacement is reduced and hit rátio
associative cache mapping technique. Sa. the cache
AU
is increased.
Fig: Write through
Long Answer Type guestions policy (All the memories have
the same copy)
unit. the main memory system. and the cache Access time
memory system, so that data throughput is
inproved without negatively affecting the hit ratio. The access efers
t
time refers that the time between reqiest to i-th level of memory and word
memory to the processor.
The performance of cache-based that level ofdefinitio it is obvious that if we
multiprocessors for general-purpose computing
and for multitasking is analyzed with arrives from definition bandwidth
above increase bandwidth the access
simple throughput models. A private according the increasing width bit transfer rate will increase.
cache is- associated with each processor, So, Because
AR
TI
load an entry from the primary
memor to its cache. When that are repeatedly separate chip. chip. The CPU uses cach
other caches with that entry. overall syster speed. Most required to run program
programs. improv ing
director either updates or invalidates the modern desktop
independent cacnes: an instruction and server CPUs have at least three
Management Unit (MMU)? cache to speed up executable instruction re
a) What is memory data cache to speed up data fetch and
cache memory organization? Define hit ratio
2.
using store, and a translation look-aside bufter usea
b) What are the advantages of WBUT 2008] speed up virtual-to-physical address
associative mapping and direct mapping. translation for both executable instructionsand aata.
Compare and contrast
AU
When the processor needs to read
Auswer: or write a location in main memory. it first checkS
Unit (MMU) is the hardware component that manages virtual whether that memory location is in the cache.
This is accomplished by comparing ne
a) Memor Management
of such devices are the translation of virtual address of the memory location to all tags in the cache that might contain that address. If
memon systems. Among the functions
addresses to phy sical addresses. memory protection,
cache control, bus arbitration, and. the processor finds that the memory location is in the cache. we say that a cache hit has
in simpler computer architectures. bank switching. Typically.
the MMU is part of the occurred; otherwise, we speak of a cache miss. In the case of a cache hit, the processor
CPL. though in some designs it is a separate chip. The MMU includes small amount of
a
immediately reads or writes the data irn the cache line.
memory that holds a tabie matching virtual addresses to physical addresses. This table is Local miss: Local miss means the misses with respect to the memory accesses to this
called the Translation Look-aside Bufier (TLB). All requests for data are sent to the cache memory or we can say that the misses in this cache divided by the total number of
AK
MMU. which determines whether the data is in RAM or needs to be fetched from the memory accesses to this cache memory.
memary access by CPU.
mass storage device. If the data is not in memory, the MMU issues a page fault interrupt. Global miss: Global miss means the misses with respect to the
total number of memory accesses generated by
i.e. the misses in this cache divided by the
b) The advantage using a cache of the memory hierarchy is to keep the information
the CPU.
expected to be used more frequently by the CPU in the cache. The end result is that at
any given time some active portion of the main memory is duplicated in the cache. to reduce miss penalty affect
the CPU. This technique ignores the
b) Many techniques cache and main memory. Adding
Therefore. when the processor makes a request for a memory reference, the request is concentrating on the intertace between the
CPU, the decision.
first scarch in the cache. If the request coresponds to an element that is currently residing cache between the original cache and memory simplifies
another level of
the clock cycle time of the fast CPU.
M
before
. No. of hit No. of hit irst-Request the missed word first from memory and send to the
it
t ratio (h,) =
No. of hit- So. of miss Total CPU access
strategies:
Critical it arrives. let the CPU continue execution while
filling the rest of the
Critical.
Compare and contrast associative mapping and direct
mapping: CPU a Se block. Critica-word-1irst fetch is also called wrapped fetch and
block.
i. In direct mapping techniquc. each
block of main memory is transferred words in the first.
AR
TI
because of a miss and it fiund therc. The TLB can hold 256 page table entries i.e. 2"
lower-level memon. If is
going to the nevu The TLB is 8-way set associative i.e. 2
block are swapped
write back cache
is organized as multiple So. the size of the tag field 32 (12+8+3) = 9
-
AU
addresses. Calculate the page size is usually a power of 2? WBUT 2013]
the tags for cache? penalty? Answer:
approaches to improve miss page size is 4 kB. The processor
b) What are the is divided into small partitions that are all the same si?e and
32-bit virtual addresses. The Paging-Computer memory
c) A CPU
generates
kB. The processor has a TLB which can hold allocated b/ the
reterTed to as, page frames. Pages are fixed units of computer memory
hold a total is 4
has a TLB which can The TLB is 8-way set associative. Calculate the
a total of 256 page
table entries. WBUT 2012] computer for storing and processing information. Paging is a virtual inemor scheipe
a process ir 1oadcd it
TLB tag size. which is transparent to the program at the application level. When
frames. "he process
Answer:
memor = 8 KB = 2" byte gets divided into pages which are the same size as those previous
a) Total size of the cache 2 pages are then loaded into the frames.
assoc iative mcans each set there are 4 blocks i.e.
AK
4-ay set (segments)Jepending on
Segmentation Computer memory is allocated in various sizes
Size of the each black is 32 byte i.e. 2. In segmentation. the address
spce is typically
w 2 bit set and 11 bit tag field. the need for address space by the process.
So. the address field contains 5 bit ord. (read/write Code segment
number of segments like data segment
divided into a preset
ignores the stack(read/write) etc. These segments may be indiviculy protected or
b) Many techniques to reduce miss penalty affect the CPU. This technique (read-only), Faults
you will see what are called "Senentation
CPL. concentrating on the intertace between the cache and main memory. Adding shared between processes. Commonly wren is outside the
the dafa that about to be read or
another level of cache between the original cache and memory simplifies the decision. in programs, this is because
that process.
The first-lev el cache can be small enough to match the clock cycle time of the fast CPU. permitted address space of a paye and offset
implemented by breaking up an address ir
Recall that paging is
M
Yet the second-level cache can be large enough to capture many accesses that would go
to break the address into X page bits a
Y otfset bits. rather
to main memory. thereby lessening the effective miss penalty. is most efticient
number. It num andand offset. Because
arithmetic on the address to calculate the page
Multilevel caches require extra hardware to reduce miss penalty, but not this second than perfom represents a power of 2. splitting an address :tween bits results in a
tcchnique. It is based on the observation that the CPU normally needs just one word of position
cach bit 2.
the block at a time. This strategy is impatience: Don't wait for the full block to be loaded is a power of
page size that
before sending the requested word and restarting the CPU. Here are two specific
computerhas 1 KB (B 4-way set associative cache andM8 main memory. If the
strategies:
A compuAB,
6. A B, then in which cache set are the wordADEs and (EDcSA)
E
TI
3
Now. this bloch will be in the æt no. 10995 mod
4
Similarly. the word (EXBA)), will be in the block no. (15218)1o and this will be in the With multiword cache blocks, each word written in a cache block requires a write
setno. I$218 mod 4 broadcast in an update protocol, although only the first write to any word in the block
needs to generate an invalidate in an invalidation protocol. An invalidation protocol
7. a)What s meant by the cache miss penalty? Briefly discuss "early restart"
technique to reduce miss penalty. works on cache blocks, while an update protocol must work on individual words (or
AU
merge writes in a write
b) Let us consider a memory system consisting of main memory and cache
memory In case of a cache miss, assume bytes. when bytes are written). It is possible to try to
the perfomance of the basic memory broadcast scheme.
organzation as: and reading the written value in
4 clock cycles to send the The delay between writing a word in one procesor written data are
address processor is usually less in a write update scheme, since the
24 clock cycles for the access
time per word another
4 clock cycles to send a word updated in the reader's cache.
of data. immediately
What will be the miss penalty, given a cache block of four
i) What will be the memory bandwidth? words?
Auswer. WBUT 2015] 2d part: coherence protocol and
protocol is an Invalidate based cache
a)Refer o Question No. 4 of Long MESI protocol: The MESI support write-back caches. By using write
Answer Type Questions.
AK
is one of the most
common protocol which wasted on a write through
which is generally
b) Refer touestion No. on bandwidth indicates that the
I of Short Answer Type Questions. back caches, we save a lot write back caches which
dirty state present in protocol reduces the
cache. There is always a main memory. This
8. What is te difference different from that in MSI protocol. This marks
a
MESI protoco between broadcast and invalidate data in the cache is transactions with respect to the
protocols? Explain number of Main memory
the performance.
Pl P2
significant improvement in
P3
data block is broadcast to
3d part: is updated, the new
local cache block write is observable
M
then P2 wants to a
update, WNrte through
mentioned scenario sent to it; ocal processor can take
one of the two states:
invalidate. Write using Write copy of loca
Answer:
back invalidatethrough update, Write back it receive cache block
1 part protocols. The state ofa
WBUT 20161 safely.
Difference between Valid State:
processors can read write
The performance broadcast and invalidate All processor can also
diflerences
from threc characteristucs between write protocols: Local (not in cache)
broadcast
and write invalidate Invalid State: invalidated.
protocols arise Block being replaced
Block being
CA-88 CA-89
e
W
S
AN
POPULARPUBLICATIONS Page 0
Page 0 L
copies becomne
TI
its cache copy, all other cache Paoe 1 Page
When a remote processor writes to Page 2 Page 2 Frame 1 Frame
invalidated.
Page Page 3 Frame2
using 3
fragmentation? What is the advantage of
9. What do you mean by mêmory Frame3
an example where logical address Page 4 Page 4 Frame 0V
Paging? Explain Virtual memory concept withpage size is 1 kb. Explain page fault
space is 8 kb, physical address space is 4 kb, Pace 5 Page 5 rame4
AU
WBUT 2016] Page
with FIFO and LRU Algorithm. 6 Page 6 Frame 3V Physical memory
Answer: Page 7 Page 7
1 part:
that is technically free Virtual memory
Memory fragmentation occurs when a system contains memory Page Table
which assigns needed memory (one per process,
but that the computer can't utilize. The memory allocator,
are required by programs. one entry per page
to various tasks, divides and allocates memory blocks as they
back to maintained by OS)
When data is deleted, more memory blocks are freed up in the system and added
the pool of available memory. When the allocator's actions or
the restoration of
are address and physical address
previously occupied memory segments leads to blocks or even bytes of memory that Fig: Mapping between virtual
AK
It
too small or too isolated to be used by the memory po0ol, fragmentation has occurred. address space is
means that the memory is divided into parts of fixed size and when some processes try to space is 8 kb, physical
size of the logical address physical
occupy the memory space, they sometimes are not able to occupy the whole memory In the above figure, the 8 pages in virtual
memory and 4 frames in
1
kb. So, there are
leading to some holes in the memory. This is memory fragmentation. It is of 2 types: 4 kb, page size is
1. External fragmentation memory.
2. Internal Fragmentation
4thpart:
2 part: FIFO: Replacement i.e. first-in-first-out.
Advantage of using Paging: FIFO Page page replacement strategy is FIF0,
A simple and obvious queue, and the page at
M
TI
Answer:
a) Write through and write back caches: Multiple Choice Type guestions
and in the main memory. If
Write-through technique, data is updated both on the cache
information is written itnto cache
there is a write buffer for main memory and it is empty, 1. eapped register windows are used to speed-up procedure call and return in
write buffer writes the word to a) RISC
and write buffer. CPU continues working while the architectures b) CISC architectures (WBUT 20071
until the bufter is
memory. If the write buffer is fulI, the cache and the CPU must wait c) both (a) and (b)
AU
d) none of these
empty. Answer: (a)
MM
Cache
2. What is a main advantage of classical vector systems (VS) compared with RISC
based systems (RS)? WBUT 2008, 2009
a VS have significantly higher memory bandwidth than RS
b) VS have higher clock rate than RS
c) VS are more parallel than RSS
d) None of these
Answer: (a)
AK
Fig: Write through policy (All the memories have the same copy)
3. Difference between RISC and CISC is WBUT 2010]
In write-back policy data is written to the cache, and updated in b) CISC is more effective
the main memory only a) RISC is more complex
when the cache line is replaced. Information is written only to the block in d) none of these
the cache. The c) RISC ls better optimizable
modified cache block is written to main memory only
when it replaced. This requires
is Answer: (a)
an additional information (either hardware or software),
called dirty bits. A dirty bit is
attached to each tag of the cache. Whenever the CiSC is that WBUT 2011]
infomation in cache is different from the
one in main memory. then write back to main memory. 4. The advantage of RISC over requiring just one clock cycle
a)RISC can achieve pipeline segments, with the longest segment
b CISC uses many segments in its pipeline
M
MM
Cache Cache requiring two or more clock cycle
c) both (a) & (6)
of these
dnone
Answer: (d)
following is not RISC architecture characteristie? wBUT 2012]
5. Which of the instructions
a) simplified
and unfried format of code of
no specialized register
Instruction
E
b) storage / storage
cno register file
d) small
Fig: Write back
Policy Answer: (c)
(All the memories have
the same copy) architectures correso
following architectures correspond to von-Neumann architecture?
b) Cache coherence SISD
AR
TI
1. a) What is SPEC
rating? Explain.
b) A 50 MHz processor
(WBUT 2010, 2012, 2014, 2015] was used to execute a program WBUT 2015]
Compare between RISC and CISC. instruction mix and with the following
Answer: clock cycle counts:
Characteristics CISC RISC Instruction type
Iastruction set size Instruction set is very large Instruction et is smalI and Instruction count Clock cycle count
and instruction| and instruction format is| instruction format is fixed.
AU
formats variable (16
- 64 bit per Integer arithmetic
Data transfer 50000
instruction)
35000 2
Floating point arithmetic
Addressingg mode
12-24 3-5 Branch
20000
General purpose 8-24 general purpose registers hough most instructions are 6000
registers and cache| present. Unified cache is used| register base so large numbers of
design for instruction and data registers (32 192) are used and Calculate the effective CPI, MIPS rate and execution
time for this program.
cache is split in data cache and Answer:
instruction cache. a) The Standard Performance Evaluation Corporation (SPEC) is
CPI CPI is between 2 to 15 In most cases CPI is I but average an American non-profit
organization that aims to "produce, establish, maintain and endorse standardized
a set" of
AK
CPI is less than 1.5 performance benchmarks for computers. SPEC was founded in 1988. SPEC
CPU control CPU is controlled by control CPU is controlled benchmarks
by hardware are widely used to evaluate the performance of computer systems: the test results
memory (ROM) using without control memory are
published on the SPEC_website. Results are sometimes informally referred
| microprograms.
"SPECmarks" or just "SPEC". SPEC evolved into an umbrella
to as
organization
2.What are multiprocessor, muti-computer and multi-core encompassing four diverse groups; Graphics _and Workstation Performance Group
systems?
WBUT 2012, 2014] (GWPG), the High Performance Group (HPG), the Open Systems Group (OSG) and the
Answer:
In Multiprocessor system there are
newest, the Research Group (RG).
more than one processor that works
In this system there is one master processor simultaneously.
and other are the Slave. If one pressor b) Total instruction count= 1000
11
M
characteristics
program instructions. (called "cores"). which are Neumann architecture [WBUT 2012]
The instructions the units that read and execute b) Non von
data, and branch, but the muftiple are ordinary CPU instructions such as add, move Computer WBUT 2012]
cores cam Tun multiple c) Cluster
increasing overall speed instructions the same Answer:
typically integrate the
for programs amenable at time, PC: microprocessor
cores onto a single to parallel computing.
Manufacturers a) Power r is a highly
hi integrated single-chip processor
AR
single chip package. integrated circuit die e PowerPC architectur Superscalar machine
machine organization, that combines a
or onto multiple dies in
a RISC
architecture, a rocessor
processor and a versatile
powerful interlace. he
1
conta
contains a 32KB unified cache high
performance bus nd completing.up to 3 insructions and is capable
ide a wide range of sy
transactio s per cycle. The
of stembus interfaces, including bus
ace
non-pipelined, and spit offers he result is a cost
pipelined,
connlincd.
microprocessr
n
solution that very competitive effective.
perforance. geneial
purpose
CA-95
CA-94
e
W
S
AN
POPULAR PUELICATIONS oranch
Oispah
Decode
Fetch
Execute
TI
Drspacn Legc |Predict
Instrucho n Oveve ond
Integer Instructions
Floatng
eraecSeavenc
Fned
Po nt Potnt Fech Olspatch EceWnebac
U Decode
Un Unit
Load/siore Instruetlons
AU
Me mery Dispatch
Manegemen Decode
Address
Cache wibe
256
Floating-Polint
Ceche Tege 37KB Ceche Arey
Focn Dispatch Oe Eecue Writeback
the
601 pipeline structure for high performance and
concurrent instruction processing in each variables there are immutable bindings between names and constant values.
of the execution units as shown below. Nate that the term non von Neumann is usually reserved for machines that represent a
The fixed-point pipeline performs all integer arithmetic logic unit (ALU) radical departure from the von Neumann model, and is therefore not normally applied to
and all processor load and store instructions,
operations multicomp architectures, which effectively offer a
including floating-point loads and multiprocessor or setof
machines.
stores.
cooperating von Neumann
The branch instruction pipeline
has only two stages. The first stage
decode, evaluate, and, if necessary. can dispatch,
predict the direction of a branch instruction in c)Cluster Cou consists of aa sset of loosely connected computers that work
eonsists
E
ed as including
clusters emer microprocessors, high speed networks,
mputing. Clusters are and software for high
er
single computers of
usually deployed
that of a single computer,
of comparable speed
while
to improve
typically being
cost-effective than s or much
availability. Computer
Applicability
applic and deployment, ranging
more
have a wide range ofdes to some some of the from small
clusters nodes fastest business
handful of supercomputers
clusters with a CA-97 in the world.
CA-96
e
W
S
AN
INTERPROCESSCOMMUNICATION
PUBLIGATIONS
eoPULAR from general
purposes ranging Multlple Cholce Type guestlons
configured for differem mputation-intensive scientific
TI
may de
Computer clusters wcb-sern ae support. o Note that the
high-availability approach.
purpose business
needs suh as
may use a high- 1. In general an n Input Omega network requires .. stages of 2*2*
calculathons In ether case,
the chuster "oompute cluster" may also use a WBUT 2011, 2016]
bekow are not
evclusive and
aa
configurations in which
cluster- switches.
atributes descrihed
availability apprmach. ete
"Load-balancing" clunNers are
performance. For example.
a2
a) 2 b)4 c)8 d) 16
provde better overall Answer:(a)
nodes share computatronal workkoad to quenes to different nodes. so the overall
AU
cluster may assign diferent load-balancing may retum in
Overlapped register windows are used to speed up procedure call and
a web server approaches to
optimized. However. 2.
response time will be cluster used for seientific WBUT 2011
applicatnons eg a high-performance a) RISC architecture b) CISC architecture
significantty differ amonge wcb-server cluster
w ith different algorithms from a cboth (a)&(b) d) none of these
computatiorns would balance load to a
round-robin method by assigning each new request
which may just use a simple Answer: (a)
different node. which of the following shared
3.The time to access shared memory is same in WBUT 2012, 2015]
memory multiprocessor models? d) ccNUMA
a) NUMA bUMA c) cOMA
AK
Answer: (b)
WBUT 2013]
4. Example of a
recirculating network is
a) 3 cube network
rng network
d) mess connected tiac network
c) tree network
Answer: (6)
5. in general
64 input Omega network requires
. .. stages of 2x2 switches.
WBUT 2013]
b) 64 c) 8 d) 7
M
Answer:(a)
multiprocessor system is best suited
WBUT 2015]
6. The
UMA interaction among different modules in program is large
the degreeof
wnehe degree of interaction among different modules in program is less
a)when
interaction among different modules in program
there isno
c) when diferentprograms are to be executed concurrently
when
E
Apswer: (d)
Short Answer Type guestions
timingel
.cture and timing diagram explain S-access memory organization.
architecture
AR
TI
cach mimor access phase of the last
takes two memon
cyces to access m consecutive words. If the m words take
overlapped with the fetch phase of the current access, effectively
access is
throughput is decreases stride is greater than
1.
only one memorn cyek to acces. The if
Fetchcvce Accs cwe
(
-orde
Reed/we
Multiplever
A Low-order
Single word
0Ccss
Single word
access
AU
AK
address bits
Fig S-Access memory organzation for m-way interleaved memory
Ss1
Fig: 3x4 delta network
Mcor Madules
a
In the delta network, there are x b" switches with n stages consisting of a x b crossbar
teich 1 Fetch3
AccesFch2TAccess 2Acss
e 3 modules. There is a unique interconnection path of constant length between the stages of
the network. In delta network no input or output terminal of any crossbar module is left
unconnected. To construct an a" x b" delta network there are a shuffle as the link pattern
keach Fach2 Fetch 3 hehween every two consecutive stages of the network. In an a'xb° dela network. there
M
Access Acoess 2 Aces sources and b" destinations. Numbering the stages of the network as 1,2.n,
are a nt the source side of the network requires that there be a crossbar modules in
starulnge So there is ab output teminal in the first stage and this is the input
Fai Acoes
ish2sh+ Acoess 2
terminals in
second stage. So, the i-th stage has
a
crossbar modules
above exan
ample, a = 3, b = 4 and n =2. So, there are 2 stages in 3 x4 delta
words words words Now, in the inter. link pattern is a- shuffle i.e.3 -shuffe.
the
Cyce network and
E
Tume
hg Successve memory access operation using overlapped stage Switching Network?
Multista WBUT 2008
fech and access cycle What is
3.
Aiswer Switching network capable of supporting massive
parallel processing,
2.Develop 34 delta network WBUT 2008
multistage
oint-to-point and multicast communications between
A luding point-to- processor modules
connected to the input and output
AR
TI
4 Draw the blocki diagram of C-access memory tunction. Why ts t necessary and
how does t improve the memory eccess time7 WBUT 2008]
Answer
Re
Ad
ac
Mod
access of a sange required to complete the used to select the module and remaining
wod trom a module Thc hits address the desired element within
produce ane word. minor cycle t is the actual (n-m) bits address the desired element
ic the oneriap
access of succssive time needed to
Cver minor cvck the module. within the module. The single
inemory modules separated îm
access
Here we gne an retuns M consecutive words or
cught
cample er. the figurt below The
effectiveness of this memory information from the M memor
contguous memor words of the tuming of the module.
pipelined configuration is revealed by its ability to| S-access
access of the bloca in a -access
memory organization. acess of the of a vector.
configuration is ideal tor
ot cight comgucus The pipelined access the elements accessing a vector
access hefore and
atte: the present
words is merged
between
of data clements or for
pre-fetching sequential instructions
the eftectne accns bich Lven though the total other pipelined block a
for
E
OR,
ar the ditferences betwee loosely coupled
What
and tightly coupled
etu contraat between UMA & NUMA
WBUT 2013, 2014, 2016]
with examples. What
memary? is Dumb
Owers WBUT 2013]
with shared memory
AID coareouters
MIMI ;
NCORE. MUL.TIMAX. are known as
tightly
Examples Tightly-coupled coupled machines.
multiprocessor
CA-103 systems contain
e
W
S
AN
eOPULARPUBLIGATIONS
Answerr
mutipe CPl s that are enno d at the bus level These CPs may have access to a ne purpose of a network
TI
central shared memor (SMP v ( MAL « may parta pate in a memory hieranchy with aistrbuted system. is to allow exchanging
Regarding this data between processors
both local and shared memon (N MA) nuroauced: network data exchange, two in ne
switching and network important terms need to
MIMD ovmputers with ar mtenvonoton netwrd are Anown as loosely coupled method of transportation routing. Network switching De
machines Eamples are NIEL PS. NTBE osely-coupled multiprocessor of data between processors in reters to ne
There are roughly the network.
systems reterrad to as clusters are aad m multipk standalonc single or dual processor two classes of networkswitching
oommodity computers interconnected a a high speed communication system. A Linux circuit switching and
AU
Beowulf ciuster s an cample of a kvsehupled system. Tightly-coupled systems Packet switching.
perform beter and are hysalh smalker than koosely-coupled systems, but have
histoncalh requtred greater nital imestments and may deflate rapidly; nodes in a n cireuit switching,
a connection is established between
koosely-coupled system are usuall neapensive commodity computers and can be processors which is the source and destination
kept intact during the entire data transmission. During
reccled as independent machnes upn retirement from the cluster COmmunication, no other processors can use this
Shared memor does « mean tha there ts a single. centralized memory. the allocated communication channe{S).
The symmetric This is like how a traditional telephone
shared-memon mutprocesses arr known as UMAs (uniform memory works. Some of the early parallel machines used
access). Uniform this switching method, but nowadays they mainly use packet switching. In
Memor Access (MAl is a computer memor architecture used in parallel packet
computers Switching, data are divided into relatively small packets and a communication
having mutuple processos and probably multipie memory chips. channel is
allocated only for the transmission of a single packet. Thereafter, the channel may be
AK
All the prooessors in the iMA model share the
physical memory uniformly. Peripherals freely used for another data transmission or for a next packet of the same transmission.
are also shared Cache memon ma be private for each
processor. In UMA architecture. The processors in a parallel architecture must be connected in some manner.
accessing time to a mcnor ocaton is independent
from which processor makes the Interconnection networks carry data between processors and to memory and the
request or whach memor chip contains the target memory
data. It is used in symmetric interconnects are made of switches and links (wires, fiber).
muttiprocessing (SMP)
Interconnects are classified as static or dynamic.
Uaifom Memon Access computer architectures
are often contrasted with Non-Uniform Static networks consist of point-to-point communication links among processing
Memory Access (NUMA) archrtectures
UMA machines are, generally, harder elements (PEs) and are also referred to as direct networks.
computer architects to design. but casier for
for programmers to program, than Dynamic networks are built using switches and communication links. Dynamic networks
archtectures NUMA
M
are also referred to as indirect networks. A variety of network topologies have been
Non-Uniform Memor Access or nronosed and implemented. These topologies tradeoff performance for cos. Commercial
Non-Uniform Memory Architecture (NUMA)
computer memory design used in multiprocessors. is a
where the memory access time machines often implement hybrid topologies for reasons of packaging. cost, and available
depends on the memor lacation relative
to a processor. Under NUMA, a processor can components.
access its own local memor
NUMA architectures logicaliy you mean by Program Flow
Mechanism? WBUT 2013]
follow in scaling from symmetric
architectures. Madem CPUs multiprocessing (SMP) 8. What do
operae considerably faster than
attached to Limtng the number the main memory they are Answeram-Flow Architecture is a Von Neumann or control flow computing model.
E
of memory accesses provided the The Progra c addressable instructions, each of which either specifies an
performance from a modern
computer. The dramatic increase
key to extracting high series of locations
systcms and of the applicatuons in size of the operating Here with memory of the operands or it specifies conditional transfer
run on them has generally operation along instruction,
processing improvements NUMA overwhelmed these cache some other
attempts to address this problem control to
memory for cach processor. by providing separate of
avodng conditlons for determining the maximum parallelism in
AR
attempt to address the same memory. the performance hit when several processors Bemste ein' the
shared memory by a factor NUMA can improve the performance over a 9. Use sequence nce of Instructions: WBUT 2015
of roughly the number of processors single following A= BxC
banks). (or separate memory D+E
B
Cd+B
7. What
architecture?
ithe significance of interconnection
network In multiprocessor
EF- D
WBUT 2012, 2014
CA-105
CA-104
e
W
S
AN
POPLLAR PUBLICATIONSS Similarly for 1
RIOW4-p
TI
based
Answer: derived some conditions R4OWI=o
data depondency and
Bernsten has elahorated the wod ot or praesses. Bemstein conditions WIOW4=
the paralelsm ot mstnatns
Read set or input set R, that
on whch ur can deride
are hased on the folku
ng tuo eth w variables 1) Thc
read h the statemeni of instniction .
) The Write set or
lHence Il and 14 are
independent ofeach other.
consists of memon kcathns written into by instnuction I1. The sets R, For 12 || 13,
kvatnns
output set W that cnsists of memn and writing by S. The
samc kratns arc used for cading
AU
R20W3=
and W. arc nox drsaint as the which are uscd to determine whether
folkow ng are Aemstein Paraikeirsm omditons R3nW2#o
statements are paraliei or nv w20W3=o
locations W, onto which S; writes must
t Locatcns in R frmm whrh S reads and the Hence I2 and I are not independent of each other.
be mutualh ecksve That means S dvs
not rcad from any memory location onto
whch S. urntes t can he denoted as R.W.*¢ For 12 || 14,
)Smilarly. kacations im R: frmm uhich S: reads and the lacations Wi onto which S R20W4#p
writes must e mutualh cclusrnr That means S: docs no read from
any memory
R4NW2-
kcaton cnto wtach S writes lt can he denoted as: R,°W,=¢ W20W4=p
AK
W and W: onto which S, and S; write. should not be read by
The memcorn kcm Hence 12 and 14 are not independent of each other.
S and S That means R and R: shoukd be independernt of W, and W,. It can be denoted
For 13 |14,
To show the operation of Bermstein's conditions. consider the following instructions of R3nW4p
scqucntal program. R40W3-p
W3nW4=p
D-E independent of each other.
2B Hence 13 and are
14
IC AB WBUT 2015]
Delta network.
14 E
F-D 10. Design 2x3'
M
Answer:
Aow, the read set and wrte set of II. 12. 13 and 14 are as follows:
RI B. C; WI A
R2 (D. F w2-B
R3 A.B; w3 C
R4 F.D; W4 E
CA-107
CA-106
e
W
S
AN
OPULAR PveLICATIONSS
TI
(WBUT 2016]
shared memory?
Aswer:
Refer to Question Na 2b) of Long Answer Type Questions. a) Straight <hcough
(b) Ciss-cross
1.What is the basic purpose of data fow architecture? Compare it with control
low architecture.
Answer:
OR
WBUT 2005]
What s the basic objective of data fow architecture? Compare it with control flow
architecture. WBUT 2005, 2015]
The data flow computers are hased on a data driven mechanism. The operation
conventional vom Neumann machine. the fundamental difference is that instruction
execution in a comventional computer is under program-flow control, whereas
of a
AU (c) Upper broadcat (d) Lower broacdeast
Which is the best architecture among and distributed shared memory being queued as they flow through the
switch. If
them and why? architecture. fowarient traffic and insufticient butfer space on the switch.
Answer wBUT 2007] packets are dropped.
a) Muti-stage networks connectn
processors with n memory enory systenis form major category of multiprocessors. In
a
with each other. The connection is blocks (so-caled modules) sharedare
She a global memory. Communication between
this categor. all
established over several
a number of switches. where each input
switch box is used to build up a must be connected
stages. Each stage consists
to an output. Normally. a
of
2x2
es or
b)
pproces perforned through writing to and reading fromtasks
coor
coordination and synchronization the
running on different
global meimory. All
multi-stage network. inter-proce>sor are also acconplished
switching possibilities Figure shows four different via the global
CA-109
CA-108
e
W
S
AN
eOPULAR PUBLICATIONS Answer:
processors.
a sct of independent
computer s\ stem consists of main problems need to
memon. A shared memor interconnection network. Two
TI
a set of memor
modules. and an degradation due to
sy stem: performance
designing a shared memor degradation might happen when
be addressed when Performancc
problems.
contention, and coherence A typical
shared memory simultaneously.
mutipke processors are tr ing to access the However, having multiple
the contention problem.
design might uses caches to sohe a cohcrence problem. The
the caches. might lead to
copies of data. spread throughout value. But, if one of the
AU
they all cqual the same
copies in the caches are coherent if inconsistent
the copies, then the copy becomes
processors writes over the value of one of In this chapter we study a
the other copies.
because it no longer equals the value of
their sol ons of the cache coherence problem. 10
vaniety of shared memor svstems and
Memon Access (UMA), Non-unifom memory
The aspects studied include l'nifom
Architecture (COMA). Bus Based Symmetric
access (NUMA) Cache-onl memor
Protocols, Directory Based
Mutiprocessors. Basic Cache Coherency Methods. Snooping
Protocols and Shared Memor Programming. 13
mutidimensional meshes) are used. So. we can say that distributed share memory
architecture is beter than centralize memory architecture.
sie: dala routing from node 1011 to node 0101 and 01 I| to 1001
CA-111
CA-110
e
W
S
AN
POPULAR PUBLICATIONS
POCesses working together
the above picture. on a multiprocessor
problem as is shown in there PpEd onto the common memory. Any process can snare a single virtual adress spa
There is no blocking in the above permutations in a single pass and can read or write a word of memory oy
TI
network can implement n* AUng a LOAD
i) An n-input Omega or STORE instruction.
are total n! Permutation is
occurred. occumred in first pass. Onunicate by simply having one of them Nothing else is needed. Two processes can
Here, n= l6. so, there
are 16 = 16 pemutation is OC Tead them back. Each write data to memory and having the on
of the 16 CPUs runs a single process, which has been assigneu
There are total 16! Permutations happen.
16/16! = .000205 T ne 16 sections to analyze. Some examples are the Sun Enterprise
onepass permutations among all pemutations= NUMA-Q, SGI 10000, Sequent
So, the percentage of Origin 2000, and HP/Convex Exemplar.
= 0.0205 %
AU
Omputer design for a parallel architecture is one in which each CPU has its own
passes needed to implement any permutation through P emory, accessible only to itself and not any other CPU. Such a
ii) In general, maximum number of to design is
the network is log 2n. where n is the
number of inputs. ed d multicomputer
or sometimes a distributed memory system and is illustrated in
igure below. Multicomputers are frequently loosely coupled. The key aspect of a
What are the similarities and
4a) What do you mean by multiprocessor system? COmputer that distinguishes it from a multiprocessor is that each CPU in a
dissimilarities between the multiprocessor system and multiple computer system? muncomputer has its own private, local memory that it can access by just executing
b) What are the diferent architectural models for multipocessors? Explain each of ADand STORE instructions, but which no other CPU can access using LOAD and
them with example. WBUT 2010]
ORE instructions. Thus multiprocessors have a single physical address space shared by
OR all the CPUs whereas multicomputers have one physical address space per CPU.
what is a fundamental difference in interprocessor coordination mechanism
AK
between multiprocessor & multicomputer systems ? Explain with reference to their Prvate memoy
architectural dirierences. WBUT 2013]
cDistingutsh between loosely coupled and tightly coupled multiprocessor -
CPU
TI
centralized
often cailed symmetric
UMA(uniform memony access). This type of
architecture is sometimes called popular organization. Figure
1
Processor
far the most Processor
shared-memory architecture is currenty
by Processor Processor
+cache cache
look like. cache cache
shows whar these multiprocessors
memorny can be defined as follows: Memory
So, communication with shared
straightforward to make an
Memoryv0 Memory uo Memon
o
AU
familar programming stye. Sometimes it is
Allows a
(with a small number of processors).
existing program run on a parallel machine
or semaphores for shared data.
Requires synchronization with critical regions
communication.
Interconnection network
support larger processor counts. memory must be distributed aL betwe ween processors becomes somewhat more
among the processors rather data complex and has higher latency, at
than cenralized; otherwise the memory there is no contention, because the processors no longer share a single
bandwidth demands of a larger nurmber processors
system would not be able to support the least when emory. As we will. see shortly, the
of centralized use of distributed memory leads to
without incuring excessively long two
access latency. With the rapid increase paradigms for interprocess communication
in processor performance and the associated differentp
increase ina processor's memory bandw
AR
S
Multistage Networks
Switches have excellent performance scalability but poor cost scalability.
AN
Dues ave excellent cost scalability, but poor performance scalability. Multistage
interconnects strike a compromise between these extremes. A number of p x q switches
stage connections present
Picscntn every stage of this network. There is a fixed interhelow.
Detween the switches in adiacent stages as shown in the fieure
TI
Multistage interconnection network Outputs
Inputs
Stage1 Stage 2
AU Stage n
AK
M
network
Fig: The schematic of a
typical multistage interconnection
Nework
Multistage Omega is the Omega network. This
E
Each Omega
mplete
A illustrated: source andd be that of the destination
the binary representation of the
W
be illushe
s.
Let
processor.
traverses the link to the first switching node. If the most significant bits ofs
proceaatraverses
data
are same, then the data is routed in pass-through mode by the switch else, it
Theehecrossover.
andd
switches to
swirocess is repeated for each of the log p switching stages.
This proc
CA-117
S
AN
appropriate row and
POPULAR PUBLICATIONS column. The complexity
wnicn grows in proportion of a crossbar has two cost components, one
as heir product. to the number of, inputs and
The product term is often outputs and the other that grows
called the cross-point count because
TI
aircctly related to the it is
number of simple 2 x 2 cross-points
crossbar requires N2 required to implement it. A
cross-points for N pairs
010 of terminals.
010
.Describe different access methods
011
maximum capacity a memory, of the memory system? What will be ne
of which uses an address bus of size 8
bitr 2013
AU
100
101 Answer: wBUT
There are two types of memory
access, i.e. C-access and S-access memory organization.
ee Short Answer Type Questions, Question No. 1
and 5 for C-access memory and
111
S-access memory organization.
Fig: A complete omega network connecting eight inputs and eight outputs he maximum capacity of a memory, which uses
bytes an address bus of size 8 bit is 2" =256
cubes. 1 part:
Now we consider ring as a static A multiport network where three processing elements want to connect with three memory
interconnection
network. Ring is obained by modules:
connecting two
terminal nodes of linear array
with one extra link. In PEl
a linear array, each internal Fig 1: Ring interconnection network
node has two neighbors, Control
one to its left and one
to its right. The ring is like the linear
network is cut in half
if the links are bidirectional.
aray, but the diameter of the PE2
E
Dynamic networks
are implemented with Crosspoint
configured to match communication switched channels, which are dynamically
demand in user program. PE
interconnection network are Bus. Source 0 Examples of dynamic
Mutistage switches, crossbar O
switch Source 1
AR
ctc. Control
Source 2
Let us consider,
crossbar switch as O- AMU?
TI
2 part: with 25 outputs: 0 000-lef_rotate(000)
A network where 9
inputs want to connect
001
I001-le_rotate(10)
010 2 =
2 010 le_rotate(001)
011
3
AU
011 le rotate(101)
100
4 100 lef_rotate(010)
101
5 5 101 lef_rotete(110)
110 6 110-lef_rotste(O11)
111 7 7 111=le_rotatel11)
Fig 1: A perfect shuffle interconnection for eight inputs and outputs
AK
1
2
33
A complete Omega network with the perfect shuffle interconnects and switches can now
be illustrated:
Let s be the binary representation of the source and d be that of the destination
processor.
. The data traverses the link to the first switching node. If the most significant bits
2 of s and d are the same, then the data is routed in pass-through mode by the
switch else, it switches to crossover.
M
Fepets and 25 entputs Delta networkk .This process is repeated for each of thelog p switching stages.
3 part:
Difference between omega network and delta network: An NxN Omega
consists oflogN identical stages and between two stagesthere a perfect shuffle
network
DO0
001
=: 001
an ax teminal.
In b' delta network, there are a sources and b destinations. There is a
unique
ierconnection path of constant length between the
stages of the network as 12..n, starting at the
there be acrossbarmodules in the first stage.
stages of the network. Numbering the
source side of the network requires that
100
101 = 100
101
AR
AHE:
110
An omega network for N =8 where N
represents no. of processors:
One of the most commonly used multistage
interconnects is the Omega network.
network consists of log p stages. where This conple omega network connecting
p is the number of inputs/outputs. Fig 2: A eight inputs
input i is connected to outputj if At each stage, and eight outputs
0sisp/2-1
j2i1-P pl2sisp-1 CA-121
CA-120
e
W
S
AN
PUBLIGATIONS
eOPULAR
2008]
on the following: wBUT 2005, 2006, 2008]
TI
notes WBUT
10. Write short
network
a) Omega Switches
WBUT 2008]
b) Crossbar
WBUT 2010]
Network (h d
c) Multiport WBUT 2010, 2014] Fig 1.2: States of a switch point in a crossbar network
inclusion
d) Memory interleaving.
) Memory
c) Multiport Network:
AU
Answer:
)Omega network: Questions The bank-based multiport memory is an approach to realizing high access bandwidth than
Refer to Question Na
9ef Long Answer Ipe a conventional N-port memory cell approach. However, this method is unsuitable for
large numbers of ports and banks because the hardware resources of the
crossbar network
b)Cressbar Switches: to any other processor or product of the numbers
processor m the sy stem to connect which connects the ports and banks increase in proportion to the
Crosshar Swinches allow ary proessors can communicate simultaneously without
many of ports and banks. are
memor unit so that any time as long as the requested A parallel processor array with a
two-dimensional crossbar switches architecture
connection can be established at
contention. A new are used in the design of high- wherein the individual processing elements within
ioput and output ports
are free. Crassbar switches configured as clusters of processors,
multiprocessors. in the design of routers for
direct nerworks. A dimensional cluster network of crossbar switch
perforaance small-scaie each cluster are interconnected by a two network of
switching network with N inputs and M outputs.
which
clusters are interconnected via a two dimensional array
AK
crosshar can be defined as a elements. The network of
interconnections without contention. Figure 1. switch elements. Input data is supplied directly into the aray
aliows up to min {.M oacto-one crossbar
N except for crossbars connecting which allows an optimal initial partitioning of the data set
shows an N XU crossbar network. Usually. M= crossbar switch elements,
processors and memory modules. among the processing elements. processor
interconnection network for interconnecting the
A parallel processor array, an
Suuh the network including.a two-dimensional mesh of multi-port crossbar switch
clusters, wherein each
Rni in a crossbar mesh network and
elements arranged in rows and columns a port of
cluster is connected to a port of a row crossbar switch element and to
processor be processed is
crossbar switch element, and wherein an input data set to
a column for initial
via crossbar switch element input ports
M
as shown in Figure 12. In bigure i2 ta). the input from the row cotaining the switch subsequently is configurable to perform a data dimension uransposition of
pot has been granied acccss to the corresponding output, while inputs from upper rows function, and
processi ng in a second data dimension by the processing elenents during
data set for
requestung the sane outpu are blocked. In Figure 1.2 (b). an input from an upper row
has the
second
processing function.
been granted access to the output. The input from the row containing a
the switch point
does nut rcquesi tha: outpui and can be propagated to other switches. In inclusion:
AR
TI
According to the inclusion property. level of memory. So. we
can state the
Group-A
of memor hierarchy must present also lower (Multiple Choice Type Questions)
inclusion property as MC M, CM C...M, copied into Mn..
1. Choose the correct alternatives for following
the
required portion of memory M, are i) A pipeline stage
At the time of the processing. found in
are copied into M. and so on. So. if a word is a) is sequential circuit
Similarly. subsets of M,
copies of the same word can be also found in all levels as M,, M,. b) is combinational circuit
memorn M,. then
AU
not be found M.
M, But, a word stored in M,. may
i
or four-wa interleaving technique is used. Each block of memory is accessed using mukiorocessor models? d) ccNUMA
different sets of control lines, which are merged together on the memory bus. When a b) UMA c) COMA
a) NUMA
reador write is begun to one block, a read or write to other blocks can be overlapped with von-Neumann architecture?
the first one. following architectures corespond to
vi) Which of the c) SISD d) SIMD
In an interleaved system. a main memory of size 2 ' is divided into m modules, where m b) MIMD
a) MISD
is a positive integer (usually. m-2" for some integern such that O<n < 1, I being the
in paged-memory system how many
number of bits in a main memory address). Each main memory address is mapped to access a physical memory locaton a
to a l absence of TLB,
In
are required?
module. and to an address within that module. Such a mapping is memory accesses
E
set associative
a) 0-way d)nway set associative
2-way set associative
c)
CA-124
e
W