0% found this document useful (0 votes)
140 views

Computer Architecture

Uploaded by

Aratrik Basak
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views

Computer Architecture

Uploaded by

Aratrik Basak
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

COMPUTER ARCHITECTURE

S
ipellne Architecture 2

AN
Vector Processor 43

Flynn's Taxonomy of Computer Architecture 57

TI
Memory 71
RISC&CISC Architectures

Interprocess Communication AU 93
99
AK
M
E
AR

NOTE:
WBUT course structure and syllabus of 4th semester has been changed from 2012
Advanced Computer Architecture (4 Sem, CS) has been redesigned as
Computer Architecture (4 Sem, CS). Taking special care of this matter we are
e

providing the relevant WBUT questions and solutions of Advanced Computer


W

Archltecture, from 2005 to 2011 along with the complete solution of new university
papers, so that students can get an idea about university questions patterns
S
AN
POPULAR PUBLICATIONS OMPLIERAR ITETURE

PIPELINE ARCHITECTURE he the speed up for a four-stage linear pipeline, when


.What will =64? the numb
Instructionn WBUT 2008, 200

TI
b) 7.1 c)6.5 d) None of the
Maltlple Choice Type 9uestions a)4.5
Answer: (d)
complete n tasks in a k stage pipelina
1. The number of cycies required to s .Dynamic pipeline
allows WBUT 2008, 2009, 2011, 2014, 2016]
WBUT 2007, to evaluate
b) nk+1 c) k d) none of 20161 Multiples functions
function
b) only streamline connection
a) k+n 1
fixed d) none of these
c) to perform

AU
these
Answer: (a)
Answer: (a)
2. A 4-ary 3-cube hypercube architecture has WBUT 2007m
a)3 dimensions with 4 nodes along each dimension 0. The division
of stages of a pipeline into sub-stages is the basis for
b)4 dimensions what 3 nodes along each dimension WBUT 2009, 2014]
c) both (a) and (b) d) none of these b) super-pipelining
a} pipelining d) VLIW processor
Answer: (a c) Superscalar
3. Which of these are examples of 2-dimensional Answer: (a)
topologies in static networks?
AK
[WBUT 2007,2010] 11. A pipeline stage WBUT 2012, 2014]
a) Mesh b) 3CCC networks c) Linear array d) None of these a) is sequential circuit
Answer: (a)
b) is combinational circuit
4. The seek time of a disk 30
is ms. It rotates c) consists of both sequential and combinational circuits
capacity of each track is 300 at the rate of 30 rotations/second. d) none of these
words. The access time is approximately The
Answer: (c)
a) 62 ms b) 60 ms WBUT 2007, 2008, 2010
c) 47 ms
Answer: 1d d) none of these 12. Utilization pattern of successive stages of a synchronous pipeline can be
specified by [WBUT 2012, 2015]
5. For two
M

instructions land JWAR a) Truth table b) Excitation table


hazard occur, if
) R(/) D(U)* WBUT 2007, 2010, 2014 o) Reservation table d) Periodic table
Answer: (c)
c) D(i)R(J)o
Answer:(ec d) none of these 13. WBUT 2012]
SPARC stands for
6 The performance Scalable Processor Architecture
a)
of a pipelined Computer
processor suffers 0 Superscalar Processor A RISC
E

a) the pipetine if CScalable Processor A RISC Computer


b) Consecutivestages have different WBUT 2008, 2009, 2011, 2013 a) Scalable Pipeline Architecture
c) the pipeline instructions delays Answer: (a)
are dependent
d) all of these stages share hardware on each
resources other ofthe following architectures?
Answer: (d
Fortability is definitely an issue for which
AR

NBUT 2012
7. A single b) Super Scalar processor
bus structure a) VLIW processor d) none of these
a) Main frames is primarily c) Super pipelined
found in
Mini and
Answer: (C) Micro-computers
(wBUT
2008)
Answer: (6)
b) High performance machines possible data hazard? WBUT 2012]
d) Supercomputers 15. is
of the followingRAWnot the cause of c) WAR
hich
a) RAR b)
d) WAW

Answer: (a)

CA-2 CA-
e
W
S
AN
POPULAR PUBLICATIONS COMPLTER ARS RE
L

non-pipeline procesS
16. What wil de the speed up fora4 segment linear pipeline wheen Time require for nk.r nk
the number
instruction ? WBUT ot Time require for pipeline process (k +(n-1))r k+(n-1)

TI
2013, 2014
a) 4S 3.82 c)8.16 when n >>k
Aaswer: (b d) 2.95 um Speed-up is. Si*k
nANimum Speed-up is never fully achievai because of data-dependencies between
17. Which type of data hazard is not possible? intemupts, ram branches etc. So, many pipeline cycles may be wasted
WBUT 2013) gctions.
WAR b) RAW c) RAR waiting state caused by out-ofssequence instruction executions.
Auswer: ( d) WAW

18 MIPS means
aMultiple Instruction Per Second
e) utnstruction Pertformed System
Aaswer

19. The prefetching


aData hazard
c Control hazard
is a solution for
b) Millions of Instruction
d) none of these

b) Structural hazard
WBUT 2013
Per Second

WBUT 2014)
AU Compare
chitecture.
ASwer:
ieerscalar and
superscalar,

formance that can be several


Sigerscalar machines can
ad used to measure
issue
superpipeline and superscalar

super-pipelined processorsS utilize paralelism to


achieve peak
tumes higher than that of conventional scalar
several instructions per cycle. A system
instruction-level parallelism for
is introduced. Our
a series of
superpipelined
WBUT 2007

proccessors
was developed
benchmarks. The average
simulations suggest that this metric
AK
legree of super pipelining metric exploit ali of the instruction-
d) none of these machines already
Aawer. s already high for many machines. These non-numeric applications, cven without parallel
Irvel parallelism available in
many
20. Suppose pipelining.
the time delay of the four
stages of a pipeline are t1 struction issue or higher degrees of
ns. t3= 90 ns.
t4= 80 ns
then the maximum clock respectively and the interface latch has 60 ns, t2 50
frequency a delay t1 10 ns, multiple
al 100 ns for the pipeline is Superscalar units are kept busy by
b 90 ns WBUT 2016] multiple functional speed. parallel
Ansmer: (d) c) 190 ns perscalar processing has approached the limits of
d) 30 ns execution pipelines have Super-pipeined machines can issue
Sructions. As performance.
M

21. The prefetching improve the latency of any


technique is a solution UTion has been required to but they have cycle times shorter than fectch-decode
a) data hazard for R instruction per
cycle. machines still empkoy a single
WBUT 2016] superscalar SPARC splits eecution
)control hazard b) structural hazard unit. In some cases example. the tra
C
Answer: (c) d) enhancing the speed
of pipeline
nal pipe that drives all of the
units. For becoming more common to
have
ch a unified pipeline. However, it is units Superscalar operation
stage of functonal
third fetch-decode-dispatch pipes feeding he that can be extracted from an
Short Answer Type ple independent operations
guestions by the number of
ited
E

1. Define
Speed-up. Deduce instruction stream.
processor is k Is that the maximum
this maximum speed-up speed-up in a k-stage pipeln
always achievable? Explain. exccute at a higher rale b
Super-pipeline possidie to
time T. it may be accomplishcdiin two ways:
there are no stalls (waits) then OR, WBUT 2006] Given a pipe
pipeline stage can be substages
T/n. Thispipeline stages wto n
AR

depth ie., the number prove Perations at intervals of


of pipeline stages. that the speedup is equal to the pipe each ofthe
WBUT 2016) Further divide areoverkapped.the ability
abilit to sun :d: the stayes into
Show that the maximum OR pipelines that log* and coutd wel ir a sense
Answer: speedup of a pupeline
is equal to its stages. [WBUT 201 rovide n
approach
requires faster second appruch
eu
a t the sane qu:rements
it af
uppoc we conuder 6 latency. The assecale with ot
ot
nno
tha
stages in
tir papeline
Z010
segments
With
uniform
ggered Stuperscalar
superscalar opdata data
has
operation. and tetcheu wih a
can be
be
slight
and the fi
logiv. and
loge.
e
trequccy of unpraiiciable
oi prces te eaceute xCen instructions and speed i
nJh shorter than
sNrtr thar the
the tnier stage latch time. and
me dc'ay iur cai be by the grou
age f he pipelure per-pipeldt
pelining is
limited pnductiey The MIPS
WIPS R400
R 4OO0 is sometimes caitet a super-
a1 e Syeeu-up
. cannY
Stage time numberof
branches. Stage
stases Ihe
for the CAS
CA-4
a lmi
So this i
e
W
S
AN
OMPLTER
ARCHITECELR
POPULAR PUBLIGATIONS
uction is calculating value to be saved in register 2.,
a
for register 4. and the s
to computc a result d is going
pipelined machine. The benefit of such extensive pipclining is really only gained for very alue 2nd operation, the results
vever,i pipeline,
when we fetch
regular applications such as giaphics.
ds erand. for the from the lst will
not yet have been
we have a dat dependency. We say that there is a data dependency saved.

TI
hence dependent on the completion of instruction with
SuperscalarSuper-pipelin it is
etion2, as
We may also combine superscalar operation with super-pipelining esents a problem with concurrent
and the tt execution, for example:
potentially the product of the spcedup factors. However. it
interlock between parallel pipes that are divided
into many'
is even more diieul
stages. Also. the memon
r2+r3
dordxrs
subsystem must be able to sustain a level of instruction

AU
throughput corresponding mory situation that there is a chand that i2 may be completed
toi
total throughput of the multiple pipelines - stretching
gap even more. Of course. w ith so many
the processor/memory performanthe
pipes and s0 many stages, ance e arein a
urent
awecition) we must ensure that we do not store the
execu
before il (i.e. with
result of register 3 before
become huge. and branch prediction becomes a
scrious bottleneck.
branch penaltiec shad a chance to fetch the operands.
But the real problem may be in finding
the parallelism required
and stages busy hetween branche: to keep all of the are the different rameters used in measuring CPU perfomance? Briefly
LConsider that a machine pipes Mhat
stages must always have access with 12 pipelines WBUT 2008, 2015]
to a window of 240 instructions of 20 MCUSS each.
to avoid all hazards, and that the average that are scheduled
of 40 branches that would be so as Lawer:
of that size are all corectly predicted present in a block h estimate the CPU performance, the measure that is generally most important is
AK
prefetch unit.) suftíciently in advance
to avoid stalling in gKCution time, T. because we can write
the
3. What are Performance = 1/ Execution time
the different factors the execution time increasos the CPU performance decreases. There are three
system? Differentiate that can affect the performance if
between WAR and RAW of a pipelined CPU, i.e. speedup, efficiency and
with a suitable example. uameters measure the performance of the
to
Answer: aroughput.
WBUT 2007] improvement, the effect of the
Pipelining achieves
a reduction of the average When considering the impact of some performanee
execution time per instruction. S, taken as the ratio o the
that pipeline can perform
more instructions nprovement is in terms of the speedup,
usually expressed
In the sense execution time with the
M

ways: per clock cycle. This


can be viewed in two improvement (T.) to the
time without the
Decreasing the
CPl. Typical way in
Aon
improvement
(T.):
Decreasing the cycle time (i.e., which people view the
performance increase S Tw T.
Pipelining increases increasing the clock
the CPU instruction rate). Ped-up as a direct percent can be represented
as
execution time throughput. Pipelining 100
x
S=((7..- T.)IT)
of an individual does not decrease the processor used.
overhead (clock instruction. It increases
skew and pipeline the execution time the nuriber of
EIs the ratio of Speed-up
register due to to
ata hazards occur when delay) in the control So E=
of the pipeline. So,
E

data is modified. S/p processor.


ace conditions. There Ignoring potential number of
are two situations data hazards can result in Where
S is the speed-up and p is the time.
a data hazard can
occur in: computation over a unit
Read after Write
Through
measure of number of
(RAW): IS the techniques.
An operand is and VLIW 2016)
modified and read super-plpeline wBUT 2008, 2011, 2014,
AR

finished writing soon after. Because Compare sup


uperscalar,
to the operand, the the first instruction
second instruction may not have
may use incorrect Answer:
Write after Read data.
(WAR): struction during a clock cycle
Read an operand Superscalar one instru
redundant funct
nctional units on the
and write soon more than
finished betore after to that sane executes instructions to form of parallelism called
A RAW Data
the read. the
lHazard refers
icad instruction operand. Because
may incorrectly the write may have
Superscalar processor
dispatching c*tiole
multiple inst implements a
implements Superscalar
architecture processor.*erscalar processing has
calculated, for example: to a situation where get the new written
value. y
PrOcessor. Superscalar CPU single instruction In some cases superscalar
we refer to a result A superscalar within a multiple
il. R2 RI R3 that has not yct hat drives all of the units.
pipe thar
becn ruction-level parallelismkept busyby
fetch-decode-dispatch
2. R4 R2 R3 ple
are
functional units
machines
still loya single
e

CA-6 .CA-7
W
S
AN
POPULARPUBLICATIONS COMPUTER ARCHITECEURE

Superscalar operation is limited by the number of independent operations that 6.What is meant by pipeline stal1? WBUT 200o

TI
on be
extracted from an instruction stream. It has been shown in carly studies simpler
Answer:
A nipeline operation is said to have been stalled if one unit (stage) requires more time to
processor models. that this is limited. mostly by branches. to a small
number The
superscalar technique is traditionally associated with several identifying characteris
erform its function, thus forcing other stages to become idle. Consider, for example. the
ristics.
These characteristics arc. ae of an instruction fetch that incurs a cache miss. Assume also that a cache miss
are issued from a scquential instruction stream reauires three extra time units. By this method we can prevent branch and structural
Instructions
. CPU hardware dynamically checks for data dependencies between instructions haZards from occurring. As instructions are fetched, control logic determines whether or

run time (vcrsus softw are checking at compile time)


Accepts multiple instructions per clock cycle

Superpipeline
Given a pipeline stage time T, it may be possible to execute at a higher
operations at intern als of T/n. This can be accomplished in two ways:
Further divide each of the pipeline stages into n substages.
Providen pipelines that are overlapped.
ions ae

rate by staring
AU
not

been
a hazard could/will occur. If this is true, then the control logic inserts NOPs (No
Operations) into the pipeline. Thus, before the next instruction (which will cause the
hazard) is executed, the previous one will be sufficiently complete to prevent the hazard.
Ifthe number NOPs is equal to the number offree
cleared of all instructions and can proceed
stages in the pipeline, the processor has
from hazards.

7.Consider the pipelined execution of these instructions:


DADD
DSUB
R1, R2, R3
R4, R1, R5
WBUT 2009
AK
The first approach requires faster logic and the ability
to subdivide the stages into AND R6, R1, R7
segments ith uniform latency. The second approach R8, R1, R9
could be viewed in a sense as QR
staggered superscalar operation. and hás R10, R1, R11
associated with it all of the same requirements XOR
except that instructions and data can be fetched may generate a data hazard and describe a way
with a slight offset in time. Explain how the above execution
Super-pipelining is limited by the speed Modify the above example to
of logic, and the frequency of unpredictable to minimize the data hazard stalls uing forwarding.
branches. Stage ume cannot productively
so this is a limit for the number
grow shorter than the inter stage latch time, and show case
a where forwarding may not work.
of stages. The MIPS R4000 is sometimes called super- Answer:
pipelined machine. The benefit a
of such extensive pipelining is really only gained for very
regular application such as graphics. C
On more irregular
M

applications, there is little


performance advantage. AD9.5
4R.SSR.| .3::2:.3.|
352.3
Fetch
R
LIW
Superscalar and VLIW architectures
in their approach. It
both exhibit instruction- level parallelism,
thereby allows faster CPU
possible at the same clock
execution resource w ithin a single CPU
but differ
throughput than would otherwise be
rate. Each functional unit is not separate
a CPU core but an
such as an arithmetic logic unit, bit shifter, or a
Decode R R4.
-
|3.

. NGR
E

multiplier. Superscalar a ed-


CPU design emphasizes
accuracy. and allowing it improving the instruction dispatcher
to keep the multiple functional
Very Long Instruction units in use åt all times.
Word (VLIW) refers
advantage of instruction level to a CPU architecture designed to
parallelism. A processor that tak Execution
after the other i.e. a non-pipelined executes every instruction one
AR

ineiticiently. potenially leading scalar architecture may use processor resource


to poor performance. The performance
by executing
different sub-stcps
execuling multuple instructions
can be improveu
of sequential instructions simultaneously,
or eveu
Wme back --
The VLW approach
entirely simultaneously as in superscalar lechnio... is a
executes operation in parallel architcctures above pipeline
oipeline technique tive stage pipcline. The
when programs are compiled. based on a fixed schedule determined
Since determining the order consideredthat
that the abndwrite
execution
back.
and write back. In the above tigure. we stages
handled by the compiler of execution of operations s We have read, pertormed also show that
decode, operalion is
äre fetch. pulse what
n which clock A-9
CA-8
e
W
S
AN
COMPUTER ARCHITEGRURE
POPULAR PUBLIGANONS

in 4 clock cycle, where both read


read and
and crement the address register after each word transfer, so that the next transfer will be
Now, the data hazard will occur execution
the same time. m the next memory location. The word count register holds the number of words tobe

TI
operations are performed on register Rl at
Resut forwarding is a technique to minimize tne stal
n pipeline processino sferred. The word count is decremented by one after each word transfer. The control
register specifies the transfer mode. Direct memory access data transfer can be performed
technique is that after execution of
one instniction. it the result is transfer to thhe eN
instruction bypass the write back stage directly. ie. without writing the t
result in the iaburst mode or single cycle mode. In burst mode,the DMA controller keeps control of
register pipeline transfer that resuh to the next instruction. he bus untilall the dat has becn transferred to (from) memory from (to) the peripheral
device. This mode of transfer is needed for fast devices where data transfer cannot be
(Resut fowarding may not work in case of branch instruction of a pipeline
execution
there is a branch instruction in the above instruction set then result forwarding willSo,

AU the entire transfer is done. Tn single-cycle mode (cycle stealing), the DMA
doDped until
if not
work. controller relinquishes the bus after cach transfer of one data word. This minimizes the
amount of time that the DMA controller keeps the CPU from controlling the
bus, but it
GExplan DAMA working principle WBUT 2009 requires that the bus request/acknowledge sequence be" performed for every single
Answer: transfer This overhead can result in a degradation of the performance. The single-cycle
The main idea of direct memory access (DMA) is to transfer data between more than a few cycles of added interrupt
peripheral mode is preferred if the system cannot toferate
devices and main memory to bypass the role of the CPU. It allows peripheral devices to very large amounts of data, causing the
transfer data directhy from and to memory without the latency or if the peripheral devices can buffer
intervention of the CPU. Having DMA controller to tie up the bus for an excessive amount of
time.
peripheral devices access memory directly would allow
AK
the CPU to do other work, which
wouid lead to improved períormance, espec are the remedies? [wBUT 2009]
controller ocontrols one or more peripheral
in the cases of large transfers.
The DMA 9.What are the different pipeline hazards and what
devices. It allows devices to transfer data OR,
fromthe system's memory without the help to or WBUT 2015]
use memorn bus and only one
of the processor. Both theDMA and CPU Discuss data hazards briefly.
DMA controlier then sends
or the other can use the memory
at the same time. The Answer: in a pipelined
a request to the CPU asking its a potential problem that can happen
permission to use the bus. The In computer architecture, a hazard is
typically three types of hazards: data hazards, branching hazards,
CPU returms an acknowiedgment
can now take control to the DMA controller granting it bus access. The DMA processor. There are
of thebus to independently
transfer is complete the DMA conduct memory transfer. When the and structural hazards.
relinquishes its control processor are pertormed in several stages, so that at any given
that support DMA provide of the bus to the CPU. Processors Instructions in a pipelined
are being executed. and instructions may not be completed
in the
M

one or more input signals instructions


that time several
gain control of
he bus and one ormore output signals the bus requester can assert to A hazard occurs when two or more of these simultaneous (possibly out of
has relinquished the bus. The figure that the CPU asserts to indicate it desired order.
CPU's memory bus. below shows how the DMA
controller shares the order) instructions conflict
i. Data hazards data hazards can result in
DMAR when data is modied. gnoring potential
Data hazards occur
are three siuaions a data hazard can occur in:
race conditions. There
(RAW: An operana is modified and read soon after. Because the
DAMA DMA Atan ledgemen
Cratk CPU Read after Write not have imisned wrting to the operand. the second instruction
fGrst instruction may
E

data. opera
may use incorrec (WAR): Read an operand and write soon after to that same
Read
Write after the wrne
write have finished
may have finished before the read. the read instruction
operand. Because written value.
Data Bu
incorrectly get the new instru
may incorre Write Two
te (WAW): Two instructions that write to the same operand are
AR

Cod Sgnals
Write afterThe first one i may finish second, and therefore
leave the operand
Memory performed. value.
incorrect data hazards
with an involvedin can
hazards -
Fig DMA controller in data reside in memory or in a register.
shares the CPU's memory operands
bus The operalhaarS ds
A DMA controller when a part of the processor's
has an address ii. Stru hardware is needed by two or
The address register register. a word
count register, and a hazardhe same same tiine. A structural hazard
be iransierred. t
contains an address
thal specifies the memory control rg instructions at the a branch
A strucuctionsute branch instruction followed
might occur, for instance. if aa
is typically location more exCeure by a computation instruction.
possible to have of the daa to
the DMA controller program were CA-11
automaticaly
CA-10
e
W
S
AN
COMPUTER ARCHITEG URE
PUBLIGATIONSS
POPULAR
branching is typically slow AnSwer:
because equiring
executed in parallel, and instruction pipeline iIs a technique used in the design of computers system to increas

TI
counter-related computation. and writing to
Because they are registers). it is
a comparison. program
architecture) that the computation
instruction and the
branch 9te Lir instruction throughput. The fundamental idea is td split the processing of a computer
possible (depending on
the ALU at the same
time. riction into a series of independent steps. with storage at the end of each step. The
instruction will both require rinciples used in instruction pipelining can be used in order to improve the performance
i. Branch
hazards processor is tola computers in performing arithmetic operations such as add, subtract, and multiply
as control hazards) occur when the to
Branching hazards (also known one part of the instructi a program there
is several numbers of instructions. There are five steps to execute an

AU
In
a certain condition is tre, then Jump from
branch ie.. if sequentially. In such instruction and the steps are:
another not necessarily to the next instruction
a case
stream to process next instruction (when Fetch
whether it should the i.
processor cannot tell in advance
a distant instruction). IhIs can result in the processor doin Decode
may instead have to move to
doing i.
unwanted actions. ii. Operand fetch
iv. Execute
10.Use8-bit2s complement integer to perfom 43-(-13). [WBUT 20o9] v. Write back
-
Execufion cycle -

TE EB
Answer: IF
D
AK
-43 = 11010101 ustmuchionOperand Instrnuction Resut
Insruction|
- 15 = 1|110011 fetch decede fetch executionwte back
=
-43+(-13) 111001000
stnicnu execuhou phase
So. discard car and the resut is 11001000. ie. the 2's complement of 56. Fig: Instruction pipeline stages

11. What do you mean by pipeline processing? WBUT 2009] an overlapped manner.
are executed in the pipeline in 7
Answer: The streams of instructions cycle 6
Pipelining refers to the technique in which a given task is divided into a number of Clock I2
subtasks that need to be performed in sequence. Each subtask is performed by a given ID OF IE WB
M

IF
functional unit. The units are connected in a serial fashion and all of them operate
simultaneously. The use of pipelining improves the performance compared to the
traditional sequential execuion of tasks. The figure below shows an illustration of the
IF ID
-IF OF OF
ID
IE WB
IE WB
basic difference between execuming four subtasks of a given instruction (in this case IF ID O IE WB
14
fetching F. decoding D. execution E, and writing the results W) using pipelining ano
sequential processing.
operations. As an example. we show a floating-
led to arithmetic
Pipelining can e below. The floating-point add unit has several stages:
E

Figure
pipeline in T --

point add Lnpack Abgn Ad Nomaize

mn fquesul Peog Figure: Floating point add pipeline stages


AR

ack stage partitions the floating-point numbers into the three field:
:12j1 onent
The exponent field. and mantissa field. Any special cases such as not-a-
I. Unpuc iielg,
ne e rero, and intinities are delected during this stage.
sign(NaN, zero, and
sign
pounts of the two nmantissas by right-shifting the
Tine number stagealigns the binary
This the smaller esponen.
w:
Fig. Pipeline vs sequential
Processing we
ss sageadds
This
dds the two aligned mantissas.
Add:
zWhat are instruction pipeline and arithmetic pipeline?
wBUT 2009
CA-13
CA-12
e
W
S
AN
COMPUTER ARCHITECT
POPULARPUBLICATIONS
dta struction execution throughput increases in
Normalize: This stage
pácks the three fields of the
result after normalizatie
tormat. Any output excentioand
ine stages." ls it true? Jus your statement. proportion with the number
WBUT 2012, 2015]
o
floating-point are Answer:

TI
IEEE-754
rounding into the Aolining refers to the technique in which
stage.
detected during this a given task is divided into a number of
asks that need to be performed in sequence. Each subtask is performed
by a given
(1.4B). represented in 16 bit format. WBUT 2009 Cctional unit. The units are connected in a serial fashion
13. Find 2's complement of and all of them operale
imultaneously. The 'use of pipelining improves the performance compared
to the
Answer: aditional sequential execution of tasks.Consider the execution of m tasks (instructions)

AU
(1.4B-(00000001 10101011) using n-stages (units) pipeline. We assume that the unit time T = units. Then the
(0000 0001 1010 1011)) is Throughput U(n) is, m/ (ntm-1) x
2's complement of
(11I1o 0101 0101) ie. the no. of tasks executed per unit time. So, from the above equation, if we increase
the number of stages in a pipeline it also increase the throughput of the pipeline.
14. What are the different factors that
can affect the performance in a pipelined
system? Differentiate between WAR and RAW hazards. WBUT 2010]
16. For the code segment given below, explain how delayed branching can help
Answer: | LOAD R1, A
Pipelining achieves a reduction of the average execution time per instruction. In the sense Dec R3,1
that pipelime can perform more instructions per clock cycle. This can be viewed in two
BrZero | R3, 15
AK
ways Add R2, R4
Decreasing the CPI. Typical way in which people view the performance increase Sub R5R6
.Decreasing the cycle time (i.e., increasing the clock rate). Store | R5, B
Pipelining increases the CPU instruction throughput. Pipelining does not decrease the
WBUT 2013
execution time of an individual instruction. lt increases the execution time due to Answer:
R3, and 13 performs "BrZero R3,15". So, both 12 and 13
overhead (clock skew and pipeline register delay) in the control of the pipeline. Instruction 12 performs "Dec
register R3 at the same time. so, delayed branch actually first modify the
modify the R3,.15"
then upadate the value of R3 by "BrZero
WAR hazrds RAW hazards value of R3 by "Dec R3,I"
M

tries to writea destinaion before it is 1. j tries to read a source before i writes it, so j how ioop unrolling can help improve instruction
code show
read by i. so i incorrecthy gets the new incorrectly gets the old value. 17. For the following performance:
(LP)
value.
WAR hazard is eliminated by register2.
evel parallelism Load Ro,A (R1)|
RAW hazards are avoided by executing an Loop 1:11: JAis the starting address of array location
renaming of ali the destination registers instruction only when its operands are R1 holds the initial; address of the element
including those with a pending read or
write for an earlier instruction.
available. RO-RO+ R2, R2is ascalar
R2
WAR hazard if 12: Add RO,
RO, A(RIL go to nextword in Array of doubles
E

RAW hazard if Store


D Rj):® R(i) DG) # 0 3: whose address is 8 bytes earlier
For example
il. R4 RI R3
For example: 4: Add R1, R1,
Loopt1
il. R2- RI R3
15:BNE WBUT 2013]
2. R3RI R2 i2.R4-R2 R3
AR

If we are in a situation that there is a chance The execution of instructions when they
first instruction is calculating a value to be be are independent of one
that i2 may be completed before il (ie. with saved in Answer: overlawerlap among instructions is called instruction-level
| register 2, and the second is going
concurren execution) we must ensure that we use Pipelinin8 otentidns can be parallelism
this value to compute a result for register be evaluated in parallel. The simplest
do not store the result of register 3 before il However, in
a pipeline, when we fetch tne
4. instrucfions
her. Thsthenstrucnt and most common
of parallclism available among instructions
has had a chance to fetch the operands.
(LP) since
is to exploit
operands for the 2nd operation, the results from theons of a loop. A loop is parallel unless there is a cycle
the first will not yet have way incr absence of a cycle
iabsence in the
we have a data
been saved, and
dependency.
henc m
nce
since the
aependencies, statenents.
mcans that the dependencies
Statement Il uses the value give
assigned in the previous
a partial
iteration
he CA-15
order
CA-14
e
W
S
AN
POPULARPUBLIGATIONS
statusto tne
channcl. If two or more programs simultaneously seek access to the ckannel,
loop-carried dependency between I1 and 12. Despite
access to
so there is a e apparatus establishes a priority order between them. thereby enabling

TI
by statement 12, logic
be made parallel because the dependency is not this program at a time. The priority ordering is based. in part. by the identity of th
dependency. this loop can nly one
rgram last cnabled. to thereby assure priority order allocation among the programs.)
18. What is pipeline chaining? WBUT
2013
Answer: 20. Draw pipeline
execution diagram during the execution of the following
Pipeline chaining is a linking process that occurS when results obtained from one instructions:
functional pipc. othpeline
unit are directly fed into the operand registers of another In

AU
MUL R1, R2, R3
wor
intermediate results do not have to be restored into memory and can be used even
the vector operation is completed. Chaining permits successive operations to be
a ADD R2, R3, R4

soon as the first result becomes available as an operand. The desired functional pine
ise INC R4
SUB R6, R3, R7
of the above
out the delay in pipeline execution due to data dependency WBUT
operand registers must be properly reserved: otherwise. chaining operations have Find 2016]
suspended until the demanded resources become available. be instructions.
Answer:
is a five stage pipeline. The stages
19. Compare between Control-Flow, Data-Flow and Demand-Driven We have considered that the above pipeline technique
mechanism. are fetch. decode, read, execution and write
back. In the figure, we also sbow that in
WBUT 2013]
which clock pulse what operation is performed.
AK
Answer:
(The Control-Flow Architecture is a Von Neumann The operations of the instructions are,
or control flow computing model.
Here a program is a series of addressable instructions,
each of which either specifies an
operation along with memory locations RI-R2.R3
of the operands or it specifies conditional
of control to some other instruction. The next instruction transfer
to be executed depends on R2-R3+R4
happened during the execution
of the current instruction. The next instruction what R4-R4+1
executed is pointed to and triggered
by the PC. The instruction is executed
to be
even if some R6-R3+R7
of its operands are not available yct.)
But in Dataflow model, the execution is driven only by the availability So, from the above instructions, we can say that
no data hazard will occur in the pipe line.
of operand. There
M

is no Program Counter So, normal pipeline execution will


and global updateable store. The
model that become bottlenecks two features of von Neumann The instructions are not dependent to each other.
in exploiting parallelism Occur.
Architecture. are missing in Data flow
Memory Uat C
nstructons) R:
Fetch RIRR i RRR!C
Dydate
Fetch
E

De:i de -

(ierrutat skess) Unst


acR STS
R.ead RIRR LRRT RRJ
RIR RR- RERR
AR

Processng Vast Executon


Pruea bled
Lastr uc taon queue LRIRR: RRi R.R.
Wme back
Fig. A static Data
flow Computer
A demand driven access on
"Reservation Table" Ihelps
Table" to study the performance of pipeline.
mechanism comprises logic
of seizing use ol a shared apparatus al each program 21. How WBUT 2016
program. The logic
program seeks access
communicalion
apparalus receives channel for enabling access to a selcctea
to the channel it
status signals
from all programs
cae
able

will be cnabled so that CA-17


immediately upon
an inact
CA-16
e
W
S
AN
POPULAR PUBLIGATIONNS
COMPLTER ARCHITECTURE

Answer: pipeline
Long Answer Type guestions
There are two types of pipelines:
stattic and dynamuç. A static can perfom
one Tunction a time, whercas a
at dynamic pipeline can perfonm more than onc only pipeline?
a
stages of pipeline are on at What Is a following reservation table:

TI
a time. A pipeline reservation table shows when
particular fumction. Each stagë ol the pipeline is represented by a row in the for a
reservation
ider the
tabls.
cyele The number of columns indicates the
s total number of timeunits reaui
Each row of the reservation table in tum broken into inte coluinns,
one S1

pipelhine to perform a particular function. To ndicate that some stage S for the $2
is in $3
time ty. an X is placed at the intersection of the row and column

AU
conësponding to that stage and time. Figure I represents a reservation
inSome he down the forbidden latencies and
initial collision vector. Draw the state
table far. le ram for scheduling the pipeline. Find out the sample and greedy
cycle and
Pipetline with throe stages. When scheduling a static pipeline, only ras
onl collisions static MHz, then what is the throughput of the
different input data for a particular function had to betwe
be avoided. With a dynamic Dinel
MAL. If
the pipeline clock rate is 25 WBUT 2007, 2011]
MAL?
it is possible for different input data pipeline? What are the bounds on
requiring difterent functions
pipeline at thesame time. Thereforc. to be present in
colisions between these data must Answer:
however. dynamiC pipeline schedulingbe
well. As with the static pipeline. considered
Part: a sequential process into sub operations being
compilation a
of set of forbidden lists from function begins with tha Pipelining is a technique of splitting
reservation tables. Next operates concurrently with all other segments. An
vectors are obtained, and finally
the sate diagram is the collision Cxecute of different segment that
AK
drawn.
instruction pipeline is a technique used in the
design of computers to increase their
22. Consider the erecution of a program instruction throughput (the number of
instructions that can be executed in a unit of time).
of15000 (SIMD) concept successive instructions
processor. The clock
rate of pipeline is 25 MHz. instructions by linear pipeline Pipelining assumes that with a single instruction
instruction is issued
per clock cycle. Neglect Pipeline has five stages in execution.
and out of sequence pipelines due to and one in a program sequence will overlap
idle while
execution:
() Calculate the speedup
branch instructions
A non-pipeline architecture is
inefficient because som CPU components are
program execution by pipeline imštruction cyclé. Pipelining does not completely
non-pipelined processor.
as compared with another modute is active during the
qi) What are
the efficiency and throughput that by cancel out idle time in a CPU but making
those modules work in parallel improves
Answer: of the pipeline processor. significantly.
M

program execution
Information [WBUT 20161 inside into stages which can semi-independently
we gel are: Processors with pipelining are organized
n= 15.000 instructions is organized and linked into a 'chain' so each stage's
- 25 MHz or tasks work on separate jobs. Each stage
stage until the job Is done. This organization of the processor
output is inputted to another
k= 5 stages time to be significantly reduced.
allows overall processing
Issued processor= are indepcndent. In a simple pipeline, completing an
Unfortunately, not all instructions
operate at 1ull pertormance. this pipeline will need to
(i)The Speedup (S,) - instruction may require 5 stages. To instructions
nkr instructions while the first is completing. 1f 4
E

nk
Tkr+(n-1}r k+(n-1) (15,000X5) 75,000 run 4 subsequent independent are available. the pipeline
output of the tirst instruction not
y Lficieney(t) 5+(15,000-1) = 4,999 that do not depend on the into the pipeline until the
=*=0,999 15,004 logic must insert a slall or wastca clock cycle
control suchn as forwarding can significantly
5 dependencv is resolved. Fortunately, ecnniques
Throughput
(T) stallng scure. wniie pipelining can in theory increase
AR

(15,000)X25) reduce the cases where


-1 5+(15.000-1) 375,000
I5,004 =24,99 MIPs clok also scales wiln
core oy a lactor of the number of stawes
performance over an unpipelined uc unoer or
stages). in reality. most code does not
freauency
execution.
allow for ideal

2nd latencies that cause collision. Here


Part: means the the forbidden latency
Forbidden latencies
3. 100,
collision vector is
The initial is given
below
diagram
CA-18 The state
CA-19
e
W
S
AN
POPULARPUBLICATIONS COMPLTERARCHILCT

TI
100

110 101

Simple cyck (2%.(4), (1.4) (1.1.4) and (2.4)


Greedy cyclc: (2)
MAI 2
Throughput 25 MIPSS
AU Fig: Data Flow graph for the above computation
AK
pperbound - 2
(addition, difference,
Louerbound 2
3. What is floatingpoint arithmetic operation? Explain all WBUT 2009]
division) operations with example.
2. a) What do you mean
multiplication,
by "Datafiow
b) With simple diagram, explain(Data Computer"? Answer: tm*b
flow architecture represented in the following form:
flowarchitectures and compare it with control A floating-point (FP) number can be the number and is normally
c) Draw data flow graphs it represents the fraction part of
to represent Where m is the mantissa and represents the exponent, and b represents
the
i) X=A»B the following computations: fraction, e
represented as a signed binary
i) Y=X/B
base (radix) of the exponent.
M

iti) Z=A°X 23 22
iv) M=2-Y Mantiasa (m)
Evponnt te
v) N=ZX number
Representation ot floating point
vi) P=M/N Fig:
Answer: [WBUT 2008, 2014] Additlon/Subtrartion:
a) Data flow computer
is a large. \ery powerful Floating-Point Arithmetic stens from the fact that they may have different
numbers stems
all phy sically wired computer two FP nunbers FP
fvo
together with a large that has a number of processors Thedifticulty in
adding
adding two numbers, their exponents must be equalized.
connputers are highly amount of memory and before be
number that has smaller magnitude of exponent inust
E

parallel backing storage. Su Therefore,


ame tume. Data liow computersin that they can carry out a large number exponents.
mantissa of the
he
are uscd to enecute of tasks at tne that is, the
as these associated
with ercas like processor intensive applications suc Foating-Point numbers
alculatioNIs for the siimulation molecular biology aligned. add/subtract two exponents and
and simulation. Numerica Steps required aadiude two exponenis and make suitable alignment to the
the tivo
or tne
pe parallel procc>ing of natural phenomeina magnitude of exponent.
counputer. The were conducted using a data-11ow the
.Compare smaller magnirude of
AR

appruximalcly three computing time addition/subtraction.


o-tive imes shorter than in the data-1low computer number wvith the
puling speed 3 MIPS tmillum that' of the usual was shitung the resulting mantissa and *djusting the resuing
medium-size compc the by shif
by
caputing process was realizcd insuructiuons per second). Dynamic .Perform tion
normalization
of the data-flow compuler. using an image display visualization or 3. Perforn
directly connected the two FP nbers:
to the memo Cxponent. Consideradding
Examp
1.1000*2
b Refer to Question
No. 19 Short
of Answer Type .I100*2' and1000 has to be aligned to 0.01 10
numb to get 10.0010 2
2
ment: 1. twonunbers
Questions. 1. AlignnAdd Add the wo orn
fnal
ial nornalired result is 0001 2
1

.
Addition:
Normalization: he
point).
(assuming 4 bits arc allowed

3. radia CA-21
CA-20 aller the
e
W
S
AN
POPULAR PUBLICATIONS COMPUTERARCHTIE FLK

forbidden Latency is (0,4)


Floating-Point Arithmctie Mutiplication )
The

A general algorithm for multpinatkn


of FP numbers consists of three has steps. These
The pipeline collision vector (100010)
100010
State:

TI
are.
praduKt by adding the cxponents together 2( 1001 10)
after 1

Reaches State cycle.


Compute the exponent of the
I2Multiplh the twe mantissas. eaches State 3 ( 101010) after 2 cycles.
3 Normalhze and round the final praduct. Reaches State 4 (110010) after 3 cycles.
rvwo FP numbers
Example Consider muluph ing the Reaches State
1

( 100010) after 5 cycles.


X= 1.000 and Y 10102 Reaches State
1
( 100010) after 6 or more cycles.
1.Add eapncnts 2 (-)000 1.010 = -I.010000
2Multiphy mantrssas:
The prduct is -1.010
!

Floating-Point Arithmetic Div ision


A peneral algonthm far div isuon of FP numbers consists of three basic steps:
I Compute the
epncnt of the result by subtracting the exponents.
Dnde the mantussa and determine the sign of the result.
AU State 2: 100110
Reaches State 5
Reaches State 6 (
Reaches
( 101110) after
111010) after 2 cycles.
State ( 100010) after 5 cycles.
Reaches State

State 3: 101010
1
(
I

100010) after 6 or more cycles.


cycle.
AK
Reaches State 7 ( 110110) after cycle.
I

3. Aormalize and round the resulting vatue, if necessary.


cycles.
Consider the divisaon of the two FP numbers Reaches State 4 ( 110010) after 3
5 cycles.
X- 1000': and Y = .1 0100 2 Reaches State I ( 100010) after
more cycles.
I Subtract eponents -2-()= -1 Reaches State I (100010) after 6 or
2Di de thc mantissas (1 0000) (-1.0100)= -0.1101. State 4: 110010
3Theresultis 1i01°2" Reaches State 3 ( 101010) after
2 cycles.
110010) after 3 cycles.
4.Consider the four stage pipelined Rcaches State 4 (
processor specified by the following diagram: after 5 cycles.
Reaches State 1 ( 100010)
M

after 6 or more cycles.


Outpul Reaches State ( 100010)
1
Input

State 5: 101110
111110)after cycle.
1

The pipeline has a total evaluation Reaches State 8( cycles.


100010 ) after 5
must be used after each clock cycle.time of six clock cycles. All successor stages ( 1
Reaches State
(100010) after 6 or more cycles.
1
1) Specity
the reservation table for above Reaches State
pipelined processor with six columns and
E

four row.
) What are the forbidden latencies State 6: 111010 110010)after 3 cycles.
and the initial colision vector? Draw Reaches State 4 ((100010)after 5 cycles.
traneition diagram. the sta
Reaches State 100010)aner o or more cycles.
1
Hi) Determine all
simple cycles, greedy Reaches State (
1

) Determine the throughput cycle and MAL.


AR

of this, plpelined processor. 20


ns. Given clock perio 110110
State 7: 6(111010) after 2 cycles.
Answer: WBUT 2010 cycles.
Rcaches State 100010 after 5 or
i) I( 00010) after
Reaches State 100010 after 6 or more cycles.
I( )
Reaches State

State 8:1||m0 I(l00010 ) afterher 5 cycles.


Reaches State ( 100 010) after 6 or more cycles.
I
Reaches State
in the state diagram.
states in
here are 8
Ther CA-23
e

CA-22
W
S
AN
cOMPLTERARCHIEECLE

POPULAR PUBLICATIONS
Input: a. b..c
i) udi cnrk do
iori= to 8

TI
t thin pipelined mrur (I)= d a,
(1 20}» 0025
025 instructions b
=
nirugha: per

S a) What do you mean by MMX? Differentiate a data flow computer froma co.
k. e,-
bl
.
AU
fiow computer. end
b List some potental problemswith data flow computer implementation. Output d. e. "for the loop" and the "for loop" is
are three instructions in
c Wh simple diagram explain data flow architecture above example. there
the add. multiply
In
24 instructions will be executed. Suppose. each
d Dra data fow graphs to represent the folowing computations: eecuted 8 times. So. total
complete respectiv ey. The data
2 and 3 clock cycles to
and divide
operation requires I. instructions.
below. for above
flow graph is shown in the fgure bs
by b
AK
DWBUT 2011, 2013, 2014
ASer
aM tecmoiog is ar eruension to the Intel Architecture
a ot to (IA2 designed improve Fig: data flow
graph figure
uec:2 and cOTmunication algorithms.
The Pentium processor cycles as shown in the
within 14 clock
A cmiNis e is microprocessor to implement the new instruction with set. Al The above
instructions can be
executed
VA ne 2 izrger iniermal Li cache than
M

Pemc
their non-MMX counterparts. The
asr uith LX impiementation was the design
of new. dedicated. high-a
below.
iræ MX ppeline.
which was abie to execute
two MMX instructions with
gk changes m ne
enisting units. Although adding a pipeline stage improves
reaueny. n decrezses CPl períomance.
i.. the longer the pipeline. the more work done
specuatively by the machine and therefore
of
more work is being thrown away in the case
branc. .S-predtion.
E

Contr T.ow compuiers are


enher uniprocessor or parallel processors
DrcesSX S} sien te instrutions are architecture. in
erecuted sequentially and this is called contro cycles
arne mechansm in paraliei processors processor datatiow computer in i4
ecution on a 4
eNecution
sy stem control flow all the entermal
meor c.
para:k. enecuted instructions may computers use snaicu
cause side effects on other instruction Fig:
Data dnven
completes the execution in
14 cycles. If
in the first
AR

and data ur to shared memor in fsiorocessor


a:. a; and a, are all ready for evecution
control flow computer dataflow instructions a1,
and a starting trom cycle
4.
iAuctors Cintrolicd by the sequence of executionO The available. trigge the execution of a«. b;. a.
is
program counter register then e8 is the last one
The 22 Acw umpu:er> are
based on a data driven inputs are produced eactions aare shown in figure above. The output
c erence mechanism. cvcles
s that instructior The fundamenta threedata-driven chain dence on all the
previous c,'s.
evecution in a conv entional
contro. whereas tra r. 2 data î.ow computer is under progran -flow The ls chcles along the critical
path a b e . The chain
cumputer is driven by nduce. duinimum time is and nmay result in longer
the data (operand) availabin is more difficult to impiement
b fhe catz drven mecnansm not requres to prreticandataflow
w the uniform operations pertormed by all processors.
tscquener i s checks deta avaiiabrity any shared memon. The compared u ith th
Controred
program counict and reactio as
ie. featior' a>ynntoous insiructwn of needy instructions and overhead.
eaecution to enaoi the
c) Suppe there arc ic
e

instructio7s gisern CA-25


b a data ow nacnine belw \ow we
enecute these instructO s

A-24
W
S
AN
POPULAR PULICATIONS COMPUTERARCHITECT

ance analysis should help answering questions such as how fast can a progr

TI
d)
b tcd using a given computer In order to answer such a question. we need to
ine
be the time taken by a computer to exccute a given job. We define the clock uyele
the time befween two consecutive rising (trailing) edges of a periodic elock
cycles allow counting unit computations, because the storage of
Clock
signal.
nnulation results is synchronized with rising (trailing) clock edges. The time
required

AU exec a job by a computer is often expressed in terms of


vecute

ajob can be expressed as CPU


It may be easier to count the
compared to counting the
program. Therefore, the
used as
time =
CC * CT=CC/f
clock cycles. We denote the
onumber of CPU clock cycles for executing a job to be the cycle count (CC). the cycle
her
ime by CT, and the clock frequency by f= l/CT.
The tine taken by the CPU to execute

number of instructions executed ir a


number of CPU
average number of clock cycles per
an alternate performance
measure. The
given program as
clock cycles needed for executing that
instruction (CPI) has been
following equation shows how to
AK
compute the CPI. Instruction count
6. a)What
Architecture?
s the difference between
Computer Organization CPI CPU clock cycles
for the program/
x CPI x Clock cycle
time
b) Why does the equation and Computer CPU time = Instruction count
in terms of average CPl to calculate the GPU-time WBUT 2012] an overall speedup =
of of a program enhancement is used to achieve
c) A 30% enhancement that processor? often expressed
c) Say, the fraction of
the time must
proposed for a new in speedup for a component WBUT 2012]
architecture. M of the processor
tdene, what
is fraction of the time the enhancement is usable only has been So, x. (30/50)-10 =>=16.6
speedup of 107 must enhancement be used for 50% for the assuring a steady
to achieve an overall an instruction pipeline is
d) What are problems in designing of instructions in
M

the diferent approaches major However, 15-20%


instructions? Briefy WBUT One of the stages of the pipeline.
illustrate any twotaken by pipeline processor to handle 2012] instructions to initial branches. Of these. 60-70% take the branch to
approaches? branch flow of stream are
(conditional)
execured, it is impossible to
determine
What is branch WBUT 2012] an assembly-level instruction is actualy
hazard? Briefly discuss OR, address. Until the
two methods to a target not.
handie branch hazard. will be taken or
Agwer: whether the branch branch instruction (the
Difference betwee Computer WBUT 2014] the impact of the
used to minimize
o Organization techniques can be
Computer organization and Computer Architecture: A number of
E

Deals with all physica branch penalty). instructions


and fetch both possible next
other to perform components of
computer systems Multiple streams ons of the pipeline
initial portions
various functionalities that interacts with each Replicate the contention
The lower level memory
of computer organization Increases chance of streamss for each instruction in the pipeline
nore detailed and is known multiple
concrete. as microurchitecture Isupport
Examples of Organizational whicnis Must targer branch target instruction
AR

the programmer atuributes includes Preferch branch


branch instruction
is decoded. begin to fetch the
such
o Comnputer architecture as control signal Hardware details transparent When the prefetch buffer
and peripheral. to second in the pipc. so there
Refers as place in ais not
and place taken, the sequential instructions are already
a set of attributes is not
branch performance
Examples of of
the Architecturalu system as seen by programmer If the loss of instruction has been prefetched and results in minimal
used to represent attributes
include the instruction is not tosch is taken. the next a memory read operation at the end of the branch
addressing memories the data types. penalty (don't have to incur
If the bnalty
Input Output set, the no ot d branch instruction)
mechanism
and technique fetch the

CA-27
CA-26
e
W
S
AN
POPULAR PUBLICATIONS COMPUTERRAR
ARCHIECURE
d
detcrmi by the original source program. There
are, three types of data harards can
Loop bufer: Look ahead. look behind buffer
OCcur:
Many conditional branches operatuons are used for loop control
ead Aftèr Write (RA) hazards:

TI
Expand prefetch butfcr so as to butter the last few instructions executed
in Aataam hazard is the most common type. It appears
when the next instruction tries
to the ones that are waiting to be cccutad
a source before the previous instruction writes to
If buffer is big enough. entire loop can be held in it. this can reduce read to it. So. the next instruction
penalty
the branch
h. he
gets
incorrect old value such as an operand is modified and
read soon after. Because
the first
instru may not have finished writing to the
Branch prediction operand. the second instruction
data.

AU
Makc a good gucss as to which instruction will be exCuted nay use incorrect
next and stat thas Write After Read (WAR) hazards:
down the pipeline. at one
Static guesses. make the guess w ithout WAR hazard appears when the next instruction writes to a destination before the previous
considering the runtime history
program Branch never taken. Branch always taken. of the instruction reads it. In this case, the previous instruction gets a new value incorrectly such
Predict bascd on the opcode
Dxnamic guesses: track the history of conditional
branches in the program.pcode as read an operand and writes soon after to that same operand. Because the write may
Delayed branch have finished before the read. the read instruction may incorrectly get the new written
Minimize the branch penalty by finding value.
valid instructions to
while the branch address is heing execute in the pipeline Write Afier Write (WAW) hazards:
resolved.
It is possible
to improve perirmance by data hazard is situation when the next instruction tries to write to a destination
AK
automatically WAW
a program. so that branch rearranging instruction
instructionoccur later than actually within before a previous instruction writes to it and it results in changes done in the wrong order.
Compiler is tasked desired
with reordering the instruction Two instructions that write to the same operand are performed. The first one issued may
independent instructions sequence to find
to feed into the pipeline enough finish second, and therefore leave the operand with an incorrect data value. So the results
penalty is reduced to after the branch that
zero the branch of WAW hazards are
7. a) What are
the major hurdles Pipeline Latency
b) Discuss data hazard
briefly.
to achieve this ideal
speed-up? Instruction effects not completed before next operation begins
c) Discuss briefly two
d) Consider approaches to handle branch hazards. processor is told to branch i.e., if a certain condition
a4stage Branch hazards occur when the
M

c)
Decode (D), Execute pipeline that consists of Instruction from one part of the instruction stream to another not necessarily
(Ex) and Write Fetch (IF), Instruction is true. and then jump
stages are 50 ns, 50 Back (WB) stages. In such a case, the processor cannot tell in advance
ns, 110 ns and 80 ns The times taken by to the next instruction sequentally.
required after every respectively. these
The pipeline registers are process the next instruction (when t may instead have to move toa
pipeline stage, whether it should
10 ns delay. What and each of these pipeline brancn penalty. put in enough hardware so that we
is
the corresponding the speedup of the pipeline under ideal register consumes distant instruction). To minimize the
bráncn target address. and update the PC during the
non-pipelined implementation? conditions compare to can test registers. calculate the
Answer:
a We define the speedup
of a k -stage linear
WBUT 2012] second stage.
E

pipeline processor as pipeline processor nirdnct


S, - over an equivalent non- Fetch 1D Opteuh Junp

It
nk k- (n-1) wnirl Huzard
should be noted
maximum speedup
thatthe maximum speedup 2
1-Fenh
that a linear pipeline is S - k .for
n >>k. In other words. Fig: Situation of control hazards
AR

stages in the pipe. can provide us the


The maximum is k. where k
dependencies between specdup is never is the number ol hazard as describe below:
instructions. interrupts. fully achievable methods to prevent branch
because ot data
b) A data hazard is
and other factors. There are twotion: Branch predication is a strategy in computer architecture design
created whenever Branchimizing the costs. usually assoC Ialed with conditional branches. partictularly
are close enough there is dependence ns of code.
code Il does this by allow ing each instruction
instructions. would
that the
overlap between instructions. for shot setions
branches to either pertonnn aan operation or do noting. Because computer programs
to
change the order caused by pipelining. and ne
Hecause of the dependence. ot or other reordering
access to the
order that the instructions we must preserve operand involved ditionaer. there is no way around ihe tact that portions of a program need to
comdionally Note that besides eliminating
what is called in the dependence enondtocoditioually.
resp branches. less code is
would execute progrum order execuled
in, il executed
sequentially
one al a time
n
that is be
CA-29
CA-28 a
e
W
S
AN
COMPUTER ARCIHT
eoeuLARPUBLICATIONS
Recent high performance processo have depended
on Instruction LevelEParallel
architecture prn ides predicated instructions w: ve high execution speed.
LP) to achieve
needed n total. pror ided the if the do this and do th
this ( multiple operations to execute
ILP processors achieve
their high performance
faster erocuton in gencral. t will that block
does not guarantee a system ha aare in parallel, using
a combination of compiler

TI
of code are short enough Typcaliy. in order to claim chniques. Very Long Instruction and
predication. most or al of the insiuctins
must have this abilit o dnchxecute cessor design that tries to achieve high
Word (VLIW) is one particular
style of
levels of instruction level parallelism
condrtonalh hased on a predicate ecuting long instruction words composed by
A branch delay mnstruction is an instruction immediatcly follo
of multiple operations. The long instruction
Delayed Branch:
whether or not the branch ic
rd called a Multiop consists of muliple
arithmetic, logic and control operations
cach
a conditional branch instructnon whch s ciecuted aken. which would probably be an individual operation on a simple RISC procèssor. The
The branch delay slt is a side-cffect of pipclined architectures due to the branch

AU
VILIW processor
concurrently executes the set of operations
within a MultiOp thereby
hazard. 1C. the fact that the branch would not be resolved until the insiruction ho achieving instruction level parallelism. The remainder
of this article discusses the
worked its way through the pipclnc. A simple design would insert stalls into
pipetine afer a branch mstruction until the new branch target address is comauted
t technology. history, uses and the future
of such processors. We now describe Defoe, an
Pxample processor used in this section to give the reader a feel
and koaded mto the program counter. Each cycle where a stall is for VILIW architecture and
inserted is programming. Though it does not exist in reality, its features are derived from
considered one branch delay slot. The delayed branch always executes those of
the nevt several existing VLIW processors. Figure l shows the architecture
of the VLIW
scquental instruction. wrth the branch taking place afier that one instruction delay.
processor.
Branch
J JMn]L Tol2
AK
Deta
Deteved shet anstructon

Branch tar D Ex]MemwB


Fig: Situation of control hazards Simplc| Simpke CompleiLead| LadBranch
Integer atcye Sture Stse Cmp
d)Total time required for each instructton in the pipeline
50-10-60-10-110-10-80 330 ns PTed
So. the spcedup = (Time per instruction
on non-pipelinc machine/ No. of pipeline stages)
M

330/4 825
Praderod Buller
Board

What do you mean by multiple issue processor?


WBUT 2012]
Dispersal Lut &
Fcach
bBriey describe the VLW processor architecture.
erwhat are the diferences between supencalar processor
d) Suppose your program consists of 2500 and V.L.I.W processor
inds of instructions instructions.The proportion of different
in the program is as follow:
Data transter instruction 50%, arithmetic
E

Instructions 20%. The cycles consumed byinstruction 30% and branching related Froun L

10 reepecively. What will be these types of instructions are 2, 5 and Cache


the execution time for a 4 GHz processor Fig: VLIW architecture
your program? to execute
Answer:
AR

a) Scorcbcardng and Tomasuio


are the two single-issue
techniques that allow oul-o Functional unit units.
order eecution. The mulupie-issue load/store
processors are superscalar
instruction word) processors. Most of today's and VILIW (very long Two simpleALUsdthat perform add, subtract, shift and logical
operations on 64-bit
32. 16 and &-bit numbers. In addition, thesc units also supporn
general-purpose microprocessors are Two packed 32,
or ux-issuc superscalar often with an cnhanced fou numbers and fpacked andpernumber
8-bit numbers.
for most signal procevrs. The Tomasulo scheme.
issue kogic examines VIIW is the choice ltiplication ofthat can pertormi multiply and divide on 64-bit
U that
instructiun window and sumultaneously the waiting instructions in tne complex ALU integers. integers and
One 8-bil
assigns issuesy a number 16 and perfoms
rms branch. call and comparison
FUs up to a maximum ssuc
bandwidth and everal
simultancously (the Issuc handwdth).
of instructions to
instructions can be issuca
n packed32, unit that
branch
operations.
One CA-31

CA-30
e
W
S
AN
POPULAR PuBLICATIONS COMPUTER ARCHITEC

allocation. i
oister allocation. his second scheduling pass will also improve the placement o
register
VLIWProccssor tcr code.
Superscalar Processor VIIM arehitectures rely on
comr
hespil/ill

TI
onc in whichi ne cheduling is only donc
one after register allocation then
there will be false dependencies
A superscalar architecture is
initiated detection of paralelisim. The ompiler
several instructions
can be
independently analy sis the program and detects operations Tsed by the register allocation that will limit the amount of instruction motion
simultaneoush and executed
o he cxecuted in parallel. such operations o5sible by the scheduler
are packed into one "large instruction. .
features After one instruction has been fetched al Define pipelining technique. Assume a 4 stage pipeline: WBUT 2013]
.Superscalar architectures include all from the memory

AU
there can be comesponding operations are issued in Cetch: Read the instruction
in addition.
ofpipelinng but. parallel. Decode: Decode the instruction
several instructions eccuting instruction
stage. Execute: EXecute the
simultaneously in the same pipeline in destination location
No hardware Is needed for run-time Write: Store the result
Ver complex pipelining.
hardware is necded for run-time detection of parallclism. Draw the space-time diagram for
Much
.The window of execution problem is solved Answer:
technique whereby multiple instructions are overlapped
detection. in
Power consumption can be vern large. the compiler can potcntially analyze the Pipeline is an implementation an
pipeline (called a pipe stage) completes a part of
The window of erccution is limited whole program in order to detect parallel execution. Each step in the
operations. instruction.
AK
Tune

d) Number of data transfer instructions 2500 50% = 1250 and total no. of cycle
IE Exwa
consumed= 1250*2 = 2500
F1D EX we WB
Number of arithmetic instnuctions 2500* 30% = 750 and total no. of cycle consumed= D EX
750 =
3750 IF W8

F D EX
Aumber of branching related instructions = 2500 20% = 500 and total no. of cycle LIE D Ex W8
consumed= 500° 10 = 5000 hogran
Flow
Total clock cycle consumed for 2500 instructions 2500+ 3750 5000=11250
CP sends the address which is held by
M

Frequency of the processor () = 4 GHz In this cycle PU


Cycle (IF): instruction from emory
So. clock period (T= 1')=25 ns nstruction Fetch memory. Then fetch the current
to the instruction.
The execution time fora 4 GlHz processor to execute this program= .25 * 11250 = 2.81us program counter (PC) PC for the next
content of (ID): In this eycle CPU decodes theinstruction
Cycle (1D):
do you understand by instruction pipelining
and update the
nstruction
Decode/Register Fech
the
le
registerfile
i and performs the equality test on the registers
What
Why pipeline scheduling is necessary and how it
and arithmetic pipelining?
WBUT 2013]
registers from sign-extend the offset field of the instruction in case it is
and readstthe branch. ThenThen
is done? target address by adding the sign-extended
branch target
Answer: possible possible branch
possiole
Tor a computes the
E

1 Part: needed and incremented PC.


Calculation (EX): In the execution cycle the ALU operates
Refer to Question No. 12 of Short Answer Type Questions. oflsct to the n is memory reference nstructions. the
/Effectvedin the prior cycle. It
Executiol
operands prepP the.ottset to form the efîcctive address or if it is Register-
2d Part: registerand
on the ALU performs the operation specified by the ALl opcode
AR

Pipcline instruction scheduling may be done either ALU adds tction theaister filc. If it is Register-Immedidte insruction then the ALU
before or after register allocation o
both before and after it. The adv antage of doing il Decificd by the opcode on the first value from the register tile
before register allocationis thal this Register from
results in maximum parallelism. The disadv antage immediat
of doing it before register allocation
that this can result in the register allucator necding performs
sign-cxendeU
eR): Register-Register ALU instruction or l.oad instruction: Write
to use a number
thoe available. This will cause spill'fill code to he introduced of registers exceedie and the regIslertile
performance of the ection of code in question. which will reduce n Vrite intoihe
If the architecture being schcduled has nstruction result
combinations (due to a lack of struction sequences that have potentially ileg" he
interlocks) the instructions
must be schedu
CA-33
CA-32
e
W
S
AN
PUTER ARCHITECT
POPULARPUBLICATIONS
What do you meanby multiple issue processors?
What is arithmetic and instruction
pipeline? describe
Briefly the VLIW processor architecture.

TI
What are the limitations
b) Consider the following reservation table_ VLIW? of
WBUT 2014]
Answer:
$1 Refer to uestion.
on No. 8(a) & (b) of Long Answer Type Questions.
2
AWhat is meant by pipeline hazard? Briefly discuss different pipeline hazards.
List the set of forbidden latencies and collision vector. Draw the state traneis
diagram. List all simple cycle from state diagram. ldentity the greedy cycles amo
simple cycles. Find out minimum average latency (MAL). Find out maxi
throughput of this pipeline if the clock rate is 25 MHz.
c) What are bounds on MAL?
Answer:
a Refer o Question Na 12 of Short Amswer Type Questions.
WBUT
um

2014
AUWhat do you mean by job collision in pipeline processor?
occur in the following static pipeline.

s
S
0 1

X
2 3

X
Show how collisions
AK
c)Consider the execution of a program of 20,000 instructions by linear pipeline
a
b processor with a clock rate 40 MHz. AsSume that the instruction pipeline has five
Forbidden Latencies are 0.2 cycle. The penalties due to
stages and that one instruction is issued per clockignored. Calculate the speed
Pipeline collision Vector is: (i010) executions are
branch instructions and out-of-order
Greeds Cyck is: (1. 3 non-pipeline processor, the efficiency and
up of the pipeline over its equivalent WBUT 2015
State Diagram is
throughput
State! 10i0 Answer: Type Questions.
Rcaches State2 ( 1110) after I cycle.
a) Refer to Question
No. 17(a) of ShortAnswer
Rcaches Sate! (010) after cycles.
3
M

Reaches State i ( 0i0 ) after 4 between initiations 1s called the latency. The first step to
or more cycles. number of cycles
State 2: i1 10 The
identify the forbidden
latencies revealed by the diagram.
A latency is forbidden if it will
both the first and last
So is reguired during
Reaches State !( 1010 ) after 3
cycles. table shows tnat stage
Reaches State 1( 1010 after 4 lead to a collision. The a new as ailer oniy
one cycle. or there would be an
) or more cycles. initiate latencies fr
cycles. We cannot way to identify all forbidden
There is a Sy Stemanc
There are2 states in the state immediate collision. each row containing more than one X. we write down the
diagram. hle. For X's. E represents a forbidden latency. In
given reserv pair of Each such distance fare
State I represents i010 every Dair orm
distance between So contains wo X's which can one distinet pair. The latency
E

State 2 represents 1! 10 row vields a tforbiddcn latency in 2. There are no forbidden


Row S yields
Our example, pair is 4. contains
these only one X.
inimal Average latency (MAL) is :2 blocked by row Sj. Since it
for ro
latencies for Jatencies is then summarized in a bit string called a collision vector.
Pipelinc
clock period be t forbidden cach
40 ns. one bit of cach pussible latency. Ihe bits are numbered
contains orne
The set of
-

So. throughput MAL (ro. of stages vector thors use the opposite order). The collision vector for our
aulhors
AR

in the pipeline* lime period) The collision some


24 40x 10°)=1.25 MIPS right (but
from left to is 11010.
Lower bound of MAL = 2 Cxample table
(maxmum number - (n-1Djt =(20000 S 40)/ (5- (20000-1)) 40 5
reservation table) of check marks in any row o S
nkt/[k
pper hound of MAL2-1=3 (no. of
l's in initial collision
vector +1)
c) Refer to Question No. 4
of Long Answer Type
Questions
CA-35

CA-34
e
W
S
AN
POPULAR PUBLICATIONS COIPLIER ARCHiTEGU
POssible improvements:
Consider the following pipeline reservation
table. Branch CPI can be decreased from 4
to 3.

TI
14.
2 3 4 61 Increase clock frequency from 2 to 2.3 GHz.
S1 Store CPIcan be decreased from 3
J. to 2. wBUT 2016]
$2 Answer:
S3 improve the pertormance, we have to
To i calculate CPl for all cases.
a) What are the forbidden latencies? In the given
problem,
b) Draw the state transition diagram.

AU
cycles.
CPl= 0.4*1+0.2*4+0.3*2+0.1*3=2.1
c) List all the simple cycles and greedy
minimal average latency clock is 2GHz = 2000MHz
d) Determine the optimal constant latency cycle and the
Now.

eLet the pipeline clock period be r= 20ns. Detemine the throughput of the Thus MIPS ((millions of instructions per second) 2.1*2000= 4200
pipeline. WBUT 2016] Now in case1, branch CPl can be decreased from 4 to 3.
Answer: CPl 0.4*1+0.2*3+0.3*2+0.1*3= 1.9
7
a) The Fortbadden latencies for your data 0. 2. 4. 5. Ifthe clock is 2GHz, then
MIPS 1.9*2000-3800
b) The State Diagram in a Tent Manner for your data Now in case 2, Increase clock frequency from 2 to 2.3 GHz
Statc1 1010110 When clock is 2.3 GHz
AK
Rcaches Statc111!iil11) after cvece. 1

MIPS(millions of instructions per second) = 2.1*2300=4830


3
Rcahes Sate (1101 10) aftcr 3 cycles
Now in case 2, Store CPl can be decreased from 3 to 2.
Rcaches State 3 ( ili01101) after 6 cycles.
CPl= 0.4* 1+0.2*4+0.3*2+0.1*2=2
Rcaches State (10101 101) after 8 or more cycles.
1

If the clock is 2GHz, then


Statc2 11111 MIPS 2*2000= 4000
Reaches State !
(i0101 i0) after 8 or more cyeles. So, among the given options case
2 is the best one for improvment.
State 3 11 10110
Rcaches Sate 3 i11:01101) after 3 cvcles.
the Reservation
M

Reaches State 3 (110i 10) after 6 cycles. following block diagram of a circuit. Form
Reaches State 1 (1010i 101)after 8 or more
16. a) Consider the
cycles. table.
There are 3 states in the state diagram:
State represcnts 10101101
I

State 2 represents 11:1111


I requires four stages to execute:
State 3 represents ! 1101101 b) An instruction fetch) requires 30 ns:
(instruction
c) The Grecdy Cycle is
Stage (instruetiodacode)
1
Stage 2 (instrucuon 40
9instruction
ns, stage 3 (instruction execute) 20 ns and stage
An inst
n5. An must proceed through the stages in
E

ti.8* ns.
results)= 10 minimum as
(store What is the Inimum asynchronous time for any single instruction to
d) The Minimum Average Latency - 4.5 Sequence. pipelineed
complete? up as
thisup as a pipelined operation. How many stages should we
to setthis we clock the pipeline?
e) Throughput, c) We want whatrateshould WBUT 2016)
nave and at
AR

W- (n T)'(1'MAL)= (1/ 20 0)
. With the use of Amdahl's law,
conclude, among the
(1/4.5) = .222

possible improvementis thebest one. given options whi


instruction type Frequency
ALU 40%
CPI
Branch 20%
Load 30%
S ro
10%
CA-37
CA-36
e
W
S
AN
COMPLTER ARCHILECTURE
POPULARPUBLICATIONS Hazard
Data
Da7ards may be classified as one of three types, depending
Data
eaCCesSes on the order of read and
Answer: e program in the instructions. By convention,the hazards are named by the ordering

TI
a) that must be preserved
Time inthe by the pipeline.
Therearetthree types of data hazards can occur:
X
,Read After Write (RAW) hazards:
nit 0
OAW data hazard is the most common type. It appears when the next instruction
Uniti read
rea
tries to
from a source betore the previous instruction writes to it. So, the next instruction
nit

AU ts the
gets incorrect old value such as an operand is modified and read soon after. Because
nit The
Unit 4 the first instruction may not have finished writing to the operand, the second instruction
may use incorrect data.
b) The minimum asynchronous time for any single instruction to complete = (30 + .Write After Read(WAR) hazards:
9-20-10) ns = 69ms WAR hazard appears when the next instruction writes to a destination before the previous
instruction reads it. In this case, the previous instruction gets a new value incorrectly such
a
c)To set up pipeline opeTa:ion. we need to construct minimum five stages in the as read an operand and writes soon after to that same operand. Because the write may
pipeline The stages are instruction tetch. instruction decode. operant read, instruction have finished before the read, the read instruction may incorrectly get the new
written
AK
cAecuton and store results. We should cakculate that which stage requires the highest value.
time for execution and this amount of ume we have to set for all the stages Write Afier Write (wAW) hazards:
pipeline
of the write to a destination
WAW data hazard is situation when the next instruction tries to
it results in changes done in the wrong order.
before a previous instruction writes to and
it
17. Write short notes of the following: same operand are performed. The first one issued may
Two instructions that write to the
a) Pipeline hazards WBUT 2005, 2011, 2014] leave the operand with an incortect data value. So the results
b) Reservation Table finish second, and therefore
c) Branch handling in instruction pipeline
WBUT 2008] of WAW hazards are:
d) Amdahl's law and its significance WBUT 2011] Pipeline Latency
WBUT 2012, 2014] completed before next operation begins
M

Answer: Instruction effects not


a) With pipelining. each instruction is supposed
to start executing at a given clock cycle.
Unfortunatety there are cases in which Structural Hazardsoccurs when a ppart of the processor's hardware is necded by two or
an instruction cannot execute at its allotted clock
cycle. These situations are called: A structural hazard time. A strucural hazard might occur, for instance. if a
pipeline ha/ards. Hazards further reduce the instructions at the same
perfomance gain from the speedup. more instruc vecute a branch instruction followed by a computation instruction.
So. program were executed ecuted in parallel, and because branching is ty pically slow (rgquiring
The hazard is a situatuon which Becauset theyare program
counter-
ier-related computation, and writing to registers). it is quite
prevents to fetch the next instructions in the
E

instruction stream from executing a compariso computation inst


instruction and the branch instruction will both require the
during its designated clock cycle. that the
Hazards reduce the performance possible
possible thatsameuime.time.
Hardware cannot
cannot support the combination of instructions that we
from the ideal speedup gained
Data Harards by pipelining ALU at the Same clock cycle.
execute in the
Structural Hazards want to
AR

Control Hazards hazards


Control hazardsoccur -ur when the proccssor is told to branch i.e.. ifa certain condition is
Hazards can naket necessary to stall the pipeline. one part of the instruction stream to another not necessarly to
When an instruction is
Controls
COn mp frome -

stalled. all instructions then Juscquentially. In such a case. the processor


cannot tcll in advance
instruction are also stalled issued later than the staiic true.instruerocess the neNt instruction (when it may instead have to move to a
the it sho bra
No new instructions are fetched
during the stall whethsructialculate he
To minimize the branch penalty, put in enough hardware so that we
calculate the branch
registers, target address, and upiate the PC during the
can eststage
second CA-39

CA-38
e
W
S
AN
POPULAR PUBLICATIONS
COMPUTER ARC HIEECTURE

b) Scheduling and control are


important factors in the design of nonlinear and yn reduces execution time of hardware, but
also minimizes favorite collision
Any time the ha.Jware is reconfigured. or any' tume the same stage staoylamic time. vector seanrch
pipelines. is used
a structural hazard may exist. meaningcaning th
mure than once in a computation. there is

TI
possibility of a collision in the pipeline. A collhsion is an attempt to use the same stage the
the .c) In pipeline process the branch instructions are those that tell
It tiw0 or more sets ot imputs arrive for decision about what the next instruction
the processor to make a
two or more operations at the same time. at the
a to be executed should be based on the results of
simultaneoushy. at the ven least the pipcline will compute crroneous resulhee another instruction.Branch instructions can be troublesome in
stage
least one set of inputs. Depending on the details
short-cincuited
ot physical construction, the
to a common input,
our
possibl O
at
conditional on the results of an instruction
a pipeline if a branch
which has not yet finished path through
ils
is
the
different stages could even be pipeline
damage to the circuitry. Thus collisions are to be avoided at all costs when
pipeline
How can we determine when cllisions might occur in a pipeline? One graphical
can use to analyze pipeline operation is called a reseralion
con.

tool wa
lable. A reservation table is
just a chart with rows epresentng pipcline stages and columns representing
(clock cycles). Marks are placed in cells of the lable to indicate which
time stens
stages of the
pipeline are in use at which time steps while a given computation is being performed.
simplest reservation table is one for a stalic. linear pipeline. Table below
The
g
a

AU
For example:
Loop: add Sr3, Sr2, Srl
sub Sr6. Srs Sr4
equal Sr3, Sr6, Loop
The example above instructs the processor to add rl and r2 and put the resut in r3, then
subtract r4 from r5, storing the difference in r6. In the third instruction, beq stands for
branch if equal. If the contents of r3 and ró are cqual, the processor should execute the
AK
is an example of instruction labeled "Loop." Otherwise, it should continue to the next instruction. In this
a reserv ation table for a five-stage static linear pipeline.
Notice that all the marks in this simple reservation table lie in example, the processor cannot make a decision about which branch to take because
a diagonal line. This is neither the value of r3 or ró have been written into the registers yet.
because each stage ts used once and only once, in numerical
order, in performing each The processor could stall, but a more sophisticated method of dealing with branch
computation. Even if we permuted the order
of the stages., there would be still be only instructions is branch prediction. he processor makes a guess about which path to take -
I

onc mark in each row because no stage would


be used more than once per operation. As be cleared, and the pipeline
long as this is the case, we will be able if the guess is wrong. anything written into the registers must
to initiate a new operation in the pipeline on each instruction. Some methods of branch prediction
clock cycle. Suppose instead that must be started again with the corect
we had a nonlinear pipeline. Some of the stages are pointing backward are taken about 90% of
used more than once per computation. depend on stereotypical behavior. Branches
backward-pointing branches are often found at the bottom of loops. On the
M

This pipeline would have a reservation table as since


depicted in Table. Notice that in this the time
case rows 2 and 3 of the table contain more than one pointing forward are only taken approximately 50% of the time.
mark. depicting the repeated other hand, branches
use
of the corresponding stages. Thus it ould be logical for processors to always follow the branch when it points
but not when points torward. Other
it methods of branch prediction are less
ba
that use aynamic preaiction keep a history for each branch and use to
it
processors
tage 1 static: branches. These processor are corect in their predictions 90% of the time.
2 predict future

model for the relationship between the expected speedup of


E

Magr Amdahl's law is a


d) allelizedimplementations of an algorithm relative to the serial algorithm. under the
the problem size remains the same when parallelized. For example. if for
ssumption
Fig: Reservation table for
static linear pipeline ize a parallelized implementation of an algorithm can run 12% of the
a given proatiolls
opera
arbitrarily quickly (while the remaining 88%
of the operations are
AR

algorithm's
Mage
Marr
Magr 3
1
not
,
parallelizable), Amdahl's law slates that the maximum speedup of the parallelized

rsiontechaically.
More
is
.

12)= 1.136 times as fast as the non-parallelized implementation.


the law
the law is concerned with the
computation that aflects a proportion
speedup achievable from an

Fig. Reservation table for


Latches are used betwcen pipeline non linear
pipeline
oens
improvement has up,
aSpecdup
a speed
P of thal computation where the
ot s. (l or enample. if 30% of the computation may be the
aSspeed Pwill 0. i; if the inmprovement makes the portion aftected twice
stages to get Minimum be
optimization technique based on
introduçing a method Average 1.atency (MAL). A" impof a
subjectof will be 2) Amdahl's law states that the overail speedup
of apply ing the
of no compute dclay latches betWeen nonlinear to search the most proper locatio ast, S
nprovementwill
he:
pipeline as
find a new collision vecior which
reservation table. yielding MAL.
s adaptable with
stages is given. The idea
pipcline topology and is o CA41
al minimum execution modii
time. This approach not ony
e

CA40
W
S
AN
COMPUTER ARCHTECTURE
POPULAR PUBLIGATIONS

VECTOR PROCESSOR
P

TI
( -
P)+
o Multiple Choice Type guestions
derived. assume that the running time of the
To see how this fomula was
The running time of the new computation
computation was I. for some unit of time. 1. Which of
the following types of instructions
fraction takes (which IsI
-
P). plus the length of vectors or _sparse matrices are useful in handling sparse
be the length of time the unimproved often encountered in practical vector processing
time the improved fraction takes. The
length of tme lor the improved part of th application?

AU
time divided by the WBUT 2007
computation is the ength of the improved part's fonner running a) Vector-Scalar instruction
b) Masking instruction
speedup. making the length of time of the improved part Ps. The final speedup is c) Vector-memory instructions
d) None of these
computed.by dividing the old running time by the new running time, which is what the Answer: (b)
above formula does.
2.The vector stride value is required WBUT 2009, 2011]
a) to deal with the length of vectors b) to find the parallellsm in vectors
c) to access the elements in multi-dimensional vectors
d) to exeCute vector instruction
Answer: (a)
AK
3. Basic difference between Vector and Array processors is WBUT 2010, 2014
a pipelining b) interconnection network c)register d) none of these
Answer: (a)

4. Stride in Vector processor is used to WBUT 2010, 2014]


a) differentiate dififerent data types b) registers
different data d) none of these
s} differentiate
Answer: (c)
M

is present in
5. Array process b) MISD
wBUT 2013]
a) MIMD c) SISD SIMD
Answer: (d)
stride value is required WBUT 2015]
6. The vectorwith the lengthof vectors
) to deal
a) parailGlism in vectors
find the multi-
b) to access the elementss in multi-dimensional vectorss
E

c)tonone ofthese
d)
Answer: (c)
vectorizing compit
piler is
ofa of vectors WBUT 2015]
The task the length
AR

7.Thend
7. sequential;scalar
d theguential
a)t convert multi4dimensionalinstructions into vector instructions
b) to proce vector instr. vectors
te
el to execute instructions
dHo
Answer: (d)

CA-43
e
W
S
AN
POPULARPUeLICATIONS MPUTER ARCHIT
URE
WBUT 2015) Answer:
to exploit three difterent ttypes
8. Array processors perform computations b) spatial parallelism
There are different
of vector instructions basically
a) temporal paralelism programs d) modularity of programs mapping as given below, basis of their mathematical

TI
c) sequential behavior
of Vector-vector
tor instruction: From
different vector registe
Answer: (b) erands enter in a functional
opera Islers one or more than one vector
pipeline unit and result is
Short Answer Type guestions
This type of vector operation send to another vector regLster.
is called vector-vector
figure below, where Va, Vb, Vc instruction as shown in the
are different vector registers given
access in case of vector processing? following two mapping functions and it can define by the
How do you speed up memory

AU
1. fl and f2.
WBUT 2005, 2007]
f1:Va Vb and f2: Vbx Vc - Va
Answer:
For example, if the time it
Letr e the vector speed ratio and f be the vectoni7ationisratio. Vb Register
Ve Register Va Register
tahes tr add a vector of 64 integcrs using the scalar unit
10 times the timè it takes to do
10. Moreover. it the total number of operations
it using the vector unit. thenr= in a
program s 100 and only 10 of these arc sealar (aftcr vectorization). then f=90 (i.c. 90% of
the worh is done by the vector unit). It follows that the achievable speedup is:

(1 me without the vector unit) Time with the vector unit


general. the specdup IS
AK
In

So even
r1-
if the perfarmance of the vector unit is extremely high (r = oo) we get
iess than i1-1). which suggests that the ratio is crucial to performance since it
a speedup 0}F
Imit on the attainabie
f
speedup. This ratio depends on the efficiency
poses a Fig: Vector -vector instruction
of the compilation.
Vector instructions that access memory have a known access Vector-scalar instruction: In vector scalar instructions the input operands of the
pattem. If the vector's
elements are all adjacent. then íetching the functional unit enter from scalar register and vector register both and produce a vector
vector from a set of heavily interleaved
memor banks works ver weil. The high latency of initiating output as shown in the figure below, where Va. Vb are different vector registers and Sa
a main memory access
a
ersus accessing cache is amortized. because. single access is is a scalar register. It
can also define by the following mapping function
fl.
M

initiated for the entire


vector ratheT than to a singie word. Thus. the
cost of the latency to main memory is seen f1:Sx Va- Vb
oniy once for the entire
vector. rather than once for each word
we can speed up memory access in of the vector. In this way
case pf vector processing.

What do you mean by pipetined chaining? WBUT 2005, 2010, 2013]


Aaswer:
Pipeline chainng is a linking process
that occurs when results obtained
E

unit are directiy sed into the operand from one pipeline
registers of another functional pipe. In other words,
intermedute results do not have to be
restored into memory and -(
the vector operation is completed. can be used even betore
Chaining permits successive
soon as the first result becomes operations to be issued as Fig: Vector
available as an operand scalar instruction
The desired functional pipes and
AR

operand regsters must be properly


reserved: otherwise. chaining vector menory
.cirction: vector
iastruction:
susoended until the dermanded opcrations have to be Vector-Memory
Memw instruction can be defined
resources become available. ions between
cen vector register by vector load
operltions betw and memory. it can
store iunctions
tions also detined by the
3. Discus vector instruction format or vector
two mapping fl and2
WBUT 2006
mtor
followingtwo /rectorloud) and f2: V M vector store

:M V
Define the various types of OR,
vector instructions.
WBUT 2010, 2014]

CA-44
CA-45
e
W
S
AN
CONPLTER ARCHITECTU
POPULARPU8LICATIONS A Yector processor includcs a
Rograt set of vector registers for
exceution of instructions and a storing data to be uscd
vector functional unit coupled the in

..
eiecuting instructions. The functional to vector registers

TI
unit executes instructions using operation
prov ided to il which operation codes code
include a field referencing a special register.
special register contains information The
about the length and starting point for cach vector
instruction. A series of new instructions enable
to rapid handling of image pixel data are
provided.

AU
Scaler Processot
Fig Vatar memorn instructions
calar
processors
vector stride in vector Lnstrertiona
4. Discuss about strip mining and WBUT 2008, 2012] pocessins Ppe2

corespond to the length of the vector registers. For shorter


Answer:
Votr lengths do not often iength rcgister applied to cach vector operation. If a vector to High
Speed
ectos we can use a vector than that of the vector registers. then strip-mining is Main Vector
be procrssed has a length greater Memory
used. whereby the onginal vector is divided into
cqual sizc segments i.c. cqual to the size Instraction
AK
Contsoter
are proccssed in sequencc. The process of
of the vectar registers and thesc segments
strnp-mining ts usualh pertormed by the compiler but in
some architecture it could be
a scquence of convoys.
donc by the hardware. The strip-mined loop consists of 2
P
Vector Vector
addressing increment between successive
The extor ciements are ordered to have a fixcd Acoess
Regst
stnde or skew distance. i.c. It is the distance separating elements
in controler
ciements cailed a

stridc could be
memor that will be adjacent in a vector register. The value of the
different for different variable. When a vector is loadcd into a vector register then the Vector Processa
can
stnde is !. mcaning that all the ciements of vector are adjacent. Non-unit strides Fig: The architecture of a vector proccssor with multiple function pipes
M

cause major problems for the mermor system, which is bascd on unit stride (i.e. all
the
with concept of strip miing used in vector processors. Why do vector
eiements arc onc after another in different interleaved memory banks). Caches deal 6. Explain the memory banks? WBUT 2009]
unit stride. and behave badiy for non-unit stride. To account for non-unit stride, most processors use
a
systems have a stride register that the memory systen can use for loading elements of Answer: than that of the vector registers then segmentation of
has a length greater
vcctor register. However. the mcmon interieaving may not support rapid loading. The When a vector fixed length segments is necessary. This technique is called strip-
vector into
vector strides technique is used when the elements of vectors are not adjacent. the long segment is processed at a time. As an example. the vector segment
mining.ements in Cray computers. Lintil the entire vector elements in cach scgment
E

What is vector processor? Give the block diagram to indicate the architecture of lengthi50 the vector register cannot be assigned to another vector operation. Strip-
a
typical Vector Processor with mutiple function
pipes. areprodricted by thc number ot available vector, registers and vector chaining.
weUT 2008, 2010-short note] mining is ccess to vector elements stored in memory. the memory of a vector
Aaswer To allow en divided into memory banks. Interleaved memory
banks associate
Vector processors are specialized. heavily pipelined processors that perform efficient nrocessormany addresses
AR

pro addre sses with Successive banks cyclically. One memory


memoy' access (load
operations on entire vecors and matrices a oncc. This class of processor is suited for successive
data value in a memory bank takes several clock cycles to complete. Each
applications that can benefi from a high degree of parallclism. Register-register vector ctore)) ofa
ollons
allous onl
only one data valuc to be read or stored in
or a single memory access.
precessors require all operations to use rogisters as source and destination operands. memorr bank one memonmemor bank may bc accessed at the same time. When
more than the clements
Memrn-nmemory vector processrs allow peTands from memory to be routed direcuy to stored in an interleaved inemory are cad into a vector register. the rcads are
But
the arithmetc unit. A vector is 1cd-length. one-dimensiunal array of valucs. or an Bucto
of
sthe
acoss the memor hanks so that one vector eiement is
ane memory acces read fron a bank per
ordered scries of calar quantiies at as a nthmetc opcralionis are defined over vectors. d
staggered takes n cloch cycies, then n elements
eycle. u
If of a vcctor may
including addithn. subiraction. and mulupluation. clock
CA47

CA-6
e
W
S
AN
COMPUTER ARCHITECT

POPULAR PUBLIGATIONS LongAnswer Type


guestlons
times faster than the
same number oc
cost of one memor
access This is n
what are the different types of vector
be fetched at a operations? Give different tields in
hank. ector instruction. a
nemory accesses to a single

TI
Answer: WBUT 2005, 2013]
Explain with example. WBUT 2009] There are two primary types
7. What is Vector aray processor? of vector operations:
Answer: one instruction) Vector-register operations
processor that can operate on entire vectors with
A vector prcessor is a the Memory-memory vector operations
specify comptele vectors. For example, consider
.c. the operands of some instractions
In vector-register operations, all vector operations-except load and store-are among
follow ing add instruction
C A 8
In both scalar and vector machines
and put the sum in C"
this means
In a scalar machne
add the contents of A to the contents of R
he operands are numbers, but in
processors the operands are vectors and the nstruction
the pair-w ise sum of each pair of vector clements. A
vector
directs the machine to comniute
processor register, usually called the
vector length register. tells the processor how many individual additions to
adds the
perfonm when
vectors A key division of vcctor processors arises from the way the
nstructions access their operands. In the memory lo memory organization the operands
AU
he vector registers. All major vector computers use vector-register architecture,
including

of this type, as were CDC's vector computers,)


The vector instructions of the following types:
.Vector-vector instructions:
f1: Vi-->Vj
Vj x Vk ->Vi
(e.g. MOVE Va, Vb)
(e.g. ADD Va, Vb, Ve)
n
the Cray Research processors (Cray-1, Cray-2) memory-memory vector
operations, all vector operations are memory to memary. The first vector computers were
AK
2:
are fetched from memon and routed directly to the functional unit. Results are streamed Vector-scalar instructions:
back out to memon as the operation proceeds. In the register to register organization" Vi->Vj («.g. ADD RI, Va, Vb)
f3:s x
operands are first loaded into a set of vector registers, each of which can hold a segment instructions:
Vector-memory
ofa register. for example 64 clements. The vector operation then proceeds by fetching the (eg. Vector Load)
opcrands from the vector registers and retuming the results to a vector register.
f4: M-VN
f5: V -M (eg. Vector Store)
Vector reduction instructions:
biscuss diferent types of vector instruction. WBUT 2011] (e.g. ADD V, s)
Answer: f6: V-> s DOT Va. Vb, s)
M

f7: (eg.
Vix Vj->s instructions:
There are thre diferent types of vector instructions basically basis of their mathematical
mapping as given below. Gather and Scatter
. (eg. gather)
Vector-vector instruction: From different vector registers one or more than one f8:Mx Va->Vb (e.g. scatter)
Vb->M
vector operands enter in a functional pipeline unit and result is send to another vector 9: Vax instructions:
register. This type of vector operaion is called vector-vector instruction as shown in Masking MMOVEVI. V2, V3)
the given figure below. where V, . Vs. V, are different vector registers
Vm->Vb (eg.process sparse matrices/vectors. The gather operation uses
and it can fa: Va x used
used to
scatter are from memar
to access from
E

dcfine by the foilowing two mapping functions and f2. memory "few" of the elements of a
f Gather and of indices to
and a set of vector registers. The scatter operation does the opposite. The
V.V, and f,: V, x V. - V a base address
the
2. Vector-scalar instruction: In vector scalar instructions into one of conditional execution
execution of an instruction based on a "masking"
the input operands of the large vector
operationsallow
functional unt enter from scalar register and masking
vector register both and produce a
AR

vector output as shown in the figure below, where


V, . V, are different vector register. in respect to vector processors?
registers and S, is a scalar register. It can
also define by the sollowing mining and vector stride,
function f mapping are strip and array processors are specialized to operate on
and.
processorsdinerences
2. a) Whatvector main differences between them? WBUT 2005, 2010]
f SV,V, Both Whatarethemain
3. Vector-Memory instruction: vector b)
vectors.
-memory instruction the different address position in memory of adjacent elements
kad or vector store operatums betwcen vector register can be defined by vector Answer: iscUSs aresses may not be sequential. For vector processors
without
defined by the following two mappng functions ,
and memory. it can also
technique to saidements
wd thesenigue fetch
M-V vector load | and - and f.
f::V Mvectorstore
) Now.
in a
vector attion
tinstruction
vector
is
fetch elements of a vector that are not adjacent in
said to be stride i, if the distance between its
CA-49
two
A
e

4-48
W
S
AN
COMPUTER
ARCHIIECTURE
POPULAR PUBLIÇATIONS
When first loaded.the
th model
contains
A in memory a program which reverses
locations 0 and 2 the order of the
or double words apart. This distance separatino"
i of the Processing Elements values
successive data references is i words
gathered into a single i
register is called the stride. Once a vector cach
2of of their memories) and leaves the (initially in locations 0 and

TI
elements that are to be results in location and 3
it had logically adjacent elements.
1

as if memories. of each of their


loaded into a vector register
it acts
The vector stridc. like the vector starting
address, can be put in a general-purpose ector processor is also a CPU design that
sOm is able to run mathematical
be used to fetch the vector into a vector register. In tiple data elements simultaneously.
multip operations on
register Then instniction can This is in contrast to a scalar
always havc a stride value stored in a register, so handles one element at a processor which
vector processors the laads and stores time. A computer with
that only a singlc load and a singlc store instruction
are required. Complications in the ultiple calculations on vectors (one-dimensional built-in instructions that perform

AU
greater than one. When multiple arrays) simultaneously. It is used to
memory sstem can occur from supporting strides colve the same or similar
problems as an array processor;
accesses contend for a bank. a memory bank conflict occurs and one access
must be however, a vector processor
passes a vector to a runctional unit, whereas
an array processor passes each element of a
stalled.A bank conflict, and hence a stal. will occur if vector to a different arithmetic unit.
Nunmber of hank_
Bank buv time Vector processors are based on a single-instruction, multiple data architecturc
Leaxt oommoa multuple (Strnde. Numhr ot hanksi that is
distinctly different than SIMD extension to scalar/superscalar
greater than the vector registers then segmentation of the long
When a vector has a length processors. Each vector
data path has some data independence from the others allowing data path dependent
vector into fixed length segments is necessary. This technique is called strip-mining. One
operations. This allows easier control for wider machines. Single chip vector processors
vector segment is processed at a time. As an example, the vector segment length is 64
can still be low power and easy to program even with eight paralle! vector units. For
AK
clements in Cray computers. Until the entire vector elements in each segment
are many communications algorithms, characterized by high data parallelism, vector single
processed. the vector register cannot be assigned to another vector operation.
Strip- instruction machines end up being the ideal balance of instruction/programming
mining is restricted by the number of available vector, registers and
vector chaining.
simplicity and compactness, while still supporting complex processing requirements and
b) The SIMD-1 Aray Processor consists of a Memory, an Array high performance.
Control Unit (ACU) and
the one-dimensional SIMD aray of simple processing elements
(PEs). The figures show
a 4-processor array. The figures shows the
initial image seen when the model is loaded. 3. a)How do vector processors improve the speed of instruction execution over
Scalar processors? Hlustrate with an-example.
we need it in a vector processor?
VOY b) What is vectorizing compler?Why do
M

WBUT 2015]
Answer: are Memor
optimizatnon Seeincs uscd in vector processors. banks
PC mun, o a) Many Derformance latency. Strip mining is used to generate code
Aray Coal UtcC reduce load/store so that ector
RNOP
ed to reau
are used
r
vector operands whose size is ess than
or greater than the size o
PE-IR
PEC
N0F operation is pO r
chaining the equivalent ol torwarding in vector processors
vector registe dependency among vector instructions. Special scatter and
is
gather
SMDAT uscd in case O provided to efficientty operate on sparse matrices.
E

instructic are
The ACU is a simple load/store. register-register com iler nust be devclop to detect the concurreney
arithmetic processor. It has 16 intelligent valized with pipelining or with among vector
purpose registers. a Program Counter general b) An be realiz the chaining of pipelines. A
(PC), a Condition code which
Instruction Register ( AÇ-IR). The
Program Counter has
Register (CC) and an cld
instructions compiler woule regeneratc parallelisim lost in the use of
s ral lan.
of sequential
AR

ages languages.
oh with rich par
label ficld is initially set two fields: label and offset. The level programmimg languages
to "main" and the offset vectorizinEto Use.llowing four ehparallel constructs on
registers. the Processing Element Instruction to zero. The ACU also uses two other following
progranmminses have been rce
four stages ha
Element Control register (PEC) Register (PE-IR) and processors. 1he DrOgrammung. The paraten recognicd in the development of
t is desirable
which are global registers the Processing vector advancod sau * parameter in parentheses indicates the
SIMD Aray. The Processing
Elements operate
used to cominunicate with
the parallelism in explorabIeat ench age degree
(deteined by the state of its PEC biu) in lock step.
ie. each active PE parallelism
algorithn(A)
Whenever a Pt. ACC is updated by a PE. obeys the same instruction at the same
instruction, the PE. time. of
Parallel language(1.)
High-level codet()
cach of its neighbors sends the new ACC value
t0 ojoctcodefnM)
Efficient
machine
Target CA-5
CA-30
e
W
S
AN
OMPUTER ARCHIT CTU
POPULAR PUBLICATIONS
Vector Stride
independent operations that ca
can be Arrayprocessor Vector processor wBUT 2012
refers to the number of N
The degree of parallelism situation with well developed
parallel
simultaneously. In the ideal user Answer: WBUT 2014
performed below.
M, as in the figure vector processors:
languages, we should expet A > L2 O> a) Scalar and

TI
Degve of A vector processor is a CPU design
that is able to run athematical operations on
paralehsm ultiple data elements simultaneously.
mult This is in contrast to a scalar processor which
handles one element at a time.
A computer with built-in instructions
multiple calcuiations on vectors (one-dimensional that pertorm
arrays) simultaneously. It is used to
colve the same or similar problems as

AU
an array processor; however, a vector processor
nasses a vector to a functional.unit, whereas an array processor
M passes each element of a
vector to a different arithmetic unit.
Vector processors are based on a single-instruction, multiple data architecture that is
distinctly difterent than SIMD extension to scalar/superscalar processors. Each vector
data path has some data independence from the others allowing data path dependent
Paralel Paralel Object code Machine code operations. This allows easier control for wider machines. Single chip vector processors
algorithm languag can still be low power and easy to program even with eight parallel vector units. For
AK
Fig: The ideal case of using parallel algorithm many communications algorithms, characterized by high data parallelism, vector single
instruction machines end up being the ideal balance of instruction/programming
At present any parallelism in an algorithm is lost when it is expressed in a sequential.
high-level language. In order to promote parallel processing in machine hardware, an simplicity and compactness, while still supporting complex processing requirements and
intelligent compiler is nceded to regenerate the parallelism through vectorizationas high performance.
A vector processor for executing
vector instuctions comprises a plurality of vector
shown in figure below.
and a plurality of pipeline arithmetic logic units. The vector registers are
Paralehsm registers
a speed equal to 2n times as fast as the
constructed with a circuit which operates in
of the pipeline arithmetic logic units. Either the read or the write
processing speed
M

(Vectorization) registers are carried out in the time obtained by a


operation from or to the vector
the pipeline arthmetic logic units multiplied by n/2.
processing cycle of each of executed by a scalar
M processors are sciladr proeessors. Each instruction processors are in
The simplest a time. RISC
typically manipulares one or wo aata items at
processor processor that includes a plurality of scalar arithmetic logic units
scalar
this category A nit,unit. Eac
function Each scalar unit performs, in a different time interval, the
and a special different data item, where each different time interval
is one of a
E

e
same operation adiacent time intervals. Each unit provides an output daa item in
successive,adjacent
Perale Sequential Object code Machine code
plurality of which the unit nit performs the operation and provides a processed data
elgorhn langoag interval in successive,
adjacentnt time intervals. The special function unit
the time last of the computation forthe
the output data item of a selected one of the
Fig the case of using vectorizing compiler and sequential language the
item in special function ation
special interval in which the selected scalar unit performs the operation,
AR

provides a time
The process to replace a block of sequential code in the n use among the scalar units. A vector p.ocessing unit includes
by vector instructions is called scalar units, a conflict processor
vectorization and the system software which avoid Scalar processor, and an output orthogonal converter.
scalar
does this regeneration of parallelism is so as to bufter, the
called a vectorizing compiler. data
an input architecture:
mory vector
memo
so ate of one word fetched or stored per clock, the memory system
4. Write short notee on the following:
a) Scalar and vector proceesors b) Meman intducing or aceepting ths data. This is usually done by creating
b) Memory to memory vector architecture WBUT 2006, 2007] Tormaintale of There
capa banks. Ther are significant numbers of banks is useful for
access dealing with
c) Vectorizing compilers WBUT 2010] must be memory that tnat access rows gr columns of data.
WBUT 2010] multiple orstores
d) Vector registers architectures loads CA-53
WBUT 2011 vector
e

CA-52
W
S
AN
TER ARCHE UR

POPULAR PUBLICATIONS
Main memory
vertors have a relatively short length. 64 in the
:

In the register to register marhinces the memory to meme


family. but the startun time is tar iess nan on tne

TI
case of the Cray
these machines are much more ctticicnt 1or operations involvino. short
machines. Thus
vectors, but for long vectoi
operations the vecto registers must loaded with cach gment Vector
FP add/subtract
before the operation can continue. load-store

canbe defined by vector land


Vector-Memory instruction: vector -memor) instruction FP multiply

AU
and memory. it can also defined by the
or vector store operations between vector register the
follow ing two mapping functions f and f FP divide

vector load] and t,: V M vector store] Vector


M-V registers Integer

Logical

Scalar
AK
registers

Fig: Vector memory instructions


Fig: Vector register Architecture
Register to register machincs now dominate the
vector computer market, with a number
of offerings from Cray Research Inc. including the Y-MP and the
C-90. In the above figure, there are eight
64-element vector registers, and all the functional
The basic processor architecture of the Cray functional units.
supercomputers has changed little since the units are vector
Cray-I was introduced in 1976. There are 8 In a vector functional unit each unit is fully pipelined and can
vector registers, named V0 through V7, Vector functional units: hazards. In
which cach hold 64 64-bit words. There
are also 8 scalar registers, which hold single 64 operation on every clock Cycle. A control unit is needed to detect
start a new
there are five functional units.
M

bit words, and 8 address registers


(for pointers) that have 20-bit words. Instead of a figure,
the above o
cache. these machines have a set
of backup registers for the scalar and address registers; unit: This is a vector memory unit that loads or stores a vector to
transfer to and from the backup registers is Vector load-store vector loads and stores are fully pipelined, so that words can be
done under program control, rather than byy Here,
from memory. a bandwidth of one word per clock
lower level hardware using dynamic memory
referencing patterns. the vector registers and memciy with
The original Cray-I had 12 pipelined data moved betweeninitial latency.
processing units; newer Cray systems have after an ca
cycle, after eiers: Scalar registers
There are separate pipelines for addition. I4
multiplication, computing reciprocal (to divide registers: Scalar registers can also provide data as input to the vector
x by y. a Cray computes
x{l/y). and logical operations. The cycle time Set of scalar well as compute addresses to pass to the vector load-store unit.
of the data functional un ime Th
Time: The execution time of a sequence of vector operations primarily
E

processing pipelines is carefully Exeeution


matched to the memory cycle times.
system delivers one value per clock The memory Vecto three
factors:
cycle through the use of4-way interleaved memory. depends on length of the operand vectors,
The among the operatio and
c)Vectorizing compilers: Refer to Structural hazards
Question No. 4(b) of Long Answer dependences.
Questions. Type
AR

Data dep
Data tme for
time for a single vector instruction dependir.g on the vector length
compute ine
d) Vector registers architectures: We caninitiation rate Shich
athat is the rate at which a vector unit consumes
new operands and
Each vector register is a fixed-length
Weiniia
the new
and
thaem
lts. All modern supercomputers have vector functional units with
pipelines that can produce two or more results per
must have at least two read
bank holding a single clines
paralel clock cycle.
ports and one write porn. vector. Each vecior register
multiple
overlap among vectOr operations This will allow a high degree Aefernoto Question No. 4 of Short Answer Type
which total at least 16 read ports
to different vector
registers. The read and write
ot Stride:Refer Questions.
and 8 write ports, ports. e)Vector
inputs or outputs by a pair of crossbars. are connected
o the functional unit CA-S5

CA-54
e
W
S
AN
COMPUTER ARCHITECT

POPULAR PUBUCATIONS FLYNN'S TAXONOMY


OF COMPUTER
Vector processor: because, with slight and raerare
SArmy processor& processin are
essentially the same
type of processor. A ARCHITECTURE
Vector and aray processor and an array processor are the same

TI
handles most of the
differences a vector a omputer chip that
processing unit (CPU') is vector processor emplove Multiple Choice Type
processor. or central a computer. A
guestions
processed through
information and functions prucessor uses number of processing elements
pipelines. An array processor and requires a hoct Advantage of MMXtechnology lies in
multiple vector pracessor is a SMD pe wBUT 2010
operating in paralkel. An array
processor is a synchronous parallel processor Multimedia application b) VGA
processor). An array

AU
processor (control memory. The ALU together with c) CGA d) none of these
Each ALU contains local
cootaining mutiple ALUs. are synchronized to Answer: (a)
is called a processing clement (PE). The PEs
the local memon processor. The
operation simultaneousy. The host processor is a scalar 2. Array Processor is present in WBUT 2010]
perform same instructions are
by the control processor. The vector a}SIMD d) none of these
instructions are fetched and decoded b) MISD c) MIMD
execution over different elements of the vector operand. These Answer: (a)
sent to PEs for distributed
elements are contained in the kocal memories. The PEs are passive devices without
vector WBUT 2010, 2014, 2016]
and aray processing technology is not usually Which one of the following has no practical usage?
instruction decoding capabilities. Vactor
3.
technology is most often seen in high-traffic a) SISD b) SIMD SMISD d) MIMD
used in home or office computers. This
house and allow access to Answer: (c)
AK
senvers. Servers are racks of storage drives designed to
information from severai different users at different computers located on a computer WBUT 2011, 2016]
4. The expresslon for Amdahl's law is
network. Scalar processing technolog operates on dififerent principles than vector and where n b) S(n)=f where n->
a) S(n)=/f
aray processing technokogy and is the most common type of processing hardware used in d) None of these
the
avcrage computer. A
superscalar processor is a processor that operates like a scalar c) S(n)=l/T where n>o
processor. but it has many differet units within the CPU which each handle and process Answer: (c)
data simultaneoush. The higher-performance superscalar processor type is also equipped
5. Which MIMD
systems are best according to scalability with respect to the
with programming that makes it efficiently assign data processing to the available scalar WBUT 2011]
units widh in the CPU. Most modern home computer processors are superscalar
number of processors?
memory computers b) ccNUMA systems
ac) Distributedsystems
M

nccNUMA
d) Symmetric multiprocessors

Answer: (a)
processors have CPl of WBUT 2011]
6. Superscalar b) greater than
1 c) more than 2 d) greater than 3
than 1
a) less
Answer: (a)
of a computer has 2 cm blocks while the cache has 2 c
anoN
meusee
E

7. The main cache uses thes the set associative mapping scheme with 2 blocks per
the main memory maps to the set
blocks f
block k ofthe
WBUT 2011, 2016]
b) (k mod c) of the cache
set; then the cache
modm) of cache d) (k mod 2m) of the cache
a}k(kmod2c)ofthe
AR

c)
Answer:() requlred to
yalue ls vectors WeUT 20111
vectorstrldelengthofvecto
8. The with the
parallelism in
a)deal
find the
ents in muti-dimensional vectors
olements
b) the instruction
faccess
execute
vector
d)
Answer: (c) CA-S7
CAS6
e
W
S
AN
POPULAR PUBLICATION COMPUTER ARCHI CUURE
Multiple Instructions, Single
is a shared
resource, so there must be some Data stream (MISD)
9. As the bus in a multiprocessor algonthm form the ho Multiple nstructions operate
conflict. The on a single data stream.
mechanism to resolve the enerally used for fault
gene Uncommon architecture

TI
technique. WBUT 2016 tolerance. Heterogeneous whic
mentioned is not a conflict resolution b) FIFO algorithm strcam ana must agree on the systems operate on the same date
a) state priorty algorithm d) Daisy Chaining algorithm result. Examples include the Space Shuttle
flight contros
c) LRU algorithm
computer.
Answer: (a)
Instrution Memory
ontrod Unit
Short Answer Type guestions

1. Discuss Flynn's classification of parallel computers.

OR
Describe Fynn's classification of computer
OR
architecture.

OR,
WBUT 2006, 2007, 2009, 20101

WBUT 2012
Explain in brief with neat diagrams the Flynn's classifications of computers.
WBUT 2013]
AU Instruetion Memory
Control Unit

Instrtion Suream

Instruction Memory Conirol Unit


Iaurction Stream

Provessing Unit Data Memory

Data Suream
AK
Explain Flynn's classification.
Answer:
WBUT 2016]
The four classifications defined
by Flynn are based upon the number of concurrent Instruction Stream
nstruction (or control) and data streams available in
the architecture; Fig: MISD Architecture
ingle Instruction, Single Data stream (SISD)
Sequential computer which exploits
no parallelism in either the instruction
reams. Examples of SISD architecture or data Multiple Instructions, Multiple Data streams (MIMD)
are the traditional uni-processor machines processors Simultaneously executing different instructions on
C or old mainframes. like a Multiple autonomous
systems: are generally recognized to be MIMD architectures:
different data. Distributed
a single shared memory space or a distributed memory space.
M

Itsurucane Memon either exploiting


Conta Ua
Proesun Uau Daa Memory

InstrutionMemory ontrol Unin Processing Unit Data Memory


Iauratra Stucam
Dats Stream
Fig SISD Architecture Instrwton Stream Data Strram
ingle Instruction, Multiple Contul Uni
Data
computer which exploits multiplestreams (SIMD). Instruction
Memony Preving Unit Dala Memory
data streams against a single
E

erform operations which instruction stream to


may be naturally Instruton Stream Data Stue am
CPU. parallelized. For example,
an array processor
Contuol Unit
ng Uana Iastrution eny
Memo P ing Unit Data Memory
Daua Memry -
AR

Stream
lasiructn Data Stam
Fig. MIMD Architecture
Dua emy

tig SIMD Architecture


CA-58 CA-59
e
W
S
AN
COMPUTER ARCHITECUR
POPULAR PUBLIGATIONS PEO

TI
architecture to compute
2. Implement the data routing logic of SIMD
= 0.1.2..N WBUT 2008
s(k)=Ai for k -1.
OR 0, 1
in SIMD array processors? WBUT 2015]
Why do we need masking mechanism
In an SIMD aray processor of 8 PEs, the
sum S{k) of the first k components in
vector
Let A
A is desired for each k from
(4.A
S(A)= 4:for k =
0

Q.7
to 7.

A). We need to compute the following and throughput.

Discuss how data-routing and masking are performed in the

Answer:
a
processor.
WBUT 2015]

Masking technique for SIMD processor is capable of masking a plurality of individual


a
AU 1 0-2

0-3
AK
machine operations within single instruction incorporating a plurality of
operations. To SS)
accomplish this each different machine operation within the instruction includes
a
number of masking bits which address a specific location in a mask register.
The mask
register includes a mask bit bank. The mask location selected within
the mask register is
bit-wise ANDed with a mask context bit in order to establish
whether the processing
element wil be enabled or disabled for a particular conditional
called
sub-routine which is
We show the execution details
processing elements (PEs) to illustrate
of the following vector instruction in an
array of N 47
the necessity of data routing in an array processor.
M

Here the sum S(k) of the first Step 2 Step 3


k components in a vector A is preferred for each k Step
from 0
to n-1.
Now, A =(As, Ai, .
A)
So, the following n summations
Fig: The calculation
of the summation s(k) = 2 4i for k = 0,1,2...-I in an SIMD mach ine

are,
s(t)=Ai sor k =0.1,2..N-1. MHz
processor was used to execute a program with the following
3. A 50 clock cycle counts:
instruction mix and Instruction Count Clock Cycle Count
These n vector sunmations can Type
be computed recursively
by going through the following
Instruction
Arithmetic 50000
E

n-l iterations: Integer 70000


S(0) A Transfer
Data' point 25000
Sk)= S(k-1)+ A for k=
1,2,3..
The above recursive summations nl
Floating
4000
arithmetiC 2
processor with N = 8 PEs
for the case of n
in log :n=3 steps as shown 8 are impleniented in an array Branch tive CPI, MIPS rate for this program.
effective «
WBUT 2011]
AR

A, is transfer to the R,
register in PE , for I = in the figure below. Calculate the
0, 1,...n-1. At first each
R, to R- and added to
A with the resulting sum A, Now in step 1, A, is routed from Answerr Count
In step 2. the intermediate
sums in R, are routed
+
A. in R- for l= 0, 1,....6 action Count (IC)
We know, = Instruction
*
Clock pre Instruction (CPl) °
Clock Cycle Time
intermediate sums in R, to R 2 for I =
are routed R, CPU time
of S(k) fork 0,1.2,..,7 in the lasttocolumn for i=0to 3. Similarly, 0 to 5. In step 3, thee (CCT)
of the figure PE, has the final value
below.

CA-61
CA-60
e
W
S
AN
eOPULAR PUuIGATIONS COMPUTER ARCHITECTURE
ncal data be more frequent
oc than accesses to remote data. This property, called
= Number of times the i instruction is executed in a program. third fundamental requirement locality, 1S
where,, for parallel software, in addition
scalability.

TI
CPI, =Number of clock ccles for the
i instruction. to concurrency and
Another important class
The average value of clock Per Instruction ( PI) is given by. of parallel
MIMD computer. In multiprocessors, computer is the multiprocessor or shared-memory
typically v1a a bus or a hierarchy all processors share access
to a common memory,
Machine (PRAM) model, of buses. In the idealized Parallel Random Access
CPI often used in theoretical
processor can acces any studies of parallel algorithms, any

AU
memory element in
scaling this architecture usually the same amount of time. In practice,
wherefrquencyofocurrence of i* instruction in the program. introduces some form of memory hierarchy:
the frequency with which the shared in particular,
memory is accessed may be reduced by
CPU time=C°CPI, *(CT copies of frequently used data items storing
this cache is much faster than access
in a cache associated with each processor. Access to
Clock rate to the shared memory.
MIPS-CPU timex 10*
CPI x(CTx10° CPI, 10*
2. a) Describe the distribution and
Given Clack rate = 50 MHz shared memory model of SIMD architecture.
b) Draw the block diagram and explain
CPL S002700°3 250°1 40*2=4.15 the functionality of processing element.
AK
MIPS (50 10)/(4.15 10)= 12.05 WBUT 2008]
Answer:
a) There are two types of SIMD computer models are described below
&What is the instruction level parallelism? WBUT 2014 memory distribution and addressing scheme used. One is Distributed-Memory
based on the
Answer: Model and
Instruction-evel parallelism (1LP) is measure another is Shared-Memory Model. Most SIMD computers use a single control
a of how many of the operations in unit and
computer program can be performed simultaneously. a distributed memories, except for a few that use associative memories. The instruction
The potential overlap among set
instructions is called instruction of an SIMD computer is decoded by the array control unit. The processing elemenis
level parallelism. A goal
designers is to identify and of compiler and processor (PEs) in the SIMD aray are passive ALU executing instructions broadcast from thee
take advantage of as much ILP
programs are typically written as possible. Ordinary control unit.
under a sequential execution model
M

execute one after the other and


in the order specified by the
where instructions Distributed-Memory Modele A distributed-memory SIMD computer consists of an
compiler and the processor programmer. ILP allows the are
array of PEs which controlled by the same array control unit, as shown in Fig: 1.
to overlap the execution of multiple
change the order in which instructions instructions or even to Scalar
are executed. Processor
Mass Storag

Scalar Instruction
Long Answer Type guestions Network

1. Differentiate between
multiprocessors and
Control Control Memory
(Program and Data)
H Host
Computer
E

structures, resource multicomputer based on (Uscr)


sharing and inter processor their Vector
Answer: communication. WBUT 2007] Broadcast Bus
A multicomputer comprises (Instructions and
interconnection network. a number of von Neumann computers, Constants))
Each computer executes or nodes, linked an
access local memory and its own program. This programby Data
AR

may send and receive may Bus


used to communicate messages over
the network. Messages aree
with other computers
memories. In the idealized or, equivalently.
network, the cost to read and write remote
independent of both of sending a message PE: Processing
node location and between two nodes is
length. other network traffic. Element
but does depend on message
A defining attribute Data Routing Networ LM: Laca
of the muticomputer
memory are less expensive model is that Mcmory
than accesses accesses to local (same-node)
read and write are less costly to remote Distributed -Memory
than send and receive. (diflerent-node) memory. Model SIMD architecture
Hence, it is That is,
desirable that accesses
CA$2 to CA-3
e
W
S
AN
OMPUTERARCH
POPULAR PUBLICATIONS ECIU
To CU

Program and data are oaded into the


control mennory through the host computer.

TI
control unit for decoding. It it IS scalar
a or program control
An instruction is sent to the ic
by a scalar processor attached to the control unit.
operation, it will be directhy executed
the decoded instruction is a vector
operation, it will be broadcast to all the PEs f
paralel execution. PE
all the local memories attached to the PEs throuph
To other PEs a the
Partitioned data sets are distnibuted to ee oneuor.
interconnected by a data-routing network which performs Detwork

AU
a vector data bus. The PEs are
inter-PE data communications such as shifting. permutation, and
other routing operations
control unit. The PEs are
The data-routing netword is under program control through the
SVnchronized in hardware by the control unit. Almost all SIMD machines built today are
based on the distributed-memory model. Iliac IV, CM-2 are examples of Distributed
Memon SIMD architecture. L ALU

Shared-Memory Model: In Fig: 2 we show a variation of the SIMD computer using


For0, 1. .N-1
shared memory among the PEs. An alignment network is used as the inter-PE memory
AK
communication network. Again this network is controlled by the control unit. The
alignment nerwork must be properly set to avoid access conflicts. Some SIMD computers
use bit-slice PEs ie. Shared-Memory Model. Example, DAP610 and CM/ 200. PEM

Fig:3 Components of a Processing Element (PE)

Control Memory Scalar PEs can establish an appropriate data routing mechanism. Each PE consists of an ALU
Scalar
ATay Control Unit Instr. OceSsor with registers and loca memory. The PEs are interconnected by a data-routing network.
There are set of local registers and flags. A B, C and S, are prescnt in a PE. The data .
M

Broadcast Bus routing register is R, address register is D and a local index register is When data
Nawori (Vector Instructions) transfer process occurs in PEs, then contains of the data routing register is transferred.
Cantrol

What is the main difference and similarities between multi-computer and


architecture for a typical MIMD processor? Explain the
Multiorocessor? Give the MIMD.
Alignment Network memory modes of WBUT 2011]
shared OR,
MIMD architecture. WBUT 2012, 2014]
Briefly discuss
E

OR,
DaaBus difference
difference and similarities between multi-computer and
What is the WBUT 2014]
multiprocessor?
Fig 2 Shared -Memory Model SIMD architecture
model is callod
Answer: machine model called the multicomputer system. A
multiconputer
AR

b) An aray processor is a synchronous A_parallel ofol yonNeumann computers, or nodes. linked by an interconnection
number
compisesa
parallel computer with multiple arithmetic uter execules own.
proyram. This program may access
US
units. callcd processing clements (PE). The logic Each Co local
PEs are synchronized and receive messagcs over the network.
function al same time. to perform the same netw Messages are used to
nd mher Coimputers or. cquivalently. to
rcadread and write remote memories.
memaote wiuMIMD CompulerceCallea TIS
called the multiprocessor
comnLmemorocessors share access lo a common computer. In
hared- memory. ty pically vía a bus
SO and an
and any processor can access or
buses any memory element in
hierarchy of the same
a CA-65
CA-64
e
W
S
AN
COMPUTER ARCHITEC
ECTURE
POPULARPUBLIGATIONS
Write shortnotes on thefollowing:
class of machine
include the Silicon Graphics ray proces
ammunt of time. Evamples of this multipnavessor workstations. MMX Technology WBUT 2005, 2007, 2010]
and the many
Chalienge. Sequent Svmmetr, parallelism. CM-2 machine
data) is a technique to achieve WBUT 2005, 2006, 20071
MNAD (mutiple instructin, multipk asinchronously and Flynn's classification WBUT 2008]

TI
pnaessors that function
Machines usng MIMD have a number of WBUT 2011
be exevuing diferent instructions Answer:
independently At any tme, different pnarssors ma)
of ether shared memory or processor:
difiercnt of data. MIMD machines can be Array
on picces
are based on how MIMD processors AD-I
SIMD Array Processor consists
distnbuied memory categnes These casifications The
a
of Memory, an Aray Control Unit (ACU) an
bus-based. Distributed memorv
ac memon Shared memoy machines may be of the MIMD machines with
he one-dimensional slMD aray
of simple processing elements (PEs). The figures Show

AU
hyperude or mesh intenonnecthon schemes.
thachunes imay have 4-processor artay. I he figures shows the initial image
a common, central memory. In the simplest seen when the model is loaded.
shared mermnon have processors uhch share
form. all processors arr artached to a bus which
connects them to memory. MIMD VEKORY

machines with herarchical shared memoy use a hierarchy of buses to give processors
aes to cach other's memor. Processors on different boards may communicate through
inter-odal huss Buses suppot communication between boards. With this type of man o
PC
archte ture. the machine may suppori over a thousand processors. Arey Corntal Uit cc
AC-IRNOF
4Why do we need paralel processing? What are different levels of parallel PE-IR NOP
AK
PEC
processing? Explain. WBUT 2015]
Answer: SIMD ATay

In computers paraliel processing is the processing DOOD


of program instructions by dividing
them among multipie processors with the objective in which they are active, each
of ruming a program in less time. In The system operates on a two phase clock. In clock cycles
the carliest computers, only one program ran at a the clock and sends out a result
that took one hour to run and a tape copying
time. A computation-intensive program unit executes its intemal actions in the first phase of
program that took one hour to run would
packet in thesecond phase. The Memory, for example, reads an instruction or operand in
take a total of rwo hours to run. An earty
fom of parallel processing allowed the output to the ACU in the second phase.
interleaved cexecution of both programs the first phase and sends its
M

together. The computer would start an load/store. regiSier-register anthmetic processor. It has 16 general
operatioa, and while it was waiting for the /0 The ACU is a simple and an
operation to complete, it would execute the Program Counter (PC), a Condition code Register (CC)
processor-imtensive program. The Durpose registers, a fields: label and offset. The
total execution time for the two jobs would
be a litle Register (AC-IR). Ihe Program ounter has two
Over one hour. Instruction
initially set to "man and the otfset to zero. The ACU also uses two other
Levels of parallel processing label field is
is initiauy
Element Instruction Register (PE-IR) and the Processing
We can have parallel processing registers, the (PEc) which are global registers used to communicate with the
at four levels.
) Instraction Level: Most processors
ricter
Element ControProcessing Elemcnts operate in lock step. i.e. each active PE
have several execution units
E

and can execute


several instructions (usually
machine level) at the same
instructions to maxmize instruction
throughput. Often the
time. Good compilers can reorder
SIMD Aray. ate of its PEC bit) obeys the same instruction
(determined by the staleupdatedby
is
instn at the same time.
by a PE instruction, the PE sends the new ACC value to
Moderm processors even processor itself can do this. ACC
parallelize execution Whenever a PE
same pipe. of micro-steps of instructions within eighbors. contains a program vhich reverses the order
the model co
each of its loaded. the model of the values
AR

ii) Loop Level: Here. consecutive 0 and 22 of


c the Proccssing Elements (initially in
loop iterations When first locations 0 and locations 0 and
However. data between
subsequent iterations
are candidates for parallel
execution. memory memories) and leaves leaves the
the results in location l and 3
instructions at loup level. may restrict parallel held in their
of each of their
execution ot
of each of
There is a lot of scope
ii) Procedure Level: Here parallelism for parallel execution 2
is available in at loop level. memories.
procedures. Here the design the fom of parallel
of the algorithm plays executable .holo extens
extension to the Intel Architecture
in Java can he spawned to run a major role. For example an (lA) designed to improve
i Program Level: This is usually
a function or method. each thread bh MMXl
AMX
technology is
diafirstprocessossor
multimedi and comnunication algorithms. The
micropro Pentium processor with
processes concurrently. Diflerent he
responsibiliny microprocessor
to implement the
programs are obviouslyofthe operating system, which
runs
performance istprocessor architectural
neain
Technologyis main new insruction set.
improvements.
So parallelism can be extracted independent of each
by the operating
system at this other. MMX consistsoftwo
level. MMX CA-67
e

CA66
W
S
AN
POPULAR PUBLICATIONS COMPLIERARCIITFCEURE

Preletch
Operatione MMX
Technmology
over the non-MMX Pentium 1. Fetch 01 02Exeute wrebac
several improvements
The MMX technology consists of

TI
BTB
microprocessors: added those have been designed to Munit
. There are 57 ncw microprocessor instructions
graphical data more efticiently. Programs can use MMY
RCML Shadow
re
handle videco. audio, and stem visible state.
a new mode or operating-sy Code
instructions without changing to lechnique. Ins Ad
is also added to MMX cache
Aen 64-bit intcger data tpe it possible for one 16K ecoddeeco colc.
Mulnple Data (SIMD). makes

AU
new prcess. Single Instruction and op. Ir eger
A
multiple data items. exec
instruction to perform the same
operation on TLB LPO reod
micruprocessor has reased to 322KB. meaning fewer assad
Theiemor cache on the
accesses to memor that is off the microprocessor.
All MMX chips have a larger intemal LI
cache than their non-MMX counterparts. This Bus unit Page
improves the performance of any software running
on the chip, regardless of whether it unit
actualh uses the MMX-specific instnactions or not. Fig: Block diagram of the Pentium Processor with MMX technology
The Pentum
processor wth MMA amplementation was the design of a new, dedicated,
high- peromance MMX pipcine. which was able to execute two MMX instructions Although addingapipeline stage impróves frequency, it decreases CPl performance. ie.
AK
with minimal logiu changes in the caisting units. In addition, the design goal was to stay the longer the pipeline, the more work done speculatively by the machine and therefore
on the microprocessors performance curve. With the addition of new instructions, the more work is being thrown away in the case of branch miss-prediction. The additional
instruction decode logic had to be modified to decode. schedule and issue the new pipeline stage costs decreased the CPI performance of the processor by 5-6%.
instructions at a rate of up to rwo instructions per clock.
c) CM-2 machine:
The CM-2 was SIMD architecture based machine. The PEs in the CM-2 was capable of
Freqwency Speedup
To simplif the design and to meet the core frequency
goal, the pipeline of the Pentiam performing bit-serial arithmetic. The control processor, or sequencer could decompose an
sor example, into 8 PE nano-instructions. The CM-2 provides the
processor wMMX was extended with a new pipeline 8-bit operation,
M

stage (length decode). In order to groups that will execute at different


mantain and improve the CPl (Clock per Instruction)
of MMX technology is due to mechanism for the programmer to assign PEs to
use of PE instruction masking.
modifications that increase the Clock Rate. times. This functionality is achieved through the
PE module floating pornt accelerator provide extensive
As we know. Although the PEs and the
Execution Time = (Na. of iastructions capability, the programming paradigm was still limited to SIMD. The CM-2
).(CPI). (Clock Cycle Time) processing
processin f from its predecessors through the use of systematic inclusion of error
ie. increasing the Clock Rate decreases the Clock Cycle Time.
which in turn decreases
Execution Tume. So. in ordeT increase
to Clock Rate. the MMX Pentium designers detecting and error-corecting circus winin tne memories and communication netvorks
to find and eliminate some bonlenecks. need The CM-2 is capable of achievingPeak procesSing speed of around 1oGFlops.
oe
E

The two major bottlenecks were The CM-2 15 orovides the hypercube connections between
decoder and the data.cache access. the instruction different prucessing
So they tried to fîx the
ian instruclion that uses old 5-stage decoder bottlencck first. Here OE The PEs were organized into modules each having 32 P:s Within a
pipe like Fetch, Decodel, clements (PEspES organized int
were organized intouse to 16-PE sets with each set having its own
Write-Back. Decode2, Execute, the PES
To speed things up. a 6th stage given module, thePES within a given
PES within given set use shared meinory to communicate with
one
AR

was added to tne pipe node. All values their respective local memories Each
added between Fetch and
Decode! to decouple freezes.
i.e. Prefetch. A queue was
also router writing router node
Prefetch, Fetch, Decodel, So now an instruction looks like: another by vertex inthe hypercube. cube. One interesting feature
of the routers was that they
helow After adding this
Decod2, Faecute, Write-Back cented a& veity
represented for message combining for messages with
circuitry for
new stage. machine as shown in the figure special communicationvi the same destination.
extra clock cycle. tuming is rebalanced provided munication via local memories of PEs
to take advantage of the terned commun ithin a given module. the
In
addition toth
supports
pafterned communications directly across
the wires ot the hypercube.
CM-2 also
e

CA-68 CA-69
W
S
AN
POPULAR PUueLIGATIONS
COMPUIERARCHAECTURE
from to Fons End C'vmpkee
MEMORY

TI
Scalar Memory bus
Global resuh bus
Instrution Brosdeas: bus Multiple Choice
Type 9uestions
1. A computer with
1000 ns, and a hit cache access time of 100ns, a main memory access
ratio of 0.9 produces e of

AU
a) 250 ns an average access time o
b) 200 ns
rocrssc A 190 ns d) none of these
Answer: (c) WBUT 2007, 2010, 2014

2. Consider the high speed


Memories 40 ns memory cache with a successful hit ratio of 80%.
The regular memory has an
RouterewsScannin access time of 100ns. What is the effective access time
for CPUto access memory?
[WBUT 2007, 2008, 2009, 2011]
LOCoole 1OControle Frame Buffer
52 ns
Answer: (a)
b) 60 ns c) 70 ns d) 80 ns
AK
VO Bus
Frame Buffer out
Fig: Biock diagram
3. Assuming a Main memory of size 32 k x 12, Cache memory of size 512 * 12 and
of CM-2 machine
block size of 1 word, the addressing relationships using direct mapping would be
d) Flymn's classificatio: WBUT 20071
Flynn's Taxonomy is a categorization of computing a} tag field-6 bits, index field 9 bits b) tag field -9 bits, index field 6 bits
systems based on the number of c) tag field-7 bits, index field-8 bits d none of these
instruction streams they. have and how many
data streams there are. Instruction streamns, Answer: (a)
suggests separate and independent paths
of control and state; each instruction stream has
its own program counter, and its own
registers. Data streams on the other hand imply 4. Associative memory is a WBUT 2008, 2009]
M

independet pieces of data. A vector of multiple a) pointer addressable memory b) very cheap memory
pieces of data may comprise multiple
data streams, and independent
memory values that are handled concurrently may also content addressable memory d) slow memory
comprise separate data streams. Each Answer:(C)
of the two attributes, Instructions/Data, may be
classified as SingleMultiple, yielding
four systems possibilities as illustrated in Figure 1. of locality justifles the use of
Data 5. The principle WBUT 2008, 2009]
Instruction Streams a)Unterrupts b) Poling c) DMA
Streams
Single Multiple
Cache memory
Answer:(d)
Single SISD SIMD
E

Mukiple MISD MMD How


many addressbits are required fora 512x4 memory?. [WBUT 2008, 2009
b) 4
a) 512 d) 44
Answer: (c)
AR

svstem where main memory is of size 16K 12


ASSum12For and cache memory is
7. As 2. a direct mapping system which
statement is correct?
1K ffeld is 9 bit and index fleld is 6 bits
of size
Tagdg WBUT 2011
is 4 blts and index field Is 10 bits
Teld is 7 bits and index field is 8 bits
Tagi these
) none ol
)
Answer: (b)

CA-71
CA-70
e
W
S
AN
COMPUTER ARCHITECrURE
POPULARPUBLICATIONSs Computer with cache
16. A access time of 100 ns, a
a physical memory
location in a paged-memory 00
1000 ns and a hit ratio of 0.9 main memory access tin
produces an average access time
8. In absence of TLB, to acenss WBUT 2012] a) 250 ns of WBUT 2016]

TI
accesses are required? b) 200 ns
system how many memory c) 3 d) 4 Answer: (c) )190 ns d) 80 ns
a) 1 )2
Answer: (b)
with n blocks is nothing but which of the
Short Answer Type guestions
9. A direct mapped cache memory WBUT 2012, 2015]
memory organizations?
folowing set associative cache /1. Consider the performance ofa main memory organization, when a cache miss

AU
b) 1-way set associative
a) 0-way set associative has occurred as
d) n way set associative WBUT 2007]
c) 2-way set associative 4 clock cyclesto send the address
Answer: (d ) 24 clock cycles for the access time per word
WBUT 2013] ii) 4 clock cycles to send a word of data.
10. In whichtype of memory mapping there will be conflict miss? Estimate:
a) direct mapping b) set associative mapping
a) The miss penalty for a cache block of 4 words.
c) associative mapping d both (a)& (h) b) The miss penalty for a 4 way interleaved main memory with a cache block
Answer: (d)
of 4 words.
11. Virtual address space can be divided into some fixed size WBUT 2013] Answer:
AK
a) segments b) blocks E) pages d) none of To find a single word from main memory, the processor wastes 4+24=28 cycles.
these
Answer: (c) Since, size of the main memory block=size of the cache memory block
To find 4 words in main memory processor access it one time and it takes the whole
12. Which is not the property of a memory nodule? block to cache.
wBUT 2013]
a) inclusion b) consistency c) capability d) locality So, miss penalty = 4x 28+4 =116
in 4 different banks.
Answer: (c)
For a 4-way inter leaved memory, the words be present
4
are activated and the time required to read = 24
So, the first 4 cycles, all the addresses
13. Effective access time (Ta) of memory is given by wBUT 2014] cycles.
cycle to send a word of data.
Let, it requires 4 clock
M

penalty = 4+ (24 +4x 4) 44


So, the total miss
mean by meway
2. What do you memoly intertaVIng
enory interleaving?
t yes,
In the system with pipeline
[WBUT 2008
processing, is the useruir explain why?
T d) none of these Answer: technique used to improve the memory performance. Memory
ng is a ses bandwidth by
Interleaving
Answer: (b) allowing simultaneous access of more than one
E

14. The compller optimization of


This improves
men memory
the perlormance of the processor because it can transfer
the same amount of time. It also helps
technique is used to reduce chunk from in
to alleviate
a) cache miss penalty WBUT 2015] informat bottlene«
neck that is a major limiting factor in overall performance.
6) cache miss rate inore r-memory'
c) cache hit time he nrocessor-e by dividing he system memory into multiple blocks. If there are m
d) none of these works
Answer: (b) nierleaving this is ca
called the m-way memory interleaving. In
blocksthen
i general nwo-way
AR

numbers ofy technique is used. Each block


of memory is accessed using
15. The cache coherence a numb interol lines, which are merged together on the memory
is potential problem
a) in asynchronous parallel algorithm especially
execution in multiprocessor (WBUT 2015]
or foets co of to one block, a read
or write to other blocks can be
bus. When
overlapped with
b) in synchronous parallel
aigorithm execution
in asynchronous parallel algorrthm executionin multiprocessor ne.
the firstle@ved ele1l,
a main memory of size
2' is
d) in synchronous parallel algorithm in data flow m/c
execution in data an ine usually, m 2" for some integer divided into m modules. where m
n such that O<n < I.
Answer: (c) flow m/c n inteain
a man me1 memory address).
Each main
l being the
bils in memory address is
mapped to a
number of CA-73
CA-72
e
W
S
AN
POPULAR PuBLICATIONS CONMPUTER ARCHITECTURE
as write-through policy.

TI
i.e. both cache
Such a mapping is called a hashino same time. memory and main memory are
module. and to an address within that module. updatca t the a
scheme Clearly. the mapping must be one-to-one. In the write-back policy,
the update is performed
cache block is replaced only in the cache memory. When tne
access time of 100nanosecs, a main memory access time then it updates the
3. A computer has cache main memory contain.
0.9.
of 1000 nanosecs anda ht ratio of 5. Assume the performance
i) Find the average access time of the memory system. of 1-word wide primary memory organizatOn

AU
Suppose thet in the computer, there is no cache memory, and then find 4 clock cycles to
)average the 56 clock cycles for
send the address
access time, when the main memory access time is 1000 nanosecs. the access time per word
Compare the two access times. WBUT 2008] 4 clock cycles to send
a word
Answer: Given a cache block of 4 words, of data
) Suppose the average access time of memory system is penalty and the effective memory and that a word is 8 bytes, calculate the mk
x 9+ 1000.l = 190 ns bandwidth.
100 Re-compute the miss penalty and the memory bandwidth
assuming we have
Main memory width of 2 words
) if there is no cache memory, then the average access time is equal to the main Main memory width of 4 words
memoryy
access time, i.e., 1000ns.
Interleaved main memory with 4 bands with each bank 1-word wide.
AK
So. with cache memory. the average access time= 190ns and
without cache memory, the WBUT 2009
average access time = 1000ns Answer:
In the first case, total (4+56+4)= 64 clock cycle is required to access I word data from
4. Consider a computer where the clock per instruction (CPI)
memory accesses ht (no memory is 1.0 when all memory and I word data = 8 bytes.
stalls) in the cache. Assume each clock Memory bandwidth = (8/ 64) bytes per clock cycle
2 ns. The only data accesses are
loads and stores, and these total 50%cycle is
of the 0.125 bytes/clock cycle
instructios. Assume the following formula for calculating
CPU executiontime = (CPU clock execution time: In the second case, main memory width is 2 words. So, total(4+56+4) = 64 clock cycle is
For a program consisting of 100 cycles+Memory stall cycles) x Clock cycle time. required to access 2 word data from memory andl word data =8 bytes.
instructions:
Calculate the CPU execution time Memory bandwidth = (8*2/ 64) bytes per clock cycle
M

Calculate the CPU execution assuming there are no misses. = 0.25 bytes/clock cycle
time considering the miss penalty memory width is 4 words. So, total (4+56+4) 64 clock cycle is
cycles and the miss rate is 2%. is 25 clock In the third case, main
Discuss the difference between
required to access word data from memory and word data = 8 bytes.
I
write through and write back 4
cache policies. Memory bandwidth = (84/64) bytes per clock cycle = 0.5 bytes/clock
Answer: WBUT 2009] interleavea main memory has 4 banks and each bank is word wide.
1

Case : In the fourth case,


(4+56+4) = 64 clock cycle is required to access (1*4) =4 word data from
If there is no miss then CPU execution So, total data = 8 bytes
(10) 2° 100 ns 200 ns
time for 100 instructions memory and I word
(84/64) bytes per clock cycle = 0.5 bytes/clock
E

handwidth
Case 2: leal address and physlcal address? If segment no. is 8, page no. iss
are
The CPl of cache memory = 6. What no. is 40. Segme
40. Segment no. 8 hold 30 and page no. 30 hold 019,
what will be
Here miss rate = 2 % and
Misses per instruction
04, word address??? F
ohysical address For answer figure is essential.
miss penalty = 25 clock x Miss penalty correspondingphysical WBUT 2009]
So, The CPl of cache memory=
AR

cycle
0.02 x 25 = 0.5
Now, CPU execution time
for 100 instructions
Answe
system, theregenerat of
are two types addresses: logical address and physical
enerated by CPU and this is the address address.
= (CPU
clock In a is at which a memory cell or
=(0.50)* 2*cycles + Memory
100 = 100
sall cycles) x Clock address
Logical addressocars to reside trom the perspective of an executing

.
cycle time x no. application
Cache write-through policy of instructions aladdress Is the address of main memory
and write-back policies: storaThe pyA logical address word which is pemanent
There are two cache write poliCies: progot chanddress iranslator ormay be different from the physical
address due too
operations. if the word write-through policy
and and caion
olidedmemor berrPing
mapping function.
the cache, and one in is in the cache. then there may be write-back policy. For write into a number of equal
ain memory and the disk sized pages. The basic unit of transfer of
main memory. n Dotn are two copies of the
word, one in main is the page, that is, at any
updated simultaneously, between the given time; the
this is referred to data CA-75
CA-14
e
W
S
AN
COMPUTERARCHITECTUR
POPULARPueLICATIONS the many-t0-one property.
This disadvantage
ratio. should lead to achieving
The set-associative a low cache hit
main memory may consist of pages from various segments. In this casc, the virtual mapping combines
address is divided into a segment number, a page number, and displacement within the mapping technique.
The set-associative
aspects of both direct
mapping and assoC

TI
page. Address translation is the same as evplained above except that the physical segment direct imapping with the flexibility mapping scheme combines
associative the simpliCI
base address obtained from the segment table is now added to the virtal page number in technique, the cache is dividcd of mapping. In the set-associative
mapping
order to obtain the appropriate entry in the page table. The output of the page table is the blocks. A given main memory into a number of sets. Eaçh set consists of a number or
page physical address, which when concatenated with the word field of the virtual block maps to a specific
address results in the physical address. S=i modS cache set based on the equaon
Where S is the number of sets
in the cache, i is the main

AU
SamenNunhr Virwi Page Number memory block number, ands1s
the specific cache set to which block i
memoryare connected to the maps. So, in this technique, the blocks of
different sets of the cache memory by
main
nn TaNe Offt technique. But in a set there are number direct mapping
Diylavemeo
memory may transfer to any block of cache blocks and a specific block of main
of a specific set of the cache memory by fully
associative cache mapping technique. So, the cache replacement
is reduced and hit ratio
is increased.

Pial Spmm Bas Adirew 8. Describe different techniques to reduce Miss Rate WBUT 2010]
AK
Page Tabke Answer:
Pomacal Aers One of the techniques to reduce the cache miss rate is Compiler-controlled prefetch.This
mayreduce the cache misS rate.While this approach yields better prefetch "hit" rates than
hardware prefetch, it does so at the expenseof executing more instructions. Thus, the
compilcr tends to concentratë on prefetching data that are likely to be cache mises
anyway.Loops are key targets since they operate over large data spaces and their data
accesses can be inferred from the loop index in advance.
Another method of reduce cache missrate is Compiler optimizations. This method does
NOT TEquire any hardware modifications. Yet it can be the most efficient way to
M

Physical Address eliminate cache misses. The improvement resuts from beer code and data
Fig paged segment address translation Oroanizations. For example, code can be rearranged to avoid conflicts in a direct-mapped
arrays can De reorderea to operate on blocks of data rather than
IThe above figure shows
data in that problem.]
the paged segment address
translation. But there are incomplete
cache, and accesses toarray.
processing rows ofthe
-
memory size is of 32kBx12. Cache memory size is of 512x12
7. What is the limitation mo that main
can be improved in of direct mapping
size is of1 word. Describe thefollowing WBUT 2010]
set-msociative mapping.method? Explain with example how it block
E

Answer:
and mapping technique
WBUT 2009] Direct
a) Associative mapping technique.
Direct mapping technique
is the simplest technique b)
incoming main memory
block into a specific of cache mapping. It places Answer: memon is 32kb 12 = 2!2x12 bytes
fixed cache block an main data
The size of the
is done based on a fixed line and 12 bit a bus in the sy ster
relation between location. The placement address
AR

number. j. and the number the incoming block are 15 bit 12 = 2"
of cache blocks, N number, i, the cache i.e. there memory iss 3512 x x bytes
block the cache 12 bit d.
j=i mod N The size of bit address bus and bit data bus in the system.
The main disadvantage 9
This is because according
of Direct mapping technique
are
2 2 words
i.c. there sire is I wordtechniquc.
in a block.
to is the inefficient Theblock napping
compete for a given cache this technique. a number use of the
in direct
expected poor block even of main memory blockscache. hit
a) So, addn>ss = 15block - 24
utilization of the
cache
there exist other empty may
Tle PU words in a
the restriction on the placement by the direct cache blocks. The
ot the incoming mapping technique is mainly I he
no. of CA-77
main memory due to
blocks in the
CA-76 cache i.c.,
e
W
S
AN
cOMPLTER ARCHITECT
POPULARPUBLIGATIONS Different Types of Locality:
Temporal Locality (Locality
in Time): The concept
The no. of blocks = 2°2' -2" referencea oy a program at one point in time that a memory location that is*
6 bits will be referenced again sometime in
No. of bits in Tag field - 15 -9 near future. if a memory element the
is referenced, it will tend

TI
CPU address Wond Field soon (e.g loops, reuse). It to be refereced aga
Ficld refers to the phenomenon
Tag Field 4 bits itemm has been referenced,
that once a particular memory
6 bits bits it is most likely that it will
Spatial Locality (Locality in Space): an itembe referenced next.
tochnique addresses are close by tend If is referenced, items whose
b) In Associative mapping to be referenced soon (e.g., straight line code, array
No. of words in a block -2 access). Spatial locality refers
to the phenomenon that when a given address has been

AU
4 = Ilbits
No. of bits in the Tag field 1$-
=
referenced, t is most likely that addresses
near it will be referenced within a short
CPU address period of tme. The concept that likelihood of referencing a memory location
Tag Field Word Field by a
4 bits program is higher if a memory location near it was just referenced.
U bits
oSequential locality: In typical programs, the execution of instructions follows a
10. What is the cache coherence problem? Suggest one method
to solve this sequential order (or the program order) unless branch instructions create out- of-order
problem. WBUT 2011 executions.
OR,
What do you mean by cache coherence problem? Describe one method to remove 12. How is a block chosen for replacement in set-associative cache to resolve a
AK
this problem and indicate its limitations. WBUT 2013, 2016] cache miss? WBUT 2012]
Answer:
Answer:
a
A protocol for managing the caches of multiprocessor system

a so
that no data is lost or
overwriten before the data is transierred from cache to the targetmemory. When two
The cache is divided into a number of sets containing an equal number of
blocks. Each
direct
block in main memory maps into one set in cache memory similar to that of
or more computer processors work together on asingle program, known as where a block can occupy
mapping. Within the set, the cache acts as associative mapping
multiprocessing cach processor may have its own memory cache that is separate from Replacement algorithms may be used within the set. However, an
any line within that set.
the larger RAM that the individual processors will access.'Memory caching effective
is in the assigned cache set. Therefore, the address
incoming block maps to any block the Tag, Set, and
because most programs access the same or instructions over
data and over.When processor is divided into three distinct fields. These are
M

issued by the
multipleprocesors with separaie caches share a common memory, it is necessary field is used to uniquely identify the specific cache set that ideally
to keep Word fields. The Set
block within
the caches in a state of coherence by ensuring that any shared operand that is changed in block. T he Iag field unrquely identifies the targeted
should hold the targeted element (word) within the block thatis
any cache is changed throughout the entire system. The Word field idenlities the
This is done in either the determined set.
of two ways: through a directory-based or a snooping system. In a processor.
requested by the Main Memory Address
directory-based system. the data being shared is placed in
maintains the coherence between caches. The directory
a
common_directory that
acts as a filter through which the TagFicld
Set Field Word Ficld
E

processor must ask permission to koad an entry from


the primary memory to its cache.
When an entr is changcd the directory either field of set-associatve mapped cache organization
updales or invalidates the other caches with Memory address
that entry. In a snooping sy stem. all caches monitor
or snoop the bus to determine if they main memory
ma address, we can now procced to explain
have a copy of the block of data thal is requested division of the managcmen
on the bus. the uicion Memory anagement unit) to satisfy a request made by
Having shown MMU(
AR

used by the element. 1 teps of the protocol are:


11. How does principle of locality help in memory hierarchy the protocol accessing a given specified set.
design? WBUT 2012] processor for deteminethe
Answer: the to wih.any of the blocks in the determined set. A
the Set field adfind a match
a match with
The cffectiveness of a memory hierarchy depends 1. Use to
on the principle of moving
information the Tag ficld memory indice
indicales that the specified set determined in step 1 is
thar
into the fast memory infrequently and accessing it many
times before replacing it with 2. Use in lhe tag targcted block, 4
Cache hit.
new infomation. This principle is possble match holding the incd in hitcache cache block. the requested word is selected using
due to a phenomenon called currently contained in
Word ficld
reserence: that is. within locality of a
a given period ol time, programs words
confined area of memory repeatedily. Locality
tend to reference a relatively Among the help ofthe
occurs often because of 3. selector with the
computer programs are created. (Generally. related the way in which
data is stored in nearby CA-79
storage locations in
e
W
S
AN
Reference string
POPULAR PUBLICATIONS

this indicates a
cache mis.
Therefore, the required 1 120 30 0 3 2 1
20 1 7 0
is found, then first, and
If in step 2, no match memory. deposited in the specified set 3
0 0 0|

TI
from the main processor. The cache Tag memory
block has to be brought available to the
the targeted element
(word) is made
be upJated
accordingly. 1| 1
block memory have to
and the cache
sequence of word addresses: 2
Total page fauli 8.
generates the following
13. A certain program
page frames in main memory is 3. How
10
4,5, 12,8, 10, 28,6, 16. An address space is specified by ot
words; the number of 28 bits and corresponding memory space
A page has four
many page faults

Answer:
are

4,5. 12.8. 10. 28.6.


generated

Sequence ofword address:


10

4
if

48
optimum page replacement policy is used?

|8|8
5551012 12 10 28
13 28
610
WBUT 2013]

AU
26 bits. If a page consists of 4K words
i) How many pages and blocks are there in the system?
li) The asSociative memory page-table contains the following entries.
Page Block
0

in binary), that will cause a page


AK
Total page fault is 7 for optimum page replacement policy Make a list of all virtual addresses (in decimal and WBUT 2013]
fault.
14. A system has 48 bit virtual address, 36 bit physical address and 128 MB main Answer:
system has 4096 bytes pages, how many virtual and [page size = 4 K = 212]
memory address. the No. of pages = 228/212 =216
physical pages can have address support? How many page frames main of No. of blocks = 226/212 =214
7.
memory are there? WBUT 2013] address that caused page fault are 2, 3, 4 and
Answer:
The list of virtual
drawback of direct mapped cache? How is it resolved in set
128 MB
2 Bytes
Physical Address space =2 Bytes 17. What is the WBUT 2014]
associative cache?
M

Page size = 4096 Bytes =? Bytes


No. of Physical Pages =2*/2 = 2 Answer: technique is the simplest technique of cache
mapping. It places an
No. of Virtual Pages = 2/ 2 =2*
Direct mappingmemory block lock into a specitc fixed cache block location. The placement
incoming main xed relation betwcen the incoming block number.
fixedr
i. the cache block
No. of Page Frames in main memory 2" /2 =25 based on a
is done number of cache blocks, N
the
number, j, and ji modN
15. What s
the objective of OPT page replacement algorithm policy of virtual
is the inefficient use of the cache.
memory? Using LRu, show the page-faut rate for the
reference string disadvantage of Direct mapping technique
E

701 20304230321201701 acCording to this technique, a number of main memory blocks may
WBUT 2013] This is becgiven cache block even if there exist other empty cache blocks. The
Answer:
The page to be replaced is the one that
compete or
expected poor on
utilization of the cach.
utilization cache by the direct mapping technique is mainly due to
will not be used for the longest period
of time. the placement of the incoming main memory blocks in the cache i.e.
disar
This algorithm requires future knowledee restrictioropety. property. This disadvantage should lead to achieving a low cache hit
AR

of the reference string which is not usually many-10-one


available. the
Size of the page frame is not given] ratio. ct-associative mapping combines aspects of both direct mapping and associative
We assume that it is 4. he Sel-a5snique. The set-ass0Ciative mapping scheme
1nng tecnith the flexibility of associative mapping. Incombines the simplicity of
the set-associative mapping
direct maphe cache iS diviacd into a number of sets. Each
set consists
hniqueen main emory block maps to a specific cache set based on of
a number of
the equation
block.
CA-81
e
W
S
AN
POPULAR PUBLIGATIONS PUTER ARCHITEC
1emory. If the write buffer
is ful, the
empty. cache and the CPU
S
S imod must wait until the butter
i is the main memory block number, and s is
Where S is the number of sets in the cache.
Sa, in this technique. the blocks of main'

TI
block i
maps. Cache MM
the specific cache set to which
different sets of the cache memory by direct mapping
memon are connected to the
cache blocks and a specific block of main
technique. But in a set there are number of
a specitic set of the cache memory by fully ufter
memory may transfer to any block of
replacement is reduced and hit rátio
associative cache mapping technique. Sa. the cache

AU
is increased.
Fig: Write through
Long Answer Type guestions policy (All the memories have
the same copy)

does the Cache memory effect the throughput of a computer system?


In write-back policy data is written
How to the cache, and updated in the main memory only
WBUT 2007] when the cache line is replaced. Information is written
only to the block in the cache. The
Distinguish between Write back and Write through Cache. WBUT 2007] modified cache block is written to main memory only when it is replaced.
This requires
OR, an additional information (either hardware or software). called dirty bits. A dirty bit is
and write back with advantages
Briefty explain the two write policies: write through attached to each tag of the cache. Whenever the information in cache is different from the
AK
and disadvantages. WBUT 2013]
c) What effect does memory bandwidth have on the effective memory access time?
one in main memory, then write back to main memory.
WBUT 2007] 2
Cache
What is Cache coherence? How can this problem be overcome? WBUT 2007]
OR
Briefly describe cache coherence problem with an example. Suggest one
software
protocol for this. WBUT 2015]
Answer:
a) In general, throughput is as amount of data transferred from sender to receiver in a
M

specified period of time. Throughput strongly depends on the latency.


However, in many
cases it can provide beter performance than expected by simply
dividing cache line size Fig: Write back Policy (All the memories have the same copy)
by latency. because many cache lines
can be transferred in parallel. A
system immediately supplies data codes to a central processing unit, and cache memory une number of bits per second that can be
new data codes bandwidth provides a measure or
are wnten from the central processing unit into the cacne
memory system so as to c) The D, rerers o tne rare at which information is transferred from
ith
enhance the hit ratio. The new data codes are transferred accessed. The bandwidth
to a main memory system while
there are no predicted bus requests or communication level to its ad to how quickly the memory can respond to a read or write
request.
between the central processing
E

unit. the main memory system. and the cache Access time
memory system, so that data throughput is
inproved without negatively affecting the hit ratio. The access efers
t
time refers that the time between reqiest to i-th level of memory and word
memory to the processor.
The performance of cache-based that level ofdefinitio it is obvious that if we
multiprocessors for general-purpose computing
and for multitasking is analyzed with arrives from definition bandwidth
above increase bandwidth the access
simple throughput models. A private according the increasing width bit transfer rate will increase.
cache is- associated with each processor, So, Because
AR

mutiple buses connect the processors to the shared, and be less.


interleaved memory. Simple models time will nroperty can manage the memories
based on dynamic instruction mix statistics of a multiprocessor system so
are introduced to evaluate upper bounds cohere ervritten before the data is transferred from a cache
the throughput when independent tasks on effective to the target
are run on each processor. d) Menmis lost oing efective
one can oblain a firca estimate of the MIPS With these models, cachingis is because most programs access the same data or
(millions of instructions
per second) rate of a at no aemory
Memory cae, When multiple processors with
multiprocessor memory. over an necessary separate caches share a
to keep the caches in a state
ch of coherence by ensuring
bWrite-ttrough uchnique. data is mmon ry.nd
operand that is changed in any cache is changed throughout
that
updated both on the d done
shared
operaaedirectoiy-based system. the entire
write buffer for main memory and it
If there is a cache and in the main memory. by a In a directory-based
system. the data
is empty. insormation that any This is CA-83
ache and write bufier. CPU continues workng while .is written into System.
the write buffer writes
the word to
CA
e
W
S
AN
COMPUTER ARCHITEC
POPULAR PUBLICATIONS
Answer:
common dirclorn that maintains
the coherence between a) Cache ory Is extremely
being shared is placed in a fast memory
filter through which the processor must ask permission to (CPU), or located that is built into a com entra
caches. The director acts as a
an entry is changed the nory to store instructions next to it on a separate

TI
load an entry from the primary
memor to its cache. When that are repeatedly separate chip. chip. The CPU uses cach
other caches with that entry. overall syster speed. Most required to run program
programs. improv ing
director either updates or invalidates the modern desktop
independent cacnes: an instruction and server CPUs have at least three
Management Unit (MMU)? cache to speed up executable instruction re
a) What is memory data cache to speed up data fetch and
cache memory organization? Define hit ratio
2.
using store, and a translation look-aside bufter usea
b) What are the advantages of WBUT 2008] speed up virtual-to-physical address
associative mapping and direct mapping. translation for both executable instructionsand aata.
Compare and contrast

AU
When the processor needs to read
Auswer: or write a location in main memory. it first checkS
Unit (MMU) is the hardware component that manages virtual whether that memory location is in the cache.
This is accomplished by comparing ne
a) Memor Management
of such devices are the translation of virtual address of the memory location to all tags in the cache that might contain that address. If
memon systems. Among the functions
addresses to phy sical addresses. memory protection,
cache control, bus arbitration, and. the processor finds that the memory location is in the cache. we say that a cache hit has
in simpler computer architectures. bank switching. Typically.
the MMU is part of the occurred; otherwise, we speak of a cache miss. In the case of a cache hit, the processor
CPL. though in some designs it is a separate chip. The MMU includes small amount of
a
immediately reads or writes the data irn the cache line.
memory that holds a tabie matching virtual addresses to physical addresses. This table is Local miss: Local miss means the misses with respect to the memory accesses to this
called the Translation Look-aside Bufier (TLB). All requests for data are sent to the cache memory or we can say that the misses in this cache divided by the total number of
AK
MMU. which determines whether the data is in RAM or needs to be fetched from the memory accesses to this cache memory.
memary access by CPU.
mass storage device. If the data is not in memory, the MMU issues a page fault interrupt. Global miss: Global miss means the misses with respect to the
total number of memory accesses generated by
i.e. the misses in this cache divided by the
b) The advantage using a cache of the memory hierarchy is to keep the information
the CPU.
expected to be used more frequently by the CPU in the cache. The end result is that at
any given time some active portion of the main memory is duplicated in the cache. to reduce miss penalty affect
the CPU. This technique ignores the
b) Many techniques cache and main memory. Adding
Therefore. when the processor makes a request for a memory reference, the request is concentrating on the intertace between the
CPU, the decision.
first scarch in the cache. If the request coresponds to an element that is currently residing cache between the original cache and memory simplifies
another level of
the clock cycle time of the fast CPU.
M

in the cache. we call that a cache hit. On the other hand.


if the request corresponds to an cache can be small enough to match
clement that is nol curently in the cache. we call that a cache misses. The first-level ond-level cache can be large enough
to capture many accesses that would go
A cache hit ratio. he. is defined as the probability Yet the thereby lessening the effective miss penalty.
of finding the requested element in the memory,
hardware to reduce miss penalty. but not this second
cache. A cache miss ratio (1 - h,) is defined as the probability of not finding the requested to main caches require ire extra
iltilevel observation CPU normally needs just one word of
element in the cache. A cache hit ratio. h. is defined as the probability of finding the hased
based on the observation that the
on
technique. It is tegy is impatience:
strateg Don't wait for the full block to be loaded
requested element in the cache. A cache miss ratio (1-h,) is defined as the probability time. This resta
of at a e requested word and restarting the CPU. Here are two specific
not finding the requested element in the cache. the block at
sending the
E

before
. No. of hit No. of hit irst-Request the missed word first from memory and send to the
it
t ratio (h,) =
No. of hit- So. of miss Total CPU access
strategies:
Critical it arrives. let the CPU continue execution while
filling the rest of the
Critical.
Compare and contrast associative mapping and direct
mapping: CPU a Se block. Critica-word-1irst fetch is also called wrapped fetch and
block.
i. In direct mapping techniquc. each
block of main memory is transferred words in the first.
AR

to a fixcd block word


of cache memory by the procedure j i mod m. But in requested
uested Fetch the
restat-Fetch the wor
words in normal order. but as soon as the requested word of
associative mapping technique,
the blocks of main meinory may ransier to Early artives, send it to t the CPU and let the CPU.continue execution.
any block of cache memory.
k
i. Contiict Miss may occur in Direct mapping
technique but Capacity Miss may occur in
hniques only benefit designs with large cache blocks. since the benefit
Associative mapping technique. lly thescks are large. The problem is that givon spatial locality. there is niore
Generess D that the next miss is to the remainder of the block.
3. a)What is cache memory? Define is loom cnach to lower miss penalty is to remember what was discarded in case
global miss & local
miss with a sultable han raner PPcace the discarded data has already been fetched, it
example. can be used again at
one arecycling" requires a small, fully associative cache betwecn a cache
b) Describe different tech niques to reduce Miss Penalty.
WBUT 2010] coSI. St CA-85
small
e
W
S
AN
ng to tne next lower-level memory.
POPULAR
PUBLICATIONS
block are swapped
Ifit is found there. the victim block and cache
discarded from a cache
blocks that arc before
contains only the desred data
and its refill path.
This i
1 the see if they have
32 bit virtual address. The
checked on a miss to the victim block and cache page size is 4 KB=2"
are

TI
because of a miss and it fiund therc. The TLB can hold 256 page table entries i.e. 2"
lower-level memon. If is
going to the nevu The TLB is 8-way set associative i.e. 2
block are swapped
write back cache
is organized as multiple So. the size of the tag field 32 (12+8+3) = 9
-

KB 4-way set associative the processor generates 36 bits


4. a) An 8 Assume that
blocks, each of
32-byte size.
memory required by cache controller to store 5.What are the major differences between segmentation and Paging? Why is the
total size of

AU
addresses. Calculate the page size is usually a power of 2? WBUT 2013]
the tags for cache? penalty? Answer:
approaches to improve miss page size is 4 kB. The processor
b) What are the is divided into small partitions that are all the same si?e and
32-bit virtual addresses. The Paging-Computer memory
c) A CPU
generates
kB. The processor has a TLB which can hold allocated b/ the
reterTed to as, page frames. Pages are fixed units of computer memory
hold a total is 4
has a TLB which can The TLB is 8-way set associative. Calculate the
a total of 256 page
table entries. WBUT 2012] computer for storing and processing information. Paging is a virtual inemor scheipe
a process ir 1oadcd it
TLB tag size. which is transparent to the program at the application level. When
frames. "he process
Answer:
memor = 8 KB = 2" byte gets divided into pages which are the same size as those previous
a) Total size of the cache 2 pages are then loaded into the frames.
assoc iative mcans each set there are 4 blocks i.e.
AK
4-ay set (segments)Jepending on
Segmentation Computer memory is allocated in various sizes
Size of the each black is 32 byte i.e. 2. In segmentation. the address
spce is typically
w 2 bit set and 11 bit tag field. the need for address space by the process.
So. the address field contains 5 bit ord. (read/write Code segment
number of segments like data segment
divided into a preset
ignores the stack(read/write) etc. These segments may be indiviculy protected or
b) Many techniques to reduce miss penalty affect the CPU. This technique (read-only), Faults
you will see what are called "Senentation
CPL. concentrating on the intertace between the cache and main memory. Adding shared between processes. Commonly wren is outside the
the dafa that about to be read or
another level of cache between the original cache and memory simplifies the decision. in programs, this is because
that process.
The first-lev el cache can be small enough to match the clock cycle time of the fast CPU. permitted address space of a paye and offset
implemented by breaking up an address ir
Recall that paging is
M

Yet the second-level cache can be large enough to capture many accesses that would go
to break the address into X page bits a
Y otfset bits. rather
to main memory. thereby lessening the effective miss penalty. is most efticient
number. It num andand offset. Because
arithmetic on the address to calculate the page
Multilevel caches require extra hardware to reduce miss penalty, but not this second than perfom represents a power of 2. splitting an address :tween bits results in a
tcchnique. It is based on the observation that the CPU normally needs just one word of position
cach bit 2.
the block at a time. This strategy is impatience: Don't wait for the full block to be loaded is a power of
page size that
before sending the requested word and restarting the CPU. Here are two specific
computerhas 1 KB (B 4-way set associative cache andM8 main memory. If the
strategies:
A compuAB,
6. A B, then in which cache set are the wordADEs and (EDcSA)
E

Critical word first-Request the missed word first from 64


memory and send it to the CPU blocksize is WBUT 2014]
as soon as it arrives: let the CPl continue execution mapped?
while filling the rest of the words in
the block. Critical-word-first fetch is also called Answer: IKB=20
wrapped fetch and requested word first.
Farly restart-Fetch the words in normal
order. but as soon as the requested word Cache size= Bytes =2
block arrives. send it to the CPU and let of the size=64
AR

the CPU continue execution. Block blocksincache inemory= 2° /2 = 2'


Generally these lechniques only benefit
is low unless blocks are
designs with large cache blocks,
since the benefit af
a
So, no. 4-wayset associative mapping.
et associamemo
lrge. The problem is that given spatial
than random chance that the next locality, there is more This is sets m
in cache
sets men
memory 2"/2=
miss is to the remainder no. ofmain memory= MB= 2I
One another approach to lower miss of the block. So,
is needed again. Since the discarded
penalty is to remember
what was discarded in case it Thesize of
blocksin.
memor 2"/2 = i4
be in the block no. 95)0.
small cost. Such "recycling
data has already been
fctched. it can be used again at No, of word(ABBCDE. will
requires small. fully associative
a e va of (ABCDE), 101010001 101110
binaryvalue
and its refill path. This vIcum cache So.
cache between a cache 14 biis
bits as the block no.
contans only blocks (Takethetheirst
lirst 14
because of a miss and are checked on a that are discarded from a
caclhe Consider
miss to see if they
have the desireddata before CA
CA-86
e
W
S
AN
POPULARPvBLICATIONS Multiple writes to the same word with no intervening reads require multiple write
hroadcasts in an update protocol, but only one initial invalidation is in a write
- invalidate protocol.

TI
3
Now. this bloch will be in the æt no. 10995 mod
4

Similarly. the word (EXBA)), will be in the block no. (15218)1o and this will be in the With multiword cache blocks, each word written in a cache block requires a write
setno. I$218 mod 4 broadcast in an update protocol, although only the first write to any word in the block
needs to generate an invalidate in an invalidation protocol. An invalidation protocol
7. a)What s meant by the cache miss penalty? Briefly discuss "early restart"
technique to reduce miss penalty. works on cache blocks, while an update protocol must work on individual words (or

AU
merge writes in a write
b) Let us consider a memory system consisting of main memory and cache
memory In case of a cache miss, assume bytes. when bytes are written). It is possible to try to
the perfomance of the basic memory broadcast scheme.
organzation as: and reading the written value in
4 clock cycles to send the The delay between writing a word in one procesor written data are
address processor is usually less in a write update scheme, since the
24 clock cycles for the access
time per word another
4 clock cycles to send a word updated in the reader's cache.
of data. immediately
What will be the miss penalty, given a cache block of four
i) What will be the memory bandwidth? words?
Auswer. WBUT 2015] 2d part: coherence protocol and
protocol is an Invalidate based cache
a)Refer o Question No. 4 of Long MESI protocol: The MESI support write-back caches. By using write
Answer Type Questions.
AK
is one of the most
common protocol which wasted on a write through
which is generally
b) Refer touestion No. on bandwidth indicates that the
I of Short Answer Type Questions. back caches, we save a lot write back caches which
dirty state present in protocol reduces the
cache. There is always a main memory. This
8. What is te difference different from that in MSI protocol. This marks
a
MESI protoco between broadcast and invalidate data in the cache is transactions with respect to the
protocols? Explain number of Main memory
the performance.
Pl P2
significant improvement in
P3
data block is broadcast to
3d part: is updated, the new
local cache block write is observable
M

.Write-update: When a for updating them. i.e. Every


containing a copy of the block
can take place at a time in any
all caches only one write P3 performs one write
write goes on the bus. 1.e. first. Then
and every operations
performs 3 write
processor. So, Pl blockis
cache when a local cache
operation. all remote copies of protocol based on write
.Write-invalidate: Invalidate protocol 1S the write invalidate start of the
simplest snoopy memor are interconnected by a bus. At the
updated. The it receives a
The caches and its first request to its cache and, when
E

The value of X (shared through caches. processor sends sent to the


cache memories. At memory) is 50. P1 and succeeding clock period. Any packet
M P3 want to simulation. each in the:
next request and the memory. The memory only responds to packets
wants to read time
response, issues the the caches and
for two timee.P1 wants to write on X for read X and store in their as the packet
read. After that first P3 three times. After forwarded to all the same source address
writes on X and that Ps bus is the memor contains
Explain the above packet sent from
AR

then P2 wants to a
update, WNrte through
mentioned scenario sent to it; ocal processor can take
one of the two states:
invalidate. Write using Write copy of loca
Answer:
back invalidatethrough update, Write back it receive cache block
1 part protocols. The state ofa
WBUT 20161 safely.
Difference between Valid State:
processors can read write
The performance broadcast and invalidate All processor can also
diflerences
from threc characteristucs between write protocols: Local (not in cache)
broadcast
and write invalidate Invalid State: invalidated.
protocols arise Block being replaced
Block being
CA-88 CA-89
e
W
S
AN
POPULARPUBLICATIONS Page 0
Page 0 L
copies becomne

TI
its cache copy, all other cache Paoe 1 Page
When a remote processor writes to Page 2 Page 2 Frame 1 Frame
invalidated.
Page Page 3 Frame2
using 3
fragmentation? What is the advantage of
9. What do you mean by mêmory Frame3
an example where logical address Page 4 Page 4 Frame 0V
Paging? Explain Virtual memory concept withpage size is 1 kb. Explain page fault
space is 8 kb, physical address space is 4 kb, Pace 5 Page 5 rame4

AU
WBUT 2016] Page
with FIFO and LRU Algorithm. 6 Page 6 Frame 3V Physical memory
Answer: Page 7 Page 7
1 part:
that is technically free Virtual memory
Memory fragmentation occurs when a system contains memory Page Table
which assigns needed memory (one per process,
but that the computer can't utilize. The memory allocator,
are required by programs. one entry per page
to various tasks, divides and allocates memory blocks as they
back to maintained by OS)
When data is deleted, more memory blocks are freed up in the system and added
the pool of available memory. When the allocator's actions or
the restoration of
are address and physical address
previously occupied memory segments leads to blocks or even bytes of memory that Fig: Mapping between virtual
AK
It
too small or too isolated to be used by the memory po0ol, fragmentation has occurred. address space is
means that the memory is divided into parts of fixed size and when some processes try to space is 8 kb, physical
size of the logical address physical
occupy the memory space, they sometimes are not able to occupy the whole memory In the above figure, the 8 pages in virtual
memory and 4 frames in
1
kb. So, there are
leading to some holes in the memory. This is memory fragmentation. It is of 2 types: 4 kb, page size is
1. External fragmentation memory.
2. Internal Fragmentation
4thpart:
2 part: FIFO: Replacement i.e. first-in-first-out.
Advantage of using Paging: FIFO Page page replacement strategy is FIF0,
A simple and obvious queue, and the page at
M

Allocating memory is easy and cheap


pages are brought in, they
are added to the tail of a
As new victim. In the following example, 20 page
Eliminates extermal fragmentation
the head of the queue is the
next
faults:
Data
Pages
(page frames) can be scattered all over Physical Memory
are mapped appropriately anyway
requests result in
15 page
rclerence string 01 7 01
Allows demand paging 0 4 2 3 0 3 2 12
More efficient swapping
7 1 20 3

No need for considerations about fragmentation


E

Just swap out page least likely to be used 2

3 part: page irames


Fig: FIFO page-replacement algorithm
The virtual memory concept with logical
address space is 8 kb, physical address space is
4 kb. page size is I kb is given below:
Questior No. 15 of Short Answer Type Questions.
AR

The size of the page and page frame are same.


System loads pages into frames LRU:Referto
translates addresses. Now we can and
logical
represent, notes on the following:
address: va = (p,w) Write short nd write back caches WBUT 2011]
physical address: pa = (f,w). 10. coherence proble
problem and its solutions WBUT 2012, 2014]
p. determines number of pages where a) Write eoherence
in VM Cache
f determines number of frames in PM b)
w determines
page/frame size
CA-91
CA-90
e
W
S
AN
POPULAR PUBLICATIONS
RISC&CISCARCHITECTURES

TI
Answer:
a) Write through and write back caches: Multiple Choice Type guestions
and in the main memory. If
Write-through technique, data is updated both on the cache
information is written itnto cache
there is a write buffer for main memory and it is empty, 1. eapped register windows are used to speed-up procedure call and return in
write buffer writes the word to a) RISC
and write buffer. CPU continues working while the architectures b) CISC architectures (WBUT 20071
until the bufter is
memory. If the write buffer is fulI, the cache and the CPU must wait c) both (a) and (b)

AU
d) none of these
empty. Answer: (a)
MM
Cache
2. What is a main advantage of classical vector systems (VS) compared with RISC
based systems (RS)? WBUT 2008, 2009
a VS have significantly higher memory bandwidth than RS
b) VS have higher clock rate than RS
c) VS are more parallel than RSS
d) None of these
Answer: (a)
AK
Fig: Write through policy (All the memories have the same copy)
3. Difference between RISC and CISC is WBUT 2010]
In write-back policy data is written to the cache, and updated in b) CISC is more effective
the main memory only a) RISC is more complex
when the cache line is replaced. Information is written only to the block in d) none of these
the cache. The c) RISC ls better optimizable
modified cache block is written to main memory only
when it replaced. This requires
is Answer: (a)
an additional information (either hardware or software),
called dirty bits. A dirty bit is
attached to each tag of the cache. Whenever the CiSC is that WBUT 2011]
infomation in cache is different from the
one in main memory. then write back to main memory. 4. The advantage of RISC over requiring just one clock cycle
a)RISC can achieve pipeline segments, with the longest segment
b CISC uses many segments in its pipeline
M

MM
Cache Cache requiring two or more clock cycle
c) both (a) & (6)
of these
dnone
Answer: (d)
following is not RISC architecture characteristie? wBUT 2012]
5. Which of the instructions
a) simplified
and unfried format of code of
no specialized register
Instruction
E

b) storage / storage
cno register file
d) small
Fig: Write back
Policy Answer: (c)
(All the memories have
the same copy) architectures correso
following architectures correspond to von-Neumann architecture?
b) Cache coherence SISD
AR

problem of the d) SIMD


Refer to Question No. and itssolutions 6. Which
a)MISD b) MIMD WBUT 2012]
10 of Short
Answer Type Questions.
Answer: (¢) processors is
RISC WBUT 2015]
CPI value for 2 c) 3 d) more
7. The b)
1
Answer: (a)
CA-93
CA-92
e
W
S
AN
COMPUTER ARCHITECTURE
POPULAR PUBLICATIONS
Long Answer Type guestions
Short Answer Type guestions

TI
1. a) What is SPEC
rating? Explain.
b) A 50 MHz processor
(WBUT 2010, 2012, 2014, 2015] was used to execute a program WBUT 2015]
Compare between RISC and CISC. instruction mix and with the following
Answer: clock cycle counts:
Characteristics CISC RISC Instruction type
Iastruction set size Instruction set is very large Instruction et is smalI and Instruction count Clock cycle count
and instruction| and instruction format is| instruction format is fixed.

AU
formats variable (16
- 64 bit per Integer arithmetic
Data transfer 50000
instruction)
35000 2
Floating point arithmetic
Addressingg mode
12-24 3-5 Branch
20000
General purpose 8-24 general purpose registers hough most instructions are 6000
registers and cache| present. Unified cache is used| register base so large numbers of
design for instruction and data registers (32 192) are used and Calculate the effective CPI, MIPS rate and execution
time for this program.
cache is split in data cache and Answer:
instruction cache. a) The Standard Performance Evaluation Corporation (SPEC) is
CPI CPI is between 2 to 15 In most cases CPI is I but average an American non-profit
organization that aims to "produce, establish, maintain and endorse standardized
a set" of
AK
CPI is less than 1.5 performance benchmarks for computers. SPEC was founded in 1988. SPEC
CPU control CPU is controlled by control CPU is controlled benchmarks
by hardware are widely used to evaluate the performance of computer systems: the test results
memory (ROM) using without control memory are
published on the SPEC_website. Results are sometimes informally referred
| microprograms.
"SPECmarks" or just "SPEC". SPEC evolved into an umbrella
to as
organization
2.What are multiprocessor, muti-computer and multi-core encompassing four diverse groups; Graphics _and Workstation Performance Group
systems?
WBUT 2012, 2014] (GWPG), the High Performance Group (HPG), the Open Systems Group (OSG) and the
Answer:
In Multiprocessor system there are
newest, the Research Group (RG).
more than one processor that works
In this system there is one master processor simultaneously.
and other are the Slave. If one pressor b) Total instruction count= 1000
11
M

then master can assign the task to other fails


slave processor. But if Master
entire system will fail. Central part Multiprocessor will be fail than CPI=(50000 x l+35000 x 2 +20000 * 2 + 6000 x 3)/111000=1.6
hard disk, memory and other
of is the Master, All of them share MIP (clock frequency)(CP x T00u000)(50 * 1000000) (1.6 x 1000000) = 31.25
devices. the = CPl x Instruction count
x Clock time
A multicomputer system consisting Execution time =
supervisionof amaster computer. of more than one computer, usually under the = 1.6x 111000 x (1/ (50x 1000000) 0.003ms
in which smaller computers
routine jobs while the large handle inputUoutput and
computer carries out the more on the following:
A multi-core processor is a
single computing component
complex computations 2. Write short notes
actual centralprocessing units with fwo or more independent a) Power PC WBUT 2007, 2010]
E

characteristics
program instructions. (called "cores"). which are Neumann architecture [WBUT 2012]
The instructions the units that read and execute b) Non von
data, and branch, but the muftiple are ordinary CPU instructions such as add, move Computer WBUT 2012]
cores cam Tun multiple c) Cluster
increasing overall speed instructions the same Answer:
typically integrate the
for programs amenable at time, PC: microprocessor
cores onto a single to parallel computing.
Manufacturers a) Power r is a highly
hi integrated single-chip processor
AR

single chip package. integrated circuit die e PowerPC architectur Superscalar machine
machine organization, that combines a
or onto multiple dies in
a RISC
architecture, a rocessor
processor and a versatile
powerful interlace. he
1
conta
contains a 32KB unified cache high
performance bus nd completing.up to 3 insructions and is capable
ide a wide range of sy
transactio s per cycle. The
of stembus interfaces, including bus
ace
non-pipelined, and spit offers he result is a cost
pipelined,
connlincd.
microprocessr
n
solution that very competitive effective.
perforance. geneial
purpose
CA-95
CA-94
e
W
S
AN
POPULAR PUELICATIONS oranch
Oispah
Decode
Fetch
Execute

TI
Drspacn Legc |Predict
Instrucho n Oveve ond

Integer Instructions
Floatng
eraecSeavenc
Fned
Po nt Potnt Fech Olspatch EceWnebac
U Decode
Un Unit
Load/siore Instruetlons

AU
Me mery Dispatch
Manegemen Decode
Address
Cache wibe
256
Floating-Polint
Ceche Tege 37KB Ceche Arey
Focn Dispatch Oe Eecue Writeback

emory oee « COP Fig: PowerPC 601 pipeline Architecture


Bur ntetace Unit U
b) Non von Neumann architecture characteristics
AK
Address COP Bus Any computer architecture in which the underlying model of computation is different
Fig: PowerPC Architecture
from what has come to be called the standard von Neumann model. A non von Neumann
machine may thus be without the concept of sequential flow of control (i.e. without any
As shown in the above figure, it is a superscalar design with three pipelined execution register coresponding to a "program counter" that indicates the current point that has
units. The processor can dispatch up to three 32-bit instructions each cycle - one each to been reached in execution of a program) and/or without the concept of a variable (i.e.
the Fixed-Poimt Unit (FXU). the Floating-Point Unit (FPU). and the branch unit (BPU). wíthout "named" storage locations in which a value may be stored and subsequently
The 32KB unified cache provides a 32-bit interface to the FXU, a 64-bit interface to the referenced or changed).
FPU. and a 256-bit interface to both the instruction queue and the memory queue.
The Examples of non von Neumann machines are the dataflow machines and the reduction
chip /Os include a 32-bit address bus and a 64-bit data bus. The designers optimized machines. Tn both of these cases there is a high degree of parallelism, and instead of
M

the
601 pipeline structure for high performance and
concurrent instruction processing in each variables there are immutable bindings between names and constant values.
of the execution units as shown below. Nate that the term non von Neumann is usually reserved for machines that represent a
The fixed-point pipeline performs all integer arithmetic logic unit (ALU) radical departure from the von Neumann model, and is therefore not normally applied to
and all processor load and store instructions,
operations multicomp architectures, which effectively offer a
including floating-point loads and multiprocessor or setof
machines.
stores.
cooperating von Neumann
The branch instruction pipeline
has only two stages. The first stage
decode, evaluate, and, if necessary. can dispatch,
predict the direction of a branch instruction in c)Cluster Cou consists of aa sset of loosely connected computers that work
eonsists
E

one cycle. On the next cycle,


from the cache.
the resuhing fetch can be accessing
new instructions cluster computer
Ac can b
iewed as a single system. The components together s0
many respects they
thatin to of a cluster
The floating-point instruction
pipeline contains six stages and connected to each other through fast local area networks, each
has are usually erver) runni
running its own instance of an operating node
fully pipelined execution
of single-precision
been optimized for used as a server) converg system. Computer
operations. (computer ergence of a number of computing
a result of trends
AR

ed as including
clusters emer microprocessors, high speed networks,
mputing. Clusters are and software for high
er
single computers of
usually deployed
that of a single computer,
of comparable speed
while
to improve
typically being
cost-effective than s or much
availability. Computer
Applicability
applic and deployment, ranging
more
have a wide range ofdes to some some of the from small
clusters nodes fastest business
handful of supercomputers
clusters with a CA-97 in the world.
CA-96
e
W
S
AN
INTERPROCESSCOMMUNICATION
PUBLIGATIONS
eoPULAR from general
purposes ranging Multlple Cholce Type guestlons
configured for differem mputation-intensive scientific

TI
may de
Computer clusters wcb-sern ae support. o Note that the
high-availability approach.
purpose business
needs suh as
may use a high- 1. In general an n Input Omega network requires .. stages of 2*2*
calculathons In ether case,
the chuster "oompute cluster" may also use a WBUT 2011, 2016]
bekow are not
evclusive and
aa

configurations in which
cluster- switches.
atributes descrihed
availability apprmach. ete
"Load-balancing" clunNers are
performance. For example.
a2
a) 2 b)4 c)8 d) 16
provde better overall Answer:(a)
nodes share computatronal workkoad to quenes to different nodes. so the overall

AU
cluster may assign diferent load-balancing may retum in
Overlapped register windows are used to speed up procedure call and
a web server approaches to
optimized. However. 2.
response time will be cluster used for seientific WBUT 2011
applicatnons eg a high-performance a) RISC architecture b) CISC architecture
significantty differ amonge wcb-server cluster
w ith different algorithms from a cboth (a)&(b) d) none of these
computatiorns would balance load to a
round-robin method by assigning each new request
which may just use a simple Answer: (a)
different node. which of the following shared
3.The time to access shared memory is same in WBUT 2012, 2015]
memory multiprocessor models? d) ccNUMA
a) NUMA bUMA c) cOMA
AK
Answer: (b)
WBUT 2013]
4. Example of a
recirculating network is
a) 3 cube network
rng network
d) mess connected tiac network
c) tree network
Answer: (6)

5. in general
64 input Omega network requires
. .. stages of 2x2 switches.
WBUT 2013]
b) 64 c) 8 d) 7
M

Answer:(a)
multiprocessor system is best suited
WBUT 2015]
6. The
UMA interaction among different modules in program is large
the degreeof
wnehe degree of interaction among different modules in program is less
a)when
interaction among different modules in program
there isno
c) when diferentprograms are to be executed concurrently
when
E

Apswer: (d)
Short Answer Type guestions
timingel
.cture and timing diagram explain S-access memory organization.
architecture
AR

WBUT 2005, 2007


1. With
arrange the Jam
Answer: way lo arange low-order interleaved memory which is called
AS illustrated in figue bekow. ln simultaneous
There is a this case all memory modules are
ThereS-accely in a synchronized manner. Agaun the high-order
accessimu d from cach module. AI the end of cach mnemory (n- a) bits select
aCcessnfse m 2a consecutive words are cycle as shown in
latched in the data
buffers
CA-99
CA-8
e
W
S
AN
OPULARPUBuCATIONS
COMPUTER ARCHITECTURE
m words out, one per Answer
are then used to mutiplex the
simultaneoush The ow-order a bits m of the memory cycle, then it
cyce. if the minor cycle is chosen to e 1

TI
cach mimor access phase of the last
takes two memon
cyces to access m consecutive words. If the m words take
overlapped with the fetch phase of the current access, effectively
access is
throughput is decreases stride is greater than
1.
only one memorn cyek to acces. The if
Fetchcvce Accs cwe

(
-orde

Reed/we
Multiplever

A Low-order
Single word
0Ccss

Single word
access
AU
AK
address bits
Fig S-Access memory organzation for m-way interleaved memory
Ss1
Fig: 3x4 delta network
Mcor Madules
a
In the delta network, there are x b" switches with n stages consisting of a x b crossbar
teich 1 Fetch3
AccesFch2TAccess 2Acss
e 3 modules. There is a unique interconnection path of constant length between the stages of
the network. In delta network no input or output terminal of any crossbar module is left
unconnected. To construct an a" x b" delta network there are a shuffle as the link pattern
keach Fach2 Fetch 3 hehween every two consecutive stages of the network. In an a'xb° dela network. there
M

Access Acoess 2 Aces sources and b" destinations. Numbering the stages of the network as 1,2.n,
are a nt the source side of the network requires that there be a crossbar modules in
starulnge So there is ab output teminal in the first stage and this is the input
Fai Acoes
ish2sh+ Acoess 2
terminals in
second stage. So, the i-th stage has
a
crossbar modules
above exan
ample, a = 3, b = 4 and n =2. So, there are 2 stages in 3 x4 delta
words words words Now, in the inter. link pattern is a- shuffle i.e.3 -shuffe.
the
Cyce network and
E

Tume
hg Successve memory access operation using overlapped stage Switching Network?
Multista WBUT 2008
fech and access cycle What is
3.
Aiswer Switching network capable of supporting massive
parallel processing,
2.Develop 34 delta network WBUT 2008
multistage
oint-to-point and multicast communications between
A luding point-to- processor modules
connected to the input and output
AR

which are ports of the network. The network


(PMs) itching model of communication and
(PMD qIeoutput ports. It uses an address-based can provide parallel pathsS
suPPinput network suitable tor designing routing algorithm for path setup
high-speed switching systems.
amonkesmonly used in telephony and multiprocessor These
whieh
conected switch nodes arranged in 2 log NI systems. The network is
intede input/output ports, N is stages,
the number of network
wherein b is the
ilt
bu wtes
indieates aceiling function providing
the smallest integer
inputoutput ports
W not less than log
andlog CA-101
CA-100
e
W
S
AN
POPULAR PUBLICATIONS
8 M
cyele
Thc addthnal tages pn e additnnal pathr tnn netwirk nput ports and minor cyele
mdegee of unlerdeleevng
ncrwrd utput avt therebi cnhan ng faut tkerannr and lessening contention.

TI
4 Draw the blocki diagram of C-access memory tunction. Why ts t necessary and
how does t improve the memory eccess time7 WBUT 2008]
Answer

Re
Ad

ac
Mod

AU Fig: Pipelined access


28
of eight consecutive words in a
AK
C-access memory
de
5. Differentiate between C-access
and S-access memory organizations.
fi -access memon Configuration WBUT 2010
In C-acccss memon contiguratKn.
Answer:
the accos of main memory C-access
a pupcline lashio
So. the memn modules is overlapped in S-access
ycks
cicle .c. the major cycle is sub- More than one module can share a memory| More than one module can share
divided into m minor bank. It increase the bank utilization and memory bank. It increase the bankk
t ebe the major cycic and t is
the minor cycle. then
reducing the bank cost. utilization and reducing the bank cost.
the degrec of untericav
ung The majcr
we can write r = 6/m,
where m is The low order m bit memory address are e low order m bit memory address are
cycle 6 is the total time aed to select the module and remaining (n-
M

access of a sange required to complete the used to select the module and remaining
wod trom a module Thc hits address the desired element within
produce ane word. minor cycle t is the actual (n-m) bits address the desired element
ic the oneriap
access of succssive time needed to
Cver minor cvck the module. within the module. The single
inemory modules separated îm
access
Here we gne an retuns M consecutive words or
cught
cample er. the figurt below The
effectiveness of this memory information from the M memor
contguous memor words of the tuming of the module.
pipelined configuration is revealed by its ability to| S-access
access of the bloca in a -access
memory organization. acess of the of a vector.
configuration is ideal tor
ot cight comgucus The pipelined access the elements accessing a vector
access hefore and
atte: the present
words is merged
between
of data clements or for
pre-fetching sequential instructions
the eftectne accns bich Lven though the total other pipelined block a
for
E

time of cach word block access time is 20, Lpipeline process.


accesed u a pipcincd is reduce
to t as the memory
fashion is contiguously the differences between loosely
e What
architecture? Whe do you mean by non-uniform coupled and tightly coupled
What
memo memory access, unitform
mory bandwidth? memory
acces and WBUT 2011]
AR

OR,
ar the ditferences betwee loosely coupled
What
and tightly coupled
etu contraat between UMA & NUMA
WBUT 2013, 2014, 2016]
with examples. What
memary? is Dumb
Owers WBUT 2013]
with shared memory
AID coareouters
MIMI ;
NCORE. MUL.TIMAX. are known as
tightly
Examples Tightly-coupled coupled machines.
multiprocessor
CA-103 systems contain
e
W
S
AN
eOPULARPUBLIGATIONS
Answerr
mutipe CPl s that are enno d at the bus level These CPs may have access to a ne purpose of a network

TI
central shared memor (SMP v ( MAL « may parta pate in a memory hieranchy with aistrbuted system. is to allow exchanging
Regarding this data between processors
both local and shared memon (N MA) nuroauced: network data exchange, two in ne
switching and network important terms need to
MIMD ovmputers with ar mtenvonoton netwrd are Anown as loosely coupled method of transportation routing. Network switching De
machines Eamples are NIEL PS. NTBE osely-coupled multiprocessor of data between processors in reters to ne
There are roughly the network.
systems reterrad to as clusters are aad m multipk standalonc single or dual processor two classes of networkswitching
oommodity computers interconnected a a high speed communication system. A Linux circuit switching and

AU
Beowulf ciuster s an cample of a kvsehupled system. Tightly-coupled systems Packet switching.
perform beter and are hysalh smalker than koosely-coupled systems, but have
histoncalh requtred greater nital imestments and may deflate rapidly; nodes in a n cireuit switching,
a connection is established between
koosely-coupled system are usuall neapensive commodity computers and can be processors which is the source and destination
kept intact during the entire data transmission. During
reccled as independent machnes upn retirement from the cluster COmmunication, no other processors can use this
Shared memor does « mean tha there ts a single. centralized memory. the allocated communication channe{S).
The symmetric This is like how a traditional telephone
shared-memon mutprocesses arr known as UMAs (uniform memory works. Some of the early parallel machines used
access). Uniform this switching method, but nowadays they mainly use packet switching. In
Memor Access (MAl is a computer memor architecture used in parallel packet
computers Switching, data are divided into relatively small packets and a communication
having mutuple processos and probably multipie memory chips. channel is
allocated only for the transmission of a single packet. Thereafter, the channel may be
AK
All the prooessors in the iMA model share the
physical memory uniformly. Peripherals freely used for another data transmission or for a next packet of the same transmission.
are also shared Cache memon ma be private for each
processor. In UMA architecture. The processors in a parallel architecture must be connected in some manner.
accessing time to a mcnor ocaton is independent
from which processor makes the Interconnection networks carry data between processors and to memory and the
request or whach memor chip contains the target memory
data. It is used in symmetric interconnects are made of switches and links (wires, fiber).
muttiprocessing (SMP)
Interconnects are classified as static or dynamic.
Uaifom Memon Access computer architectures
are often contrasted with Non-Uniform Static networks consist of point-to-point communication links among processing
Memory Access (NUMA) archrtectures
UMA machines are, generally, harder elements (PEs) and are also referred to as direct networks.
computer architects to design. but casier for
for programmers to program, than Dynamic networks are built using switches and communication links. Dynamic networks
archtectures NUMA
M

are also referred to as indirect networks. A variety of network topologies have been
Non-Uniform Memor Access or nronosed and implemented. These topologies tradeoff performance for cos. Commercial
Non-Uniform Memory Architecture (NUMA)
computer memory design used in multiprocessors. is a
where the memory access time machines often implement hybrid topologies for reasons of packaging. cost, and available
depends on the memor lacation relative
to a processor. Under NUMA, a processor can components.
access its own local memor
NUMA architectures logicaliy you mean by Program Flow
Mechanism? WBUT 2013]
follow in scaling from symmetric
architectures. Madem CPUs multiprocessing (SMP) 8. What do
operae considerably faster than
attached to Limtng the number the main memory they are Answeram-Flow Architecture is a Von Neumann or control flow computing model.
E

of memory accesses provided the The Progra c addressable instructions, each of which either specifies an
performance from a modern
computer. The dramatic increase
key to extracting high series of locations
systcms and of the applicatuons in size of the operating Here with memory of the operands or it specifies conditional transfer
run on them has generally operation along instruction,
processing improvements NUMA overwhelmed these cache some other
attempts to address this problem control to
memory for cach processor. by providing separate of
avodng conditlons for determining the maximum parallelism in
AR

attempt to address the same memory. the performance hit when several processors Bemste ein' the
shared memory by a factor NUMA can improve the performance over a 9. Use sequence nce of Instructions: WBUT 2015
of roughly the number of processors single following A= BxC
banks). (or separate memory D+E
B
Cd+B
7. What
architecture?
ithe significance of interconnection
network In multiprocessor
EF- D
WBUT 2012, 2014
CA-105
CA-104
e
W
S
AN
POPLLAR PUBLICATIONSS Similarly for 1
RIOW4-p

TI
based
Answer: derived some conditions R4OWI=o
data depondency and
Bernsten has elahorated the wod ot or praesses. Bemstein conditions WIOW4=
the paralelsm ot mstnatns
Read set or input set R, that
on whch ur can deride
are hased on the folku
ng tuo eth w variables 1) Thc
read h the statemeni of instniction .
) The Write set or
lHence Il and 14 are
independent ofeach other.
consists of memon kcathns written into by instnuction I1. The sets R, For 12 || 13,
kvatnns
output set W that cnsists of memn and writing by S. The
samc kratns arc used for cading

AU
R20W3=
and W. arc nox drsaint as the which are uscd to determine whether
folkow ng are Aemstein Paraikeirsm omditons R3nW2#o
statements are paraliei or nv w20W3=o
locations W, onto which S; writes must
t Locatcns in R frmm whrh S reads and the Hence I2 and I are not independent of each other.
be mutualh ecksve That means S dvs
not rcad from any memory location onto
whch S. urntes t can he denoted as R.W.*¢ For 12 || 14,
)Smilarly. kacations im R: frmm uhich S: reads and the lacations Wi onto which S R20W4#p
writes must e mutualh cclusrnr That means S: docs no read from
any memory
R4NW2-
kcaton cnto wtach S writes lt can he denoted as: R,°W,=¢ W20W4=p
AK
W and W: onto which S, and S; write. should not be read by
The memcorn kcm Hence 12 and 14 are not independent of each other.
S and S That means R and R: shoukd be independernt of W, and W,. It can be denoted
For 13 |14,
To show the operation of Bermstein's conditions. consider the following instructions of R3nW4p
scqucntal program. R40W3-p
W3nW4=p
D-E independent of each other.
2B Hence 13 and are
14

IC AB WBUT 2015]
Delta network.
14 E
F-D 10. Design 2x3'
M

Answer:
Aow, the read set and wrte set of II. 12. 13 and 14 are as follows:
RI B. C; WI A
R2 (D. F w2-B
R3 A.B; w3 C
R4 F.D; W4 E

ow let us find out whether 11


E

and 12 are paralel or not


RIW2z0
R2Wl-
WInw2=
That means ll and 12 are not indepcndem of cach other
AR

Similarly for 11 13.


RIW3s Fig 2a3 Delta network
R3 WI#0
WIW3
HenceI and 13 are not independent of each
uther

CA-107
CA-106
e
W
S
AN
OPULAR PveLICATIONSS

memory and distributed


1. What is the ditterence between centralized shared

TI
(WBUT 2016]
shared memory?
Aswer:
Refer to Question Na 2b) of Long Answer Type Questions. a) Straight <hcough
(b) Ciss-cross

Long Anewer Type guestions

1.What is the basic purpose of data fow architecture? Compare it with control
low architecture.

Answer:
OR
WBUT 2005]

What s the basic objective of data fow architecture? Compare it with control flow
architecture. WBUT 2005, 2015]
The data flow computers are hased on a data driven mechanism. The operation
conventional vom Neumann machine. the fundamental difference is that instruction
execution in a comventional computer is under program-flow control, whereas
of a
AU (c) Upper broadcat (d) Lower broacdeast

A multistage interconnection network may capable for performing highly reliable


communications with less hardware. In the multistage interconnection network for
interconnecting a plurality of nodes, the first and final stages cach have switches two
times as large as the number of switches at an intermediate stage. Two output ports of
AK
that in a
data fiow computer is driven by the data (operand) availability. There each node are connected to the input ports of different first stage switches, and two input
are three basic
issues towards the development of an ideal architecture for future ports are connected to the output ports of final stage different switches. The input ports of
computers. The first is
to achieve a high performance cost ratio: the second is switches of the intermediate stage are connected to the output ports of first stage different
to match the ratio with
technological progress: and the third is to offer better
programmability in application switches, and the output ports are connected to the input ports of final stage different
areas The data flow model offers an approach to Switches. At least one output port of each switch at the first stage is directly connected to
meet these demands.
The control flow computers are either
uniprocessor or parallel processors architecture. In at least one input port of an optional switch at the final stage.
uniprocessor system the instructions are executed
sequentially and this is called control A modified version of a switching/routing device commonly known as a crossbar switch
driven mechansm. In parallel processors which is optimized to switch and reroute very high speed synchronous data
system control flow computers use shared communication signals without interruptions and/or excessive transition time shifts. In a
memor So. parallel executed instructions may cause
M

and data due to shared memory. In


side effects on other instructions
control flow computer the sequence of execution
a
erwark, a cross-bar switch is device that is capable of channeling data between any
instructions is controlled by program of devices that are attached to it up to ts maximum number of ports. The paths set up
Shared memory cells are the means by
counter register.
(operands) are referenced by their memory
which data is passed between instructions. Data en
between devices can be fixed for some duration or changed when desired and each
Aevice-t0-device path (going through the switch) is usually fixed for some period.
sequential control flow model. there
addresses (variables). In the traditional can be contrasted with bus topology. an arrangement in which there is
har topology allI
is a single thread of control. which is Cross-
path that all devices share. Traditionally. computers have been connected to
Dath
instruction to instruction There is
more than one thread of control
passed fromn only one a
instant and provide means for to be active at an ices with large bus. A major advantage of cross-bar switching is that. as the
device:
synchronizing these threads. stora een any two devices increases, it does not attect trafic between other devices.
E

traftc to offering more tlexibility, a cross-bar switch environment offers greater


2. a) Compare dynamic connection In sdonhan a bus environment. However crossbar architecture has a small
networks and crossbar switch networks such as multistage problem.
networks in terms of the following interconnection
Bandwidth and Hardware complexity
scalabilyshar Switch serves multiple networks, and two frames enter the switch
at the
characteristics: When aestined for diflerent ports, one of the frames is blocked
b) Compare between centratzed such as switching, arbitration, wires etc. ame tim This results in all frames while the first frame
AR

Which is the best architecture among and distributed shared memory being queued as they flow through the
switch. If
them and why? architecture. fowarient traffic and insufticient butfer space on the switch.
Answer wBUT 2007] packets are dropped.
a) Muti-stage networks connectn
processors with n memory enory systenis form major category of multiprocessors. In
a
with each other. The connection is blocks (so-caled modules) sharedare
She a global memory. Communication between
this categor. all
established over several
a number of switches. where each input
switch box is used to build up a must be connected
stages. Each stage consists
to an output. Normally. a
of
2x2
es or
b)
pproces perforned through writing to and reading fromtasks
coor
coordination and synchronization the
running on different
global meimory. All
multi-stage network. inter-proce>sor are also acconplished
switching possibilities Figure shows four different via the global
CA-109

CA-108
e
W
S
AN
eOPULAR PUBLICATIONS Answer:
processors.
a sct of independent
computer s\ stem consists of main problems need to
memon. A shared memor interconnection network. Two

TI
a set of memor
modules. and an degradation due to
sy stem: performance
designing a shared memor degradation might happen when
be addressed when Performancc
problems.
contention, and coherence A typical
shared memory simultaneously.
mutipke processors are tr ing to access the However, having multiple
the contention problem.
design might uses caches to sohe a cohcrence problem. The
the caches. might lead to
copies of data. spread throughout value. But, if one of the

AU
they all cqual the same
copies in the caches are coherent if inconsistent
the copies, then the copy becomes
processors writes over the value of one of In this chapter we study a
the other copies.
because it no longer equals the value of
their sol ons of the cache coherence problem. 10
vaniety of shared memor svstems and
Memon Access (UMA), Non-unifom memory
The aspects studied include l'nifom
Architecture (COMA). Bus Based Symmetric
access (NUMA) Cache-onl memor
Protocols, Directory Based
Mutiprocessors. Basic Cache Coherency Methods. Snooping
Protocols and Shared Memor Programming. 13

To support larger processor counts. memory must be distributed among the


processors
AK
rather than centralized: otherwise the memory system would not be able to support the
bandwidth demands of a larger number of processors without incuring excessively long Fig: 16-input Omega network using 2 x 2 switches
access latency. With the rapid increase in processor performance and the associated
increase in aa processor's memory bandwidth requirements. the scale of multiprocessor for for routing a message from node 1011= 11
i) Now, we have shown the switching setting
which distributed memor is preferred over a single. centralized memory continues to 5 and from node 01ll= 7 to node 1001 = 9 simultaneously
in the figure
to node 0101=
decrease in number (which is anothet reason not to use smal and large scale). Of course,
below.
the larger number of processors raises the need for a high bandwidth interconnects. Both
direct interconnection networks ti.e. switches) and indirect networks (typically
M

mutidimensional meshes) are used. So. we can say that distributed share memory
architecture is beter than centralize memory architecture.

3. Drewa 164input Omega network using 2x2 switches as building blocks:


9Show the switching setting for routing a message from node 1011 to node 0101
and from node 0111 to node 1001 simultaneously. Does blocking exist in this
case?
Determine how many permutations can be implemented in one-pass through
E

Ohls Omega networt What is the percentage of one-pass


permutations among
al permutatons?
n what is the maximum number of passes needed to implement any permutation
through the network? WBUT 2008]
AR

sie: dala routing from node 1011 to node 0101 and 01 I| to 1001

CA-111

CA-110
e
W
S
AN
POPULAR PUBLICATIONS
POCesses working together
the above picture. on a multiprocessor
problem as is shown in there PpEd onto the common memory. Any process can snare a single virtual adress spa
There is no blocking in the above permutations in a single pass and can read or write a word of memory oy

TI
network can implement n* AUng a LOAD
i) An n-input Omega or STORE instruction.
are total n! Permutation is
occurred. occumred in first pass. Onunicate by simply having one of them Nothing else is needed. Two processes can
Here, n= l6. so, there
are 16 = 16 pemutation is OC Tead them back. Each write data to memory and having the on
of the 16 CPUs runs a single process, which has been assigneu
There are total 16! Permutations happen.
16/16! = .000205 T ne 16 sections to analyze. Some examples are the Sun Enterprise
onepass permutations among all pemutations= NUMA-Q, SGI 10000, Sequent
So, the percentage of Origin 2000, and HP/Convex Exemplar.
= 0.0205 %

AU
Omputer design for a parallel architecture is one in which each CPU has its own
passes needed to implement any permutation through P emory, accessible only to itself and not any other CPU. Such a
ii) In general, maximum number of to design is
the network is log 2n. where n is the
number of inputs. ed d multicomputer
or sometimes a distributed memory system and is illustrated in
igure below. Multicomputers are frequently loosely coupled. The key aspect of a
What are the similarities and
4a) What do you mean by multiprocessor system? COmputer that distinguishes it from a multiprocessor is that each CPU in a
dissimilarities between the multiprocessor system and multiple computer system? muncomputer has its own private, local memory that it can access by just executing
b) What are the diferent architectural models for multipocessors? Explain each of ADand STORE instructions, but which no other CPU can access using LOAD and
them with example. WBUT 2010]
ORE instructions. Thus multiprocessors have a single physical address space shared by
OR all the CPUs whereas multicomputers have one physical address space per CPU.
what is a fundamental difference in interprocessor coordination mechanism
AK
between multiprocessor & multicomputer systems ? Explain with reference to their Prvate memoy
architectural dirierences. WBUT 2013]
cDistingutsh between loosely coupled and tightly coupled multiprocessor -
CPU

architectures. Which architecture is better and why? WBUT 2010]


Answer: MH
a)1part
A multiple processor system consists of two
H Message-
passng
itercornec:iO
or more processors that are connected in a networ
manner that allows them to share the simultaneous (parallel) execution
of a given
computational task. Parallel processing has been advocated as
M

a promising approach for


building high-performance computer systems. Two basic requirements
are inevitable for
the efficient use of the employed processors.
These requirementts are (1) low
communication overhead among processors while executing
a given task and (2) a degree Fig: A multicomputer with 16 CPUs. each with each own private memory
of inherent parallelism in the task. A number communication
of styles exist for multiple
processor networks. These can be broadly CPUs on a multicomputer cannot communicate by just reading and writing the
classified according to (1) the communication Siace the
model (CM) or (2) the physical connection
(PC). According to the CM, networks can be mon memory, they need a different communication mechanism. Examples of
further classified as (1) mutiple
processors (single address space or shared memory computers include the IBM SP2, InteVSandia Option Red, and the Wisconsin
E

computation) or (2) muliple Workstations).


computers (multiple address space cow(Cluster of
computation). According to PC, or message passing
networks can be different architectural models of multiprocessor system. First
further classified as (I) bus-based O we discuss
or (2) network- b) There
based multiple processors.
about cen
ed shared-memory architectures. For multiprocessors with
small processor
ossible for the processors to share a single centralized
AR

processors and memory by a memory and to


part: he
ferconnectossibly bus. With large caches. the bus
and the
In a mutiprocessors system Sharec memory P with multiple banks, can satisfy the
all CPUs share a memory demands of a smal
common physical memory, as enoy single By replacing a single bus with multiple
illustrated in Figure pro
siner of sharedmemory design can be scaled buses, or even a switch, a
below. A system based on shared
one. is called a multiprocessor
memory. like this nuized snahat is tcchnically conceivable, to a few dozen processors. Although
centralized
or sometimes just a beyontiple banks, becomes less sharing a centralized
attractive as the number memory. even
shared memory system. of processors sharing
Fig: A multiprocessor here is a single main memory
The multiprocessor model extends into with 16 CPUs that has a symmetric relationship
software. All sharing a common
memory CA-113 to
CA-112
e
W
S
AN
eOPULAR PUBLIGATIONS
multiprocessors are
processor. these
uniform access time from any (SMPs), and this style of
all processors and a
(shared-memory) multiprocessors

TI
centralized
often cailed symmetric
UMA(uniform memony access). This type of
architecture is sometimes called popular organization. Figure
1
Processor
far the most Processor
shared-memory architecture is currenty
by Processor Processor
+cache cache
look like. cache cache
shows whar these multiprocessors
memorny can be defined as follows: Memory
So, communication with shared
straightforward to make an
Memoryv0 Memory uo Memon
o

AU
familar programming stye. Sometimes it is
Allows a
(with a small number of processors).
existing program run on a parallel machine
or semaphores for shared data.
Requires synchronization with critical regions
communication.
Interconnection network

help reduce the amount of


The cache can correct.
Complex hardware is needed to keep the caches
HoMemory oMemn vo Memo

Processor Processor Processor Processor


Processor +cache
AK
Processor Processor Processor Cache +cache cache

One or mar* One or more One or more


levcds of levels of levels of Fig The basic architecture ofa distributed-memory multiprocessor
lcves of
cache cache
ache cache
The basic architecture of a distributed-memory multiprocessor consists of individual
nodes containing a processor, some memory, typically some V0, and an interface to an
interconnection network that connects all the nodes. Individual nodes may contain a
small number of processors, which may be interconnected by a small bus or a different
M

interconnection technology, which is less scalable than the global interconnection


Masn mcmor VO svstem
network. Distributing the memory among the nodes has two major benefits. First, it is a
cost-effective way to scale the memory bandwidth, if most of the accesses are to the local
memory
in the node. Second, it reduces the latency for accesses to the local memory.
Fig: Basac structure of a centralized shared-memory These two advantages make distrrbuted memory attractive at smaller processor counts as
multiprocessor
processors get ever faster
ster and require móre memory bandwidth and lower memory
The second group consists of multiprocessors P The key disadvantage for distributed memory architecture is that communicating
with physically distributed memory. To
E

support larger processor counts. memory must be distributed aL betwe ween processors becomes somewhat more
among the processors rather data complex and has higher latency, at
than cenralized; otherwise the memory there is no contention, because the processors no longer share a single
bandwidth demands of a larger nurmber processors
system would not be able to support the least when emory. As we will. see shortly, the
of centralized use of distributed memory leads to
without incuring excessively long two
access latency. With the rapid increase paradigms for interprocess communication
in processor performance and the associated differentp
increase ina processor's memory bandw
AR

idth requirements, the scale of multiprocessor


which disiribuied memory is preserred
over a single, centralized memory
for manycases,.
eases, each processor executes a different process. A process is a
segment of
decrease in aumber (which is another reason continues to c) In independently, and that the state
of the process contains all the
not to use small and large scale). Of course, code tha cessary to execute that program
the larger number of processors raises on a processor. In a multiprogrammed
the need for a high bandwidth informatowhere the processors may be nunning independent
direct interconnection networks (i.e. interconnects. Both environm tasks, each process is
swiiches) and indirect independent of the processes on other processors.
mutidimensional mesihes) are used. Figure networks (typically
below shows what these
multiprocessors look
lly arc known as tightly coupled machines. MIMD computers with shared
wmically
lihe memory
MULTIMAX. Tightly-coupled
pled multiprocessor Examples are ENCORE,
AX. systems contain multiple
he
connected at
the bus level. These CPUs may have access CPUs that are
to a central shared memory
CA-115
CA-114
e
W
inerconnection though they could be implemented on top of a packet
nougn the network is typically switcnng
used for routing purposes, it could also be used as a c
processor to the actual
processors for such uses as sorting

S
Multistage Networks
Switches have excellent performance scalability but poor cost scalability.

AN
Dues ave excellent cost scalability, but poor performance scalability. Multistage
interconnects strike a compromise between these extremes. A number of p x q switches
stage connections present
Picscntn every stage of this network. There is a fixed interhelow.
Detween the switches in adiacent stages as shown in the fieure

TI
Multistage interconnection network Outputs
Inputs

Stage1 Stage 2
AU Stage n
AK
M

network
Fig: The schematic of a
typical multistage interconnection

Nework
Multistage Omega is the Omega network. This
E

com mmonly used multistage interconnects


One the most
One of cOnsists p stages, where p is the number of inputs/outputs. At each stage,
k consists of
etwork
oflog
AR

connected to outputj if:


inputi is OSisp/2-1
(2i,
p/2Sisp-1
2i+1-P
implements a perfect shuffle as follows:
stage ofthe Omega network
e of the interconnects and switches can now
ega network with the perfect shuffle
e

Each Omega
mplete
A illustrated: source andd be that of the destination
the binary representation of the
W

be illushe
s.
Let
processor.
traverses the link to the first switching node. If the most significant bits ofs
proceaatraverses
data
are same, then the data is routed in pass-through mode by the switch else, it
Theehecrossover.
andd
switches to
swirocess is repeated for each of the log p switching stages.
This proc

CA-117
S
AN
appropriate row and
POPULAR PUBLICATIONS column. The complexity
wnicn grows in proportion of a crossbar has two cost components, one
as heir product. to the number of, inputs and
The product term is often outputs and the other that grows
called the cross-point count because

TI
aircctly related to the it is
number of simple 2 x 2 cross-points
crossbar requires N2 required to implement it. A
cross-points for N pairs
010 of terminals.
010
.Describe different access methods
011
maximum capacity a memory, of the memory system? What will be ne
of which uses an address bus of size 8
bitr 2013

AU
100
101 Answer: wBUT
There are two types of memory
access, i.e. C-access and S-access memory organization.
ee Short Answer Type Questions, Question No. 1
and 5 for C-access memory and
111
S-access memory organization.
Fig: A complete omega network connecting eight inputs and eight outputs he maximum capacity of a memory, which uses
bytes an address bus of size 8 bit is 2" =256

6. Describe the different types of interconnection networks in computer systems.


What is multistage switching networks? WBUT 2013] 8. What is multi-processor system? Classify
AK
it with examples. WBUT 2015]
Answer: Answer: Refer to Question No. 5 of Long Answer Type
Questions.
In satie networks, there are point-to-point connections between neighboring nodes.
These networks typically are static, which implies that the point-to-point connections are 9. Construct a multiport network where three
processing elements want to connect
fixed. Static networks use direct links which are with three memory modules. Design a network where 9 inputs want to connect
with
fixed once built. This type of network is suitable for 25 outputs. What is the difference between omega network and delta network?
building computers where communication patterns Construct an omega network for N = 8 where N represent no. of processors.
are predictable. Well-known examples of static WBUT 2016]
networks are linear array, rings. meshes, torus Answer:
and
M

cubes. 1 part:
Now we consider ring as a static A multiport network where three processing elements want to connect with three memory
interconnection
network. Ring is obained by modules:
connecting two
terminal nodes of linear array
with one extra link. In PEl
a linear array, each internal Fig 1: Ring interconnection network
node has two neighbors, Control
one to its left and one
to its right. The ring is like the linear
network is cut in half
if the links are bidirectional.
aray, but the diameter of the PE2
E

Dynamic networks
are implemented with Crosspoint
configured to match communication switched channels, which are dynamically
demand in user program. PE
interconnection network are Bus. Source 0 Examples of dynamic
Mutistage switches, crossbar O
switch Source 1
AR

ctc. Control
Source 2
Let us consider,
crossbar switch as O- AMU?

dynamic interconnection source3 O- ProcesSIng E levvent


network. The Crosspoint
PE
AfU Ateuary
l'uit
crossbar switch corresponds
M array. Semiconductor
to an Nx
switches are
located at each the
of cross points where
inputs and output wires
cross. We
connect an input to an
output by closing
cross-point al the intersection Fig 2: Crossbar
of the switches
CA-119
CA-118
e
W
S
AN
POPULAR PuBLICATION Each stage
of the Omega network implements
a perfect shuffle asfollows:
follo
000 0-

TI
2 part: with 25 outputs: 0 000-lef_rotate(000)
A network where 9
inputs want to connect
001
I001-le_rotate(10)
010 2 =
2 010 le_rotate(001)
011
3

AU
011 le rotate(101)
100
4 100 lef_rotate(010)
101
5 5 101 lef_rotete(110)
110 6 110-lef_rotste(O11)
111 7 7 111=le_rotatel11)
Fig 1: A perfect shuffle interconnection for eight inputs and outputs
AK
1
2
33
A complete Omega network with the perfect shuffle interconnects and switches can now
be illustrated:
Let s be the binary representation of the source and d be that of the destination
processor.
. The data traverses the link to the first switching node. If the most significant bits
2 of s and d are the same, then the data is routed in pass-through mode by the
switch else, it switches to crossover.
M

Fepets and 25 entputs Delta networkk .This process is repeated for each of thelog p switching stages.

3 part:
Difference between omega network and delta network: An NxN Omega
consists oflogN identical stages and between two stagesthere a perfect shuffle
network
DO0

001
=: 001

interconnection. This network maintains a uniform connection pattern


is O10
010
stages. Every input terminal has a unique path to every output
between 0l1
011
E

an ax teminal.
In b' delta network, there are a sources and b destinations. There is a
unique
ierconnection path of constant length between the
stages of the network as 12..n, starting at the
there be acrossbarmodules in the first stage.
stages of the network. Numbering the
source side of the network requires that
100

101 = 100
101
AR

AHE:
110
An omega network for N =8 where N
represents no. of processors:
One of the most commonly used multistage
interconnects is the Omega network.
network consists of log p stages. where This conple omega network connecting
p is the number of inputs/outputs. Fig 2: A eight inputs
input i is connected to outputj if At each stage, and eight outputs

0sisp/2-1
j2i1-P pl2sisp-1 CA-121
CA-120
e
W
S
AN
PUBLIGATIONS
eOPULAR
2008]
on the following: wBUT 2005, 2006, 2008]

TI
notes WBUT
10. Write short
network
a) Omega Switches
WBUT 2008]
b) Crossbar
WBUT 2010]
Network (h d
c) Multiport WBUT 2010, 2014] Fig 1.2: States of a switch point in a crossbar network
inclusion
d) Memory interleaving.
) Memory
c) Multiport Network:

AU
Answer:
)Omega network: Questions The bank-based multiport memory is an approach to realizing high access bandwidth than
Refer to Question Na
9ef Long Answer Ipe a conventional N-port memory cell approach. However, this method is unsuitable for
large numbers of ports and banks because the hardware resources of the
crossbar network
b)Cressbar Switches: to any other processor or product of the numbers
processor m the sy stem to connect which connects the ports and banks increase in proportion to the
Crosshar Swinches allow ary proessors can communicate simultaneously without
many of ports and banks. are
memor unit so that any time as long as the requested A parallel processor array with a
two-dimensional crossbar switches architecture
connection can be established at
contention. A new are used in the design of high- wherein the individual processing elements within
ioput and output ports
are free. Crassbar switches configured as clusters of processors,
multiprocessors. in the design of routers for
direct nerworks. A dimensional cluster network of crossbar switch
perforaance small-scaie each cluster are interconnected by a two network of
switching network with N inputs and M outputs.
which
clusters are interconnected via a two dimensional array
AK
crosshar can be defined as a elements. The network of
interconnections without contention. Figure 1. switch elements. Input data is supplied directly into the aray
aliows up to min {.M oacto-one crossbar
N except for crossbars connecting which allows an optimal initial partitioning of the data set
shows an N XU crossbar network. Usually. M= crossbar switch elements,
processors and memory modules. among the processing elements. processor
interconnection network for interconnecting the
A parallel processor array, an
Suuh the network including.a two-dimensional mesh of multi-port crossbar switch
clusters, wherein each
Rni in a crossbar mesh network and
elements arranged in rows and columns a port of
cluster is connected to a port of a row crossbar switch element and to
processor be processed is
crossbar switch element, and wherein an input data set to
a column for initial
via crossbar switch element input ports
M

supplied directly into the network


partitioning of the data set
among the processing elements. said input data set being
said three dimensional data cube being
characterized as a three dunensional data cube,
Outputs
characterized as sensor data, a first data dimension
reprcsents a sensor channel
fig.! An N M crossbar dimension. a second data dimension represents a Doppler dimension,
and a third data
dimension
represents a Range cell dimension, and wherein the interconnection network is
The cost of such a network is OrNM. which is prohibitively high with large N and M. such that the data set is initially distributed among the
igurable in an initial state
For a crossbar networ. with distributed comtrol, each switch point may have four states, processing in a first data dimension during a first processing
rocessing lements for
E

as shown in Figure 12. In bigure i2 ta). the input from the row cotaining the switch subsequently is configurable to perform a data dimension uransposition of
pot has been granied acccss to the corresponding output, while inputs from upper rows function, and
processi ng in a second data dimension by the processing elenents during
data set for
requestung the sane outpu are blocked. In Figure 1.2 (b). an input from an upper row
has the
second
processing function.
been granted access to the output. The input from the row containing a
the switch point
does nut rcquesi tha: outpui and can be propagated to other switches. In inclusion:
AR

Figure 1.2 (c). an Memory


nput Irom an upper row has aiso been granted access to the output. However. d) Memhierarchy satisfies three impurtant properties for information storing as
from ihe row contaning the swiuh point also requests the input
that output and is blocked. The Mem cohcrence. and locality We consder that the cache memory is in the innermost
conligutatuwn .a Figure tdj is oniy required the crossbar inclusion.
the M, ren
outemost level M,
el represents the tape drive where all the information
(one-tu-iian wmniunatum, if has to support multicasting
el Provessorcan
ManProCssor can access all addressable words
stored. in M. using the virtuai address
wrds computer.
spaceol a
CA-I23
e
W
S
AN
eoeuLARPUBLICATIONS
QUESTION 2012
.Inclusion Property present in the upper level
the memory contains that

TI
According to the inclusion property. level of memory. So. we
can state the
Group-A
of memor hierarchy must present also lower (Multiple Choice Type Questions)
inclusion property as MC M, CM C...M, copied into Mn..
1. Choose the correct alternatives for following
the
required portion of memory M, are i) A pipeline stage
At the time of the processing. found in
are copied into M. and so on. So. if a word is a) is sequential circuit
Similarly. subsets of M,
copies of the same word can be also found in all levels as M,, M,. b) is combinational circuit
memorn M,. then

AU
not be found M.
M, But, a word stored in M,. may
i

c)consists ofboth sequential and combinational circuits


hierarchies to reduce the latency of cache
Many multiprocessors use multilevel cache d) none of these
multilevel inclusion-every level of cache hierarchy is
misses. If the cache also prn ides
of the level further away from the processorthen we can use the multilevel
a subset i) Utlization pattem of successive stages of a synchronous pipeline can be specified by
reduce the contention between coherence traffic and processor traffic, as d) Periodic table
structure to
enforce the inclusion
a) Truth table b) Excitation table c) Reservation table
explained earlier. Thus most multiprocessors with multilevel caches
is a subset
property. This restriction is also calied the subset property. because each cache ii) SPARC stands for
ofthe cache below it in the hierarchy. a) Scalable Processor Architecture b) Superscalar Processor A RISC Computer
d) Scalable Pipetine Architecture
AK
c) Scalable Processor A RISC Computer
e) Memory interleaving:
Interleaving is a technique used to improve the memory performance. Memory iv) Which of the following is not RISC architecture
characteristic?
interleaving increases bandw idth by allowing simultaneous access of more than one a) simplified and unified format of code of
instructions
chunk of memory. This improves the performance of the processor because it can transfer b) no specialized register
more information to from memor in the same amount of time. It also helps to alleviate no storage / storage instruction
c)
the processor-memory bottleneck that is a major limiting factor in overall performance. d) small register file
Interleaving works by dividing the system memory into multiple blocks. If there are m
following shared memory
numbers of blocks then this is called the m-way memory interleaving. In general wo-way memory is same in which of the
v)The time to access shared
M

or four-wa interleaving technique is used. Each block of memory is accessed using mukiorocessor models? d) ccNUMA
different sets of control lines, which are merged together on the memory bus. When a b) UMA c) COMA
a) NUMA
reador write is begun to one block, a read or write to other blocks can be overlapped with von-Neumann architecture?
the first one. following architectures corespond to
vi) Which of the c) SISD d) SIMD
In an interleaved system. a main memory of size 2 ' is divided into m modules, where m b) MIMD
a) MISD
is a positive integer (usually. m-2" for some integern such that O<n < 1, I being the
in paged-memory system how many
number of bits in a main memory address). Each main memory address is mapped to access a physical memory locaton a
to a l absence of TLB,
In
are required?
module. and to an address within that module. Such a mapping is memory accesses
E

called a hashing )3 d)4


scheme. Clearly. the mappin must be one-to-one. 1
b) 2
a)
following set
memory withn blocks is nothing but whch of the
direct mapped cache
vii) A memory organizations?
associative cache b) 1-way set associative
AR

set associative
a) 0-way d)nway set associative
2-way set associative
c)

Portability iis definitely


an issue for which of the following archtectures?
processor b) Super Scalar processor
ix) VLIW
a) d) none of these
pipelined
c) Super

CA-124
e
W

You might also like