COA Notes
COA Notes
Syllabus: Machine instructions and addressing modes, ALU and data path, CPU control design, memory interface,
I/O interface (interrupt and DMA mode), instruction pipelining, cache and main memory, secondary storage
Computer organization deals with the operation AR Address Holds address for
and interconnection of the various hardware components. register memory
(Continued)
Table 2.1 | Continued When the system consists of multiple frequent cases,
where i is the number of frequent cases:
Register Register Function
−1
Symbol Name
( ) Fi
Soverall = 1 − ∑ Fi + ∑
S
IR Instruction Holds an instruction
register that is to be executed
Problem 2.1: Consider a hypothetical processor used
PC Program Holds address of
in mathematical model simulation. It consists of two
counter instruction to be
functional units, floating point and integer. The float-
executed next
ing point is enhanced then it runs two times faster,
TR Temporary Holds temporary data but only 10% of the instructions are floating point.
register if required What is the speed up?
INPR Input register Holds input character
Solution: Here S = 2, F = 0.1
OUTR Output register Holds output character
−1
0.1
Soverall = (1 − 0.1) +
2
= 1.052
2.2.2 Quantitative Principles to Design
High-Performance Processor
Amdahl’s law focused on performance gain after enhanc- 2.3 MACHINE INSTRUCTIONS AND
ing the system. The performance gain is denoted by ADDRESSING MODES
Soverall and ET stands for execution time.
Performance of the system with enhancement
Soverall = Machine instruction is an individual machine code. The
Performance of the system without enhancement complete set of all machine codes recognized by a partic-
1 ETnew ular processor makes its Instruction Set. Instructions can
Soverall = (2.1) be grouped according to the function they perform. The
1 ETold
number of ways by which arguments for these machine
ETold instructions can be specified constitutes the addressing
Soverall =
ETnew modes for a processor.
After enhancement, the system consists of two portions:
unenhanced and enhanced portion. 2.3.1 Machine Instructions
2. Two-address instruction: Each address field can 1. provide programming flexibility to users through
specify either a processor register or a memory word. use of pointers to memory, counter for loop control,
OV R1, A (R1 ← M[A]);
Example: M data indexing and program relocation.
MUL R1, R2 (R1 ← R1*R2) 2. reduce the size of the addressing field of the
instruction.
3. One-address instruction: It used an implied
accumulator (AC) register for all data manipula- Let us suppose [x] means contents at location x for all the
tion. The other operand is in register or memory. addressing modes.
5. Auto-increment or Auto-decrement mode: 10. Base register addressing mode: In this mode,
This is similar to register indirect mode except the the effective address of an operand is obtained by
register containing effective address is incremented adding the content of a base register to the address
or decremented after (or before) its value is used to part of the instruction. This is somewhat similar to
access memory. the indexed addressing mode except that the base
6. Direct address mode: In this mode, the effec- register stores base or beginning address instead of
tive address of an operand is equal to the address an index register. It is used for program relocation.
part of the instruction. Example: ADD A instruc-
tion adds content of memory cell A to accumula-
tor, that is, ACC = [ACC] + M[A]. Problem 2.3: A two-word instruction LOAD is stored
Instruction with direct address mode at location 300 with its address field in the next loca-
tion. The address field has value 600 and value stored
at 600 is 500 and at 500 is 650. The words stored
Opcode Memory address
at 900, 901 and 902 are 400, 401 and 402, respec-
Memory tively. A processor register R contains the number
800 and index register has value 100. Evaluate the
effective address and operand if addressing mode of
the instruction is as follows:
Operand
1. Direct 4. Immediate
2. Indirect 5. Register indirect
7. Indirect address mode: In this mode, memory 3. Relative 6. Index
address specified by address field contains the Solution: Memory layout is as follows
address of (pointer to) the operand. Example:
ADD @ A will add the contents of the memory cell 300 LOAD
A, that is, ACC = [ACC] + M[M[A]]. 301 600
Instruction with indirect address mode
500 650
Opcode Memory address
Memory
600 500
Pointer to operand
700 900
800 700
Operand
900 400
901 401
8. Relative address mode: In this mode, the effective 902 402
address of an operand is obtained by adding the con-
tent of a program counter to the address part of the Addressing Effective Operand
instruction. The address part of the instruction can be Mode Address
either positive or negative represented in 2’s comple-
ment. The result obtained after adding the content of Direct 600 500
the program counter to the address field produces an Indirect 500 650
effective address whose position in memory is relative Relative 902 402
to the address of the next instruction. Immediate 301 600
9. Index address mode: In this mode, the effective Register indirect 800 700
address of an operand is obtained by adding the Index 700 900
content of an index register to the address part of
the instruction. The index register is a special CPU
register that stores an index value and the address
field of the instruction stores the base address of a Problem 2.4: A relative mode branch type instruc-
data array in the memory. The distance between tion is stored in memory at an address equivalent to
the base address and the address of the operand is decimal 600 and the branch is made to an address
the index value that is stored in the index register. equivalent to decimal 400. What is the value of the
The index register can be incremented to facilitate relative address field of the instruction (in decimal)?
access to consecutive operands stored in arrays
Solution: Relative address = 400 − 601 = −201
using the same instruction.
2.4 ARITHMETIC LOGIC UNIT By controlling the output Y of multiplexers with two
selection inputs S1 and S0 and Cin either 0 or 1, we can
generate the eight arithmetic micro-operations (Table 2.2).
Arithmetic logic unit (ALU) is a combinational circuit
that performs all arithmetic and logic operations so that 2.4.2 Logic Micro-Operations
the entire register transfer operation from the source reg-
isters through the ALU and into the destination register Logic micro-operations such as AND, OR, Exclusive OR,
can be performed during one clock pulse period. etc., consider each bit of register separately and specify
binary operations for strings of bits (Table 2.3).
2.4.1 Arithmetic Micro-Operations
Table 2.3 | Types of micro-operations
The basic arithmetic micro-operations such as addition,
subtraction, increment, decrement and shift are performed Micro-operation Name
F←0
on numeric data stored in registers. The basic component
Clear
of arithmetic is parallel binary adder, and by controlling
the input to adder, different micro-operations can be F←A∧B AND
realized. Figure 2.1 depicts a 2-bit arithmetic circuit which F ←A∧B
includes two full-adder circuits and two multiplexers for
choosing different arithmetic micro-operations. There are F←A Transfer A
two 2-bit input numbers A and B and 2-bit output D. The F ←A∧B
F←B
two inputs from A go directly to X inputs of full adder.
Transfer B
The output of multiplexer goes to input Y of full adder.
F←A⊕B Exclusive OR
Cin F ← A ∨B OR
A0
S1
X0 C0 F ←A∨B NOR
S1 FA D0
S0 Y0 C1 F ← A⊕B Exclusive NOR
0 4×1
S0
B0 1 MUX F ←B Complement B
2
3
A1 F ←A∨B
X1 C1
S1 FA D1 F ←A Complement A
0 4×1
S0 Y1 C2
B1 1 MUX Cout F ←A∨B
2
3
0 1 F ←A∧B NAND
Figure 2.1 | A 2-bit arithmetic circuit. F ← all 1’s Set to all 1’s
1 1 0 0 At
1 0 1 0 B
At+1 (A ¬ A Å B)
External output
0 1 1 0
Figure 2.2 | General register organization.
6. Insert operation: It is used to insert a specific bit
pattern into A register, leaving the other bit posi- select lines SELA and SELB from multiplexers A
tions unchanged. This is accomplished by two sub- and B select one of the input and feed to ALU.
operations: masking operation to clear the desired bit OPR specifies one of the possible operation codes
positions, followed by OR operation to introduce the that ALU will perform on the data inputs and the
new bits into the desired positions. Suppose you wanted output is transferred either to one of the registers
to introduce 10 into the low order two bits of A: using 2 × 4 decoder or to the external output say
1101 A (Original) and 1110 A (Desired) memory. The control word (Fig. 2.3) for the two-
operand instruction is as follows:
1 1 0 1 A (Original)
1 1 0 0 Mask 2 bits 2 bits 2 bits 4 bits
1 1 0 0 A (Intermediate)
SELA SELB SELD OPR
0 0 1 0 Added bits
1 1 1 0 A (Desired) Figure 2.3 | A control word.
3. Stack organization: Stack may consist of number repeated continuously for a complete program and is
of registers or a part of main memory in which data known as the fetch-execute cycle (Fig. 2.4). The fol-
items are stored in consecutive locations that are lowing steps are performed for executing an instruction:
accessed by LIFO (last in, first out) mechanism. As
there is limited number of registers, a part of memory Start
is implemented as stack for storage and retrieval of
intermediate data. Stack pointer (SP) keeps a track
of the top item of a stack. The process of inserting Load PC contents
a new item onto a stack is known as push accom- to MAR
plished by first incrementing stack pointer and then
inserting an item from the data register. Increment PC to
SP ← SP + 1 point to next
instruction
M[SP] ← DR
The process of removing an item from the top of a
Load the instruction
stack is known as pop performed by first transfer-
stored at MAR to IR
IR ← M[MAR]
ring data into DR and then decrementing SP.
DR ← M[SP]
SP ← SP − 1 Decode the
instruction
Problem 2.5: A system has CPU organized in the
form of general register organization consisting of 16 Load any data
registers, each storing 32-bit data. Assume the ALU required into MDR
has 35 operations.
(a) How many multiplexers are there in A bus and B
Check
bus, and what is the size of each multiplexer? Yes Set PC to value
for jump
(b) How many selection inputs are needed for MUX A from jump inset
instruction
and MUX B?
(c) How many inputs and outputs are there in a decoder?
(d) How many inputs and outputs are there in ALU No
for data, including input and output carries?
(e) Formulate a control word for the system. Execute the
instruction
Solution:
(a) 32 Multiplexers, each of size 16 × 1.
(b) 4 Inputs each, to select one of 16 registers.
(c) 4 to 16 − Line decoder Check for No
(d) 32 + 32 + 1 = 65 data input lines interrupts
(e) 32 + 1 = 33 data output lines
4 bits 4 bits 4 bits 6 bits Yes
SELA SELB SELD OPR Service the
interrupt
A CPU generally executes one instruction at a time 1. Fetching the instruction: The next instruction
sequentially and a sequence of such instructions is is fetched from the memory address that is saved in
known as a program. The CPU executes the instructions the program counter, and memory content fetched
that reside in the main memory. In order to execute is stored in instruction register (IR). The program
an instruction, the CPU has to fetch the instruction counter then points to the next instruction that
first from the main memory into one of its registers. will be read in the next cycle.
It then decodes the instruction, that is, it decides what 2. Decode the instruction: During this cycle, the
the instruction intended to do, fetch operands required instruction inside the IR gets interpreted by the
and finally executes the instruction. This process is decoder.
3. Operand fetch: In case of a direct or indirect decoder generates a separate control line for each
memory instruction, the execution begins in the step in the control sequence. The encoder gets
next clock cycle. If the instruction has an indirect its input signal from the decoder, step decoder,
address, the effective address of the operand is read external input and condition codes and generates
from the main memory, and the required data is individual control signals. It is faster and more
fetched from the memory into memory data regis- efficient but less flexible and is difficult to add
ters. If the instruction has direct address, nothing new feature or correct mistakes in original design.
is done at this clock cycle.
4. Execute the instruction: The control unit of Clock Control step Reset
the CPU passes the instruction decoded by decoder counter
as a sequence of control signals to the different
functional units of the CPU to execute the tasks
required by the instruction such as reading values Step decoder
from registers or input devices, performing mathe-
T1 T2 Tn
matical or logic micro-operations by ALU, and writ- I1
ing the result back to a register or main memory. External
I2
inputs
Instruction
2.5.2 CPU Data Path IR Encoder
decoder
In Condition
CPU contains data paths that are responsible for routing codes
data between the functional units of a computer. The
following are the different data path structures available End
for routing: Control
1. Single bus structure: In this architecture, all CPU signals
Figure 2.5 | Block diagram of hardwired control unit.
registers are connected to the same bus. Data can be
transferred either between CPU registers or between
CPU register and ALU at a given clock pulse. The 2. Micro-programmed control: Control signals
speed of operation is slow as only one operand can be are generated by using programming known as
transferred in one clock cycle and addition operation micro-programs that constitutes micro-instructions
(R1 ← R2 + R3) occurs in three clock cycles. (control word) (Fig. 2.6). Memory that is part
2. Two bus structure: All general purpose CPU
registers are connected to both buses say bus A and IR External
Sequences inputs
bus B; but special purpose registers are divided into (starting and branch
two groups, say group 1 connecting bus A to pro- address generator) Condition
gram counter and one input of ALU and group 2 codes
connecting bus B to MDR (Memory Data Register)
and other input of ALU. The two operands are Control address
transferred to ALU in 1 clock cycle and the addition Clock
register
operation (R1 ← R2 + R3) occurs in 2 clock cycles. Address
3. Three bus structure: The performance can be
further be improved by using three buses such that Control
Read
addition operation (R1 ← R2 + R3) can occur in command memory
one clock cycle.
Control word
of CPU is known as control memory and stores Table 2.4 | RISC versus CISC
micro-instructions. The micro-program sequencer
generates the address of micro-instruction accord- RISC (Reduced CISC (Complex
ing to instruction stored in instruction register. Instruction Set Instruction Set
The address of micro-instruction to be executed Computers) Computers)
is available in content addressable register. Micro- Rich register set Less number of registers
program sequencer issues read command to read
micro-instruction from control memory into micro- Supports less addressing Supports more number of
instruction register which on execution generates modes addressing modes
control signals for various parts of a processor. This Supports fixed length Supports variable length
control unit design is more flexible to accommodate instruction instruction
new features and less error prone but quite slower Successful pipeline with Unsuccessful pipeline
than the hardwired unit. one instruction per cycle
The format of the control word is Example: ARM, Example: Pentium
Motorola processors
Branch Flag Control Control memory
condition signal address
On the basis of the type of control word supported, it is 2.6 I/O INTERFACE (INTERRUPT
divided into two types: AND DMA MODE)
1. Horizontal micro-programmed control unit:
In this design, the control signals are represented in I/O interface bridges the differences between CPU and
the form of 1 bit per control signal and it supports peripheral devices and provides a method for transfer-
longer control word. ring information between internal storage and external
2. Vertical micro-programmed control unit: In I/O devices. There are the following three modes of I/O
this design, the control signal is represented by using transfer:
encoding format.
1. Programmed I/O: The I/O device does not
have direct access to memory. It requires execution
Problem 2.6: Consider a control unit which has 1024 of several instructions by the CPU and the CPU
control word memory; it supports 48 control signals has to wait for the I/O device to be ready for either
and 8 flag conditions. What is the size of the control reception or transmission of data.
word in bits and control memory in bytes? 2. Interrupt initiated I/O: In this, instead of
Solution: waiting, the control is transferred from a currently
running program to another service program as a
(a) Using horizontal programmed control unit result of an external/internal generated request.
0 bits 3 bits 48 bits 10 bits ••Hardware interrupts: These interrupts are
Branch Flag Control Control present in the hardware pins.
condition signal memory ••Software interrupts: These are the instruc-
tions used in the program whenever the required
Size of control word = 61 bits functionality is needed.
Control memory = (1024 × 61)/8 = 128 × 61 bytes ••Maskable interrupts: These interrupts may
(b) Using vertical programmed control unit be enabled or disabled explicitly.
••Non-maskable interrupts: These interrupts
0 bits 3 bits 48 bits 10 bits are always there in the enable state. We cannot
Branch Flag Control Control disable them by explicit conditions (flags).
condition signal memory ••Vectored interrupts: These interrupts are
log 48 ~ 6 bits associated with the static vector address.
••Non-vectored interrupts: These interrupts
Size of control word = 19 bits
Control memory = (1024 × 19)/8 = 128 × 19 bytes are associated with dynamic vector address.
••External interrupts: These interrupts are
generated by external devices such as I/O.
2.5.4 RISC versus CISC Processors ••Internal interrupts: These devices are gener-
ated by the internal components of the processor
The differences between reduced and complex instruc- such as temperature sensor, power failure, error
tion set computers is given in Table 2.4 instruction, etc.
••Synchronous interrupts: These interrupts block in memory is given by the address register,
are controlled by the fixed time interval. All and the length of the bytes to transfer is given by
the interval interrupts are called as synchronous the word count register. The controller decrements
interrupt. a word counter each time it moves a data byte.
••Asynchronous interrupts: These interrupts
There are several modes of operation of DMA:
are initiated based on the feedback of previ-
ous instructions. All the external interrupts are ••Burst or block transfer mode: In this mode,
called as asynchronous interrupt. the entire block of data is transferred once the
3. Direct memory access (DMA): It is one of DMA controller is granted access to the system
several methods for coordinating the data transfers bus by the CPU. The bytes of data in the block are
between an I/O device and the core processing unit transferred before releasing control of the system
or memory in a computer. It refers to transfer of buses back to the CPU. The only disadvantage of
data directly between a fast storage device and this mode is that it renders the CPU inactive for
memory bypassing CPU because of its limited some long periods of time.
speed. DMA provides a significant improvement ••Cycle stealing mode: In this mode, the DMA con-
in terms of latency and throughput as it allows troller obtains access to the system buses like burst
the I/O device to access the memory directly, mode; but after one byte of data transfer, the control
without using the processor. There are certain of the system bus is released back to the CPU via
advantages of using DMA for data transfer: BG. It is then continually requested again via BR,
••DMA saves processor’s MIPS as the core can transferring one byte of data per request, until the
operate in parallel. entire block of data has been transferred. This mode
••DMA saves power because it requires less cir- is suitable for the systems in which the CPU cannot
cuitry than the processor to transfer data. be disabled for the considerable length of time as in
••DMA has no modulo block size restrictions. burst transfer modes such as for controllers moni-
Direct memory access (DMA) controller takes over toring the data in real time. The advantage is that
the control of buses to manage the transfer directly CPU is not idled for as long as in burst mode, but
between the I/O device and memory. Bus request the data block is not transferred as quickly.
••Transparent mode: It is the slowest yet more ef-
(BR) and Bus grant (BG) signals are used by the
DMA controller to request the CPU to relinquish ficient data transfer mode in terms of overall system
control of the buses and get the control of system performance. The DMA controller transfers data only
buses (Fig. 2.7). The DMA controller consists of when the CPU is busy in performing operations that
3 different registers: an address register, a control do not use the system buses. So, the CPU never stops
register and a word counter register. To transfer a executing its programs but the biggest disadvantage
block of data between an I/O device and memory, is complex hardware circuitry that needs to deter-
the controller stores initial values in the address mine when the CPU is not using the system buses.
register. The DMA channel then transfers the A DMA read transfers data from the memory to
block of information from or to memory according the I/O device, while DMA write transfers data
to the control register. The starting address of the from an I/O device to memory. The functional
Address behaviour of a DMA transfer outlined in Fig. 2.8:
bus ••TheCPU transmits the following information to a
Data DMA controller:
bus Data bus (a) beginning address in memory which is stored in
Address bus
buffer buffer address register in DMA controller.
DMA (b) Number of words to transfer which is stored in
select DS
Register Address word count register in DMA Controller.
select RS register (c) direction (memory-to-I/O device or I/O device-
Read RD to-memory), port ID, DMA mode of transfer
Internal bus
Interrupt
BG Random Access
CPU
Memory (RAM)
BR
RD WR Address Data RD WR Address Data
Read control
Write control
Address
select Address bus
Data bus
RD WR Address Data
DS DMA acknowledge
BG
Interrupt
Figure 2.8 | DMA controller interconnection with memory, CPU and I/O devices.
••When the DMA controller accesses memory, it execution time of a set of instructions and there is no
synchronizes this memory request with an idle need to wait of the most part of the processor circuits for
period of the processor, thus disabling the pro- the other parts of the processor to complete their part
cessor, or requesting a halt of the processor, and of execution. Pipeline speed is limited by the slowest
awaits an acknowledgement. pipeline stage.
••After the completion of the block transfer, the DMA Throughput of a processor is the rate at which opera-
controller either raises an interrupt request if the tions get executed. Latency is the amount of time that a
interrupts are enabled or indicates the completion single operation takes to execute. In an unpipelined com-
in its status register and the processor recognizes puter, throughput = 1/latency, as each operation exe-
I/O completion (either by interrupt signal or by cutes by itself and for pipelined computer, throughput
reading the status register) and gets its system > 1/latency, since execution of instruction is overlapped.
buses back and normal processing starts. The Consider a k-segment pipeline with a clock cycle time
device has to initiate a new data transfer through Tp used to execute n tasks (Fig. 2.9). An equivalent non-
DMA request signal which is again acknowledged pipelined system takes Tn time to complete each task.
by CPU through DMA acknowledge signal via The speed up of a pipelined system over a non-pipelined
DMA controller. system is given by the following relation:
n × Tn
S=
2.7 INSTRUCTION PIPELINING (k + n - 1) × Tp
One clock One clock One clock One clock One clock
cycle cycle cycle cycle cycle
Pipeline Stages
(b)
step → 1 2 3 4 5 6 7 8
Segment ↓ Fetch I1 I2 I3 I4
Decode I1 I2 I3 I4
Fetch operand I1 I2 I3 I4
Execute I1 I2 I3 I4
Write back I1 I2 I3 I4
(c)
Figure 2.9 | (a) Unpipelined processor. (b) Pipelined five-stage processor.
(c) Timing diagram of a five-stage instruction pipeline.
1. Structural hazards: These result from resource instruction refers to a result which is yet not
conflicts when the hardware cannot support been calculated, that is, in this inst2 tries to
instructions that need simultaneous execution in read a source before inst1 writes to it. This
pipeling. situation arises if the read operation by instruc-
2. Data hazards: They arise when an instruction tion takes place before write done by other in-
depends on the result of a previous instruction and struction. For example,
that result is not yet calculated. inst1: R3 <-R1 + R2
There are three situations in which data hazards inst2: R4 <-R3 + R2
can occur: The first instruction calculates a value by adding
••Read
values in registers R1 and R2 and saves the result
after write (RAW), a true dependency
in register R3, and the second instruction uses
••Write after read (WAR), an anti dependency this saved value to calculate a result for regis-
••Write after write (WAW), an output dependency ter R4. However, in a pipeline, when operands
for the second operation are fetched, the results
Consider two instructions inst1 and inst2
from the first instruction will not have been
occurring, with inst1 occurring before inst2 in
saved yet, and so there arises a data depend-
the program order.
ency. It can be said that there is a data depend-
••Readafter write (RAW): A read after write ency with instruction inst2, as it is dependent
(RAW) data hazard is a situation in which an on the completion of instruction inst1.
••Write after read (WAR): A write after 3. Control hazards: They arise from the pipelining
read (WAR) data hazard refers to a situation in of branches and other instructions that change the
which there is a problem with concurrent execu- value of PC.
tion, that is, inst2 tries to write a destination Speed up from pipelining
before it is read by inst1. This situation arises
if write operation completes first by instruction Average instruction time unpipelined
=
before the read operation takes place by other Average instruction time pipelined
instruction. For example,
Speed up from pipelining
inst1: R4 <-R1 + R3
CPI unpipelined × Clock cycle pipelined
inst2: R3 <-R1 + R2 =
CPI pipelined × Clock cycle pipelined
If a situation arises in which there is a chance
that inst2 may get completed before inst1 CPI unpipelined
Ideal CPI =
(i.e., with concurrent execution) we must note Pipeline depth
that we do not store the result of register R3
before inst1 has had a chance to fetch the Speed up from pipelining
operands. Ideal CPI × Pipeline depth × Clock cycle unpipelined
=
••Write after write (WAW): A write after CPI pipelined × Clock cycle pipelined
write (WAW) data hazard refers to a situation
in which there is a concurrent execution envi- Speed up from pipelining
ronment, that is, inst2 tries to write an oper- Ideal CPI × Pipeline depth × Clock cycle unpipelined
and before it is written by inst1.This situation
(Ideal CPI + Pipeline stall) × Clock cycle pipelined
=
arises if write operation by an instruction occurs
in the reverse order of the intended sequence. Assuming ideal CPI as 1, speed up is:
For example,
Speed up from pipelining
inst1: R2 <-R1 + R3
inst2: R2 <-R4 + R5 Pipeline depth × Clock cycle unpipelined
(1 + Pipeline stall) × Clock cycle pipelined
=
The WB (write back) of inst2 must be delayed
until the execution of inst1. where CPI is cycles per instruction.
Problem 2.7: Consider a four-stage pipeline processor. The number of cycles needed by the four instructions I1, I2,
I3 and I4 in stages instruction fetch, decode, operand fetch and execute are shown below. Assume I2 is the branch
instruction. Draw the timing space diagram.
S1 S2 S3 S4
I1 2 1 1 1
I2 1 2 3 1
I3 1 1 1 2
I4 2 1 3 1
Solution:
STEP → 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Fetch I1 I1 I2 I3 - - - - - I3 I4 I4
Decode I1 I2 I2 - - - - - I3 - I4
Operand Fetch I1 I2 I2 I2 - - - I3 - I4 I4 I4
Execute I1 - - - I2 - - - I3 I3 - - I4
Problem 2.8: Assume a simple 5-stage pipeline (IF, ID, E, DF, W) each stage takes a single cycle. Assuming there
are no cache misses. How many cycles would the following code take to execute if there is no special hardware to
improve performance in the presence of hazards?
MOV edx,[ecx+100]
MOV ebx,[ecx+104]
ADD edx,ebx
MOV [ecx+108],ebx
MOV eax,[ecx+100]
ADD ebx,eax
1 2 3 4 5 6 7 8 9 10 11 12 13 14
IF ID DF E W
IF ID DF E W
IF ID DF stall E W
IF ID stall DF stall W
IF ID stall DF stall stall E W
IF ID stall DF stall stall stall E W
Problem 2.9: In the below figure, calculate the total execution time after which the result of the fourth task enter-
ing the pipe above ready?
IF ID EX MEM WB
5 ns 5 ns 10 ns 10 ns 5 ns
Solution:
5 10 15 20 25 30 35 40 45 50 55 60 65
Inst1 IF ID EX EX MEM MEM WB
Inst2 IF ID EX EX MEM MEM WB
Inst3 IF ID EX EX MEM MEM WB
Inst4 IF ID EX EX MEM MEM WB
Problem 2.10: What is the mean overhead of a pipe- Problem 2.12: Calculate the time required to perform
line with 8 stages and an execution time per stage of 1000 operations in a 6-staged pipeline with an execu-
2 ns? tion time of 3 ns per stage?
Solution:
Solution: The mean overhead = (Stages - 1) ×
Execution time per stage = (8 - 1) × 2 = 7 × 2 = 14 ns Tp = (k - 1 + n) × T = (6 - 1 + 1000) × 3 = 3.015 µs
Problem 2.11: How many stages has a pipeline that Problem 2.13: Calculate the mean overhead of a pipeline
achieves a speed of 9.9 for 100 operations? with 7 stages and an execution time per stage of 2 ns?
Problem 2.14: Consider a pipeline with 5 stages: IF, ID, EX, M and W. Assume that each stage requires one clock
cycle. Show how the following program segment for adding 2 arrays is processed and compare the clock cycles
needed in non-pipelined system with pipelined system when result of the branch instruction i.e. content of is avail-
able after WB stage.
LOAD R4 #400
L1: LOAD R1, 0 (R4);
LOAD R2, 400 (R4);
ADD R3, R1, R2;
STORE R3, 0 (R4);
SUB R4, R4, #4;
BNEZ R4, L1;
Solution: Number of cycles = [Initial instruction + (Number of instructions in the loop L1) × Number of loop
cycles] × Number of clock cycles/instruction (CPI)
= [1 + (6) × 400/4] × 5 = 3005
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LOAD R4 #400 IF ID EX M W
LOAD R1, 0 (R4) IF ID EX M W
LOAD R2, 400 (R4) IF ID stall stall EX M W
ADD R3, R1, R2 IF ID stall stall EX stall M W
STORE R3, 0 (R4) IF ID stall DF stall stall E W
SUB R4,R4, #4 IF ID stall Ex M W
BNEZ R4, L1 IF stall ID stall stall EX M W
Problem 2.15: Consider a 5-stage pipeline with stages: For all following questions we assume that: (a) Pipeline
contains stages: IF (Instruction Fetch), IS (Issue), FO (Fetch operand), E (Execute) and W (Write). (b) Each stage
except E requires one clock cycle and system has 4 Functional Units for floating point operations, FP load/store,
FP addition/subtraction, FP multiplication and FP division, (c) Execution stage for Load/Store operations requires
1 clock cycle, for ADD or SUB operations requires 1 clock cycle, for MUL operation requires 3 clock cycles and for
DIV operation requires 4 clock cycles. All memory references hit in cache. Pipeline has forwarding circuitry for all
FUs, except FP-Load/Store where operand is ready after W-stage.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
LOAD F6, 20(R5) IF IS FO E W
LOAD F2, 28(R5) IF ISD FO E W
MUL F0, F2, F4 IF IS stall stall FO E E E W
SUB F8, F6, F3 IF IS FO E W
DIV F10, F0, F6 IF IS stall stall stall stall FO E E E E W
ADD F6, F8, F2 IF IS FO E W
STORE F8, 50(R5) IF IS FO E W
Identify the hazards in the following instructions from the following list (Structural, Data, Control, RAW, WAR,
WAW, None)
1. MULT F0, F2, F4 and STORE F8, 50(R5)
2. DIV F10, F0, F6 and ADD F6, F8, F2
3. MULT F0, F2, F4 and DIV F10, F0, F6
4. DIV F10, F0, F6 and ADD F6, F8, F2
2.8 MEMORY HIERARCHY and ROM (read only memory). Integrated RAM chips
are available in two modes:
The storage media can be categorized in hierarchy accord- 1. Static RAM: It stores the binary information in
ing to their speed and cost (Fig. 2.10). As we move down flip flops and information remains valid until power
the hierarchy, access time increases and cost per bit is supplied. It has faster access time and is used in
decreases. implementing cache memory.
2. Dynamic RAM: It stores the binary information
as a charge on the capacitor. It requires refreshing
circuitry to maintain the charge on the capacitors
CPU after few milliseconds. It contains more memory
registers cells per unit area as compared to SRAM.
Decreasing Cache Increasing
cost and memory cost and
speed Main memory speed 2.8.1.1 Memory Interfacing
Magnetic disks If the required memory for the computer is larger
Increasing Decreasing
size size than the capacity of one chip, it is necessary to
Magnetic tapes connect multiple RAM and ROM chips to a CPU
(a) (b)
Figure 2.11 | (a) RAM chip. (b) ROM chip.
Problem 2.16: A computer employs RAM chips of 256 × 8 and ROM chips of 1024 × 16. The computer system needs
2K bytes of RAM and 4K bytes of ROM and four interface units each with four registers. Draw a memory address
map for the system and give the address range in hexadecimal for RAM and ROM chips.
Component Address 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
RAM 0000-O7FF 0 0 0 0 0 ↔ x x x x x x x x
3×8
decoder
ROM 4000-4FFF 0 1 0 0 ↔ x x x x x x x x x x
2×4
decoder
Interface 8000-800F 1 0 0 0 0 0 0 0 0 0 0 0 x x x x
item requested by the CPU is found in cache it is called 2.8.4.1 Direct Mapping
hit otherwise it is a miss. Hit ratio is defined as ratio
of number of hits divided by total CPU references to In this technique, each block from the main memory has
memory. only one possible location in the cache memory. In this
example, say a block from main memory maps onto a
Number of hits block (i mod 128) of the cache. If there are 2n words in the
Hit ratio (h) =
Number of hits + Number of misses cache memory and 2m words in the main memory, then
Average access time = Hit ratio × Tc m-bit main memory address is divided into two fields: n
bits for index field to access the cache and (m − n) bits
+ (1 - Hit ratio)(Tc + Tm )
for the tag field. Each word in cache consists of the data
where Tc is cache access time and Tm is the main memory and the associated tag. Whenever a new block is brought
access time. into cache, tag is stored along with data bits. Index field
is further divided into block and word if there are mul-
2.8.3.1 Elements of Cache Design tiple words (say k) in a block. The lower k bits select one
of the k words in a block known as word field. The block
The various elements of cache design are as follows: field is used to distinguish a block from other blocks.
1. Cache size: It should be optimum, small enough Tag (m − n) bits Index (n bits)
to keep average cost per bit close to the main
memory and large enough to keep overall average Tag (m - n) bits Block (n - k) bits Word (k bits)
access time close to the cache memory.
2. Mapping function: It describes the mapping of When CPU generates a memory request, the block field
main memory block to cache block. There are three points to a particular block location in the cache. The
different mapping techniques: fully associative, direct high-order tag field is compared with tag bits associated
mapped and set associative cache organization. with that cache location. If they match, then the desired
3. Replacement algorithm: When a new memory word is in that block of cache. If there is no match, then
block is required in cache, one of the existing blocks the block containing the required word must be loaded
must be replaced by a new block. Example: FIFO to cache first (Fig. 2.13).
(first in, first out), LRU (least recently used). Main memory address Main memory
4. Write policy: Cache memory follows write-
through and write-back updating policies. In 5 7 4 Block 0
write-through policy, cache controller copies data Tag Block Word Block 1
immediately to main memory as data is written in
≈ ≈
cache. The data in main memory is always valid,
but this approach reduces system performance. In Cache memory Block 127
write back, update to memory block is delayed until Tag 0
Block 0 Block 128
the updated cache block is replaced by a new block. (5 bits) Data
≈ ≈ ≈ ≈
2.8.4 Cache Mapping Techniques Block 255
The cache memory can store a reasonable number of Block 127 Tag 3 Data Block 3968
blocks, but this number is always small as compared to
blocks in the main memory to keep average cost per bit
low. The correspondence between memory blocks and
≈ ≈
cache block is specified by the following mapping tech- Block 4095
niques. Consider a cache memory consisting of 2K words
with 128 blocks of 16 words each. Number of bits required Figure 2.13 | Direct mapped cache organization.
to address a cache block is 11 bits. Main memory has 64K
The demerit of direct mapping is that hit ratio drops
words and bits required to address is 16 (Fig. 2.12).
considerably if two or more words having same index and
different tags are accessed consecutively one after the other.
two fields: word and tag. The associative memory stores contain the desired block. The high-order tag field is
both the address (tag) and data of the main memory. then compared associatively to the tags corresponding
Figure 2.14 shows the mapping of different blocks into to the matched set. If a match occurs, the corresponding
cache. High-order 12 bits of CPU address is placed in the word is read from cache else main memory is referred
argument register of the associative memory and com- and block containing that word is brought into cache for
pared to tag bits of each block of the cache to see if the future reference (Fig. 2.15).
desired block is present. Once the desired block is pres-
ent, 4-bit word is used to extract necessary word from Tag Set Word
the cache. 6 6 4 Main memory
Main memory address Main memory address Block 0
12 4 Block 1
≈
Tag 0
Main memory ≈
Tag Word Cache memory
Block 0 Block 63
Set 0 Tag 0 Data Tag 2 Data
Cache memory Block 1 6 bits Block 62
Tag 0 Date
(12 bits)
Block i
Set 63 Tag 3 Date Tag 61 Date 4033 Tag 63
4095
Block 4095 Figure 2.15 | Set-associative mapped cache organization.
Problem 2.19: Consider a 2-way set associative cache Problem 2.20: The access time of a cache memory
consisting of 256 blocks of 8 words each, and assume is 200 ns and that of main memory is 2000 ns. It is
that the main memory is addressable by 16-bit address estimated that 70% of the memory requests are for
and it consists of 4K blocks. Calculate the number read and remaining 30% for write. The hit ratio for
of bits in each of the TAG, BLOCK/SET and word read accesses only is 0.9. A write-through procedure
fields for different mapping techniques? is used.
Solution: For direct mapping, word field is of 3 (a) What is the average access time of the system
bits to identify 8 different words in a block (23 = 8). considering only memory read cycles?
As cache memory consists of 256 blocks so (28 = 256) (b) What is the average access time of the system for
8 bits are required to address a block because there is both read and write requests?
one-to-one correspondence of block k in main memory (c) What is the hit ratio taking into consideration the
to block (k mod 256) in cache memory. The remain- write cycles also?
ing 5 (16 − 8 - 3) high-order address bits are tag bits.
Thus, the main memory address for direct mapping is Solution:
divided as follows: (a) Average access time = 0.9 × 200 + 0.1 × 2200 =
180 + 220 = 400 ns
Tag Block Word (b) Average access time = 0.3 × 2000 + 0.7 × 400 =
5 bits 8 bits 3 bits 600 + 280 = 880 ns
(c) Hit ratio = 0.7 × 0.9 = 0.63
For fully associative mapping, number of word
bits are same, that is, 3 bits. Cache memory stores
both tag and data. The high-order tag bits of an
address generated by CPU are compared with tag
bits of each block so number of block bits is zero. All Problem 2.21: A 4-way set-associative cache memory
remaining bits (except word bits) are identified as tag uses blocks of four words. The cache can accommo-
bits. Thus, the main memory address for fully asso- date a total of 1024 words from the main memory.
ciative mapping is divided as follows: The main memory size is 128K × 32.
(a) Formulate all pertinent information required to
Tag Word
construct the cache memory.
13 bits 3 bits (b) What is the size of the cache memory?
(b) Since cache is 4-way set associative, 4 blocks per set are stored in cache memory.
Problem 2.22: Suppose physical memory is of 2GB (b) Since cache is 2-way set associative, 2 blocks per
and each word is of 16 bits. There is a cache contain- set are stored in cache memory. So, the number
ing 2K words of data, and each cache block contains of sets is 128/2 = 64.
16 words. For each of the direct mapped and 2-way
set associative cache configurations, specify how the 21 bits 6 bits 4 bits
address would be partitioned. Tag Set Word
Solution:
(a) For direct mapping, the word field is of 4 bits
Problem 2.23: Consider a direct mapped cache of size
identify 16 different words in a block (24 = 16).
32 KB with block size 32 bytes. The CPU generates
As the cache memory consists of 2K words which
32-bit addresses. What are number of bits needed for
is equivalent to 2K/16 = 128 blocks so (27 =
addressing block in cache and number of tag bits?
128) 7 bits are required to address a block. The
remaining 20 (31 − 7 − 4) high-order address
Solution:
bits are tag bits. Thus, the main memory address
for direct mapping is divided as follows: Tag Block Word
32 − 10 − 5 bits 10 bits 5 bits
20 bits 7 bits 4 bits
Tag Block Word The number of bits needed for addressing block in
cache and number of tag bits are 10, 17, respectively.
IMPORTANT FORMULAS
SOLVED EXAMPLES
1. The principle of locality is used in 2. Which memory unit has lowest access time?
(a) Interrupt (b) Registers (a) Cache (b) Registers
(c) DMA (d) Cache memory (c) Optical disk (d) Main memory
Solution: It is used in cache memory to help the Solution: Registers are used for processing
program access small amounts of address space at and manipulating data and for holding memory
any instant. addresses that are available to the machine-code
programmer. So, they have lowest access time.
Ans. (d)
Ans. (b)
3. During DMA transfer, the DMA controller takes (c) a processor interrupt.
over the buses to manage the transfer (d) a clock interrupt.
(a) Directly from CPU to memory Solution: Hardware interrupt is present in hard-
(b) Directly from memory to CPU ware pins.
(c) Directly between the memory and registers Ans. (b)
(d) Directly between the I/O device and memory
9. Priority is provided by for access to memory
Solution: DMA controller manages transfer by various I/O channels and processors.
between I/O device and memory.
(a) a register
Ans. (d)
(b) a counter
4. Booth’s algorithm is used for the arithmetic opera- (c) the processor scheduler
tion of (d) a controller
(a) addition. (b) subtraction. Solution: Controller sets priority for memory
(c) multiplication. (d) division. access by various I/O devices and processes.
Solution: It is a multiplication algorithm that Ans. (d)
multiplies two signed binary numbers in 2’s com- 10. By applying the principle of temporal locality,
plement notation. processes are likely to reference pages that
Ans. (c)
(a) have been referenced recently.
5. The reason for improvement in CPU performance (b) are located at address near recently referenced
during pipelining is pages in memory.
(c) have been preloaded into memory.
(a) reduced memory access time.
(d) have to be reloaded into memory.
(b) increased clock speed.
(c) introduction of parallelism. Solution: Temporal locality refers to reuse of
(d) increase in cache memory. resources referenced within a short time frame.
Solution: Instruction-level parallelism is imple- Ans. (a)
mented within a single processor to allow faster 11. Which of the following is a correct statement
CPU throughput. related to L2 cache memory?
Ans. (c)
(a) T
he level 1 cache is always faster than the level
6. Use of cache memory enhances 2 cache.
(b) The level 2 cache is used to mitigate the dynamic
(a) I/O access time
slowdown every time a level 1 cache miss occurs.
(b) memory access time.
(c) Level 2 cache comes as on board only.
(c) effective memory access time.
(d) In modern day computer, the level 2 cache is
(d) secondary storage access time.
considered an internal cache.
Solution: Cache memory compensates the speed
Solution: L2 level of cache is placed between the
mismatch between processor and main memory
L1 and RAM. The L1 cache is always the fastest.
access time.
Ans. (c) Ans. (a)
7. An instruction cycle refers to 12. What is the control unit’s function in the CPU?
(a) fetching an instruction. (a) To decode program instructions
(b) executing an instruction. (b) To transfer data to primary storage
(c) fetching, decoding and executing an instruction. (c) To perform logical operations
(d) reading and executing an instruction. (d) To store arithmetic operations
Solution: It involves fetching, decoding and exe- Solution: Control unit controls several units of
cuting the instruction. CPU and helps decode program instructions.
Ans. (c)
Ans. (a)
8. A hardware interrupt is also called
13. CPU fetches the data and instructions from
(a) an internal interrupt.
(a) ROM (b) control unit
(b) an external interrupt.
(c) RAM (d) coprocessors chip
Solution: Disk is the I/O device attached exter- 23. Advantage of synchronous sequential circuits over
nally to the processor. Therefore, disk requires a asynchronous ones is
device driver.
(a) faster operation.
Ans. (d)
(b) ease of avoiding problems due to hazards.
21. More than one word are put in one cache block to (c) lower hardware requirement.
(d) better noise immunity.
(a) exploit the temporal locality of reference in a
program. Solution: Because of less delay, synchronous
(b) exploit the spatial locality of reference in a sequential circuits have faster operation than asyn-
program. chronous ones.
(c) reduce the miss penalty. Ans. (a)
(d) none of the above.
24. The total size of address space in a virtual memory
Solution: There are two types of locality of references system is limited by
temporal and spatial locality.
(a) the length of MAR.
The concept of spatial locality, instead of fetching
(b) the available secondary storage.
just one item from the main memory to the cache,
(c) the available main memory.
is useful to fetch several items that reside at adja-
(d) all of the above.
cent address as well.
So, option (b) is correct. Solution: Virtual memory depends only on the
Ans. (b) available size of the secondary memory.
22. Which of the following statements is false? Ans. (b)
(a) Virtual memory implements the translation of a 25. Comparing the time T1 taken for a single instruc-
program’s address space into physical memory tion on a pipelined CPU with time T2 taken on a
address space. non-pipelined but identical CPU, we can say that
(b) Virtual memory allows each program to exceed (a) T1 ≤ T2
the size of the primary memory. (b) T1 ≥ T2
(c) Virtual memory increases the degree of (c) T1 < T2
multiprogramming. (d) T1 plus T2 is the time taken for one instruction
(d) Virtual memory reduces the context-switching fetch cycle
overhead.
Solution: In case of one instruction, non-
Solution: Virtual memory increases the context- pipelined CPU takes less time as compared to pipe-
switching overhead. lined CPU. This is due to buffer delays for pipelining.
Ans. (d) Ans. (b)
1. For a pipelined CPU with a single ALU, consider Solution: All the three statements cause hazards.
the following situations: Ans. (d)
I. The j + 1-st instruction uses the result of the
Common Data Questions 2 and 3: Consider the
jth instruction as an operand.
following assembly language program for a hypo-
II. The execution of a conditional jump instruction.
thetical processor. A, B and C are 8-bit registers.
III. The j-th and j + 1-st instructions require the
The meanings of various instructions are shown as
ALU at the same time.
comments:
Which of the above can cause a hazard?
MOV B, #0; B←0
(a) I and II only (b) II and III only MOV C, #8; C←8
(c) III only (d) All the three Z: CMP C, #0; Compare C with 0
(GATE 2003: 1 Mark) JZ X; Jump to X if zero flag is set
Clocks 1 2 3 4 5 6 7 8 9 10 11 12
I1 IF RD EX MA WB
I2 IF - - - RD EX MA WB
I3 IF - - - RD - - EX MA WB
Ans. (c)
18. A device with data transfer rate 10 KB/s is con- fetch cycle of the first word of the instruction, its
nected to a CPU. Data is transferred byte-wise. register transfer interpretation is
Rn ⇐ PC+1;
Let the interrupt overhead be 4 s. The byte trans-
PC ⇐ M[PC];
fer time between the device interfaces register and
CPU or memory is negligible. What is the mini-
mum performance gain of operating the device The minimum number of CPU clock cycles needed
under interrupt mode over operating it under pro- during the execution cycle of this instruction is
gram-controlled mode? (a) 2 (b) 3 (c) 4 (d) 5
(a) 15 (b) 25 (GATE 2005: 2 Marks)
(c) 35 (d) 45 Solution: The minimum number of CPU clock
(GATE 2005: 2 Marks) cycles needed during the execution cycle = 4. This
is because
Solution: Data transfer rate = 10 KB/s 1 cycle is required to transfer already incremented
Interrupt overhead = 4 × 10−2 s value of PC
10 KB is sent = 1 s 2 cycles for getting data in MDR
1 B is sent = 1/10K = 100 − 10−2 s 1 to load value of MDR in PC
Minimum performance gain = 100 × 10−2/4 × 10−2 Ans. (c)
= 25
21. Consider a disk drive with the following
Ans. (b)
specifications:
Common Data Questions 19 and 20: Consider the 16 surfaces, 512 tracks/surface, 512 sectors/track, 1
following data path of a CPU. The ALU, the bus KB/sector, rotation speed 3000 rpm. The disk is oper-
and all the registers in the data path are of identi- ated in cycle stealing mode whereby whenever one 4
cal size. All operations including incrementation of byte word is ready it is sent to memory; similarly, for
the PC and the GPRs are to be carried out in the writing, the disk interface reads a 4 byte word from
ALU. Two clock cycles are needed for memory read the memory in each DMA cycle. Memory cycle time
operation - the first one for loading address in the is 40 nsec. The maximum percentage of time that the
MAR and the next one for loading data from the CPU gets blocked during DMA operation is:
memory bus into the MDR.
(a) 10 (b) 25
MAR MDR (c) 40 (d) 50
(GATE 2005: 2 Marks)
Solution:
Data transfer in one rotation = 512 × 1024 Bytes
S T 60
1 rotation takes = s
IR PC 3000
GPRs ALU 60
512KB is transferred in = s
3000
60
1 byte will be transferred = × 512 × 1024
19. The instruction “add R0, R1” has the register 3000
transfer interpretation R0 ⇐ R0 + R1. The mini- 4 bytes will be transferred
4
= 60 × × 512 × 1024 = 152.58 ns
mum number of clock cycles needed for execution
cycle of this instruction is 3000
(a) 2 (b) 3 40
Block % = = 26%
(c) 4 (d) 5 152.28
Ans. (b)
(GATE 2005: 2 Marks)
22. A CPU has 24-bit instructions. A program starts at
Solution: There will be three cycles-(1) R1out, address 300 (in decimal). Which one of the following
Sin, (2) R2out, Tin and (3) Sout, Tout, ALUadd, Rin. is a legal program counter (all values in decimal)?
Ans. (b) (a) 400 (b) 500
20. The instruction “call Rn, sub” is a two-word instruc- (c) 600 (d) 700
tion. Assuming that PC is incremented during the (GATE 2006: 1 Mark)
Solution: Size of instruction = 24 bits; Start and the bits are numbered 0 to 31, bit in position
address = 300. Legal address will be multiple of 0 being the least significant. Consider the following
three, that is, 300. emulation of this instruction on a processor that
Ans. (c) does not have bbs implemented.
23. A CPU has a cache with block size 64 bytes. temp ← reg & mask
The main memory has k banks, each bank being Branch to label if temp is non-zero.
c bytes wide. Consecutive c byte chunks are The variable temp is a temporary register. For correct
mapped on consecutive banks with wrap around. emulation, the variable mask must be generated by
All the k banks can be accessed in parallel, but
two accesses to the same bank must be serialized. (a) mask ← 0 × 1 pos
A cache block access may involve multiple itera- (b) mask ← 0 × ffffffff pos
tions of parallel bank accesses depending on the (c) mask ← pos
amount of data obtained by accessing all the k (d) mask ←0×f
banks in parallel. Each iteration requires decoding (GATE 2006: 2 Marks)
the bank numbers to be accessed in parallel and
this takes k/2 ns. The latency of one bank access is Solution: As there is only one bit with pos, the other
80 ns. If c = 2 and k = 24, the latency of retriev- bits need to be set to 0 in temp. The mask register
ing a cache block starting at address zero from the must have 1 in pos position, for which pos number
main memory is of left shifts over 1 need to be made.
Ans. (a)
(a) 92 ns (b) 104 ns
(c) 172 ns (d) 184 ns Common Data Questions 26 and 27: Consider two
cache organizations. The first one is 32 KB, 2-way
(GATE 2006: 2 Marks) set associative with 32-byte block size. The second
Solution: one is of the same size but direct mapped. The size
Time for one parallel process = k/2 + latency of an address is 32 bits in both cases. A 2-to-1 mul-
Time for one byte = 24/2 + 80 = 92 tiplexer has a latency of 0.6 ns while a k-bit com-
Total time for c bytes = 2 × 92 = 184 ns parator has a latency of k/10 ns. The hit latency of
Ans. (d) the set-associative organization is h1 while that of
the direct mapped one is h2.
24. A CPU has a five-stage pipeline and runs at 1
GHz frequency. Instruction fetch happens in the 26. The value of h1 is
first stage of the pipeline. A conditional branch (a) 2.4 ns (b) 2.3 ns
instruction computes the target address and (c) 1.8 ns (d) 1.7 ns
evaluates the condition in the third stage of the
pipeline. The processor stops fetching new instruc- (GATE 2006: 2 Marks)
tions following a conditional branch until the Solution:
branch outcome is known. A program executes Address bits = 32
109 instructions out of which 20% are conditional Block size = 32 B = 5 bits
branches. If each instruction takes one cycle to Size of cache = 32KB/32 = 1KB = 10 bits
complete on average, the total execution time of For 2-way set-associative memory = index bits = 9,
the program is tag bits = 18
(a) 1.0 s (b) 1.2 s k/10 = 18/10 = 1.8 + latency = 1.8 + 0.6 = 2.4
(c) 1.4 s (d) 1.6 s Ans. (a)
(GATE 2006: 2 Marks) 27. The value of h2 is
Solution: (a) 2.4 ns (b) 2.3 ns
Total execution time of the program (c) 1.8 ns (d) 1.7 ns
= 109 + 0.20 × 2 × 109 = 1.4 s (GATE 2006: 2 Marks)
Ans. (c)
Solution:
25. Consider a new instruction named branch-on-bit- In direct memory access: Tag bits = 17; Index = 10
set (mnemonic bbs). The instruction “bbs reg, pos, bits; word = 5 bits
label” jumps to label if bit in position pos of reg- k/10 = 17/10 = 1.7 + latency = 1.7 + 0.6 = 2.3
ister operand reg is one. A register is 32 bits wide Ans. (b)
Common Data Questions 28 and 29: A CPU has Solution: For 64 words, log264 = 6 bits are required.
a 32-KB direct-mapped cache with 128-byte block For lines = 128/4 = 32 lines, 5 bits are required
size. Suppose A is a two-dimensional array of size
512 × 512 with elements that occupy 8 bytes each. Tag bits Line Word
Consider the following two C code segments, P1 9 5 6
and P2. Ans. (d)
P1: for (i=0; i<512; i++) { 31. Consider a pipelined processor with the following
for (j=0; j<512; j++) { four stages:
x +=A[i][j];
} IF: Instruction fetch
} ID: Instruction decode and operand fetch
P2: for (i=0; i<512; i++) { EX: Execute
for (j=0; j<512; j++) { WB: Write back
x +=A[j][i]; The IF, ID and WB stages take one clock cycle each
} to complete the operation. The number of clock cycles
}
for the EX stage depends on the instruction. The
ADD and SUB instructions need 1 clock cycle and
P1 and P2 are executed independently with the
the MUL instruction needs 3 clock cycles in the EX
same initial state, namely, the array A is not in the
stage. Operand forwarding is used in the pipelined
cache and i, j, x are in registers. Let the number
processor. What is the number of clock cycles taken
of cache misses experienced by P1 be M1 and that
to complete the following sequence of instructions?
R2 ← R1 + R0
for P2 be M2.
ADD R2, R1, R0
28. The value of M1 is MUL R4, R3, R2 R4 ← R3 × R2
(a) 0 (b) 2048 SUB R6, R5, R4 R 6 ← R 5 − R4
(c) 16384 (d) 262144 (a) 7 (b) 8 (c) 10 (d) 14
(GATE 2006: 2 Marks) (GATE 2007: 1 Mark)
Solution: Solution:
Memory = 32 KB Clock 1 2 3 4 5 6 7 8
Block size = 128 B
I1 IF ID EX WB
Number of blocks = 256
Number of elements in block = 256/8 =16 I2 IF ID EX EX EX WB
P1 cache misses = M1 : 512 × 512/16 = 16384 I3 IF ID - - EX WB
Ans. (c)
Using operand forwarding, 8 clock cycles are required.
29. The value of the ratio M1/M2 is
Ans. (b)
(a) 0 (b) 1/16
32. In a simplified computer, the instructions are
(c) 1/8 (d) 16
OP RJ, Ri - Performs RJ OP Ri and stores the
(GATE 2006: 2 Marks)
result in register Ri.
Solution: OP m, Ri - Performs val OP Ri and stores the
P2 number of cache misses = M2 = 512 × 512 result in Ri. val denotes the content of memory
Ratio of M1:M2 = 1:16 location m.
Ans. (b) MOV, mRi - Moves the content of memory loca-
tion m to register Ri.
30. Consider a 4-way set-associative cache consist- MOV Ri, m - Moves the content of register Ri to
ing of 128 lines with a line size of 64 words. The memory location m.
CPU generates a 20-bit address of a word in main
memory. The number of bits in the TAG, LINE The computer has only two registers, and OP is either
and WORD fields are, respectively: ADD or SUB. Consider the following basic block:
t1 = a+b
(a) 9, 6, 5 (b) 7, 7, 6
t2 = c+d
e − t2
(c) 7, 5, 8 (d) 9, 5, 6
t3 =
(GATE 2007: 1 Mark) t4 = t1 − t3
Assume that all operands are initially in memory. The Solution: Given that R1 = 10, so the loop will
final value of the computation should be in memory. run 10 times
What is the minimum number of MOV instructions 10 × 2 + 1 = 21
in the code generated for this basic block? Ans. (d)
(a) 2 (b) 3
(c) 5 (d) 6 34. Assume that the memory is word addressable.
After the execution of this program, the content of
(GATE 2007: 1 Mark) memory location 2010 is
Solution: The instructions generated in the code (a) 100 (b) 101
for this basic block are as follows: (c) 102 (d) 110
MOV a, Ri (GATE 2007: 1 Mark)
ADD b, Ri
MOV c, Rj Solution: It will remain 100, because the loop will
ADD d, Rj exit as the value in R1 becomes 0 when address in
SUB e, Rj R3 becomes 2010.
SUB Ri, Rj Ans. (a)
MOV m, Ri 35. Assume that the memory is byte addressable and
Ans. (b) the word size is 32 bits. If an interrupt occurs
Common Data Questions 33-35: Consider the during the execution of the instruction “INC R3”,
following program segment. Here R1, R2 and R3 what return address will be pushed on to the stack?
are general purpose registers. (a) 1005 (b) 1020
(c) 1024 (d) 1040
Instruction Operation Instruction
Size (No. (GATE 2007: 1 Mark)
of Words) Solution: Memory is byte addressable, take 4 bytes
per word. So at INC R3, stack will contain 1024.
MOV R1, R1 ← m[3000] 2
(3000) Ans. (c)
LOOP: MOV R2 ← M[R3] 1 36. Consider a disk pack with 16 surfaces, 128 tracks
R2, (R3) per surface and 256 sectors per track. 512 bytes of
R2 ← R1 + R2
ADD R2, R1 1 data are stored in a bit serial manner in a sector.
The capacity of the disk pack and the number of
MOV (R3), R2 M[R3] ← R2 1 bits required to specify a particular sector in the
disk are respectively:
INC R3 R3 ← R3 + 1 1
(a) 256 Mbyte, 19 bits (b) 256 Mbyte, 28 bits
DEC R1 R1 ← R1 - 1 1 (c) 512 Mbyte, 20 bits (d) 64 Gbyte, 28 bits
BNZ LOOP Branch on not 2 (GATE 2007: 1 Mark)
zero
Solution:
Disk capacity = 16 surfaces × 128 tracks ×
HALT Stop 1
256 sectors × 512 bytes = 256 MB
Assume that the content of memory location 3000 Total number of sectors = 16 × 128 × 256 = 219
is 10 and the content of the register R3 is 2000. The Ans. (a)
content of each of the memory locations from 2000 to
2010 is 100. The program is loaded from the memory Linked Answer Questions 37 and 38: Consider a
location 1000. All the numbers are in decimal. machine with a byte addressable main memory of
162 bytes. Assume that a direct mapped data cache
33. Assume that the memory is word addressable. The consisting of 32 lines of 64 bytes each is used in the
number of memory references for accessing the system. A 50 × 50 two-dimensional array of bytes is
data in executing the program completely is stored in the main memory starting from memory
location 1100H. Assume that the data cache is ini-
(a) 10 (b) 11
tially empty. The complete array is accessed twice.
(c) 20 (d) 21
Assume that the contents of the data cache do not
(GATE 2007: 1 Mark) change in between the two accesses.
37. How many data cache misses will occur in total? exception occurs, so an exception is not allowed to
execute. Option (d) is the correct option.
(a) 48 (b) 50 (c) 56 (d) 59
Ans. (d)
(GATE 2007: 2 Marks)
41. For a magnetic disk with concentric circular tracks,
Solution: the seek latency is not linearly proportional to the
Main memory = 216 B seek distance due to
Block size = 64 B (a) non-uniform distribution of requests
Number of blocks = 32 (b) arm starting and stopping inertia
Number of elements = 50 × 50 = 2500 (c) higher capacity of tracks on the periphery of
Starting from location 1100 means from 68th block the platter
Number of blocks = 2500/64 = 40 blocks (d) use of unfair arm scheduling policies
Initially cache is empty for 32 misses, then 8 are
remaining from total 40 for one access (GATE 2008: 2 Marks)
Array is traversed twice, so data cache misses = Solution: Tracks on magnetic disks are concentric
40 + 8 + 8 = 56 and seek latency from one sector to other which
Ans. (c) may or may not be in different tracks. This seek
38. Which of the following lines of the data cache will distance is not proportional to latency since the
be replaced by new blocks in accessing the array tracks at periphery have higher diameter, and
for the second time? hence higher capacity to store data.
Ans. (b)
(a) line 4 to line 11 (b) line 4 to line 12
(c) line 0 to line 7 (d) line 0 to line 8 42. Which of the following are NOT true in a pipelined
processor?
Solution:
Applying k mod c to find the location: 68 mod I. Bypassing can handle all RAW hazards.
32 = 4 to 11 II. Register renaming can eliminate all register
Ans. (a) carried WAR hazards.
III. C ontrol hazard penalties can be eliminated by
39. Which of the following is/are true for the auto- dynamic branch prediction.
increment addressing mode?
(a) I and II only (b) I and III only
I. It is useful in creating self-relocating code. (c) II and III only (d) I, II and III
II. If it is included in an Instruction Set Architec
ture, then an additional ALU is required for (GATE 2008: 2 Marks)
effective address calculation. Solution: All the statements are true.
III. T he amount of increment depends on the size Ans. (d)
of the data item accessed.
43. For inclusion to hold between two cache levels L1
(a) I only (b) II only and L2 in a multi-level cache hierarchy, which of
(c) III only (d) II and III only the following are necessary?
(GATE 2008: 2 Marks) I. L1 must be a write-through cache
Solution: Only statement (III) is true. II. L2 must be a write-through cache
Ans. (c) III. T
he associativity of L2 must be greater than
40. Which of the following must be true for the RFE that of L1
(Return from Exception) instruction on a general IV. The L2 cache must be at least as large as the
purpose processor? L1 cache
(a) IV only (b) I and IV only
I. It must be a trap instruction. (c) I, II and IV only (d) I, II, III and IV
II. It must be a privileged instruction.
III. A
n exception cannot be allowed to occur during (GATE 2008: 2 Marks)
execution of an RFE instruction. Solution: L1 and L2 cache are placed between
(a) I only (b) II only CPU and they can be both write through cache
(c) I and II only (d) I, II and III only but not necessarily.
Associativity does not matter.
(GATE 2008: 2 Marks)
L2 cache must be at least as large as L1 cache,
Solution: RFE (Return from Exception) is a since all the words in L1 are also in L2.
privileged trap instruction that is executed when Ans. (a)
44. The use of multiple register windows with over- Solution: Total elements can come in one slot 2048.
lap causes a reduction in the number of memory After 2048 elements, same cache index will be on
accesses for [2][0] and [4][0].
Ans. (b)
I. Function locals and parameters
II. Register saves and restores 47. The cache hit ratio for this initialization loop is
III. Instruction fetches
(a) 0% (b) 25% (c) 50% (d) 75%
(a) I only (b) II only (c) III only (d) I, II and III
(GATE 2008: 2 Marks)
(GATE 2008: 2 Marks)
Solution: Cache hit ratio is found out as follows:
Solution: Multiple register windows with over-
1024 1
lap causes a reduction in the number of memory = = 50%
accesses for register saves and restores. 2048 2
Ans. (b) As we can see in the above, there will be 50% hits.
Ans. (c)
45. Consider a machine with a 2-way set-associative
data cache of size 64 KB and block size 16 bytes. Linked Answer Questions 48 and 49: Delayed
The cache is managed using 32 bit virtual addresses branching can help in the handling of control
and the page size is 4 KB. A program to be run on hazards.
this machine begins as follows:
48. For all delayed conditional branch instructions,
double ARR [1024] [1024]; irrespective of whether the condition evaluates to
int i, j;
true or false:
/* Initialize array ARR to 0.0 */
for(i=0; i<1024; i++) (a) T he instruction following the conditional
for(j=0; j<1024; j++) branch instruction in memory is executed.
ARR [i] [j] =0.0; (b) The first instruction in the fall through path is
executed.
The size of double is 8 bytes. Array ARR is located (c) The first instruction in the taken path is
in memory starting at the beginning of virtual page executed.
0xFF000 and stored in row major order. The cache (d) The branch takes longer to execute than any
is initially empty and no pre-fetching is done. The other instruction.
only data memory references made by the program
(GATE 2008: 2 Marks)
are those to array ARR.
The total size of the tags in the cache directory is Solution: The first instruction following the
branch instruction is always executed (irrespective
(a) 32 Kbits (b) 34 Kbits
of whether the branch is taken or not).
(c) 64 Kbits (d) 68 Kbits
Ans. (b)
(GATE 2008: 2 Marks)
49. The following code is to run on a pipelined proces-
Solution: sor with one branch delay slot:
Virtual address = 32 bits
2-way cache size = 64 KB I1: ADD R2 ← R7 + R8
1 set will contain = 32 KB entries = 15 bits I2: SUB R4 ← R5 − R6
Block size = 16 bytes = 4 bits I3: ADD R1 ← R2 + R3
Tag bits Set bits Word bits I4: STORE Memory [R4] ← R1
17 11 4 BRANCH to Label if R1 == 0
Tag size = 17 × 2 × 1024 = 34 kbits Which of the instructions I1, I2, I3 or I4 can legiti-
Ans. (b) mately occupy the delay slot without any other
program modification?
46. Consider the data given in the above question.
Which of the following array elements has the (a) I1 (b) I2 (c) I3 (d) I4
same cache index as ARR[0][0]? (GATE 2008: 2 Marks)
(a) ARR[0][4] (b) ARR[4][0] Solution: Instruction I2 contains delayed slot. I4
(c) ARR[0][5] (d) ARR[5][0] has data dependency in I2.
(GATE 2008: 2 Marks) Ans. (b)
50. How many 32K × 1 RAM chips are needed to pro- 53. Consider a 4-way set-associative cache (initially
vide a memory capacity of 256 K bytes? empty) with total 16 cache blocks. The main
memory consists of 256 blocks and the request for
(a) 8 (b) 32
memory blocks is in the following order:
(c) 64 (d) 128
0, 255, 1, 4, 3, 8, 133, 159, 216, 129, 63, 8, 48, 32,
Solution: 73, 92, 155
256K × 8 Which one of the following memory block will NOT
Number of chips required = = 64
32K × 1 be in cache if LRU replacement policy is used?
Ans. (c) (a) 3 (b) 8 (c) 129 (d) 216
(GATE 2009: 2 Marks)
51. A CPU generally handles an interrupt by execut-
Solution:
ing an interrupt service routine
To decide the location (address), mod 4 is applied.
(a) as soon as an interrupt is raised.
(b) by checking the interrupt register at the end of Set 0 0, 4, 8,216 → 48, 32, 8, 92
the fetch cycle. Set 1 1, 133, 129, 73
(c) by checking the interrupt register after finishing Set 2 155
the execution of the current instruction. Set 3 255, 3, 159, 63
(d) by checking the interrupt register at fixed time 216 will not be there in cache.
intervals. Ans. (d)
Common Data Questions 54 and 55: A hard disk has
Solution: Interrupts are handled by checking the 63 sectors per track, 10 platters each with 2 record-
interrupt register after finishing the execution of ing surfaces and 1000 cylinders. The address of a
current instruction. sector is given as a triple <c, h, s>, where c is the
Ans. (c) cylinder number, h is the surface number and s is
the sector number. Thus, the 0th sector is addressed
as <0, 0, 0>, the 1st sector as <0, 0, 1>, and so on.
52. Consider a 4-stage pipeline processor. The number
of cycles needed by the four instructions I1, I2, I3,
I4 in stages S1, S2, S3, S4 is shown below: (GATE 2009: 2 Marks)
54. The address <400, 16, 29> corresponds to sector
S1 S2 S3 S4 number:
I1 2 1 1 1 (a) 505035 (b) 505036
(c) 505037 (d) 505038
I2 1 3 2 2
I3 1 1 1 3 Solution:
Total surfaces = 10 × 2 = 20
Address is: 400 × 20 × 63 + 16 × 63 + 29 = 505037
I4 1 2 2 2
Ans. (c)
What is the number of cycles needed to execute the 55. The address of 1039th sector is
(a) <0, 15, 31> (b) <0, 16, 30>
following loop?
for (i=1 to 2) {I1; I2; I3; I4;} (c) <0, 16, 31> (d) <0, 17, 31>
Solution:
(a) 16 (b) 23 (c) 28 (d) 30 Address of 1039th sector = 16 × 31 + 31 = 1039
(GATE 2009: 2 Marks) Ans. (c)
Solution:
Clock 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
I1 S1 S1 S2 S3 S4
I2 S1 S2 S2 S2 S3 S3 S4 S4
I3 S1 S1 S2 S3 S4 S4 S4
I4 S1 S2 S2 S3 S3 S4 S4
For two iterations = 15 × 2 = 30 clock cycles
Ans. (d)
56. A main memory unit with a capacity of 4 mega- Here we have to find X in terms of Y. So,
bytes is built using 1M × 1-bit DRAM chips. Each a a a -1
X = a1 + 2 + 3 + + (nn- =Y
2 2)
DRAM chip has 1 K rows of cells with 1K cells
2 4
in each row. The time taken for a single refresh
operation is 100 ns. The time required to perform
If a0 + 1 + 2 + … + n-1
a a a
one refresh operation on all the cells in the memory 2 4 2( n -1)
unit is
a -1
< a1 + 2 + 3 + … + (nn-
a a
(a) 100 nanoseconds 2 4 2 2)
(b) 100 × 210 nanoseconds
(c) 100 × 220 nanoseconds OR
(d) 3200 × 220 nanoseconds a 1
X = a0 + 1 + 2 + + (nn -
a a
= a0 +
Y
(GATE 2010: 1 Mark) 2 4 2 -1)
2
Solution:
If a0 + 1 + 2 + … + (nn −−11)
a a a
Main memory = 4 MB 2 4
Number of DRAM chips = 4 MB/ 1M × 1 bit = 32
2
Total cells = 32 × 1K × 1K
> a1 + 2 + 3 + … + (nn −−12)
a a a
Time taken to refresh all the cells = 32 × 1K × 2 4 2
1K × 100 ns Hence, we sum up as
Ans. (d) X = MAX(Y, a0 + Y/2)
57. The weight of a sequence a0, a1/2, … an−1 of real Ans. (b)
numbers is defined as a0 + a1/2 + … + an−1/2n−1.
58. A 5-stage pipelined processor has instruction fetch
A subsequence of a sequence is obtained by delet-
(IF), instruction decode (ID), operand fetch (OF),
ing some elements from the sequence, keeping the
perform operation (PO) and write operand (WO)
order of the remaining elements the same. Let X
stages. The IF, ID, OF and WO stages take 1 clock
denote the maximum possible weight of a subse-
quence of a0, a1, …, an−1 and Y the maximum pos-
cycle each for any instruction. The PO stage takes
sible weight of a subsequence of a1, a2, … an−1.
1 clock cycle for ADD and SUB instructions, 3 clock
cycles for MUL instruction and 6 clock cycles for
Then X is equal to
DIV instruction, respectively. Operand forwarding
(a) max(Y, a0 + Y ) (b) max(Y, a0 + Y/2) is used in the pipeline. What is the number of clock
(c) max(Y, a0 + 2Y ) (d) a0 + Y/2 cycles needed to execute the following sequence of
(GATE 2010: 2 Marks) instructions?
Solution: The concepts involve the Dynamic Instruction Meaning of Instruction
Programming in Algorithms.
Given that I0: MUL R2, R0, R1 R2 ← R0 × R1
X = max weight from the sequence (a0, a1, a2, … I1: DIV R5, R3, R4 R5 ← R3/R4
a a a -1
an−1) = a0 + 1 + 2 + + n I2: ADD R2, R5, R2 R2 ← R5 + R2
2 4 2(n -1)
Y= max weight from the sequence (a1, a2, … an−1) I3: SUB R5, R2, R6 R5 ← R2 − R6
a2 a3 a -1
= a1 + + + + (nn -2) (a) 13 (b) 15 (c) 17 (d) 19
2 4 2 (GATE 2010: 2 Marks)
Solution:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
I0 IF ID OF PO PO PO WO
I1 IF ID OF - - PO PO PO PO PO PO WO
I2 IF ID OF - - - - - PO WO
I3 IF ID OF PO WO
With operand forwarding, 15 clock cycles are needed.
Ans. (b)
59. The program below uses six temporary variables a, Solution: Access to L1 cache = 2 ns
b, c, d, e, f. Access to L2 cache = 20 ns
a = 1 Block size of L2 is 16 words; data bus size is 4 words.
b = 10 So, time taken for this transfer = 4 × 22 = 88 ns.
c = 20 Ans. (d)
d = a + b 61. When there is a miss in both L1 cache and L2 cache,
e = c + d
first a block is transferred from main memory to
f = c + e
b = c + e L2 cache, and then a block is transferred from L2
e = b + f cache to L1 cache. What is the total time taken for
d = 5 + e these transfers?
return d + f (a) 222 nanoseconds (b) 888 nanoseconds
(c) 902 nanoseconds (d) 968 nanoseconds
Assuming that all operations take their operands
from registers, what is the minimum number of regis- (GATE 2010: 2 Marks)
ters needed to execute this program without spilling? Solution: Block transfer from main memory to L2
cache = 2 + 4 × (20 + 200) = 2 + 880 = 882 ns
(a) 2 (b) 3
L2 to L1 = 882 + 20 + 66 = 968 ns
(c) 4 (d) 6 Ans. (d)
(GATE 2010: 2 Marks) 62. A computer handles several interrupt sources of
which the following are relevant for this question.
Solution: Let us take three Registers R1, R2 and R3
••Interrupt from CPU temperature sensor (raises
R1 R2 R3 interrupt if CPU temperature is too high)
••Interrupt from Mouse (raises interrupt if the
a=1 b = 10 c = 20
mouse is moved or a button is pressed)
d = 11 b = 10 c = 20 ••Interrupt from Keyboard (raises interrupt when
d = 11 e = 21 c = 20 a key is pressed or released)
••Interrupt from Hard Disk (raises interrupt when
f = 41 e = 21 c = 20
a disk read is completed)
b = 41 e = 21 c = 20
Which one of these will be handled at the HIGHEST
e = 42 b = 41 f = b = 41 priority?
d = 47 e = 42 f = 41 (a) Interrupt from Hard Disk
(b) Interrupt from Mouse
All the operations will be completed using three
(c) Interrupt from Keyboard
registers only.
(d) Interrupt from CPU temperature sensor
Ans. (b) (GATE 2011: 1 Mark)
Common Data Questions 60 and 61: A computer Solution: Interrupt from CPU temperature sensor
system has an L1 cache, an L2 cache, and a main will be handled at the highest priority.
memory unit connected as shown below. The block Ans. (d)
size in L1 cache is 4 words. The block size in L2 63. Consider a hypothetical processor with an instruction
cache is 16 words. The memory access times are 2 of type LW R1, 20 (R2), which during execution reads
nanoseconds, 20 nanoseconds and 200 nanoseconds a 32-bit word from memory and stores it in a 32-bit
for L1 cache, L2 cache and main memory unit, register R1. The effective address of the memory loca-
respectively. tion is obtained by the addition of a constant 20 and
Data bus Data bus the contents of register R2. Which of the following
L1 L2 Main best reflects the addressing mode implemented by
Cache Cache memory this instruction for the operand in memory?
4 words 4 words
(a) Immediate addressing
60. When there is a miss in L1 cache and a hit in L2 (b) Register addressing
cache, a block is transferred from L2 cache to (c) Register indirect scaled addressing
L1 cache. What is the time taken for this transfer? (d) Base indexed addressing
(GATE 2011: 1 Mark)
(a) 2 nanoseconds (b) 20 nanoseconds
(c) 22 nanoseconds (d) 88 nanoseconds Solution: Effective address = contents of register
R2 + 20.
(GATE 2010: 2 Marks) Ans. (d)
64. On a non-pipelined sequential processor, a program Memory size for tag bits = 19 + 2 = 21
segment, which is a part of the interrupt service Total size of memory for tags = 21 × 256 = 5376 bits
routine, is given to transfer 500 bytes from an I/O Ans. (d)
device to memory.
66. Consider an instruction pipeline with four stages
Initialize the address register (S1, S2, S3 and S4) each with combinational circuit
Initialize the count to 500 only. The pipeline registers are given in the figure.
LOOP: Load a byte from device
Tag bits Block bits Word bits Solution: Register renaming is done to handle
WAR/WAW hazards.
19 8 5
Ans. (c)
69. The amount of ROM needed to implement a 4-bit 73. Consider the following sequence of micro-operations.
multiplier is MBR ← PC
(a) 64 bits (b) 128 bits (c) 1 Kbits (d) 2 Kbits MAR ← X
PC ← Y
Solution: Amount of ROM required Memory ← MBR
= 22k × 2k (where k = number of bits) Which one of the following is a possible operation
= 22×4 × 2 × 4 performed by this sequence?
= 2 Kbits (GATE 2013: 2 Marks)
Ans. (d)
(a) Instruction fetch
70. A computer has a 256 KB, 4-way set-associative, (b) Operand fetch
write-back data cache with block size of 32 bytes. (c) Conditional branch
The processor sends 32-bit addresses to the cache (d) Initiation of interrupt service
controller. Each cache tag directory entry contains,
in addition to address tag, 2 valid bits, 1 modified Solution: Program counter value is stored in
bit and 1 replacement bit. memory by MBR and gets a new address by Y. This
indicates initialization of interrupt service routine.
The number of bits in the tag field of an address is Ans. (d)
(a) 11 (b) 14 (c) 16 (d) 27
74. Consider a hard disk with 16 recording surfaces
(GATE 2012: 2 Marks)
(0-15) having 16384 cylinders (0-16383) and each
Solution: cylinder contains 64 sectors (0-63). Data storage
256 KB capacity in each sector is 512 bytes. Data are orga-
Number of blocks = = 213 blocks nized cylinder-wise and the addressing format is
32 B
<cylinder no., surface no., sector no.>. A file of size
Due to 4-way set associative = 213/22 = 211 42797 KB is stored in the disk and the starting disk
32 bits location of the file is <1200, 9, 40>. What is the
cylinder number of the last sector of the file, if it is
Ans. (c) stored in a contiguous manner?
71. The size of the cache tag directory is (GATE 2013: 2 Marks)
(a) 160 Kbits (b) 136 bits (a) 1281 (b) 1282 (c) 1283 (d) 1284
(c) 40 Kbits (d) 32 bits Solution: Number of sectors required to store the
(GATE 2012: 2 Marks) 42797 × 1024
file = = 85594 sectors
512
Solution: Tag directory contains: Tag bits + 4 Number of sectors in a cylinder = 16 × 64 = 1024
additional bits = 20 bits
85594
Size of cache tag directory = 20 × 213 = 160 Kbits Total number of cylinders required = = 84
Ans. (a) 1024
Last sector will be stored on 1284th cylinder.
72. In a k-way set associative cache, the cache is divided Ans. (d)
into v sets, each of which consists of k lines. The lines
of a set are placed in sequence one after another. 75. Consider an instruction pipeline with five stages
The lines in set s are sequenced before the lines in without any branch prediction: Fetch Instruction
set (s+1). The main memory blocks are numbered (FI), Decode Instruction (DI), Fetch Operand
0 onwards. The main memory block numbered j (FO), Execute Instruction (EI) and Write Operand
must be mapped to any one of the cache lines from (WO). The stage delays for FI, DI, FO, EI and WO
are 5 ns, 7 ns, 10 ns, 8 ns and 6 ns, respectively.
(a) (j mod v) * k to (j mod v) * k + (k-1) There are intermediate storage buffers after each
(b) (j mod v) to (j mod v) + (k-1) stage and the delay of each buffer is 1 ns. A pro-
(c) (j mod k) to (j mod k) + (v-1) gram consisting of 12 instructions I1, I2, I3, … I12 is
(d) (j mod k) * v to (j mod k) * v + (v-1) executed in this pipelined processor. Instruction I4
(GATE 2013: 1 Mark) is the only branch instruction and its branch target
is I9. If the branch is taken during the execution of
Solution: Number of sets = v
this program, the time (in ns) needed to complete
Number of main memory blocks = j
the program is
Number of lines = k (from 0 to k-1)
Position will be (j mod v)*k to (j mod v)*k +k-1 (a) 132 (b) 165 (c) 176 (d) 328
Ans. (a) (GATE 2013: 2 Marks)
PRACTICE EXERCISES
(c) In the CISC instruction set, all arithmetic/logic 11. The most relevant addressing mode to write posi-
instructions must be register based for fast tion-independent code is
processing. (a) direct mode (b) auto mode
(d) CISC architectures may perform better in net- (c) relative mode (d) indexed mode
work centric applications than RISC.
12. A CPU uses 24-bit instruction. A program starts at
4. The register that holds the address of the loca- address 300 (in decimal). Which one of the follow-
tion to or from which data are to be transferred is ing is a legal program counter content (all values
called in decimal)?
(a) Index register (a) 324 (b) 512
(b) Accumulator (c) 600 (d) 700
(c) Memory address registers
(d) Memory data registers 13. An attempt to access a location not owned by a
program is called
5. Which one of the following is not a type of I/O
(a) data fault (b) address fault
channel?
(c) instruction fault (d) page fault
(a) Multiplexer (b) Selector
14. Which of the following statement about relative
(c) Block multiplexer (d) None of the above
addressing mode is FALSE?
6. The performance of a pipelined processor is (a) It enables reduced program code
degraded if (b) It allows indexing of array element with same
(a) the pipeline stages have different delays instruction
(b) consecutive instructions are to be executed serially (c) It enables easy relocation of data
(c) the pipeline stages share hardware resources (d) It enables faster address calculation than abso-
(d) all of the above lute addressing
7. The minimum time delay between the initiation of 15. Compared to CISC processors, RISC processors contain
two independent memory operations is called (a) more register and smaller instruction set
(a) Access time (b) Cycle time (b) larger instruction set and less registers
(c) Rotational time (d) Latency time (c) less registers and smaller instruction set
(d) more registers and larger instruction set
8. The register which keeps track of the execution of a
16. Micro-programmed control cannot be implemented
program and which contains the memory address of
in RISC architecture because
the instruction currently being executed is known
as (a) it tends to slow down the processor.
(b) it consumes more chip areas and large instruc-
(a) index register
tion set.
(b) memory address register
(c) handling a large number of registers is impos-
(c) program counter
sible in micro-programmed system.
(d) instruction registers
(d) the 1 instruction/cycle timing requirement
9. For interval arithmetic, the best rounding tech- for RISC is difficult to achieve in micro-pro-
nique used is grammed based architecture.
(a) rounding to plus and minus infinity 17. Relocation of the code is easier in irrespec-
(b) rounding to zero tive of the program code
(c) rounding to nearest zero (a) indirect addressing
(d) rounding to the next number (b) indexed addressing
10. Hardwired control unit are faster than micro- (c) base register addressing
programmed control unit because (d) absolute addressing
(a) they do not consist of slower memory elements. 18. In inverted page table organization, the size of the
(b) they do not have slower elements such as gates, page table depends on
flip flops and registers. (a) the number of processes
(c) they consist of elements based on VVLSI design (b) the size of page
technology. (c) the size of main memory
(d) they contain high-speed digital components. (d) the number of frames in the main memory
19. When using the concept of locality of reference, the 25. A device employing INTA line for device inter-
page reference being made by a process rupt puts the CALL instruction on the data bus
(a) will always be to the page used in the previous while
page reference (a) INTA is active. (b) HOLD is active.
(b) is likely to be one of the pages used in the past (c) READY is active. (d) READY is active.
few page references
(c) will always to be one of the pages existing in 26. On receiving an interrupt from an I/O device, the CPU
the main memory (a) b ranches off to halt (or wait) for a predeter-
(d) will always lead to page fault mined time
(b) branches off to the interrupt service after com-
20. If the new version of processor is not made com-
pletion of the current instruction
patible to programs written for its older version, it
(c) branches off to the interrupt service routine
could be able to process at a faster speed
immediately
(a) the statement is true. (d) hands over control of address bus and data bus
(b) the statement is false. to the interrupting device
(c) the speed cannot be predicted.
(d) speed has nothing to do with the compatibility. 27. Using large block size in a fixed block size file
system leads to
21. A certain snooping cache can snoop only on address
(a) b etter disk throughput but poorer disk space
line. Which of the following is true?
utilization
(a) This would adversely affect the system if the (b) better disk throughput and better disk space
write-through protocol is used. utilization
(b) It would run well if the write-through protocol (c) it does not matter as the total memory size is
is used. same
(c) Data snooping is mandatory to be implemented (d) poorer disk throughput but better disk space
on data line. utilization
(d) Data snooping may not be required.
28. Which of the following statements are true about
22. When the frequency of the input signal to a CMOS paging?
gate is increased, the average power dissipation (a) It divides memory into units of equal size.
(a) decreases exponentially (b) It permits implementation of virtual memory.
(b) increases (c) It suffers from internal fragmentation.
(c) decreases (d) It suffers from external fragmentation.
(d) increases exponentially
29. The number of entries in an inverted page table
23. The disadvantage of hardwired control units with (a) is equal to the number of processes.
flip flop is (b) is equal to the number of page frames in the
(a) design becomes complex main memory.
(b) it requires more number of flip flops (c) is equal to the size of the page frame.
(c) control circuit speed does not match with flip (d) is equal to the number of page frames in cache
flops memory.
(d) flip—flops can handle the data unit not the
30. In a virtual memory system, the addresses used by
control unit
the programmer belongs to
24. In a vectored interrupt, (a) memory space (b) physical space
(a) the branch address is assigned to a fixed loca- (c) address space (d) main memory space
tion in memory.
31. Power consumption of processors can be vastly
(b) the interrupting source supplies the branch
reduced by making use of based transistors
information to the processor through an inter-
to implement the ICs.
rupt vector.
(c) the branch address is obtained from a register (a) NMOS only
in the processor. (b) TTL Schottky and PMOS
(d) the branch address is obtained from program (c) PMOS only
counter. (d) NMOS and PMOS
32. Address symbol table is generated by the 41. In a multiprogramming system, which of the fol-
(a) memory management software lowing concepts is used?
(b) assembler (a) Data parallelism (b) Paging
(c) table match of associative memory (c) L1 cache (d) DMA
(d) generated by CPU
42. PAL circuit can be defined as
33. How many 128 × 8 RAM chips are needed to have
(a) fixed OR and programmable AND logic.
a total RAM of 2048 bytes?
(b) programmable OR and programmable AND logic.
(a) 8 (b) 16 (c) 24 (d) 32 (c) fixed AND and programmable OR logic.
(d) fixed OR and fixed AND logic.
34. In 8085 microprocessor, how many I/O devices can
be interfaced in I/O mapped I/O technique? 43. If the clock input applied to a cascaded Mod-6 and
Mod-4 counter is 48 kHz. Then the output of the
(a) Either 256 input devices or 256 output devices
cascaded arrangement shall be of:
(b) 8 I/O devices
(c) 256 input devices and 256 output devices (a) 4.8 kHz (b) 12 kHz
(d) 512 input-output devices (c) 8 kHz (d) 48 kHz
44. If there are four ROM ICs of 8K and two RAM ICs
35. After reset, the CPU starts the execution of instruc-
of 4K words, then the address range of Ist RAM is
tion from memory address
(assume initial addresses correspond to ROMs)
(a) 1111H (b) 8000H
(c) 0000H (d) FFFFH (a) (8000)H to (9FFF)H (b) (5000)H to (7FFF)H
(c) (8000)H to (8FFF)H (d) (5000)H to (9FFF)H
36. In a microprocessor system, suppose TRAP, HOLD
45. The method for updating the main memory as soon
and RESET pin got activated at the same time,
as a word is removed from the cache is called
while the processor was executing some instruc-
tions, the system will (a) Write-through (b) Write-back
(c) Write-save (d) Cache-save
(a) execute the TRAP instruction
(b) execute the HOLD instruction Set 2
(c) execute the RESET instruction
(d) none of these instructions will be executed 1. The most appropriate matching for the following
pairs
37. In 8085 microprocessor, the programmer cannot
access which flag directly? X. Indirect addressing 1. Loops
Y. Immediate addressing 2. Pointers
(a) Sign flag (b) Carry flag Z. Auto-decrement addressing 3. Constants
(c) Auxiliary carry flag (d) Parity flag
(a) X − 3 Y − 2 Z − 1 (b) X - 1 Y - 3 Z - 2
38. Which of the following is a pseudo-instruction for (c) X − 2 Y − 3 Z − 1 (d) X - 3 Y - 1 Z - 2
8085?
2. Which of the following is not a form of memory?
(a) SPHL (b) CMP
(c) NOP (d) END (a) Instruction cache
(b) Instruction register
39. The term “cycle stealing” refers to: (c) Instruction opcode
(d) Translation look aside buffer
(a) Interrupt-based data transfer
(b) DMA-based data transfer 3. In serial data transmission, every byte of data is
(c) Polling mode data transfer padded with a `0’ in the beginning and one or two
(d) Clock cycle overriding 1’ s at the end of byte because
(a) Receiver is to be synchronized for byte reception.
40. Which of the following architecture is not suitable
(b) Receiver recovers lost `0’ and `1’ from these
for the following SIMD architecture?
padded bits.
(a) Vector processor (b) PLA-based processor (c) Padded bits are useful in parity computation.
(c) Von Neumann (d) PAL-based processor (d) None of these.
(c) the register containing the address of the oper- (a) A - 4, B - 3, C -1, D - 2
and is specified inside the instruction. (b) A - 2, B - 1, C -3, D - 4
(d) the location of the operand is implicit. (c) A - 4, B - 3, C -2, D - 1
(d) A - 2, B - 3, C -4, D - 1
14. What are the states of the auxiliary carry (AC)
and carry flag (CY) after executing the following 19. I/O redirection
8085 program? (a) implies changing the name of a file
MVI H, 5DH (b) can be employed to use an existing file as input
MIV L, 6BH file for a program
MOV A, H (c) implies connecting two programs through a pipe
ADD L (d) none of the above
(a) AC = 0, CY = 0 (b) AC = 1, CY = 1 20. The main difference(s) between a CISC and a RISC
(c) AC = 1, CY = 0 (d) AC = 0, CY = 1 processor is/are that a RISC processor typically
17. A microprogram control unit is required to gener- address instructions are required to evaluate it?
ate a total of 25 control signals. Assume that during (a) 4 (b) 6 (c) 8 (d) 10
any microinstruction at most two control signals
are active. Minimum number of bits required in 24. A decimal number has 64 digits. The number of bits
the control word to generate the required control needed for its equivalent binary representation is
signal is . (a) 200 (b) 213 (c) 246 (d) 277
18. The correct matching for the following pairs is 25. Determine the speed up obtained from pipelining if
latencies for each stage in single cycle processor is
A. DMA I/O 1. High-speed RAM given as:
B. Cache 2. Disk
C. Interrupt I/O 3. Printer IF ID ALU MEM WB
D. Condition code 4. ALU 45 ns 20 ns 52 ns 44 ns 18 ns
register
Set 1
11. (c) 18. (d) 25. (a) 32. (b) 39. (b)
12. (c) 19. (b) 26. (b) 33. (b) 40. (c)
13. (b) 20. (a) 27. (a) 34. (c) 41. (b)
14. (d) 21. (a) 28. (c) 35. (c) 42. (a)
15. (a) 22. (b) 29. (b) 36. (d) 43. (a)
16. (a) 23. (b) 30. (c) 37. (c) 44. (c)
17. (c) 24. (a) 31. (d) 38. (d) 45. (b)
Both operands are positive and the result is negative. 12. (a) Push `r’ memory operation needs 2 clocks.
Both operands are negative and the result is positive. 13. (b) In absolute addressing mode, the address of the
So option (b) is true, overflow does not occur when operand is inside the instruction.
positive and negative numbers are added.
14. (c)
5. (d) Different hazards are caused due to various
dependencies. Different dependencies for pipelined Carry Auxilary carry
processor are as follows: 0 1 1 1 1 1 1 1
Structural dependency is due to different delays in (5D) # (0 1 0 1 1 1 0 1)B
pipelined stages. +(6D) # + (0 1 1 0 1 0 1 1)B
Control dependency is due to consecutive instruc-
tions are dependent on each other. (1 1 0 0 1 0 0 0)B
Data dependency is due to hardware resources
AC = 1 and C Y = 0
sharing.
15. (d) Features of horizontal microprogramming are
6. (c) (A) IEEEE 488 - (Q)
(B) IEEEE 796 - (S) (i) It does not require use of signal decoders
(C) IEEEE 696 - (R) (ii) It results in larger-sized microinstructions
(D) RS232-C - (P) than vertical microprogramming
(iii) It uses 1 bit for each control signal
7. (c) When interrupt is caused, the execution of cur-
rent instruction is stopped. After handling inter- 16. (a) The daisy chaining method of establishing
rupt, the program resumes its execution. priority consists of a serial connection of all devices
that request an interrupt. The device with the
8. (a) RAID is random array of independent disks highest priority is placed in the first position, fol-
that combines multiple disk drive components into lowed by lower-priority devices up to the device
a logical unit. RAID configuration provides fault- with the lowest priority, which is placed last in
tolerance and high speed. the chain. The farther the device is from the first
9. (b) Horizontal micro programming has high paral- position, the lower is its priority. Therefore, daisy
lelism than vertical. So, the speed order is: chain gives non-uniform priority to various devices.
17. (10) To generate 25 control signals, 5 bits are 22. (b) Three-address instructions are as follow:
required. To generate two control signals, the fol-
MUL R1, N, O
lowing scheme will be used.
MUL R2, P, Q
ADD R3, M, R1
5 bits to identify first and 5 bits to identify second DIV X, R3, R2
including the case when one of them is not present.
So total bits required = 10. 23. (c) LOAD P (AC ← P)
18. (b) DMA I/O - Disk MPY Q (AC ← AC × Q)
Cache - High-speed RAM STORE X (X ← AC)
Interrupt I/O - Printer LOAD N (AC ← N)
Condition code register - ALU MPY O (AC ← AC × O)
19. (c) I/O redirection implies connection two pro- ADD M (AC ← AC + M)
grams through a pipe. DIV X (AC ← AC/X)
20. (d) The major characteristics of a RISC processor are STORE X (X ← AC)
(i) Relatively few instruction 24. (b) The number of bits is
(ii) Relatively few addressing modes
(iii) More registers 1064 − 1 = 2x − 1
(iv) Hardwired rather than microprogrammed ⇒ 1064 = 2x = x = log2 1064 ≈ 213
control
21. (a) Number of bytes per line = 16 bit × 4 byte = 25. (3.44) Speed up obtained by pipelining
8 bytes
(45 + 20 + 52 + 44 + 18)
Cache size = 8 × 4 × 1024 = 32 KB =
52
= 3.44