0% found this document useful (0 votes)
147 views14 pages

Problem Set 4 Sol

Uploaded by

bsudheertec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views14 pages

Problem Set 4 Sol

Uploaded by

bsudheertec
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Problem-Set #4

COE 608: Computer Organization and Architecture


Multicycle MIPS-Lite CPU and Pipelining

(a) MIPS-Lite CPU Multi-Cycle Control and Pipelining

Chapter 4:
Exercises: 4.12, 4.13, 4.14, 4.16, 4.17, 4.18, 4.19,
4.20, 4.21 and 4.22.1 & 4.22.2.

Additional Questions
Q.1. How could we modify the following code to make use of a delayed branch slot?

Loop: lw $2, 100($3)


addi $3, $3, 4
beq $3, $4, Loop

Q.2. Identify all the data dependencies in the following code.


Which dependencies are data hazards that can be resolved by forwarding?

add $2, $5, $4


add $4, $2, $5
sw $5, 100($2)
add $3, $2, $4

© G. Khan Problem-Set-4, COE608: Multi-cycle CPU and Pipelining Page: 1


Solutions:
4.12
Done in the class

4.13
4.13.1
Instruction sequence Dependences
a. I1: lw $1,40($6) RAW on $1 from I1 to I3
I2: add $6,$2,$2 RAW on $6 from I2 to I3
I3: sw $6,50($1) WAR on $6 from I1 to I2 and I3

b. I1: lw $5,-16($5) RAW on $5 from I1 to I2 and I3


I2: sw $5,-16($5) WAR on $5 from I1 and I2 to I3
I3: add $5,$5,$5 WAW on $5 from I1 to I3
4.13.2
In the basic fi ve-stage pipeline WAR and WAW dependences do not cause any hazards. Without
forwarding, any RAW dependence between an instruction and the next two instructions (if register
read happens in the second half of the clock cycle and the register write happens in the first half). The
code that eliminates these hazards by inserting nop instructions is:
Instruction sequence
a. lw $1,40($6) Delay I3 to avoid RAW hazard on $1 from I1
add $6,$2,$2
nop
sw $6,50($1)
b. lw $5,-16($5) Delay I2 to avoid RAW hazard on $5 from I1
nop
nop
sw $5,-16($5)
add $5,$5,$5 Note: no RAW hazard from on $5 from I1 now

4.13.3
With full forwarding, an ALU instruction can forward a value to EX stage of the next instruction
without a hazard. However, a load cannot forward to the EX stage of the next instruction (by can to the
instruction after that). The code that eliminates these hazards by inserting nop instructions is:

Instruction sequence
a. lw $1,40($6) No RAW hazard on $1 from I1 (forwarded)
add $6,$2,$2
sw $6,50($1)
b. lw $5,-16($5)
nop Delay I2 to avoid RAW hazard on $5 from I1
sw $5,-16($5) Value for $5 is forwarded from I2 now
add $5,$5,$5 Note: no RAW hazard from on $5 from I1 now

© G. Khan Problem-Set-4, COE608: Multi-cycle CPU and Pipelining Page: 2


4.13.4
The total execution time is the clock cycle time times the number of cycles. Without any stalls, a three-
instruction sequence executes in 7 cycles (5 to complete the first instruction, then one per instruction).
The execution without forwarding must add a stall for every nop we had in 4.13.2, and execution
forwarding must add a stall cycle for every nop we had in 4.13.3. Overall, we get:
No forwarding With forwarding Speed-up due to forwarding.

No forwarding With forwarding Speed-up due to forwarding


a. (7 + 1) × 300ps = 2400ps 7 × 400ps = 2800ps 0.86 (This is really a slowdown)
b. (7 + 2) × 200ps = 1800ps (7 + 1) × 250ps = 2000ps 0.90 (This is really a slowdown)

4.13.5
With ALU-ALU-only forwarding, an ALU instruction can forward to the next instruction, but not to
the second-next instruction (because that would be forwarding from MEM to EX). A load cannot
forward at all, because it determines the data value in MEM stage, when it is too late for ALU-ALU
forwarding. We have:

Instruction sequence
a. lw $1,40($6) Can’t use ALU-ALU forwarding, ($1 loaded in MEM)
add $6,$2,$2
nop
sw $6,50($1)
b. lw $5,-16($5) Can’t use ALU-ALU forwarding ($5 loaded in MEM)
nop
nop
sw $5,-16($5)
add $5,$5,$5

4.13.6
No forwarding With ALU-ALU Speed-up with ALU-ALU
forwarding only forwarding
a. (7+1)×300ps = 2400ps (7+1)×360ps = 2880ps 0.83 (This is really a slowdown)
b. (7+2)×200ps = 1800ps (7+2)×220ps = 1980ps 0.91 (This is really a slowdown)

4.14
4.14.1
In the pipelined execution shown below, *** represents a stall when an instruction cannot be fetched
because a load or store instruction is using the memory in that cycle. Cycles are represented from left
to right, and for each instruction we show the pipeline stage it is in during that cycle:

© G. Khan Problem-Set-4, COE608: Multi-cycle CPU and Pipelining Page: 3


We cannot add nops to the code to eliminate this hazard—nops need to be fetched just like any other
instructions, so this hazard must be addressed with a hardware hazard detection unit in the processor.

4.14.2
This change only saves one cycle in an entire execution without data hazards (such as the one given).
This cycle is saved because the last instruction finishes one cycle earlier (one less stage to go through).
If there were data hazards from loads to other instruction, the change would help eliminate some stall
cycles.
Instructions Cycles with Cycles with
Executed 5 stages 4 stages Speed-up
a. 4 4+4=8 3+4=7 8/7 = 1.14
b. 5 4+5=9 3+5=8 9/8 = 1.13
4.14.3
Stall-on-branch delays the fetch of the next instruction until the branch is executed. When branches
execute in the EXE stage, each branch causes two stall cycles. When branches execute in the ID stage,
each branch only causes one stall cycle. Without branch stalls (e.g., with perfect branch prediction)
there are no stalls, and the execution time is 4 plus the number of executed instructions. We have:

Instructions Branches Cycles with Cycles with


Executed Executed branch in EXE branch in ID Speed-up
a. 4 1 4 + 4 + 1 × 2 = 10 4+4+1×1=9 10/9 = 1.11
b. 5 1 4 + 5 + 1 × 2 = 11 4 + 5 + 1 × 1 = 10 11/10 = 1.10

4.14.4
The number of cycles for the (normal) 5-stage and the (combined EX/MEM) 4-stage pipeline is
already computed in 4.14.2. The clock cycle time is equal to the latency of the longest-latency stage.
Combining EX and MEM stages affects clock time only if the combined EX/MEM stage becomes the
longest-latency stage:

Cycle time Cycle time


with 5 stages with 4 stages Speed-up
a. 130ps (MEM) 150ps (MEM + 20ps) (8 × 130)/(7 × 150) = 0.99
b. 220ps (MEM) 240ps (MEM + 20ps) (9 × 220)/(8 × 240) = 1.03

4.14.5
New ID New EX New cycle Old cycle

© G. Khan Problem-Set-4, COE608: Multi-cycle CPU and Pipelining Page: 4


Latency latency time time Speed-up
a. 180ps 80ps 180ps (ID) 130ps (MEM) (10 × 130)/(9 × 180) = 0.80
b. 150ps 160ps 220ps (MEM) 220ps (MEM) (11 × 220)/(10 × 220) = 1.10
4.14.6
The cycle time remains unchanged: a 20ps reduction in EX latency has no effect on clock cycle time
because EX is not the longest-latency stage. The change does affect execution time because it adds one
additional stall cycle to each branch. Because the clock cycle time does not improve but the number of
cycles increases, the speed-up from this change will be below 1 (a slowdown). In 4.14.3 we already
computed the number of cycles when branch is in EX stage. We have:

Cycles with Execution time Cycles with Execution time


branch in EX (branch in EX) branch in MEM (branch in MEM) Speed-up
a. 4+4+1×2 = 10 10×130ps = 1300ps 4+4+1 × 3 = 11 11 × 130ps = 1430ps 0.91
b. 4+5+1 × 2 = 11 11×220ps = 2420ps 4+5+1 × 3 = 12 12 × 220ps = 2640ps 0.92

4.15
4.15.1
a. This instruction behaves like a load with a zero offset until it fetches the value from memory. The
pre-ALU Mux must have another input now (zero) to allow this. After the value is read from memory
in the MEM stage, it must be compared against zero. This must either be done quickly in the WB
stage, or we must add another stage between MEM and WB. The result of this zero comparison must
then be used to control the branch Mux, delaying the selection signal for the branch Mux until the WB
stage.
b. We need to compute the memory address using two register values, so the address computation for
SWI is the same as the value computation for the ADD instruction. However now we need to read a
third register value, so Registers must be extended to support a another read register input and another
read data output and a Mux must be added in EX to select the Data Memory’s write data input between
this value and the value for the normal SW instruction.

4.15.2
a. We need to add one more bit to the control signal for the pre-ALU Mux. We also need a control
signal similar to the existing “Branch” signal to control whether or not the new zero-compare result is
allowed to change the PC.
b. We need a control signal to control the new Mux in the EX stage.

4.15.3
a. This instruction introduces a new control hazard. The new PC for this branch is computed only after
the Mem stage. If a new stage is added after MEM, this either adds new forwarding paths (from the
new stage to EX) or (if there is no forwarding) makes a stall due to a data hazard one cycle longer.
b. This instruction does not affect hazards. It modifies no registers, so it causes no data hazards. It is
not a branch instruction, so it produces no control hazards. With the added third register read port, it
creates no new resource hazards, either.

4.15.4
a. lw Rtmp,0(Rs) e.g., BEZI can be used when trying to find the length of a
beq Rt,$0,Label zero-terminated array.

b. add Rtmp,Rs,Rt e.g., SWI can be used to store to an array element, where the array
sw Rd,0(Rtmp) begins at address Rt and Rs is used as an index into the array.

© G. Khan Problem-Set-4, COE608: Multi-cycle CPU and Pipelining Page: 5


4.15.5
The instruction can be translated into simple MIPS-like micro-operations (see 4.15.4 for a possible
translation). These micro-operations can then be executed by the processor with a “normal” pipeline.

4.15.6
We will compute the execution time for every replacement interval. The old execution time is simply
the number of instruction in the replacement interval (CPI of 1). The new execution time is the number
of instructions after we made the replacement, plus the number of added stall cycles. The new number
of instructions is the number of instructions in the original replacement interval, plus the new
instruction, minus the number of instructions it replaces:
New execution time Old execution time Speed-up
a. 20 − (2 − 1) + 1 = 20 20 1.00
b. 60 − (3 − 1) + 0 = 58 60 1.03

4.16
4.16.1
For every instruction, the IF/ID register keeps the PC + 4 and the instruction word itself. The ID/EX
register keeps all control signals for the EX, MEM, and WB stages, PC + 4, the two values read from
Registers, the sign-extended lowermost 16 bits of the instruction word, and Rd and Rt fi elds of the
instruction word (even for instructions whose format does not use these fi elds). The EX/MEM register
keeps control signals for MEM and WB stages, the PC + 4 + Offset (where Offset is the sign-extended
lowermost 16 bits of the instructions, even for instructions that have no offset fi eld), the ALU result
and the value of its Zero output, the value that was read from the second register in the ID stage (even
for instructions that never need this value), and the number of the destination register (even for
instructions that need no register writes; for these instructions the number of the destination register is
simply a “random” choice between Rd or Rt). The MEM/WB register keeps the WB control signals,
the value read from memory (or a “random” value if there was no memory read), the ALU result, and
the number of the destination register.
4.16.2
Need to be read Actually read
a. $6 $6, $1
b. $5 $5 (twice)
4.16.3
EX MEM
a. 40 + $6 Load value from memory
b. $5 + $5 Nothing

4.16.4

© G. Khan Problem-Set-4, COE608: Multi-cycle CPU and Pipelining Page: 6


4.16.5
In a particular clock cycle, a pipeline stage is not doing useful work if it is stalled or if the instruction
going through that stage is not doing any useful work there. In the pipeline execution diagram from
4.16.4, a stage is stalled if its name is not shown for a particular cycles, and stages in which the
particular instruction is not doing useful work are marked in red. Note that a BEQ instruction is doing
useful work in the MEM stage, because it is determining the correct value of the next instruction’s PC
in that stage. We have:

Cycles in which all stages % of cycles in which all


Cycles per loop iteration do useful work stages do useful work
a. 5 1 20%
b. 5 2 40%
4.16.6
The address of that fi rst instruction of the third iteration (PC + 4 for the beq from the previous
iteration) and the instruction word of the beq from the previous iteration.
4.17
4.17.1
Of all these instructions, the value produced by this adder is actually used only by a beq instruction
when the branch is taken. We have:
a. 15% (60% of 25%)
b. 9% (60% of 15%)
4.17.2
Of these instructions, only add needs all three register-ports (reads two registers and write one). beq
and sw does not write any register, and lw only uses one register value. We have:
a. 50%
b. 30%
4.17.3
Of these instructions, only lw and sw use the data memory. We have:
a. 25% (15% + 10%)

© G. Khan Problem-Set-4, COE608: Multi-cycle CPU and Pipelining Page: 7


b. 55% (35% + 20%)
4.17.4
The clock cycle time of a single-cycle is the sum of all latencies for the logic of all fi ve stages. The
clock cycle time of a pipelined datapath is the maximum latency of the fi ve stage logic latencies, plus
the latency of a pipeline register that keeps the results of each stage for the next stage. We have:

Single-cycle Pipelined Speed-up


a. 500ps 140ps 3.57
b. 730ps 230ps 3.17
4.17.5
The latency of the pipelined datapath is unchanged (the maximum stage latency does not change). The
clock cycle time of the single-cycle datapath is the sum of logic latencies for the four stages (IF, ID,
WB, and the combined EX + MEM stage). We have:
Single-cycle Pipelined
a. 410ps 140ps
b. 560ps 230ps
4.17.6
The clock cycle time of the two pipelines (5-stage and 4-stage) as explained for 4.17.5. The number of
instructions increases for the 4-stage pipeline, so the speed-up is below 1 (there is a slowdown):
Instructions with 5-stage Instructions with 4-stage Speed-up
a. 1.00 × I 1.00 × I + 0.5 × (0.15 + 0.10) × I = 1.125 × I 0.89
b. 1.00 × I 1.00 × I + 0.5 × (0.35 + 0.20) × I = 1.275 × I 0.78

4.18
4.18.1
No signals are asserted in IF and ID stages. For the remaining three stages we have:
EX MEM WB
a. ALUSrc = 0, ALUOp = 10, Branch = 0, MemWrite = 0, MemtoReg = 1,
RegDst = 1 MemRead = 0 RegWrite = 1

b. ALUSrc = 0, ALUOp = 10, Branch = 0, MemWrite = 0, MemtoReg = 1,


RegDst = 1 MemRead = 0 RegWrite = 1

4.18.2
One clock cycle.

4.18.3
The PCSrc signal is 0 for this instruction. The reason against generating the PCSrc signal in the EX
stage is that the and must be done after the ALU computes its Zero output. If the EX stage is the
longest-latency stage and the ALU output is on its critical path, the additional latency of an AND gate
would increase the clock cycle time of the processor. The reason in favor of generating this signal in
the EX stage is that the correct next-PC for a conditional branch can be computed one cycle earlier, so
we can avoid one stall cycle when we have a control hazard.

4.18.4
Control signal 1 Control signal 2
a. Generated in ID, used in EX Generated in ID, used in WB
b. Generated in ID, used in MEM Generated in ID, used in WB

© G. Khan Problem-Set-4, COE608: Multi-cycle CPU and Pipelining Page: 8


4.18.5
a. R-type instructions
b. Loads.

4.18.6
Signal 2 goes back though the pipeline. It affects execution of instructions that execute after the one for
which the signal is generated, so it is not a time-travel paradox.

4.19
4.19.1
Dependences to the 1st next instruction result in 2 stall cycles, and the stall is also 2 cycles if the
dependence is to both 1st and 2nd next instruction. Dependences to only the 2nd next instruction result
in one stall cycle. We have:

CPI Stall Cycles


a. 1 + 0.45 × 2 + 0.05 × 1 = 1.95 49% (0.95/1.95)
b. 1 + 0.40 × 2 + 0.10 × 1 = 1.9 47% (0.9/1.9)

4.19.2
With full forwarding, the only RAW data dependences that cause stalls are those from the MEM stage
of one instruction to the 1st next instruction. Even this dependences causes only one stall cycle, so we
have:
CPI Stall Cycles
a. 1 + 0.25 = 1.25 20% (0.25/1.25)
b. 1 + 0.20 = 1.20 17% (0.20/1.20)
4.19.3
With forwarding only from the EX/MEM register, EX to 1st dependences can be satisfied without
stalls but EX to 2nd and MEM to 1st dependences incur a one-cycle stall. With forwarding only from
the MEM/WB register, EX to 2nd dependences incur no stalls. MEM to 1st dependences still incur a
one-cycle stall (no time travel), and EX to 1st dependences now incur one stall cycle because we must
wait for the instruction to complete the MEM stage to be able to forward to the next instruction. We
compute stall cycles per instructions for each case as follows:
EX/MEM MEM/WB Fewer stall cycles with
a. 0.10 + 0.05 + 0.25 = 0.40 0.10 + 0.10 + 0.25 = 0.45 EX/MEM
b. 0.05 + 0.10 + 0.20 = 0.35 0.15 + 0.05 + 0.20 = 0.40 EX/MEM
4.19.4
In 4.19.1 and 4.19.2 we have already computed the CPI without forwarding and with full forwarding.
Now we compute time per instruction by taking into account the clock cycle time:
Without forwarding With forwarding Speed-up
a. 1.95 × 100ps = 195ps 1.25 × 110ps = 137.5ps 1.42
b. 1.90 × 300ps = 570ps 1.20 × 350ps = 420ps 1.36

4.19.5
We already computed the time per instruction for full forwarding in 4.19.4. Now we compute time-per
instruction with time-travel forwarding and the speed-up over full forwarding:

With full forwarding Time-travel forwarding Speed-up


a. 1.25 × 110ps = 137.5ps 1 × 210ps = 210ps 0.65
b. 1.20 × 350ps = 420ps 1 × 450ps = 450ps 0.93

© G. Khan Problem-Set-4, COE608: Multi-cycle CPU and Pipelining Page: 9


4.19.6
EX/MEM MEM/WB Shorter time per instruction with
a. 1.40 × 100ps = 140ps 1.45 × 100ps = 145ps EX/MEM
b. 1.35 × 320ps = 432ps 1.40 × 310ps = 434ps EX/MEM

4.20
4.20.1

4.20.2
Only RAW dependences can become data hazards. With forwarding, only RAW dependences from a
load to the very next instruction become hazards. Without forwarding, any RAW dependence from an
instruction to one of the following three instructions becomes a hazard:

4.20.3
With forwarding, only RAW dependences from a load to the next two instructions become hazards
because the load produces its data at the end of the second MEM stage. Without forwarding, any RAW
dependence from an instruction to one of the following 4 instructions becomes a hazard:

© G. Khan Problem-Set-4, COE608: Multi-cycle CPU and Pipelining Page: 10


4.20.4

4.20.5
A register modification becomes “visible” to the EX stage of the following instructions only two
cycles after the instruction that produces the register value leaves the EX stage. Our forwarding-
assuming hazard detection unit only adds a one-cycle stall if the instruction that immediately follows a
load is dependent on the load. We have:

4.20.6

© G. Khan Problem-Set-4, COE608: Multi-cycle CPU and Pipelining Page: 11


4.21
4.21.2
We can move up an instruction by swapping its place with another instruction that has no dependences
with it, so we can try to fill some nop slots with such instructions. We can also use R7 to eliminate
WAW or WAR dependences so we can have more instructions to move up.

4.21.3
With forwarding, the hazard detection unit is still needed because it must insert a one-cycle stall
whenever the load supplies a value to the instruction that immediately follows that load. Without the
hazard detection unit, the instruction that depends on the immediately preceding load gets the stale
value the register had before the load instruction.
a. I2 gets the value of $1 from before I1, not from I1 as it should.
b. I4 gets the value of $1 from I1, not from I3 as it should.

© G. Khan Problem-Set-4, COE608: Multi-cycle CPU and Pipelining Page: 12


4.21.4
The outputs of the hazard detection unit are PCWrite, IF/IDWrite, and ID/EXZero (which controls the
Mux after the output of the Control unit). Note that IF/IDWrite is always equal to PCWrite, and
ED/ExZero is always the opposite of PCWrite. As a result, we will only show the value of PCWrite for
each cycle. The outputs of the forwarding unit is ALUin1 and ALUin2, which control Muxes which
select the fi rst and second input of the ALU. The three possible values for ALUin1 or ALUin2 are 0
(no forwarding), 1 (forward ALU output from previous instruction), or 2 (forward data value for
second-previous instruction). We have:

4.21.5
The instruction that is currently in the ID stage needs to be stalled if it depends on a value produced by
the instruction in the EX or the instruction in the MEM stage. So we need to check the destination
register of these two instructions. For the instruction in the EX stage, we need to check Rd for R-type
instructions and Rd for loads. For the instruction in the MEM stage, the destination register is already
selected (by the Mux in the EX stage) so we need to check that register number (this is the bottommost
output of the EX/MEM pipeline register). The additional inputs to the hazard detection unit are register
Rd from the ID/EX pipeline register and the output number of the output register from the EX/MEM
pipeline register. The Rt fi eld from the ID/EX register is already an input of the hazard detection unit
in Figure 4.60. No additional outputs are needed. We can stall the pipeline using the three output
signals that we already have.

4.21.6
As explained for 4.21.5, we only need to specify the value of the PCWrite signal, because IF/IDWrite
is equal to PCWrite and the ID/EXzero signal is its opposite. We have:

© G. Khan Problem-Set-, COE608: Multi-cycle CPU and Pipelining Page: 13


4.22
4.22.1

4.22.2

© G. Khan Problem-Set-4, COE608: Multi-cycle CPU and Pipelining Page: 14

You might also like