Problem Set 4 Sol
Problem Set 4 Sol
Chapter 4:
Exercises: 4.12, 4.13, 4.14, 4.16, 4.17, 4.18, 4.19,
4.20, 4.21 and 4.22.1 & 4.22.2.
Additional Questions
Q.1. How could we modify the following code to make use of a delayed branch slot?
4.13
4.13.1
Instruction sequence Dependences
a. I1: lw $1,40($6) RAW on $1 from I1 to I3
I2: add $6,$2,$2 RAW on $6 from I2 to I3
I3: sw $6,50($1) WAR on $6 from I1 to I2 and I3
4.13.3
With full forwarding, an ALU instruction can forward a value to EX stage of the next instruction
without a hazard. However, a load cannot forward to the EX stage of the next instruction (by can to the
instruction after that). The code that eliminates these hazards by inserting nop instructions is:
Instruction sequence
a. lw $1,40($6) No RAW hazard on $1 from I1 (forwarded)
add $6,$2,$2
sw $6,50($1)
b. lw $5,-16($5)
nop Delay I2 to avoid RAW hazard on $5 from I1
sw $5,-16($5) Value for $5 is forwarded from I2 now
add $5,$5,$5 Note: no RAW hazard from on $5 from I1 now
4.13.5
With ALU-ALU-only forwarding, an ALU instruction can forward to the next instruction, but not to
the second-next instruction (because that would be forwarding from MEM to EX). A load cannot
forward at all, because it determines the data value in MEM stage, when it is too late for ALU-ALU
forwarding. We have:
Instruction sequence
a. lw $1,40($6) Can’t use ALU-ALU forwarding, ($1 loaded in MEM)
add $6,$2,$2
nop
sw $6,50($1)
b. lw $5,-16($5) Can’t use ALU-ALU forwarding ($5 loaded in MEM)
nop
nop
sw $5,-16($5)
add $5,$5,$5
4.13.6
No forwarding With ALU-ALU Speed-up with ALU-ALU
forwarding only forwarding
a. (7+1)×300ps = 2400ps (7+1)×360ps = 2880ps 0.83 (This is really a slowdown)
b. (7+2)×200ps = 1800ps (7+2)×220ps = 1980ps 0.91 (This is really a slowdown)
4.14
4.14.1
In the pipelined execution shown below, *** represents a stall when an instruction cannot be fetched
because a load or store instruction is using the memory in that cycle. Cycles are represented from left
to right, and for each instruction we show the pipeline stage it is in during that cycle:
4.14.2
This change only saves one cycle in an entire execution without data hazards (such as the one given).
This cycle is saved because the last instruction finishes one cycle earlier (one less stage to go through).
If there were data hazards from loads to other instruction, the change would help eliminate some stall
cycles.
Instructions Cycles with Cycles with
Executed 5 stages 4 stages Speed-up
a. 4 4+4=8 3+4=7 8/7 = 1.14
b. 5 4+5=9 3+5=8 9/8 = 1.13
4.14.3
Stall-on-branch delays the fetch of the next instruction until the branch is executed. When branches
execute in the EXE stage, each branch causes two stall cycles. When branches execute in the ID stage,
each branch only causes one stall cycle. Without branch stalls (e.g., with perfect branch prediction)
there are no stalls, and the execution time is 4 plus the number of executed instructions. We have:
4.14.4
The number of cycles for the (normal) 5-stage and the (combined EX/MEM) 4-stage pipeline is
already computed in 4.14.2. The clock cycle time is equal to the latency of the longest-latency stage.
Combining EX and MEM stages affects clock time only if the combined EX/MEM stage becomes the
longest-latency stage:
4.14.5
New ID New EX New cycle Old cycle
4.15
4.15.1
a. This instruction behaves like a load with a zero offset until it fetches the value from memory. The
pre-ALU Mux must have another input now (zero) to allow this. After the value is read from memory
in the MEM stage, it must be compared against zero. This must either be done quickly in the WB
stage, or we must add another stage between MEM and WB. The result of this zero comparison must
then be used to control the branch Mux, delaying the selection signal for the branch Mux until the WB
stage.
b. We need to compute the memory address using two register values, so the address computation for
SWI is the same as the value computation for the ADD instruction. However now we need to read a
third register value, so Registers must be extended to support a another read register input and another
read data output and a Mux must be added in EX to select the Data Memory’s write data input between
this value and the value for the normal SW instruction.
4.15.2
a. We need to add one more bit to the control signal for the pre-ALU Mux. We also need a control
signal similar to the existing “Branch” signal to control whether or not the new zero-compare result is
allowed to change the PC.
b. We need a control signal to control the new Mux in the EX stage.
4.15.3
a. This instruction introduces a new control hazard. The new PC for this branch is computed only after
the Mem stage. If a new stage is added after MEM, this either adds new forwarding paths (from the
new stage to EX) or (if there is no forwarding) makes a stall due to a data hazard one cycle longer.
b. This instruction does not affect hazards. It modifies no registers, so it causes no data hazards. It is
not a branch instruction, so it produces no control hazards. With the added third register read port, it
creates no new resource hazards, either.
4.15.4
a. lw Rtmp,0(Rs) e.g., BEZI can be used when trying to find the length of a
beq Rt,$0,Label zero-terminated array.
b. add Rtmp,Rs,Rt e.g., SWI can be used to store to an array element, where the array
sw Rd,0(Rtmp) begins at address Rt and Rs is used as an index into the array.
4.15.6
We will compute the execution time for every replacement interval. The old execution time is simply
the number of instruction in the replacement interval (CPI of 1). The new execution time is the number
of instructions after we made the replacement, plus the number of added stall cycles. The new number
of instructions is the number of instructions in the original replacement interval, plus the new
instruction, minus the number of instructions it replaces:
New execution time Old execution time Speed-up
a. 20 − (2 − 1) + 1 = 20 20 1.00
b. 60 − (3 − 1) + 0 = 58 60 1.03
4.16
4.16.1
For every instruction, the IF/ID register keeps the PC + 4 and the instruction word itself. The ID/EX
register keeps all control signals for the EX, MEM, and WB stages, PC + 4, the two values read from
Registers, the sign-extended lowermost 16 bits of the instruction word, and Rd and Rt fi elds of the
instruction word (even for instructions whose format does not use these fi elds). The EX/MEM register
keeps control signals for MEM and WB stages, the PC + 4 + Offset (where Offset is the sign-extended
lowermost 16 bits of the instructions, even for instructions that have no offset fi eld), the ALU result
and the value of its Zero output, the value that was read from the second register in the ID stage (even
for instructions that never need this value), and the number of the destination register (even for
instructions that need no register writes; for these instructions the number of the destination register is
simply a “random” choice between Rd or Rt). The MEM/WB register keeps the WB control signals,
the value read from memory (or a “random” value if there was no memory read), the ALU result, and
the number of the destination register.
4.16.2
Need to be read Actually read
a. $6 $6, $1
b. $5 $5 (twice)
4.16.3
EX MEM
a. 40 + $6 Load value from memory
b. $5 + $5 Nothing
4.16.4
4.18
4.18.1
No signals are asserted in IF and ID stages. For the remaining three stages we have:
EX MEM WB
a. ALUSrc = 0, ALUOp = 10, Branch = 0, MemWrite = 0, MemtoReg = 1,
RegDst = 1 MemRead = 0 RegWrite = 1
4.18.2
One clock cycle.
4.18.3
The PCSrc signal is 0 for this instruction. The reason against generating the PCSrc signal in the EX
stage is that the and must be done after the ALU computes its Zero output. If the EX stage is the
longest-latency stage and the ALU output is on its critical path, the additional latency of an AND gate
would increase the clock cycle time of the processor. The reason in favor of generating this signal in
the EX stage is that the correct next-PC for a conditional branch can be computed one cycle earlier, so
we can avoid one stall cycle when we have a control hazard.
4.18.4
Control signal 1 Control signal 2
a. Generated in ID, used in EX Generated in ID, used in WB
b. Generated in ID, used in MEM Generated in ID, used in WB
4.18.6
Signal 2 goes back though the pipeline. It affects execution of instructions that execute after the one for
which the signal is generated, so it is not a time-travel paradox.
4.19
4.19.1
Dependences to the 1st next instruction result in 2 stall cycles, and the stall is also 2 cycles if the
dependence is to both 1st and 2nd next instruction. Dependences to only the 2nd next instruction result
in one stall cycle. We have:
4.19.2
With full forwarding, the only RAW data dependences that cause stalls are those from the MEM stage
of one instruction to the 1st next instruction. Even this dependences causes only one stall cycle, so we
have:
CPI Stall Cycles
a. 1 + 0.25 = 1.25 20% (0.25/1.25)
b. 1 + 0.20 = 1.20 17% (0.20/1.20)
4.19.3
With forwarding only from the EX/MEM register, EX to 1st dependences can be satisfied without
stalls but EX to 2nd and MEM to 1st dependences incur a one-cycle stall. With forwarding only from
the MEM/WB register, EX to 2nd dependences incur no stalls. MEM to 1st dependences still incur a
one-cycle stall (no time travel), and EX to 1st dependences now incur one stall cycle because we must
wait for the instruction to complete the MEM stage to be able to forward to the next instruction. We
compute stall cycles per instructions for each case as follows:
EX/MEM MEM/WB Fewer stall cycles with
a. 0.10 + 0.05 + 0.25 = 0.40 0.10 + 0.10 + 0.25 = 0.45 EX/MEM
b. 0.05 + 0.10 + 0.20 = 0.35 0.15 + 0.05 + 0.20 = 0.40 EX/MEM
4.19.4
In 4.19.1 and 4.19.2 we have already computed the CPI without forwarding and with full forwarding.
Now we compute time per instruction by taking into account the clock cycle time:
Without forwarding With forwarding Speed-up
a. 1.95 × 100ps = 195ps 1.25 × 110ps = 137.5ps 1.42
b. 1.90 × 300ps = 570ps 1.20 × 350ps = 420ps 1.36
4.19.5
We already computed the time per instruction for full forwarding in 4.19.4. Now we compute time-per
instruction with time-travel forwarding and the speed-up over full forwarding:
4.20
4.20.1
4.20.2
Only RAW dependences can become data hazards. With forwarding, only RAW dependences from a
load to the very next instruction become hazards. Without forwarding, any RAW dependence from an
instruction to one of the following three instructions becomes a hazard:
4.20.3
With forwarding, only RAW dependences from a load to the next two instructions become hazards
because the load produces its data at the end of the second MEM stage. Without forwarding, any RAW
dependence from an instruction to one of the following 4 instructions becomes a hazard:
4.20.5
A register modification becomes “visible” to the EX stage of the following instructions only two
cycles after the instruction that produces the register value leaves the EX stage. Our forwarding-
assuming hazard detection unit only adds a one-cycle stall if the instruction that immediately follows a
load is dependent on the load. We have:
4.20.6
4.21.3
With forwarding, the hazard detection unit is still needed because it must insert a one-cycle stall
whenever the load supplies a value to the instruction that immediately follows that load. Without the
hazard detection unit, the instruction that depends on the immediately preceding load gets the stale
value the register had before the load instruction.
a. I2 gets the value of $1 from before I1, not from I1 as it should.
b. I4 gets the value of $1 from I1, not from I3 as it should.
4.21.5
The instruction that is currently in the ID stage needs to be stalled if it depends on a value produced by
the instruction in the EX or the instruction in the MEM stage. So we need to check the destination
register of these two instructions. For the instruction in the EX stage, we need to check Rd for R-type
instructions and Rd for loads. For the instruction in the MEM stage, the destination register is already
selected (by the Mux in the EX stage) so we need to check that register number (this is the bottommost
output of the EX/MEM pipeline register). The additional inputs to the hazard detection unit are register
Rd from the ID/EX pipeline register and the output number of the output register from the EX/MEM
pipeline register. The Rt fi eld from the ID/EX register is already an input of the hazard detection unit
in Figure 4.60. No additional outputs are needed. We can stall the pipeline using the three output
signals that we already have.
4.21.6
As explained for 4.21.5, we only need to specify the value of the PCWrite signal, because IF/IDWrite
is equal to PCWrite and the ID/EXzero signal is its opposite. We have:
4.22.2