Solution 2
Solution 2
loop: LD R1,0(R2)
DADDI R1,R1,1
SD 0(R2),R1
DADDI R2,R2,4
DSUB R4,R3,R2
BNEZ R4,loop
Assume that the initial value of R3 is R2+396. Throughout this exercise use the classic
RISC five-stage integer pipeline in H&P. Specifically, assume that: 1) branches are resolved
in the second stage of the pipeline; 2) there are separate instruction and data memories; 3)
all memory accesses take 1 clock cycle.
a. Show the timing of this instruction sequence for the RISC pipeline without any forwarding
or bypassing hardware but assuming a register read and a write in the same clock cycle
“forwards” through the register file. Assume that the branch is handled by flushing the
pipeline. If all memory references take 1 cycle, how many cycles does this loop take to
execute?
b. Show the timing of this instruction sequence for the RISC pipeline with normal forwarding
and bypassing hardware. Assume that the branch is handled by predicting it as not taken.
If all memory references take 1 cycle, how many cycles does this loop take to execute?
c. Assume the RISC pipeline with a single-cycle delayed branch and normal forwarding and
bypassing hardware. Schedule the instructions in the loop including the branch delay slot.
You may reorder instructions and modify the individual instructions operands, but do not
undertake other loop transformations that change the number or opcode of the instructions
in the loops. Show a pipeline timing diagram and compute the number of cycles needed to
execute the entire loop.
Solution
1
Inf3 Computer Architecture tutorial 2 - week 4
Inst. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
LD F D X M W
DADDI F D s s X M W
SD F s s D s s X M W
DADDI F s s D X M W
DSUB F D s s X M W
BNEZ F s s D s s X M W
LD (2) F s s F D X M W
• The first DADDI must wait until LD gets to the WB stage to obtain the value of R1.
• SD must wait until the first DADDI computes the value of R1 and reaches the WB
stage (no forwarding).
• DSUB must wait until the second DADDI computes the value of R2 and reaches the
WB stage.
• BNEZ must wait until DSUB computes the value of R4 and reaches the WB stage.
• LD from the next iteration is re-fetched after the branch is resolved as taken (in the
branch ID stage).
The loop executes 99 iterations (396/4), iterations 0 to 98. Iteration i begins in cycle
1 + (i ∗ 15) (see figure). The last iteration takes 18 cycles to complete. So, the total number
of cycles for the whole loop is (98 ∗ 15) + 18 = 1488 cycles.
Inst. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
LD F D X M W
DADDI F D s X M W
SD F s D X M W
DADDI F D X M W
DSUB F D X M W
BNEZ F D s X M W
LD (2) F s F D X M W
• The first ADDI still must wait until LD gets to the WB stage to obtain the value of
R1, but now stall is implemented in the EX stage through the interlocking mechanism
and value is forwarded directly to EX stage.
2
Inf3 Computer Architecture tutorial 2 - week 4
• SD now waits until the first ADDI computes the value of R1 and forwards the value
from EX to MEM.
• DSUB now gets the value of R2 forwarded by the second ADDI and does not need to
wait.
• BNEZ now gets the value of R4 forwarded by sub but needs to wait till the sub reaches
EX stage.
• LD from the next iteration is re-fetched after the branch is resolved as mispredicted
(in the branch ID stage).
Iteration i now begins in cycle 1 + (i ∗ 9) and the last iteration takes only 12 cycles to
complete. So, the total number of cycles for the whole loop is (98 ∗ 9) + 12 = 894 cycles.
• Optimization 3: move SD to the branch delay slot, adjusting the offset (SD -4(R2),R1).
Inst. 1 2 3 4 5 6 7 8 9 10 11
LD R1,0(R2) F D X M W
DADDI R2,R2,4 F D X M W
DSUB R4,R3,R2 F D X M W
DADDI R1,R1,1 F D X M W
BNEZ R4,loop F D X M W
SD -4(R2),R1 F D X M W
LD (2) F D X M W
• The first DADDI does not wait as it is now separated by two extra cycles from LD.
• The BNEZ now does not wait as it is separated by one extra cycle from DSUB.
• LD from the next iteration is properly fetched as the branch is resolved in time.
Iteration i now begins in cycle 1 + (i ∗ 6) and the last iteration takes only 10 cycles to
complete. So, the total number of cycles for the whole loop is (98 ∗ 6) + 10 = 598 cycles.
Vijay Nagarajan. Thanks to Boris Grot, Nigel Topham and Marcelo Cintra.