Frtyuiop
Frtyuiop
Question 1. An instruction requires four stages to execute: stage 1 (instruction fetch) requires 30 ns, stage 2 (instruction decode) = 9 ns, stage 3 (instruction execute) = 20 ns and stage 4 (store results) = 10 ns. An instruction must proceed through the stages in sequence. What is the minimum asynchronous time for any single instruction to complete? 30 + 9 + 20 + 10 = 69 ns.
Question 2. We want to set this up as a pipelined operation. How many stages should we have and at what rate should we clock the pipeline? We have 4 natural stages given and no information on how we might be able to further subdivide them, so we use 4 stages in our pipeline. We have a choice of what clock rate to use. The simplest choice would be to use a clock cycle that accommodates the longest stage in our pipe 30 ns. This would allow us to initiate a new instruction every 30 ns with a latency through the pipe of 30 ns x 4 stages = 120 ns. We could also pick a finer clock cycle that more closely matches the shortest stage (9 ns) but is integrally divisible into the other stages. A clock of 10 ns would be a good match and would require three clocks for the first stage, 1 clock for the second, 2 clocks for the third and 1 clock for the fourth. This would allow us to initiate a new instruction every 30 ns but provide a latency of 70 ns rather than 120. Either 30 ns or 10 ns is acceptable.
Question 3. For the pipeline in question 2, how frequently can we initiate the execution of a new instruction, and what is the latency? See answer to question 2.
Question 4. What is the speedup of the pipeline in question 2? Speedup per Stone's preferred definition is (30 + 9 + 20 + 10)/30 = 2.3 Speedup per best clocked definition is (30 + 10 + 20 + 10)/30 = 2.33
Question 5. Draw the reduced state-diagram and show the maximum-rate cycle using the following collision vector: 1 0 0 0 1 1
100011
7 2 4
101111 111011
7 3
2
111111
4 3
110011
The maximum-rate cycle is the sequence 3, 4, 3, 4, 3, . . . giving two operations initiated every seven cycles, or 0.29 ops/cycle. The greedy cycle is 2,2,7,2,2,7,2,2, . . . giving three operations initiated every 11 cycles, or 0.27 ops/cycle. This is a case where the greedy cycle is not the optimum.
Question 6. We have a RISC processor with register-register arithmetic instructions that have the format R1 R2 op R3. The pipeline for these instructions runs with a 100 MHz clock with the following stages: instruction fetch = 2 clocks, instruction decode = 1 clock, fetch operands = 1 clock, execute = 2 clocks, and store result = 1 clock. a) At what rate (in MIPS) can we execute register-register instructions that have no data dependencies with other instructions? b) At what rate can we execute the instructions when every instruction depends on the results of the previous instruction? c) We implement internal forwarding. At what rate can we now execute the instructions when every instruction depends on the results of the previous instruction?
1 1
2 1
3 2 1
4 2
5 3 2
6 3
7 4 3
8 4
9 5 4
10 5
11 6 5
12 6
1 1
2 1 2 1
3 2 3 2
4 3 4 3
5 4
1 1
2 1
3 2 1
4 2
5 3 2
6 3
7 4 3
8 4 wait 2
9 5 4
10 5
11
12
13 6 5
14 6
15 wait
wait wait
wait wait
1 1
wait wait 1 1
b) Dependencies rate = 1 inst/4 cycles = 25 MIPS. The reservation table shows that, although we begin fetching instructions every two cycles, the Operand Fetch unit must wait until the prior instruction stores its result before it can retrieve one of its operands (e.g. Op Fetch for #2 must wait until Op Store for #1 completes). As a result, things begin backing up in the pipeline, and we produce one instruction output only every 4 cycles.
1 1
2 1
3 2 1
4 2
5 3 2
6 3
10
11
12
3 2 3 2 1 2 3 2 3 3
1 1
c) Dependencies with internal forwarding rate = 1 inst/2 cycles = 50 MIPS. If we implement internal forwarding, the operand fetch unit can bypass fetching the dependent operand and just rename the dependent operand input register to be the result of instruction 1. The result is available in time for the next calculation; we just have to point one of the inputs for instruction 2 execution to the internal register that receives the output of instruction 1 in order to get it. We can then proceed without waiting.
Question 7. Conditional branches are a problem with instruction pipelines. For the RISC processor described in question 6, we decide to implement branches by always assuming the branch will not be taken rather than implementing some form of branch prediction or speculative execution, and we do not implement internal forwarding. We don't know that the instruction is a branch until stage 2 (decode), we don't know the condition code setting (for instructions that set the condition code) until stage 5 (operand store) is complete, and we can't provide the target address (of a branch taken) to stage 1 until the end of stage 5. Assume a sequence of instructions where the condition code setting instruction immediately precedes the conditional branch. a) What penalty in lost cycles do we incur for the branch not taken? b) What penalty in lost cycles do we incur for the branch taken? c) We implement delayed branching and the conditional branch is a delayed conditional branch. What penalty in lost cycles do we incur for the delayed branch taken? d) We implement internal forwarding along with the delayed branch. What penalty in lost cycles do we incur for the delayed branch taken with internal forwarding?
1 CC
2 CC
3 BR CC
4 BR
5 NSI BR
6 NSI
7 2SI
8 2SI
9 3SI 2SI
10 3SI
hold
12 4SI
hold hold
CC CC
NSI hold BR BR
CC CC
NSI
a) We have a data dependency between the CC instruction and the branch instruction. The operand fetch unit must wait 2 cycles until the CC is stored by the operand store unit before fetching it for use by the branch instruction. The penalty depends on how we implement the pipeline. If we force the operand fetch 2 cycle delay up the pipeline, we introduce a two cycle delay, even when the branch is not taken. Penalty of 2 cycles for a branch not taken.
However, we have two cycles of buffering in the pipeline one cycle in the instruction decode unit (it waits every other cycle) and one cycle in the operand fetch unit (it also waits every other cycle). If these units can each hold onto their results for a cycle until the next stage is available as shown in the reservation table, we take no penalty for this particular instruction pair. However, we will end up taking the 2 cycle penalty when the next two (and every succeeding) instruction pair with data dependencies come along. Penalty of 2 cycles for a branch not taken.
1 CC
2 CC
3 BR CC
4 BR
5 NSI BR
6 NSI
7 2SI
8 2SI
9 3SI 2SI
10 3SI
hold
11 wait
12 BT
13 BT
CC CC
NSI hold BR BR BR
CC CC
b) The operand fetch unit must still wait 2 cycles until the CC is available from the operand store unit. By the time we know the outcome of the branch instruction (clock 10), we have fetched the next three sequential instructions. We must stop the execution of NSI, dump the 2SI and 3SI instructions, and stop the instruction fetch unit from fetching the fourth sequential instruction after the branch, for 6 wasted clocks. Since the instruction fetch unit cant get the new program counter address for the branch target (BT) instruction until clock 11, it cant begin fetching the instruction at the target address until clock 12, so we make it wait one more cycle. The total penalty for the branch taken is 7 cycles. If you assumed that the instruction fetch unit could not be stopped after fetching 3SI and proceeded to fetch 4SI, the total penalty is 8 cycles.
1 CC
2 CC
3 BA CC
4 BA
5 NSI BA
6 NSI
7 2SI
8 2SI
9 3SI 2SI
10 3SI
hold
11 wait
12 BT
13 BT
CC CC
CC CC
c) The difference here is that we do not need to stop the execution of NSI on a delayed branch. It can continue to completion, but we still need to dump the 2SI and 3SI instructions, and stop the instruction fetch unit from fetching the fourth sequential instruction after the branch. The instruction fetch unit still cant get the new program counter address until clock 11, and it cant begin fetching the instruction at the target address until clock 12, so the total penalty for the branch taken is only 5 cycles, the four we lost by fetching the 2SI and 3SI instructions, and the one it had to wait before proceeding with the new PC. Again, if you assumed that 4SI was fetched, it would be 6 cycles.
1 CC
2 CC
3 BA CC
4 BA
5 NSI BA
6 NSI
7 2SI NSI
8 2SI
9 BT
10 BT
11 2T
12 2T
13 3T
CC CC
BA CC BA CC
d) Internal forwarding allows us to forward the condition code result directly from the CC Execute stage (clock 6) to the branch Execute stage (clock 7), so we dont delay the branch. We can also forward the Branch Target address directly from the output of the Branch Execute stage (clock 8) to the Instruction Fetch unit so we dont lose the branch Operand Store cycle in clock 9. We still need to dump the 2SI instruction that we pre-fetched, so the total penalty for the delayed branch taken with internal forwarding is only 2 cycles.
Question 8. What is a greedy cycle? The greedy cycle arises from initiating a new instruction into the pipeline at the first opportunity in each state. The greedy cycle is also the maximum-rate cycle in many cases, but not necessarily.
Question 9. Why would you implement a branch history table in a pipelined computer? A branch history table gives you a better guess than random on whether or not a conditional branch will be taken. The assumption is that recent history is a good predictor of the near future, the same idea that the LRU cache replacement algorithm is based on. If we have a long instruction pipeline, a good guess will reduce the number of times we have to discard instructions that we prefetch and start into the pipeline following a conditional branch.
Question 10. What do we mean when we say a computer is superscalar? A superscalar computer executes more than one instruction per clock tick. This is achieved by having more than one pipeline and allowing instructions without dependencies on one another to proceed in parallel through the separate pipelines.
Question 11. What problem is speculative execution trying to solve? Speculative execution is another strategy used to reduce the effects of conditional branches. Rather than guessing which way a branch will go and fetching instructions only along one path, we proceed to fetch, decode, and begin execution of instructions along both paths. Results from both instruction streams are tentative until we know which way the branch goes. When the outcome of the branch is known, the tentative results from the path not taken are discarded and the results from the path taken are made permanent.