Chapter 6 Pipelining
Chapter 6 Pipelining
Chapter 6
Basic concepts
Speed of execution of programs can be improved in two ways:
Faster circuit technology to build the processor and the memory. Arrange the hardware so that a number of operations can be performed simultaneously. The number of operations performed per second is increased although the elapsed time needed to perform any one operation is not changed.
Pipelining is an effective way of organizing concurrent activity in a computer system to improve the speed of execution of programs.
What if the execution of one instruction is overlapped with the fetching of the next one?
Execution unit
- Fetch unit fetches an instruction I3 (F3), and execution unit executes instruction I2 (E2).
Time 3 4
F : Fetch instruction B1
E: Execute operation B3
W : Write results
Clock cycle 1: F1 Clock cycle 2: D1, F2 Clock cycle 3: E1, D2, F3 Clock cycle 4: W1, E2, D3, F4 Clock cycle 5: W2, E3, D4 Clock cycle 6: W3, E3, D4 Clock cycle 7: W4
I1
I2 I3 I4
F1
D1
F2
E1
D2 F3
W1
E2 D3 F4 W2 E3 D4 W3 E4 W4
During clock cycle #4: Buffer B1 holds instruction I3, which is being decoded by the instruction-decoding unit. Instruction I3 was fetched in cycle 3. Buffer B2 holds the source and destination operands for instruction I2. It also holds the information needed for the Write step (W2) of instruction I2. This information will be passed to the stage W in the following clock cycle. Buffer B1 holds the results produced by the execution unit and the destination information for instruction I1.
Fi gure 8.3. Effect of an xecuti e on operati on taki ng m ore than one ycl cl ock e. c
Many processors have separate data and instruction caches to avoid this delay. In general, structural hazards can be avoided by providing sufficient resources on the processor chip.
Memory address X+R1 is computed in step E2 in cycle 4, memory access takes place In cycle 5, operand read from the memory is written into register R2 in cycle 6. Execution of instruction I2 takes two clock cycles 4 and 5. In cycle 6, both instructions I2 and I3 require access to register file. Pipeline is stalled because the register file cannot handle two operations at once.
Data hazards
Data hazard is a situation in which the pipeline is stalled because the data to be operated on are delayed. Consider two instructions: I1 : A = 3 + A I2 : B = 4 x A If A = 5, and I1 and I2 are executed sequentially, B=32. In a pipelined processor, the execution of I2 can begin before the execution of I1. The value of A used in the execution of I2 will be the original value of 5 leading to an incorrect result. Thus, instructions I1 and I2 depend on each other, because the data used by I2 depends on the results generated by I1. Results obtained using sequential execution of instructions should be the same as the results obtained from pipelined execution. When two instructions depend on each other, they must be performed in the correct order.
Add R5,R4,R6
W3
E4 W4
F3
D3
F4
E3
D4
Mul instruction places the results of the multiply operation in register R4 at the end of clock cycle 4. Register R4 is used as a source operand in the Add instruction. Hence the Decode Unit decoding the Add instruction cannot proceed until the Write step of the first instruction is complete. Data dependency arises because the destination of one instruction is used as a source in the next instruction.
Operand forwarding
Data hazard occurs because the destination of one instruction is used as the source in the next instruction. Hence, instruction I2 has to wait for the data to be written in the register file by the Write stage at the end of step W1. However, these data are available at the output of the ALU once the Execute stage completes step E1. Delay can be reduced or even eliminated if the result of instruction I1 can be forwarded directly for use in step E2. This is called operand forwarding.
Detecting data dependencies and handling them can also be accomplished in software.
Compiler can introduce the necessary delay by introducing an appropriate number of NOP instructions. For example, if a two-cycle delay is needed between two instructions then two NOP instructions can be introduced between the two instructions. I1: Mul R2, R3, R4 NOP NOP I2: Add R5, R4, R6
Side effects
Data dependencies are explicit easy to detect if a register specified as the destination in one instruction is used as a source in the subsequent instruction. However, some instructions also modify registers that are not specified as the destination. For example, in the autoincrement and autodecrement addressing mode, the source register is modified as well. When a location other than the one explicitly specified in the instruction as a destination location is affected, the instruction is said to have a side effect. Another example of a side effect is condition code flags which implicitly record the results of the previous instruction, and these results may be used in the subsequent instruction.
Instruction hazards
Instruction fetch units fetch instructions and supply the execution units with a steady stream of instructions. If the stream is interrupted then the pipeline stalls. Stream of instructions may be interrupted because of a cache miss or a branch instruction.
I3 Ik
I k+1
F3
X Fk Ek
Fk+ 1 Ek+ 1
Pipeline stalls for one clock cycle. Time lost as a result of a branch instruction is called as branch penalty. Branch penalty is one clock cycle.
T ime 5 6 7 8
2 D
1
3 E D F
1
4 W E D F
1
F2
X X Fk D F
k
Ek D
k+ 1
W E
k+ 1
k+ 1
ycle
1 F1
2 D1 F2
3 E1 D2 F3
4 W
1
X Fk D F k+
k
Ek
1
W E k+
k+ 1
Fetch unit fetches instructions before they are needed & stores them in a queue
E : Execute instruction
W : Write results
Dispatch unit takes instructions from the front of the queue and dispatches them to the Execution unit. Dispatch unit also decodes the instruction.
F3
F4 F5
D3
E3
D4
W3
E4 W4
I5 is a branch instruction with target instruction Ik. Ik is fetched in cycle 7, and I6 is discarded. However, this does not stall the pipeline, since I4 is dispatched.
I4
I 5 (Branch) I6 Ik I k+ 1
D5 F6 X Fk Dk Ek Wk
I2, I3, I4 and Ik are executed in successive clock cycles. Fetch unit computes the branch address concurrently with the execution of other instructions. This is called as branch folding.
Fk+ 1
D k+ 1 Ek+ 1
T ime
F
1
(Branch)
I
3
F
k
F
k+ 1
k+
k+
k+
If we cannot place useful instructions in the branch delay slots, then we can fill these slots with NOP instructions.
R1 R2 LOOP R1,R3
LOOP
NEXT
R2 LOOP R1 R1,R3
Branch prediction
To reduce the branch penalty associated with conditional branches, we can predict whether the branch will be taken. Simplest form of branch prediction:
Assume that the branch will not take place. Continue to fetch instructions in sequential execution order. Until the branch condition is evaluated, instruction execution along the predicted path must be done
Determine a priori whether a branch will be taken or not depending on the expected program behavior.
For example, a branch instruction at the end of the loop causes a branch to the start of the loop for every pass through the loop except the last one. Better performance can be achieved if this branch is always predicted as taken.
Initial state of the machine be LNT When the branch instruction is executed, and if the branch is taken, the machine moves to state LT. If the branch is not taken, it remains in state LNT. When the same branch instruction is executed the next time, the branch is predicted as taken if the state of the machine is LT, else it is predicted as not taken.
Branch taken (BT) BNT LNT BT
BNT BNT BT
BT
ST : Strong likely to be taken LT : Likely to be taken LNT : Likely not to be taken SNT : Strong likely not to be taken
BT BNT
Overview
Some instructions are much better suited to pipeline execution than others. Addressing modes Conditional code flags
Addressing Modes
Addressing modes include simple ones and complex ones. In choosing the addressing modes to be implemented in a pipelined processor, we must consider the effect of each addressing mode on instruction flow in the pipeline: Side effects The extent to which complex addressing modes cause the pipeline to stall Whether a given mode is likely to be used by compilers
I 2 (Load)
F2
D2
E2
M2
W2
I3
F3
D3
E3
W3
I4
F4
D4
E4
I5
F5
D5
Recall
Load X(R1), R2
Load (R1), R2
Clock c ycle 1
T ime 7
Load
X + [R1]
Next instruction
Add
X + [R1]
Load
[X +[R1]]
Load
[[X +[R1]]]
Next instruction
Addressing Modes
In a pipelined processor, complex addressing modes do not necessarily lead to faster execution. Advantage: reducing the number of instructions / program space Disadvantage: cause pipeline to stall / more hardware to decode / not convenient for compiler to work with Conclusion: complex addressing modes are not suitable for pipelined execution.
Addressing Modes
Good addressing modes should have: Access to an operand does not require more than one access to the memory Only load and store instruction access memory operands The addressing modes used do not have side effects Register, register indirect, index
Conditional Codes
If an optimizing compiler attempts to reorder instruction to avoid stalling the pipeline when branches or data dependencies between successive instructions occur, it must ensure that reordering does not cause a change in the outcome of a computation. The dependency introduced by the condition-code flags reduces the flexibility available for the compiler to reorder instructions.
Conditional Codes
Figure 8.17. Instruction reordering.
Add Compare Branch=0 R1,R2 R3,R4 ...
Conditional Codes
Two conclusion: To provide flexibility in reordering instructions, the conditioncode flags should be affected by as few instruction as possible. The compiler should be able to specify in which instructions of a program the condition codes are affected and in which they are not.
Original Design
Pipelined Design
- Separate instruction and data caches - PC is connected to IMAR - DMAR - Separate MDR - Buffers for ALU - Instruction queue - Instruction decoder output
- Reading an instruction from the instruction cache - Incrementing the PC - Decoding an instruction - Reading from or writing into the data cache - Reading the contents of up to two regs - Writing into one register in the reg file - Performing an ALU operation
Superscalar operation
Pipelining enables multiple instructions to be executed concurrently by dividing the execution of an instruction into several stages: An alternative approach is to equip the processor with multiple processing units to handle several instructions in parallel in each stage. If a processor has multiple processing units then several instructions can start execution in the same clock cycle.
Processor is said to use multiple issue.
(Add)
(Fsub)
(Sub)
Out-of-order execution
Instructions are dispatched in the same order as they appear in the program, however, they complete execution out-of-order.
Dependencies among instructions need to be handled correctly, so that this does not lead to any problems.
What if during the execution of an instruction an exception occurs and one or more of the succeeding instructions have been executed to completion?
For example, the execution of instruction I1 may cause an exception after the instruction I2 has completed execution and written the results to the destination location?
If a processor permits succeeding instructions to complete execution and write to the destination locations, before knowing whether the prior instructions cause exceptions, it is said to allow imprecise exceptions.
Out-of-order execution(Contd.,)
Clock c ycle I 1 (F add) 1 F1 2 D1 3 E 1A 4 E 1B 5 E 1C 6 W1 7
I 2 (Add)
I 3 (Fsub) I 4 (Sub)
F2
D2
F3 F4
E2
D3 D4 E 3A E 3B
W2
E 3C E4 W3 W4
To guarantee a consistent state when exceptions occur, the results of execution must be written to the destination locations strictly in the program order. Step W2 must be delayed until cycle 6, when I1 enters the write stage. Integer unit must retain the results of I2 until cycle 6, and cannot accept another instruction until then. If an exception occurs during an instruction, then all subsequent instructions that may have been partially executed are discarded. This is known a precise exception.
Execution completion
It is beneficial to allow out-of-order execution, so that the execution unit is freed up to execute other instructions. However, instructions must be completed in program order to allow precise exceptions. These requirements are conflicting. It is possible to resolve the conflict by allowing the execution to proceed and writing the results into temporary registers. The contents of the temporary registers are transferred to permanent registers in correct program order.
When an instruction reaches the head of the queue and its execution has been completed:
Results are transferred from temporary registers to permanent registers. All resources assigned to this instruction are released. The instruction is said to have retired.
Instructions are retired strictly in program order, though they may be completed out-of-order.
Dispatch Operation
When dispatch decisions are made, dispatch unit must ensure that all the resources needed for the execution of an instruction are available and it reserves the resources needed. What if instructions are dispatched out of order? Deadlock occurs
Performance Considerations
Overview
The execution time T of a program that has a dynamic instruction count N is given by:
T N S R
where S is the average number of clock cycles it takes to fetch and execute one instruction, and R is the clock rate. Instruction throughput is defined as the number of instructions executed per second.
Ps R S
Overview
An n-stage pipeline has the potential to increase the throughput by n times. However, the only real measure of performance is the total execution time of a program. Higher instruction throughput will not necessarily lead to higher performance. Two questions regarding pipelining
How much of this potential increase in instruction throughput can be realized in practice? What is good value of n?