Chapter 6 Pipelining Summary Computer Organization
Chapter 6 Pipelining Summary Computer Organization
Chapter 6 Pipelining
• Pipelining as a means for improving performance by overlapping the execution of machine instructions
• Hazards that limit performance gains in pipelined processors and means for mitigating their effect
• Hardware and software implications of pipelining
• Influence of pipelining on instruction set design
• Superscalar processors
Although any one instruction takes five cycles to complete its execution, instructions are completed at a rate of one
per cycle.
• The value of a source operand of an instruction is not available when needed (Data hazard)
• Memory delays
• Branch instructions
• Resource limitations.
1. The control circuit must recognize the data dependency when it decodes the subtract instruction in cycle 3 by
comparing its source register identifier from interstage B1 with destination register identifier of the Add
instruction that is held in interstage B2.
2. The subtract instruction must be held in interstage B1 during cycle 3-5
3. As the add instruction moves ahead in cycle 3-5, control signals can be set in interstage buffer B2 for an
implicit NOP (No operation) instruction that does not modify the memory or the register file.
- Each NOP creates one clock cycle of idle time (bubble) as it passes through the Compute,
Memory, and Write stages to the end of the pipeline.
Operand forwarding
Pipeline stalls due to data dependencies can be alleviated through the use of
operand forwarding. Rather than stall the Subtract instruction, the hardware
can forward the value from the register to where it is needed.
• A new multiplexer, MuxA, is inserted before the input InA of the ALU,
and the existing multiplexer MuxB is expanded with another input. The
multiplexers select either a value read from the register file in the normal
manner, or the value available in the register RZ.
There is an additional type of memory-related stall that occurs when there is a data dependency involving a Load
instruction:
• Assume that the data for the Load instruction
is found in the cache, requiring onlu one cycle
to access the operand
• The destination register R2 for the Load
instruction is a source register for the Subtract
• Operand forwarding cannot be done in the
same manner as 6.4 because the data read
from the cache are not available until they are
loaded into register RY, beginning of cycle 5.
• Subtract instruction must be stalled for one
cycle, to delay the ALU.
The compiler can eliminate the one-cycle stall for this type of data dependency by reordering instructions to insert a
useful instruction between the Load instruction and the instruction that depends on the data read from the memory.
If a useful instruction cannot be found by the compiler, then the hardware introduces the one-cycle stall automatically.
If the processor hardware does not deal with dependencies, then the compiler must insert an explicit NOP.
Unconditional Branches
Branch instructions occur frequently. With a two-cycle branch
penalty, the relatively high frequency of branch instructions
could increase the execution time for a program by as much
as 40 percent. Therefore, it is important to find ways to
mitigate this impact on performance.
Conditional Branches
Branch_if_[R5] = [R6] LOOP
Moving the branch decision to the Decode stage ensures a common branch penalty of only one cycle for all branch
instructions.
The compiler attempts to find a suitable instruction to occupy the delay slot,
one that needs to be executed, even when the branch is taken. It can do so
by moving one of the instructions preceding the branch instruction to the
delay slot. This can only be done if any data dependencies involving the
instruction being moved are preserved.
If no useful instruction can be placed in the delay slot, a NOP must be placed
instead. Therefore, there will be a penalty of one cycle whether or not the
branch is taken.
Branching takes place one instruction later than where the branch instruction
appears in the instruction sequence à delayed branching
The effectiveness of delayed branching depends on how often the compiler can reorder instructions to usefully fill the
delay slot.
Branch prediction
To reduce the penalty further, the processor needs to anticipate that an instruction being fetched is a branch
instruction and predict its outcome to determine which instruction should be fetched in cycle 2.
The simple approach is a form of static branch prediction. The same choice (assume not-taken) is used every time
a conditional branch is encountered. If branch outcomes were random, then half of all conditional branches would be
taken.
A backward branch at the end of a loop is taken most of the time. For such a branch, better accuracy can be achieved
by predicting that the branch is likely to be taken. Thus, instructions are fetched using the branch target address as
soon as it is known.
The processor can determine the static prediction of taken or not taken by checking the sign of the branch offset.
Alternatively, the machine encoding of a branch instruction may include one bit that indicates whether the branch
should be predicted as taken or not taken. The setting of this bit can be specified by the compiler
The processor assumes that the next time the instruction is executed, the branch
decision is likely to be the same as the last time. à two states: LT = branch is likely to
be taken, LNT = branch is likely not to be taken
The key to improving performance is to increase the likelihood that the instruction fetched in cycle 2 is the correct one.
This can be achieved only if branch prediction takes place in cycle 1, at the same time that the branch instruction is
being fetched. The processor will need to keep more information about the history of execution, this is stored in a
small and fast memory called the branch target buffer.
The branch target buffer identifies branch instructions, by their addresses. As each branch instruction is executed,
the processor records the address of the instruction and the outcome of the branch decision in the buffer.
The information is organized in the form of a lookup table, in which each entry includes:
- The address of the branch instruction
- One or two state bits for the branch prediction algorithm
- The branch target address
Every time the processor fetches a new instruction, it checks the branch target buffer for an entry containing the same
address. If that address is found, it means that the instruction being fetched is a branch instruction. The processor is
then able to use the state bits to predict whether that branch is likely to be taken. At the same time, the target address
is also obtained.
Then in cycle 2, the processor uses the predicted outcome of the branch to fetch the next instruction. It must also
determine the actual branch decision and target address to determine whether the predicted values were correct. If
they are, execution continues without penalty. Otherwise, the instruction that has just been fetched is discarded and
the correct one is fetched in cycle 3.
The main value of the branch target buffer is that the state information needed for branch prediction and the target
address of a branch instruction are both obtained at the same time the branch instruction is being fetched
If two instructions need to access the same resource in the same clock cycle, one instruction must be stalled to allow
the other instruction to use the resource. This can be prevented by providing additional hardware. Such stalls can
occur in a computer that has a single cache that supports only one access per cycle. Using sperate caches for
instructions and data allows stages to proceed simultaneously without stalling
• T = execution time
• S = average number of clock cycles it takes to fetch and execute one instruction
• R = clock rate in cycles per second.
• Often referred to as the basic performance equation
A useful performance indicator is the instruction throughput, which is the number of instructions executed p/second.
For none-pipelined execution this is given by:
• If there are no cache misses, S is equal to 5 since the processor uses 5 cycles.
Pipelining improves performance by overlapping execution of successive instructions, which increase instruction
throughput even though an individual instruction is still executed in the same number of cycles. Thus, in the absence
of stalls, S is equal to 1, and the ideal throughput with pipelining is:
A five-stage pipeline can potentially increase the throughput by a factor of five. In general, an n-stage pipeline has the
potential to increase throughput n times. Thus, it would appear that the higher the value of n, the larger the
performance gain.
à how much of this potential increase in instruction throughput can actually be realized in practice?
àwhat is a good value for n?
Any time a pipeline is stalled, or instructions are discarded, the instruction throughput is reduced below its ideal value.
While ideal pipelined execution has S = 1, stalls due to such Load instructions have the effect of increasing S by an
amount d. For example, assume that Load instructions constitute 25% of the dynamic instruction count, and assume
that 20% of these Load instructions are followed by a dependent instruction. A one-cycle stall is needed in such
cases. Hence, the increase over the ideal case of S = 1 is:
The compiler can improve performance by reducing the number of times that a Load instruction is immediately
followed by a dependent instruction. A stall is eliminated each time the compiler can safely move a nearby instruction
to a position between the Load instruction and the dependent instruction.
The sum of dstall and dbranch_penalty determines the increase in the number of cycles, the increase in time, and the
reduction in the throughput.
When all factors are combined S is increasd from the ideal value of 1 to 1 + dstall + dbranch_penalty + dmiss
Another important factor is the inherent delay in the basic operations performed by the processor. à ALU delay
Further reductions in the clock cycle time are possible if a pipelined ALU is used. Some recent processor
implementations have used twenty or more pipeline stages to aggressively reduce the cycle time.
A superscalar processor has a more elaborate fetch unit that fetches two
more instructions per cycle before they are needed and places the m in an
instruction queue.
The dispatch unit takes two or more instructions from the front of the
queue, decodes them, and sends them to the appropriate execution units.
At the end of the pipeline, another unit is responsible for writing results into
the register file. The register file must now have two input ports instead of
the single input port for the simple pipeline.
There is also the potential complication of two instructions completing at the same time with the same destination
register for their results.
If this cannot be avoided by dispatching the instructions, one of the instructions is stalled to ensure that results are
written into the destination register in the same order as in the original instruction sequence of the program.
An instructions complete execution in each unit, the register file allows two results to be written in the same cycle
because the destination registers are different.