0% found this document useful (0 votes)
14 views8 pages

Chapter 6 Pipelining Summary Computer Organization

Uploaded by

nushin.ameli9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

Chapter 6 Pipelining Summary Computer Organization

Uploaded by

nushin.ameli9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

lOMoARcPSD|18317745

Chapter 6 Pipelining Summary Computer Organization

Systems Architecture (Vrije Universiteit Amsterdam)

Studeersnel wordt niet gesponsord of ondersteund door een hogeschool of universiteit


Gedownload door Nushin Ameli ([email protected])
lOMoARcPSD|18317745

Chapter 6 Pipelining
• Pipelining as a means for improving performance by overlapping the execution of machine instructions
• Hazards that limit performance gains in pipelined processors and means for mitigating their effect
• Hardware and software implications of pipelining
• Influence of pipelining on instruction set design
• Superscalar processors

6.1 Basic Concept


Pipelining is a particularly effective way of organizing concurrent activity in a computer system.
Rather than wait until each instruction is completed, instructions can be fetched and executed in a pipelined manner.
Ideally, this overlapping pattern of execution would be possible for all instructions.

Although any one instruction takes five cycles to complete its execution, instructions are completed at a rate of one
per cycle.

6.2 Pipeline Organization

• First the program counter (PC) fetches a new instruction


• As other instructions are fetched, execution proceeds
through successive stages. à at any given time, each stage
of the pipeline is processing a different instruction.
• Information such as register addresses, immediate data,
and the operations to be performed must be carried through
the pipeline as each instruction proceeds from one stage to
the next. à held in the interstage buffers (RA,RB,RM,RZ)

The interstage buffers are used as follows:


• B1 feeds the Decode stage with a newly fetched instruction
• B2 feeds the Compute stage with
- the two operands read from the Register file
- the source/destination register identifiers
- the immediate value derived from the instruction
- the incremented PC value used as the return address
- the settings of control signals determined by the
instruction decoder. à move through the pipeline to
determine the ALU & memory operation, and a possible
write into the register file
• B3 holds the ALU result, which may be data to be written in
the register file or an address that feeds the memory stage.
à in case of a write access to memory, buffer B3 holds the
data to be written. These data were read from the register
file in the Decode stage. The buffer also holds the
incremented PC value.
• B4 feeds the write stage with a value to be written into the
Register file. This value may be the ALU result from the
Compute stage, the result of the Memory access stage, or
the incremented PC value.

Gedownload door Nushin Ameli ([email protected])


lOMoARcPSD|18317745

6.3 Pipelining Issues


There are times when it is not possible to have a new instruction enter the pipeline every cycle. In that case an
instruction must be stalled for a certain number of cycles. Any condition that causes the pipeline to stall is called a
hazard.

• The value of a source operand of an instruction is not available when needed (Data hazard)
• Memory delays
• Branch instructions
• Resource limitations.

6.4 Data Dependencies

• There is a data dependency between these


two instructions, register R2 carries data from
the first instruction to the second.

• The subtract instruction is stalled for three


cycles to delay reading register R2 until cycle 6
when the new value becomes available.

1. The control circuit must recognize the data dependency when it decodes the subtract instruction in cycle 3 by
comparing its source register identifier from interstage B1 with destination register identifier of the Add
instruction that is held in interstage B2.
2. The subtract instruction must be held in interstage B1 during cycle 3-5
3. As the add instruction moves ahead in cycle 3-5, control signals can be set in interstage buffer B2 for an
implicit NOP (No operation) instruction that does not modify the memory or the register file.
- Each NOP creates one clock cycle of idle time (bubble) as it passes through the Compute,
Memory, and Write stages to the end of the pipeline.

Operand forwarding
Pipeline stalls due to data dependencies can be alleviated through the use of
operand forwarding. Rather than stall the Subtract instruction, the hardware
can forward the value from the register to where it is needed.

• A new multiplexer, MuxA, is inserted before the input InA of the ALU,
and the existing multiplexer MuxB is expanded with another input. The
multiplexers select either a value read from the register file in the normal
manner, or the value available in the register RZ.

• Forwarding can also be extended to a result in register RY (5.8). This


would handle a data dependency such as one involving R2 in the
sequence of instructions:

Add R2, R3 ,#100


Or R4, R5, R6
Subtract R9, R2, #30

• When the subtract instruction is in the compute stage of the pipeline,


the or instruction is in the memory stage, and the add instruction in the
Write stage. The new value of R2 generated bu the Add instruction is
now in RY. Forwarding this value from register RY to ALU input InA
makes it possible to avoid stalling the pipeline. MuxA requires another
input for the value RY. Similarly, MuxB is extended with another input.

Gedownload door Nushin Ameli ([email protected])


lOMoARcPSD|18317745

Handling Data dependencies in Software


An alternative approach is to leave the task of detecting data dependencies and dealing with them to the compiler.

• When the compiler identifies a data dependency between two


successive instructions IJ and IJ+1, it can insert three explicit
NOP instructions between them.
• The NOPs introduce the necessary delay to enable IJ+1 to
read the new value from the register file after it is written.
• Requiring the compiler to identify dependencies and insert
NOP instructions simplifies the hardware implementation of the
pipeline.
• The compiler can attempt to optimize the code to improve
performance and reduce the code size by reordering instructions
to move useful instructions into the NOP slots. In doing so, the
compiler must consider data dependencies between instructions,
which constrain the extent to which the NOP slots can be usefully
filled.

6.5 Memory Delays


Delays arising from memory accesses are another cause of pipeline stalls. A cache miss is the result when the
requested instruction or data are not found in the cache. This causes all subsequent instructions to be delayed. A
similar delay can be caused by a cache miss when fetching an instruction.

There is an additional type of memory-related stall that occurs when there is a data dependency involving a Load
instruction:
• Assume that the data for the Load instruction
is found in the cache, requiring onlu one cycle
to access the operand
• The destination register R2 for the Load
instruction is a source register for the Subtract
• Operand forwarding cannot be done in the
same manner as 6.4 because the data read
from the cache are not available until they are
loaded into register RY, beginning of cycle 5.
• Subtract instruction must be stalled for one
cycle, to delay the ALU.

The compiler can eliminate the one-cycle stall for this type of data dependency by reordering instructions to insert a
useful instruction between the Load instruction and the instruction that depends on the data read from the memory.
If a useful instruction cannot be found by the compiler, then the hardware introduces the one-cycle stall automatically.
If the processor hardware does not deal with dependencies, then the compiler must insert an explicit NOP.

6.6 Branch Delays


In ideal pipelined execution a new instruction is fetched every cycle, while the preceding instruction is still being
decoded. Branch instructions can alter the sequence of execution, but they must first be executed to determine wheter
and where to branch.

Gedownload door Nushin Ameli ([email protected])


lOMoARcPSD|18317745

Unconditional Branches
Branch instructions occur frequently. With a two-cycle branch
penalty, the relatively high frequency of branch instructions
could increase the execution time for a program by as much
as 40 percent. Therefore, it is important to find ways to
mitigate this impact on performance.

Reducing the branch penalty requires the branch target


address to be computed earlier in the pipeline. Rather than
wait until the Compute stage, it is possible to determine the
target address and update the PC in the Decode stage.

However, still one instruction is fetched incorrectly now. The


hardware in 5.10 must be modified to implement this change.
The adder is needed to increment PC every cycle and a
second adder is needed in the Decode stage to compute a
branch target address for every instruction.

When the instruction decoder determines that the instruction is


indeed a branch instruction, the computed target address will
be available before the end of the cycle. It can then be used to
fetch the target instruction in the next cycle.

Conditional Branches
Branch_if_[R5] = [R6] LOOP

For pipelining, the branch condition must be tested as early as


possible to limit the branch penalty. The comparator that tests
the branch condition can also be moved to the Decode stage,
enabling the conditional branch decision to be made at the
same time that the target address is determined.
In this case, the comparator uses the values from outputs A
and B of the register file directly.

Moving the branch decision to the Decode stage ensures a common branch penalty of only one cycle for all branch
instructions.

The branch delay slot


The location that follows a branch instruction is called the branch delay slot.
Rather than conditionally discard the instruction in the delay slot, we can
arrange to have the pipeline always execute this instruction, whether or not
the branch is taken.

The compiler attempts to find a suitable instruction to occupy the delay slot,
one that needs to be executed, even when the branch is taken. It can do so
by moving one of the instructions preceding the branch instruction to the
delay slot. This can only be done if any data dependencies involving the
instruction being moved are preserved.

If no useful instruction can be placed in the delay slot, a NOP must be placed
instead. Therefore, there will be a penalty of one cycle whether or not the
branch is taken.

Branching takes place one instruction later than where the branch instruction
appears in the instruction sequence à delayed branching

The effectiveness of delayed branching depends on how often the compiler can reorder instructions to usefully fill the
delay slot.

Branch prediction
To reduce the penalty further, the processor needs to anticipate that an instruction being fetched is a branch
instruction and predict its outcome to determine which instruction should be fetched in cycle 2.

Gedownload door Nushin Ameli ([email protected])


lOMoARcPSD|18317745

Static branch prediction


Misprediction incurs the full branch penalty.

The simple approach is a form of static branch prediction. The same choice (assume not-taken) is used every time
a conditional branch is encountered. If branch outcomes were random, then half of all conditional branches would be
taken.

A backward branch at the end of a loop is taken most of the time. For such a branch, better accuracy can be achieved
by predicting that the branch is likely to be taken. Thus, instructions are fetched using the branch target address as
soon as it is known.

The processor can determine the static prediction of taken or not taken by checking the sign of the branch offset.
Alternatively, the machine encoding of a branch instruction may include one bit that indicates whether the branch
should be predicted as taken or not taken. The setting of this bit can be specified by the compiler

Dynamic branch prediction


To improve prediction accuracy further, we can use actual branch behavior to influence
the prediction, resulting in dynamic branch prediction.
The processor hardware assesses the likelihood of a given branch being taken by
keeping track of branch decisions every time that a branch instruction is executed.

The processor assumes that the next time the instruction is executed, the branch
decision is likely to be the same as the last time. à two states: LT = branch is likely to
be taken, LNT = branch is likely not to be taken

Better prediction accuracy can be achieved by keeping more information about


execution history:
ST = strongly likely to be taken
LT = likely to be taken
LNT = likely not to be taken
SNT = strongly likely not to be taken

Branch Target buffer for dynamic prediction


Branch target address and the branch decision can both be determined in the Decode stage of the pipeline, cycle 2 of
instruction execution. The instruction being fetched in the same cycle may or may not be the one that has to be
executed after the branch instruction. It may have to be discarded in which case the correct instruction will be fetched
in cycle 3.

The key to improving performance is to increase the likelihood that the instruction fetched in cycle 2 is the correct one.
This can be achieved only if branch prediction takes place in cycle 1, at the same time that the branch instruction is
being fetched. The processor will need to keep more information about the history of execution, this is stored in a
small and fast memory called the branch target buffer.

The branch target buffer identifies branch instructions, by their addresses. As each branch instruction is executed,
the processor records the address of the instruction and the outcome of the branch decision in the buffer.
The information is organized in the form of a lookup table, in which each entry includes:
- The address of the branch instruction
- One or two state bits for the branch prediction algorithm
- The branch target address

Every time the processor fetches a new instruction, it checks the branch target buffer for an entry containing the same
address. If that address is found, it means that the instruction being fetched is a branch instruction. The processor is
then able to use the state bits to predict whether that branch is likely to be taken. At the same time, the target address
is also obtained.

Then in cycle 2, the processor uses the predicted outcome of the branch to fetch the next instruction. It must also
determine the actual branch decision and target address to determine whether the predicted values were correct. If
they are, execution continues without penalty. Otherwise, the instruction that has just been fetched is discarded and
the correct one is fetched in cycle 3.

The main value of the branch target buffer is that the state information needed for branch prediction and the target
address of a branch instruction are both obtained at the same time the branch instruction is being fetched

Gedownload door Nushin Ameli ([email protected])


lOMoARcPSD|18317745

6.7 Resource limitations


Pipelining enables overlapping execution of instructions, but the pipeline stalls when there are insufficient hardware
resources to permit all actions to proceed concurrently.

If two instructions need to access the same resource in the same clock cycle, one instruction must be stalled to allow
the other instruction to use the resource. This can be prevented by providing additional hardware. Such stalls can
occur in a computer that has a single cache that supports only one access per cycle. Using sperate caches for
instructions and data allows stages to proceed simultaneously without stalling

6.8 Performance Evaluation


For a non-pipelined processor, the execution time of a program that has a dynamic instruction count of N is given by:

• T = execution time
• S = average number of clock cycles it takes to fetch and execute one instruction
• R = clock rate in cycles per second.
• Often referred to as the basic performance equation

A useful performance indicator is the instruction throughput, which is the number of instructions executed p/second.
For none-pipelined execution this is given by:

• If there are no cache misses, S is equal to 5 since the processor uses 5 cycles.

Pipelining improves performance by overlapping execution of successive instructions, which increase instruction
throughput even though an individual instruction is still executed in the same number of cycles. Thus, in the absence
of stalls, S is equal to 1, and the ideal throughput with pipelining is:

A five-stage pipeline can potentially increase the throughput by a factor of five. In general, an n-stage pipeline has the
potential to increase throughput n times. Thus, it would appear that the higher the value of n, the larger the
performance gain.
à how much of this potential increase in instruction throughput can actually be realized in practice?
àwhat is a good value for n?

Any time a pipeline is stalled, or instructions are discarded, the instruction throughput is reduced below its ideal value.

Effects of Stalls and Penalties


The five-stage pipeline involves memory-access operations in the Fetch and Memory stages, and ALU operations in
the Compute stage. The operations with the longest delay dictate the cycle time, and hence the clock rate R.

While ideal pipelined execution has S = 1, stalls due to such Load instructions have the effect of increasing S by an
amount d. For example, assume that Load instructions constitute 25% of the dynamic instruction count, and assume
that 20% of these Load instructions are followed by a dependent instruction. A one-cycle stall is needed in such
cases. Hence, the increase over the ideal case of S = 1 is:

The execution time T is increased bu 10% and throughput is reduced to:

The compiler can improve performance by reducing the number of times that a Load instruction is immediately
followed by a dependent instruction. A stall is eliminated each time the compiler can safely move a nearby instruction
to a position between the Load instruction and the dependent instruction.

Branch penalty example:


- Branches constitute 20% of the dynamic instruction count of a program
- The average prediction accuracy is 90% à 10% has a one-cycle penalty

The sum of dstall and dbranch_penalty determines the increase in the number of cycles, the increase in time, and the
reduction in the throughput.

When all factors are combined S is increasd from the ideal value of 1 to 1 + dstall + dbranch_penalty + dmiss

Gedownload door Nushin Ameli ([email protected])


lOMoARcPSD|18317745

Number of pipeline stages


The fact that an n-stage pipeline may increase instruction throughput by a factor of n suggests that we should use a
large number of stages.
However:
- As the number of pipeline stages increases, there are more instructions being executed concurrently
- There are more potential dependencies between instructions that may lead to pipeline stalls
- The branch penalty may be larger than one cycle if a longer pipeline moves the branch decision to a later stage

Another important factor is the inherent delay in the basic operations performed by the processor. à ALU delay
Further reductions in the clock cycle time are possible if a pipelined ALU is used. Some recent processor
implementations have used twenty or more pipeline stages to aggressively reduce the cycle time.

6.9 Superscalar Operation


The maximum throughput of a pipelined processor is one instruction per clock cycle. A more aggressive approach is
to equip the processor with multiple execution units, each of which may be pipelined, to increase the processor’s
ability to handle several instructions in parallel. à several instructions start execution in the same clock cycle, but in
different execution units. The processor is said to use multiple-issue. And are known as superscalar processors.
Many modern high-performance processors use this approach.

A superscalar processor has a more elaborate fetch unit that fetches two
more instructions per cycle before they are needed and places the m in an
instruction queue.

The dispatch unit takes two or more instructions from the front of the
queue, decodes them, and sends them to the appropriate execution units.

At the end of the pipeline, another unit is responsible for writing results into
the register file. The register file must now have two input ports instead of
the single input port for the simple pipeline.

There is also the potential complication of two instructions completing at the same time with the same destination
register for their results.
If this cannot be avoided by dispatching the instructions, one of the instructions is stalled to ensure that results are
written into the destination register in the same order as in the original instruction sequence of the program.

• The fetch unit fetches two instructions every cycle


• Instructions are decoded and their source registers are read in the next cycle.
• They are dispatched to the arithmetic and Load/Store units.
• Arithmetic operations can be initiated every cycle
• Load and Store instruction can also be initiated every cycle.
- the two-stage pipeline overlaps the address calculation for one Load or
Store instruction with the memory access for the preceding Load or Store
instruction.

An instructions complete execution in each unit, the register file allows two results to be written in the same cycle
because the destination registers are different.

Branch and Data Dependencies

6.10 Pipelining in CISC Processors

Gedownload door Nushin Ameli ([email protected])

You might also like