Pipe Lining
Pipe Lining
Intermediate Concepts
ENGR9861 Winter
2007 RV
Characteristics Of Pipelining
• If the stages of a pipeline are not balanced and one
stage is slower than another, the entire throughput of
the pipeline is affected.
• In terms of a pipeline within a CPU, each instruction
is broken up into different stages. Ideally if each stage
is balanced (all stages are ready to start at the same
time and take an equal amount of time to execute.) the
time taken per instruction (pipelined) is defined as:
ENGR9861 Winter
2007 RV
Characteristics Of Pipelining
• The previous expression is ideal. We will see later that
there are many ways in which a pipeline cannot
function in a perfectly balanced fashion.
• In terms of a CPU, the implementation of pipelining
has the effect of reducing the average instruction time,
therefore reducing the average CPI.
• EX: If each instruction in a microprocessor takes 5
clock cycles (unpipelined) and we have a 4 stage
pipeline, the ideal average CPI with the pipeline will
be 1.25 .
ENGR9861 Winter
2007 RV
RISC Instruction Set Basics
(from Hennessey and Patterson)
• Properties of RISC architectures:
– All ops on data apply to data in registers and typically
change the entire register (32-bits or 64-bits).
– The only ops that affect memory are load/store
operations. Memory to register, and register to memory.
– Load and store ops on data less than a full size of a
register (32, 16, 8 bits) are often available.
– Usually instructions are few in number (this can be
relative) and are typically one size.
ENGR9861 Winter
2007 RV
RISC Instruction Set Basics
Types Of Instructions
• ALU Instructions:
Arithmetic operations, either take two registers as
operands or take one register and a sign extended
immediate value as an operand. The result is stored in a
third register.
Logical operations AND OR, XOR do not usually
differentiate between 32-bit and 64-bit.
• Load/Store Instructions:
Usually take a register (base register) as an operand and a
16-bit immediate value. The sum of the two will create
the effective address. A second register acts as a source
in the case of a load operation.
ENGR9861 Winter
2007 RV
RISC Instruction Set Basics
Types Of Instructions (continued)
In the case of a store operation the second register
contains the data to be stored.
• Branches and Jumps
Conditional branches are transfers of control. As
described before, a branch causes an immediate value to
be added to the current program counter.
• Appendix A has a more detailed description of the
RISC instruction set. Also the inside back cover has a
listing of a subset of the MIPS64 instruction set.
ENGR9861 Winter
2007 RV
RISC Instruction Set Implementation
• We first need to look at how instructions in the MIPS64
instruction set are implemented without pipelining. We’ll
assume that any instruction of the subset of MIPS64 can be
executed in at most 5 clock cycles.
• The five clock cycles will be broken up into the following steps:
Instruction Fetch Cycle
Instruction Decode/Register Fetch Cycle
Execution Cycle
Write-Back Cycle
ENGR9861 Winter
2007 RV
Instruction Fetch (IF) Cycle
• The value in the PC represents an address in memory.
The MIPS64 instructions are all 32-bits in length.
Figure 2.27 shows how the 32-bits (4 bytes) are
arranged depending on the instruction.
• First we load the 4 bytes in memory into the CPU.
• Second we increment the PC by 4 because memory
addresses are arranged in byte ordering. This will now
represent the next instruction. (Is this certain???)
ENGR9861 Winter
2007 RV
Instruction Decode (ID)/Register Fetch
Cycle
• Decode the instruction and at the same time read in
the values of the register involved. As the registers are
being read, do equality test incase the instruction
decodes as a branch or jump.
• The offset field of the instruction is sign-extended
incase it is needed. The possible branch effective
address is computed by adding the sign-extended
offset to the incremented PC. The branch can be
completed at this stage if the equality test is true and
the instruction decoded as a branch.
ENGR9861 Winter
2007 RV
Instruction Decode (ID)/Register Fetch
Cycle (continued)
• Instruction can be decoded in parallel with reading the
registers because the register addresses are at fixed
locations.
ENGR9861 Winter
2007 RV
Execution (EX)/Effective Address Cycle
• If a branch or jump did not occur in the previous
cycle, the arithmetic logic unit (ALU) can execute the
instruction.
• At this point the instruction falls into three different
types:
Memory Reference: ALU adds the base register and the
offset to form the effective address.
Register-Register: ALU performs the arithmetic, logical,
etc… operation as per the opcode.
Register-Immediate: ALU performs operation based on
the register and the immediate value (sign extended).
ENGR9861 Winter
2007 RV
Memory Access (MEM) Cycle
• If a load, the effective address computed from the
previous cycle is referenced and the memory is read.
The actual data transfer to the register does not occur
until the next cycle.
• If a store, the data from the register is written to the
effective address in memory.
ENGR9861 Winter
2007 RV
Write-Back (WB) Cycle
• Occurs with Register-Register ALU instructions or
load instructions.
• Simple operation whether the operation is a register-
register operation or a memory load operation, the
resulting data is written to the appropriate register.
ENGR9861 Winter
2007 RV
Looking At The Big Picture
• Overall the most time that an non-pipelined
instruction can take is 5 clock cycles. Below is a
summary:
Branch - 2 clock cycles
Store - 4 clock cycles
Other - 5 clock cycles
• EX: Assuming branch instructions account for 12% of
all instructions and stores account for 10%, what is the
average CPI of a non-pipelined CPU?
ANS: 0.12*2+0.10*4+0.78*5 = 4.54
ENGR9861 Winter
2007 RV
The Classical RISC 5 Stage Pipeline
ENGR9861 Winter
2007 RV
ENGR9861 Winter
2007 RV
Problems With The Previous Figure
ENGR9861 Winter
2007 RV
Problems With The Previous Figure
(continued)
ENGR9861 Winter
2007 RV
Pipeline Hazards
• The performance gain from using pipelining occurs
because we can start the execution of a new
instruction each clock cycle. In a real implementation
this is not always possible.
• Another important note is that in a pipelined
processor, a particular instruction still takes at least as
long to execute as non-pipelined.
• Pipeline hazards prevent the execution of the next
instruction during the appropriate clock cycle.
ENGR9861 Winter
2007 RV
Types Of Hazards
• There are three types of hazards in a pipeline, they are
as follows:
Structural Hazards: are created when the data path
hardware in the pipeline cannot support all of the
overlapped instructions in the pipeline.
Data Hazards: When there is an instruction in the
pipeline that affects the result of another instruction in
the pipeline.
Control Hazards: The PC causes these due to the
pipelining of branches and other instructions that change
the PC.
ENGR9861 Winter
2007 RV
A Hazard Will Cause A Pipeline Stall
ENGR9861 Winter
2007 RV
A Hazard Will Cause A Pipeline Stall
(continued)
1
Speedup = x Pipeline Depth
1 + Pipeline stalls per Ins
ENGR9861 Winter
2007 RV
Dealing With Structural Hazards
ENGR9861 Winter
2007 RV
ENGR9861 Winter
2007 RV
Dealing With Structural Hazards
ENGR9861 Winter
2007 RV
Dealing With Structural Hazards (continued)
1 1
Speedup = x
1+0.4*1 1/1.05
= 0.75
ENGR9861 Winter
2007 RV
Dealing With Structural Hazards (continued)
ENGR9861 Winter
2007 RV
Pipeline Registers
ENGR9861 Winter
2007 RV
Data Hazard Avoidance
• In this trivial example, we cannot expect the programmer to
reorder his/her operations. Assuming this is the only code we
want to execute.
• Data forwarding can be used to solve this problem.
• To implement data forwarding we need to bypass the pipeline
register flow:
– Output from the EX/MEM and MEM/WB stages must be fed back
into the ALU input.
– We need routing hardware that detects when the next instruction
depends on the write of a previous instruction.
ENGR9861 Winter
2007 RV
ENGR9861 Winter
2007 RV
General Data Forwarding
• It is easy to see how data forwarding can be used by
drawing out the pipelined execution of each
instruction.
• Now consider the following instructions:
ENGR9861 Winter
2007 RV
ENGR9861 Winter
2007 RV
Problems
• Can data forwarding prevent all data hazards?
• NO!
• The following operations will still cause a data hazard.
This happens because the further down the pipeline
we get, the less we can use forwarding.
LD R1, O(R2)
DSUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
ENGR9861 Winter
2007 RV
ENGR9861 Winter
2007 RV
Problems
• We can avoid the hazard by using a pipeline interlock.
• The pipeline interlock will detect when data
forwarding will not be able to get the data to the next
instruction in time.
• A stall is introduced until the instruction can get the
appropriate data from the previous instruction.
ENGR9861 Winter
2007 RV
Control Hazards
• Control hazards are caused by branches in the code.
• During the IF stage remember that the PC is
incremented by 4 in preparation for the next IF cycle
of the next instruction.
• What happens if there is a branch performed and we
aren’t simply incrementing the PC by 4.
• The easiest way to deal with the occurrence of a
branch is to perform the IF stage again once the
branch occurs.
ENGR9861 Winter
2007 RV
Performing IF Twice
• We take a big performance hit by performing the
instruction fetch whenever a branch occurs. Note, this
happens even if the branch is taken or not. This
guarantees that the PC will get the correct value.
IF ID EX MEM WB
branch IF ID EX MEM WB
IF IF ID EX MEM WB
ENGR9861 Winter
2007 RV
Performing IF Twice
• This method will work but as always in computer
architecture we should try to make the most common
operation fast and efficient.
• With MIPS64 branch instructions are quite common.
• By performing IF twice we will encounter a
performance hit between 10%-30%
• Next class we will look at some methods for dealing
with Control Hazards.
ENGR9861 Winter
2007 RV
Control Hazards (other solutions)
ENGR9861 Winter
2007 RV
Control Hazards (other solutions)
ENGR9861 Winter
2007 RV
Control Hazards (other solutions)
ENGR9861 Winter
2007 RV
Control Hazards (other solutions)
ENGR9861 Winter
2007 RV
How To Implement a Pipeline
ENGR9861 Winter
2007 RV
Multi-clock Operations
• Sometimes operations require more than one clock
cycle to complete. Examples are:
Floating Point Multiply
Floating Point Divide
Floating Point Add
• We can assume that there is hardware available on the
processor for performing the operations.
• Assume that the FP Mul and Add are fully pipelined,
and the divide is un-pipelined.
ENGR9861 Winter
2007 RV
ENGR9861 Winter
2007 RV
ENGR9861 Winter
2007 RV
Avoiding Structural Hazards
• The multiplier and the divider are fully pipelined. The
divider is not pipelined at all.
• Take a look at figure A.34 for a good example of how
pipelining will function in the case of longer
instruction execution. The author assumes a single
floating point register port.
• Structural hazards are avoided in the ID stage by
assigning a memory bit in a shift register. Incoming
instructions can then check to see if they should stall.
ENGR9861 Winter
2007 RV
Instruction Level Parallelism (ILP)
Chapter 3
• The reason why we can implement pipelining in a
microprocessor is due to instruction level parallelism.
• Since operations can be overlapped in execution, they
exhibit ILP.
• ILP is mostly exploited in the use of branches. A
“basic block” is a block of code that has no branches
into or out of except for at the start and the end.
• In MIPS, an average basic block is 4-7 separate
instructions.
ENGR9861 Winter
2007 RV
Dependences and Hazards
A high proportion of loop instructions executed are loop management instructions (next
example should give a clearer picture) on the induction variable.
KEY IDEA: Eliminating this overhead could potentially significantly increase the
performance of the loop:
ENGR9861 Winter
2007 RV
Dependences and Hazards
for (i = 1000 ; i > 0 ; I -- ) {
x[ i ] = x[ i ] + constant;
}
ENGR9861 Winter
2007 RV
Dependences and Hazards
• Data Dependence:
– Instruction i produces a result the instruction j will use
or instruction j is data dependent on instruction i and
vice versa.
• Name Dependence:
– Occurs when two instructions use the same register and
memory location. But there is no flow of data between
the instructions. Instruction order must be preserved.
Antidependence: j writes to a location that i reads.
Output Dependence: two instructions write to the same
location.
ENGR9861 Winter
2007 RV
Dependences and Hazards
If p2{
S2
}
ENGR9861 Winter
2007 RV
Control Dependence
• Control Dependences have the following properties:
– An instruction that is control dependent on a branch
cannot be moved in front of the branch, so that the
branch no longer controls it.
– An instruction that is control dependent on a branch
cannot be moved after the branch so that the branch
controls it.
ENGR9861 Winter
2007 RV
Dynamic Scheduling
• The previous example that we looked at was an
example of statically scheduled pipeline.
• Instructions are fetched and then issued. If the users
code has a data dependency / control dependence it is
hidden by forwarding.
• If the dependence cannot be hidden a stall occurs.
• Dynamic Scheduling is an important technique in
which both dataflow and exception behavior of the
program are maintained.
ENGR9861 Winter
2007 RV
Dynamic Scheduling (continued)
DIV.D F0,F2,F4
ADD.D F10,F0,F8
SUB.D F12,F8,F14
ENGR9861 Winter
2007 RV
Dynamic Scheduling (continued)
ENGR9861 Winter
2007 RV
Dynamic Scheduling (continued)
ENGR9861 Winter
2007 RV
Still More Dynamic Scheduling
• Tomasulo’s Algorithim was invent by Robert
Tomasulo and was used in the IBM 360/391.
• The algorithm will avoid RAW hazards by executing
an instruction only when it’s operands are available.
WAR and WAW hazards are avoided by register
renaming.
ENGR9861 Winter
2007 RV
ENGR9861 Winter
2007 RV
Branch Prediction In Hardware
ENGR9861 Winter
2007 RV
2-bit Prediction Scheme
• This method is more reliable than using a single bit to
represent whether the branch was recently taken or
not.
• The use of a 2-bit predictor will allow branches that
favor taken (or not taken) to be mispredicted less often
than the one-bit case.
ENGR9861 Winter
2007 RV
ENGR9861 Winter
2007 RV
Branch Predictors
• The size of a branch predictor memory will only
increase it’s effectiveness so much.
• We also need to address the effectiveness of the
scheme used. Just increasing the number of bits in the
predictor doesn’t do very much either.
• Some other predictors include:
– Correlating Predictors
– Tournament Predictors
ENGR9861 Winter
2007 RV
Branch Predictors
• Correlating predictors will use the history of a local
branch AND some overall information on how
branches are executing to make a decision whether to
execute or not.
• Tournament Predictors are even more sophisticated in
that they will use multiple predictors local and global
and enable them with a selector to improve accuracy.
ENGR9861 Winter
2007 RV