0% found this document useful (0 votes)
37 views82 pages

Pipelining 2019

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views82 pages

Pipelining 2019

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Pipelining

Basic and Intermediate Concepts


What Is Pipelining?


Pipelining -- multiple instructions are overlapped
in execution


It takes advantage of parallelism that exists among
the actions needed to execute an instruction.
The throughput of an instruction pipeline

no.of instructions executed per clock cycle



The time required between moving an instruction
one step down the pipeline is a processor cycle.


In a computer, this processor cycle is usually 1
clock cycle (sometimes it is 2, rarely more).
The Basics of a RISC Instruction
Set
Key Properties:


All operations on data apply to data in registers .


The only operations that affect memory are load and store operations

move data from memory to a register or to memory from a register,
respectively.

Load and store operations -- load or store less than a full register(e.g.,
a byte, 16 bits, or 32 bits) are often available.


The instruction formats are few in number, with all instructions typically
being one size
RISC architectures have 3 classes:

ALU instructions
– take either two registers or a register and a sign-extended immediate operate on them, and store the
result into a third register.
– operations include -- add (DADD), subtract (DSUB), and logical operations (such as AND or OR)
– Immediate, Unsigned forms


Load and store instructions
– takes base register, and an immediate field (offset) as operands.
– effective address –sum of the contents of the base register and the sign-extended offset ( memory
address.)
– load instruction: a second register operand acts as the destination for the data loaded from memory.
– Store instruction, the second register operand is the source of the data that is stored into memory.

Branches and jumps
– Branches are conditional transfers of control.
– a set of condition bits (condition code) or
– by a limited set of comparisons between a pair of registers or between a register and zero.
– In all RISC architectures, the branch destination is obtained by adding an offset to the current PC.
– Unconditional jumps are provided in many RISC architectures.
A Simple Implementation of a RISC Instruction Set
1. Instruction fetch cycle (IF):
– Send the program counter (PC) to memory and fetch the
current instruction from memory.
– Update the PC to the next sequential PC by adding 4
(since each instruction is 4 bytes) to the PC.

2. Instruction decode/register fetch cycle (ID):


– Decode the instruction and read the registers from the
register file.
– Do the equality test on the registers as they are read, for
a possible branch.
– Compute the possible branch target address by adding
the sign-extended offset to the incremented PC.

3. Execution/effective address cycle (EX):
– Memory reference— base register + offset to form the effective
address.
– Register-Register ALU instruction—performs the operation
specified by opcode on the values read from the register file.
– Register-Immediate ALU instruction— performs the operation
specified by ALU opcode on the first value read from the register
file and the sign-extended immediate.

– In a load-store architecture the effective address and execution


cycles can be combined into a single clock cycle, since no
instruction needs to simultaneously calculate a data address and
perform an operation on the data.
4. Memory access (MEM)

– Load --> memory does a read using the effective address


computed in the previous cycle.
– Store --> memory writes the data using the effective
address.

5. Write-back cycle (WB)


– Write the result into the register file,


– whether it comes from the memory system (for a load) or
– from the ALU (for an ALU instruction).
In this implementation,

branch instructions require 2 cycles,


store instructions require 4 cycles,


and all other instructions require 5 cycles.



Assuming a branch frequency of 12% and a store
frequency of 10%,

a typical instruction distribution leads to an overall CPI
of 4.54.
The Classic Five-Stage Pipeline for
a RISC Processor
Observations

First, we use separate instruction and data
memories, to implement with separate instruction
and data caches


The use of separate caches eliminates a conflict
for a single memory that would arise between
instruction fetch and data memory access.
●Second, the register file is used in the two
stages: one for reading in ID and one for writing in
WB.
● show the register file in two places.
●Hence, we need to perform two reads and one

write every clock cycle.


●To handle reads and a write to the same register
we perform the register write in the first half of the
clock cycle and the read in the second half.
Third, Figure C.2 does not deal with the PC.

To start a new instruction every clock,


● increment and store the PC every clock,


●during the IF stage in preparation for the next instruction.

● have an adder to compute the potential branch target

during ID.

One further problem is that a branch does not change the


PC until the ID stage.



Ensure instructions in pipeline do not attempt to
use h/w resources at the same time


Instructions in different stages of pipeline do not
interfere with one another
– Pipeline Registers
Basic Performance Issues in Pipelining

Pipelining increases the CPU instruction throughput

but it does not reduce the execution time of an individual
instruction.

slightly increases the execution time of each instruction due to
overhead in the control of the pipeline.


The increase in instruction throughput means

program runs faster and has lower total execution time, even
though no single instruction runs faster!


limitations arise from pipeline latency, from imbalance among the pipe
stages and from pipelining overhead.

Imbalance among the pipe stages


reduces performance since the clock takes time needed for the slowest pipeline
stage.
Pipeline overhead arises from the


combination of pipeline register delay and clock skew.

The pipeline registers add setup time, which is the time that a register input
must be stable before the clock signal that triggers a write occurs, plus
propagation delay to the clock cycle.

Clock skew, which is maximum delay between when the clock arrives at any two
registers,

This simple RISC pipeline would function just if every instruction were independent

of every other instruction in the pipeline.


In reality, instructions in the pipeline can depend on one another



The Major Hurdle of Pipelining—Pipeline
Hazards
Structural hazards

resource conflicts


when the hardware cannot support all possible combinations of
instructions simultaneously in over-lapped execution.

Data hazards


instruction depends on the results of a
previous instruction.
Control hazards


pipelining of branches and other instructions
that change the PC.
Hazards in pipelines - stall the pipeline.

when an instruction is stalled,


all instructions issued later are also stalled.



Instructions issued earlier than the stalled
instruction must continue,
Otherwise hazard will never clear.

no new instructions are fetched during the stall.



Avoiding a hazard : some instructions in the
pipeline be allowed to proceed while others are
delayed.
Performance of Pipelines with Stalls
INSTRUCTION LEVEL
PARALLELISM
Its Exploration
1.Concepts & Challenges
• ILP – Overlap among instructions.
- parallel execution of instructions.
• Exploit ILP
• H/w – dynamically at execution time
• S/W – Statically at compile time
• Pipelining
– Structural Hazards
– Data Hazards
– Control Hazards
• Hazards make necessary to stall pipeline
- some inst. Proceed, others delayed.

• Stall – Wasted Clock Cycle


– Causes pipeline performance to degrade
from ideal performance.
• Structural Hazards
• Resource conflicts

• Data Hazards
• Current inst. depends on previous one

• Control Hazards
• Branch instructions.
Structural Hazards
• Usually happen when a unit is not fully
pipelined
• That unit cannot churn out one instruction per
cycle
• Or, when a resource has not been duplicated
enough
– Example: same I-cache and D-cache
– Example: single write-port for register-file
• Usual solution: stall
• Also called pipeline bubble, or simply
bubble
Why Allow Structural Hazards?
• Lower Cost:
– Lesser hardware ==> lesser cost

• Shorter latency of unpipelined unit


– May have other performance benefits
– Data hazards may introduce stalls anyway!

• Suppose the FP unit is unpipelined, and the other


instructions have a 5-stage pipeline. What percentage
of instructions can be FP, so that the CPI does not
increase?
– 20% can be FP, assuming no clustering of FP instructions
– Even if clustered, data hazards may introduce stalls anyway
Data Hazards
• Example:
– ADD R1, R2, R3
– SUB R4, R1, R5
– AND R6, R1, R7
– OR R8, R1, R9
– XOR R10,R1, R11
• All instructions after ADD depend on R1
• Stalling is a possibility
– Can we do better?
Data Hazard Classification

• Read after Write (RAW):


– j tries to read a source before i writes it.
– Preserve order of inst., use data forwarding to
overcome
• Write after Write (WAW):
– j tries to write an operand before it is written by i.
– arises only when writes can happen in different
pipeline stages
– Has other problems as well: structural hazards
• Write after Read (WAR):
– j writes to a destn, before it is read by i.
– rare
Stages in inst. execution
• IF
• ID
• EX
• MEM
• WM
Control Hazard

• Result of branch instruction not known


until end of MEM stage
• Naïve solution: stall until result of branch
instruction is known
– If an instruction is a branch or not is known
at the end of its ID cycle
– “IF” may have to be repeated
Handling Control Hazards
• Naive solution : Stall
• Predict untaken or not-taken:
– Treat every branch as not taken
– Only slightly more complex
– Do not update machine state until branch
outcome is known
– Done by clearing the IF/ID register of the
fetched instruction
More Ways to Reduce Control Hazard
Delays
• Predict taken:
– Treat every branch as taken
– Not of any use in DLX since branch target is not
known before branch condition anyway
– May be of use in other architectures

• Delayed branch:
– Instruction(s) after branch are executed anyway!
– Sequential successors are called branch-delay slots
• Pipeline CPI (pipelined processor) =
Ideal pipeline CPI + Structural stalls + Data
hazard stalls + control stalls
What is ILP?
• Amount of parallelism within a basic
block is quite small

• Basic block – a straight line code with no


branches in and out except at entry and
exit respectively

• Exploit ILP across multiple basic blocks.


• Simple and common way to increase
parallelism – exploit parallelism among
iterations of a loop – loop level
parallelism

• Loop level parallelism


• Instruction level parallelism
• Data level parallelism
Example
● Adds two 1000 element arrays
● for (i=1; i<=1000;i+1) every iteration of loop
x[i] = x[i] + y [i]; can overlap with any other
iteration

● Within each loop iteration - no opportunity for overlap


Convert LLP to ILP – unroll loop statically by compiler,
dynamically by H/W.
Data dependences & Hazards
● Exploit ILP – which inst.s can execute in
parallel?
– If 2 inst. are Parallel – executed
simultaneously with no stalls, assume
sufficient resources
– If 2 insts. are dependent, not parallel –
executed in order, may partially overlap.

● How do you determine instruction is


dependent?
Dependences & Hazards
• Data dependences

• Name Dependences

• Control Dependences
Data dependences
• Instruction j is dependent on i if either of
the following holds:
a. Inst i produces a result that may be used by
j
b. Inst j is data dependent on k, and k is data
dependent on i (chain of dependences)

• Dependency with in a single inst. is not a


dependency
– ADD R1, R1
example
• Loop: L.D F0,0(R1);
ADD.D F4, F0, F2;
S.D F4, 0(R1);
DADDUI R1,R1,#-8;
BNE R1,R2,Loop;
Risk
● Order must be preserved for correct
execution.
● 2 inst. are data dependent --? Can't be
exec. simultaneously/ completely
overlapped

● If exec simultaneously ? - stall.


Pipeline organization properties
i. Whether a dependency results in actual
hazard being detected?
ii. Whether that hazard causes a stall?

• Understand this to exploit ILP


Solution
• Data dependency conveys 3 things:
1. Possibility of a hazard
2. Order in which results must be calculated
3. Upper bound on how much ||sm can be
exploited.

Data dependency limits ILP exploitation… so
– Maintain dependency but avoid a hazard

Scheduling the code – compiler/ hardware.
– Eliminate dependence by transforming the code
• Data value may flow b/w inst through
– Register’s
– Mem. Locations

• Detecting dependency
– Registers- easy – reg names fixed
– Mem. Loc – difficult – 2 address refer to same
loc but look diff. effective address
Name Dependences
• Occurs when 2 inst use same reg/mem
loc called a ‘name’

• But no flow of data b/w inst associated


with that name.
Types
• Anti dependence
– b/w i and j occurs when j writes a reg/mem
loc that i reads.
• Original ordering must be preserved – i reads
correct value.

• Output dependence
– When i and j write same reg/mem loc
• Original ordering must be preserved - j gets the
final value.
Solution
• Register renaming
– No data flow b/w inst
– Only clash in using ‘name’
– Use different names in either instructions
– Easy for register
– Done by compiler / hardware
Control dependences
• Determines the ordering of an inst i, w.r.t
branch inst so that i is executed in correct order
& only when it should be.

• Example : T1;
if P1{S1;}
• Statement S1 is control-dependent on p1, but
T1 is not
• What does this mean for execution ?

– S1 cannot be moved before p1, and also T1


cannot be moved after p1
Constraints imposed by CD
1. An inst that is ctrl dep on a branch can’t
be moved before the branch
– Its exec no longer controlled by branch.

2. An inst that is not ctrl dep on a branch


can’t be moved after the branch
– Its exec is controlled by branch
Solution
• Preserve strict program order – ensure
ctrl dep also preserved

• we can Violate ctrl dep without affecting


correctness of program.

• Correctness of program ?
Correctness
● Two critical properties to program correctness:
● Exception behavior
– Reordering must not cause new exceptions in
the program
● Data flow
– Actual flow of data values among inst that
produce results and consume them
Basic compiler techniques for
exposing ILP

• Pipeline scheduling

• Loop unrolling
Static vs. Dynamic
Scheduling
● Static scheduling: limitations
– Dependences may not be known at compile time
– Even if known, compiler becomes complex
– Compiler has to have knowledge of pipeline
● Dynamic scheduling
– Handle dynamic dependences
– Simpler compiler
– Efficient even if code compiled for a different
pipeline
Pipeline scheduling
● To exploit ILP – keep pipeline full.
● How to keep pipeline full?
– Find sequences of unrelated inst – overlap in
pipeline.

● To avoid stall – dependent inst is to be separated


from source inst
“by a distance in clock cycles = pipeline latency
of that source inst”
Bandwidth Vs Latency
• Band width/ through put
– Total amount of work done in a given time
• Megabytes per sec for disk transfer

• Latency/response time
– Time b/w start and completion of an event
• Milliseconds for disk access
Pipeline scheduling(Contd…)
• PS depends on :
1.Amount of ILP available in program
2.Latencies of functional units in pipeline.

Inst Inst using Latency in


producing result Clock cycles
result
FP ALU op Another FP ALU 3
op
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Loop unrolling
for (i=1000; i>0; i=i-1)
x[i] = x[i] + s; //add scalar to vector

• Each iteration is independent


• So loop is parallel
• Evaluate the performance of this loop
based on previous latencies.
• Translate it to assembly language

Loop : L.D F0, 0(R1); F0 = array element;


ADD.D F4, F0, F2;
S.D 0(R1),F4,;
DADDUI R1, R1, #-8; decrement ptr 8bits
BNE R1,R2, Loop
Loop : L.D F0, 0(R1); 1
2
stall
ADD.D F4, F0, F2; 3
stall 4
stall 5

S.D F4,0(R1); 6
DADDUI R1, R1, #-8; 7
stall 8

BNE R1,R2, Loop 9


Loop : L.D F0, 0(R1); 1
DADDUI R1, R1, #-8; 2
ADD.D F4, F0, F2; 3
stall 4
stall 5
S.D F4,0(R1); 6
BNE R1,R2, Loop 7

•Out of the 7, operations take only 3cc,


remaining 4cc for loop overhead - DADDUI, BNE
and 2 stalls
• To eliminate those 4 cc, we require more
operations relative to no.of overhead inst.

• Simple scheme for increasing the no.of insts


relative to branch and overhead insts is LOOP
UNROLLING

• Unrolling replicates loop body multiple times,


adjusting loop termination code
Pros and Cons
● Loop unrolling used to improve scheduling.
– Eliminates branch
– Allows inst from diff iterations to be scheduled
together
– But… , we require more no.of registers if a loop is
unrolled

● When and how the ordering among insts may


be changed – key to all techniques.
Limitations
• Decrease in amount of overhead

• Code size limitations

• Compiler limitation

• Register pressure.
Summary of loop unrolling &
scheduling
● Unrolling would be useful
● Use different registers to avoid unnecessary
constraints
● Eliminate extra branch & test conditions
● Determine that loads and stores can be interchanged
– Analyse mem addr- make sure they don’t refer to
same
● Schedule the code – preserve dependences.
● For all this, understand the dependency properly
3. Reducing branch costs with
prediction
● How to reduce branch hazards?
– Loop unrolling
– Reduce performance losses ?
● By predicting their behavior

● How do we predict their behavior?


– Static branch prediction
– Dynamic branch prediction
– Tournament predictors
Static branch prediction
• Predicts statically at compile time.
• Branch behavior is highly predictable at
compile time.
• Also, assists dynamic predictors.

• But how? By several methods


• Major limitation of SBP : mis prediction
rate for int prgm is > FP prgm.
Methods
● Predict a branch as taken
– Avg mis prediction rate i.e untaken branch freq
– 59%(not accurate) to 9% (highly accurate)
– For SPEC , average is 34%

● Predict branch based on profile information.


– Collected from earlier runs.
– Behavior of branches is bimodally distributed
– Individual branch is biased as taken/untaken
Dynamic branch prediction
• Branch prediction buffer/ branch history
table

• 2-bit prediction scheme

• Correlation branch predictors


Branch prediction buffer -BPB

● a small memory
● indexed by lower position of the address of branch
inst.
● One of mem bit indicates whether branch was
recently taken or not.
● Limitation:
– not sure if prediction is correct
– Address may be of another branch with same lower
address
● Remedy – 2 bit prediction scheme
2 bit prediction scheme
● Small cache accessed with inst address
during the IF stage

● If inst decoded as a branch – if branch is


predicted as taken , fetching begins from
target
Accuracy 99 to 82 %
Misprediction 1 to 18%
Correlating branch predictors
(2-level predictors)
● 2-bit predictor – uses only recent
behaviour of a single branch
● Improve accuracy of predictor by – looking
at recent behaviour of other branches
also.
● If (aa=2) aa=0;
if (bb=2) bb=0;
if (aa!=bb) {
Branch b3 behaviour based on b1 and b2
• CP add info about the behavior of the
most recent branches to decide how to
predict a given branch.

• (1,2) predictor –
– uses behaviour of last 1 branch to choose from
among a pair of 2-bit predictors
• (m,n) predictor -
– uses behaviour of last m branches to choose
from 2m branch of predictors (each of n-bit)

• Better than 2-bit + only trivial amount of


H/W

No.of bits in an (m,n) predictor is
– 2m X n X no. of prediction entries selected by branch
address
– 2-bit predictor with no global history – (0,2)pred.


Eg: how many bits are there in (0,2) BP with 4 K entries ?
How many entries are in a (2,2) BP with same no.of bits
– 20 X 2 X 4K = 8K bits
– 22X 2 X a = 8K
– a= 1K
Tournament Predictors
● Combine local and global predictors

● Multiple predictors – one based on global info


& one on local info – combine with a selector

● Selects right predictor a particular branch

● Mis prediction rate < 0.5% for FP programs and


about 14 % for integer programs.
Dynamic scheduling
• Hardware re arranges the inst exec to
reduce stalls while maintaining data flow
and exception behaviour.

You might also like