Pipelining 2019
Pipelining 2019
●
Pipelining -- multiple instructions are overlapped
in execution
●
It takes advantage of parallelism that exists among
the actions needed to execute an instruction.
The throughput of an instruction pipeline
●
●
The time required between moving an instruction
one step down the pipeline is a processor cycle.
●
In a computer, this processor cycle is usually 1
clock cycle (sometimes it is 2, rarely more).
The Basics of a RISC Instruction
Set
Key Properties:
●
●
All operations on data apply to data in registers .
●
The only operations that affect memory are load and store operations
●
move data from memory to a register or to memory from a register,
respectively.
●
Load and store operations -- load or store less than a full register(e.g.,
a byte, 16 bits, or 32 bits) are often available.
●
The instruction formats are few in number, with all instructions typically
being one size
RISC architectures have 3 classes:
●
ALU instructions
– take either two registers or a register and a sign-extended immediate operate on them, and store the
result into a third register.
– operations include -- add (DADD), subtract (DSUB), and logical operations (such as AND or OR)
– Immediate, Unsigned forms
●
Load and store instructions
– takes base register, and an immediate field (offset) as operands.
– effective address –sum of the contents of the base register and the sign-extended offset ( memory
address.)
– load instruction: a second register operand acts as the destination for the data loaded from memory.
– Store instruction, the second register operand is the source of the data that is stored into memory.
●
Branches and jumps
– Branches are conditional transfers of control.
– a set of condition bits (condition code) or
– by a limited set of comparisons between a pair of registers or between a register and zero.
– In all RISC architectures, the branch destination is obtained by adding an offset to the current PC.
– Unconditional jumps are provided in many RISC architectures.
A Simple Implementation of a RISC Instruction Set
1. Instruction fetch cycle (IF):
– Send the program counter (PC) to memory and fetch the
current instruction from memory.
– Update the PC to the next sequential PC by adding 4
(since each instruction is 4 bytes) to the PC.
●
Assuming a branch frequency of 12% and a store
frequency of 10%,
●
a typical instruction distribution leads to an overall CPI
of 4.54.
The Classic Five-Stage Pipeline for
a RISC Processor
Observations
●
First, we use separate instruction and data
memories, to implement with separate instruction
and data caches
●
The use of separate caches eliminates a conflict
for a single memory that would arise between
instruction fetch and data memory access.
●Second, the register file is used in the two
stages: one for reading in ID and one for writing in
WB.
● show the register file in two places.
●Hence, we need to perform two reads and one
during ID.
●
Instructions in different stages of pipeline do not
interfere with one another
– Pipeline Registers
Basic Performance Issues in Pipelining
●
Pipelining increases the CPU instruction throughput
●
but it does not reduce the execution time of an individual
instruction.
●
slightly increases the execution time of each instruction due to
overhead in the control of the pipeline.
●
●
The increase in instruction throughput means
●
program runs faster and has lower total execution time, even
though no single instruction runs faster!
●
limitations arise from pipeline latency, from imbalance among the pipe
stages and from pipelining overhead.
●
Imbalance among the pipe stages
●
●
reduces performance since the clock takes time needed for the slowest pipeline
stage.
Pipeline overhead arises from the
●
●
combination of pipeline register delay and clock skew.
●
The pipeline registers add setup time, which is the time that a register input
must be stable before the clock signal that triggers a write occurs, plus
propagation delay to the clock cycle.
●
Clock skew, which is maximum delay between when the clock arrives at any two
registers,
This simple RISC pipeline would function just if every instruction were independent
●
resource conflicts
●
●
when the hardware cannot support all possible combinations of
instructions simultaneously in over-lapped execution.
Data hazards
●
●
instruction depends on the results of a
previous instruction.
Control hazards
●
●
pipelining of branches and other instructions
that change the PC.
Hazards in pipelines - stall the pipeline.
●
●
Instructions issued earlier than the stalled
instruction must continue,
Otherwise hazard will never clear.
●
●
Avoiding a hazard : some instructions in the
pipeline be allowed to proceed while others are
delayed.
Performance of Pipelines with Stalls
INSTRUCTION LEVEL
PARALLELISM
Its Exploration
1.Concepts & Challenges
• ILP – Overlap among instructions.
- parallel execution of instructions.
• Exploit ILP
• H/w – dynamically at execution time
• S/W – Statically at compile time
• Pipelining
– Structural Hazards
– Data Hazards
– Control Hazards
• Hazards make necessary to stall pipeline
- some inst. Proceed, others delayed.
• Data Hazards
• Current inst. depends on previous one
• Control Hazards
• Branch instructions.
Structural Hazards
• Usually happen when a unit is not fully
pipelined
• That unit cannot churn out one instruction per
cycle
• Or, when a resource has not been duplicated
enough
– Example: same I-cache and D-cache
– Example: single write-port for register-file
• Usual solution: stall
• Also called pipeline bubble, or simply
bubble
Why Allow Structural Hazards?
• Lower Cost:
– Lesser hardware ==> lesser cost
• Delayed branch:
– Instruction(s) after branch are executed anyway!
– Sequential successors are called branch-delay slots
• Pipeline CPI (pipelined processor) =
Ideal pipeline CPI + Structural stalls + Data
hazard stalls + control stalls
What is ILP?
• Amount of parallelism within a basic
block is quite small
●
Convert LLP to ILP – unroll loop statically by compiler,
dynamically by H/W.
Data dependences & Hazards
● Exploit ILP – which inst.s can execute in
parallel?
– If 2 inst. are Parallel – executed
simultaneously with no stalls, assume
sufficient resources
– If 2 insts. are dependent, not parallel –
executed in order, may partially overlap.
• Name Dependences
• Control Dependences
Data dependences
• Instruction j is dependent on i if either of
the following holds:
a. Inst i produces a result that may be used by
j
b. Inst j is data dependent on k, and k is data
dependent on i (chain of dependences)
• Detecting dependency
– Registers- easy – reg names fixed
– Mem. Loc – difficult – 2 address refer to same
loc but look diff. effective address
Name Dependences
• Occurs when 2 inst use same reg/mem
loc called a ‘name’
• Output dependence
– When i and j write same reg/mem loc
• Original ordering must be preserved - j gets the
final value.
Solution
• Register renaming
– No data flow b/w inst
– Only clash in using ‘name’
– Use different names in either instructions
– Easy for register
– Done by compiler / hardware
Control dependences
• Determines the ordering of an inst i, w.r.t
branch inst so that i is executed in correct order
& only when it should be.
• Example : T1;
if P1{S1;}
• Statement S1 is control-dependent on p1, but
T1 is not
• What does this mean for execution ?
• Correctness of program ?
Correctness
● Two critical properties to program correctness:
● Exception behavior
– Reordering must not cause new exceptions in
the program
● Data flow
– Actual flow of data values among inst that
produce results and consume them
Basic compiler techniques for
exposing ILP
• Pipeline scheduling
• Loop unrolling
Static vs. Dynamic
Scheduling
● Static scheduling: limitations
– Dependences may not be known at compile time
– Even if known, compiler becomes complex
– Compiler has to have knowledge of pipeline
● Dynamic scheduling
– Handle dynamic dependences
– Simpler compiler
– Efficient even if code compiled for a different
pipeline
Pipeline scheduling
● To exploit ILP – keep pipeline full.
● How to keep pipeline full?
– Find sequences of unrelated inst – overlap in
pipeline.
• Latency/response time
– Time b/w start and completion of an event
• Milliseconds for disk access
Pipeline scheduling(Contd…)
• PS depends on :
1.Amount of ILP available in program
2.Latencies of functional units in pipeline.
S.D F4,0(R1); 6
DADDUI R1, R1, #-8; 7
stall 8
• Compiler limitation
• Register pressure.
Summary of loop unrolling &
scheduling
● Unrolling would be useful
● Use different registers to avoid unnecessary
constraints
● Eliminate extra branch & test conditions
● Determine that loads and stores can be interchanged
– Analyse mem addr- make sure they don’t refer to
same
● Schedule the code – preserve dependences.
● For all this, understand the dependency properly
3. Reducing branch costs with
prediction
● How to reduce branch hazards?
– Loop unrolling
– Reduce performance losses ?
● By predicting their behavior
● a small memory
● indexed by lower position of the address of branch
inst.
● One of mem bit indicates whether branch was
recently taken or not.
● Limitation:
– not sure if prediction is correct
– Address may be of another branch with same lower
address
● Remedy – 2 bit prediction scheme
2 bit prediction scheme
● Small cache accessed with inst address
during the IF stage
• (1,2) predictor –
– uses behaviour of last 1 branch to choose from
among a pair of 2-bit predictors
• (m,n) predictor -
– uses behaviour of last m branches to choose
from 2m branch of predictors (each of n-bit)
●
Eg: how many bits are there in (0,2) BP with 4 K entries ?
How many entries are in a (2,2) BP with same no.of bits
– 20 X 2 X 4K = 8K bits
– 22X 2 X a = 8K
– a= 1K
Tournament Predictors
● Combine local and global predictors