Hazards: CSE378 W, 2001 CSE378 W, 2001

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Hazards Introduction

• Pipelining up until now has been “ideal”


• In real life, though, we might not be able to fill the pipeline
because of hazards:
•Data hazards. For example, the result of an operation is needed
before it is computed:
add $7, $12, $15 # put result in $7
sub $8, $7, $12 # use $7
and $9, $13, $7 # use $7 again
•Note that there is no dependency for $12, b/c it is used only as a
source register.
•Control hazards. If we take the branch, then the instructions
were fetched after the branch (which are now in the pipe) are
the wrong ones.

CSE378 WINTER, 2001 CSE378 WINTER, 2001

174 175

Data Hazards Detecting Data Dependencies


• Dependencies: Given two instructions, i and j (i occurs before j).
Clock cycle: 1 2 3 4 5 6 7
• We say a dependence exists between i and j if j reads the result
Value of reg $7: 5 5 5 5 5 23 23 produced by i, and there is no instruction k which occurs between
i and j and that produces the same result as i.
add $7, ... IM REG ALU DM REG • We call a data dependence a hazard when an instruction tries to
read a register in stage 2 (ID) and this register will be written by a
previous instruction that has not yet completed stage 5 (WB).
sub $8, $7, $12 IM REG ALU DM REG • This is sometimes called a read-after-write hazard.
• What kind of instructions can create data dependences?
• Modern microprocessors have several ALUs, floating point units
and $9, $13, $7 IM REG ALU DM REG
that take longer than integer units, etc which give rise to other
kinds of data hazards.

• The arrow represents a dependency. Arrows that go backwards


are trouble.

CSE378 WINTER, 2001 CSE378 WINTER, 2001

176 177
Resolving Data Hazards Hazard Detection and Stalling
• There are several options:
•Build a hazard detection unit, which stalls the pipeline until the
IM REG ALU DM REG
hazard has passed. It does this by inserting “bubbles”
(essentially nops) in the pipeline. This isn’t a great idea. We’d
add $7,...
like to avoid it, if possible.
•Forwarding. Forward the result as an ALU source.
IM bubble bubble bubble REG ALU DM REG
•Software (static) scheduling. Leave it up to the compiler. It must
schedule instructions to avoid hazards. Often it won’t be able sub $8, $7, $12
to, so it will issue no-ops (an instruction that does nothing)
instead. This is the cheapest (in terms of hardware) solution.
IM REG ALU DM REG
•Hardware (dynamic) scheduling. Build special hardware that
schedules instructions dynamiclly. and $9, $13, $7

• Note that the hazard costs us 3 cycles...

CSE378 WINTER, 2001 CSE378 WINTER, 2001

178 179

Detecting Hazards Improvements


• Between instruction i+1 and instruction i (3 bubbles): • Our stalling scheme is very conservative, and there are a few
•ID/EX.WriteReg == IF/ID read-register 1 or 2 (in fact, it is slightly improvements we can make.
more complex b/c write-register can be rd or rt depending on • Is the RegWrite control bit asserted (this determines whether
the instruction) we’re really dealing with an R-type or load instruction)?
• Between instruction i+2 and i (2 bubbles): • Build a better register file. Currently, we assume that the register
•EX/MEM.WriteReg == IF/ID read-register 1 or 2 file will not produce the correct result if a given register is both
read and written in the same cycle. Doing this would eliminate
• Between instruction i+3 and i (1 bubble): hazards in the WB stage.
•MEM/WB.WriteReg == IF/ID read-register 1 or 2
• Note that stalls stop instructions in the ID stage. Therefore, we
must stop fetching new instructions, or else we would clobber the
PC and the IF/ID register. So we need control lines to:
•Create bubbles. This can be done by setting all control lines that
are passed from ID to 0, hence creating a nop.
•Prevent new instruction fetches. This should be done for as
many cycles as there are bubbles.

CSE378 WINTER, 2001 CSE378 WINTER, 2001

180 181
Forwarding Forwarding Example
• Inserting bubbles is a pessimistic solution, since data that is $7 is computed here
written during the writeback stage is often computed much earlier:
•At the end of the EX stage for arithmetic instructions
IM REG ALU DM REG
•At the end of the MEM stage for a load.
• So why not forward the result of the computation (or load) directly add $7,...
to the input of the ALU if it is required there?
• Forwarding is sometimes called bypassing. IM REG ALU DM REG

• Note that for reasons related to interrupts or exceptions, we do not


want the state of the process (i.e. the registers), to be modified sub $8, $7, $12
until the last stage.
IM REG ALU DM REG

and $9, $13, $7

• There is no need to wait until WB, because we’ve already


computed the value required.

CSE378 WINTER, 2001 CSE378 WINTER, 2001

182 183

Implementing Forwarding The Trouble with Loads


• Change the data path so that data can be read from either the EX/ • What if we have a load followed by an arithmetic operation which
MEM or MEM/WB registers and be forwarded to one of the ALU needs the result of the load:
inputs. lw $7, 16($8)
• This requires logic to detect forwarding: add $9, $9, $7
•We can do this at stage 3 (EX) of instruction i to forward to stage $7 is ready here
2 (ID) of instruction i+1
•We can do this at stage 4 (MEM) of instruction i to forward to IM REG ALU DM REG
stage 2 (ID) of instruction i+2.
• It also requires additional inputs to the muxes over the ALU inputs lw $7, 16($8)
(inputs can now come from ID/EX, EX/MEM, or MEM/WB pipe
registers). IM REG ALU DM REG

add $9, $9, $7


$7 is needed here

• We’re busy fetching the data while it is needed in the EX stage.

CSE378 WINTER, 2001 CSE378 WINTER, 2001

184 185
Loads Scheduling
• Forwarding cannot save the day in the face of a dependent • Other important approaches include scheduling the instructions to
instruction which immediately follows a load. avoid hazards, in hardware or software.
• The only solution is to insert a bubble after loads if the next • This is particularly important for processors which have multiple or
operation is dependent, so we still need a hazard detection unit. very deep pipelines (most modern processors).
• Good compilers will attempt to schedule instructions in the “load • Dependences force a partial ordering on the instruction stream.
delay slot” so as to avoid these kinds of stalls. lw $t2, 0($t0) # 1
add $t5, $t2, $t3 # 2
sub $t3, $t1, $t8 # 3
mult $t7, $t8, $t8 # 4
addi $t5, $t7, 16 # 5
• Three kinds of dependence: data (read-after-write), anti-
dependence (write-after-read), output (write-after-write).
• Above: data dependences (1->2); anti-depencences (2->3, 4->5);
output (2->5).
• How can we reorder these instructions to do better?

CSE378 WINTER, 2001 CSE378 WINTER, 2001

186 187

Control Hazards Example


• Pipelining and branching just don’t get along...
IM REG ALU DM REG We potentially
• The transfer of control, via jumps, returns, and taken branches fetch and start
working on 3
cause control hazards. beq $t0, $t1, foo incorrect instructions!!
• The branch instruction decides whether to branch in the MEM
IM REG ALU DM REG
stage. In other words, if the branch is taken, the PC isn’t updated
to the proper address until the end of the MEM stage.
and $1, $2, $3
• By this time, however, we’ve already entered 3 instructions into
the pipeline that were the wrong ones! IM REG ALU DM REG

add $4, $5, $6


beq $t0, $t1, foo # assume $t0==$t1
IM REG ALU DM REG
and $1, $2, $3
add $4, $5, $6
sub $7, $8, $9
sub $7, $8, $9
IM REG ALU DM REG
foo: add $4, $9, $10
and $4, $9, $10

CSE378 WINTER, 2001 CSE378 WINTER, 2001

188 189
Resolving Control Hazards Assume Branch Not Taken
• Detecting one is easy: just look at the opcode! • We need to be able to flush the pipeline in case the branch
• At least 4 possibilities: actually was taken.
•Always stall. Stall as soon as we see a branch. This costs a bit • If the branch is taken:
of control hardware and 3 cycles for every branch. •For the IF stage, we zero out the instruction field in IF/ID register.
•Assume branch not taken. Just go ahead and start executing •For the ID stage, since this is where we determine control, we
the next instructions, but find a way to flush those instructions if just set all control lines to zero, creating the effect of a nop.
the branch was taken. This costs more control hardware and 3 •For the EX stage, we use an extra mux to zero out the result of
cycles only if the branch is taken. the ALU.
•Delayed branches. Change the semantics of your branch • This approach costs additional control hardware and costs cycles
instruction to force the compiler/assembler to deal with the only when the branch is taken.
problem.
• A rule of thumb says that forward branches are taken 60% of the
•Branch prediction. Try to guess whether the branch will be taken time, and backward branches (as in loops) are taken 85% of the
or not and do the right thing. Be ready to flush the pipeline in time.
case you were wrong...

CSE378 WINTER, 2001 CSE378 WINTER, 2001

190 191

Delayed Branches Branch Prediction


• Change the semantics (meaning) of your branch instruction so • Develop some hardware to tell you the chances that you will
that they won’t have effect until N (where N is the branch delay) actually take the branch or not (a history table, for example).
cycles later. • Given this information, make a prediction (taken or untaken) and
• This means that the N instructions after the branch will be start executing instructins speculatively.
executed regardless of the branch outcome. These are called • If you’re wrong, you still have to flush the pipeline.
delay slots.
• Note that assuming branch-not-taken is just a special case of
• This forces the programmer/compiler/assembler to deal with the branch prediction (where you always predict not taken).
problem, by requiring them to fill the N delay slots.
• Branch prediction should do better than assuming not taken, but
• This costs the hardware nothing, since it is the compilers job to you pay the price in additional hardware.
assure that correct instructions (or nops) are scheduled in the
delay slots. • Branch prediction should do better than delayed branches,
assuming you predict right more often than the compiler can fill
• Good compilers can usually fill 1 or 2 slots. the delay slot with interesting work (not a nop).
• MIPS branches are delayed (1 slot) and compilers can fill around
70% of the slots.

CSE378 WINTER, 2001 CSE378 WINTER, 2001

192 193
Exceptions How to Handle Exceptions
• Historical definitions: • We must save the program counter of the offending instruction in
•An exception is an unexpected event from within the processor the EPC (Exception PC), and then transfer control to the
(such as arithmetic overflow). operating system.
•An interrupt is an unexpected event from outside of the • The OS can then take appropriate action (provide an IO service
processor (such as IO requests). for the program, kill the program, etc). If it chooses to restart the
program, it can jump back to the EPC.
• MIPS doesn’t distinguish the source of the event, and calls both of
the above exceptions. • How does the OS know what kind of exception? MIPS includes a
cause register.
• Kinds:
• In hardware, the cause is saved into the cause register, the PC is
•IO device request (external) saved in EPC, and control transfers to a predefined address in the
•System call (internal) kernel (0x4000 0040).
•Arithmetic overflow (internal) • Exceptions are hard to deal with because we have several
•Undefined instruction (internal)
instructions in the pipeline.
• Suppose we get an arithmetic overflow (in the EX stage). We
•Hardware malfunctions (either)
need to be sure to let the downstream instructions finish, while
• Note that we can view system calls as exceptions! flushing the upstream instructions.

CSE378 WINTER, 2001 CSE378 WINTER, 2001

194 195

The Truth Summary


• The MIPS R2000/3000 pipelined implementation is pretty close to • Pipelining improves performance by increasing throughput
the one we’ve discussed in class, but modern machines use much (instructions/time) not latency (time/instruction).
more complex implementations: • We examined the classic 5 stage pipeline (IF, ID, EX, MEM, WB)
• Multiple pipelines: superscalar : • Data and control hazards place limits on the speedups we can
•Trend: exploit instruction level parallelism (ILP) by working on achieve through pipelining.
multiple instructions simultaneously. This reduces CPI. •Data hazards can be avoided by stalling or forwarding (unless it
•Many modern machines issue up to 4 instructions at once. is a load!). Stalling can be achieved through software or
•Challenge: statically or dynamically scheduling instructions to hardware. Forwarding is more efficient.
extract maximal ILP while keeping cycle time low •Branch hazards can only be avoided by hardware stalling or
• Deep pipelines: superpipelined: “defining away the problem” via delayed branches.
•Trend: Reduce cycle time •The performance of branches can be improved through delayed
branches or branch prediction.
•Modern pipelines often have 8 or more stages.
• Compilers must understand the pipeline to extract maximum
•Challenge: longer branch and load delays (often leading to performance through scheduling. In MIPS, the ISA is no longer a
higher CPI), more forwarding required, scheduling is also perfect abstraction.
important

CSE378 WINTER, 2001 CSE378 WINTER, 2001

196 197

You might also like