Pipeline: A Simple Implementation of A RISC Instruction Set
Pipeline: A Simple Implementation of A RISC Instruction Set
[PC] + 4;
15
Perform equality test on the register as they are read for a possible branch.
For a load instruction, using effective address the memory is read. For a store
instruction memory writes the data from the 2nd register read using effective address.
5.
Write the result in to the register file, whether it comes from memory system (for a
LOAD instruction) or from the ALU.
Five stage Pipeline for a RISC processor
Each instruction taken at most 5 clock cycles for the execution
*
16
Clock number
1
IF
ID
EXE
MEM
WB
IF
ID
EXE
MEM
WB
IF
ID
EXE
MEM
WB
IF
ID
EXE
MEM
WB
IF
ID
EXE
MEM
WB
Figure 2.1 Simple RISC Pipeline. On each clock cycle another instruction fetched
Each stage of the pipeline must be independent of the other stages. Also, two different
operations cant be performed with the same data path resource on the same clock. For
example, a single ALU cannot be used to compute the effective address and perform a
subtract operation during the same clock cycle. An adder is to be provided in the stage 1
to compute new PC value and an ALU in the stage 3 to perform the arithmetic indicated
in the instruction (See figure 2.2). Conflict should not arise out of overlap of instructions
using pipeline. In other words, functional unit of each stage need to be independent of
other functional unit. There are three observations due to which the risk of conflict is
reduced.
Separate Instruction and data memories at the level of L1 cache eliminates a
conflict for a single memory that would arise between instruction fetch and data
access.
Register file is accessed during two stages namely ID stage WB. Hardware should
allow to perform maximum two reads one write every clock cycle.
To start a new instruction every cycle, it is necessary to increment and store the
PC every cycle.
17
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Figure 2.2 Diagram indicating the cycle and functional unit of each stage.
Figure 2.3 Functional units of 5 stage Pipeline. IF/ID is a buffer between IF and ID stage.
18
Pipeline Hazards
Hazards may cause the pipeline to stall. When an instruction is stalled, all the instructions
issued later than the stalled instructions are also stalled. Instructions issued earlier than
the stalled instructions will continue in a normal way. No new instructions are fetched
during the stall.
Hazard is situation that prevents the next instruction in the instruction stream fromk
executing during its designated clock cycle. Hazards will reduce the pipeline
performance.
19
Speedup =
CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction
CPI pipelined = 1 + Pipeline stall clock cycles per instruction
Assume that,
i)
cycle time overhead of pipeline is ignored
ii)
stages are balanced
With theses assumptions,
Clock cycle unpipelined = clock cycle pipelined
Therefore, Speedup = CPI unpipelined
CPI pipelined
Speedup =
CPI unpipelined
a
1+Pipeline stall cycles per instruction
If all the instructions take the same number of cycles and is equal to the number of
pipeline stages or depth of the pipeline, then,
CPI unpipelined = Pipeline depth
Speedup =
Pipeline depth a
1+Pipeline stall cycles per instruction
Structural hazard
Structural hazard arise from resource conflicts, when the hardware cannot support all
possible combination of instructions simultaneously in overlapped execution. If some
combination of instructions cannot be accommodated because of resource conflicts, the
processor is said to have structural hazard.
Structural hazard will arise when some functional unit is not fully pipelined or when
some resource has not been duplicated enough to allow all combination of instructions in
the pipeline to execute.
For example, if memory is shared for data and instruction as a result, when an instruction
contains data memory reference, it will conflict with the instruction reference for a later
instruction (as shown in figure 2.5a). This will cause hazard and pipeline stalls for 1
clock cycle.
Figure 2.5a Load Instruction and instruction 3 are accessing memory in clock
cycle4
21
Instruction #
Load Instruction
Instruction I+1
Instruction I+2
Clock number
1
IF
ID
EXE
MEM
WB
IF
ID
IF
EXE
ID
MEM
EXE
IF
Instruction I+3
Instruction I+4
Stall
WB
MEM
ID
IF
WB
EXE
ID
MEM
EXE
WB
MEM
R1, R2, R3
R4,R1,R5
R6,R1,R5
R8, R1,R9
R10,R1,R11
22
DADD instruction produces the value of R1 in WB stage (Clock cycle 5) but the DSUB
instruction reads the value during its ID stage (clock cycle 3). This problem is called Data
Hazard.
DSUB may read the wrong value if precautions are not taken. AND instruction will read
the register during clock cycle 4 and will receive the wrong results.
The XOR instruction operates properly, because its register read occurs in clock cycle 6
after DADD writes in clock cycle 5. The OR instruction also operates without incurring a
hazard because the register file reads are performed in the second half of the cycle
whereas the writes are performed in the first half of the cycle.
Minimizing data hazard by Forwarding
The DADD instruction will produce the value of R! at the end of clock cycle 3. DSUB
instruction requires this value only during the clock cycle 4. If the result can be moved
from the pipeline register where the DADD store it to the point (input of LAU) where
DSUB needs it, then the need for a stall can be avoided. Using a simple hardware
technique called Data Forwarding or Bypassing or short circuiting, data can be made
available from the output of the ALU to the point where it is required (input of LAU) at
the beginning of immediate next clock cycle.
Forwarding works as follows:
i)
The output of ALU from EX/MEM and MEM/WB pipeline register is always
feedback to the ALU inputs.
ii)
If the Forwarding hardware detects that the previous ALU output serves as the
source for the current ALU operations, control logic selects the forwarded
result as the input rather than the value read from the register file.
Forwarded results are required not only from the immediate previous instruction, but also
from an instruction that started 2 cycles earlier. The result of ith instruction
Is required to be forwarded to (i+2)th instruction also.
Forwarding can be generalized to include passing a result directly to the functional unit
that requires it.
Data Hazard requiring stalls
LD
R1, 0(R2)
DADD
R3, R1, R4
AND
R5, R1, R6
OR
R7, R1, R8
The pipelined data path for these instructions is shown in the timing diagram (figure 2.7)
23
Instruction
LD R1, 0(R2)
Clock number
1
IF
ID
EXE
MEM
WB
IF
ID
EXE
MEM
WB
IF
ID
EXE
MEM
WB
IF
ID
EXE
MEM
EXE
ID
MEM
Stall
WB
EXE
MEM
WB
IF
Stall
ID
EXE
MEM
WB
Stall
IF
ID
EXE
MEM
DADD R3,R1,R4
AND R5, R1, R6
OR R7, R1, R8
LD R1, 0(R2)
DADD R3,R1,R4
AND R5, R1, R6
OR R7, R1, R8
IF
ID
IF
WB
WB
Figure 2.7 In the top half, we can see why stall is needed. In the second half, stall
created to solve the problem.
The LD instruction gets the data from the memory at the end of cycle 4. even with
forwarding technique, the data from LD instruction can be made available earliest during
clock cycle 5. DADD instruction requires the result of LD instruction at the beginning of
clock cycle 5. DADD instruction requires the result of LD instruction at the beginning of
clock cycle 4. This demands data forwarding of clock cycle 4. This demands data
forwarding in negative time which is not possible. Hence, the situation calls for a pipeline
stall.
Result from the LD instruction can be forwarded from the pipeline register to the and
instruction which begins at 2 clock cycles later after the LD instruction.
The load instruction has a delay or latency that cannot be eliminated by forwarding alone.
It is necessary to stall pipeline by 1 clock cycle. A hardware called Pipeline interlock
detects a hazard and stalls the pipeline until the hazard is cleared. The pipeline interlock
helps to preserve the correct execution pattern by introducing a stall or bubble. The CPI
for the stalled instruction increases by the length of the stall. Figure 2.7 shows the
pipeline before and after the stall.
Stall causes the DADD to move 1 clock cycle later in time. Forwarding to the AND
instruction now goes through the register file or forwarding is not required for the OR
instruction. No instruction is started during the clock cycle 4.
Control Hazard
When a branch is executed, it may or may not change the content of PC. If a branch is
taken, the content of PC is changed to target address. If a branch is taken, the content of
PC is not changed.
24
The simple way of dealing with the branches is to redo the fetch of the instruction
following a branch. The first IF cycle is essentially a stall, because, it never performs
useful work.
One stall cycle for every branch will yield a performance loss 10% to 30% depending on
the branch frequency.
Reducing the Brach Penalties
There are many methods for dealing with the pipeline stalls caused by branch
delay
1. Freeze or Flush the pipeline, holding or deleting any instructions after the branch
until the branch destination is known. It is a simple scheme and branch penalty is
fixed and cannot be reduced by software
2. Treat every branch as not taken, simply allowing the hardware to continue as if
the branch were not to executed. Care must be taken not to change the processor
state until the branch outcome is known.
Instructions were fetched as if the branch were a normal instruction. If the branch
is taken, it is necessary to turn the fetched instruction in to a no-of instruction and
restart the fetch at the target address. Figure 2.8 shows the timing diagram of both
the situations.
Instruction
Untaken Branch
Clock number
1
IF
ID
EXE
MEM
WB
IF
ID
IF
EXE
ID
IF
Instruction I+1
Instruction I+2
Instruction I+3
Instruction I+4
Taken Branch
Instruction I+1
Branch Target
Branch Target+1
Branch Target+2
IF
ID
IF
EXE
Idle
IF
MEM
Idle
ID
IF
MEM
EXE
WB
MEM
WB
ID
IF
EXE
ID
MEM
EXE
WB
MEM
WB
WB
Idle
EXE
ID
IF
Idle
MEM
EXE
ID
Idle
WB
MEM
EXE
WB
MEM
WB
Figure 2.8 The predicted-not-taken scheme and the pipeline sequence when the
branch is untaken (top) and taken (bottom).
25
3. Treat every branch as taken: As soon as the branch is decoded and target address
is computed, begin fetching and executing at the target if the branch target is
known before branch outcome, then this scheme gets advantage.
For both predicated taken or predicated not taken scheme, the compiler can
improve performance by organizing the code so that the most frequent path
matches the hardware choice.
4. Delayed branch technique is commonly used in early RISC processors.
In a delayed branch, the execution cycle with a branch delay of one is
Branch instruction
Sequential successor-1
Branch target if taken
The sequential successor is in the branch delay slot and it is executed irrespective of
whether or not the branch is taken. The pipeline behavior with a branch delay is shown in
Figure 2.9. Processor with delayed branch, normally have a single instruction delay.
Compiler has to make the successor instructions valid and useful there are three ways in
which the to delay slot can be filled by the compiler.
Instruction
Untaken Branch
Clock number
1
IF
ID
EXE
MEM
WB
IF
ID
EXE
MEM
WB
IF
ID
EXE
MEM
WB
IF
ID
IF
EXE
ID
Branch delay
Instruction (i+1)
Instruction (i+2)
Instruction (i+3)
Instruction (i+4)
Taken Branch
Branch delay
Instruction (i+1)
Branch Target
Branch Target+1
Branch Target+2
IF
ID
IF
EXE
ID
MEM
EXE
WB
MEM
WB
IF
ID
IF
EXE
ID
IF
MEM
EXE
ID
MEM
EXE
WB
MEM
WB
WB
MEM
EXE
WB
MEM
WB
Figure 2.9 Timing diagram of the pipeline to show the behavior of a delayed branch
is the same whether or not the branch is taken.
26
ii)
b)
c)
Types of exceptions:
The term exception is used to cover the terms interrupt, fault and exception.
I/O device request, page fault, Invoking an OS service from a user program, Integer
arithmetic overflow, memory protection overflow, Hardware malfunctions, Power failure
etc. are the different classes of exception.
Individual events have important characteristics that determine what action is needed
corresponding to that exception.
i)
If the event occurs at the same place every time the program is executed with the
same data and memory allocation, the event is asynchronous.
Asynchronous events are caused by devices external to the CPU and memory such
events are handled after the completion of the current instruction.
ii)
and can always be handled after the current instruction has completed. Coerced
exceptions are caused by some hardware event that is not under the control of the user
program. Coerced exceptions are harder to implement because they are not predictable
iii)
If an event can be masked by a user task, it is user maskable. Otherwise it is user non
maskable.
iv)
v)
If the programs execution continues after the interrupt, it is a resuming event otherwise
if is terminating event. It is easier implement exceptions that terminate execution.
28
11)
restart. Pipeline control can take the following steps to save the pipeline state safely.
i)
ii)
Until the trap is taken, turn off all writes for the faulting instruction and for all
instructions that follow in pipeline. This prevents any state changes for instructions that
will not be completed before the exception is handled.
iii) After the exception handling routine receives control, it immediately saves the PC
of the faulting instruction. This value will be used to return from the exception later.
NOTE:
1. with pipelining multiple exceptions may occur in the same clock cycle because
there are multiple instructions in execution.
2 Handling the exception becomes still more complicated when the instructions are
allowed to execute in out of order fashion.
Pipeline implementation
Every MIPS instruction can be implemented in 5 clock cycle
1. Instruction fetch cycles.(IF)
IR
Mem [PC]
NPC
PC+ 4
Operation: send out the [PC] and fetch the instruction from memory in to the Instruction
Register (IR). Increment PC by 4 to address the next sequential instruction.
29
Regs [rs]
Regs [rt]
Imm
Operation: decode the instruction and access that register file to read the registers
( rs and rt). File to read the register (rs and rt). A & B are the temporary registers.
Operands are kept ready for use in the next cycle.
Decoding is done in concurrent with reading register. MIPS ISA has fixed length
Instructions. Hence, these fields are at fixed locations.
3.
Memory reference:
ALU output
A+ Imm;
Operation: ALU adds the operands to compute the effective address and places
the result in to the register ALU output.
*
ALU output
A func
B;
Operation: The ALU performs the operation specified by the function code on the value
taken from content of register A and register B.
*.
Operation:
A Op Imm ;
the content of register A and register Imm are operated (function Op) and
30