0% found this document useful (0 votes)
162 views16 pages

Pipeline: A Simple Implementation of A RISC Instruction Set

Pipeline allows for the overlapping execution of multiple instructions by breaking down the execution process into discrete stages. At each stage, a different instruction is being worked on. This improves performance by allowing new instructions to begin execution before previous instructions have finished. However, pipeline hazards can occur when instructions interact in ways that prevent proper parallel execution, such as structural hazards from shared resources or data hazards when an instruction needs to read a value before it is written. Various techniques are used to address hazards and maximize the performance benefits of pipelining.

Uploaded by

akhilesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views16 pages

Pipeline: A Simple Implementation of A RISC Instruction Set

Pipeline allows for the overlapping execution of multiple instructions by breaking down the execution process into discrete stages. At each stage, a different instruction is being worked on. This improves performance by allowing new instructions to begin execution before previous instructions have finished. However, pipeline hazards can occur when instructions interact in ways that prevent proper parallel execution, such as structural hazards from shared resources or data hazards when an instruction needs to read a value before it is written. Various techniques are used to address hazards and maximize the performance benefits of pipelining.

Uploaded by

akhilesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Pipeline

Pipeline is an implementation technique that exploits parallelism among the


instructions in a sequential instruction stream. Pipeline allows to overlapping the
execution of multiple instructions.
A Pipeline is like an assembly line each step or pipeline stage completes a part of
an instructions. Each stage of the pipeline will be operating an a separate instruction.
Instructions enter at one end progress through the stage and exit at the other end. If the
stages are perfectly balance.
(assuming ideal conditions), then the time per instruction on the pipeline processor is
given by the ratio:
Time per instruction on unpipelined machine
Number of Pipeline stages
Under these conditions, the speedup from pipelining is equal to the number of stage
pipeline. In practice, the pipeline stages are not perfectly balanced and pipeline does
involve some overhead. Therefore, the speedup will be always then practically less than
the number of stages of the pipeline.
Pipeline yields a reduction in the average execution time per instruction. If the
processor is assumed to take one (long) clock cycle per instruction, then pipelining
decrease the clock cycle time. If the processor is assumed to take multiple CPI, then
pipelining will aid to reduce the CPI.
A Simple implementation of a RISC instruction set
Instruction set of implementation in RISC takes at most 5 cycles without pipelining.
The 5 clock cycles are:
1. Instruction fetch (IF) cycle:
Send the content of program count (PC) to memory and fetch the current
instruction from memory to update the PC.
New PC

[PC] + 4;

Since each instruction is 4 bytes

15

2. Instruction decode / Register fetch cycle (ID):


Decode the instruction and access the register file. Decoding is done in parallel with
reading registers, which is possible because the register specifies are at a fixed location in
a RISC architecture. This corresponds to fixed field decoding. In addition it involves:
-

Perform equality test on the register as they are read for a possible branch.

Sign-extend the offset field of the instruction in case it is needed.

Compute the possible branch target address.

3. Execution / Effective address Cycle (EXE)


The ALU operates on the operands prepared in the previous cycle and performs
one of the following function defending on the instruction type.
* Memory reference: Effective address

[Base Register] + offset

* Register- Register ALU instruction: ALU performs the operation specified in


the instruction using the values read from the register file.
* Register- Immediate ALU instruction: ALU performs the operation specified in
the instruction using the first value read from the register file and that sign extended
immediate.
4.

Memory access (MEM)

For a load instruction, using effective address the memory is read. For a store
instruction memory writes the data from the 2nd register read using effective address.
5.

Write back cycle (WB)

Write the result in to the register file, whether it comes from memory system (for a
LOAD instruction) or from the ALU.
Five stage Pipeline for a RISC processor
Each instruction taken at most 5 clock cycles for the execution
*

Instruction fetch cycle (IF)

Instruction decode / register fetch cycle (ID)

Execution / Effective address cycle (EX)

Memory access (MEM)

Write back cycle (WB)

16

The execution of the instruction comprising of the above subtask can be


pipelined. Each of the clock cycles from the previous section becomes a pipe stage a
cycle in the pipeline. A new instruction can be started on each clock cycle which results
in the execution pattern shown figure 2.1. Though each instruction takes 5 clock cycles
to complete, during each clock cycle the hardware will initiate a new instruction and will
be executing some part of the five different instructions as illustrated in figure 2.1.
.
Instruction
#
Instruction i
Instruction I+1
Instruction I+2
Instruction I+3
Instruction I+4

Clock number
1

IF

ID

EXE

MEM

WB

IF

ID

EXE

MEM

WB

IF

ID

EXE

MEM

WB

IF

ID

EXE

MEM

WB

IF

ID

EXE

MEM

WB

Figure 2.1 Simple RISC Pipeline. On each clock cycle another instruction fetched
Each stage of the pipeline must be independent of the other stages. Also, two different
operations cant be performed with the same data path resource on the same clock. For
example, a single ALU cannot be used to compute the effective address and perform a
subtract operation during the same clock cycle. An adder is to be provided in the stage 1
to compute new PC value and an ALU in the stage 3 to perform the arithmetic indicated
in the instruction (See figure 2.2). Conflict should not arise out of overlap of instructions
using pipeline. In other words, functional unit of each stage need to be independent of
other functional unit. There are three observations due to which the risk of conflict is
reduced.
Separate Instruction and data memories at the level of L1 cache eliminates a
conflict for a single memory that would arise between instruction fetch and data
access.
Register file is accessed during two stages namely ID stage WB. Hardware should
allow to perform maximum two reads one write every clock cycle.
To start a new instruction every cycle, it is necessary to increment and store the
PC every cycle.

17

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

Figure 2.2 Diagram indicating the cycle and functional unit of each stage.

Buffers or registers are introduced between successive stages of the pipeline so


that at the end of a clock cycle the results from one stage are stored into a register (see
figure 2.3). During the next clock cycle, the next stage will use the content of these
buffers as input. Figure 2.4 visualizes the pipeline activity.

Figure 2.3 Functional units of 5 stage Pipeline. IF/ID is a buffer between IF and ID stage.

18

Basic Performance issues in Pipelining


Pipelining increases the CPU instruction throughput but, it does not reduce the execution
time of an individual instruction. In fact, the pipelining increases the execution time of
each instruction due to overhead in the control of the pipeline. Pipeline overhead arises
from the combination of register delays and clock skew. Imbalance among the pipe stages
reduces the performance since the clock can run no faster than the time needed for the
slowest pipeline stage.

Figure 2.4 Pipeline activity

Pipeline Hazards
Hazards may cause the pipeline to stall. When an instruction is stalled, all the instructions
issued later than the stalled instructions are also stalled. Instructions issued earlier than
the stalled instructions will continue in a normal way. No new instructions are fetched
during the stall.
Hazard is situation that prevents the next instruction in the instruction stream fromk
executing during its designated clock cycle. Hazards will reduce the pipeline
performance.

19

Performance with Pipeline stall


A stall causes the pipeline performance to degrade from ideal performance. Performance
improvement from pipelining is obtained from:
Speedup =

Average instruction time un-pipelined


Average instruction time pipelined

Speedup =

CPI unpipelined * Clock cycle unpipelined


CPI pipelined * Clock cycle pipelined

CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instruction
CPI pipelined = 1 + Pipeline stall clock cycles per instruction
Assume that,
i)
cycle time overhead of pipeline is ignored
ii)
stages are balanced
With theses assumptions,
Clock cycle unpipelined = clock cycle pipelined
Therefore, Speedup = CPI unpipelined
CPI pipelined

Speedup =

CPI unpipelined
a
1+Pipeline stall cycles per instruction

If all the instructions take the same number of cycles and is equal to the number of
pipeline stages or depth of the pipeline, then,
CPI unpipelined = Pipeline depth
Speedup =

Pipeline depth a
1+Pipeline stall cycles per instruction

If there are no pipeline stalls,


Pipeline stall cycles per instruction = zero
Therefore,
Speedup = Depth of the pipeline.
Types of hazard
Three types hazards are:
1. Structural hazard
2. Data Hazard
3. Control Hazard
20

Structural hazard
Structural hazard arise from resource conflicts, when the hardware cannot support all
possible combination of instructions simultaneously in overlapped execution. If some
combination of instructions cannot be accommodated because of resource conflicts, the
processor is said to have structural hazard.
Structural hazard will arise when some functional unit is not fully pipelined or when
some resource has not been duplicated enough to allow all combination of instructions in
the pipeline to execute.
For example, if memory is shared for data and instruction as a result, when an instruction
contains data memory reference, it will conflict with the instruction reference for a later
instruction (as shown in figure 2.5a). This will cause hazard and pipeline stalls for 1
clock cycle.

Figure 2.5a Load Instruction and instruction 3 are accessing memory in clock
cycle4

21

Instruction #
Load Instruction
Instruction I+1
Instruction I+2

Clock number
1

IF

ID

EXE

MEM

WB

IF

ID
IF

EXE
ID

MEM
EXE
IF

Instruction I+3
Instruction I+4

Stall

WB
MEM
ID
IF

WB
EXE
ID

MEM
EXE

WB
MEM

Figure 2.5b A Bubble is inserted in clock cycle 4


Pipeline stall is commonly called Pipeline bubble or just simply bubble.
Data Hazard
Consider the pipelined execution of the following instruction sequence (Timing diagram
shown in figure 2.6)
DADD
DSUB
AND
OR
XOR

R1, R2, R3
R4,R1,R5
R6,R1,R5
R8, R1,R9
R10,R1,R11

22

DADD instruction produces the value of R1 in WB stage (Clock cycle 5) but the DSUB
instruction reads the value during its ID stage (clock cycle 3). This problem is called Data
Hazard.
DSUB may read the wrong value if precautions are not taken. AND instruction will read
the register during clock cycle 4 and will receive the wrong results.
The XOR instruction operates properly, because its register read occurs in clock cycle 6
after DADD writes in clock cycle 5. The OR instruction also operates without incurring a
hazard because the register file reads are performed in the second half of the cycle
whereas the writes are performed in the first half of the cycle.
Minimizing data hazard by Forwarding
The DADD instruction will produce the value of R! at the end of clock cycle 3. DSUB
instruction requires this value only during the clock cycle 4. If the result can be moved
from the pipeline register where the DADD store it to the point (input of LAU) where
DSUB needs it, then the need for a stall can be avoided. Using a simple hardware
technique called Data Forwarding or Bypassing or short circuiting, data can be made
available from the output of the ALU to the point where it is required (input of LAU) at
the beginning of immediate next clock cycle.
Forwarding works as follows:
i)
The output of ALU from EX/MEM and MEM/WB pipeline register is always
feedback to the ALU inputs.
ii)
If the Forwarding hardware detects that the previous ALU output serves as the
source for the current ALU operations, control logic selects the forwarded
result as the input rather than the value read from the register file.
Forwarded results are required not only from the immediate previous instruction, but also
from an instruction that started 2 cycles earlier. The result of ith instruction
Is required to be forwarded to (i+2)th instruction also.
Forwarding can be generalized to include passing a result directly to the functional unit
that requires it.
Data Hazard requiring stalls
LD
R1, 0(R2)
DADD
R3, R1, R4
AND
R5, R1, R6
OR
R7, R1, R8
The pipelined data path for these instructions is shown in the timing diagram (figure 2.7)

23

Instruction
LD R1, 0(R2)

Clock number
1

IF

ID

EXE

MEM

WB

IF

ID

EXE

MEM

WB

IF

ID

EXE

MEM

WB

IF

ID

EXE

MEM

EXE
ID

MEM
Stall

WB
EXE

MEM

WB

IF

Stall

ID

EXE

MEM

WB

Stall

IF

ID

EXE

MEM

DADD R3,R1,R4
AND R5, R1, R6
OR R7, R1, R8

LD R1, 0(R2)
DADD R3,R1,R4
AND R5, R1, R6
OR R7, R1, R8

IF

ID
IF

WB

WB

Figure 2.7 In the top half, we can see why stall is needed. In the second half, stall
created to solve the problem.

The LD instruction gets the data from the memory at the end of cycle 4. even with
forwarding technique, the data from LD instruction can be made available earliest during
clock cycle 5. DADD instruction requires the result of LD instruction at the beginning of
clock cycle 5. DADD instruction requires the result of LD instruction at the beginning of
clock cycle 4. This demands data forwarding of clock cycle 4. This demands data
forwarding in negative time which is not possible. Hence, the situation calls for a pipeline
stall.
Result from the LD instruction can be forwarded from the pipeline register to the and
instruction which begins at 2 clock cycles later after the LD instruction.
The load instruction has a delay or latency that cannot be eliminated by forwarding alone.
It is necessary to stall pipeline by 1 clock cycle. A hardware called Pipeline interlock
detects a hazard and stalls the pipeline until the hazard is cleared. The pipeline interlock
helps to preserve the correct execution pattern by introducing a stall or bubble. The CPI
for the stalled instruction increases by the length of the stall. Figure 2.7 shows the
pipeline before and after the stall.
Stall causes the DADD to move 1 clock cycle later in time. Forwarding to the AND
instruction now goes through the register file or forwarding is not required for the OR
instruction. No instruction is started during the clock cycle 4.
Control Hazard
When a branch is executed, it may or may not change the content of PC. If a branch is
taken, the content of PC is changed to target address. If a branch is taken, the content of
PC is not changed.

24

The simple way of dealing with the branches is to redo the fetch of the instruction
following a branch. The first IF cycle is essentially a stall, because, it never performs
useful work.
One stall cycle for every branch will yield a performance loss 10% to 30% depending on
the branch frequency.
Reducing the Brach Penalties
There are many methods for dealing with the pipeline stalls caused by branch
delay
1. Freeze or Flush the pipeline, holding or deleting any instructions after the branch
until the branch destination is known. It is a simple scheme and branch penalty is
fixed and cannot be reduced by software
2. Treat every branch as not taken, simply allowing the hardware to continue as if
the branch were not to executed. Care must be taken not to change the processor
state until the branch outcome is known.
Instructions were fetched as if the branch were a normal instruction. If the branch
is taken, it is necessary to turn the fetched instruction in to a no-of instruction and
restart the fetch at the target address. Figure 2.8 shows the timing diagram of both
the situations.

Instruction
Untaken Branch

Clock number
1

IF

ID

EXE

MEM

WB

IF

ID
IF

EXE
ID
IF

Instruction I+1
Instruction I+2
Instruction I+3
Instruction I+4
Taken Branch
Instruction I+1
Branch Target
Branch Target+1
Branch Target+2

IF

ID
IF

EXE
Idle
IF

MEM
Idle
ID
IF

MEM
EXE

WB
MEM

WB

ID
IF

EXE
ID

MEM
EXE

WB
MEM

WB

WB
Idle
EXE
ID
IF

Idle
MEM
EXE
ID

Idle
WB
MEM
EXE

WB
MEM

WB

Figure 2.8 The predicted-not-taken scheme and the pipeline sequence when the
branch is untaken (top) and taken (bottom).

25

3. Treat every branch as taken: As soon as the branch is decoded and target address
is computed, begin fetching and executing at the target if the branch target is
known before branch outcome, then this scheme gets advantage.
For both predicated taken or predicated not taken scheme, the compiler can
improve performance by organizing the code so that the most frequent path
matches the hardware choice.
4. Delayed branch technique is commonly used in early RISC processors.
In a delayed branch, the execution cycle with a branch delay of one is
Branch instruction
Sequential successor-1
Branch target if taken
The sequential successor is in the branch delay slot and it is executed irrespective of
whether or not the branch is taken. The pipeline behavior with a branch delay is shown in
Figure 2.9. Processor with delayed branch, normally have a single instruction delay.
Compiler has to make the successor instructions valid and useful there are three ways in
which the to delay slot can be filled by the compiler.

Instruction
Untaken Branch

Clock number
1

IF

ID

EXE

MEM

WB

IF

ID

EXE

MEM

WB

IF

ID

EXE

MEM

WB

IF

ID
IF

EXE
ID

Branch delay
Instruction (i+1)
Instruction (i+2)
Instruction (i+3)
Instruction (i+4)
Taken Branch
Branch delay
Instruction (i+1)
Branch Target
Branch Target+1
Branch Target+2

IF

ID
IF

EXE
ID

MEM
EXE

WB
MEM

WB

IF

ID
IF

EXE
ID
IF

MEM
EXE
ID

MEM
EXE

WB
MEM

WB

WB
MEM
EXE

WB
MEM

WB

Figure 2.9 Timing diagram of the pipeline to show the behavior of a delayed branch
is the same whether or not the branch is taken.
26

The limitations on delayed branch arise from


i)

Restrictions on the instructions that are scheduled in to delay slots.

ii)

Ability to predict at compiler time whether a branch is likely to be taken or


not taken.

The delay slot can be filled from choosing an instruction


a)

From before the branch instruction

b)

From the target address

c)

From fall- through path.

The principle of scheduling the branch delay is shown in fig 2.10

Figure 2.10 Scheduling the Branch delay

What makes pipelining hard to implements?


Dealing with exceptions: Overlapping of instructions makes it more difficult to know
whether an instruction can safely change the state of the CPU. In a pipelined CPU, an
instruction execution extends over several clock cycles. When this instruction is in
execution, the other instruction may raise exception that may force the CPU to abort the
instruction in the pipeline before they complete.
27

Types of exceptions:
The term exception is used to cover the terms interrupt, fault and exception.
I/O device request, page fault, Invoking an OS service from a user program, Integer
arithmetic overflow, memory protection overflow, Hardware malfunctions, Power failure
etc. are the different classes of exception.
Individual events have important characteristics that determine what action is needed
corresponding to that exception.
i)

Synchronous versus Asynchronous

If the event occurs at the same place every time the program is executed with the
same data and memory allocation, the event is asynchronous.
Asynchronous events are caused by devices external to the CPU and memory such
events are handled after the completion of the current instruction.
ii)

User requested versus coerced: User requested exceptions are predictable

and can always be handled after the current instruction has completed. Coerced
exceptions are caused by some hardware event that is not under the control of the user
program. Coerced exceptions are harder to implement because they are not predictable
iii)

User maskable versus user non maskable :

If an event can be masked by a user task, it is user maskable. Otherwise it is user non
maskable.

iv)

Within versus between instructions:

Exception that occur within

instruction are usually synchronous, since the instruction triggers the


exception. It is harder to implement exceptions that occur within
instructions than those between instructions, since the instruction must be
stopped and restarted. Asynchronous exceptions that occurs within
instructions arise from catastrophic situations and always causes program
termination.

v)

Resume versus terminate:

If the programs execution continues after the interrupt, it is a resuming event otherwise
if is terminating event. It is easier implement exceptions that terminate execution.

28

Stopping and restarting execution:


The most difficult exception have 2 properties:
1. Exception that occur within instructions
2. They must be restartable
For example, a page fault must be restartable and requires the intervention of OS.
Thus pipeline must be safely shutdown, so that the instruction can be restarted in
the correct state. If the restarted instruction is not a branch, then we will continue
to fetch the sequential successors and begin their execution in the normal fashion.

11)

Restarting is usually implemented by saving the PC of the instruction at which to

restart. Pipeline control can take the following steps to save the pipeline state safely.
i)

Force a trap instruction in to the pipeline on the next IF

ii)

Until the trap is taken, turn off all writes for the faulting instruction and for all

instructions that follow in pipeline. This prevents any state changes for instructions that
will not be completed before the exception is handled.
iii) After the exception handling routine receives control, it immediately saves the PC
of the faulting instruction. This value will be used to return from the exception later.
NOTE:
1. with pipelining multiple exceptions may occur in the same clock cycle because
there are multiple instructions in execution.
2 Handling the exception becomes still more complicated when the instructions are
allowed to execute in out of order fashion.
Pipeline implementation
Every MIPS instruction can be implemented in 5 clock cycle
1. Instruction fetch cycles.(IF)
IR

Mem [PC]

NPC

PC+ 4

Operation: send out the [PC] and fetch the instruction from memory in to the Instruction
Register (IR). Increment PC by 4 to address the next sequential instruction.
29

2. Instruction decode / Register fetch cycle (ID)


A

Regs [rs]

Regs [rt]

Imm

sign extended immediate field of IR;

Operation: decode the instruction and access that register file to read the registers
( rs and rt). File to read the register (rs and rt). A & B are the temporary registers.
Operands are kept ready for use in the next cycle.
Decoding is done in concurrent with reading register. MIPS ISA has fixed length
Instructions. Hence, these fields are at fixed locations.

3.

Execution/ Effective address cycle (EX)


One of the following operations are performed depending on the instruction
type.
*

Memory reference:
ALU output

A+ Imm;

Operation: ALU adds the operands to compute the effective address and places
the result in to the register ALU output.
*

Register Register ALU instruction:

ALU output

A func

B;

Operation: The ALU performs the operation specified by the function code on the value
taken from content of register A and register B.
*.

Register- Immediate ALU instruction:


ALU output

Operation:

A Op Imm ;

the content of register A and register Imm are operated (function Op) and

result is placed in temporary register ALU output.


*. Branch:
ALU output
Cond

NPC + (Imm << 2)


(A == O)

30

You might also like