SIMD Machines:: Pipeline System

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Chapter 6

SIMD Machines:
Pipeline System
Pipeline System
Pipelining is a H/W implementation technique where multiple instructions are overlapped
in execution. It is a particularly effective way of organizing concurrent activities in a
computer system.

Aside:
Sequential execution of instructions is done by using a Single H/W circuit
Fetch I+ Decode I+ Execute I+ Write the result o f instruction
Return
Pipelining Properties
- The number of performed operations/ sec is increased even though the elapsed time
needed to perform any one operation is not changed.
- The computer pipelining system is divided into n-stages.
- Each stage completes a part of an instruction in parallel.
- The stages are connected one to next to form a pipe.
Example: Consider a 4- stage pipeline system as shown in Fig. (6.1)

Fig. (6.1) Hardware Organization of 4- stage pipeline system


The four stages are:
F: Fetch- To read (fetch) an instruction from the memory.
D: Decode- Decode the instruction and fetch the source operands.
E: Execute- Perform the operation specified by the instruction.
W: Write- Store the result in destination location.
In this configuration, Interstage buffer registers (B1, B2, and B3) are placed between each two
stages so that the result computed by one stage can serve as an input to the next stage during the
next period.
Example: Consider a 4-stage pipeline system, with four instructions are in progress at any given
time. This means four distinct hardware units (stages) are needed, as shown in Figure (6.2). These
units must be capable of performing their tasks simultaneously and without interfering with one
another. Information is passed from one unit to the next through a storage buffer.

Fig. (6.2) Instruction execution diagram of 4- stage pipeline system (Ideal case)
Notes
1) Each stage in a pipeline is expected to complete its operation in one clock cycle. Hence,
the clock period should be sufficiently long to complete the task being performed in any
stage.

• If different units require different amounts of time, the clock period must allow the
longest task to be completed.
•A stage that completes its task early is idle for the remainder of the clock period.
2) The results obtained when instructions are executed in a pipelined processor are identical
to those obtained when the same instructions are executed sequentially.
PIPELINE PERFORMANCE
∙ The ideal case of a pipelined processor completes the processing of one instruction in each clock
cycle, which means that the rate of instruction processing is n-times that of sequential operation,
where n: is the number of pipeline stages.

∙ The potential increase in performance resulting from pipelining is proportional to the number of
pipeline stages.

Fetch + Decode + Execute +


write

Sequential processor
Pipeline system processor
Pipeline Hazards
For a variety of reasons, one of the pipeline stages may not be able to complete its processing
task for a given instruction in the time allotted. This is called "Hazard“.
- Hazards prevent the next instruction in an instruction stream from being executing during its
designated clock cycle.
- Hazards reduce the performance from the ideal speedup gained by pipelining because it causes
pipeline stalls to be inserted.

Data
There are three classes of pipeline hazards
Hazards

Control Structural
1) Data Hazards
A data hazard is a situation in which the pipeline is stalled, because the data to be operated
on are delayed for some reason; like data dependency.
There are two types of data dependency:
a) Explicit
b) implicit
Example (Explicit data dependency) : Consider a program that contains the following two
instructions I1 and I2. The data dependency just described arises when the destination of one
instruction is used as a source in the next instruction. Draw the instruction execution
diagram.
I1: Mul R2,R3,R4 R4 R2 * R3
I2: Add R5,R4,R6 R6 R4 + R5
I3 : Mov R5, M[addr] M[addr] R5
:
:
Mul R2,R3,R4 Data dependencey
R4

Add R5,R4,R6

Mov R5, M[addr]

Fig. (6.3) Pipeline stalled by data dependency between D2 and W1


Example: (Implicit data dependency)
Consider a program that contains the following two instructions I1 and I2. The data dependency just described
arises when the destination of one instruction I1 is used as a source in instruction I2. Draw the instruction
execution diagram.
I1: Add R1,R2,R3 R3 R1 + R2
Data dependence
I2: Adc R4,R5,R6 R6 R4 + R5 + CY because of the CY
I3 : Mov R5, R1 R1 R5
:
:
Eliminating Data Hazards: There are two approaches to eliminate data hazards
A) H/W Handling (OPERAND FORWARDING)
- If the H/W of pipeline system is arranged so that the result of source instruction I1 (E1) is forwarded directly
for the use directly in execution stage of destination instruction (ex. E2), the data hazard can be eliminated.

Fig. (6.4a) shows a part of the processor datapath involving the ALU and the register file.
The registers SRC1 and SRC2 constitute the interstage buffers needed for pipelined operation, as illustrated in
Fig. (6.4b)

I1: Mul R2,R3,R4 Source Instruction


I2: Add R5,R4,R6 Destination Instruction
SRC1, SRC2 RSLT

F: Fetch Ins D: Decode Ins. E: Execute Ins


and Fetch (ALU) W: Write Ins
B1 Operands B2 B3

Forwarding Path

Fig. (6.4b) Operand Forwarding technique in a pipelined processor


- As shown in Fig. (6.4b), registers SRC1 and SRC2 are part of buffer B2 and RSLT is part of B3.
- The two multiplexers connected at the inputs to the ALU allow the data on the destination bus to be
selected instead of the contents of either the SRC1 or SRC2 register.
- After decoding instruction I2 and detecting the data dependency, a decision is made to use data
forwarding. The operand not involved in the dependency, register R2, is read and loaded in register
SRC1 in clock cycle 3.
- In the next clock cycle, the product produced by instruction I1 is available in register RSLT, and
because of the forwarding connection, it can be used in step E2. Hence, execution of I2 proceeds
without interruption
B) HANDLING DATA HAZARDS IN SOFTWARE
In this case, the compiler can introduce the two-cycle delay needed between the instructions I1
and I2 by inserting NOP (No-operation) instructions, as follows:

I1: Mul R2,R3,R4


NOP
NOP Data dependency
I2: Add R5,R4,R6

If the responsibility for detecting such dependencies is left entirely to the software, the
compiler must insert the NOP instructions to obtain a correct result.

2- Instruction (or Control) Hazards


The pipeline processor may also be stalled because of a delay in the availability of an
instruction. Instruction Hazards occurred if a cache miss occurred. This will require the instruction
to be fetched from the main memory. Control Hazard may occur because of branch instructions.
a) Effect of a cache miss: The effect of a cache miss on pipelined operation is illustrated in Fig. (6.5).
Assume here that there is a cache miss in fetching instructin I2. Instruction I1 is fetched from the
cache in cycle 1, and its execution proceeds normally. However, the fetch operation for instruction
I2, which is started in cycle 2, results in a cache miss.

Fig. (6.5) Instruction execution diagram for a pipelined processor


stalled caused by cache miss in F2
The instruction fetch stage must now be suspend any further fetch requests and wait for instruction
I2 to arrive. We assume that instruction I2 is received and loaded into buffer B1 at the end of cycle 5.
The pipeline resumes its normal operation at that point.

b) Effect of Branch instructions (consider Unconditional branch only)


A branch instruction may also cause the pipeline system to stall. The time lost as a result of a branch
instruction is often referred to as the branch penalty.

For example, Fig. (6.6)a shows the effect of a branch instruction on a four-stage pipeline system.
- Assume that the branch address is computed in step E2. Instructions I3 and I4 must be discarded, and

the target instruction, Ik, is fetched in clock cycle 5. Thus, the branch penalty is two clock cycles.
I2: branch Ik

Fig. (6.6)a . pipelined processor stalled caused by branch timing


Reducing the branch penalty requires the branch address to be computed earlier in the pipeline.
- Typically, the instruction fetch unit has dedicated hardware to identify a branch instruction and
compute the branch target address as quickly as possible after an instruction is fetched.

- With this additional hardware, both of these tasks can be performed in step D2, leading to the
sequence of events shown in Fig. (6.6b). In this case, the branch penalty is only one clock cycle.

I2: branch Ik

Fig. (5.6b) Instruction execution


diagram after Reducing the branch
penalty
Reducing Instruction hazards
- Using Instruction Queue and Prefetching Technique
In general, either a cache miss or a branch instruction stall the pipeline for one or more
clock cycles.
- To reduce the effect of these stalls, many processors employ sophisticated fetch units that
can fetch instructions before they are needed and put them in a queue.
* Typically, this queue called instruction queue ; it can store several instructions.
* sophisticated fetch unit attempts to keep the instruction queue filled at all times to
reduce the impact of occasional delays when fetching instructions.
* Further, the fetch unit must have sufficient decoding and processing capability to
recognize and execute branch instructions.
- A separate unit, which we call the dispatch unit, takes instructions from the front of the
queue and sends them to the execution unit. This leads to the organization shown in Fig.
(6.7). The dispatch unit also performs the decoding function.
Fig. (6.7) Use of an instruction queue in the hardware organization

To be effective, the fetch unit must have sufficient decoding and processing capability to recognize and
execute branch instructions..
What is the effect of instruction queue and prefetching system on pipeline hazards?
∙ When the pipeline stalls because of a data hazard (for example), the dispatch unit is not able to
issue instructions from the instruction queue. However, the fetch unit continues to fetch
instructions and add them to the queue.

∙ Conversely, if there is a delay in fetching instructions because of a cache miss, the dispatch unit
continues to issue instructions from the instruction queue.

3- Structural hazard
This is a situation when two instructions require the use of a given hardware resource at the
same time. The most common case in which structural hazard may arise is in access to memory.
One instruction may need to access memory as part of the Execute or Write stage while another
instruction is being fetched. If instructions and data reside in the same cache unit, only one
instruction can proceeds and the other instruction is delayed. Many processors use separate
instruction and data caches to avoid this delay.
Both stages require to use
the same data bus at the
same time

Fig. 6.8. Structural hazards are avoided by providing sufficient hardware resources on the processor chip
Example: Consider the following sequence of instructions

I1 Add 0A,R0,R1
I2 Mul 3,R2,R3
I3 And 3A,R2,R4
I4 Add R0,R2,R5
I5 Sub R5, R4,R4
I6 Mov R5, [3000]
I7 Mov R2, [2500]

In all instructions, the destination operand is given last. Initially, registers R0 and R2 contain
14 and 5B, respectively. These instructions are executed in a computer that has a four-stage
pipeline. Assume that the first instruction is fetched in clock cycle 1, and that instruction
fetch requires only one clock cycle. And there is a cache miss in fetching instruction I2.
Note that the time needed to fetch an instruction in case of cache miss is 5 Clk cycles.
(a) Draw the instruction execution diagram.
(b) Give the contents of the interstage buffers, B1, B2, and B3, during clock cycles 2 and 10.
Clock 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Inst.
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
I5 F5 D5 D5 E5 W5
I6 F6 D6 E6 W6
I7 F7 D7 E7 W7

I1 Add 0A,R0,R1
Hazards I2 Mul 3,R2,R3
I3 And 3A,R2,R4
I4 Add R0,R2,R5
Fig. Instruction execution diagram I5 Sub R5, R4,R4
I6 Mov R5, [3000]
I7 Mov R2, [2500]
Clock 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Inst.
I1 F1 D1 E1 W1
I2 F2 D E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
I5 F5 D5 D5 E5 W5
I6 F6 D6 E6 W6
I7 F7 D7 E7 W

I1 Add 0A,R0,R1
I2 Mul 3,R2,R3
I3 And 3A,R2,R4
I4 Add R0,R2,R5
Clock B1 B2 B3 I5 Sub R5, R4,R4

Clk cycle 2 I1 (fetched) I1 (after decode step) Nothing I6 Mov R5, [3000]
I7 Mov R2, [2500]
Clk cycle10 I6 (fetched) I5 (after decode step) R5=6F (i.e. execute I4)
Homework: Consider the following sequence of instructions

I1 Mov 09,R0
Loop: I2 INC R0
I3 Add R0,R1,R1
I4 Adc R2,R0,R2
I5 OR R2, 89, R4
I6 Jmp Loop
I7 Mov R1, [2500]
I8 Mov [2000], R3

In all instructions, the destination operand is given last. Initially, registers R1 and R2 contain 18 and 20,
respectively. These instructions are executed in a computer that has a four-stage pipeline. Assume that the first
instruction is fetched in clock cycle 1, and that instruction fetch requires only one clock cycle. And there is a
cache miss in fetching instruction I5. Note that the time needed to fetch an instruction in case of cache miss is 5
Clk cycles, and the first level cache is split type.

(a) Draw the instruction execution diagram.


(b) Give the contents of the interstage buffers, B1, B2, and B3, during clock cycles 7 and 13.
Performance Measurements
The execution time of a program, T , of a program

N: dynamic instruction count


S: is the average number of clock cycles processor takes to fetch and execute one instruction,
and
R : is the clock rate.
Note: This simple model assumes that instructions are executed one after the other, with no overlap.

Instruction Throughput: represents the number of instructions executed per second. (It is
used to measure the pipelined system speed)
A- Sequential Processor Throughput
For sequential execution, the throughput, Ps is given by
B- Pipeline Throughput
1) Effect of a Unified Cache
Let TI be the time between two successive instruction completions.
Note: For sequential execution, T1 = S
However, in the absence of hazards, a pipelined processor completes the execution of one instruction each
clock cycle, thus,
T1 = 1 cycle
A cache miss stalls the pipeline by an amount equal to the cache miss penalty. This means that the value of TI
increases by an amount equal to the cache miss penalty for the instruction in which the miss occurs.
Note: a cache miss can occur for either
Consider a computer that has a unified cache for both instructions and data, and let d be the percentage of
instructions that refer to data operands in the memory. The average increase in the value of T1 as a result of
cache misses is given by

where hi and hd are the hit ratios for instructions and data, respectively.
2) Effect of a two- level Caches
Reducing the cache miss penalty is particularly worthwhile in a pipelined processor. This can be achieved by
introducing a secondary cache between the primary, on-chip cache and main memory. A miss in the primary
cache for which the required block is found in the secondary cache introduces a penalty, Ms , In the case of a
miss in the secondary cache, the penalty (Mp) is still incurred. Assuming a hit rate hs in the secondary cache,
the average increase in TI is

Example 1: A typical computer system with a clock period of 1.25ns and a unified cache for instructions and
data. Assume that 33% of the instructions access data in memory. With 95% instruction hit rate and 92% data hit
rate, and miss penalty of 16 – clock cycles. Determine the followings:
- Pipeline processor throughput.
- Non-pipelined processor throughput.
Solution: clock rate R = 1/T1 = 1/ 1.25ns = 800 MHz
with 33% of the instructions access data in memory, δmiss is

= ((1- 0.95) + 0.33(1-0.92))* 16


= ( 0.05 + 0.0264 ) * 16 = 1.2224 cycles

Taking this delay into account, the processor’s throughput would be

Pp = 800/ (1+ 1.2224) = 360 MIPS


- Non-pipelined processor throughput

Pipeline Performance = Pp / Ps = 360/ 153.2 = 2.4

Example 2: Consider a processor that uses a 4- stage pipeline system and two level caches (L1 and
L2) with a clock period of 1.25ns.
The L1- cache is a unified cache for instructions and data with 33% of the instructions access data in
memory. Assume that the instruction hit rate is 95%, data hit rate is 92%, and miss penalty of 16 –
clock cycles. If the time needed to transfer an (8- word) block from the L2- cache is 9 ns, miss penalty
of L2- cache is 5clock cycles, and L2- cache hit rate is 91%, determine the followings:

- Pipelined processor throughput.


- Non-pipelined processor throughput.
- Pipeline Performance
Solution: clock rate R = 1/T1 = 1/ 1.25ns = 800 MHz
with 33% of the instructions access data in memory, δmiss is

= ((1- 0.95) + 0.33(1-0.92)) * (0.91*5)+ (1-0.91)* 16)


= ( 0.05 + 0.0264 ) * (4.55+ 1.44) =
= 0.0764 * 5.99 = .0457 cycles

Taking this delay into account, the processor’s throughput would be

Pipelined processor throughput

Pp = 800/ (1+ .0457) = 549 MIPS

Ps = 800/ (4+ .0457) = 179.5 MIPS

Pipeline Performance = Pp / Ps = 549/ 179.5 = 3.06


Superscalar Operation
Aside

Return
A more aggressive approach is to equip the processor with multiple processing units to handle
several instructions in parallel in each processing stage.
- With this arrangement, several instructions start execution in the same clock cycle, and the
processor is said to use multiple-issue.
- Such processor is capable of achieving instruction execution throughput of more than one instruction
per cycle. They are known as superscalar processors. Many modern high-performance processors use
this approach.
Example: Consider a processor with two execution units, one for integer and one for floating-point operations.
The Instruction fetch unit is capable of reading two instructions at a time and storing them in the instruction
queue as shown in Fig. 9

In each clock cycle, the Dispatch unit retrieves and decodes up to two instructions from the front of the queue. If
there is one integer, one floating-point instruction, and no hazards, both instructions are dispatched in the same
clock cycle.

Fig. 9 A superscalar processor with


two execution units.
Pipeline instruction execution timing diagram is shown in the following Fig.10. The blue
shading indicates operations in the floating-point unit. The floating-point unit takes three clock
cycles to complete the floating-point operation specified in I1. The integer unit completes
execution of I2 in one clock cycle.

Fig. 10 An example of
instruction execution flow in the
processor of Fig. (1), assuming
no hazards are encountered.

You might also like