SIMD Machines:: Pipeline System
SIMD Machines:: Pipeline System
SIMD Machines:: Pipeline System
SIMD Machines:
Pipeline System
Pipeline System
Pipelining is a H/W implementation technique where multiple instructions are overlapped
in execution. It is a particularly effective way of organizing concurrent activities in a
computer system.
Aside:
Sequential execution of instructions is done by using a Single H/W circuit
Fetch I+ Decode I+ Execute I+ Write the result o f instruction
Return
Pipelining Properties
- The number of performed operations/ sec is increased even though the elapsed time
needed to perform any one operation is not changed.
- The computer pipelining system is divided into n-stages.
- Each stage completes a part of an instruction in parallel.
- The stages are connected one to next to form a pipe.
Example: Consider a 4- stage pipeline system as shown in Fig. (6.1)
Fig. (6.2) Instruction execution diagram of 4- stage pipeline system (Ideal case)
Notes
1) Each stage in a pipeline is expected to complete its operation in one clock cycle. Hence,
the clock period should be sufficiently long to complete the task being performed in any
stage.
• If different units require different amounts of time, the clock period must allow the
longest task to be completed.
•A stage that completes its task early is idle for the remainder of the clock period.
2) The results obtained when instructions are executed in a pipelined processor are identical
to those obtained when the same instructions are executed sequentially.
PIPELINE PERFORMANCE
∙ The ideal case of a pipelined processor completes the processing of one instruction in each clock
cycle, which means that the rate of instruction processing is n-times that of sequential operation,
where n: is the number of pipeline stages.
∙ The potential increase in performance resulting from pipelining is proportional to the number of
pipeline stages.
Sequential processor
Pipeline system processor
Pipeline Hazards
For a variety of reasons, one of the pipeline stages may not be able to complete its processing
task for a given instruction in the time allotted. This is called "Hazard“.
- Hazards prevent the next instruction in an instruction stream from being executing during its
designated clock cycle.
- Hazards reduce the performance from the ideal speedup gained by pipelining because it causes
pipeline stalls to be inserted.
Data
There are three classes of pipeline hazards
Hazards
Control Structural
1) Data Hazards
A data hazard is a situation in which the pipeline is stalled, because the data to be operated
on are delayed for some reason; like data dependency.
There are two types of data dependency:
a) Explicit
b) implicit
Example (Explicit data dependency) : Consider a program that contains the following two
instructions I1 and I2. The data dependency just described arises when the destination of one
instruction is used as a source in the next instruction. Draw the instruction execution
diagram.
I1: Mul R2,R3,R4 R4 R2 * R3
I2: Add R5,R4,R6 R6 R4 + R5
I3 : Mov R5, M[addr] M[addr] R5
:
:
Mul R2,R3,R4 Data dependencey
R4
Add R5,R4,R6
Fig. (6.4a) shows a part of the processor datapath involving the ALU and the register file.
The registers SRC1 and SRC2 constitute the interstage buffers needed for pipelined operation, as illustrated in
Fig. (6.4b)
Forwarding Path
If the responsibility for detecting such dependencies is left entirely to the software, the
compiler must insert the NOP instructions to obtain a correct result.
For example, Fig. (6.6)a shows the effect of a branch instruction on a four-stage pipeline system.
- Assume that the branch address is computed in step E2. Instructions I3 and I4 must be discarded, and
the target instruction, Ik, is fetched in clock cycle 5. Thus, the branch penalty is two clock cycles.
I2: branch Ik
- With this additional hardware, both of these tasks can be performed in step D2, leading to the
sequence of events shown in Fig. (6.6b). In this case, the branch penalty is only one clock cycle.
I2: branch Ik
To be effective, the fetch unit must have sufficient decoding and processing capability to recognize and
execute branch instructions..
What is the effect of instruction queue and prefetching system on pipeline hazards?
∙ When the pipeline stalls because of a data hazard (for example), the dispatch unit is not able to
issue instructions from the instruction queue. However, the fetch unit continues to fetch
instructions and add them to the queue.
∙ Conversely, if there is a delay in fetching instructions because of a cache miss, the dispatch unit
continues to issue instructions from the instruction queue.
3- Structural hazard
This is a situation when two instructions require the use of a given hardware resource at the
same time. The most common case in which structural hazard may arise is in access to memory.
One instruction may need to access memory as part of the Execute or Write stage while another
instruction is being fetched. If instructions and data reside in the same cache unit, only one
instruction can proceeds and the other instruction is delayed. Many processors use separate
instruction and data caches to avoid this delay.
Both stages require to use
the same data bus at the
same time
Fig. 6.8. Structural hazards are avoided by providing sufficient hardware resources on the processor chip
Example: Consider the following sequence of instructions
I1 Add 0A,R0,R1
I2 Mul 3,R2,R3
I3 And 3A,R2,R4
I4 Add R0,R2,R5
I5 Sub R5, R4,R4
I6 Mov R5, [3000]
I7 Mov R2, [2500]
In all instructions, the destination operand is given last. Initially, registers R0 and R2 contain
14 and 5B, respectively. These instructions are executed in a computer that has a four-stage
pipeline. Assume that the first instruction is fetched in clock cycle 1, and that instruction
fetch requires only one clock cycle. And there is a cache miss in fetching instruction I2.
Note that the time needed to fetch an instruction in case of cache miss is 5 Clk cycles.
(a) Draw the instruction execution diagram.
(b) Give the contents of the interstage buffers, B1, B2, and B3, during clock cycles 2 and 10.
Clock 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Inst.
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
I5 F5 D5 D5 E5 W5
I6 F6 D6 E6 W6
I7 F7 D7 E7 W7
I1 Add 0A,R0,R1
Hazards I2 Mul 3,R2,R3
I3 And 3A,R2,R4
I4 Add R0,R2,R5
Fig. Instruction execution diagram I5 Sub R5, R4,R4
I6 Mov R5, [3000]
I7 Mov R2, [2500]
Clock 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Inst.
I1 F1 D1 E1 W1
I2 F2 D E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
I5 F5 D5 D5 E5 W5
I6 F6 D6 E6 W6
I7 F7 D7 E7 W
I1 Add 0A,R0,R1
I2 Mul 3,R2,R3
I3 And 3A,R2,R4
I4 Add R0,R2,R5
Clock B1 B2 B3 I5 Sub R5, R4,R4
Clk cycle 2 I1 (fetched) I1 (after decode step) Nothing I6 Mov R5, [3000]
I7 Mov R2, [2500]
Clk cycle10 I6 (fetched) I5 (after decode step) R5=6F (i.e. execute I4)
Homework: Consider the following sequence of instructions
I1 Mov 09,R0
Loop: I2 INC R0
I3 Add R0,R1,R1
I4 Adc R2,R0,R2
I5 OR R2, 89, R4
I6 Jmp Loop
I7 Mov R1, [2500]
I8 Mov [2000], R3
In all instructions, the destination operand is given last. Initially, registers R1 and R2 contain 18 and 20,
respectively. These instructions are executed in a computer that has a four-stage pipeline. Assume that the first
instruction is fetched in clock cycle 1, and that instruction fetch requires only one clock cycle. And there is a
cache miss in fetching instruction I5. Note that the time needed to fetch an instruction in case of cache miss is 5
Clk cycles, and the first level cache is split type.
Instruction Throughput: represents the number of instructions executed per second. (It is
used to measure the pipelined system speed)
A- Sequential Processor Throughput
For sequential execution, the throughput, Ps is given by
B- Pipeline Throughput
1) Effect of a Unified Cache
Let TI be the time between two successive instruction completions.
Note: For sequential execution, T1 = S
However, in the absence of hazards, a pipelined processor completes the execution of one instruction each
clock cycle, thus,
T1 = 1 cycle
A cache miss stalls the pipeline by an amount equal to the cache miss penalty. This means that the value of TI
increases by an amount equal to the cache miss penalty for the instruction in which the miss occurs.
Note: a cache miss can occur for either
Consider a computer that has a unified cache for both instructions and data, and let d be the percentage of
instructions that refer to data operands in the memory. The average increase in the value of T1 as a result of
cache misses is given by
where hi and hd are the hit ratios for instructions and data, respectively.
2) Effect of a two- level Caches
Reducing the cache miss penalty is particularly worthwhile in a pipelined processor. This can be achieved by
introducing a secondary cache between the primary, on-chip cache and main memory. A miss in the primary
cache for which the required block is found in the secondary cache introduces a penalty, Ms , In the case of a
miss in the secondary cache, the penalty (Mp) is still incurred. Assuming a hit rate hs in the secondary cache,
the average increase in TI is
Example 1: A typical computer system with a clock period of 1.25ns and a unified cache for instructions and
data. Assume that 33% of the instructions access data in memory. With 95% instruction hit rate and 92% data hit
rate, and miss penalty of 16 – clock cycles. Determine the followings:
- Pipeline processor throughput.
- Non-pipelined processor throughput.
Solution: clock rate R = 1/T1 = 1/ 1.25ns = 800 MHz
with 33% of the instructions access data in memory, δmiss is
Example 2: Consider a processor that uses a 4- stage pipeline system and two level caches (L1 and
L2) with a clock period of 1.25ns.
The L1- cache is a unified cache for instructions and data with 33% of the instructions access data in
memory. Assume that the instruction hit rate is 95%, data hit rate is 92%, and miss penalty of 16 –
clock cycles. If the time needed to transfer an (8- word) block from the L2- cache is 9 ns, miss penalty
of L2- cache is 5clock cycles, and L2- cache hit rate is 91%, determine the followings:
Return
A more aggressive approach is to equip the processor with multiple processing units to handle
several instructions in parallel in each processing stage.
- With this arrangement, several instructions start execution in the same clock cycle, and the
processor is said to use multiple-issue.
- Such processor is capable of achieving instruction execution throughput of more than one instruction
per cycle. They are known as superscalar processors. Many modern high-performance processors use
this approach.
Example: Consider a processor with two execution units, one for integer and one for floating-point operations.
The Instruction fetch unit is capable of reading two instructions at a time and storing them in the instruction
queue as shown in Fig. 9
In each clock cycle, the Dispatch unit retrieves and decodes up to two instructions from the front of the queue. If
there is one integer, one floating-point instruction, and no hazards, both instructions are dispatched in the same
clock cycle.
Fig. 10 An example of
instruction execution flow in the
processor of Fig. (1), assuming
no hazards are encountered.