0% found this document useful (0 votes)
67 views75 pages

ACA - Chapter 6

This document discusses techniques for improving instruction pipeline performance, including pipelining and superscalar techniques. It covers linear and non-linear pipeline models, reservation tables, latency analysis, and avoiding collisions in scheduling instructions. Key mechanisms for instruction pipelining include prefetch buffers, multiple functional units, internal data forwarding, and hazard avoidance. Prefetch buffers in particular can be used to match the instruction fetch rate to the pipeline consumption rate by pre-loading sequential, branch target, or loop instructions.

Uploaded by

Praveen Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views75 pages

ACA - Chapter 6

This document discusses techniques for improving instruction pipeline performance, including pipelining and superscalar techniques. It covers linear and non-linear pipeline models, reservation tables, latency analysis, and avoiding collisions in scheduling instructions. Key mechanisms for instruction pipelining include prefetch buffers, multiple functional units, internal data forwarding, and hazard avoidance. Prefetch buffers in particular can be used to match the instruction fetch rate to the pipeline consumption rate by pre-loading sequential, branch target, or loop instructions.

Uploaded by

Praveen Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Chapter 6

Pipelining and
Superscalar Techniques.
Dr.Manjunath Kotari
Professor & Head-CSE
Linear Pipeline

• Processing Stages are linearly connected


• Perform fixed function
• Synchronous Pipeline
• Clocked latches between Stage i and Stage i+1
• Equal delays in all stages
• Asynchronous Pipeline (Handshaking)
Latches

S1 S2 S3

L1 L2

Slowest stage determines delay

Equal delays  clock period


Linear Pipeline Processors

• Linear pipeline processor is a cascade of


processing stages which are
• linearly connected to perform a fixed function over a
stream of data flowing from one end to the other.
• Applied for
• Execution
• Arithmetic computation
• Memory access operations
Asynchronous and Synchronous
Models
• Depending on the control of data along the pipeline,
we model linear pipelines in two categories
• Asynchronous
• Synchronous
Asynchronous Model
• Data flow between adjacent stages in an
asynchronous pipeline is controlled by a handshaking
protocol.
• When stage Si is ready to transmit, it sends a ready
signal to stage Si+1
• After stage Si+1 receives the incoming data, it returns
the acknowledge signal to Si
• Advantages
• Useful in designing communication channels
• Variable throughput rate-different delays
Synchronous Model
• Clocked latches are used to interface between
stages.
• The latches are made with master slave flip-flops
• Upon the arrival of a clock pulse, all latches
transfer data to the next stage simultaneously.
• The pipeline stages are combinational logic
• Equal delays in all stages.
• Specified by reservation table
Reservation Table
Time

S1 X

S2 X

S3 X

S4
X
5 tasks on 4 stages
Time

S1 X X X X X

S2 X X X X X

S3 X X X X X

S4 X X X X X
Clocking and Timing Control
Speedup, Efficiency & Throughput
Efficiency(Ek) & Throughput(Hk)
Non Linear Pipelines

• Variable functions
• Feed-Forward
• Feedback
3 stages & 2 functions

X Y

S1 S2 S3

• it is a multi function pipeline


•Three type of connections
•Streamline connection S1-S2 and S2-S3
•Feedforeward connection S1-S3
•Feedbackward connection S3-S2 and S3-S1
Reservation Tables for X & Y
S1 X X X
S2 X X
S3 X X X

S1 Y Y
S2
Y
S3
Y Y Y
Reservation Tables

• For static linear pipeline is trivial in the sense


that dataflow follows a linear streamline.
• For a dynamic pipeline non-linear pattern is
used.
• A static pipeline is specified by a single
reservation table.
• A dynamic pipeline may be specified by more
than one reservation table.
Reservation Tables

• Each reservation table displays the time-space flow


of data through the pipeline for one function
evaluation.
• Different functions may follow different paths on the
reservation table.
• The no. of columns in a table is called as evaluation
time of function.
• The check marks in each row of the table
correspond to the time instants that the particular
stage will be used.
• Multiple check marks means repeated usage of the
Latency Analysis

• Latency-
• the number of time units between two initiations
of a pipeline is called latency.
• Must be non-negative integers
• A latency of k means, two initiations are
separated by a k clock cycles
• Collision
• Any attempt by two or more initiations to use the
same pipeline stage at the same time.
• Collision implies resource conflicts between
two initiations in the pipeline.
• Therefore all collisions must be avoided in
scheduling a sequence of pipeline initiations.
• Some latencies will cause collisions, and some
will not.
• Latencies that cause collisions are called
forbidden latencies.
• Latency sequence
• It is a sequence of permissible nonforbidden latencies
between successive task initiations.
• Latency cycle
• It is a latency sequence which repeats the same
subsequence indefinitely.
• Average latency
•It is obtained by dividing the sum of all latencies by the number of
latencies along the cycle.
• Constant cycle
•It is a latency cycle which contains only one latency value.
Collision free scheduling

• Collision vectors
• The combined set of permissible and forbidden latencies
can easily displayed by a collision vector.
• C=(CmCm-1……..C2C1)
• If Ci=1 latency i causes a collision
• If Ci=0 latency i causes is permissible.
State diagrams
• Simple cycle
• It is a latency cycle in which each state appears
only once.
• Ex: (3),(6),(1,8),(3,8) and (6,8)
• Greedy cycles
• Greedy cycle is one whose edges are all made
with minimum latencies from their respective
starting states.
• Ex: (1,8),(3)
• MAL (minimal average latency)
Nonlinear Pipeline Design

• Latency
The number of clock cycles between two initiations of a
pipeline
• Collision
Resource Conflict
• Forbidden Latencies
Latencies that cause collisions
Nonlinear Pipeline Design cont
• Latency Sequence
A sequence of permissible latencies between successive
task initiations
• Latency Cycle
A sequence that repeats the same subsequence
• Collision vector
C = (Cm, Cm-1, …, C2, C1), m <= n-1
n = number of column in reservation table
Ci = 1 if latency i causes collision, 0 otherwise
Collision Vector for Multiply
after Multiply
Forbidden Latencies: 1, 2

Collision vector
0 0 0 0 1 1  11

Maximum forbidden latency = 2  m = 2


Example
X Y

S1 S2 S3
Reservation Tables for X & Y
S1 X X X
S2 X X
S3 X X X

S1 Y Y
S2
Y
S3
Y Y Y
Reservation Tables for X & Y
S1 X X X
S2 X X
S3 X X X

S1 Y Y
S2
Y
S3
Y Y Y
Collision Vector

• Forbidden Latencies: 2, 4, 5, 7
• Collision Vector =
1011010
Y after Y
S1 Y Y Y
S2
Y Y
S3
Y YY YY

S1 Y YY
S2
Y
S3
Y Y YY
Collision Vector

• Forbidden Latencies: 2, 4
• Collision Vector =
1010
Exercise – Find the collision
vector
1 2 3 4 5 6 7

A X X X

B X X

C X X

D X
State Diagram for X
8+

1011010

3 8+

6 8+ 1*

1011011 1111111

3* 6
Cycles

• Simple cycles  each state appears only once


(3), (6), (8), (1, 8), (3, 8), and (6,8)
• Greedy Cycles  simple cycles whose edges are all
made with minimum latencies from their respective
starting states
(1,8), (3)  one of them is MAL
Delay Insertion

• The purpose of delay insertion is to modify the


reservation table, yielding a new collision vector.
• This leads new modified state diagram
• Which may produce greedy cycles meeting the lower
bound on the MAL
Instruction Pipeline Design
• Pipelined Instruction Processing
• The fetch stage (F) fetches instructions from a cache
memory, presumably one per cycle.
• The decode stage (D) reveals (disclose) the instruction
function to be performed and identifies the resources
needed.
• The issue stage (I) reserves resources and operands
are also read from registers during the issue stage.
• The instructions are executed in one or several
execute stages (E)
• Issue of instructions follows the original program order
• The shaded boxes correspond to idle cycles when instruction
issues are blocked due to
•resources latency or
•conflicts or
•due to data dependencies
•The first two load instructions issue on consecutive cycles.
•The add is dependent on both loads so it waits.
• It is an improved timing after the instruction issuing order is
changed to eliminate unnecessary delays.
• Four load operations in the beginning
• The add and multiply instructions are blocked fewer cycles due to
this data prefetching
• The reorder should not change the end results.
• The R4000 is a super pipelined 64-bit processor (instruction + data ) cache
• Execution consists of 8 major steps.
• The single-cycle ALU stage takes slightly more time than each of the cache
access stages.
• The overlapped execution of successive instructions
•This pipeline operates efficiently and utilized simultaneously
on a noninterfering basis
• The internal pipeline clock rate (100 MHz)
• Load and branch instructions introduce extra delays.
Mechanisms for Instruction Pipelining

• Prefetch Buffers
• Multiple Functional Units
• Internal Data Forwarding
• Hazard Avoidance
Prefetch Buffers

• Three type of buffers can be used to match the


instruction fetch rate to the pipeline consumption
rate.
• Sequential buffer
• Target buffer
• Loop buffer
• In one memory-access time, a block of consecutive
instructions ate fetched into a prefetch buffer.
1. Sequential instructions are loaded into a pair of sequential
buffers
2. Instructions from a branch target are loaded into pair of
target buffers
3. These works on FIFO fashion
4. After the branch condition is checked, appropriate
instructions are taken from one of the two buffers,
instructions in the other are discarded.
• Loop buffer
• Loop buffer operates in two steps.
• It contains instructions sequentially ahead of the current
instruction.
• It recognizes when the target of a branch falls within the
loop boundary.
Multiple functional units
Internal Data Forwarding

• The throughput of a pipelined processor can be


further improved with internal data forwarding
among multiple functional units.
• Store load forwarding
• Load load forwarding
• Store store forwarding
Store – load forwarding
• The load operation (LD R2,M) from memory to register R2
can be replaced by the move operation MOVE R2,R1
Load-load forwarding
• It eliminates the second load operation (LD R2,M) and
replaces it with the move operation (MOVE R2,R1)

LD R2,M
Store-store forwarding
• The two stores are executed immediately one after the other.
• The second store overwrites the first
• The first store becomes redundant and thus can be
eliminated without affecting the outcome.
Implementing the dot-product operation with internal data
forwarding
Hazard Avoidance

• The read and write of shared variables by different


instructions in a pipeline may lead to different results
• if these instructions are executed out of order.
• Three types of hazards
• Read-after-Write hazard
• Write-after-Read hazard
• Write-after-Write hazard
• Consider two instructions I and J.
• Instruction J is assumed to
logically follow instruction I
according to program order.
• if the actual execution order of
these two instructions
•violates the program order,
•incorrect results are read or
written,
•thereby producing hazards
•We use the notation D(I) and R(I) for
domain and range of instruction I.
•The domain contains –input set to
be used by instruction I.
• The range corresponds to the
output set of instruction I.
Possible hazards can occur
• R(I) ∩ D(J) ≠ 0 for RAW hazard
• R(I) ∩ R(J) ≠ 0 for WAW hazard
• D(I) ∩ R(J) ≠ 0 for WAR hazard
• The RAW hazard corresponds to the flow dependence
• WAR to the antidependence
• WAW to the output dependence
• The resolution of hazard can be checked by special
hardware while instructions are being loaded into the
buffer.
• A special tag bit can be used with each operand register to
indicate safe or hazard prone.
Dynamic Instruction Scheduling

• Static scheduling
• Data dependencies in a sequence of instructions create
interlocked relationships among them.
Branch handling techniques
• Three basic terms for the analysis of branching
effect.
• Branch Taken
• The action of fetching a nonsequential or remote instruction
after a branch instruction
• Branch Target
• The instruction to executed after a branch taken
• Delay Slot
• The number of pipeline cycles wasted between a branch taken
and its target.
• Denoted by d 0<=d<=k-1, where k=no. of pipeline stages.
Branch Prediction
• Branch can be predicted either based on branch code
types statically or based on branch history during
program execution.
• The static prediction direction is usually wired into the
processor.
• According to past experience, the best performance is
given by predicting taken.
• A dynamic branch strategy uses recent branch history to
predict whether or not the branch will be taken next time
when it occurs.
• To accurate one may need to use the entire history of
the branch to predict the future choice.
Classification of dynamic branch
strategies
• One class predicts the branch direction based
upon information found at the decode stage.
• The second class uses a cache to store target
addresses at the stage the effective address of
the branch target is computed.
• The third scheme uses a cache to store target
instructions at the fetch stage.
BTB (branch target buffer)

• Use to implement branch prediction


• The BTB is used to hold recent branch information
including the address of the branch target used.
• The address of the branch instruction locates its
entry in the BTB.
Delayed Branches

• Branch penalty would be reduced significantly


• if the delay slot could be shortened or minimized to a zero
penalty.
• A delayed branch of d cycles allows at most d-1
useful instructions to be executed following the
branch taken.
Arithmetic Pipeline Design
• Fixed point operations
• Fixed point numbers are represented internally in machines in
• Sign-magnitude
• One’s complement
• Two’s complement notation
• Add, subtract, multiply and divide are four primitive
arithmetic operations
• The add or subtract of two n-bit integers produces an n-bit
result
• The multiplication of two n-bit numbers produces a 2n-bit
result
• The division of an n-bit number by another may create an
arbitrarily long quotient and a remainder.
Floating point numbers
• The IEEE 754 floating-point standard
• A floating-point number X is represented by a pair (m,e).
• The algebraic value is represented as X=m x r e.
Arithmetic Pipeline Stages

• Arithmetic or logical shifts can be easily implemented


with shift registers.
• High speed addition requires either the use of a
• CPA which adds two numbers and produces an arithmetic
sum.
• CSA to add three input numbers and produce one sum
output and a carry.
Multiply Pipeline Design

• Consider the multiplication of two 8-bit integers


AxB=P, where P is the 16-bit product in double
precision.
• This fixed-point multiplication can be written as the
summation of eight partial products
• P=AxB=P0+P1+P2+……….+ P7
Pipeline unit for fixed-point
multiplication of 8-bit
integers
• Arithmetic pipeline has 3
stages
The mantissa section can
perform
Floating point add or
multiply operations

Stage 1 receives input


operands and returns with
computations results.
Stage 2 contains array
multiplier used to carry
out long multiplication.
Stage 3 contains registers
for holding results.
Convergence Division
• Division can be carried out by repeated multiplications.
• Mantissa division is carried out by a convergence
method.
• This convergence division obtains the quotient
Q=M/D of two normalized fractions 0.5≤ M<D<1

You might also like