Chapter 6
Chapter 6
techniques
-prajwala T R
Dept. of CSE
PESIT
Linear pipelining processors
• Cascade of processing stages which are
linearly connected to perform fixed function
over stream of data
• External inputs are feed to stage s1
• K processing stages
Asynchronous model
• Uses handshaking protocol
• Ready and acknowledgement signal
• Variable throughput and variable delays at
each stage
Synchronous model
• Latches act as interface between stages
• Upon arrival of clock signal latch transfers data
• Pipeline stages-combinational circuits
• Reservation table-space and time diagram of
pipeline stage
Reservation table
Clock and timing protocol
• Clock cycle time-τ
• τ i-time delay of stage si
• D-time delay of latch
• τ= τm+d τm>>d
• Pipeline frequency=1/ τ
Clock skewing
• Same clock pulse arrive at different stages
with time offset s
• tmax and tmin
• To avoid race condition
– τm>tmax+s
– d<tmin-s
Speedup
• Linear pipeline of k stages can process n tasks
in k cycles n-1 tasks take (n-1) clock cycles
• Tk =[k+(n-1)] τ
• n tasks in non pipelined processor is T1 =nkτ
• Speedup is ratio of nonpipelined over
pipelined
Optimal pipeline stages
• t-total time required for non pipelined
program
• Same program using pipelining
– If k pipeline stages
– Clock period p=t/k+d
– Cost=c+kh
– Frequency f=1/p
– PCR=f/c+kh
• Efficiency
• HK=Ek.f
Reservation and latency analysis
• Feed forward connections of pipeline
• Feedback connections of pipeline.
• Example-3 stage pipeline
• Output of pipeline is not always got from last
stage.
Reservation table
• Example of reservation table for function.
(multiple check marks in a row and column)
• Time space data flow
• Columns-evaluation time
• All initiations in static pipeline use same
reservation table.
• Dynamic pipeline –different initiation different
reservation table
Latency analysis
• Number clock cycles between 2 initiations of
pipeline-latency. Ex: k latency
• Collision-2 or more initiations use same
pipeline stage at same time.
• Latencies that cause collision are called
forbidden latencies.
• Forbidden latency in reservation table is
distance between any 2 checkmarks in a row.
• Latency cycle-latency sequence that repeats
itself indefinitely.
• Average latency.
• Collision free scheduling
• Collision vector
• State transition diagram
• Greedy cycles
Pipelined scheduled optimization
• Insert non compute delay stages into original
pipeline.
• New reservation table ,so new collision vector
• Lower bound is max checkmarks in any row of
reservation table
• MAL<=avg latency of greedy cycle
• Upper bound is number of 1’s in collision
vector +1
• Pipeline throughput-N tasks n pipeline cycles.
• N/n
• Pipeline efficiency-percentage of time each
pipeline stage is used over long series of tasks
• High pipeline throughput-less latency
• High efficiency- less idle time
Instruction pipeline design
• Instruction execution phase
– Fetch
– Decode
– Issue
– Execute
– Write results
Mechanisms for instruction pipelining
• Prefetch buffers
– Sequential buffers
– Target buffers
• Pairs of buffers
• FIFO
– Loop buffer
• Saves fetch time
• Unnecessary memory access is avoided
Mechanisms for instruction pipelining
contd…
• Multiple functional units
– Maximum checkmarks in a row in reservation
table-bottleneck is created.
– Multiple copies of the stage
– Reservation station
• Operations wait until no data dependencies
• Act as buffer
– Tag unit
– Multiple functional unit
Mechanisms for instruction pipelining
contd…
• Internal data forwarding
• Store –load operation
– Convert load operation to move
• Load –load operation
– Convert load operation to move
Dot product example
Mechanisms for instruction pipelining
contd…
• Hazard avoidance
– D(I)-Domain-set of inputs for instruction I
– R(I)-Range-set of outputs of instruction I
– RAW hazard-flow dependence
– WAR hazard-anti dependence
– WAW Hazard-output dependence
Mechanisms for instruction pipelining
contd…
• Static scheduling
• Handled by compiler
• For(i=1000;i>0;i=i-1)
• X[i]=x[i]+s;
• Loop unrolling is also a static scheduling
method
• Replicate body of loop multiple times until
terminal condition
Static scheduling
• Determine if loop unrolling would be useful or
not
• Use different set of registers to avoid collisions
• Determine if load and stores can be
interchanged safely or not.
• Schedule the code preserving the
dependencies.
Dynamic scheduling
• Hardware rearranges the instruction execution
to overcome stalls at run time
• The 2 techniques are
– Score boarding on CDC
– Tomosulo algorithm
Tomosulo algorithm
• Register renaming to avoid RAW WAR and
WAW hazards.
• Reservation station fetches the operand and
buffers it instead of registers.
• Common data bus
Tomosulo algorithm
CDC scoreboarding
• Multiple functional units
• Score board is unit which keeps track of
registers needed by instructions waiting for
various functional units.
• Centralized system
• When all registers have valid data score board
enabled instruction execution.
• Scoreboard releases resources once
instruction is been executed.
Branch handling techniques
• Branch taken
• Branch target
• Delay slot(b)
• p
• q
• pnbqτ
• Pipeline throughput
• Upper bound
• Performance degradation factor
f-Heff/f=pq(k-1)/pq(k-1)+1
Branch prediction-static scheme
• Sloution1 – freeze or flush all instructions until
branch destination is known
• Solution2-predicted not taken
– Continue fetching instructions as if branch not
taken
– If taken then restart execution
• Solution3- branch is always taken
• As soon as branch encountered compute
target address and store in PC
• Best performance
• Solution 4-delay slot
• Branch instrn
• Delay slot
• Branch target if taken
Delay slot
Branch hazards-dynamic solution
• History of branches is taken
• Branch history table
• 3 classes of strategy
• Hardware-branch target buffer
• 2 bit prediction scheme
– Branch prediction buffer
Delayed branches
• Delayed branch of d cycles allows at most d-1
useful instructions
• NOP act fillers
• Whether branch taken or not the result should
remain same.
Arithmetic pipeline design
• Applied to speedup numerical calculations
• Finite precision is required.
– Numbers exceeding limit must be rounded off or
truncated
– Overflow condition
– Underflow condition
• IEEE 754 format
• Decimal numbers can be fixed or floating
point
• Fixed point numbers
– Sign magnitude , 1’s complement, 2’s complement
– 1’s complement –dirty zero
– Multiplication and division requires 2n bit register
space
• Floating point numbers
• Single precision
– X=(m,e)=m X re
– Sign bit-1 bit
– Exponent-8 bit, -127 to +127 values
– Mantissa 23 bit
• X=(-1)s X 2 e-127 X m
• Floating point addition
• Floating point multiplication
Static arithmetic pipelines
• Scalar pipelines-one function at a time
• Vector or dynamic pipelines-more than
function at a time
• Multiple functional units
• 3 stages are
– Exponent comparison and equalization
– Fraction addition
– Normalization and rounding off the values
• Carry propagation adder
• Carry save adder
• Multiplication pipeline design
• Convergence division
– Q=N/D
• Convergence Division
• Generate reciprocal of the divisor by an
iterative process and then use:
• A/B = A X (1/B)