0% found this document useful (0 votes)
94 views71 pages

Chapter 6

Pipelining and superscalar techniques can improve processor performance. Pipelining divides instruction execution into stages to allow multiple instructions to be processed simultaneously. Superscalar techniques use multiple functional units and dynamic scheduling to allow multiple instructions to execute concurrently. Key aspects of pipelining include pipeline hazards, branch prediction, reservation tables, and latency analysis. Arithmetic pipelines apply similar techniques to speed up numerical calculations by dividing operations like addition and multiplication into stages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views71 pages

Chapter 6

Pipelining and superscalar techniques can improve processor performance. Pipelining divides instruction execution into stages to allow multiple instructions to be processed simultaneously. Superscalar techniques use multiple functional units and dynamic scheduling to allow multiple instructions to execute concurrently. Key aspects of pipelining include pipeline hazards, branch prediction, reservation tables, and latency analysis. Arithmetic pipelines apply similar techniques to speed up numerical calculations by dividing operations like addition and multiplication into stages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Pipelining and superscalar

techniques
-prajwala T R
Dept. of CSE
PESIT
Linear pipelining processors
• Cascade of processing stages which are
linearly connected to perform fixed function
over stream of data
• External inputs are feed to stage s1
• K processing stages
Asynchronous model
• Uses handshaking protocol
• Ready and acknowledgement signal
• Variable throughput and variable delays at
each stage
Synchronous model
• Latches act as interface between stages
• Upon arrival of clock signal latch transfers data
• Pipeline stages-combinational circuits
• Reservation table-space and time diagram of
pipeline stage
Reservation table
Clock and timing protocol
• Clock cycle time-τ
• τ i-time delay of stage si
• D-time delay of latch
• τ= τm+d τm>>d
• Pipeline frequency=1/ τ
Clock skewing
• Same clock pulse arrive at different stages
with time offset s
• tmax and tmin
• To avoid race condition
– τm>tmax+s
– d<tmin-s
Speedup
• Linear pipeline of k stages can process n tasks
in k cycles n-1 tasks take (n-1) clock cycles
• Tk =[k+(n-1)] τ
• n tasks in non pipelined processor is T1 =nkτ
• Speedup is ratio of nonpipelined over
pipelined
Optimal pipeline stages
• t-total time required for non pipelined
program
• Same program using pipelining
– If k pipeline stages
– Clock period p=t/k+d
– Cost=c+kh
– Frequency f=1/p
– PCR=f/c+kh
• Efficiency

• (Hk)Pipeline throughput=no of tasks/time(tk)

• HK=Ek.f
Reservation and latency analysis
• Feed forward connections of pipeline
• Feedback connections of pipeline.
• Example-3 stage pipeline
• Output of pipeline is not always got from last
stage.
Reservation table
• Example of reservation table for function.
(multiple check marks in a row and column)
• Time space data flow
• Columns-evaluation time
• All initiations in static pipeline use same
reservation table.
• Dynamic pipeline –different initiation different
reservation table
Latency analysis
• Number clock cycles between 2 initiations of
pipeline-latency. Ex: k latency
• Collision-2 or more initiations use same
pipeline stage at same time.
• Latencies that cause collision are called
forbidden latencies.
• Forbidden latency in reservation table is
distance between any 2 checkmarks in a row.
• Latency cycle-latency sequence that repeats
itself indefinitely.
• Average latency.
• Collision free scheduling
• Collision vector
• State transition diagram
• Greedy cycles
Pipelined scheduled optimization
• Insert non compute delay stages into original
pipeline.
• New reservation table ,so new collision vector
• Lower bound is max checkmarks in any row of
reservation table
• MAL<=avg latency of greedy cycle
• Upper bound is number of 1’s in collision
vector +1
• Pipeline throughput-N tasks n pipeline cycles.
• N/n
• Pipeline efficiency-percentage of time each
pipeline stage is used over long series of tasks
• High pipeline throughput-less latency
• High efficiency- less idle time
Instruction pipeline design
• Instruction execution phase
– Fetch
– Decode
– Issue
– Execute
– Write results
Mechanisms for instruction pipelining
• Prefetch buffers
– Sequential buffers
– Target buffers
• Pairs of buffers
• FIFO
– Loop buffer
• Saves fetch time
• Unnecessary memory access is avoided
Mechanisms for instruction pipelining
contd…
• Multiple functional units
– Maximum checkmarks in a row in reservation
table-bottleneck is created.
– Multiple copies of the stage
– Reservation station
• Operations wait until no data dependencies
• Act as buffer
– Tag unit
– Multiple functional unit
Mechanisms for instruction pipelining
contd…
• Internal data forwarding
• Store –load operation
– Convert load operation to move
• Load –load operation
– Convert load operation to move
Dot product example
Mechanisms for instruction pipelining
contd…
• Hazard avoidance
– D(I)-Domain-set of inputs for instruction I
– R(I)-Range-set of outputs of instruction I
– RAW hazard-flow dependence
– WAR hazard-anti dependence
– WAW Hazard-output dependence
Mechanisms for instruction pipelining
contd…
• Static scheduling
• Handled by compiler
• For(i=1000;i>0;i=i-1)
• X[i]=x[i]+s;
• Loop unrolling is also a static scheduling
method
• Replicate body of loop multiple times until
terminal condition
Static scheduling
• Determine if loop unrolling would be useful or
not
• Use different set of registers to avoid collisions
• Determine if load and stores can be
interchanged safely or not.
• Schedule the code preserving the
dependencies.
Dynamic scheduling
• Hardware rearranges the instruction execution
to overcome stalls at run time
• The 2 techniques are
– Score boarding on CDC
– Tomosulo algorithm
Tomosulo algorithm
• Register renaming to avoid RAW WAR and
WAW hazards.
• Reservation station fetches the operand and
buffers it instead of registers.
• Common data bus
Tomosulo algorithm
CDC scoreboarding
• Multiple functional units
• Score board is unit which keeps track of
registers needed by instructions waiting for
various functional units.
• Centralized system
• When all registers have valid data score board
enabled instruction execution.
• Scoreboard releases resources once
instruction is been executed.
Branch handling techniques
• Branch taken
• Branch target
• Delay slot(b)
• p
• q
• pnbqτ
• Pipeline throughput
• Upper bound
• Performance degradation factor
f-Heff/f=pq(k-1)/pq(k-1)+1
Branch prediction-static scheme
• Sloution1 – freeze or flush all instructions until
branch destination is known
• Solution2-predicted not taken
– Continue fetching instructions as if branch not
taken
– If taken then restart execution
• Solution3- branch is always taken
• As soon as branch encountered compute
target address and store in PC
• Best performance
• Solution 4-delay slot
• Branch instrn
• Delay slot
• Branch target if taken
Delay slot
Branch hazards-dynamic solution
• History of branches is taken
• Branch history table
• 3 classes of strategy
• Hardware-branch target buffer
• 2 bit prediction scheme
– Branch prediction buffer
Delayed branches
• Delayed branch of d cycles allows at most d-1
useful instructions
• NOP act fillers
• Whether branch taken or not the result should
remain same.
Arithmetic pipeline design
• Applied to speedup numerical calculations
• Finite precision is required.
– Numbers exceeding limit must be rounded off or
truncated
– Overflow condition
– Underflow condition
• IEEE 754 format
• Decimal numbers can be fixed or floating
point
• Fixed point numbers
– Sign magnitude , 1’s complement, 2’s complement
– 1’s complement –dirty zero
– Multiplication and division requires 2n bit register
space
• Floating point numbers
• Single precision
– X=(m,e)=m X re
– Sign bit-1 bit
– Exponent-8 bit, -127 to +127 values
– Mantissa 23 bit
• X=(-1)s X 2 e-127 X m
• Floating point addition
• Floating point multiplication
Static arithmetic pipelines
• Scalar pipelines-one function at a time
• Vector or dynamic pipelines-more than
function at a time
• Multiple functional units
• 3 stages are
– Exponent comparison and equalization
– Fraction addition
– Normalization and rounding off the values
• Carry propagation adder
• Carry save adder
• Multiplication pipeline design
• Convergence division
– Q=N/D
• Convergence Division
• Generate reciprocal of the divisor by an
iterative process and then use:
• A/B = A X (1/B)

• Use(great, great, great grand-uncle) Newton-


Raphson method to solve for 1/B:
• xi+1 = xi - f(xi)/f'(xi)

You might also like