0% found this document useful (0 votes)
45 views57 pages

Stud CSA Mod 5p2 Arithmetic SuperPipeline

The total branch penalty is 0.36 clock cycles.

Uploaded by

sheenanees
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views57 pages

Stud CSA Mod 5p2 Arithmetic SuperPipeline

The total branch penalty is 0.36 clock cycles.

Uploaded by

sheenanees
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 57

COMPUTER SYSTEM ARCHITECTURE - CS

405

Module - 5 Part - 2
Arithmetic Pipeline Design
Super Scalar Pipeline Design
٣
Arithmetic Pipeline for Floating Point Addition & Subtraction

 Exponent difference determine how many times the


mantissa associated with the smaller exponent must be
shifted to the right.
 This produces are alignment of two mantissas.
 The then mantissas are added or subtracted in segment 3.
 Finally result is normalised in segment 4.
 When a overflow occurs, the mantissa of the sum or
difference is shifted right and the exponent is incremented
by one.
 When an underflow occurs, the number of leading zeroes
in the mantissa determines number of left shifts in the
mantissa and the number that must be subtracted from the
exponent.
Fixed Point Multiplication Pipeline:
• A pipelined multiplier based on the digit products can be designed
using digit product generation logic and the digit adders.
Example:
25 * 35 = 875

Now for binary multiplication:


A = a1a0
B = b1b0
Multiply Pipeline Design
Multiply Pipeline Design
 Multiply A and B, two 8-bit numbers. Product is16-bit number,
P= A x B = P0+P1+P2+P3+P4+P5+P6+P7
 By multiplying the last bit of multiplier(B), by all the bits of
multiplicand (A), the partial product P0 is got.
 Partial products P0, P1, P2, P3, P4, P5, P6 and P7 corresponding to
multiplication of each of 8 bits of multiplier with all bits of multiplicand
can be generated simultaneously.
 Partial product Pj is got by multiplying the multiplicand A by the jth bit
of B and then shifting the result j bits to the left for j = 0 to 7.
 Thus partial product Pj is ( 8+j ) bits long with j trailing zeroes.
 Now (P0,P1,P2,P3,P4,P5,P6,P7) will be having length varying from
8 to15 bits
 Now these are the partial products, which are to be added.
 Summation of the eight partial products is done with a Wallace tree
of CSAs plus a CPA at the final stage.
 (P0,P1,P2), (P3,P4,P5) (P6,P7) given to 3 CSAs in first stage and
their output fed to CSAs in later stages and finally to a CPA to
produce 16-bit output (P).
Wallace tree for multiplying two 8-bit nos
Floating Point Multiplication Pipeline
• FP multiplication involves the following three major steps:
1. Multiplication of fractions.
2. Addition of exponents.
3. Normalization of the result.
• Since fractions and exponents are fixed-point numbers, the
steps 1 & 2 can be implemented using the principles discussed
before.
• Normalization step can be implemented as given in the floating
point addition.
Floating Point Division Pipeline:

Division operation appears less frequently in computer programs


compared to addition subtraction and multiplication and hence
separate pipeline unit for the division is seldom implemented.
It is common to schedule the division using adder and multiplier
pipelines.
Static vs Dynamic Pipeline

 Static Pipeline  Dynamic Pipeline


• It assumes only one functional • It permits several functional
configuration at a time configurations to exist
simultaneously
• It can be either unifunctional • A dynamic pipeline must be
or multifunctional multi-functional
• preferred when instructions of • requires more elaborate
same type are to be executed control and sequencing
continuously. mechanisms
• A unifunctional pipeline must • Multifunctional pipeline must
be static. be dynamic
Multifunctional Pipeline (*no need to study)

=
Multifunctional Pipeline - 4X-TI-ASC
• A multifunction pipe may perform different functions either at
different times or same time, by interconnecting different subset of
stages in pipeline.
• Eg: 4X-TI-ASC (Supercomputer - 1973)
• It has four multifunction pipeline processors, each one reconfigurable
for a variety of arithmetic or logic operations at different times.
• It is a four central processor comprised of nine units.
• It has
– one instruction processing unit
– four memory buffer units and
– four arithmetic units.
• Thus it provides four parallel execution pipelines below the IPU.
• Any mixture of scalar and vector instructions can be executed
simultaneously in four pipes.
2- issue Superscalar processor

2- issue Superscalar processor


2- issue Superscalar processor
2- issue Superscalar processor
2- issue Superscalar processor
2- issue Superscalar processor
In-order Instruction Issue & In-order Completion
In-order Instruction Issue & Out-of-order Completion
Out-of-order Instruction Issue & Out-of-order Completion
Instruction Issue & Completion
• In-order issue and completion is the simplest to implement. But
unnecessary stalls or delays in keeping program order. Still attractive
in multiprocessor environment.
• Out-of-order completion found in scalar and superscalar processors
• Long latency operations ( loads and FP operations) can be hidden in
out-of-order completion .
• Output dependency and anti dependency prevents out -of -order
completion
• Out-of-order issue gives freedom to exploit parallelism, enhances
pipeline efficiency
• Multiple pipeline scheduling is an NP-complete problem.
• Optimal scheduling is expensive
• Simple data dependence checking, a small look ahead window, and
scoreboarding mechanisms along with an optimizing compiler are
needed to to exploit instruction parallelism in a superscalar
Pipelining Example
• Assume the 5 stages take time 10ns, 8ns, 10ns, 10ns, and
7ns respectively
• Unpipelined
–Avg instruction execution time = 10+8+10+10+7= 45 ns
• Pipelined
–Each stage introduces some overhead, say 1ns per stage
–We can only go as fast as the slowest stage!
–Each stage then takes 11ns; in steady state we execute
each instruction in 11ns
–Speedup = UnpipelinedTime / Pipelined Time
= 45ns / 11ns = 4.1 times or about a 4X speedup
Note: Actually a higher latency for pipelined instructions!
Measuring Performance with Stalls
Ave _ Instr _ Time _ Unpiped
Speedup _ from _ Pipelining 
Ave _ Instr _ Time _ Pipelined
CPI _ Unpiped Clock _ Cycle _ Unpiped
 *
CPI _ Pipelined Clock _ Cycle _ Piped

We also know that the Ideal CPI is 1:

CPI _ Pipelined  Ideal _ CPI  Pipeline stall cycles per Instr


 1  Pipeline stall cycles per Instr

Assuming an identical clock cycle, substitution yields:


CPI _ Unpiped
Speedup _ from _ Pipelining 
1  Pipeline stall cycles per Instr
Measuring Stall Performance
Given:
CPI _ Unpiped
Speedup _ from _ Pipelining 
1  Pipeline stall cycles per Instr
In our simple case each instruction takes the same number of
cycles, which is equal to the number of pipeline stages or the
pipeline depth:
Pipeline _ Depth
Speedup _ from _ Pipelining 
1  Stall _ Cycles _ Per _ Instruction

If there are no pipeline stalls, the intuitive result is that pipelining


can improve performance by the depth of the pipeline.
Example Dynamic Hardware Prediction

Basic Branch Prediction: Branch Target Buffers


Instructions Prediction Actual Penalty
in Buffer Branch Cycles
Yes Taken Taken 0
Yes Taken Not taken 2
No Taken 3

Determine the total branch penalty for a BTB using the above penalties.
Assume also the following:
• Prediction accuracy of 80%
• Hit rate in the buffer of 90%
• 60% taken branch frequency.

Solution:
Branch Penalty = Misprediction penalty + Buffer miss penalty
= Percent buffer hit rate x Percent incorrect predictions x penalty cycles
+ ( 1 - percent buffer hit rate) x Taken branches x penalty cycles
Branch Penalty = ( 90% x 10% x 2) + (10% X 60% X 3)
Branch Penalty = 0.18 + 0.18 = 0.36 clock cycles

You might also like