9 Computer Arithmetics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Systems and Techniques for

Digital Signal Processing


9 – Computer arithmetics
• Definitions and data path types
• Integer and fixed-point data paths
• Integer number coding types and the law of conservation of bits
• Adders and ALUs
• Multipliers
• Overflow management and numerical issues (quantization) in fixed point CPUs

• Floating-point data paths


• Floating point properties: the standard ANSI/IEEE 754
• Floating point ALU and multipliers
• Floating point software emulation
• Numerical issues in floating point
1
Definitions
Datapath: processing path within a processing device, which is typically
devoted to perform arithmetic operations
A datapath consists of:
• registers and accumulators CPU
• arithmetic-logical units (ALU)
• buses Datapath Control
• multipliers
Bus ALU Regs Shifter
• one or more shifters …
• special function units (e.g. SQRT unit, HW dividers,…)

Native datapath width B : maximum width of a binary data (expressed in


bits) that can be processed by the datapath in a single instruction cycle
• Usually, the same word size is used for instructions and data (e.g. 8-bit, 16-
bit, 32-bit)
• Different word size may have impact on system cost in terms of area of the
chip, number of package pins, size and number of external memory devices
Extended Precision: data representation that provides higher precision
than the native datapath width (e.g. 64-bit operations in a 32-bit CPU)
2
Data path types
From the hardware point of view, two main types of datapath exist

Integer or Floating point


fixed-point (Fx) (Fp)

Notes
• Data path types mirrors internal number representation over the native width
(usually 8, 16, 24 or 32 bits for int/fx data paths and 32 or 64 bits for fp data paths)
• Fp data path is more expensive due to the more complex hardware Fp usually
slower than its fx counterpart for the same width and for the same technology
• Int/Fx operations returns exact results only if such results can be represented
exactly over B bits (see later)
• In floating-point operations basic mathematical laws are valid only approximately
• Example: Due to rounding A+(B+C) (A+B)+C

3
Integer and fixed-point data paths

4
Integer number binary coding
• Assume you can represent a number N using B bits. Most
famous binary representation are:
• Signed magnitude: if bB =0 the number positive, else
negative: B 2
N   1 B1 
b

bi 2i
i 0

• 1’s complement: complement bits



N  bB 1 2 B 1
1  B 2
 bi 2i
i 0

• 2’s complement: complement bits and add 1 to LSB


B 2
B 1
N  bB 1 2   bi 2i
i 0
In general, 2’s complement is commonly used

5
Overflow
• An overflow occurs in a mathematical operation when the result is too
large (in absolute value) to be represented properly over B bits

• Overflow occurs when adding:


– 2 positive numbers and the sum is negative
– 2 negative numbers and the sum is positive
– When adding operands with different signs, overflow cannot occur.

0 1 1 0
1 1
0 1 1 1 = 7 1 1 0 0 = –4
+ 0 0 1 1 = 3 + 1 0 1 1 = –5
1 0 1 0 =–6 0 1 1 1 =7

• In general you have an overflow when carry into MSB carry out of
MSB, i.e. Overflow_flag= CinMSB xor CoutMSB

6
2’s complement advantages
1. Additions and subtractions are managed
in the same way: –B=NOT B + 1 0000
2. Zero representation is only 0…..0, i.e. 1111 0001

unambiguous
1110 0 0010
-1 1
3. Any sequence of arithmetic operations
2
whose final result is within [-2B-1, 2B-1-1] -2 0011
1101
can be calculated correctly even if some -3
3

intermediate overflows occur


-4 4 0100
Example 1100

-5
5
1011
-6
6 0101

-7 7
1010 -8
0110

1001 0111
1000

7
Laws of conservation of bits
• Addition: if we have 2
operands B-bit long we 1 0 0 0 -8
need at least B+1 bits to -4
+ 1 1 1 0 +
represent the results
without overflow +6
• In general, when adding K
1 0 1 1 0 (+ carry bit)
terms, log2(K) extra bits
are required to guarantee
no overflow

• Multiplication: if we have 2
operands B-bit long we 0 0 1 1 3
need at least 2B bits to
represent the result * 0 1 1 1 * 7
without overflow
0 0 0 1 0 1 0 1 +5
(+ carry bit)
8
1-bit integer addition
A B Cout S
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0

A B Cin Cout S
0 0 0 0 0
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 1 0
1 1 1 1 1
9
Carry Ripple Adder (CRA)
• In practice, more efficient CMOS FA
implementations are used, rather than logic gates C0
Example
MINORITY a0 1-bit
A
s0
FA
B b0
C C1 Cout0
Cout S a1 1-bit
S s1
FA
b1
C2 Cout1
a2 1-bit
s2
FA
Cout b2
C3 Cout2
• The Carry Ripple Adder is the simplest type of
a3 1-bit
adder. It simply consists of a cascade of FAs s3
FA
• Problem: There is a critical path from Cin to Cout b3
the carry propagation latency L increases linearly
with the datapath width B …

10
Arithmetic and Logic Unit (ALU) - 1
• An adder is the core of the ALU. Many different implementations are
possible.
• An ALU has 2 input operands, some control inputs, and some output flags
specifying special conditions. Control inputs are driven by CPU control unit,
whereas the output flags are usually written into the CPU’s status register

Example of 1-bit slice of an ALU


X Y
Cin C3 C2 C1 C0 M
Ci
M Zero
C0 Sign
C1 Overflow
C2 ALU Even/odd
C3 Cout Cin Si
Xi S

S
a
FAC out
b
Yi
The mode bit M enables/disables the carry
propagation:
• M=1 arithmetic mode Ci+1
• M=0 logic mode
Note: More advanced ALUs allow also shift-left and shift-right operations
11
Arithmetic and Logic Unit (ALU) - 2
Logic mode: M=0, Cin=- Arithmetic mode: M=1, C3=0

C0 C1 C2 \ C3 0 1 C0C1C2 \ Cin 0 1

000 0 X and Y 000 0 1

001 not Y X or (not Y) 001 -Y-1 -Y

010 Y Y 010 Y Y+1


011 11…1 11...1 011 -1 0
100 X X 100 X X+1
101 X xor Y xor X xor (not Y) 101 X-Y-1 X-Y
1
110 X xor Y X xor (not Y) 110 X+Y X+Y+1

111 not X (not X) xor Y 111 X-1 X

12
Arithmetic and Logic Unit (ALU) - 3
ALU’s output flag signals are easy to generate
Zero flag Overflow flag

Sign flag: most significant bit of the output

Odd/even flag: least significant bit of the


output

Carry out: results naturally from adder’s


output

13
Binary multiplication
• The 1-bit product is the same as
a logic AND

a3a2a1a0 x • The “pencil and paper” algorithm


to perform a binary multiplication
b3b2b1b0 = is very simple. For instance in
the 4-bit case we have that:
a3b0 a2b0 a1b0 a0b0
p0  a0  b0
a3b1 a2b1 a1b1 a0b1 -
p1  a1  b0  a0  b1
a3b2 a2b2 a1b2 a0b2 - - p2  a2  b0  a1  b1   a0  b1  c p1
a3b3 a2b3 a1b3 a0b3 - - - p3  a3  b0  a2  b1  a1  b2  a0  b3  c p2
p4  a3  b1  a2  b2   a1  b3  c p3
p7 p6 p5 p4 p3 p2 p1 p0
p5  a3  b2  a2  b3  c p4
p6  a3  b3  c p5
p7  c p6

14
Serial implementation
Start

a3a2a1a0 Register X i=0


X Multiplicand
B bits AL Multiplier
AH 0

B-bit ALU
Accumulator (A) 1 0
LSB (AL)
Product b3b2b1b0 Control
Write AH X+AH AH AH
AH 2B bits AL

Right shift (A)


• At the end of the algorithm the result is within i=i+1
the whole accumulator
i<B
• A MUL operation needs at least B clock cycles.
Too slow in some applications (e.g. DSP)
Done
15
Parallel array-based implementation
Multiplicand Multiplier • To improve performance, arithmetic
operations returning product bits can
be implemented in parallel using:
– A matrix made up of B2 ANDs
AND matrix – A FA and HA array
– A final Vector Merging Adder (e.g.
a CRA or a CSA) summing up the
final carries
• The critical path lies in the rightmost
column (i.e. (2B-1) )

Problems
• The worst-case carry propagation
latency can be very large
• The proposed circuit does not work for
signed (2’C) numbers
– Ex. 3x-5=
0011x1001=00100001=33

Wrong!
16
Booth’s algorithm - 1
• Both previous problems are solved through the Booth’s algorithms.
Different types of Booth’s algorithms exist (radix 2, radix 4, radix 8,…).
– The radix 2 solves the problem of signed multiplications, while keeping
the HW complexity approximately the same as the integer case.
– Radix 4 and 8 algorithms have the further advantage of reducing the
number of rows of the FA matrix (critical path reduction) at the expense of
a larger complexity

• The radix 2 Booth algorithm is based on a different way to “look” at the


steps of the multiplication algorithm. In particular, if A=aB-1…a0 is the
multiplicand and B=bB-1…b0 is the multiplier it can be easily proved that:
B 1
P  i i1

i 0
b  b   A  2 i (conventionally b-1=0)

Example: AxB=3x7=0011x0111=(0-1) 3 20+(1-1) 3 21+(1-1) 3 22+(0+1)


3 23 =21

17
Booth’s algorithm - 2
middle of run
end of run beginning of run
0 1 1 1 1 0
Cur. Bit Bit to the Right Explanation Example Op
0 0 Middle of run of 0s 0001111000 none
0 1 End of run of 1s 0001111000 add
1 0 Begins run of 1s 0001111000 sub
1 1 Middle of run of 1s 0001111000 none

• Originally conceived for speed (when shift was faster than add)

18
Example: 3x-5
Operation Multiplicand Multiplier Product next?
Step 0 (init) 0011 1011(0) 00000000 10 -A 20
Step 1 1011 + 11111101 11 no op.

Step 2 1011 + 00000000 01 +A 22

Step 3 1011 + 00001100 10 -A 23

Step 4 1011 + 11101000

11110001 = -15 correct!


The structure of serial and parallel HW multipliers is very similar to those
shown previously. The only differences are:
1. The need for bit extension
2. The need to perform not only additions, but also subtractions at each stage

19
Fixed Point: fractional binary coding
• In fixed-point CPUs, there is no hardware radix-point. Thus, it is
programmers or hardware designers responsibility to keep the radix-point
in the correct position
• A fractional number is still represented as 2’s complement in fixed-point,
but the radix position is set by convention
• The Q format is adopted for specifying the position of the binary
point (radix) starting from the LSB a binary number in Qn format is
defined as having n bits to the right of the radix
• If a fractional number |N|<1, it can be represented using a QB-1 format:
B 1
N  bB 1  
i 1
bB 1i  2 i

Example: Consider the Q7 number 10111101


(1.0111101)2=-1+0·1/2+1·1/4+1·1/8+1·1/16+1·1/32+0·1/64+1·1/128
=-0.5234375

20
Resolution, range and dynamic range
of a fx number (QB-1 case)
• Resolution: minimum fractional quantity which can be
represented
  2 B 1
• Range: R =[–1; 1-2-B+1]
• Dynamic range: ratio of the largest and the smallest numbers
which can be represented with B bits
 1  2 B 1 
DR  20 log10
  
 
  20 log10 2 B 1  1  6  B  1 dB
 
Example: In most digital signal processing and digital control
applications, a 16-bit data path width is sufficient to signals without
significant loss of accuracy (DR=90 dB)

21
Managing fixed-point additions - 1
• Usually both integer and fractional operations are performed using
the same ALU:
– Switching from one representation to the other may imply just writing
a fractional bit in the mode control register
– In both fractional and integer case the main problem is the overflow,
as only B bits become the input to the next operations overflow has
to be managed
• Alternatively the fractional representation can be emulated by
shifting the radix-point of the operands, i.e. scaling the value by
2x, where x is the number of bits in the fraction
-0.37510 = 1.1012*23 11012 = -310
0.7510 = 0.1102*23 01102 = 610
• NOTE: when adding numbers the binary representations must be
the same, i.e. the location of the radix-point must be the same

22
Managing fixed-point additions - 2
• Overflow bit found in the status register It may require a software
procedure to handle the exception

3 techniques are used to cope with overflow in fx CPUs:


• Guard bits: they may be added to the most significant positions of the
accumulator register to temporarily save intermediate results without
overflow

• Scaling: operands and/or intermediate results are shifted right (with sign
extension), discarding the low order bits loss of precision, dynamic
range is the limitation requires HW shifter

• Saturation arithmetic (limiter): Overflown results are replaced with the


largest negative (or positive) representable value depending on the sign
of the result requires saturation circuitry between the accumulator and
the data bus introduces a nonlinearity

23
Shifters
• Hardware shifters allows scaling of the intermediate signal values
division by a positive/negative power of 2
– Shift right loss of precision, but headroom against overflow
– Shift left can be used when one data is much smaller than the other

• Shifters and guard bits are used together to prevent overflow the
greater the number of the guard bits, the less shifters are needed
– Shifters are needed when storing results into memory
– Shifters can be found in different parts of the data path, i.e. after
multiplier, after ALU, before ALU
– Some shifters have limited capabilities (shift 1 bit left or right,…)

• Barrel shifters are key components to manipulate binary data


flexibly and efficiently barrel shifter can shift a data by any
number of bits

24
A practical example: a FIR filter

25
A practical example: overflow

Result:

26
A practical example: guard bits

Result:

27
A practical example: scaling

Result:

28
A practical example: saturation

Result:

29
Managing fixed-point multiplications
• In integer multiplication the main problem is overflow because the result
requires 2B bits accumulators need at least 2B bits
• Overflow has to be managed like in the addition case
• In a QB-1 fixed point multiplication, both factors are smaller than 1 (in
absolute value) The result is smaller than any of the operands The
result grows downwards so we still need 2B bits, but no overflow occurs!
Example
a= 0 1 1 0 = 0.5 + 0.25 = 0.75
b= 1 1 1 0 = -1 + 0.5 + 0.25 = -0.25
0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 .
0 0 0 1 1 0 . .
1 1 0 1 0 . . .
1 1 1 1 0 1 0 0 = -1 + 0.5 + 0.25 +0.0625= -0.1875
Sign extension

1 1 1 0 = -1 + 0.5 + 0.25 = -0.25 Problem: loss of precision due to quantization

30
Quantization techniques
LSB
... ...

S B-1 bits B bits

• At some point during the calculation it is often required to reduce the


number of bits (quantization) for further processing. Consider for instance
a 2B single product result over B bits:

X 2B  X B  e Quantization error

3 quantization techniques are generally used:


1. Truncation round down
2. Round-to-nearest round down when e< /2 and round up when e /2
3. Convergent rounding round down e< /2 and round up when e> /2. If e= /2:
– round down if LSB=0
– round up if LSB=1
31
Quantization error analysis in a fx CPU
• Quantization is a deterministic operation, but when <<X and X is supposed to
vary randomly the quantization error sequence can be modeled as an random
additive white noise e which is uniformly distributed in a certain interval
depending on the used rounding technique
• The pmf of e is discrete as the starting values are already quantized over 2B
bits
Ee 


1  2B 
Truncation 
0  e    1  2 B  2

e  
E e2 
2
3

1  2  2 B 1 

Ee 
Round-to-
nearest … 

 
 1  2B  e 

2 B 1
e
2 2

Ee 
2 2
12

1  2  2 B 1 
Convergent   Ee  0 No bias!
…  e
  
rounding 2 2 2
e Ee 2
1  2 2 B 1
12
32
Rounding logic implementation
LSB RB
... ...

B-1 bits B bits


S

… SB
Cin
Incrementer Rounding logic

Example
• Truncate (round towards - ) Cin=0

• Rounding to nearest value Cin=RB

• Convergent rounding Cin=RB (LSB+SB)

33
Putting all together: a Fx DSP data path
Register Register Acc A
File A File B
Mode control
A0 B0 Acc B register
A1 B1
Temp(0) Status
A2 B2
register
A3 B3 Temp(1)

DATA BUS 1
DATA BUS 2

MUL MUX MUX

Adder ALU

zero Rounding Saturation Sign Barrel


logic logic Extension Shifter

34
Some examples of Int/Fx datapath - 1

Motorola DSP56xxx
• 4 input registers (X0, X1, Y0, Y1)
• Registers and data paths are 24-bits
wide
• 2 accumulators 56-bit wide
• Data chaining is possible (48 bits)
• 1 8-bit status register (condition
code register, CCR)
• 1 8-bit mode register (MR)

35
Some examples of Int/Fx datapath - 2

TI TMS320C54x
• Heterogeneous datapath
• Dual access memory through
2 16-bit buses (CB,DB)
• 40-bit ALU
• 16x16 multiplier
• 2 40-bits accumulators
• Most operand combinations
are possible at ALU input
through appropriate MUX
• 3 16-bit status and control
registers (ST0, ST1, PMST)

36
Floating-point data paths

37
Floating point binary coding
• Fp CPUs contain two data paths: one for integer operands and one for fp operands
(usually contained in a separate register file made up of with wider register)
• Floating point binary coding is related to usual (decimal) scientific notation
31 30 22 21 0
s e e ... e e m m ... m m

1-bit C-bits B-bits


s = sign bit
e = exponent (C-bit)
m = mantissa (B-bit in QB-1 notation) (in normalized notation, 0.5 m<1 and MSB
is implicit)
N = (-1)s 1.m 2e
   2 B  22 
C 1 C 1
B 2 1
(Variable) resolution: 2  2
R  2 B  2 2 , 1  2 B  22 1
C 1 C 1
Positive range:  
Dynamic Range:

C



DR  20 log10  2 2  B 1   6  2C  B  1 dB 
38
The standard ANSI/IEEE 754
• Standardization improves
The portability of programs utilizing floating-point operations
The quality of programs utilizing floating-point operations
• Most notable feature is that it requires the computation to continue in
case of exceptional conditions (divide by zero, square root of a negative
number, etc.)
• Most FP-processors today are based on ANSI/IEEE 754 standard:
N = (-1)s* (1.m) * 2(e-k)
1 1.mantissa<2 k=127 or
k=1023
Single precision (32-bit) Double precision (64-bit)
• s = 1 bit • s = 1 bit
• m = 23 bits • m = 52 bits
• e = 8 bits • e =11 bits

39
The standard ANSI/IEEE 754
• The developers of the standard wanted a representation where the
nonnegative numbers are ordered in the same way as integers, i.e.
magnitude of fp numbers can be compared using a integer comparator
• Exp. field before mantissa field a number with greater exp. is larger
• Problems with negative exponents: negative exponent has 1 in the MSB, i.e.
it seems greater than number with positive exponent.
• Solution: biased notation, e = actual exp.+bias, e.g. in single precision:
bias=127

s e m Number
0 0 0 0
1 0 0 -0
s 0 0 (-1)s * 0.m * 2-126 • Denormal notation: useful
s 0<e<255 m (-1)s * 1.m * 2e-127 for gradual underflow
0 255 0 + • Exceptional result during
1 255 0 - floating point arithmetic
s 255 0 NaN (not a number)

40
Floating-point addition

41
Floating-point ALU
Sign Exponent Significand Sign Exponent Significand

• Overflow automatically
saturated
Compare
Small ALU exponents
• Underflow automatically
Exponent
truncated to zero
difference

0 1 0 1 1
• IEEE 754 presents 4
0
types of rounding:
Control Shift right
Shift smaller
number right
• Round towards +
• Round towards -

Add
• Round to zero
Big ALU
• Round-to-the-nearest
(which in IEEE standard is
0 1 0 1
the same as convergent)
• Most DSP use round-to-
Increment or
decrement Shift left or right Normalize

the nearest, but some


Rounding hardware Round implement all the
methods
Sign Exponent Significand

42
Floating-point multiplication

43
Floating point emulation
• Floating-point arithmetic in fixed-point processors requires SW
implementation useful when high accuracy is required expensive
• Some manufacturers provide floating-point libraries for Fx CPUs
• Example: Write TMS320C25 code for floating-point multiplication
(operands in MA, EA and MB, EB, result in MC, EC)
LAC EA
ADD EB
SACL EC ;EXPONENT OF RESULT BEFORE NORMALIZATION
LT MA
MPY MB
PAC ; (ACC) = MA *MB
SFL ; TAKES CARE OF REDUNTANT SIGN BIT
LARP AR5
LAR AR5,EC; AR5 IS INITIALIZED WITH EC
NORM *- ; FINDS MSB AND MODIFIES AR5
SACH MC ; MC = MA * MB (NORMALIZED)
SAR AR5,EC

44
Quantization error analysis in a Fp CPU
• If the signal is varying, the relative error:

Qxn   xn 
 n 
xn 

can be modeled as a random noise process


which is independent of the signal
NOTE: the amplitude of the relative error is
not uniformly distributed
• Relative quantization error noise power:

 
2 B
E  2 n 
2 Conv. rounding
6
• The total fp quantization error can be
modeled as a noise source modulated by
the signal:
en  xn  Qxn  xn  n

45
Fixed-point vs. floating-point - 1
• The quantization noise of fixed-point arithmetic is constant
power noise independent of signal SNR decreases
• The level of the floating-point quantization noise is related to the
level of the signal i.e. the SNR is constant during computation
– Word width of exponent affects only the dynamic range
– Word width of mantissa affects only the quantization (processing) noise

Fixed-point Floating-point

46
Fixed point vs. floating-point - 2
Main problems in mathematical operations:

Integer QB-1 Fixed- Floating-point


point

Overflow Overflow Quantization


+ (mantissa)
Overflow Quantization Quantization
x (mantissa)

• Floating-point makes negligible the probability for overflow and


SNR is constant programming is simple
• In fixed-point, suitable techniques have to be used while
programming in order to avoid overflow and/or quantization errors

47

You might also like