9 Computer Arithmetics
9 Computer Arithmetics
9 Computer Arithmetics
Notes
• Data path types mirrors internal number representation over the native width
(usually 8, 16, 24 or 32 bits for int/fx data paths and 32 or 64 bits for fp data paths)
• Fp data path is more expensive due to the more complex hardware Fp usually
slower than its fx counterpart for the same width and for the same technology
• Int/Fx operations returns exact results only if such results can be represented
exactly over B bits (see later)
• In floating-point operations basic mathematical laws are valid only approximately
• Example: Due to rounding A+(B+C) (A+B)+C
3
Integer and fixed-point data paths
4
Integer number binary coding
• Assume you can represent a number N using B bits. Most
famous binary representation are:
• Signed magnitude: if bB =0 the number positive, else
negative: B 2
N 1 B1
b
bi 2i
i 0
5
Overflow
• An overflow occurs in a mathematical operation when the result is too
large (in absolute value) to be represented properly over B bits
0 1 1 0
1 1
0 1 1 1 = 7 1 1 0 0 = –4
+ 0 0 1 1 = 3 + 1 0 1 1 = –5
1 0 1 0 =–6 0 1 1 1 =7
• In general you have an overflow when carry into MSB carry out of
MSB, i.e. Overflow_flag= CinMSB xor CoutMSB
6
2’s complement advantages
1. Additions and subtractions are managed
in the same way: –B=NOT B + 1 0000
2. Zero representation is only 0…..0, i.e. 1111 0001
unambiguous
1110 0 0010
-1 1
3. Any sequence of arithmetic operations
2
whose final result is within [-2B-1, 2B-1-1] -2 0011
1101
can be calculated correctly even if some -3
3
-5
5
1011
-6
6 0101
-7 7
1010 -8
0110
1001 0111
1000
7
Laws of conservation of bits
• Addition: if we have 2
operands B-bit long we 1 0 0 0 -8
need at least B+1 bits to -4
+ 1 1 1 0 +
represent the results
without overflow +6
• In general, when adding K
1 0 1 1 0 (+ carry bit)
terms, log2(K) extra bits
are required to guarantee
no overflow
• Multiplication: if we have 2
operands B-bit long we 0 0 1 1 3
need at least 2B bits to
represent the result * 0 1 1 1 * 7
without overflow
0 0 0 1 0 1 0 1 +5
(+ carry bit)
8
1-bit integer addition
A B Cout S
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0
A B Cin Cout S
0 0 0 0 0
0 0 1 0 1
0 1 0 0 1
0 1 1 1 0
1 0 0 0 1
1 0 1 1 0
1 1 0 1 0
1 1 1 1 1
9
Carry Ripple Adder (CRA)
• In practice, more efficient CMOS FA
implementations are used, rather than logic gates C0
Example
MINORITY a0 1-bit
A
s0
FA
B b0
C C1 Cout0
Cout S a1 1-bit
S s1
FA
b1
C2 Cout1
a2 1-bit
s2
FA
Cout b2
C3 Cout2
• The Carry Ripple Adder is the simplest type of
a3 1-bit
adder. It simply consists of a cascade of FAs s3
FA
• Problem: There is a critical path from Cin to Cout b3
the carry propagation latency L increases linearly
with the datapath width B …
10
Arithmetic and Logic Unit (ALU) - 1
• An adder is the core of the ALU. Many different implementations are
possible.
• An ALU has 2 input operands, some control inputs, and some output flags
specifying special conditions. Control inputs are driven by CPU control unit,
whereas the output flags are usually written into the CPU’s status register
S
a
FAC out
b
Yi
The mode bit M enables/disables the carry
propagation:
• M=1 arithmetic mode Ci+1
• M=0 logic mode
Note: More advanced ALUs allow also shift-left and shift-right operations
11
Arithmetic and Logic Unit (ALU) - 2
Logic mode: M=0, Cin=- Arithmetic mode: M=1, C3=0
C0 C1 C2 \ C3 0 1 C0C1C2 \ Cin 0 1
12
Arithmetic and Logic Unit (ALU) - 3
ALU’s output flag signals are easy to generate
Zero flag Overflow flag
13
Binary multiplication
• The 1-bit product is the same as
a logic AND
14
Serial implementation
Start
B-bit ALU
Accumulator (A) 1 0
LSB (AL)
Product b3b2b1b0 Control
Write AH X+AH AH AH
AH 2B bits AL
Problems
• The worst-case carry propagation
latency can be very large
• The proposed circuit does not work for
signed (2’C) numbers
– Ex. 3x-5=
0011x1001=00100001=33
Wrong!
16
Booth’s algorithm - 1
• Both previous problems are solved through the Booth’s algorithms.
Different types of Booth’s algorithms exist (radix 2, radix 4, radix 8,…).
– The radix 2 solves the problem of signed multiplications, while keeping
the HW complexity approximately the same as the integer case.
– Radix 4 and 8 algorithms have the further advantage of reducing the
number of rows of the FA matrix (critical path reduction) at the expense of
a larger complexity
17
Booth’s algorithm - 2
middle of run
end of run beginning of run
0 1 1 1 1 0
Cur. Bit Bit to the Right Explanation Example Op
0 0 Middle of run of 0s 0001111000 none
0 1 End of run of 1s 0001111000 add
1 0 Begins run of 1s 0001111000 sub
1 1 Middle of run of 1s 0001111000 none
• Originally conceived for speed (when shift was faster than add)
18
Example: 3x-5
Operation Multiplicand Multiplier Product next?
Step 0 (init) 0011 1011(0) 00000000 10 -A 20
Step 1 1011 + 11111101 11 no op.
19
Fixed Point: fractional binary coding
• In fixed-point CPUs, there is no hardware radix-point. Thus, it is
programmers or hardware designers responsibility to keep the radix-point
in the correct position
• A fractional number is still represented as 2’s complement in fixed-point,
but the radix position is set by convention
• The Q format is adopted for specifying the position of the binary
point (radix) starting from the LSB a binary number in Qn format is
defined as having n bits to the right of the radix
• If a fractional number |N|<1, it can be represented using a QB-1 format:
B 1
N bB 1
i 1
bB 1i 2 i
20
Resolution, range and dynamic range
of a fx number (QB-1 case)
• Resolution: minimum fractional quantity which can be
represented
2 B 1
• Range: R =[–1; 1-2-B+1]
• Dynamic range: ratio of the largest and the smallest numbers
which can be represented with B bits
1 2 B 1
DR 20 log10
20 log10 2 B 1 1 6 B 1 dB
Example: In most digital signal processing and digital control
applications, a 16-bit data path width is sufficient to signals without
significant loss of accuracy (DR=90 dB)
21
Managing fixed-point additions - 1
• Usually both integer and fractional operations are performed using
the same ALU:
– Switching from one representation to the other may imply just writing
a fractional bit in the mode control register
– In both fractional and integer case the main problem is the overflow,
as only B bits become the input to the next operations overflow has
to be managed
• Alternatively the fractional representation can be emulated by
shifting the radix-point of the operands, i.e. scaling the value by
2x, where x is the number of bits in the fraction
-0.37510 = 1.1012*23 11012 = -310
0.7510 = 0.1102*23 01102 = 610
• NOTE: when adding numbers the binary representations must be
the same, i.e. the location of the radix-point must be the same
22
Managing fixed-point additions - 2
• Overflow bit found in the status register It may require a software
procedure to handle the exception
• Scaling: operands and/or intermediate results are shifted right (with sign
extension), discarding the low order bits loss of precision, dynamic
range is the limitation requires HW shifter
23
Shifters
• Hardware shifters allows scaling of the intermediate signal values
division by a positive/negative power of 2
– Shift right loss of precision, but headroom against overflow
– Shift left can be used when one data is much smaller than the other
• Shifters and guard bits are used together to prevent overflow the
greater the number of the guard bits, the less shifters are needed
– Shifters are needed when storing results into memory
– Shifters can be found in different parts of the data path, i.e. after
multiplier, after ALU, before ALU
– Some shifters have limited capabilities (shift 1 bit left or right,…)
24
A practical example: a FIR filter
25
A practical example: overflow
Result:
26
A practical example: guard bits
Result:
27
A practical example: scaling
Result:
28
A practical example: saturation
Result:
29
Managing fixed-point multiplications
• In integer multiplication the main problem is overflow because the result
requires 2B bits accumulators need at least 2B bits
• Overflow has to be managed like in the addition case
• In a QB-1 fixed point multiplication, both factors are smaller than 1 (in
absolute value) The result is smaller than any of the operands The
result grows downwards so we still need 2B bits, but no overflow occurs!
Example
a= 0 1 1 0 = 0.5 + 0.25 = 0.75
b= 1 1 1 0 = -1 + 0.5 + 0.25 = -0.25
0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 .
0 0 0 1 1 0 . .
1 1 0 1 0 . . .
1 1 1 1 0 1 0 0 = -1 + 0.5 + 0.25 +0.0625= -0.1875
Sign extension
30
Quantization techniques
LSB
... ...
X 2B X B e Quantization error
… SB
Cin
Incrementer Rounding logic
Example
• Truncate (round towards - ) Cin=0
33
Putting all together: a Fx DSP data path
Register Register Acc A
File A File B
Mode control
A0 B0 Acc B register
A1 B1
Temp(0) Status
A2 B2
register
A3 B3 Temp(1)
DATA BUS 1
DATA BUS 2
Adder ALU
34
Some examples of Int/Fx datapath - 1
Motorola DSP56xxx
• 4 input registers (X0, X1, Y0, Y1)
• Registers and data paths are 24-bits
wide
• 2 accumulators 56-bit wide
• Data chaining is possible (48 bits)
• 1 8-bit status register (condition
code register, CCR)
• 1 8-bit mode register (MR)
35
Some examples of Int/Fx datapath - 2
TI TMS320C54x
• Heterogeneous datapath
• Dual access memory through
2 16-bit buses (CB,DB)
• 40-bit ALU
• 16x16 multiplier
• 2 40-bits accumulators
• Most operand combinations
are possible at ALU input
through appropriate MUX
• 3 16-bit status and control
registers (ST0, ST1, PMST)
36
Floating-point data paths
37
Floating point binary coding
• Fp CPUs contain two data paths: one for integer operands and one for fp operands
(usually contained in a separate register file made up of with wider register)
• Floating point binary coding is related to usual (decimal) scientific notation
31 30 22 21 0
s e e ... e e m m ... m m
DR 20 log10 2 2 B 1 6 2C B 1 dB
38
The standard ANSI/IEEE 754
• Standardization improves
The portability of programs utilizing floating-point operations
The quality of programs utilizing floating-point operations
• Most notable feature is that it requires the computation to continue in
case of exceptional conditions (divide by zero, square root of a negative
number, etc.)
• Most FP-processors today are based on ANSI/IEEE 754 standard:
N = (-1)s* (1.m) * 2(e-k)
1 1.mantissa<2 k=127 or
k=1023
Single precision (32-bit) Double precision (64-bit)
• s = 1 bit • s = 1 bit
• m = 23 bits • m = 52 bits
• e = 8 bits • e =11 bits
39
The standard ANSI/IEEE 754
• The developers of the standard wanted a representation where the
nonnegative numbers are ordered in the same way as integers, i.e.
magnitude of fp numbers can be compared using a integer comparator
• Exp. field before mantissa field a number with greater exp. is larger
• Problems with negative exponents: negative exponent has 1 in the MSB, i.e.
it seems greater than number with positive exponent.
• Solution: biased notation, e = actual exp.+bias, e.g. in single precision:
bias=127
s e m Number
0 0 0 0
1 0 0 -0
s 0 0 (-1)s * 0.m * 2-126 • Denormal notation: useful
s 0<e<255 m (-1)s * 1.m * 2e-127 for gradual underflow
0 255 0 + • Exceptional result during
1 255 0 - floating point arithmetic
s 255 0 NaN (not a number)
40
Floating-point addition
41
Floating-point ALU
Sign Exponent Significand Sign Exponent Significand
• Overflow automatically
saturated
Compare
Small ALU exponents
• Underflow automatically
Exponent
truncated to zero
difference
0 1 0 1 1
• IEEE 754 presents 4
0
types of rounding:
Control Shift right
Shift smaller
number right
• Round towards +
• Round towards -
Add
• Round to zero
Big ALU
• Round-to-the-nearest
(which in IEEE standard is
0 1 0 1
the same as convergent)
• Most DSP use round-to-
Increment or
decrement Shift left or right Normalize
42
Floating-point multiplication
43
Floating point emulation
• Floating-point arithmetic in fixed-point processors requires SW
implementation useful when high accuracy is required expensive
• Some manufacturers provide floating-point libraries for Fx CPUs
• Example: Write TMS320C25 code for floating-point multiplication
(operands in MA, EA and MB, EB, result in MC, EC)
LAC EA
ADD EB
SACL EC ;EXPONENT OF RESULT BEFORE NORMALIZATION
LT MA
MPY MB
PAC ; (ACC) = MA *MB
SFL ; TAKES CARE OF REDUNTANT SIGN BIT
LARP AR5
LAR AR5,EC; AR5 IS INITIALIZED WITH EC
NORM *- ; FINDS MSB AND MODIFIES AR5
SACH MC ; MC = MA * MB (NORMALIZED)
SAR AR5,EC
44
Quantization error analysis in a Fp CPU
• If the signal is varying, the relative error:
Qxn xn
n
xn
2 B
E 2 n
2 Conv. rounding
6
• The total fp quantization error can be
modeled as a noise source modulated by
the signal:
en xn Qxn xn n
45
Fixed-point vs. floating-point - 1
• The quantization noise of fixed-point arithmetic is constant
power noise independent of signal SNR decreases
• The level of the floating-point quantization noise is related to the
level of the signal i.e. the SNR is constant during computation
– Word width of exponent affects only the dynamic range
– Word width of mantissa affects only the quantization (processing) noise
Fixed-point Floating-point
46
Fixed point vs. floating-point - 2
Main problems in mathematical operations:
47