0% found this document useful (0 votes)
63 views35 pages

Pooja Vashisth

This document discusses floating point number representation and arithmetic. It begins by introducing floating point representation using sign-magnitude notation, with a sign bit, exponent field, and significand or fraction. It describes normalized and denormalized representations, as well as special values like infinity and NaN. The document then discusses floating point arithmetic operations like addition, subtraction, multiplication and division at both the algorithmic level and hardware implementation level. It also covers floating point data formats and number ranges as defined by the IEEE 754 standard.

Uploaded by

Apurva Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views35 pages

Pooja Vashisth

This document discusses floating point number representation and arithmetic. It begins by introducing floating point representation using sign-magnitude notation, with a sign bit, exponent field, and significand or fraction. It describes normalized and denormalized representations, as well as special values like infinity and NaN. The document then discusses floating point arithmetic operations like addition, subtraction, multiplication and division at both the algorithmic level and hardware implementation level. It also covers floating point data formats and number ranges as defined by the IEEE 754 standard.

Uploaded by

Apurva Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

CSC 258

Pooja Vashisth
1. Representation of numerical data as floating-point numbers
2. Describe underflow, overflow, round off, and truncation errors
3. Describe floating point arithmetic operations
4. Hardware implementation of floating-point operations

2 2
 Floating-point data formats
 Underflow and overflow
 How representations affect accuracy and precision
 Investigate hardware implementation of various floating-point
arithmetic operations

3 3
CSC 258

4
We’ve presented a fixed point representation of (some)
real numbers

One alternative is floating point.


 In floating point, the number of digits used to
represent the integer and fractional parts may vary.
 These operations tends to require more complex
hardware but allow a wider range of values.

5
 There’s a shift in topics next week after.
 We’re moving away from assembly and to
computer organization.

6
 Representation for non-integral numbers
 Including very small and very large numbers

 Like scientific notation normalized


 –2.34 × 1056
 +0.002 × 10–4 not normalized
 +987.02 × 109

 In binary
 ±1.xxxxxxx2 × 2yyyy

 Types float and double in C

7
 Defined by IEEE Std 754-1985
 Developed in response to divergence of representations
 Portability issues for scientific code

 Now almost universally adopted


 Two representations
 Single precision (32-bit)
 Double precision (64-bit)

8
single: 8 bits single: 23 bits
double: 11 bits double: 52 bits
S Exponent Fraction

x = ( −1)S × (1+ Fraction) × 2(Exponent −Bias)

 S: sign bit (0 ⇒ non-negative, 1 ⇒ negative)


 Normalize significand: 1.0 ≤ |significand| < 2.0
 Always has a leading pre-binary-point 1 bit, so no need to
represent it explicitly (hidden bit)
 Significand is Fraction with the “1.” restored

 Exponent: excess representation: actual exponent + Bias


 Ensures exponent is unsigned
 Single: Bias = 127; Double: Bias = 1203
9
 Exponents 00000000 and 11111111 reserved
 Smallest value
 Exponent: 00000001
⇒ actual exponent = 1 – 127 = –126
 Fraction: 000…00 ⇒ significand = 1.0
 ±1.0 × 2–126 ≈ ±1.2 × 10–38

 Largest value
 exponent: 11111110
⇒ actual exponent = 254 – 127 = +127
 Fraction: 111…11 ⇒ significand ≈ 2.0
10  ±2.0 × 2+127 ≈ ±3.4 × 10+38
 Exponents 0000…00 and 1111…11 reserved
 Smallest value
 Exponent: 00000000001
⇒ actual exponent = 1 – 1023 = –1022
 Fraction: 000…00 ⇒ significand = 1.0
 ±1.0 × 2–1022 ≈ ±2.2 × 10–308

 Largest value
 Exponent: 11111111110
⇒ actual exponent = 2046 – 1023 = +1023
 Fraction: 111…11 ⇒ significand ≈ 2.0
 ±2.0 × 2+1023 ≈ ±1.8 × 10+308
11
 Relative precision
 all fraction bits are significant
 Single: approx 2–23
 Equivalent to 23 × log102 ≈ 23 × 0.3 ≈ 6 decimal digits of precision
 Double: approx 2–52
 Equivalent to 52 × log102 ≈ 52 × 0.3 ≈ 16 decimal digits of precision

12
 Represent –0.75
 –0.75 = (–1)1 × 1.12 × 2–1
S = 1
 Fraction = 1000…002
 Exponent = –1 + Bias
 Single: –1 + 127 = 126 = 011111102
 Double: –1 + 1023 = 1022 = 011111111102

 Single: 1011111101000…00
 Double: 1011111111101000…00

13
 What number is represented by the single-
precision float
11000000101000…00
S = 1
 Fraction = 01000…002
 Exponent = 100000012 = 129

 x = (–1)1 × (1 + 012) × 2(129 – 127)


= (–1) × 1.25 × 22
= –5.0
14
 Exponent = 000...0 ⇒ hidden bit is 0

x = ( −1)S × (0 + Fraction) × 2 −Bias


 Smaller than normal numbers
 allow for gradual underflow, with
diminishing precision

 Denormal with fraction = 000...0


x = ( −1)S × (0 + 0) × 2 −Bias = ±0.0
Two representations
of 0.0!
15
 Exponent = 111...1, Fraction = 000...0
 ±Infinity
 Can be used in subsequent calculations, avoiding need for overflow
check
 Exponent = 111...1, Fraction ≠ 000...0
 Not-a-Number (NaN)
 Indicates illegal or undefined result
 e.g., 0.0 / 0.0
 Can be used in subsequent calculations

16
 Consider a 4-digit decimal example
 9.999 × 101 + 1.610 × 10–1

 1. Align decimal points


 Shift number with smaller exponent
 9.999 × 101 + 0.016 × 101

 2. Add significands
 9.999 × 101 + 0.016 × 101 = 10.015 × 101

 3. Normalize result & check for


over/underflow
 1.0015 × 102

 4. Round and renormalize if necessary


17  1.002 × 102
 Now consider a 4-digit binary example
 1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375)

 1. Align binary points


 Shift number with smaller exponent
 1.0002 × 2–1 + –0.1112 × 2–1

 2. Add significands
 1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1

 3. Normalize result & check for


over/underflow
 1.0002 × 2–4, with no over/underflow

 4. Round and renormalize if necessary


18  1.0002 × 2–4 (no change) = 0.0625
 Much more complex than integer adder
 Doing it in one clock cycle would take too long
 Much longer than integer operations
 Slower clock would penalize all instructions

 FP adder usually takes several cycles


 Can be pipelined

19
Step 1

Step 2

Step 3

Step 4

20
 Consider a 4-digit decimal example
 1.110 × 1010 × 9.200 × 10–5

 1. Add exponents
 For biased exponents, subtract bias from sum
 New exponent = 10 + –5 = 5

 2. Multiply significands
 1.110 × 9.200 = 10.212 ⇒ 10.212 × 105

 3. Normalize result & check for over/underflow


 1.0212 × 106

 4. Round and renormalize if necessary


 1.021 × 106

 5. Determine sign of result from signs of operands


 +1.021 × 106
21
 Now consider a 4-digit binary example
 1.0002 × 2–1 × –1.1102 × 2–2 (0.5 × –0.4375)

 1. Add exponents
 Unbiased: –1 + –2 = –3
 Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127

 2. Multiply significands
 1.0002 × 1.1102 = 1.1102 ⇒ 1.1102 × 2–3

 3. Normalize result & check for over/underflow


 1.1102 × 2–3 (no change) with no over/underflow

 4. Round and renormalize if necessary


 1.1102 × 2–3 (no change)

 5. Determine sign: +ve × –ve ⇒ –ve


 –1.1102 × 2–3 = –0.21875
22
 FP multiplier is of similar complexity to FP adder
 But uses a multiplier for significands instead of an adder

 FP arithmetic hardware usually does


 Addition, subtraction, multiplication, division, reciprocal, square-
root
 FP ↔ integer conversion

 Operations usually takes several cycles


 Can be pipelined

23
 Separate FP registers: f0, …, f31
 double-precision
 single-precision values stored in the lower 32 bits
 FP instructions operate only on FP registers
 FP load and store instructions
 flw, fld
 fsw, fsd

24
 Single-precision arithmetic
 fadd.s, fsub.s, fmul.s, fdiv.s,
fsqrt.s
 e.g., fadds.s f2, f4, f6

 Double-precision arithmetic
 fadd.d, fsub.d, fmul.d, fdiv.d,
fsqrt.d
 e.g., fadd.d f2, f4, f6

 Single and double-precision comparison


 feq.s, flt.s, fle.s
 feq.d, flt.d, fle.d
 Result is 0 or 1 in integer destination register
 Use beq, bne to branch on comparison result

 Branch on FP condition code true or false


 B.cond
25
 C code:
float f2c (float fahr) {
return ((5.0/9.0)*(fahr - 32.0));
}
 fahr in f10, result in f10, literals in global memory space

 Compiled RISC-V code:


f2c:
flw f0,const5(x3) // f0 = 5.0f
flw f1,const9(x3) // f1 = 9.0f
fdiv.s f0, f0, f1 // f0 = 5.0f / 9.0f
flw f1,const32(x3) // f1 = 32.0f
fsub.s f10,f10,f1 // f10 = fahr – 32.0
fmul.s f10,f0,f10 // f10 = (5.0f/9.0f) * (fahr–32.0f)
jalr x0,0(x1) // return

26
C = C + A × B
 All 32 × 32 matrices, 64-bit double-precision
elements
 C code:
void mm (double c[][],
double a[][], double b[][]) {
size_t i, j, k;
for (i = 0; i < 32; i = i + 1)
for (j = 0; j < 32; j = j + 1)
for (k = 0; k < 32; k = k + 1)
c[i][j] = c[i][j]
+ a[i][k] * b[k][j];
}
 Addresses of c, a, b in x10, x11, x12, and
27
i, j, k in x5, x6, x7
 RISC-V code:
mm:...
li x28,32 // x28 = 32 (row size/loop end)
li x5,0 // i = 0; initialize 1st for loop
L1: li x6,0 // j = 0; initialize 2nd for loop
L2: li x7,0 // k = 0; initialize 3rd for loop
slli x30,x5,5 // x30 = i * 2**5 (size of row of c)
add x30,x30,x6 // x30 = i * size(row) + j
slli x30,x30,3 // x30 = byte offset of [i][j]
add x30,x10,x30 // x30 = byte address of c[i][j]
fld f0,0(x30) // f0 = c[i][j]
L3: slli x29,x7,5 // x29 = k * 2**5 (size of row of b)
add x29,x29,x6 // x29 = k * size(row) + j
slli x29,x29,3 // x29 = byte offset of [k][j]
add x29,x12,x29 // x29 = byte address of b[k][j]
fld f1,0(x29) // f1 = b[k][j]

28

slli x29,x5,5 // x29 = i * 2**5 (size of row of a)
add x29,x29,x7 // x29 = i * size(row) + k
slli x29,x29,3 // x29 = byte offset of [i][k]
add x29,x11,x29 // x29 = byte address of a[i][k]
fld f2,0(x29) // f2 = a[i][k]
fmul.d f1, f2, f1 // f1 = a[i][k] * b[k][j]
fadd.d f0, f0, f1 // f0 = c[i][j] + a[i][k] * b[k][j]
addi x7,x7,1 // k = k + 1
bltu x7,x28,L3 // if (k < 32) go to L3
fsd f0,0(x30) // c[i][j] = f0
addi x6,x6,1 // j = j + 1
bltu x6,x28,L2 // if (j < 32) go to L2
addi x5,x5,1 // i = i + 1
bltu x5,x28,L1 // if (i < 32) go to L1

29
 IEEE Std 754 specifies additional rounding
control
 Extra bits of precision (guard, round, sticky)
 Choice of rounding modes
 Allows programmer to fine-tune numerical behavior
of a computation
 Not all FP units implement all options
 Most programming languages and FP libraries just
use defaults
 Trade-off between hardware complexity,
performance, and market requirements
30
31
• Submit READY? Quizzes before next week classes
• Participate in Peer discussion and Q/A every week
• Check your labs schedule… (Lab D)
• Attempt your Quiz3 (closes Fri, Feb. 17)

32
• Practice questions, and these are part of Homework2:
(also mentioned at eClass course webpage)
• #?: 3.1, 3.6, 3.11*, 3.13*
• #?: 3.18*, 3.20*, 3.23, 3.24

33
 Basics of logic design
 Hardware Description Language

34
35

You might also like