0% found this document useful (0 votes)

64 views35 pages

Pooja Vashisth

This document discusses floating point number representation and arithmetic. It begins by introducing floating point representation using sign-magnitude notation, with a sign bit, exponent field, and significand or fraction. It describes normalized and denormalized representations, as well as special values like infinity and NaN. The document then discusses floating point arithmetic operations like addition, subtraction, multiplication and division at both the algorithmic level and hardware implementation level. It also covers floating point data formats and number ranges as defined by the IEEE 754 standard.

Uploaded by

Apurva Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views35 pages

Pooja Vashisth

Uploaded by

Apurva Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

CSC 258

Pooja Vashisth
1. Representation of numerical data as floating-point numbers
2. Describe underflow, overflow, round off, and truncation errors
3. Describe floating point arithmetic operations
4. Hardware implementation of floating-point operations

2 2
 Floating-point data formats
 Underflow and overflow
 How representations affect accuracy and precision
 Investigate hardware implementation of various floating-point
arithmetic operations

3 3
CSC 258

4
We’ve presented a fixed point representation of (some)
real numbers

One alternative is floating point.

 In floating point, the number of digits used to
represent the integer and fractional parts may vary.
 These operations tends to require more complex
hardware but allow a wider range of values.

5
 There’s a shift in topics next week after.
 We’re moving away from assembly and to
computer organization.

6
 Representation for non-integral numbers
 Including very small and very large numbers

 Like scientific notation normalized

 –2.34 × 1056
 +0.002 × 10–4 not normalized
 +987.02 × 109

 In binary
 ±1.xxxxxxx2 × 2yyyy

 Types float and double in C

7
 Defined by IEEE Std 754-1985
 Developed in response to divergence of representations
 Portability issues for scientific code

 Now almost universally adopted

 Two representations
 Single precision (32-bit)
 Double precision (64-bit)

8
single: 8 bits single: 23 bits
double: 11 bits double: 52 bits
S Exponent Fraction

x = ( −1)S × (1+ Fraction) × 2(Exponent −Bias)

 S: sign bit (0 ⇒ non-negative, 1 ⇒ negative)

 Normalize significand: 1.0 ≤ |significand| < 2.0
 Always has a leading pre-binary-point 1 bit, so no need to
represent it explicitly (hidden bit)
 Significand is Fraction with the “1.” restored

 Exponent: excess representation: actual exponent + Bias

 Ensures exponent is unsigned
 Single: Bias = 127; Double: Bias = 1203
9
 Exponents 00000000 and 11111111 reserved
 Smallest value
 Exponent: 00000001
⇒ actual exponent = 1 – 127 = –126
 Fraction: 000…00 ⇒ significand = 1.0
 ±1.0 × 2–126 ≈ ±1.2 × 10–38

 Largest value
 exponent: 11111110
⇒ actual exponent = 254 – 127 = +127
 Fraction: 111…11 ⇒ significand ≈ 2.0
10  ±2.0 × 2+127 ≈ ±3.4 × 10+38
 Exponents 0000…00 and 1111…11 reserved
 Smallest value
 Exponent: 00000000001
⇒ actual exponent = 1 – 1023 = –1022
 Fraction: 000…00 ⇒ significand = 1.0
 ±1.0 × 2–1022 ≈ ±2.2 × 10–308

 Largest value
 Exponent: 11111111110
⇒ actual exponent = 2046 – 1023 = +1023
 Fraction: 111…11 ⇒ significand ≈ 2.0
 ±2.0 × 2+1023 ≈ ±1.8 × 10+308
11
 Relative precision
 all fraction bits are significant
 Single: approx 2–23
 Equivalent to 23 × log102 ≈ 23 × 0.3 ≈ 6 decimal digits of precision
 Double: approx 2–52
 Equivalent to 52 × log102 ≈ 52 × 0.3 ≈ 16 decimal digits of precision

12
 Represent –0.75
 –0.75 = (–1)1 × 1.12 × 2–1
S = 1
 Fraction = 1000…002
 Exponent = –1 + Bias
 Single: –1 + 127 = 126 = 011111102
 Double: –1 + 1023 = 1022 = 011111111102

 Single: 1011111101000…00
 Double: 1011111111101000…00

13
 What number is represented by the single-
precision float
11000000101000…00
S = 1
 Fraction = 01000…002
 Exponent = 100000012 = 129

 x = (–1)1 × (1 + 012) × 2(129 – 127)

= (–1) × 1.25 × 22
= –5.0
14
 Exponent = 000...0 ⇒ hidden bit is 0

x = ( −1)S × (0 + Fraction) × 2 −Bias

 Smaller than normal numbers
 allow for gradual underflow, with
diminishing precision

 Denormal with fraction = 000...0

x = ( −1)S × (0 + 0) × 2 −Bias = ±0.0
Two representations
of 0.0!
15
 Exponent = 111...1, Fraction = 000...0
 ±Infinity
 Can be used in subsequent calculations, avoiding need for overflow
check
 Exponent = 111...1, Fraction ≠ 000...0
 Not-a-Number (NaN)
 Indicates illegal or undefined result
 e.g., 0.0 / 0.0
 Can be used in subsequent calculations

16
 Consider a 4-digit decimal example
 9.999 × 101 + 1.610 × 10–1

 1. Align decimal points

 Shift number with smaller exponent
 9.999 × 101 + 0.016 × 101

 2. Add significands
 9.999 × 101 + 0.016 × 101 = 10.015 × 101

 3. Normalize result & check for

over/underflow
 1.0015 × 102

 4. Round and renormalize if necessary

17  1.002 × 102
 Now consider a 4-digit binary example
 1.0002 × 2–1 + –1.1102 × 2–2 (0.5 + –0.4375)

 1. Align binary points

 Shift number with smaller exponent
 1.0002 × 2–1 + –0.1112 × 2–1

 2. Add significands
 1.0002 × 2–1 + –0.1112 × 2–1 = 0.0012 × 2–1

 3. Normalize result & check for

over/underflow
 1.0002 × 2–4, with no over/underflow

 4. Round and renormalize if necessary

18  1.0002 × 2–4 (no change) = 0.0625
 Much more complex than integer adder
 Doing it in one clock cycle would take too long
 Much longer than integer operations
 Slower clock would penalize all instructions

 FP adder usually takes several cycles

 Can be pipelined

19
Step 1

Step 2

Step 3

Step 4

20
 Consider a 4-digit decimal example
 1.110 × 1010 × 9.200 × 10–5

 1. Add exponents
 For biased exponents, subtract bias from sum
 New exponent = 10 + –5 = 5

 2. Multiply significands
 1.110 × 9.200 = 10.212 ⇒ 10.212 × 105

 3. Normalize result & check for over/underflow

 1.0212 × 106

 4. Round and renormalize if necessary

 1.021 × 106

 5. Determine sign of result from signs of operands

 +1.021 × 106
21
 Now consider a 4-digit binary example
 1.0002 × 2–1 × –1.1102 × 2–2 (0.5 × –0.4375)

 1. Add exponents
 Unbiased: –1 + –2 = –3
 Biased: (–1 + 127) + (–2 + 127) = –3 + 254 – 127 = –3 + 127

 2. Multiply significands
 1.0002 × 1.1102 = 1.1102 ⇒ 1.1102 × 2–3

 3. Normalize result & check for over/underflow

 1.1102 × 2–3 (no change) with no over/underflow

 4. Round and renormalize if necessary

 1.1102 × 2–3 (no change)

 5. Determine sign: +ve × –ve ⇒ –ve

 –1.1102 × 2–3 = –0.21875
22
 FP multiplier is of similar complexity to FP adder
 But uses a multiplier for significands instead of an adder

 FP arithmetic hardware usually does

 Addition, subtraction, multiplication, division, reciprocal, square-
root
 FP ↔ integer conversion

 Operations usually takes several cycles

 Can be pipelined

23
 Separate FP registers: f0, …, f31
 double-precision
 single-precision values stored in the lower 32 bits
 FP instructions operate only on FP registers
 FP load and store instructions
 flw, fld
 fsw, fsd

24
 Single-precision arithmetic
 fadd.s, fsub.s, fmul.s, fdiv.s,
fsqrt.s
 e.g., fadds.s f2, f4, f6

 Double-precision arithmetic
 fadd.d, fsub.d, fmul.d, fdiv.d,
fsqrt.d
 e.g., fadd.d f2, f4, f6

 Single and double-precision comparison

 feq.s, flt.s, fle.s
 feq.d, flt.d, fle.d
 Result is 0 or 1 in integer destination register
 Use beq, bne to branch on comparison result

 Branch on FP condition code true or false

 B.cond
25
 C code:
float f2c (float fahr) {
return ((5.0/9.0)*(fahr - 32.0));
}
 fahr in f10, result in f10, literals in global memory space

 Compiled RISC-V code:

f2c:
flw f0,const5(x3) // f0 = 5.0f
flw f1,const9(x3) // f1 = 9.0f
fdiv.s f0, f0, f1 // f0 = 5.0f / 9.0f
flw f1,const32(x3) // f1 = 32.0f
fsub.s f10,f10,f1 // f10 = fahr – 32.0
fmul.s f10,f0,f10 // f10 = (5.0f/9.0f) * (fahr–32.0f)
jalr x0,0(x1) // return

26
C = C + A × B
 All 32 × 32 matrices, 64-bit double-precision
elements
 C code:
void mm (double c[][],
double a[][], double b[][]) {
size_t i, j, k;
for (i = 0; i < 32; i = i + 1)
for (j = 0; j < 32; j = j + 1)
for (k = 0; k < 32; k = k + 1)
c[i][j] = c[i][j]
+ a[i][k] * b[k][j];
}
 Addresses of c, a, b in x10, x11, x12, and
27
i, j, k in x5, x6, x7
 RISC-V code:
mm:...
li x28,32 // x28 = 32 (row size/loop end)
li x5,0 // i = 0; initialize 1st for loop
L1: li x6,0 // j = 0; initialize 2nd for loop
L2: li x7,0 // k = 0; initialize 3rd for loop
slli x30,x5,5 // x30 = i * 2**5 (size of row of c)
add x30,x30,x6 // x30 = i * size(row) + j
slli x30,x30,3 // x30 = byte offset of [i][j]
add x30,x10,x30 // x30 = byte address of c[i][j]
fld f0,0(x30) // f0 = c[i][j]
L3: slli x29,x7,5 // x29 = k * 2**5 (size of row of b)
add x29,x29,x6 // x29 = k * size(row) + j
slli x29,x29,3 // x29 = byte offset of [k][j]
add x29,x12,x29 // x29 = byte address of b[k][j]
fld f1,0(x29) // f1 = b[k][j]

28
…
slli x29,x5,5 // x29 = i * 2**5 (size of row of a)
add x29,x29,x7 // x29 = i * size(row) + k
slli x29,x29,3 // x29 = byte offset of [i][k]
add x29,x11,x29 // x29 = byte address of a[i][k]
fld f2,0(x29) // f2 = a[i][k]
fmul.d f1, f2, f1 // f1 = a[i][k] * b[k][j]
fadd.d f0, f0, f1 // f0 = c[i][j] + a[i][k] * b[k][j]
addi x7,x7,1 // k = k + 1
bltu x7,x28,L3 // if (k < 32) go to L3
fsd f0,0(x30) // c[i][j] = f0
addi x6,x6,1 // j = j + 1
bltu x6,x28,L2 // if (j < 32) go to L2
addi x5,x5,1 // i = i + 1
bltu x5,x28,L1 // if (i < 32) go to L1

29
 IEEE Std 754 specifies additional rounding
control
 Extra bits of precision (guard, round, sticky)
 Choice of rounding modes
 Allows programmer to fine-tune numerical behavior
of a computation
 Not all FP units implement all options
 Most programming languages and FP libraries just
use defaults
 Trade-off between hardware complexity,
performance, and market requirements
30
31
• Submit READY? Quizzes before next week classes
• Participate in Peer discussion and Q/A every week
• Check your labs schedule… (Lab D)
• Attempt your Quiz3 (closes Fri, Feb. 17)

32
• Practice questions, and these are part of Homework2:
(also mentioned at eClass course webpage)
• #?: 3.1, 3.6, 3.11*, 3.13*
• #?: 3.18*, 3.20*, 3.23, 3.24

33
 Basics of logic design
 Hardware Description Language

34
35

Mathex Year 8 2015 Final
No ratings yet
Mathex Year 8 2015 Final
4 pages
Computer Architecture: Nguyễn Trí Thành
No ratings yet
Computer Architecture: Nguyễn Trí Thành
55 pages
Floating Point & Fixed Point Representation - BCA II
No ratings yet
Floating Point & Fixed Point Representation - BCA II
24 pages
Ece552 10 Floating Point
No ratings yet
Ece552 10 Floating Point
15 pages
Division: Check For 0 Divisor Long Division Approach
No ratings yet
Division: Check For 0 Divisor Long Division Approach
27 pages
ENSC254 - Floating Point Computation
No ratings yet
ENSC254 - Floating Point Computation
29 pages
Floating Point
No ratings yet
Floating Point
33 pages
Lab 7
No ratings yet
Lab 7
9 pages
IT3030E CA Chap3 Arithmetics
No ratings yet
IT3030E CA Chap3 Arithmetics
39 pages
Project Report Vlsi
No ratings yet
Project Report Vlsi
33 pages
NXN Crossbar Design For Barrel Shifter: X-Input Y-Output
No ratings yet
NXN Crossbar Design For Barrel Shifter: X-Input Y-Output
18 pages
Lect 13
No ratings yet
Lect 13
41 pages
Lab 1
100% (1)
Lab 1
10 pages
Floating Point Multipliers: Simulation & Synthesis Using VHDL
No ratings yet
Floating Point Multipliers: Simulation & Synthesis Using VHDL
40 pages
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
No ratings yet
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
42 pages
3 Risc V Alu Arch Basics
No ratings yet
3 Risc V Alu Arch Basics
127 pages
Lecture5 - Arithmetic For Computers - Part 2
No ratings yet
Lecture5 - Arithmetic For Computers - Part 2
57 pages
Floating Point Arithmetic Class
No ratings yet
Floating Point Arithmetic Class
24 pages
CH08.2-Computer Arithmetic
No ratings yet
CH08.2-Computer Arithmetic
14 pages
2.4 Floating Points
No ratings yet
2.4 Floating Points
36 pages
Shi Wal 95 A
No ratings yet
Shi Wal 95 A
8 pages
Floating Point Arith
100% (1)
Floating Point Arith
8 pages
An Fpga Based 64-Bit Ieee - 754 Double Precision Floating Point Adder/Subtractor and Multiplier Using VHDL
No ratings yet
An Fpga Based 64-Bit Ieee - 754 Double Precision Floating Point Adder/Subtractor and Multiplier Using VHDL
11 pages
EE 109 Unit 20: IEEE 754 Floating Point Representation Floating Point Arithmetic
No ratings yet
EE 109 Unit 20: IEEE 754 Floating Point Representation Floating Point Arithmetic
31 pages
Ijspr 1203 438
No ratings yet
Ijspr 1203 438
4 pages
Floating Point Representation of Numbers: Wide Range
No ratings yet
Floating Point Representation of Numbers: Wide Range
11 pages
Chapter 03 Arith 3 Float
No ratings yet
Chapter 03 Arith 3 Float
30 pages
ECE 252 - Quiz - 1 - Solutions
No ratings yet
ECE 252 - Quiz - 1 - Solutions
5 pages
Computer Organization and Architecture: William Stallings
No ratings yet
Computer Organization and Architecture: William Stallings
7 pages
BiD 09
No ratings yet
BiD 09
56 pages
ELEC2041 Microprocessors and Interfacing Lectures 21: Floating Point Number Representation - III
No ratings yet
ELEC2041 Microprocessors and Interfacing Lectures 21: Floating Point Number Representation - III
31 pages
Floating Point: - We Need A Way To Represent
No ratings yet
Floating Point: - We Need A Way To Represent
14 pages
Implementation of Binary To Floating Point Converter Using HDL
No ratings yet
Implementation of Binary To Floating Point Converter Using HDL
41 pages
Computer Arithmetic: Part II: Integer Arithmetic & Floating Point
No ratings yet
Computer Arithmetic: Part II: Integer Arithmetic & Floating Point
30 pages
Floating Point: 15-213: Introduction To Computer Systems 4 Lecture, Sep. 10, 2015
No ratings yet
Floating Point: 15-213: Introduction To Computer Systems 4 Lecture, Sep. 10, 2015
40 pages
Floating Point Arithmetic: Numbers
No ratings yet
Floating Point Arithmetic: Numbers
14 pages
Manage-Implementation of Floating - Bhagyashree Hardiya
No ratings yet
Manage-Implementation of Floating - Bhagyashree Hardiya
6 pages
Floating Point 6up
No ratings yet
Floating Point 6up
7 pages
Lecture 06 - MIPS Floating Point Arithmetic
No ratings yet
Lecture 06 - MIPS Floating Point Arithmetic
23 pages
ADSD Fall2011 09 Fixed Point Representation
No ratings yet
ADSD Fall2011 09 Fixed Point Representation
41 pages
04 Float
No ratings yet
04 Float
40 pages
Design and Implementation of Fast Floating Point Multiplier Unit
No ratings yet
Design and Implementation of Fast Floating Point Multiplier Unit
5 pages
Computer Architecture: Arithmetic For Computers
No ratings yet
Computer Architecture: Arithmetic For Computers
52 pages
LEC03 Data II
No ratings yet
LEC03 Data II
45 pages
Lab 7
No ratings yet
Lab 7
11 pages
Lab 1 Report 445L
No ratings yet
Lab 1 Report 445L
7 pages
Chapter 08 Computer Arithmetic 2
No ratings yet
Chapter 08 Computer Arithmetic 2
58 pages
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
No ratings yet
The World Is Not Just Integers: Programming Languages Support Numbers With Fraction
51 pages
MIPS Architecture - BITS Pilani
No ratings yet
MIPS Architecture - BITS Pilani
58 pages
Implementation of IEEE 754 Compliant Single Precision Floating-Point Adder Unit Supporting Denormal Inputs On Xilinx FPGA
No ratings yet
Implementation of IEEE 754 Compliant Single Precision Floating-Point Adder Unit Supporting Denormal Inputs On Xilinx FPGA
5 pages
COD - Unit-3 - N - 4 - PPT AJAY Kumar
No ratings yet
COD - Unit-3 - N - 4 - PPT AJAY Kumar
93 pages
04 Float 2
No ratings yet
04 Float 2
44 pages
Floating - Point - Number
No ratings yet
Floating - Point - Number
36 pages
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
No ratings yet
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
8 pages
Floating-Point Numbers
No ratings yet
Floating-Point Numbers
23 pages
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
No ratings yet
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
49 pages
Lecture 12 - CH 03
No ratings yet
Lecture 12 - CH 03
22 pages
Arithmetic & Logic Unit
No ratings yet
Arithmetic & Logic Unit
58 pages
Week8 Slides
No ratings yet
Week8 Slides
43 pages
CH 7
No ratings yet
CH 7
21 pages
Omr Sheet 150 Questions
No ratings yet
Omr Sheet 150 Questions
1 page
Chapter Decimal
No ratings yet
Chapter Decimal
3 pages
G10 Factorization and Algebraic Fractions 2
No ratings yet
G10 Factorization and Algebraic Fractions 2
31 pages
61 (Number) - Wikipedia
No ratings yet
61 (Number) - Wikipedia
1 page
Rational Number (Paper 2)
No ratings yet
Rational Number (Paper 2)
7 pages
04 Classification of Number's HK Sir Updated 2024
No ratings yet
04 Classification of Number's HK Sir Updated 2024
22 pages
PHYS 324 DIGITAL ELECTRONICS Binary Numb
No ratings yet
PHYS 324 DIGITAL ELECTRONICS Binary Numb
8 pages
Lesson 2 Number Base Conversion
No ratings yet
Lesson 2 Number Base Conversion
12 pages
Year-6 Unit-8 (Summary)
No ratings yet
Year-6 Unit-8 (Summary)
21 pages
Divisibility Rules
No ratings yet
Divisibility Rules
2 pages
6th Class Maths Question Paper - Docx - 20231110 - 135439 - 0000
No ratings yet
6th Class Maths Question Paper - Docx - 20231110 - 135439 - 0000
3 pages
Fib Divided Into Fib Mod 49 DATA
No ratings yet
Fib Divided Into Fib Mod 49 DATA
20 pages
Exercise: Section - A (Fixed Response Type) Multiple Choice Questions
No ratings yet
Exercise: Section - A (Fixed Response Type) Multiple Choice Questions
9 pages
Real Numbers - Previous Year Questions - Warrior 2025
No ratings yet
Real Numbers - Previous Year Questions - Warrior 2025
3 pages
200 Questions On Binary, Octal, and Hexadecimal Arithmetic
No ratings yet
200 Questions On Binary, Octal, and Hexadecimal Arithmetic
11 pages
Integers Worksheet-34r
No ratings yet
Integers Worksheet-34r
2 pages
04b Practice Papers Set 3B - Paper 1H Mark Scheme
No ratings yet
04b Practice Papers Set 3B - Paper 1H Mark Scheme
9 pages
Grade 9 Number System - 25 Marks
No ratings yet
Grade 9 Number System - 25 Marks
2 pages
Number System
No ratings yet
Number System
18 pages
Commercial Math
No ratings yet
Commercial Math
7 pages
01 Digital Systems and Binary Numbers
No ratings yet
01 Digital Systems and Binary Numbers
110 pages
Grade 6 Adding Fractions Like Denominators A
No ratings yet
Grade 6 Adding Fractions Like Denominators A
4 pages
2546001-2. FRACTIONS - Recall Worksheet - Class 5 (2023-24) .. Flavia
No ratings yet
2546001-2. FRACTIONS - Recall Worksheet - Class 5 (2023-24) .. Flavia
2 pages
Ncert Solutions For Class 9 Maths April05 Chapter 1 Number System Exercise 1 3
No ratings yet
Ncert Solutions For Class 9 Maths April05 Chapter 1 Number System Exercise 1 3
7 pages
Binary Numbers
No ratings yet
Binary Numbers
13 pages
Yr.7 - Maths H.W
No ratings yet
Yr.7 - Maths H.W
12 pages
CBSE Class 10 Real Numbers MCQ Practice Questions
No ratings yet
CBSE Class 10 Real Numbers MCQ Practice Questions
7 pages
Introduction To Integers Fractions and Decimals
No ratings yet
Introduction To Integers Fractions and Decimals
8 pages

Pooja Vashisth

Uploaded by

Pooja Vashisth

Uploaded by

CSC 258

One alternative is floating point.

 Like scientific notation normalized

 Types float and double in C

 Now almost universally adopted

x = ( −1)S × (1+ Fraction) × 2(Exponent −Bias)

 S: sign bit (0 ⇒ non-negative, 1 ⇒ negative)

 Exponent: excess representation: actual exponent + Bias

 x = (–1)1 × (1 + 012) × 2(129 – 127)

x = ( −1)S × (0 + Fraction) × 2 −Bias

 Denormal with fraction = 000...0

 1. Align decimal points

 3. Normalize result & check for

 4. Round and renormalize if necessary

 1. Align binary points

 3. Normalize result & check for

 4. Round and renormalize if necessary

 FP adder usually takes several cycles

 3. Normalize result & check for over/underflow

 4. Round and renormalize if necessary

 5. Determine sign of result from signs of operands

 3. Normalize result & check for over/underflow

 4. Round and renormalize if necessary

 5. Determine sign: +ve × –ve ⇒ –ve

 FP arithmetic hardware usually does

 Operations usually takes several cycles

 Single and double-precision comparison

 Branch on FP condition code true or false

 Compiled RISC-V code:

You might also like