0% found this document useful (0 votes)
3 views30 pages

Chapter 03 Arith 3 Float

This document provides an overview of floating-point arithmetic, focusing on the IEEE 754 standard for single and double precision, special numbers, and floating-point operations such as addition and multiplication. It details RISC-V floating-point instructions, fixed-point versus floating-point representations, and the structure of floating-point numbers. Additionally, it discusses the importance of rounding modes and the internal format used in arithmetic operations.

Uploaded by

s6i893i7744
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views30 pages

Chapter 03 Arith 3 Float

This document provides an overview of floating-point arithmetic, focusing on the IEEE 754 standard for single and double precision, special numbers, and floating-point operations such as addition and multiplication. It details RISC-V floating-point instructions, fixed-point versus floating-point representations, and the structure of floating-point numbers. Additionally, it discusses the importance of rounding modes and the internal format used in arithmetic operations.

Uploaded by

s6i893i7744
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Computer

Architecture
CH3 Computer Arithmetic (III)
Floating Point

Prof. Ren-Shuo Liu


NTHU EE
Outline
• Overview
• IEEE 754 standard
• Single-precision
• Double-precision
• Special numbers
• Floating-point operations
• Addition
• Multiplication
• Rounding

2
Outline
• Overview
• RISC-V floating-point instructions
• Fixed-point and floating-point representations
• IEEE 754 standard
• Floating-point operations

3
RISC-V Floating Instructions
• Arithmetic
• fadd.s, fsub.s, fmul.s, fdiv.s # s means single-precision
• fadd.d, fsub.d, fmul.d, fdiv.d # d means double-precision

• Comparisons
• feq.s, feq.d # equal
• flt.s, flt.d # less than
• fle.s, fle.d # less than or equal

• Load / store
• flw, fsw
• fld, fsd

4
Floating Point Unit and Register
Files
• Separate 32 registers for Processor
floating-point $0
• Register pairs (e.g., $F0 and $1
Add
$F1) for double precision $2
• $F0 is not always zero
Mult/Div
$31

• Floating-point instructions $F0 fadd/


$F1 fmult/fdiv
can be optional $F2
• Many embedded systems do
not utilize them $F31

5
Fixed-Point
• Integers scaled by an implicit (隱含的) factor
• The scaling factor for each variable does not change
(i.e., fixed) during the entire computation
• Examples
• 3.14 is represented as
• 314 (scaling factor = 1/100)
• 3140 (scaling factor = 1/1000)
• 5,000,000 is represented as
• 5 (scaling factor = 1,000,000)
• 50 (scaling factor = 100,000)

6
Floating-Point ~= Scientific Notation
(科學記號表示法)
• 光速 + 2.99792458 × 10 8 (m/s)
• 電子電量 - 1.60217733 × 10 -19 (C)
• 0.5莫耳碳原子 + 6.00000000 × 10 -3 (kg)
sign significand (有效數) exponent (指數)
or fraction (小數),
or mantissa (尾數) radix or base (底數)

• Normalized form
• Exactly one non-zero significant digit to the left of the
point
• 29.9 × 107 and 0.299 × 109 are not normalized forms

7
Floating-Point ~= Scientific Notation
(科學記號表示法)
• 光速 + 2.99792458 × 10 8 (m/s)
• 電子電量 - 1.60217733 × 10 -19 (C)
• 0.5莫耳碳原子 + 6.00000000 × 10 -3 (kg)
sign significand (有效數) exponent (指數)
or fraction (小數),
or mantissa (尾數) radix or base (底數)

• Scaling factor (the exponent) is explicit


• "Floating"
• Scaling factor can change during computation

8
Floating Point Number
32 bits
• IEEE 754 standard
• Single-precision S Exp. Significand 64 bits

• Double-precision S Exp. Significand

• Represented value: (-1)S × 1.Significanttwo × 2(Exponent - Bias)


Sign bit Exp. bits Significand bits Bias
Single precision
(float in C/C++)
1 8 23 127
Double precision
(double in C/C++)
1 11 52 1023

9
Special Floating Point Numbers
32 bits

S Exp. Significand
zero S 0…00 0…00
Denormalized value S 0…00 non-zero
+/- ∞ S 1…11 0…00
Not a number (NaN) any 1…11 non-zero

i.e., the maximal and minimal exponent values are reserved


for special floating-point numbers

10
Denormalized Value
• S 0...00 Significand are denormalized values
• (-1)S × 0.Significant × 2(1-bias)
• No leading one to the left of the point

• Objective
• Represent very small value
• Gradual underflow

11
Floating Point Examples
32 bits

• S Exp. Significand
0 01111000 10100……..000

= 1.1010…...000two × 2(120-127)
= 1.1010two × 2-7
= 1.625ten × 2-7
= 0.0126953125ten

12
Floating Point Examples
fraction part x2
• Convert -3.14 to 32-bit float
• 3 = 11two 0.14
0.28 0.88
• 3.14
0.56 1.76
= 11.0010_0011_1101_0111_0000_1010…two 1.12 1.52
= 1.1001_0001_1110_1011_1000_010 × 21 0.24 1.04
0.48 0.08
23-bit significand (assume not rounded) 0.96 0.16
1.92 0.32
1.84 0.64
= 1 10000000 1.68 1.28
1.36 0.56
0.72 1.12
32 bits
1.44 0.24

13
IEEE 754 Online Converter

https://www.h-schmidt.net/FloatConverter/IEEE754.html

http://babbage.cs.qc.cuny.edu/IEEE-754.old/Decimal.html
14
Floating Point Operations
• Comparisons
• Addition
• Multiplication

15
Comparisons
• Similar to sign-magnitude integer comparison
S Exp. Significand S Exp. Significand
viewed as viewed as

S Magnitude S Magnitude

• Rationales
• Positive > negative
• Between two positive floating point numbers
• One with larger {exponent, significand} is greater
• Between two negative floating point numbers
• One with smaller {exponent, significand} is greater
16
Comparisons (Cont'd)
• Cases directly supported by sign-magnitude
comparisons
• +∞ == +∞
• -∞ == -∞
• -∞ < all numbers < ∞
• 0 == -0

• Special cases that sign-magnitude comparisons do not


directly support
• != involving any NaN yields true
• All other comparisons involving NaN yield false
• NaN < 10 ?  false
• NaN > NaN?  false
• NaN == NaN?  false

17
Addition
• Steps
1. Align (adjust the smaller number)
2. Perform addition
3. Normalize
4. Round
5. Re-normalize

• Examples
• 9.999ten × 101 + 1.610ten × 10-1
• 1.101two × 29 + 1.110two × 212
• Assume four-digit significands

18
Decimal Example
9.999ten × 101 + 1.610ten × 10-1
Align
= 9.999ten × 101 + 0.01610ten × 101
Add
= 10.01500ten × 101
Normalize
= 1.001500ten × 102
Round
= 1.002ten × 102
Renormalize
= 1.002ten × 102 (no change)

19
Binary Example
1.101 × 29 + 1.110 × 212
Align two two
= 0.001101two × 212 + 1.110two × 212
Add
= 1.111101two × 212
Normalize
= 1.111101two × 212 (no change)
Round
= 10.000two × 212
Renormalize
= 1.000two × 213

20
Compare
Exponents

Shift smaller
number right

Add

Normalize

Round

21
Multiplication
• Steps
1. Add exponents (considering the bias)
2. Multiply the significands (with sign determined)
3. Normalize (and check over/underflow)
4. Round
5. Re-normalize (and re-check over/underflow)

• Examples
• 1.110ten × 1010 × 9.200ten × 10-5
• 1.000two × 2-1 × (-1.110two) × 2-2
• Inputs and outputs have four-digit significand

22
Decimal Example
1.110ten × 1010 × 9.200ten × 10-5
Exponent 10 + (-5) = 5
Multiply 1.110ten × 9.200ten = 10.212ten
Normalize = 1.0212ten × 106
Round = 1.021ten × 106
Renormalize = 1.021ten × 106 (no change)

23
Binary Example
1.000two × 2-1 × (-1.110two) × 2-2
Exponent (-1) + (-2) = (-3)
Multiply 1.000two × (-1.110two) = (-1.110000two)
Normalize = (-1.110000two) × 2-3 (no change)
Round = (1.110two) × 2-3 (no change)
Renormalize = (1.110two) × 2-3 (no change)

24
Internal Format with Extra Bits
• Extra bits are needed during arithmetic operations
to increase the arithmetic accuracy
• e.g., 1.101two × 29 + 1.110two × 212
without extra bits with extra bits

0.001two × 212 0.001101two × 212


+ 1.110two × 212 + 1.110000two × 212
= 1.111two × 212 = 1.111101two × 212
= 10.00two × 212
= 1.000two × 213

25
IEEE 754 Internal Format
• Three extra bits
• The 3rd one represents any remaining nonzero bits to
the right

S Exp. Significand 1.0001two × 27


right shift 5 bits during
arithmetic
Internal format
0.000010001two × 212
S Exp. Significand 0.0000101two × 212
First
Second
Third

26
IEEE 754 Internal Format
• Roles/names of the three extra bits
• First: Guard
• Second: Round
• Third: Sticky

0.000010001two × 212
S Exp. Significand 0.0000101two × 212
First
Second
Third

27
IEEE 754 Rounding Mode
• Four modes can be chosen by programmers
• Toward 0 (also called truncation)
• Toward +∞
• Toward -∞
• Toward nearest even (default mode)
• Choose the even one if there are two equally nearest values

28
Round Toward Nearest Even
• Binary examples

10.00011 10.00101 10.10100 10.11100


Results 10.00 10.01 10.10 11.00

• Reduce the statistical biases of rounding noises

29
Outline
• Overview
• IEEE 754 standard
• Single-precision
• Double-precision
• Special numbers
• Floating-point operations
• Addition
• Multiplication
• Rounding

30

You might also like