L1 FloatingPointNumbers Intro

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Floating-Point Numbers

Slides adapted by: Sparsh Mittal


Floating Point
• Representation for non-integral numbers
 Including very small and very large numbers
• Like scientific notation
 –2.34 × 1056 normalized

 +0.002 × 10–4 not normalized


 +987.02 × 109
• In binary
 ±1.xxxxxxx2 × 2yyyy
• Types float and double in C

3
Floating Point Standard
• Defined by IEEE Standard 754-1985
• Developed in response to divergence of representations
Portability issues for scientific code
• Now almost universally adopted
• Most-commonly representations
Half precision (16-bit)
Single precision (32-bit)-------------float in C
Double precision (64-bit) ------------double in C

4
IEEE Floating-Point Format
single: 8 bits single: 23 bits
double: 11 bits double: 52 bits
S Exponent Fraction
(Exponent Bias)
x  ( 1)  (1 Fraction)  2
S

• S: sign bit (0  non-negative, 1  negative)


• Normalize significand: 1.0 ≤ |significand| < 2.0
 Always has a leading pre-binary-point 1 bit, so no need to
represent it explicitly (hidden bit)
 Significand is fraction with the “1.” restored
• Exponent: excess representation: actual exponent + Bias
Ensures exponent is unsigned
Single: Bias = 127; Double: Bias = 1023 6
Formula for Bias

Bias = 2(NumberOfExpBits-1)-1
Why Is It Called Single/Double Precision
• The precision indicates the number of decimal digits that
are correct, that is, without any kind of representation
error or approximation. In other words, it indicates how
many decimal digits one can safely use.
• The number of decimal digits which can be safely used:
• Single precision: log10(224), which is about 7~8 decimal
digits
• Double precision: log10(253), which is about 15~16
decimal digits

5
https://stackoverflow.com/a/42444685/984260
Various formats and their correct digits
Precision Decimal
Total Bits Sign Exponent Significand
Type Digits

Half 16 1 5 10 ~3.31

Single 32 1 8 23 ~7.22

Double 64 1 11 52 ~15.95

Quadruple 128 1 15 112 ~34.02

Octuple 256 1 19 236 ~71.34


Infinities and NaNs
• Exponent = 111...1, Fraction = 000...0
 ±Infinity
• Exponent = 111...1, Fraction ≠ 000...0
 Not-a-Number (NaN)
 Indicates illegal or undefined result
For example, 0.0 / 0.0

7
Special FP Numbers
E M Value
255 0  if S = 0
255 0 –  if S = 1 This table is
255 0 NAN(Not a number) for FP32
0 0 0 numbers
0 0 Denormal number

 NAN + x= NAN 1/0 = ∞


 0/0 = NAN -1/0 = -∞
 sin-1(5) = NAN

8
E – Exponent, M –Mantissa
Denormal Numbers on Number Line

Denormal numbers

Normal numbers

• In decimal, say 7 * 105 is considered normalized representation, but 0.7*1e6 is


not normalized.
• Similarly, in binary, 1.1* 25 is considered normalized, but 0.11 * 26 is not
normalized; it is said “denormal”.
9
Denormal Numbers
• Exponent = 000...0  hidden bit is 0

x  ( 1) S  (0  Fraction)  2126 For FP32


• Smaller than normal numbers
 Allow for gradual underflow, with diminishing precision

NOTE: For denormal numbers, exponent is NOT 0-bias, but 1-bias.


Bias is 127, so we get 1-127 = -126

10
Denormal Numbers

 Smallest +ve normal number : 2-126


 Largest denormal number :
 0.11...11 * 2-126 = (1 – 2-23)*2-126
=2-126 - 2-149

11
Summary representation

Note: We have to first check whether a number belongs to special cases (0/infinity/NaN/denormal).
If a number does not belong to special case, then, it is taken as a normal number.
Fixed-point format
• FxP has a specific number of bits (or digits) reserved for
integer and fractional parts, regardless of how large/small the
number is. For example:
With IIIII.FFFFF format, we can show numbers in range
[00000.00000, 11111.11111] (binary system)
• FP: the number of bits for integer/fractional part is not
reserved. Instead, it reserves certain bits for the significand
and exponent
• Int is similar to FxP, except that Int has no fraction part.
• Sometimes, Int and FxP are used synonoymously
Further Study
• https://blog.demofox.org/2017/11/21/floating-point-precision/
• https://www.h-schmidt.net/FloatConverter/IEEE754.html
• https://stackoverflow.com/questions/4220417/print-binary-
representation-of-a-float-number-in-c
• https://moocaholic.medium.com/fp64-fp32-fp16-bfloat16-tf32-
and-other-mem bers-of-the-zoo-a1ca7897d407
• https://www.ibm.com/support/pages/single-precision-floating-
point-accuracy
• http://www.mimirgames.com/articles/programming/digits-of-pi-
needed-for-floating-point-numbers/

You might also like