L1 FloatingPointNumbers Intro
L1 FloatingPointNumbers Intro
L1 FloatingPointNumbers Intro
3
Floating Point Standard
• Defined by IEEE Standard 754-1985
• Developed in response to divergence of representations
Portability issues for scientific code
• Now almost universally adopted
• Most-commonly representations
Half precision (16-bit)
Single precision (32-bit)-------------float in C
Double precision (64-bit) ------------double in C
4
IEEE Floating-Point Format
single: 8 bits single: 23 bits
double: 11 bits double: 52 bits
S Exponent Fraction
(Exponent Bias)
x ( 1) (1 Fraction) 2
S
Bias = 2(NumberOfExpBits-1)-1
Why Is It Called Single/Double Precision
• The precision indicates the number of decimal digits that
are correct, that is, without any kind of representation
error or approximation. In other words, it indicates how
many decimal digits one can safely use.
• The number of decimal digits which can be safely used:
• Single precision: log10(224), which is about 7~8 decimal
digits
• Double precision: log10(253), which is about 15~16
decimal digits
5
https://stackoverflow.com/a/42444685/984260
Various formats and their correct digits
Precision Decimal
Total Bits Sign Exponent Significand
Type Digits
Half 16 1 5 10 ~3.31
Single 32 1 8 23 ~7.22
Double 64 1 11 52 ~15.95
7
Special FP Numbers
E M Value
255 0 if S = 0
255 0 – if S = 1 This table is
255 0 NAN(Not a number) for FP32
0 0 0 numbers
0 0 Denormal number
8
E – Exponent, M –Mantissa
Denormal Numbers on Number Line
Denormal numbers
Normal numbers
10
Denormal Numbers
11
Summary representation
Note: We have to first check whether a number belongs to special cases (0/infinity/NaN/denormal).
If a number does not belong to special case, then, it is taken as a normal number.
Fixed-point format
• FxP has a specific number of bits (or digits) reserved for
integer and fractional parts, regardless of how large/small the
number is. For example:
With IIIII.FFFFF format, we can show numbers in range
[00000.00000, 11111.11111] (binary system)
• FP: the number of bits for integer/fractional part is not
reserved. Instead, it reserves certain bits for the significand
and exponent
• Int is similar to FxP, except that Int has no fraction part.
• Sometimes, Int and FxP are used synonoymously
Further Study
• https://blog.demofox.org/2017/11/21/floating-point-precision/
• https://www.h-schmidt.net/FloatConverter/IEEE754.html
• https://stackoverflow.com/questions/4220417/print-binary-
representation-of-a-float-number-in-c
• https://moocaholic.medium.com/fp64-fp32-fp16-bfloat16-tf32-
and-other-mem bers-of-the-zoo-a1ca7897d407
• https://www.ibm.com/support/pages/single-precision-floating-
point-accuracy
• http://www.mimirgames.com/articles/programming/digits-of-pi-
needed-for-floating-point-numbers/