Floating Point Numbers: CS101 Introduction To Computing
Floating Point Numbers: CS101 Introduction To Computing
Floating Point Numbers
1
Outline
• Need to floating point number
• Number representation : IEEE 754
• Floating point range
Floating point range
• Floating point density
–Accuracy
• Arithmetic
Arithmetic and Logical Operation on
and Logical Operation on
FP
• Conversions and type casting in C
C i d i i C
2
Need to go beyond integers
Need to go beyond integers
complex
• integer 7
integer 7
• rational 5/8 real
• l √3
real √3 rationall
• complex 2 ‐ 3 i integer
Extremely large and small values:
distance pluto ‐ sun = 5.9 1012 m
mass of electron = 9 1 x 10‐28 gm
mass of electron = 9.1 x 10
Representing fractions
Representing fractions
• Integer pairs (for rational numbers)
Integer pairs (for rational numbers)
5 8 = 5/8
St i
Strings with explicit decimal point
ith li it d i l i t
‐ 2 4 7 . 0 9
Implicit point at a fixed position
0 1 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1
Floating point implicit point
l
x base power
fraction x base
fraction
Numbers with binary point
Numbers with binary point
11 = 1x22 + 0x2
101.11 = 1x2
101 + 0x21 + 1x2
+ 1x20 + +1x2‐1 + 1x2
+ . +1x2 + 1x2‐2
= 4 + 1 + .+ 0.5 + 0.25 = 5.7510
0 6 0 00 00 00 00 00
0.6 = 0.10011001100110011001.....
.6 x 2 = 1 + .2
.2 x 2 = 0 + .4
.4 x 2
4 x 2 = 0 + .8
0+ 8
.8 x 2 = 1 + .6
Numeric Data Type
Numeric Data Type
• char, short, int, long int
– char : 8 bit number (1 byte=1B)
– short: 16 bit number (2 byte)
– int : 32 bit number (4B)
– long int : 64 bit number (8B)
• float, double, long double
– float : 32 bit number (4B)
– double : 64 bit number (8B)
– long double : 128 bit number (16B)
6
Numeric Data Type
Numeric Data Type
unsigned char
char
g
unsigned short
short
Unsigned int
int
7
Numeric Data Type
Numeric Data Type
• char, short, int, long int
– We have : Signed and unsigned version
W h Si d d i d i
– char (8 bit)
• char : 128 to 127 we have +0 and 0 ☺ ☺ Fun
char : ‐128 to 127, we have +0 and ‐0 ☺
• unsigned char: 0 to 255
– int : ‐231 to 231‐1
– unsigned int : 0 to 232‐1
• float, double, l
double, long double
ong double
– For fractional, real number data
– All these numbered are signed and get stored in
All these numbered are signed and get stored in
different format
8
Sign bit
Sign bit Numeric Data Type
Numeric Data Type
Exponent Mantissa
float
Exponent Mantiss‐1
Mantissa‐2
double
9
FP numbers with base = 10
FP numbers with base 10
((‐1)
1)S x F x 10
x F x 10E
S = Sign
F = Fraction (fixed point number)
(f d b )
usually called Mantissa or Significand
E = Exponent (positive or negative integer)
Example 5.9x10
p 12 ,, ‐2.6x103 9.1 x 10‐28
Only one non‐zero digit left to the point
FP numbers with base = 2
FP numbers with base 2
(‐1) S x F x 2 E
S = Sign
F = Fraction (fixed point number)
y
usually called Mantissa or Significand
g
E = Exponent (positive or negative integer)
• How to divide a word into S, F and E?
How to divide a word into S F and E?
• How to represent S, F and E?
Example 1.0101x212 , ‐1.11012x103 1.101 x 2‐18
Only one non‐zero digit left to the point: default it
will be 1 incase of binary
will be 1 incase of binary
So no need to store this
IEEE 754 standard
IEEE 754 standard
Single precision numbers
Single precision numbers
1 8 23
0 0101 1101 0110 1011 0001 0110 110
1011 0101
1011 1101 0110 1011 0001 0110 110
S E F
Double precision numbers
Double precision numbers
1 11 20+32
0 0101 111 1101 0110 1011 0001 0110
1011 0101 111
1011
S E F
1011 0001 0110 1100 1011 0101 1101 0110
Representing F in IEEE 754
Representing F in IEEE 754
Single precision numbers
23
1. 110101101011000101101101
F
Double precision numbers
20+32
1. 101101011000101101101
F
101100010110110010110101110101101
Only one non‐zero digit left to the point: default it will be 1 incase
of binary. So no need to store this bit
Value Range for F
Value Range for F
Single precision numbers
Single precision numbers
1 ≤ F ≤ 2 ‐ 2‐23 or 1 ≤ F < 2
Double precision numbers
Double precision numbers
1 ≤ F ≤ 2 ‐ 2‐52 or 1 ≤ F < 2
16
Floating point values
Floating point values
• E=E’‐127, V =(‐1)s x 1 .M x 2 E’‐127
• V= ‐1.1 x 2 (126‐127)=‐1.1 x 2‐1 =‐0.11x20
= ‐0.11 = ‐11/2210=‐3/410=‐0.7510
Single precision numbers
Single precision numbers
1 8 23
1 1110 1000 0000 0000 0000 0000 000
0111 1110
0111 1000 0000 0000 0000 0000 000
S E’ F
17
Value Range for E
Value Range for E
Single precision numbers
Single precision numbers
‐126 ≤ E ≤ 127
(all 0’s and all 1’s have special meanings)
Double precision numbers
Double precision numbers
‐1022 ≤ E ≤ 1023
( ll 0’
(all 0’s and all 1’s have special meanings)
d ll 1’ h i l i )
Floating point demo applet on the
webb
• https://www
https://www.hh‐
schmidt.net/FloatConverter/IEEE754.html
• Google “Float applet” to get the above link
19
Overflow and underflow
Overflow and underflow
largest positive/negative number (SP) =
g p / g ( )
±(2 ‐ 2‐23) x 2127 ≅ ± 2 x 1038
smallest positive/negative number (SP) =
p / g ( )
± 1 x 2‐126 ≅ ± 2 x 10 ‐38
Largest positive/negative number (DP) =
( 2‐52)) x 21023 ≅ ± 2 x 10308
±(2 ‐
Smallest positive/negative number (DP) =
± 1 x 2‐1022 ≅ ± 2 x 10 ‐308
Density of int vs float
Density of int float
Int : 32 bit
: 32 bit
Exponent Mantissa
Float : 32 bit
Float : 32 bit
• Number of number can be represented
) 32
– Both the cases (float, int) : 2
(
• Range
– int (‐231 to 231‐1)
( 2‐23
– float Large ±(2 ‐
fl 23) x 2
) 127
127 Small±
ll 1 x 2‐126
126
• 50% of float numbers are Small (less then ±1 ) 21
Density of Floating Points
Density of Floating Points
• 256 Persons in Room of Capacity 256 (Range)
8 bi i
8 bit integer : 256/256 = 1
256/256 1
• 256 person in Room of Capacity 200000
(Range)
– 1st Row should be filled with 128 person
– 50% number with negative power are ‐1 < N > +1
• Density of Floating point number is
y gp
– Dense towards 0
Sparse towards ∞
Sparse towards
‐–∞ ‐2 ‐1 0 +1 +2 + ∞
2 1 0 1 2
22
Expressible Numbers(int and float)
Expressible integers
Expressible integers
‐ overflow
fl +
+ overflow
fl
‐231 0 231‐1
‐ underflow
Expressible Float
+ underflow
+ underflow
‐ overflow
fl + overflow
fl
0
(1‐2‐24)x2128 ‐0.5x2‐127 0.5x2‐127 (1‐2‐24)x2128
Distribution of Values
• 6‐bit IEEE‐like format
– e = 3 exponent bits
3 bi
– f = 2 fraction bits
– Bias is 3
• Notice how the distribution gets denser
-15 -10 -5 0 5 10 15
toward zero.
Denormalized Normalized Infinity
Distribution of Values
( l
(close‐up view)
i )
• 6
6‐bit
bit IEEE
IEEE‐like
like format
format
– e = 3 exponent bits
– f = 2 fraction bits
– Bias is 3
Bi i 3
-1 -0.5 0 0.5 1
Denormalized Normalized Infinity
Density of 32 bit float SP
Density of 32 bit float SP
• Fraction/mantissa is 23 bit
• Number of different number can be stored for
Number of different number can be stored for
particular value of exponent
– Assume for exp
Assume for exp=1, 223=8x1024x1024
1, 2 8x106
8x1024x1024 ≈8x10
– Between 1‐2 we can store 8x106 numbers
• Similarly
y
– for exp=2, between 2‐4, 8x106 number of number can
be stored
3 b 8 8 06 number of number can
– for exp=3, between 4‐8, 8x10
f b f b
be stored
– for exp
for exp=4,
4, between 8
between 8‐16, 8x106 number of number
16, 8x10 number of number
can be stored
26
Density of 32 bit float SP
Density of 32 bit float SP
• Similarly
– for exp=23, between 2
f 23 b 222‐2
223, 8x10
8 106 number of
b f
number can be stored
223‐2
– ffor exp=24, between 2
24 b t 224, 8x10
8 106 number of
b f
number can be stored OK
–…
for exp=127 between 2126‐2
– for exp=127, between 2 2127, 8x10
8x106 number of
number of
number can be stored WROST 27
Density of 32 bit float SP
Density of 32 bit float SP
• 223=8x1024x1024 ≈8x106
0 1 2 4 8 16
28
Numbers in float format
Numbers in float format
• largest positive/negative number (SP) =
±(2 ‐ 2‐23) x 2
±(2 ) x 2127 ≅ ± 2 x 10
2 x 1038
Second largest number :
±(2 ‐ 2‐22) x 2
±(2 ) x 2127
Smallest positive/negative number (SP) =
± 1 x 2‐126 ≅ ± 2 x 10 ‐38
29
Addition/Sub of Floating Point
Addition/Sub of Floating Point
3.2 x 10 8 ± 2.8 x 10 6
Step I:
Align Exponents
g p x 10 6 ± 2.8
320 x 10
320 x 10 6
2.8 x 10
Step 2:
Step 2:
Add Mantissas
322.8 x 10 6
Step 3:
Normalize 3 228 x 108
3.228 x 10
30
Floating point operations: ADD
Floating point operations: ADD
• Add/subtract A = A1 ± A2
[(‐1)S1 x F1 x 2E1] ± [(‐1)S2 x F2 x 2E2]
suppose E1 > E2, then we can write it as
pp ,
[(‐1)S1 x F1 x 2E1] ± [(‐1)S2 x F2’ x 2E1]
where F2 = F2 / 2E1‐E2,
where F2’ = F2 / 2 3 2 x 10 8 ± 2.8 x 10
3.2 x 10 2 8 x 10 6
320 x 10 6 ± 2.8 x 10 6
The result is
Th lt i 322 8 10 6
322.8 x 10
(‐1)S1 x (F1 ± F2’) x 2E1 3.228 x 108
It may need to be normalized
Testing Associatively with FP
Testing Associatively with FP
• X=
X= ‐1 5x1038, Y=1.5x10
1.5x10 Y=1 5x1038, z=1000.0
z=1000 0
• X+(Y+Z) = ‐1.5x1038 + (1.5x1038 + 1000.0)
= ‐1.5x10038 + 1.5x10038
38
=0
• (X+Y)+Z = (‐1.5x1038 + 1.5x1038 ) + 1000.0
= 0.0 + 1000.0
0 0 + 1000 0
=1000
32
Multiply Floating Point
Multiply Floating Point
3.2 x 10 8 X 5.8 x 10 6
Step I:
Multiply Mantissas
py 3.2 X 5.8 X 108 x 10 6
Step 2:
Step 2: 56 10 14
18.56 x 10
18
Add Exponents
Step 3:
Normalize 1 856 x 1015
1.856 x 10
33
Floating point operations
Floating point operations
• Multiply
[(‐1)S1 x F1 x 2E1] x [(‐1)S2 x F2 x 2E2]
= (‐1)S1⊕S2 x (F1xF2) x 2E1+E2
Since 1 ≤ (F1xF2) < 4,
( ) ,
the result may need to be normalized
2 x 10 8 X 5.8 x 10
3.2 x 10
3 X 5 8 x 10 6
3.2 X 5.8 X 108 x 10 6
18 56 x 10 14
18.56 x 10
1.856 x 1015
Floating point operations
Floating point operations
• Divide
[(‐1)S1 x F1 x 2E1] ÷ [(‐1)S2 x F2 x 2E2]
= (‐1)S1⊕S2 x (F1 ÷ F2) x 2E1‐E2
Since .5 <
Since (F1 ÷ F2) < 2,
5 < (F1 F2) < 2
the result may need to be normalized
(assume F2 ≠ 0)
Float and double
• Float : single precision floating point
• Double : Double precision floating point
• Floating points operation are slower
– But not in newer PC ☺ ☺
But not in newer PC ☺
• Double operation are even slower
Precision/Accuracy in Calculation
Speed of Calculation 36
Floating point Comparison
gp p
• Three phases
• Phase I: Compare sign (give result)
Phase I: Compare sign (give result)
• Phase II: If (sign of both numbers are same)
– Compare exponents and give result
C d i l
– 90% of case it fall in this categories
– Faster as compare to integer comparison :
F t t i t i
Require only 8 bit comparison for float and 11 bit
for double (Example : sorting of float numbers)
( p g )
• Phase III: If (both sign and exponents are
same))
– compare fraction/mantissa
Storing and Printing Floating Point
Storing and Printing Floating Point
float x=145.0,y;
,y; Many Round
y=sqrt(sqrt((x))); off cause loss
of accuracy
x=(y*y)*(y*y);
printf("\nx=%f",x); x=145.000015
Big number with
a=34359243.000000 small fraction can
b=3.5366233
b 3.5366233 not combined
not combined
c=0.000002123630
39
Storing and Printing Floating Point
Storing and Printing Floating Point
//15 S digits
g to store
float a=34359243.5366233;
//8 S digits to store
float b=3.5366233;
//6 S digits to store
float c=0.00000212363;
Thumb rule: 8 to 9 significant digits of a
number can be stored in a 32 bit number
40
Thanks
41