0% found this document useful (0 votes)
46 views

Floating Point Numbers: CS101 Introduction To Computing

Floating point numbers use a binary representation scheme called IEEE 754 that represents numbers as a sign bit, exponent field, and mantissa to support a wide range of values much larger and smaller than can be represented with integers. Floating point numbers have limited precision due to their fixed field sizes, resulting in a non-uniform density of representable values across their range. Arithmetic and logical operations on floating point numbers in programming languages like C may require special handling of conversions and type casting between integer and floating point types.

Uploaded by

Mihir Kumar Mech
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Floating Point Numbers: CS101 Introduction To Computing

Floating point numbers use a binary representation scheme called IEEE 754 that represents numbers as a sign bit, exponent field, and mantissa to support a wide range of values much larger and smaller than can be represented with integers. Floating point numbers have limited precision due to their fixed field sizes, resulting in a non-uniform density of representable values across their range. Arithmetic and logical operations on floating point numbers in programming languages like C may require special handling of conversions and type casting between integer and floating point types.

Uploaded by

Mihir Kumar Mech
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

CS101 Introduction to computing 

Floating Point Numbers

A. Sahu amd S. V .Rao


Dept of Comp. Sc. & Engg.
Dept of Comp. Sc. & Engg.
Indian Institute of Technology Guwahati

1
Outline
• Need to floating point number 
• Number representation : IEEE 754
• Floating point range
Floating point range 
• Floating point density 
–Accuracy 
• Arithmetic
Arithmetic  and Logical Operation on 
and Logical Operation on
FP 
• Conversions  and type casting in C
C i d i i C
2
Need to go beyond integers
Need to go beyond integers
complex
• integer    7
integer 7
• rational    5/8 real
• l √3
real          √3 rationall
• complex   2 ‐ 3 i integer

Extremely large and small values:
distance pluto ‐ sun = 5.9 1012 m
mass of electron = 9 1 x 10‐28 gm
mass of electron = 9.1 x 10
Representing fractions
Representing fractions
• Integer pairs (for rational numbers)
Integer pairs (for rational numbers)
5 8 = 5/8
St i
Strings with explicit decimal point
ith li it d i l i t
‐ 2 4 7 . 0 9
Implicit point at a fixed position
0 1 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1
Floating point implicit point
l

x base power
fraction x base 
fraction
Numbers with binary point
Numbers with binary point
11 = 1x22 + 0x2
101.11 = 1x2
101 + 0x21 + 1x2
+ 1x20 +  +1x2‐1 + 1x2
+ .  +1x2 + 1x2‐2
= 4 + 1 + .+  0.5 + 0.25 = 5.7510
0 6 0 00 00 00 00 00
0.6 = 0.10011001100110011001.....
.6 x 2 = 1 + .2
.2 x 2 = 0 + .4
.4 x 2 
4 x 2 = 0 + .8
0+ 8
.8 x 2 = 1 + .6
Numeric Data Type
Numeric  Data Type
• char, short,  int, long int
– char : 8 bit number (1 byte=1B)
– short: 16 bit number (2 byte)
– int : 32 bit number (4B)
– long int : 64 bit number (8B)
• float, double, long double
– float : 32 bit number (4B)
– double : 64 bit number (8B)
– long double : 128 bit number (16B)
6
Numeric Data Type
Numeric Data Type
unsigned char
char
g
unsigned short 
short 

Unsigned int

int
7
Numeric Data Type
Numeric  Data Type
• char, short,  int, long int
– We have : Signed and unsigned version
W h Si d d i d i
– char  (8 bit)
• char : 128 to 127 we have +0 and 0 ☺ ☺ Fun
char : ‐128 to 127, we have +0 and ‐0 ☺
• unsigned char: 0 to 255
– int : ‐231 to  231‐1
– unsigned int : 0  to  232‐1
• float, double, l
double, long double
ong double
– For fractional, real number data
– All these numbered are signed and get stored in
All these numbered are signed and get stored in 
different format
8
Sign bit
Sign bit Numeric Data Type
Numeric Data Type

Exponent Mantissa
float

Exponent Mantiss‐1

Mantissa‐2
double
9
FP numbers with base = 10
FP numbers with base  10
((‐1)
1)S x F x 10
x F x 10E
S = Sign
F = Fraction (fixed point number)
(f d b )
usually called Mantissa or Significand
E = Exponent (positive or negative integer)
Example          5.9x10
p 12 ,,  ‐2.6x103  9.1 x 10‐28

Only one non‐zero digit left to the point
FP numbers with base = 2
FP numbers with base  2
(‐1) S x F x 2 E
S = Sign
F = Fraction (fixed point number)
y
usually called Mantissa or Significand
g
E = Exponent (positive or negative integer)

• How to divide a word into S, F and E?
How to divide a word into S F and E?
• How to represent S, F and E?

Example    1.0101x212 ,    ‐1.11012x103       1.101 x 2‐18
Only one non‐zero digit left to the point: default it 
will be 1 incase of binary
will be 1 incase of binary
So no need to store this
IEEE 754 standard
IEEE 754 standard
Single precision numbers
Single precision numbers
1        8                                       23
0 0101 1101 0110 1011 0001 0110 110
1011  0101
1011 1101 0110 1011 0001 0110 110
S        E                                           F
Double precision numbers
Double precision numbers
1         11                          20+32
0 0101 111 1101 0110 1011 0001 0110
1011  0101 111
1011

S          E                              F
1011 0001 0110 1100 1011 0101 1101 0110
Representing F in IEEE 754
Representing F in IEEE 754
Single precision numbers
23
1. 110101101011000101101101
F
Double precision numbers
20+32
1. 101101011000101101101
F
101100010110110010110101110101101

Only one non‐zero digit left to the point: default it will be 1 incase 
of binary. So no need to store this bit
Value Range for F
Value Range for F
Single precision numbers
Single precision numbers
1 ≤ F ≤ 2 ‐ 2‐23 or  1 ≤ F < 2
Double precision numbers
Double precision numbers
1 ≤ F ≤ 2 ‐ 2‐52 or         1 ≤ F < 2

These are “normalized”


These are  normalized .
Representing E in IEEE 754
Representing E in IEEE 754
Single precision numbers
Single precision numbers
8
10110101
E               bias 127
Double precision numbers
Double precision numbers
11
10110101110
E              bias 1023
Floating point values
Floating point values
• E=E 127 V =( 1)s  
E=E’‐127, V =(‐1)s x 1 .M x 2 
x 1 M x 2 EE’‐127
127

• V=  1101 x 2 ((40‐127))=1.1101.. x 2


V= 1.1101…  x 2  1101 x 2‐87
Single precision numbers
Single precision numbers
1        8                                       23
0 1000 1101 0110 1011 0001 0110 110
0010 1000
0010 1101 0110 1011 0001 0110 110
S        E’                                           F

16
Floating point values
Floating point values
• E=E’‐127, V =(‐1)s  x 1 .M x 2 E’‐127
• V= ‐1.1 x 2 (126‐127)=‐1.1 x 2‐1 =‐0.11x20
= ‐0.11 = ‐11/2210=‐3/410=‐0.7510
Single precision numbers
Single precision numbers
1        8                                       23
1 1110 1000 0000 0000 0000 0000 000
0111 1110
0111 1000 0000 0000 0000 0000 000
S        E’                                           F

17
Value Range for E
Value Range for E
Single precision numbers
Single precision numbers
‐126 ≤ E ≤ 127  
(all 0’s and all 1’s have special meanings)
Double precision numbers
Double precision numbers
‐1022 ≤ E ≤ 1023  
( ll 0’
(all 0’s and all 1’s have special meanings)
d ll 1’ h i l i )
Floating point demo applet on the 
webb
• https://www
https://www.hh‐
schmidt.net/FloatConverter/IEEE754.html

• Google “Float applet” to get the above link

19
Overflow and underflow
Overflow and underflow
largest positive/negative number (SP) = 
g p / g ( )
±(2 ‐ 2‐23) x 2127 ≅ ± 2 x 1038
smallest positive/negative number (SP) = 
p / g ( )
± 1 x 2‐126 ≅ ± 2 x 10 ‐38

Largest positive/negative number (DP) = 
( 2‐52)) x 21023 ≅ ± 2 x 10308
±(2 ‐
Smallest positive/negative number (DP) = 
± 1 x 2‐1022 ≅ ± 2 x 10 ‐308
Density of int vs float
Density of  int float 
Int : 32 bit 
: 32 bit

Exponent Mantissa
Float : 32 bit
Float : 32 bit
• Number of number can be represented 
) 32
– Both the cases (float, int) : 2
(
• Range  
– int (‐231 to 231‐1)   
( 2‐23
– float  Large ±(2 ‐
fl 23) x 2
) 127    
127 Small±
ll 1 x 2‐126 
126

• 50% of float numbers are  Small  (less then ±1 ) 21
Density of Floating Points
Density of Floating Points
• 256 Persons in Room of Capacity 256     (Range)
8 bi i
8  bit integer :   256/256 = 1 
256/256 1
• 256 person in Room of Capacity  200000 
(Range)
– 1st Row should be filled with 128 person
– 50% number  with negative power are ‐1 < N > +1
• Density of Floating point number is 
y gp
– Dense towards  0   
Sparse towards ∞
Sparse towards  
‐–∞                            ‐2   ‐1   0    +1   +2                         + ∞
2 1 0 1 2
22
Expressible Numbers(int and float)
Expressible integers
Expressible integers

‐ overflow
fl +
+ overflow
fl
‐231 0 231‐1
‐ underflow
Expressible Float
+ underflow
+ underflow

‐ overflow
fl + overflow
fl
0
(1‐2‐24)x2128 ‐0.5x2‐127 0.5x2‐127 (1‐2‐24)x2128
Distribution of Values
• 6‐bit IEEE‐like format
– e = 3 exponent bits
3 bi
– f = 2 fraction bits
– Bias is 3

• Notice how the distribution gets denser 
-15 -10 -5 0 5 10 15
toward zero. 
Denormalized Normalized Infinity
Distribution of Values
( l
(close‐up view)
i )
• 6
6‐bit
bit IEEE
IEEE‐like
like format
format
– e = 3 exponent bits
– f = 2 fraction bits
– Bias is 3
Bi i 3

-1 -0.5 0 0.5 1
Denormalized Normalized Infinity
Density of 32 bit float SP
Density of 32 bit float SP
• Fraction/mantissa is 23 bit
• Number of different number can be stored for 
Number of different number can be stored for
particular value of exponent
– Assume for  exp
Assume for exp=1, 223=8x1024x1024
1,   2 8x106
8x1024x1024 ≈8x10
– Between 1‐2 we can store 8x106 numbers
• Similarly 
y
– for exp=2, between 2‐4, 8x106 number of number can 
be stored
3 b 8 8 06 number of number can 
– for exp=3, between 4‐8, 8x10
f b f b
be stored
– for exp
for exp=4,
4, between 8
between 8‐16, 8x106 number of number 
16, 8x10 number of number
can be stored
26
Density of 32 bit float SP
Density of 32 bit float SP
• Similarly 
– for exp=23, between 2
f 23 b 222‐2
223, 8x10
8 106 number of 
b f
number can be stored
223‐2
– ffor exp=24, between 2
24 b t 224, 8x10
8 106 number of 
b f
number can be stored OK

for exp=25 between 224‐2


– for exp=25, between 2 225, 8x10
8x106 number of 
number of
number can be stored 
• 224‐225  >8 x10
>8 x106 BAD

–…
for exp=127 between 2126‐2
– for exp=127, between 2 2127, 8x10
8x106 number of 
number of
number can be stored WROST 27
Density of 32 bit float SP
Density of 32 bit float SP
• 223=8x1024x1024 ≈8x106

0 1 2     4             8                                  16 

28
Numbers in float format
Numbers in float format
• largest positive/negative number (SP) = 
±(2 ‐ 2‐23) x 2
±(2  ) x 2127 ≅ ± 2 x 10
2 x 1038
Second largest number : 
±(2 ‐ 2‐22) x 2
±(2  ) x 2127 

Difference Largest FP ‐ 2nd largest FP


= (2‐23‐2‐22)x2127=2x2105=2x1032

Smallest positive/negative number (SP) = 
± 1 x 2‐126 ≅ ± 2 x 10 ‐38

29
Addition/Sub of Floating Point
Addition/Sub of  Floating Point
3.2 x 10 8 ± 2.8 x 10 6
Step I: 
Align Exponents
g p x 10 6 ± 2.8
320 x 10 
320 x 10 6
2.8 x 10 

Step 2: 
Step 2:
Add Mantissas
322.8 x 10 6

Step 3: 
Normalize 3 228 x 108
3.228 x 10
30
Floating point operations: ADD
Floating point operations: ADD
• Add/subtract      A = A1 ± A2
[(‐1)S1 x F1 x 2E1] ± [(‐1)S2 x F2 x 2E2]
suppose E1 > E2, then we can write it as
pp ,
[(‐1)S1 x F1 x 2E1] ± [(‐1)S2 x F2’ x 2E1]
where F2 = F2 / 2E1‐E2,
where F2’ = F2 / 2 3 2 x 10 8 ± 2.8 x 10 
3.2 x 10  2 8 x 10 6
320 x 10 6 ± 2.8 x 10 6
The result is 
Th lt i 322 8 10 6
322.8 x 10 
(‐1)S1 x (F1 ± F2’) x 2E1 3.228 x 108
It may need to be normalized
Testing Associatively with FP
Testing Associatively with FP
• X=  
X= ‐1 5x1038,  Y=1.5x10
1.5x10 Y=1 5x1038,  z=1000.0
z=1000 0
• X+(Y+Z) = ‐1.5x1038 + (1.5x1038 + 1000.0)
= ‐1.5x10038 + 1.5x10038 
38

=0
• (X+Y)+Z = (‐1.5x1038 + 1.5x1038 ) + 1000.0
=  0.0 + 1000.0
0 0 + 1000 0
=1000 

32
Multiply Floating Point
Multiply Floating Point
3.2 x 10 8 X   5.8 x 10 6
Step I: 
Multiply Mantissas
py 3.2  X  5.8    X  108 x 10 6 

Step 2:
Step 2: 56 10 14
18.56 x 10 
18
Add Exponents 

Step 3: 
Normalize 1 856 x 1015
1.856 x 10
33
Floating point operations
Floating point operations
• Multiply
[(‐1)S1 x F1 x 2E1] x [(‐1)S2 x F2 x 2E2]
= (‐1)S1⊕S2 x (F1xF2) x 2E1+E2
Since 1 ≤ (F1xF2) < 4,
( ) ,
the result may need to be normalized
2 x 10 8 X   5.8 x 10 
3.2 x 10 
3 X 5 8 x 10 6
3.2  X  5.8    X  108 x 10 6 
18 56 x 10 14
18.56 x 10 
1.856 x 1015
Floating point operations
Floating point operations
• Divide
[(‐1)S1 x F1 x 2E1] ÷ [(‐1)S2 x F2 x 2E2]
= (‐1)S1⊕S2 x (F1 ÷ F2) x 2E1‐E2
Since .5 <
Since (F1 ÷ F2) < 2,
5 < (F1  F2) < 2
the result may need to be normalized

(assume F2 ≠ 0)
Float and double 
• Float : single precision floating point
• Double : Double precision floating point
• Floating points operation are slower
– But not in newer PC ☺ ☺
But not in newer PC ☺
• Double operation are even slower
Precision/Accuracy  in Calculation   

Integer  Float  Double

Speed of Calculation  36
Floating point Comparison
gp p
• Three phases 
• Phase I: Compare sign  (give result)
Phase I: Compare sign (give result)
• Phase II: If (sign of both numbers are same) 
– Compare exponents  and give result
C d i l
– 90% of case it fall in this categories
– Faster as compare to integer comparison : 
F t t i t i
Require only 8 bit comparison for float and 11 bit 
for double   (Example : sorting of float numbers)
( p g )
• Phase III: If (both sign and exponents are 
same))
– compare fraction/mantissa
Storing and Printing Floating Point
Storing and Printing Floating Point
float x=145.0,y;
,y; Many Round 
y=sqrt(sqrt((x))); off cause loss 
of accuracy
x=(y*y)*(y*y);
printf("\nx=%f",x); x=145.000015

float x=1.0/3.0; Value stored in x is 


if ( x==1
x==1.0/3.0)
0/3 0) not exactly same
not exactly same 
as 1.0/3.0
printf(“YES”);
else NO
printf(“NO”); 38
Storing and Printing Floating Point
Storing and Printing Floating Point
float
oat a=34359243.5366233;
a 3 359 3.5366 33;
float b=3.5366233;
float c=0.00000212363;
c=0 00000212363;
printf("\na=%8.6f, b=%8.6f
%8 12f\ " a, b
c=%8.12f\n", b, c )
);

Big number with 
a=34359243.000000 small fraction can 
b=3.5366233
b 3.5366233 not combined
not combined
c=0.000002123630
39
Storing and Printing Floating Point
Storing and Printing Floating Point
//15 S digits
g to store
float a=34359243.5366233;
//8 S digits to store
float b=3.5366233;
//6 S digits to store
float c=0.00000212363;

Thumb rule:  8 to 9 significant digits of a 
number can be stored in a 32 bit number
40
Thanks

41

You might also like