0% found this document useful (0 votes)

39 views5 pages

4.16. Floating Point

Floating point numbers use a format similar to scientific notation, with a mantissa, exponent, and radix. This allows representing a wider range of numbers than fixed point, but with less precision. The document describes a sample 32-bit floating point format, covering how to represent and calculate with values in this format. It also discusses costs of floating point arithmetic like lower precision than integers and higher computation time.

Uploaded by

Ishan Jawa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views5 pages

4.16. Floating Point

Uploaded by

Ishan Jawa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

4.16.

Floating Point
Prev Chapter 4. Data Representation Next

4.16. Floating Point

4.16.1. The Basics

Floating point gets around the limitations of fixed point by using a format similar to scientific
notation.

3.52 x 103 = 1520

A scientific notation number, as you probably know, consists of a mantissa (3.52 in the example
above) a radix (always 10), and an exponent (3 in the example above). Hence, the general format of
a scientific notation value is:

mantissa x radixexponent

The normalized form always has a mantissa greater than or equal to 1.0, and less than 10.0. We can
denormalize the value and express it in many other ways, such as 35.2 x 102, or 0.00325 x 100. For
each position we shift the digits of the mantissa relative to the decimal point, we increase or
decrease the value of the mantissa by a factor of 10. To compensate for this, we simply increase or
decrease the exponent by 1. Denormalizing is necessary when adding scientific notation values:

3.52 x 103
+ 1.97 x 105
-----------------------------------------

1
0.0352 x 105
+ 1.97 x 105
-----------------------------------------
2.0052 x 105

Adjusting the mantissa and exponent is also sometimes necessary to normalize results. For
example, 9.9 x 102 + 9.9 x 102 is 19.8 x 102, which must be normalized to 1.98 x 103.

A binary floating system stores a signed binary mantissa and a signed binary exponent, and usually
uses a radix of 2. Using a radix of 2 (or any power of 2) allows us to normalize and denormalize by
shifting the binary digits in the mantissa and adjusting the integer exponent on the radix of 2. (
Shifting binary digits in the mantissa n bits to the left or right multiplies or divides the mantissa by
2n. )

000102 x 23 = 010002 x 21.

The standard floating point formats are defined by the IEEE society. The IEEE formats are slightly
more complex that necessary to understand floating point in general, so we will start with a simpler
example here.

4.16.2. A Simple Floating Point Format

Suppose a 32-bit floating point format has a 24-bit two's complement mantissa, an 8-bit two's
complement exponent, and a radix of 2. The general structure is:

mantissa x 2exponent

Where mantissa is a 24-bit two's complement integer, and exponent is an 8-bit two's complement
integer.

The binary format is as follows:

Table 4.5. Floating Point Format

Mantissa Exponent
24 bits, two's compliment 8 bits, two's compliment

1. What is the value of the following number?

000000000000000000010010 11111100

The mantissa is 000000000000000000010010, or +(2 + 16) = +18.

The exponent is 11111100 = -(00000011 + 1) = -00000100 = -4.

The value is therefore +18 x 2-4

2. What is the largest positive value we can represent in this system?

The largest positive value will consist of the largest positive mantissa and the largest positive
exponent.

The largest mantissa is 011111111111111111111111, which in two's complement is +223-1

(+8388607). The largest exponent is 01111111, which in two's complement is +27-1 (+127).

Hence, the largest positive value is +8388607 x 2+127 = 1.42 x 1045.

3. What is the second largest positive value? What is the difference between the largest and
second largest?

4. What is the smallest positive value?

To find the smallest positive value in the form mantissa x radixexponent, we choose the
smallest positive mantissa, and the smallest negative exponent (the negative exponent with
the greatest magnitude).

Since the mantissa is an integer, the smallest positive value possible is 1.

Since the exponent is an 8-bit two's complement value, the smallest negative exponent is
100000002, of -27 = -128.

Hence the smallest positive value is 1 x 2-128, or 2.93873587706 x 10-39.

5. What is the second smallest positive value? What is the difference between the smallest and
second smallest?
6. Represent -2.75 in this floating point system.

a. Convert the number to fixed point binary using the methods described in previous
sections:

-2.7510 = -(10.112)

b. Multiply by radixexponent equal to 1:

-2.7510 = -(10.112) x 20

c. Shift the binary point to make the mantissa a whole number: -(10112)

By moving the binary point two places to the right, we multiply the mantissa by 22.
We therefore must divide (radixexponent) by the same factor:

-2.7510 = -(10112) x 2-2

d. Convert the mantissa and exponent into the specified formats (two's complement in
this case):

Mantissa: -(000000000000000000001011) = 111111111111111111110101

Exponent: -210 = 11111110

Binary representation = 11111111111111111111010111111110

7. How many different values can this system represent?

4.16.3. Overflow and Underflow

Overflow occurs when the result of a floating point operation is larger than the largest positive
value, or smaller than the smallest negative value. In other words, the magnitude is too large to
represent.

Underflow occurs when the result of a floating point operation is smaller than the smallest positive
value, or larger than the largest negative value. In other words, the magnitude is too small to
represent.

The example 32-bit format above cannot represent values larger than about 1045 or smaller than
about 10-39.

One technique to avoid overflow and underflow is to alternate operations that increase and
decrease intermediate results. Rather than do all the multiplications first, which could cause
overflow, or all the divisions first, which could cause underflow, we could alternate multiplications
and divisions to moderate the results along the way. Techniques like these must often be used in
complex scientific calculations.

4.16.4. Cost of Floating Point

Everything has a cost. The increased range and ability to represent non-whole numbers is no
exception.

Precision

There are only 232 patterns of 32 0's and 1's. Hence, there are only 232 unique numbers that we can
represent in 32 bits, regardless of the format.

So how is it we can represent numbers up to 1045?

Obviously, we must be sacrificing something in between. What floating point does for us is spread
out the limited number of binary patterns we have available to cover a larger range of numbers.
The larger the exponent, the larger the gap between consecutive numbers that we can accurately
represent.

Close to 0, we can represent many numbers in a small range. Far from zero, there will be a whole
range of whole numbers that cannot be represented.

The precision of a 32-bit floating point value is less than the precision of a 32-bit integer. By using
8 bits for the exponent, we sacrifice those 8 bits of precision. Hence, our example format has the
same precision as a 24-bit signed integer system.

Performance

Arithmetic on floating point is several times slower than on integers. This is an inherent property of
the format.

Consider the process of adding two scientific notation values.

1. Equalize the exponents
2. Add the mantissas
3. Normalize the result

Each of these operations take roughly the same amount of time in a computer as a single integer
addition. Since floating point is stored like scientific notation, we can expect floating point addition
to take about three times as long as integer addition. In reality, a typical PC takes about 2.5 times as
long to execute a floating point arithmetic instruction as it does to do the same integer instruction.

Note that this applies only to operations that can be carried out using either a single integer
instruction or a single floating point instruction. For example, suppose a program is running on a
32-bit computer, and there is no way to represent the data within the range of a 32-bit integer. In
this case, multiple integer instructions will be necessary to process integer values of more than 32
bits, and the speed advantage of integers does not apply.

It is also possible in some systems that floating point and integer operations could occur at the
same time, and hence utilizing the floating point hardware could result in better performance than
performing additional integer operations while the floating point unit sits idle. This is the case with
graphics rendering that occurs using floating point on the graphics processing unit (GPU) rather
than the CPU. It would not make sense to move the rendering calculations to the CPU in order to
use integers, as this would only increase the workload for the CPU and allow the power of the GPU
to go to waste.

If hardware has floating point support built-in, then common operations like floating point
addition, subtraction, etc. can each be handled by a single instruction. If hardware doesn't have a
floating point unit (common in embedded processors), floating point operations must be handled
by software routines. Hence, adding two floating point values will require dozens of instructions to
complete instead of just one. These will be hundreds of times slower than integers, and will
consume a big chunk of available program memory.

Most algorithms can be implemented using integers with a little thought. Use of floating point is
often the result of sheer laziness. Don't use floating point just because it's intuitive.

More power consumption. CPUs achieve their maximum power consumption when doing intensive
floating point calculations. This is not usually noticeable on a desktop PC, but can become a
problem on large grids consisting of hundreds of PCs, since the power grid they are attached to
may not be designed to provide for their maximum draw. It can also be a problem when running a
laptop on battery while doing intensive computations. Battery life while doing intensive floating
point computations could be a small fraction of what it is while reading email, browsing the web,
or editing a document in OpenOffice.

Prev Up Next
4.15. Hex and Octal with Signed 4.17. IEEE Floating Point Formats
Home
Numbers

MYP Grade 7 Mathematics - International Baccalaureate: # Topic Title
100% (1)
MYP Grade 7 Mathematics - International Baccalaureate: # Topic Title
11 pages
Leaky Bucket Additional Problems With Solutions
100% (1)
Leaky Bucket Additional Problems With Solutions
7 pages
A Primer On Logarithm
75% (4)
A Primer On Logarithm
202 pages
ELEC1601 Week 6 2023
No ratings yet
ELEC1601 Week 6 2023
61 pages
Floating Point Representation
No ratings yet
Floating Point Representation
18 pages
COA
No ratings yet
COA
14 pages
Floating Point Numbers: CS031 September 12, 2011
No ratings yet
Floating Point Numbers: CS031 September 12, 2011
22 pages
Lecture Slides Week4
No ratings yet
Lecture Slides Week4
42 pages
Unit 1 CBNST
No ratings yet
Unit 1 CBNST
4 pages
3-EED220 Lecture 3
No ratings yet
3-EED220 Lecture 3
22 pages
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
No ratings yet
Lec07 - Computer Arithmetic - Floating-Point Representation and Arithmetic
42 pages
Document From Avijit Mukherjee
No ratings yet
Document From Avijit Mukherjee
10 pages
Chapter 5 - Floating Point Numbers
No ratings yet
Chapter 5 - Floating Point Numbers
9 pages
LECTURE NOTE Fixed and Floating Point Representation
No ratings yet
LECTURE NOTE Fixed and Floating Point Representation
3 pages
#3 - Floating Point
No ratings yet
#3 - Floating Point
38 pages
Cacc
No ratings yet
Cacc
106 pages
Floating Point Representation: Reading: B&O 2.4
No ratings yet
Floating Point Representation: Reading: B&O 2.4
44 pages
Looping Questions and Answers With Explanation For Written Test Exam and Interview in C Programming Language
No ratings yet
Looping Questions and Answers With Explanation For Written Test Exam and Interview in C Programming Language
8 pages
Binary Tutorial
No ratings yet
Binary Tutorial
10 pages
MTH 214 Accuracy in Numerical Calculations and Error Analysis
No ratings yet
MTH 214 Accuracy in Numerical Calculations and Error Analysis
18 pages
Reveal Math 2020 Algebra 2
No ratings yet
Reveal Math 2020 Algebra 2
20 pages
QCPU Programming Manual (Common Instruction) (SH - NA - 080809ENG-C) PDF
No ratings yet
QCPU Programming Manual (Common Instruction) (SH - NA - 080809ENG-C) PDF
1,204 pages
Floating Point Numbers: Do You Have Your Laptop Here?
No ratings yet
Floating Point Numbers: Do You Have Your Laptop Here?
10 pages
Floating Point
No ratings yet
Floating Point
16 pages
Itec1000 Lecture Note 5
No ratings yet
Itec1000 Lecture Note 5
10 pages
2.4 Floating Point Representation
No ratings yet
2.4 Floating Point Representation
7 pages
The IEEE Standard For Floating Point Arithmetic
No ratings yet
The IEEE Standard For Floating Point Arithmetic
9 pages
COA Module6 FloatingPoint
No ratings yet
COA Module6 FloatingPoint
17 pages
Maths Course For NTHP&STHP
No ratings yet
Maths Course For NTHP&STHP
3 pages
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
No ratings yet
9-Algorithms For Floating Point Arithmetic Operations-22-01-2024
49 pages
Week 5: IEEE Floating Point Revision Guide For Phase Test
No ratings yet
Week 5: IEEE Floating Point Revision Guide For Phase Test
23 pages
Fixed Point and Floating Point Number Representations
No ratings yet
Fixed Point and Floating Point Number Representations
5 pages
Floating-Point Binary
No ratings yet
Floating-Point Binary
3 pages
Gauss PHD Dissertation
No ratings yet
Gauss PHD Dissertation
28 pages
5.3 Representing Data - The Binary Number System
No ratings yet
5.3 Representing Data - The Binary Number System
22 pages
Unit 2 Powers and Roots - 4º ESO
No ratings yet
Unit 2 Powers and Roots - 4º ESO
15 pages
IEEE Standard 754 Floating Point Numbers
No ratings yet
IEEE Standard 754 Floating Point Numbers
7 pages
Lecture 4 - Computer Arithmetic
No ratings yet
Lecture 4 - Computer Arithmetic
18 pages
Introduction To Numerical Computing: Statistics 580 Number Systems
No ratings yet
Introduction To Numerical Computing: Statistics 580 Number Systems
35 pages
8.1.4 Data Representation - Floatng Point Numbers
No ratings yet
8.1.4 Data Representation - Floatng Point Numbers
3 pages
Tutorial - Floating-Point Binary
No ratings yet
Tutorial - Floating-Point Binary
4 pages
Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic 33333
No ratings yet
Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic Floating-Point Arithmetic 33333
18 pages
Floating Point
No ratings yet
Floating Point
26 pages
ST Assisi Matriculatin School - Time Table and Syllabus I To XI
No ratings yet
ST Assisi Matriculatin School - Time Table and Syllabus I To XI
19 pages
Y8 Indices Practice
No ratings yet
Y8 Indices Practice
7 pages
IEEE 754 Floating Point Standard
No ratings yet
IEEE 754 Floating Point Standard
2 pages
Midterm Exam #2 Solutions: Your Name: Sid: Circle The Letters of Your CS162 Login (1 Per Line) : TA Name / Section
No ratings yet
Midterm Exam #2 Solutions: Your Name: Sid: Circle The Letters of Your CS162 Login (1 Per Line) : TA Name / Section
12 pages
Lecture 10 (Temp)
No ratings yet
Lecture 10 (Temp)
50 pages
Lecture11 Slides 1
No ratings yet
Lecture11 Slides 1
52 pages
Floating Point & Fixed Point Representation - BCA II
No ratings yet
Floating Point & Fixed Point Representation - BCA II
24 pages
FIXED and FLOAT
No ratings yet
FIXED and FLOAT
8 pages
Floating Point Representation - M.eng Term Paper
No ratings yet
Floating Point Representation - M.eng Term Paper
6 pages
IEEE Paper On Floating Point
No ratings yet
IEEE Paper On Floating Point
28 pages
Module 2 - PART D Floating
No ratings yet
Module 2 - PART D Floating
30 pages
Floating Point Representation
No ratings yet
Floating Point Representation
3 pages
Answers: Practice For Third Exam Math 1352-006, Fall 2003 Dec 1, 2003
No ratings yet
Answers: Practice For Third Exam Math 1352-006, Fall 2003 Dec 1, 2003
17 pages
What Are Floating Point Numbers?
No ratings yet
What Are Floating Point Numbers?
7 pages
1 5 Floating Point Representation
No ratings yet
1 5 Floating Point Representation
9 pages
Floating Points
No ratings yet
Floating Points
31 pages
Floating Point
No ratings yet
Floating Point
26 pages
Quick Review Math Handbook, Book 1 PDF
100% (2)
Quick Review Math Handbook, Book 1 PDF
393 pages
8.3 Floating Point Numbers
No ratings yet
8.3 Floating Point Numbers
19 pages
Secret of Success in Bank Tests
No ratings yet
Secret of Success in Bank Tests
47 pages
Computer Organisation
No ratings yet
Computer Organisation
4 pages
Process Sync
No ratings yet
Process Sync
110 pages
The Ultimate Guide To Additional Mathematics
No ratings yet
The Ultimate Guide To Additional Mathematics
73 pages
EC-502 - Aritra Dutta
No ratings yet
EC-502 - Aritra Dutta
6 pages
Floating Point Arithmetic Presentation
No ratings yet
Floating Point Arithmetic Presentation
3 pages
IEEE Standard 754
No ratings yet
IEEE Standard 754
10 pages
Mathematics (51) : Aims
No ratings yet
Mathematics (51) : Aims
10 pages
CS 180 Fall 2006 Final Exam: Recitation Start Time
No ratings yet
CS 180 Fall 2006 Final Exam: Recitation Start Time
27 pages
Logs & Sequences
No ratings yet
Logs & Sequences
81 pages
Floating-Point Numbers and Operations Representation
No ratings yet
Floating-Point Numbers and Operations Representation
8 pages
02 Network Basics Print PDF
No ratings yet
02 Network Basics Print PDF
25 pages
3 2011-Estt-L PDF
No ratings yet
3 2011-Estt-L PDF
3 pages
7-M7NS-Ii-1 C Scientific Notation-Standardform
0% (1)
7-M7NS-Ii-1 C Scientific Notation-Standardform
4 pages
03.design Theory PDF
No ratings yet
03.design Theory PDF
112 pages
Chemistry: The Study of Change
No ratings yet
Chemistry: The Study of Change
38 pages
Introduction To Programming in C++ Summer II 2002: Midterm Test
No ratings yet
Introduction To Programming in C++ Summer II 2002: Midterm Test
13 pages
Real Number Representation and Floating Point Arithmetic
No ratings yet
Real Number Representation and Floating Point Arithmetic
12 pages
Name: Section: TA: Expected Score (Just Guess) :: CSE 143, Summer 1995 Final Exam (100 Points) Date: 8/18/95
No ratings yet
Name: Section: TA: Expected Score (Just Guess) :: CSE 143, Summer 1995 Final Exam (100 Points) Date: 8/18/95
15 pages
Synchronization OS
No ratings yet
Synchronization OS
83 pages
Homework 5
No ratings yet
Homework 5
5 pages
Quiz 1
No ratings yet
Quiz 1
14 pages
TI30XIIS Instructions
No ratings yet
TI30XIIS Instructions
3 pages
COMP 361 Computer Communications Networks: Midterm Examination 2
No ratings yet
COMP 361 Computer Communications Networks: Midterm Examination 2
5 pages
CSCE 4523 Introduction To Database Management Systems
No ratings yet
CSCE 4523 Introduction To Database Management Systems
9 pages
15sp Midtermsolved
No ratings yet
15sp Midtermsolved
12 pages
Laws Exponent
No ratings yet
Laws Exponent
25 pages
ClosureProperties PDF
No ratings yet
ClosureProperties PDF
3 pages
Computer Organization
No ratings yet
Computer Organization
22 pages
Midterm Exam: University of Washington CSE 331 Software Design & Implementation Winter 2011
No ratings yet
Midterm Exam: University of Washington CSE 331 Software Design & Implementation Winter 2011
8 pages
Module2.1 of Nothing
No ratings yet
Module2.1 of Nothing
7 pages
Scientific Computation (Floating Point Numbers)
No ratings yet
Scientific Computation (Floating Point Numbers)
4 pages
ALGEBRA
No ratings yet
ALGEBRA
213 pages
CSE-3421 Test #1: "Design"
No ratings yet
CSE-3421 Test #1: "Design"
10 pages
Challenges Regarding Answer Key: Ugc - Net June 2020
No ratings yet
Challenges Regarding Answer Key: Ugc - Net June 2020
6 pages
Guided Generalization For Student Understanding Grade 9 - Mathematics
No ratings yet
Guided Generalization For Student Understanding Grade 9 - Mathematics
3 pages
Sample Final Exam: Answer Key
No ratings yet
Sample Final Exam: Answer Key
4 pages
Vii
No ratings yet
Vii
6 pages
Sample Final Exam: Answer Key
No ratings yet
Sample Final Exam: Answer Key
4 pages
What Is R Programming Language?: Parenthesis Around The Calculation You Want Performed First. This Is Useful
No ratings yet
What Is R Programming Language?: Parenthesis Around The Calculation You Want Performed First. This Is Useful
3 pages
Galois in MatLab Guide
No ratings yet
Galois in MatLab Guide
30 pages
PS GMAT Prep
No ratings yet
PS GMAT Prep
127 pages
Mathematics Exercises II
No ratings yet
Mathematics Exercises II
5 pages
Data Representation Workbook
No ratings yet
Data Representation Workbook
8 pages
Exercise 4: Writer: Graphic Artist: Layout Artist: Division Quality Assurance: EPS I - Mathematics: Management Team
No ratings yet
Exercise 4: Writer: Graphic Artist: Layout Artist: Division Quality Assurance: EPS I - Mathematics: Management Team
6 pages
MYP 4 - Term End Exam Portion - Jan 2025-2
No ratings yet
MYP 4 - Term End Exam Portion - Jan 2025-2
14 pages
Class 8 Exponents
No ratings yet
Class 8 Exponents
2 pages
Sample Midterm Exam Questions
No ratings yet
Sample Midterm Exam Questions
2 pages
Lesson 8 Q1 General Mathematics - Grade 11 STEM
No ratings yet
Lesson 8 Q1 General Mathematics - Grade 11 STEM
24 pages
Mathematics Qanda Liberia Wassce Study Manual
No ratings yet
Mathematics Qanda Liberia Wassce Study Manual
65 pages
Basic Math Notes
From Everand
Basic Math Notes
Ernest Bywater
5/5 (2)
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

4.16. Floating Point

Uploaded by

4.16. Floating Point

Uploaded by

4.16.

4.16. Floating Point

3.52 x 103 = 1520

000102 x 23 = 010002 x 21.

4.16.2. A Simple Floating Point Format

The binary format is as follows:

Table 4.5. Floating Point Format

1. What is the value of the following number?

The mantissa is 000000000000000000010010, or +(2 + 16) = +18.

The exponent is 11111100 = -(00000011 + 1) = -00000100 = -4.

The value is therefore +18 x 2-4

2. What is the largest positive value we can represent in this system?

The largest mantissa is 011111111111111111111111, which in two's complement is +223-1

Hence, the largest positive value is +8388607 x 2+127 = 1.42 x 1045.

4. What is the smallest positive value?

Since the mantissa is an integer, the smallest positive value possible is 1.

Hence the smallest positive value is 1 x 2-128, or 2.93873587706 x 10-39.

b. Multiply by radixexponent equal to 1:

-2.7510 = -(10112) x 2-2

Mantissa: -(000000000000000000001011) = 111111111111111111110101

Exponent: -210 = 11111110

Binary representation = 11111111111111111111010111111110

7. How many different values can this system represent?

4.16.3. Overflow and Underflow

4.16.4. Cost of Floating Point

So how is it we can represent numbers up to 1045?

Consider the process of adding two scientific notation values.

You might also like