0% found this document useful (0 votes)
50 views30 pages

Numerical Methods: Representing Numbers

This document discusses numerical representation and errors. It covers accuracy, precision, different number bases including binary and hexadecimal, fixed and floating point representation, overflow, loss of precision in operations, and sources of errors in numerical computation.

Uploaded by

Em Reyes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views30 pages

Numerical Methods: Representing Numbers

This document discusses numerical representation and errors. It covers accuracy, precision, different number bases including binary and hexadecimal, fixed and floating point representation, overflow, loss of precision in operations, and sources of errors in numerical computation.

Uploaded by

Em Reyes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Numerical Methods

Representing Numbers
Accuracy and Precision
• Accuracy – how closely a
increasing accuracy measured value agrees with
the truth
• Precision – how closely
measured values agree with
increasing precision

each other

• Inaccurate measurements are


due to some bias
• Imprecise values are caused by
some uncertainty
• Inaccuracy and imprecision are
measured using an error term
Errors – Measurement
• Absolute error 𝐸 𝐴 =¿𝑡𝑟𝑢𝑒 − 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑∨¿
 

𝑡𝑟𝑢𝑒 −𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑
• Relative error
 
𝐸𝑅= | 𝑡𝑟𝑢𝑒 |
• true = 1.5 cm, calculated = 1 cm
• absolute = 0.5 cm
• relative = 0.333

• true = 1,000,000.5 cm, calculated = 1,000,000 cm


• absolute = 0.5 cm
• relative = 0.5 x 10-8
Absolute vs. Relative Error
• Generally, scientific and engineering applications are less
sensitive to small errors in large values
• Relative error can be used to compare values at widely
varying sizes
• Most computing systems are designed to minimize relative
error
𝑡𝑟𝑢𝑒 −𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑
Relative Error | |
 
𝐸𝑅=
𝑡𝑟𝑢𝑒

1) Undefined when the true value is zero


2
𝑓 (𝑥 )− 𝑓 𝑐 (𝑥)
1

0
0
𝑓 ( 2.3 )=0
 
1 2 3 4 5
 
|
𝐸𝑅=
max ⁡{∨𝑓 𝑐 ( 𝑥)∨,∨𝑓 ( 𝑥 )∨}|
-1

Celsius
2) Relative measurements Kelvin
true   C   K
calculated   C   K

  = 283 −284
 𝐸 = 10− 11
𝑅 |
10 | 𝐸 𝑅
283 | |
𝐸
  𝑅 =0.1=10 % 𝐸
  𝑅 =0.0035=.35 %
Sources of error
•  Inaccuracy often results from a bias in the algorithm
• Euler’s method for explicit integration [discussed later] returns
biased results based on the curvature of the solution to the
differential equation
• Creating more accurate algorithms is a cornerstone of numerical
methods

• Imprecision is more fundamental – depends on the number


of bits supported by the data type being used
• Imagine that we only have 3 digits to represent a value
• How do we represent , , and ?
• Our choices for improving precision are limited:
• increase the number of digits available
Number Bases
• Decimal – base 10
• measure precision in “digits”
• Binary – base 2
• measure precision in “bits”
• Hexadecimal – base 16
• useful for memory dumps
• matches neatly with data produced by 8-, 32-, 64-bit systems
• will explore later
• Octal – base 8
• look at memory dumps from machines with smaller word sizes
• similar to how hexadecimal is used now
Binary
• When
  discussing numerical algorithms, it is convenient to
discuss numbers using both binary and decimal notation
• For example, discussing significant bits and significant digits can be
useful for different applications
• Understand what both of these terms mean
• Numerical basis is represented by a subscript when the
value is ambiguous:
Interpreting binary numbers
•• Double-dabble
  algorithm
• initialize a register X = 0
• for each digit in the binary number
• if 0 – double X
• if 1 – dabble X (double it and add one)

• Do this with a few numbers:


0000 1101 0110 1010 1111 1111
13 106 255
• How would you handle signed integers?
• Introduce a “signed” bit as the most significant digit:

• What are the following values (assuming an preceding signed bit)?


1101 1010 1011 1001 1111 1111
-90 -57 -127
Overflow
• What
  is 1111 1111 + 1?
• or 0000 0000 – 1?
• Value “wraps around” to the opposite value
• One of the most common errors in programming
• What happens when we convert an 8-bit integer into a 4-bit integer
by truncating the most significant bits?
0001 0011 -> 0011 1100 1010 -> 1010

• Keeping the least significant bits of a number performs the


operation:
• This is the defined behavior for C/C++ casting:

unsigned long m = ULONG_MAX; //


unsigned int n = m; //
Fixed Point and Arithmetic
• Decimal
  places are fractional powers-of-two
• Evaluate:

• Demonstrate the operations:

+ + x x
Floating Point, base 10
23
. 602× 10
  exponent

mantissa base

• spend
  a few bits on the mantissa and a few on the exponent
• effectively represent very large and very small numbers

• Any base can be used


• normalization constraint:
insures that each number has a single representation
(basically, the first digit of the mantissa should be non-zero)
Floating Point, base 2
• Standard
  format:

• is the base
• is a sign bit
• is the mantissa, where
• is the exponent

• What is the decimal spacing between values when and is 3-bits?


smallest value:
next value:
• How about when ?
smallest value:
next value:
Floating Point Multiplication
• 

1) Add the sign bits:


2) Add the exponents:
3) Multiply the mantissa:
4) Adjust to normalize :
5) Chop excess digits:
• Find the result of the binary operation given a 3-bit mantissa:
var
Floating Point Addition
• 

1) Convert to the same exponent (and chop):

2) Add:
3) Normalize:
• What is the actual solution and the corresponding relative
error?

• Note that the actual solution could be represented


Loss of Precision
• This
  subtraction effectively reduced our numerical precision
by one digit:

• Assuming a 3-digit mantissa, perform the following


operations:

1)
2)
where is the correct result, the relative error is
Loss of Precision
• Each
  operation maintains the appropriate precision
• However, we end up with far less precision than we expect
• The result has one digit of precision
• The correct value of requires two digits of precision
• We expect , we only need , but we get
• This is known colloquially as catastrophic cancellation
• How could we have avoided loss of precision in this case?

• We will discuss ways to avoid loss of precision in more


complex cases in the next lecture
Loss of Precision Theorem
•   Let and be normalized floating-point machine numbers such that ,
and

for some positive integers and , then at most and at least significant
digits are lost in the subtraction .

• How many bits are lost when calculating for:


and ?
Lost Bits
•  Reformulate the Loss of Precision Theorem:

• Lose between and bits


Lost Bits
•  How many bits are lost in , where
and ?

Between and bits of precision have been lost.


Lost Bits – Example 2
•  How many bits are lost in when ?

Between and bits of precision have been lost.

We will discuss what to do about this in the next lecture.


Representation Quirks
• Precision
  scales between decades – spacing between values
increases with increasing exponent
• Machine epsilon (unit epsilon)
• smallest number such that
• spacing between and an adjacent number is given by:

• Functions with slope skip values


• Imagine has slope at all points. Several values of will have no
corresponding
• How is this a problem for finding roots or optimization?
Chopping, Rounding, and Bias
• 
• Chopping values: 0.1557
  × 101
 
   

• Rounding values: 0.1557


  × 101
 
 
 
   

• Both chopping and rounding introduce a bias


• rounding – 4 digits (1-4) round down while 5 digits (5-9) round up
• eliminating the bias can be done by randomly rounding up or down
in the case of a ‘5’ – not worth it if running the same program twice
returns two different results
IEEE 32-bit floating point
•  Slightly more complicated

• no wasted 1 to force normalization


• exponent bias removes the need for a sign bit in
• 32-bit: 1 sign bit, 8-bit exponent, 23-bit mantissa
• exponent bias:
• machine epsilon

• 64-bit: 1 sign bit, 11-bit exponent, 52-bit mantissa


• exponent bias:
• machine epsilon
Hexadecimal
• Useful when looking at binary data
• memory dumps, binary files, etc.

0000 1010 0001 1011 0010 1100 0011 1101


0 A 1 B 2 C 3 D

• Easier to view the data in 8-bit chunks


• 0A 1B 2C 3D
• Commonly used for
• memory dumps
• doing your own math, bitwise operations
Endianness
• Hardware-dependent byte
ordering of the system
• Conversion is called the “nUxi
problem”
• Intel (and by extension most
home computers) use little-
endian
• Important if you want to do
your own bitwise operations
• Create binary masks
• Implement your own floating
point
• Integer arithmetic
Testing Endianness
main() {
int a = 0x0A1B2C3D;
unsigned char *c = (unsigned char*)(&a);
if (*c == 0x3D)
std::cout<<“little-endian”<<std::endl;
else
std::cout<<“big-endian”<<std::endl;
}
Numerical Methods – a PSA
• In 1991, failure of a MIM-104 Patriot Missile defense system was
caused by a software error. Time stamps from multiple radar
pulses were converted to floating point differently.
• http://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dhahran
• In 1994, Professor Thomas R. Nicely discovered that the Intel
Pentium floating‐point processor returned erroneous results for
certain division operations.
• http://engineeringfailures.org/?p=466
• In 1996, the Ariane5 rocket launched by the European Space
Agency exploded 40 seconds after lift‐off from Kourou, French
Guiana, because of an overflow error.
• http://www.around.com/ariane.html
• IBM recently announced that they’re going little-endian
• https://www.business-cloud.com/articles/news/ibm-drops-linux-bombshell

h/t Prof. Lennart Johnsson (COSC)


Numerical Methods – a PSA
• Boeing
  747 operators have been ordered to periodically
reset their electrical systems to avoid an overflow error that
happens every centiseconds (around every 8 months)
• http://www.nytimes.com/2015/05/01/business/faa-orders-fix-for-p
ossible-power-loss-in-boeing-787.html?_
r=0
• Software patch is being released Fall 2015

• Donkey Kong breaks at level 22. You’re given 260 seconds to


complete the level – however they’re only using an 8-bit
unsigned integer to store the time...
Memory dump demo
• Try using https://hexed.it/ to view memory dumps and
binary files

You might also like