0% found this document useful (0 votes)
74 views18 pages

13.3 Real Numbers - Normalized Floating Point

Number bases

Uploaded by

8mrdnyghhf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
74 views18 pages

13.3 Real Numbers - Normalized Floating Point

Number bases

Uploaded by

8mrdnyghhf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 18
Boost your performance and confidence | =| with these topic-based exam questions Practice questions created by actual EXAM PAPERS PRACTICE CT ee ale aaala oa) ere Teas eo) Suitable for all boards Designed to test your ability and thoroughly prepare you Real numbers & Normalized CIE AS Level Computer Science Revision Notes 9618 For more help, please visit our website www.exampaperspractice.co.uk EB Syllabus Content; 1 Data representation 13.3 Real numbers and normalised floating-point representation describe the format of binary floating-point real numbers convert binary floating-point real numbers into denary and vice versa normalise floating-point numbers Show understanding of the consequences of a binary representation only being an approximation to the real number it represents (in certain cases) show understanding that binary representations can give rise to rounding errors Notes and guidance Use two's complement form effects of changing the allocation of bits to mantissa and exponent in a floating- point representation Understand the reasons for normalisation how underflow and overflow can occur Real numbers: Areal number is one with a fractional part, When we write down a value for a real number in the denary system we have a choice. We can use a simple representation or we can use an exponential notation (sometimes referred to as scientific notation). In this latter case we have options. For example, the number 25.3 might alternatively be written as: 6.253 x 10 or 2.53 x 10 or 25.3 x 10° or 253 x 10-1 For this number, the simple expression is best but if a number is very large or very small the exponential notation is the only sensible choice. Fixed-point representation: A binary code must be used for storing a real number in a computer system. One possibility is to use a fixed-point representation. In fixed-point representation, an overall number of bits are chosen with a defined number of bits for the whole number part and the remainder for the fractional part. One possibility for handling numbers with fractional parts is to add bits after the decimal point: The first bit after the decimal point is the halves place, the next bit the quarter's place, the next bit the eighth’s place, and so on. agent EB - 2 (0.5 0.250.125 ¢emmms{Piacelvalues) ye —less 4 2 1 4 A é 2 a 2 ae 2 ‘Suppose that we want to represent 1.625(10). We would want 4 in the ones place, leaving us with 0.625. Then we want 1 in the halves place, leaving us with 0.625 - 0.5 = 0.125. No quarters will fit, so put a 0 there. We want a 1. in the eighths place, and we subtract 0.25 from 0.125 to get 0.\ Oo 01.1.0 1 4 2 _ iy y lg So the binary representation of 1.625 would be 1.104(2). So how fixed number representation stores a fractional number in binary format? See explanation below A number 40.125 is to be converted into binary. First numi#@has to be converted as a normal denary to binary conversion, which gives: 40 = 101000 40 written in complete byte 40 = 00101000 Now 0.125 has to be converted to binary. We Multiply 0.425 by 2e.g. thigds ested so we puto 0.125x2 = 0.25 o 0.25x2=0.5 ° 0.5x2=4 a 00! So fractional number 40,125 eee ee Sign bit +ve ragezerny EB For the fixed-point option, a possible choice would be to use the most significant bit as a sign bit and the next five bits for the whole number part. This would leave two bits for the fractional part. Some important non-zero values in this representation are shown in Table (The bits shown with a gap indicate the implied position of the binary decimal point.) Emenee Bien eea alue oui | e (000000 01 Place values 0.50.25 32168421 he 01111111=31.75 Place values 0.50.25 32168421 ~*% 00000001=0.25 Negative Fixed Point Representation Suppose -31.75 has to be converted into binary: -34.75 = 100000 01 Place values 0.50.25 -32168421 iA 10000001=-31.75 Suppose 0.25 has to be converted into binary: 70.25 = 111111 11 Place values 0.50.25 -32168421 - 2% 11111111=-0.25 agents EB Consider a fractional number stored in 12 bits, 8 bits for whole number part and 4 bits for fraction -52.625 becomes 11001100.1010 Sign bit -ve y In Fixed-Point Representation option, an overall number of bits are chosen with a defined number of bits for the whole number part and the remainder for the fractional part. Fixed Point representation would be 11001100 1010 Some important non-zero values in this representation are shown in Table below. (The bits are shown with a gap to indicate the implied position of the binary point.) Description | weBinarycodessesfy Denary equivalent itive value OLiiit 0 st positive value 000000 OL 0.25 jest magnitude negative value 100000 OL ~0.25 st magnitude negative value Ait ub Floating-Point Number Representation The alternative is a floating-point representation. The format for a floating-point number can £M x RE be generalised as: In this option a defined number of bits are used for what is called the significand or mantissa, #M, The remaining bits are used for the exponent or exrad, E. The radix, Ris Not stored in the representation; it has an implied value of 2. +MxRE (eS Se Floating-point representation: a representation of real numbers that stores value for the mantissa and a value for the exponent Conversion from +ve Real Number to Binary Number! A number 40.125 is to be converted into binary. First number 40 has to be converted as a normal denary to binary conversion, which gives: 40 = 101000 fageaota7 EB 40 written in inclusive of sign bit 49 = 0101000 v Sign bit Now 0.425 has to be converted to binaryWe Multiply “9-425. 9 og. 0.125 x2= 0.25 this is l@bQthanso we put 0.125x2 = 0.25 oO 0.25x2=0.5 0 O.5x2=1 1 So fractional number 40,125 number becomes ae oe Sign bit +ve But in binary numbers, decimals cannot be written: Decimal has to be converted into binary number too. 0101000.001 Decimal moved 6 places to left so exponent = 6 Six in binary is represented as 6 = 120 So the number becomes 0 101000 001 110 so number is 0101000001110 (normalized form) Sign bit Mantissa Exponent Past Paper questions: Question 1; +08/32/0/n/16) Ina particular computer system, real numbers are stored using floating-point representation with: 08 bits for the mantissa 08 bits for the exponent two's complement form for both mantissa and exponent Calculate the floating point representation of + 3.5 in this system. Show your working. fagesots7 EB Solution: 3.5 has to be converted into binary. First number 3 has to be converted as anormal denary to binary conversion, which gives: 3 = 11 3 written in inclusive of sign bit 3=041 Now 0.5 has to be converted to binary, 0.5 x 2 =. 3.52111 So the number becomes 011.1 v Sign bit 011.10000 0 4 1.10000 (decimal moved 2 places right after sign bit) So the exponent is 2 so Mantissa and exponent would be 01110000 00000010 ign bit Mantissa Exponent Conversion from -ve Real Number to Binary Number: Question Ai/16) Ina particular computer system, real numbers are stored using floating-point representation with: 08 bits for the mantissa 08 bits for the exponent O two's complement form for both mantissa and exponent Calculate the floating - point representation of -3.5 in this system. Show your working. Solution: 3 written in inclusive of sign bit 3 = 11 Now 0.5 has to be converted to binary. 0.5 x 2= 4 35-111 Now 011.1000 = 3.5 written in whole byte EB o 1 1. 10000 wd (decimal moved 2 places right after sign bit) So exponent = 2 which will be binary in 1 byte = 00000010 (exponent) 01110000 = +3.5 (has to be converted into -3.5) See the process below: 01110000 = 3.5 10007111 = 1's Compliment 111 carry bits jooom — +1 310010990-2’s Compliment = -3.5 Mantissa expressed in 8 bits and exponent expressed in 8 bits would be 10010000 00000010 Sign bit Mantissa Exponent (Another example) Conversion from -ve Real Number to Binary Number: Suppose -82.625 has to be converted into binary: 52 = 10100 52 written in complete byte 52 = 00110100 . 0.625 x 2=1.25 When we multiply 0.625 by 2 e.g. so we put] 0.625x2 = 1.25 1 0.25x2=0.5 oO O5x2=1 a 52.625 = 00110100.101 One's Compliment = T1001011.010 (bY inverting 0's to 1 and 1's to 0) Two's Compliment 52.625 = T1001011.011 (by adding 1 in One’s compliment) EB So fractional number -52.125 becomes 11001011.011 Sign bit -ve Butin binary numbers, decimals cannot be written. Decimal has to be converted into binary number too. 11001011.011 Decimal moved 7 places to left so exponent = 7 Seven in binary is represented as 7 = 111 So the number becomes 1 1001011011 111 so number is 11001011011 111 Sign bit Mantissa Exponent Conversion from Binary Number to +ve Real Number: Example 4: 9608/34/0/N/15 Q#1) In a computer system, real numbers are stored using floating-point representation with: 08 bits for the mantissa, followed by 8 bits for the exponent O Two's complement form is used for both mantissa and exponent. Areal number is stored as the following two bytes: Mantissa Exponent 0010100000000011 Calculate the denary value of this number. Show your working Solution: sign bit As Exponent 00000014=3denary Now Mantissa = 00101000 Decimal is actually after the sign bit So decimal would be 0.0201000 ‘As exponent is 3 so decimal will move three places to right 0.0101000 Place Values 0.50.25 0.125 0.0625 8421.%% 1/8416 0010.1000 2..5 (Calculated as per place values) Example 2: (Q#2) Floating Point Binary representation uses 4 bits for Mantissa and 4 bits for exponent Convert 0140 0010 Solution: As Exponent = 0010 = 2 denary sign bit Now Mantissa =0110 v Decimal is actually after the sign bit So decimal would be 0,110 soot? EB As exponent is 2 so decimal will move two places to right 0. waa!» 011.0 +3.0 (Calculated as per place values) Conversion from Binary Number to -ve Real Number: Example 1: Floating Point Binary representation uses 4 bits for Mantissa and 4 bits for exponent. Convert 1001 0001 Solution: sign bit As Exponent =0001=1denary v Now Mantissa = 1004 Decimal is actually after the sign bit So decimal would be 1.004 As exponent is 1 so decimal will move one place to right 4.6 0.4 -21%% (place Values) 10.01 sign bit } y -2 + % (Calculated as per place values) -2 40.25 -2 is -ve and +0.25 is +ve, so by adding 0.25 in -2, we get) = - . 75 (Answer) Binary Number to -ve Real Number with -ve Exponent: Find the denary value for the following binary floating-point number. (Past Paper question) Mantissa Exponent afofalalolo Now we first solve the Exponent Place Value = -8 421 1110 Exponent = -2 Now solving Mentissa which is: 1041 0 0. 0 0 We know decimal lies after sign bit so number becomes 1.011 0 0 0 0 EB Whenever we have ~ve exponent with -ve mantissa, we first solve exponent and then TAKE Two’s Compliment of Mentissa (Keeping the sign of number intact) 1.0110000 Now take 1’s compliment Carry1114 0.1001111 Now 2’s Compliment Add 1+ 0.1010000 Now because our exponent was -2 we have to shift decimal two places LEFT 000.101000 Place values 10.5 0.25 0.125 0.0625 0.03125 Number becomes -(0.00101000) Keeping sign in mind, our number was ~ve so we kkep the sign intact Denary number is - (0.25+0.03425) = - 0.15325 Precision and normalization: In floating-point representation, decision has to be made for the total number of bits to be used and split between those representing the mantissa and the exponent. In practice, a choice for the total number of bits to be used will be available as an option when the program is written. However, the split between the two parts of the representation will have been determined by the floating-point processor. Effects of changing the allocation of bits to mantissa and exponent in a floating-point representation: Ifyou have a choice you would base a decision on the fact that increasing the number of bits for the mantissa would give better precision for a value stored but would leave fewer bits for the exponent so reducing the range of possible values. In order to achieve maximum precision, it is necessary to normalise a floating-point number. EB Since precision increases with an increasing number of bits for the mantissa it follows that optimum precision will only be achieved if full use is made of these bits. In practice, that means using the largest possible magnitude for the value represented by the ‘mantissa. To illustrate this we can consider the eight-bit representation used in Tables below. Table below shows possible representations for denary 2 using this representation. Denary representation Floating-point binary representation 0.125 x2" 0.001 0100 0.25x2° 0010 0011 05x? 01000010 <———_ Normalized Representation of denary number 2, using four bits for mantissa and four bits exponent. For a negative number we can consider representations for-4-as shown in Table Below Denary representation __| Floating-point binary representation 0.25*2 1110 0100 0.5% 2! 11000011 1.02! 10000010 €———_ Normalized Representation of denary number —~4,, using four bits for mantissa and four bits exponent. It can be seen that when the number is represented with the highest magnitude for the mantissa, the two most significant bits are different. This fact can be used to recognise that anumber is in a normalised representation. The values in these tables also show how a number could be normalised. Normalizing the Mantissa Before a floating-point binary number can be stored correctly; its mantissa must be normalized. The process is basically the same as when normalizing a floating-point decimal number. For example, decimal 1234.567 is normalized as 1.234567 x 103 by moving the decimal point so that only one digit appears before the decimal. The exponent expresses the number of positions the decimal point was moved lett (positive exponent) or moved right (negative exponent).Simitarly, the floating-point binary value 1101.101 is normalized as 1.101101 x 23 by moving the decimal point 3 positions to the left, and multiplying by 23. Here are some examples of normalizations: agent Focmore hel plist vet ou wb ann eramoscereeactce out EB Normalization of a +Ve binary Number: For a positive number, the bits in the mantissa are shifted left until the most significant bits are Floating-point binary representation 0.001 0100 00100011 01000010 <————_ Normalized O fdllowed by . For each shift left the value of the exponent is reduced by 1. Normalization of a -Ve binary Number: Floating-point binary representation 11100100 10000010 <————_ Normalized ‘The same process of shifting is used for a negative number until the most significant bits are 1 followed by O. In this case, no attention is paid to the fact that bits are falling off th most significant end of the mantissa. What are Overflow and Underflow? Overflow occurs when calculations produce results exceeding the capacity of the result. Example: 16-bit integers can hold numbers in the range -32768...32767. So what happens when you add 20000 to 20000? dod, 0100111000100000 + 0100111000100000 100TTTOOOIO00000 The sixteenth bit contains a ‘1’ as a result of adding the two numbers. Yet, numbers with a'' in the leading position are interpreted as negative numbers, so instead of *40000', the result is interpreted as '-25536'. fagen2007 EB Overflow can also occur in the exponent of a floating point number, when the exponent has become too large to be represented using the given representation for floating point numbers (e.g, 7 bits for 32-bit integers, or exponents larger than 63). Underflow: A calculation resulting in a number so small that the negative number used for the exponent is beyond the number of bits used for exponents is called underflow (e.g, 7 bits for 32-bit integers, or exponents smaller than -64).. The term arithmetic underflow (or "floating point underflow’, or just "underflow") is a condition in a computer program where the result of a calculation is a number of smaller absolute value than the computer can actually store in memory. Overflo A CPU with a capacity of 8 bits has a capacity of up to LILIII11 inbinary. If one more bit was added there would be an overflow error. An explanation of binary overflow errors ‘Transeript Example: 8-bit overflow ‘An example of an 8-bit overflow occurs in the binary sum 1111111 + 4 (denary: 255 + 1). 414749041 +/10,0,0,0,0,0;0,1 100000000 The total is a number bigger than 8 digits, and when this happens the CPU drops the overflow digit because the computer cannot store it anywhere, and the computer thinks 255 +1=0. pagent ofa? EB Rounding errors Because floating-point numbers have a limited number of digits, they cannot represent all real numbers accurately: when there are more digits than the format allows, the leftover ones are omitted - the number is rounded. There are three reasons why this can be necessary: Large Denominators: In any base, the larger the denominator of an (irreducible) fraction, the more digits it needs in positional notation. A sufficiently large denominator will require rounding, no matter what the base or number of available digits is. For example, 1/1000 cannot be accurately represented in less ‘than 3 decimal digits, nor can any multiple of it (that does not allow simplifying the fraction). Periodical digits: Any (irreducible) fraction where the denominator has a prime factor that does not occur in the base requires an infinite number of digits that repeat periodically after a certain point. For example, in decimal 1/4, 3/5 and 8/20 are finite, because 2 and 5 are the prime factors of 10. But 1/3 is not finite, nor is 2/3 or 1/7 or 5/6, because 3 and 7 are not factors of 10. Fractions with a prime factor of 5 in the denominator can be finite in base 10, but not in base 2 - the biggest source of confusion for most novice users of floating-point numbers. Non-rational numbers Non-rational numbers cannot be represented as a regular fraction at all, and in positional notation (no matter what base) they require an infinite number of non-recurring digits. agents? zl Many new programmers become aware of binary floating-point after seeing their programs give odd results: “Why does my program print 0.10000000000000001 when I ‘Why does 0.3 + 0.6 = 0.899999999999999912"; “Why does 6 * 0.1 not equal 0.62” Questions like these are asked every day, on online forums like stackoverflow.com, The answer is that most decimals have infinite representations in binary. Take 0.1 for example. It's one of the simplest decimals you can think of, and yet it looks so complicated in binary: 0, c00110021901 1001200110932 .0911002 16012 09220912 09120012002200110012091209112 Soetoot ioarionti9or tots or iacttootine iooston 1ae1t 90710810011 oat OETA Soi10ot oet10011001 10041 be bsid 041001 04490110041 SoztDsLaooLLooL ELL O04 soit ent ioet4.0 1400! 1001 odd oot net 10044001 Loot cosioo4iooiieotLociioo12 914001 19o14 9011001 10014901 Ipeddooti0e 1001190119041 90z10oLa00L10eL 0NLL O01 Soit0or 290110912001 10011002 19su20044001 9044001 eukSoztDo4aooLtooL eat ooaE 941001 190140012001 100141 1ootd0o1i001 004100110011 90z10010011004 100410012, S9:100t 190149011001 100141 inotdoo1ine1 9041.90: 19041 907191011 ooL oA oot Soitec!i9et4. 44004 4ootd od iaodi co! soetdonddoddaatdoosipetaps/co:ioc1i012 914001 i9o14 0044001 1004401 19014001e4 004400110041 9041001a0011 00107110013 S0:100t 19014004400! a4 et 1netdoo1ibe! Toot oo ett sort betaBo 110010021001 S9:4001 190n4 9044001 1094 90 19042011002 199t1001 19011 99710914D9t10N11OoT1 Ons S0i100t i011 0044001 1004102 10s140014neLigoddoLdbs4ig021094ab0 110049 c41 9012 {S931001 200110914001 1004.0: 19014001 petAoddooLABoNA zI0s4aB911001 100110011 S94 001 190149044001 100149 190120013eLonsL.0N JDeMAN?10gLADNT ION I9eT1 908 $911001 190110044001 1004401 10014 001ip¢1 9044901 9e44907109140011 001100119012 Sbt0or ioe11034a0oL40044b01190120014BeLADoLLooL4oeM4Oo LI Detabo11 001 ooz1 O02 gute: Decimal 0.1 in Binary (To 1369 Places) a anada The bits go on forever; no matter how many of those bits you store in a computer, you will never end up with the binary equivalent of decimal 0.1. 0 BI 0.1 In Binary 0 0.1 is one-tenth, or 1/10. To show it in binary — that is, as.a bicimal — divide binary 1 by binary 1010, using Binary tong division: L | o 2 oOo Computing One-Tenth In Binary — The division process would repeat forever — and so too 1000 the digits in the quotient — because 100 (“one-zero- 0 zero") reappears as the working portion of the dividend. T0000 Recognizing this, we can abort the division and write the answer in repeating bicimal notation, as 0.00021. 1010 1100 Summary 1010 In pure math, every decimal has an equivalent bicimal. In 100) floating-point math, this is just not true. eagensots7 Focmore hel plist vet ou webs an ermoscereeactice co EB Even with 10, 20, or 100 digits, you would need to do some rounding to represent an infinite number in a finite space. If you have a lot of digits, your rounding error might seem insignificant. But consider what happens if you add up these rounded numbers repeatedly for a long period o-4P Pf time. If you round 4/7 to 1.42 x 10 (0.142) and add up this representation 700 times, you would expect to get 100. (1/7 x 700 = 100) but instead you get 99.4 (0,142 x 700). Relatively small rounding errors like the example above can have huge impacts. Knowing how these rounding errors can occur and being conscious of them will help you become a better and more precise programmer. Errors due to rounding have long been the bane of analysts trying to solve equations and systems. Such errors may be introduced in many ways, for instance: Dinexact representation of a constant integer overflow resulting from a calculation with a result too large for the word size D integer overflow resulting from a calculation with a result too large for the number of bits used to represent the mantissa of a floating-point number Daccumulated error resulting from repeated use of numbers stored inexactly Summary Rounding error is a natural consequence of the representation scheme used for integers and floating-point numbers in digital computers. Rounding can produce highly inaccurate results as errors get propagated through repeated operations using inaccurate numbers. Proper handling of rounding error may involve a combination of approaches such as use of high-precision data types and revised calculations and algorithms. Mathematical analysis can be used to estimate the actual error in calculations. EB (c) A student writes a program to output numbers using the following code: xX «0.0 FOR i € 0 To 1000 XeX+ 01 ourpuT x ENDFOR The student is surprised to see that the program outputs the following sequence: 0.0 0.1 0.2 0.2999999 0.3999999...... Explain why this output has occurred. Solution: Any one of the below mentioned answers: o cannot be represented exactly in binary 0 0.1 represented here by a value just less than 0.1. o the loop keeps adding this approximate value to counter o until all accumulated small differences become significant enough to be seen ages 0637 Formorehel pet vst our webs au cransaceswarcecaub

You might also like