0% found this document useful (0 votes)
14 views8 pages

(Data & Variable Management) Storage Type - The Penultimate Guide To Precision The Stata Blog

Uploaded by

ipz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

(Data & Variable Management) Storage Type - The Penultimate Guide To Precision The Stata Blog

Uploaded by

ipz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

10/14/2020 The Stata Blog » The Penultimate Guide to Precision

Home
About

Type text to search here...


Home > Numerical Analysis > The Penultimate Guide to Precision

The Penultimate Guide to Precision


2 April 2012 William Gould, President 17 Comments
Like 19 Tweet

There have recently been occasional questions on precision and storage types on Statalist despite all that I have written on the subject, much of it
posted in this blog. I take that as evidence that I have yet to produce a useful, readable piece that addresses all the questions researchers have.

So I want to try again. This time I’ll try to write the ultimate piece on the subject, making it as short and snappy as possible, and addressing
every popular question of which I am aware—including some I haven’t addressed before—and doing all that without making you wade with me
into all the messy details, which I know I have a tendency to do.

I am hopeful that from now on, every question that appears on Statalist that even remotely touches on the subject will be answered with a link
back to this page. If I succeed, I will place this in the Stata manuals and get it indexed online in Stata so that users can find it the instant they
have questions.

What follows is intended to provide everything scientific researchers need to know to judge the effect of storage precision on their work, to
know what can go wrong, and to prevent that. I don’t want to raise expectations too much, however, so I will entitle it …

THE PENULTIMATE GUIDE TO PRECISION

0. Contents

1. Numeric types
2. Floating-point types
3. Integer types
4. Integer precision
5. Floating-point precision
6. Advice concerning 0.1, 0.2, …
7. Advice concerning exact data, such as currency data
8. Advice for programmers
9. How to interpret %21x format (if you care)
10. Also see

1. Numeric types

1.1 Stata provides five numeric types for storing variables, three of them integer types and two of them floating point.

1.2 The floating-point types are float and double.

1.3 The integer types are byte, int, and long.

1.4 Stata uses these five types for the storage of data.

1.5 Stata makes all calculations in double precision (and sometimes quad precision) regardless of the type used to store the data.

2. Floating-point types

2.1 Stata provides two IEEE 754-2008 floating-point types: float and double.

2.2 float variables are stored in 4 bytes.

2.3 double variables are stored in 8 bytes.

2.4 The ranges of float and double variables are


Storage
type minimum maximum
-----------------------------------------------------
float -3.40282346639e+ 38 1.70141173319e+ 38
double -1.79769313486e+308 8.98846567431e+307
-----------------------------------------------------
In addition, float and double can record missing values
., .a, .b, ..., .z.

https://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/#:~:text=2.1 Stata provides two IEEE,are stored in 8 bytes.&text=They are 1 … 1/8


10/14/2020 The Stata Blog » The Penultimate Guide to Precision
The above values are approximations. For those familiar with %21x floating-point hexadecimal format, the exact values
are

Storage
type minimum maximum
-------------------------------------------------------
float -1.fffffe0000000X+07f +1.fffffe0000000X+07e
double -1.fffffffffffffX+3ff +1.fffffffffffffX+3fe
-------------------------------------------------------

Said differently, and less precisely, float values are in the open interval (-2128, 2127), and double values are in the open
interval (-21024, 21023). This is less precise because the intervals shown in the tables are closed intervals.

3. Integer types

3.1 Stata provides three integer storage formats: byte, int, and long. They are 1 byte, 2 bytes, and 4 bytes, respectively.

3.2 Integers may also be stored in Stata’s IEEE 754-2008 floating-point storage formats float and double.

3.3 Integer values may be stored precisely over the ranges


storage
type minimum maximum
------------------------------------------------------
byte -127 100
int -32,767 32,740
long -2,147,483,647 2,147,483,620
------------------------------------------------------
float -16,777,216 16,777,216
double -9,007,199,254,740,992 9,007,199,254,740,992
------------------------------------------------------
In addition, all storage types can record missing values
., .a, .b, ..., .z.

The overall ranges of float and double were shown in (2.4) and are wider than the ranges for them shown here. The ranges
shown here are the subsets of the overall ranges over which no rounding of integer values occurs.

4. Integer precision

4.1 (Automatic promotion.) For the integer storage types—for byte, int, and long—numbers outside the ranges listed in (3.3)
would be stored as missing (.) except that storage types are promoted automatically. As necessary, Stata promotes bytes to
ints, ints to longs, and longs to doubles. Even if a variable is a byte, the effective range is still [-9,007,199,254,740,992,
9,007,199,254,740,992] in the sense that you could change a value of a byte variable to a large value and that value would
be stored correctly; the variable that was a byte would, as if by magic, change its type to int, long, or double if that were
necessary.

4.2 (Data input.) Automatic promotion (4.1) applies after the data are input/read/imported/copied into Stata. When first reading,
importing, copying, or creating data, it is your responsibility to choose appropriate storage types. Be aware that Stata’s
default storage type is float, so if you have large integers, it is usually necessary to specify explicitly the types you wish to
use.

If you are unsure of the type to specify for your integer variables, specify double. After reading the data, you can use
compress to demote storage types. compress never results in a loss of precision.

4.3 Note that you can use the floating-point types float and double to store integer data.

4.3.1 Integers outside the range [-2,147,483,647, 2,147,483,620] must be stored as doubles if they are to be precisely
recorded.

4.3.2 Integers can be stored as float, but avoid doing that unless you are certain they will be inside the range [-16,777,216,
16,777,216] not just when you initially read, import, or copy them into Stata, but subsequently as you make
transformations.

4.3.3 If you read your integer data as floats, and assuming they are within the allowed range, we recommend that you
change them to an integer type. You can do that simply by typing compress. We make that recommendation so that
your integer variables will benefit from the automatic promotion described in (4.1).

4.4 Let us show what can go wrong if you do not follow our advice in (4.3). For the floating-point types—for float and double—
integer values outside the ranges listed in (3.3) are rounded.

Consider a float variable, and remember that the integer range for floats is [-16,777,216, 16,777,216]. If you tried to store a
value outside the range in the variable—say, 16,777,221—and if you checked afterward, you would discover that actually
stored was 16,777,220! Here are some other examples of rounding:

desired value stored (rounded)


to store true value float value
------------------------------------------------------
maximum 16,777,216 16,777,216
maximum+1 16,777,217 16,777,216
------------------------------------------------------
maximum+2 16,777,218 16,777,218
------------------------------------------------------

https://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/#:~:text=2.1 Stata provides two IEEE,are stored in 8 bytes.&text=They are 1 … 2/8


10/14/2020 The Stata Blog » The Penultimate Guide to Precision
maximum+3 16,777,219 16,777,220
maximum+4 16,777,220 16,777,220
maximum+5 16,777,221 16,777,220
------------------------------------------------------
maximum+6 16,777,222 16,777,222
------------------------------------------------------
maximum+7 16,777,223 16,777,224
maximum+8 16,777,224 16,777,224
maximum+9 16,777,225 16,777,224
------------------------------------------------------
maximum+10 16,777,226 16,777,226
------------------------------------------------------

When you store large integers in float variables, values will be rounded and no mention will be made of that fact.

And that is why we say that if you have integer data that must be recorded precisely and if the values might be large—
outside the range ±16,777,216—do not use float. Use long or use double; or just use the compress command and let
automatic promotion handle the problem for you.

4.5 Unlike byte, int, and long, float and double variables are not promoted to preserve integer precision.

Float values are not promoted because, well, they are not. Actually, there is a deep reason, but it has to do with the use of
float variables for their real purpose, which is to store non-integer values.

Double values are not promoted because there is nothing to promote them to. Double is Stata’s most precise storage type.
The largest integer value Stata can store precisely is 9,007,199,254,740,992 and the smallest is -9,007,199,254,740,992.

Integer values outside the range for doubles round in the same way that float values round, except at absolutely larger
values.

5. Floating-point precision

5.1 The smallest, nonzero value that can be stored in float and double is
Storage
type value value in %21x value in base 10
-----------------------------------------------------------------
float ±2^-127 ±1.0000000000000X-07f ±5.877471754111e-039
double ±2^-1022 ±1.0000000000000X-3fe ±2.225073858507e-308
-----------------------------------------------------------------

We include the value shown in the third column, the value in %21x, for those who know how to read it. It is described in
(9), but it is unimportant. We are merely emphasizing that these are the smallest values for properly normalized numbers.

5.2 The smallest value of epsilon such that 1+epsilon ≠ 1 is


Storage
type epsilon epsilon in %21x epsilon in base 10
-----------------------------------------------------------------
float ±2^-23 ±1.0000000000000X-017 ±1.19209289551e-07
double ±2^-52 ±1.0000000000000X-034 ±2.22044604925e-16
-----------------------------------------------------------------

Epsilon is the distance from 1 to the next number on the floating-point number line. The corresponding unit roundoff error
is u = ±epsilon/2. The unit roundoff error is the maximum relative roundoff error that is introduced by the floating-point
number storage scheme.

The smallest value of epsilon such that x+epsilon ≠ x is approximately |x|*epsilon, and the corresponding unit roundoff
error is ±|x|*epsilon/2.

5.3 The precision of the floating-point types is, depending on how you want to measure it,
Measurement float double
----------------------------------------------------------------
# of binary digits 23 52
# of base 10 digits (approximate) 7 16

Relative precision ±2^-24 ±2^-53


... in base 10 (approximate) ±5.96e-08 ±1.11e-16
----------------------------------------------------------------

Relative precision is defined as


|x - x_as_stored|
± max ------------------
x x

performed using infinite precision arithmetic, x chosen from the subset of reals between the minimum and maximum
values that can be stored. It is worth appreciating that relative precision is a worst-case relative error over all possible
numbers that can be stored. Relative precision is identical to roundoff error, but perhaps this definition is easier to
appreciate.

5.4 Stata never makes calculations in float precision, even if the data are stored as float.

https://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/#:~:text=2.1 Stata provides two IEEE,are stored in 8 bytes.&text=They are 1 … 3/8


10/14/2020 The Stata Blog » The Penultimate Guide to Precision
Stata makes double-precision calculations regardless of how the numeric data are stored. In some cases, Stata internally
uses quad precision, which provides approximately 32 decimal digits of precision. If the result of the calculation is being
stored back into a variable in the dataset, then the double (or quad) result is rounded as necessary to be stored.

5.5 (False precision.) Double precision is 536,870,912 times more accurate than float precision. You may worry that float
precision is inadequate to accurately record your data.

Little in this world is measured to a relative accuracy of ±2-24, the accuracy provided by float precision.

Ms. Smith, it is reported, made $112,293 this year. Do you believe that is recorded to an accuracy of ±2-24*112,293, or
approximately ±0.7 cents?

David was born on 21jan1952, so on 27mar2012 he was 21,981 days old, or 60.18 years old. Recorded in float precision,
the precision is ±60.18*2-24, or roughly ±1.89 minutes.

Joe reported that he drives 12,234 miles per year. Do you believe that Joe’s report is accurate to ±12,234*2-24, equivalent
to ±3.85 feet?

A sample of 102,400 people reported that they drove, in total, 1,252,761,600 miles last year. Is that accurate to ±74.7 miles
(float precision)? If it is, each of them is reporting with an accuracy of roughly ±3.85 feet.

The distance from the Earth to the moon is often reported as 384,401 kilometers. Recorded as a float, the precision is
±384,401*2-24, or ±23 meters, or ±0.023 kilometers. Because the number was not reported as 384,401.000, one would
assume float precision would be accurate to record that result. In fact, float precision is more than sufficiently accurate to
record the distance because the distance from the Earth to the moon varies from 356,400 to 406,700 kilometers, some
50,300 kilometers. The distance would have been better reported as 384,401 ±25,150 kilometers. At best, the measurement
384,401 has relative accuracy of ±0.033 (it is accurate to roughly two digits).

Nonetheless, a few things have been measured with more than float accuracy, and they stand out as crowning
accomplishments of mankind. Use double as required.

6. Advice concerning 0.1, 0.2, …

6.1 Stata uses base 2, binary. Popular numbers such as 0.1, 0.2, 100.21, and so on, have no exact binary representation in a finite
number of binary digits. There are a few exceptions, such as 0.5 and 0.25, but not many.

6.2 If you create a float variable containing 1.1 and list it, it will list as 1.1 but that is only because Stata’s default display format
is %9.0g. If you changed that format to %16.0g, the result would appear as 1.1000000238419.

This scares some users. If this scares you, go back and read (5.5) False Precision. The relative error is still a modest ±2-24.
The number 1.1000000238419 is likely a perfectly acceptable approximation to 1.1 because the 1.1 was never measured to
an accuracy of less than ±2-24 anyway.

6.3 One reason perfectly acceptable approximations to 1.1 such as 1.1000000238419 may bother you is that you cannot select
observations containing 1.1 by typing if x==1.1 if x is a float variable. You cannot because the 1.1 on the right is
interpreted as double precision 1.1. To select the observations, you have to type if x==float(1.1).

6.4 If this bothers you, record the data as doubles. It is best to do this at the point when you read the original data or when you
make the original calculation. The number will then appear to be 1.1. It will not really be 1.1, but it will have less relative
error, namely, ±2-53.

6.5 If you originally read the data and stored them as floats, it is still sometimes possible to recover the double-precision
accuracy just as if you had originally read the data into doubles. You can do this if you know how many decimal digits
were recorded after the decimal point and if the values are within a certain range.

If there was one digit after the decimal point and if the data are in the range [-1,048,576, 1,048,576], which means the
values could be -1,048,576, -1,048,575.9, …, -1, 0, 1, …, 1,048,575.9, 1,048,576, then typing

. gen double y = round(x*10)/10

will recover the full double-precision result. Stored in y will be the number in double precision just as if you had originally
read it that way.

It is not possible, however, to recover the original result if x is outside the range ±1,048,576 because the float variable
contains too little information.

You can do something similar when there are two, three, or more decimal digits:
# digits to
right of
decimal pt. range command
-----------------------------------------------------------------
1 ±1,048,576 gen double y = round(x*10)/10
2 ± 131,072 gen double y = round(x*100)/100
3 ± 16,384 gen double y = round(x*1000)/1000
4 ± 1,024 gen double y = round(x*10000)/10000
5 ± 128 gen double y = round(x*100000)/100000
6 ± 16 gen double y = round(x*1000000)/1000000

https://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/#:~:text=2.1 Stata provides two IEEE,are stored in 8 bytes.&text=They are 1 … 4/8


10/14/2020 The Stata Blog » The Penultimate Guide to Precision
7 ± 1 gen double y = round(x*10000000)/10000000
-----------------------------------------------------------------

Range is the range of x over which command will produce correct results. For instance, range = ±16 in the next-to-the-last
line means that the values recorded in x must be -16 ≤ x ≤ 16.

7. Advice concerning exact data, such as currency data

7.1 Yes, there are exact data in this world. Such data are usually counts of something or are currency data, which you can think
of as counts of pennies ($0.01) or the smallest unit in whatever currency you are using.

7.2 Just because the data are exact does not mean you need exact answers. It may still be that calculated answers are adequate if
the data are recorded to a relative accuracy of ±2-24 (float). For most analyses—even of currency data—this is often
adequate. The U.S. deficit in 2011 was $1.5 trillion. Stored as a float, this amount has a (maximum) error of ±2-24*1.5e+12
= ±$89,406.97. It would be difficult to imagine that ±$89,406.97 would affect any government decision maker dealing with
the full $1.5 trillion.

7.3 That said, you sometimes do need to make exact calculations. Banks tracking their accounts need exact amounts. It is not
enough to say to account holders that we have your money within a few pennies, dollars, or hundreds of dollars.

In that case, the currency data should be converted to integers (pennies) and stored as integers, and then processed as
described in (4). Assuming the dollar-and-cent amounts were read into doubles, you can convert them into pennies by
typing

. replace x = x*100

7.4 If you mistakenly read the currency data as a float, you do not have to re-read the data if the dollar amounts are between ±
$131,072. You can type

. gen double x_in_pennies = round(x*100)

This works only if x is between ±131,072.

8. Advice for programmers

8.1 Stata does all calculations in double (and sometimes quad) precision.

Float precision may be adequate for recording most data, but float precision is inadequate for performing calculations. That
is why Stata does all calculations in double precision. Float precision is also inadequate for storing the results of
intermediate calculations.

There is only one situation in which you need to exercise caution—if you create variables in the data containing
intermediate results. Be sure to create all such variables as doubles.

8.2 The same quad-precision routines StataCorp uses are available to you in Mata; see the manual entries [M-5] mean, [M-5]
sum, [M-5] runningsum, and [M-5] quadcross. Use them as you judge necessary.

9. How to interpret %21x format (if you care)

9.1 Stata has a display format that will display IEEE 754-2008 floating-point numbers in their full binary glory but in a readable
way. You probably do not care; if so, skip this section.

9.2 IEEE 754-2008 floating-point numbers are stored as a pair of numbers (a, b) that are given the interpretation

z = a * 2b

where -2 < a < 2. In double precision, a is recorded with 52 binary digits. In float precision, a is recorded with 23 binary
digits. For example, the number 2 is recorded in double precision as

a = +1.0000000000000000000000000000000000000000000000000000
b = +1

The value of pi is recorded as

a = +1.1001001000011111101101010100010001000010110100011000
b = +1

9.3 %21x presents a and b in base 16. The double-precision value of 2 is shown in %21x format as

+1.0000000000000X+001

and the value of pi is shown as

+1.921fb54442d18X+001

In the case of pi, the interpretation is

a = +1.921fb54442d18 (base 16)


b = +001 (base 16)

https://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/#:~:text=2.1 Stata provides two IEEE,are stored in 8 bytes.&text=They are 1 … 5/8


10/14/2020 The Stata Blog » The Penultimate Guide to Precision
Reading this requires practice. It helps to remember that one-half corresponds to 0.8 (base 16). Thus, we can see that a is
slightly larger than 1.5 (base 10) and b = 1 (base 10), so _pi is something over 1.5*21 = 3.

The number 100,000 in %21x is

+1.86a0000000000X+010

which is to say

a = +1.86a0000000000 (base 16)


b = +010 (base 16)

We see that a is slightly over 1.5 (base 10), and b is 16 (base 10), so 100,000 is something over 1.5*216 = 98,304.

9.4 %21x faithfully presents how the computer thinks of the number. For instance, we can easily see that the nice number 1.1
(base 10) is, in binary, a number with many digits to the right of the binary point:

. display %21x 1.1


+1.199999999999aX+000

We can also see why 1.1 stored as a float is different from 1.1 stored as a double:

. display %21x float(1.1)


+1.19999a0000000X+000

Float precision assigns fewer digits to the mantissa than does double precision, and 1.1 (base 10) in base 16 is a repeating
hexadecimal.

9.5 %21x can be used as an input format as well as an output format. For instance, Stata understands

. gen x = 1.86ax+10

Stored in x will be 100,000 (base 10).

9.6 StataCorp has seen too many competent scientific programmers who, needing a perturbance for later use in their program,
code something like

epsilon = 1e-8

It is worth examining that number:

. display %21x 1e-8


+1.5798ee2308c3aX-01b

That is an ugly number that can only lead to the introduction of roundoff error in their program. A far better number would
be

epsilon = 1.0x-1b

Stata and Mata understand the above statement because %21x may be used as input as well as output. Naturally, 1.0x-1b
looks just like what it is,

. display %21x 1.0x-1b


+1.0000000000000X-01b

and all those pretty zeros will reduce numerical roundoff error.

In base 10, the pretty 1.0x-1b looks like

. display %20.0g 1.0x-1b


7.4505805969238e-09

and that number may not look pretty to you, but you are not a base-2 digital computer.

Perhaps the programmer feels that epsilon really needs to be closer to 1e-8. In %21x, we see that 1e-8 is
+1.5798ee2308c3aX-01b, so if we want to get closer, perhaps we use

epsilon = 1.6x-1b

9.7 %21x was invented by StataCorp.

10. Also see

If you wish to learn more, see

How to read the %21x format

How to read the %21x format, part 2

https://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/#:~:text=2.1 Stata provides two IEEE,are stored in 8 bytes.&text=They are 1 … 6/8


10/14/2020 The Stata Blog » The Penultimate Guide to Precision
Precision (yet again), Part I

Precision (yet again), Part II

Categories: Numerical Analysis Tags: 21x, binary, format, hexadecimal, IEEE, precision

ALSO ON THE STATA BLOG

How to automate common tas The Stata Blog » Update to Im

How to automate The Stata Blog » The Stata Blog » The Stata Blog » Python inte
common tasks Import COVID-19 … Update to Import … Calculating power … 3: How to in
2 years ago • 4 comments 7 months ago • 6 comments 7 months ago • 10 comments a year ago • 1 comment a month ago • 1

Automating common tasks Like many of you, I am In my last post, I mentioned In my last three posts, I In my last pos
is crucial to effective data working from home and that I did not want to showed you how to you three way
analysis. Automation … checking the latest news … distribute my covid19.ado … calculate power for a t … Python within

17 Comments The Stata Blog 🔒 Disqus' Privacy Policy  Muhammad Syaiful

 t f S tb N t
Using import excel with real world data Our users’ favorite commands
RSSTwitterFacebook
Follow 6,852

Subscribe to the Stata Blog


Receive email notifications of new blog posts

Name

Email Address*

Subscribe

Recent articles

Stata/Python integration part 7: Machine learning with support vector machines


Stata/Python integration part 6: Working with APIs and JSON data
Stata/Python integration part 5: Three-dimensional surface plots of marginal predictions
Stata/Python integration part 4: How to use Python packages
Stata/Python integration part 3: How to install Python packages

Archives

2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010

Categories

Blogs
Company
Data Management
Graphics
Mathematics
Linear Algebra
Numerical Analysis
Performance
Hardware
Memory
Multiprocessing
Programming
https://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/#:~:text=2.1 Stata provides two IEEE,are stored in 8 bytes.&text=They are 1 … 7/8
10/14/2020 The Stata Blog » The Penultimate Guide to Precision
Mata
Resources
Documentation
Meetings
Support
Stata Products
New Books
New Products
Statistics

Tags

#StataProgramming ado ado-command ado-file Bayesian bayesmh binary biostatistics conference coronavirus COVID-19
Bayes

econometrics endogeneity estimation Excel gmm import marginal effects margins Mata meeting mlexp nonlinear model numerical
format graphics

analysis OLS power precision probit programming putexcel Python random numbers runiform() sample size SEM simulation Stata matrix command

Stata matrix function statistics time series treatment effects users group
Links

Stata
Stata Press
The Stata Journal
Stata FAQs
Statalist
Statalist archives
Links to others

Top www.stata.com
Copyright © 2010-2020 StataCorp LLC
Terms of use

https://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/#:~:text=2.1 Stata provides two IEEE,are stored in 8 bytes.&text=They are 1 … 8/8

You might also like