(Data & Variable Management) Storage Type - The Penultimate Guide To Precision The Stata Blog
(Data & Variable Management) Storage Type - The Penultimate Guide To Precision The Stata Blog
Home
About
There have recently been occasional questions on precision and storage types on Statalist despite all that I have written on the subject, much of it
posted in this blog. I take that as evidence that I have yet to produce a useful, readable piece that addresses all the questions researchers have.
So I want to try again. This time I’ll try to write the ultimate piece on the subject, making it as short and snappy as possible, and addressing
every popular question of which I am aware—including some I haven’t addressed before—and doing all that without making you wade with me
into all the messy details, which I know I have a tendency to do.
I am hopeful that from now on, every question that appears on Statalist that even remotely touches on the subject will be answered with a link
back to this page. If I succeed, I will place this in the Stata manuals and get it indexed online in Stata so that users can find it the instant they
have questions.
What follows is intended to provide everything scientific researchers need to know to judge the effect of storage precision on their work, to
know what can go wrong, and to prevent that. I don’t want to raise expectations too much, however, so I will entitle it …
0. Contents
1. Numeric types
2. Floating-point types
3. Integer types
4. Integer precision
5. Floating-point precision
6. Advice concerning 0.1, 0.2, …
7. Advice concerning exact data, such as currency data
8. Advice for programmers
9. How to interpret %21x format (if you care)
10. Also see
1. Numeric types
1.1 Stata provides five numeric types for storing variables, three of them integer types and two of them floating point.
1.4 Stata uses these five types for the storage of data.
1.5 Stata makes all calculations in double precision (and sometimes quad precision) regardless of the type used to store the data.
2. Floating-point types
2.1 Stata provides two IEEE 754-2008 floating-point types: float and double.
Storage
type minimum maximum
-------------------------------------------------------
float -1.fffffe0000000X+07f +1.fffffe0000000X+07e
double -1.fffffffffffffX+3ff +1.fffffffffffffX+3fe
-------------------------------------------------------
Said differently, and less precisely, float values are in the open interval (-2128, 2127), and double values are in the open
interval (-21024, 21023). This is less precise because the intervals shown in the tables are closed intervals.
3. Integer types
3.1 Stata provides three integer storage formats: byte, int, and long. They are 1 byte, 2 bytes, and 4 bytes, respectively.
3.2 Integers may also be stored in Stata’s IEEE 754-2008 floating-point storage formats float and double.
The overall ranges of float and double were shown in (2.4) and are wider than the ranges for them shown here. The ranges
shown here are the subsets of the overall ranges over which no rounding of integer values occurs.
4. Integer precision
4.1 (Automatic promotion.) For the integer storage types—for byte, int, and long—numbers outside the ranges listed in (3.3)
would be stored as missing (.) except that storage types are promoted automatically. As necessary, Stata promotes bytes to
ints, ints to longs, and longs to doubles. Even if a variable is a byte, the effective range is still [-9,007,199,254,740,992,
9,007,199,254,740,992] in the sense that you could change a value of a byte variable to a large value and that value would
be stored correctly; the variable that was a byte would, as if by magic, change its type to int, long, or double if that were
necessary.
4.2 (Data input.) Automatic promotion (4.1) applies after the data are input/read/imported/copied into Stata. When first reading,
importing, copying, or creating data, it is your responsibility to choose appropriate storage types. Be aware that Stata’s
default storage type is float, so if you have large integers, it is usually necessary to specify explicitly the types you wish to
use.
If you are unsure of the type to specify for your integer variables, specify double. After reading the data, you can use
compress to demote storage types. compress never results in a loss of precision.
4.3 Note that you can use the floating-point types float and double to store integer data.
4.3.1 Integers outside the range [-2,147,483,647, 2,147,483,620] must be stored as doubles if they are to be precisely
recorded.
4.3.2 Integers can be stored as float, but avoid doing that unless you are certain they will be inside the range [-16,777,216,
16,777,216] not just when you initially read, import, or copy them into Stata, but subsequently as you make
transformations.
4.3.3 If you read your integer data as floats, and assuming they are within the allowed range, we recommend that you
change them to an integer type. You can do that simply by typing compress. We make that recommendation so that
your integer variables will benefit from the automatic promotion described in (4.1).
4.4 Let us show what can go wrong if you do not follow our advice in (4.3). For the floating-point types—for float and double—
integer values outside the ranges listed in (3.3) are rounded.
Consider a float variable, and remember that the integer range for floats is [-16,777,216, 16,777,216]. If you tried to store a
value outside the range in the variable—say, 16,777,221—and if you checked afterward, you would discover that actually
stored was 16,777,220! Here are some other examples of rounding:
When you store large integers in float variables, values will be rounded and no mention will be made of that fact.
And that is why we say that if you have integer data that must be recorded precisely and if the values might be large—
outside the range ±16,777,216—do not use float. Use long or use double; or just use the compress command and let
automatic promotion handle the problem for you.
4.5 Unlike byte, int, and long, float and double variables are not promoted to preserve integer precision.
Float values are not promoted because, well, they are not. Actually, there is a deep reason, but it has to do with the use of
float variables for their real purpose, which is to store non-integer values.
Double values are not promoted because there is nothing to promote them to. Double is Stata’s most precise storage type.
The largest integer value Stata can store precisely is 9,007,199,254,740,992 and the smallest is -9,007,199,254,740,992.
Integer values outside the range for doubles round in the same way that float values round, except at absolutely larger
values.
5. Floating-point precision
5.1 The smallest, nonzero value that can be stored in float and double is
Storage
type value value in %21x value in base 10
-----------------------------------------------------------------
float ±2^-127 ±1.0000000000000X-07f ±5.877471754111e-039
double ±2^-1022 ±1.0000000000000X-3fe ±2.225073858507e-308
-----------------------------------------------------------------
We include the value shown in the third column, the value in %21x, for those who know how to read it. It is described in
(9), but it is unimportant. We are merely emphasizing that these are the smallest values for properly normalized numbers.
Epsilon is the distance from 1 to the next number on the floating-point number line. The corresponding unit roundoff error
is u = ±epsilon/2. The unit roundoff error is the maximum relative roundoff error that is introduced by the floating-point
number storage scheme.
The smallest value of epsilon such that x+epsilon ≠ x is approximately |x|*epsilon, and the corresponding unit roundoff
error is ±|x|*epsilon/2.
5.3 The precision of the floating-point types is, depending on how you want to measure it,
Measurement float double
----------------------------------------------------------------
# of binary digits 23 52
# of base 10 digits (approximate) 7 16
performed using infinite precision arithmetic, x chosen from the subset of reals between the minimum and maximum
values that can be stored. It is worth appreciating that relative precision is a worst-case relative error over all possible
numbers that can be stored. Relative precision is identical to roundoff error, but perhaps this definition is easier to
appreciate.
5.4 Stata never makes calculations in float precision, even if the data are stored as float.
5.5 (False precision.) Double precision is 536,870,912 times more accurate than float precision. You may worry that float
precision is inadequate to accurately record your data.
Little in this world is measured to a relative accuracy of ±2-24, the accuracy provided by float precision.
Ms. Smith, it is reported, made $112,293 this year. Do you believe that is recorded to an accuracy of ±2-24*112,293, or
approximately ±0.7 cents?
David was born on 21jan1952, so on 27mar2012 he was 21,981 days old, or 60.18 years old. Recorded in float precision,
the precision is ±60.18*2-24, or roughly ±1.89 minutes.
Joe reported that he drives 12,234 miles per year. Do you believe that Joe’s report is accurate to ±12,234*2-24, equivalent
to ±3.85 feet?
A sample of 102,400 people reported that they drove, in total, 1,252,761,600 miles last year. Is that accurate to ±74.7 miles
(float precision)? If it is, each of them is reporting with an accuracy of roughly ±3.85 feet.
The distance from the Earth to the moon is often reported as 384,401 kilometers. Recorded as a float, the precision is
±384,401*2-24, or ±23 meters, or ±0.023 kilometers. Because the number was not reported as 384,401.000, one would
assume float precision would be accurate to record that result. In fact, float precision is more than sufficiently accurate to
record the distance because the distance from the Earth to the moon varies from 356,400 to 406,700 kilometers, some
50,300 kilometers. The distance would have been better reported as 384,401 ±25,150 kilometers. At best, the measurement
384,401 has relative accuracy of ±0.033 (it is accurate to roughly two digits).
Nonetheless, a few things have been measured with more than float accuracy, and they stand out as crowning
accomplishments of mankind. Use double as required.
6.1 Stata uses base 2, binary. Popular numbers such as 0.1, 0.2, 100.21, and so on, have no exact binary representation in a finite
number of binary digits. There are a few exceptions, such as 0.5 and 0.25, but not many.
6.2 If you create a float variable containing 1.1 and list it, it will list as 1.1 but that is only because Stata’s default display format
is %9.0g. If you changed that format to %16.0g, the result would appear as 1.1000000238419.
This scares some users. If this scares you, go back and read (5.5) False Precision. The relative error is still a modest ±2-24.
The number 1.1000000238419 is likely a perfectly acceptable approximation to 1.1 because the 1.1 was never measured to
an accuracy of less than ±2-24 anyway.
6.3 One reason perfectly acceptable approximations to 1.1 such as 1.1000000238419 may bother you is that you cannot select
observations containing 1.1 by typing if x==1.1 if x is a float variable. You cannot because the 1.1 on the right is
interpreted as double precision 1.1. To select the observations, you have to type if x==float(1.1).
6.4 If this bothers you, record the data as doubles. It is best to do this at the point when you read the original data or when you
make the original calculation. The number will then appear to be 1.1. It will not really be 1.1, but it will have less relative
error, namely, ±2-53.
6.5 If you originally read the data and stored them as floats, it is still sometimes possible to recover the double-precision
accuracy just as if you had originally read the data into doubles. You can do this if you know how many decimal digits
were recorded after the decimal point and if the values are within a certain range.
If there was one digit after the decimal point and if the data are in the range [-1,048,576, 1,048,576], which means the
values could be -1,048,576, -1,048,575.9, …, -1, 0, 1, …, 1,048,575.9, 1,048,576, then typing
will recover the full double-precision result. Stored in y will be the number in double precision just as if you had originally
read it that way.
It is not possible, however, to recover the original result if x is outside the range ±1,048,576 because the float variable
contains too little information.
You can do something similar when there are two, three, or more decimal digits:
# digits to
right of
decimal pt. range command
-----------------------------------------------------------------
1 ±1,048,576 gen double y = round(x*10)/10
2 ± 131,072 gen double y = round(x*100)/100
3 ± 16,384 gen double y = round(x*1000)/1000
4 ± 1,024 gen double y = round(x*10000)/10000
5 ± 128 gen double y = round(x*100000)/100000
6 ± 16 gen double y = round(x*1000000)/1000000
Range is the range of x over which command will produce correct results. For instance, range = ±16 in the next-to-the-last
line means that the values recorded in x must be -16 ≤ x ≤ 16.
7.1 Yes, there are exact data in this world. Such data are usually counts of something or are currency data, which you can think
of as counts of pennies ($0.01) or the smallest unit in whatever currency you are using.
7.2 Just because the data are exact does not mean you need exact answers. It may still be that calculated answers are adequate if
the data are recorded to a relative accuracy of ±2-24 (float). For most analyses—even of currency data—this is often
adequate. The U.S. deficit in 2011 was $1.5 trillion. Stored as a float, this amount has a (maximum) error of ±2-24*1.5e+12
= ±$89,406.97. It would be difficult to imagine that ±$89,406.97 would affect any government decision maker dealing with
the full $1.5 trillion.
7.3 That said, you sometimes do need to make exact calculations. Banks tracking their accounts need exact amounts. It is not
enough to say to account holders that we have your money within a few pennies, dollars, or hundreds of dollars.
In that case, the currency data should be converted to integers (pennies) and stored as integers, and then processed as
described in (4). Assuming the dollar-and-cent amounts were read into doubles, you can convert them into pennies by
typing
. replace x = x*100
7.4 If you mistakenly read the currency data as a float, you do not have to re-read the data if the dollar amounts are between ±
$131,072. You can type
8.1 Stata does all calculations in double (and sometimes quad) precision.
Float precision may be adequate for recording most data, but float precision is inadequate for performing calculations. That
is why Stata does all calculations in double precision. Float precision is also inadequate for storing the results of
intermediate calculations.
There is only one situation in which you need to exercise caution—if you create variables in the data containing
intermediate results. Be sure to create all such variables as doubles.
8.2 The same quad-precision routines StataCorp uses are available to you in Mata; see the manual entries [M-5] mean, [M-5]
sum, [M-5] runningsum, and [M-5] quadcross. Use them as you judge necessary.
9.1 Stata has a display format that will display IEEE 754-2008 floating-point numbers in their full binary glory but in a readable
way. You probably do not care; if so, skip this section.
9.2 IEEE 754-2008 floating-point numbers are stored as a pair of numbers (a, b) that are given the interpretation
z = a * 2b
where -2 < a < 2. In double precision, a is recorded with 52 binary digits. In float precision, a is recorded with 23 binary
digits. For example, the number 2 is recorded in double precision as
a = +1.0000000000000000000000000000000000000000000000000000
b = +1
a = +1.1001001000011111101101010100010001000010110100011000
b = +1
9.3 %21x presents a and b in base 16. The double-precision value of 2 is shown in %21x format as
+1.0000000000000X+001
+1.921fb54442d18X+001
+1.86a0000000000X+010
which is to say
We see that a is slightly over 1.5 (base 10), and b is 16 (base 10), so 100,000 is something over 1.5*216 = 98,304.
9.4 %21x faithfully presents how the computer thinks of the number. For instance, we can easily see that the nice number 1.1
(base 10) is, in binary, a number with many digits to the right of the binary point:
We can also see why 1.1 stored as a float is different from 1.1 stored as a double:
Float precision assigns fewer digits to the mantissa than does double precision, and 1.1 (base 10) in base 16 is a repeating
hexadecimal.
9.5 %21x can be used as an input format as well as an output format. For instance, Stata understands
. gen x = 1.86ax+10
9.6 StataCorp has seen too many competent scientific programmers who, needing a perturbance for later use in their program,
code something like
epsilon = 1e-8
That is an ugly number that can only lead to the introduction of roundoff error in their program. A far better number would
be
epsilon = 1.0x-1b
Stata and Mata understand the above statement because %21x may be used as input as well as output. Naturally, 1.0x-1b
looks just like what it is,
and all those pretty zeros will reduce numerical roundoff error.
and that number may not look pretty to you, but you are not a base-2 digital computer.
Perhaps the programmer feels that epsilon really needs to be closer to 1e-8. In %21x, we see that 1e-8 is
+1.5798ee2308c3aX-01b, so if we want to get closer, perhaps we use
epsilon = 1.6x-1b
Categories: Numerical Analysis Tags: 21x, binary, format, hexadecimal, IEEE, precision
How to automate The Stata Blog » The Stata Blog » The Stata Blog » Python inte
common tasks Import COVID-19 … Update to Import … Calculating power … 3: How to in
2 years ago • 4 comments 7 months ago • 6 comments 7 months ago • 10 comments a year ago • 1 comment a month ago • 1
Automating common tasks Like many of you, I am In my last post, I mentioned In my last three posts, I In my last pos
is crucial to effective data working from home and that I did not want to showed you how to you three way
analysis. Automation … checking the latest news … distribute my covid19.ado … calculate power for a t … Python within
t f S tb N t
Using import excel with real world data Our users’ favorite commands
RSSTwitterFacebook
Follow 6,852
Name
Email Address*
Subscribe
Recent articles
Archives
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
Categories
Blogs
Company
Data Management
Graphics
Mathematics
Linear Algebra
Numerical Analysis
Performance
Hardware
Memory
Multiprocessing
Programming
https://blog.stata.com/2012/04/02/the-penultimate-guide-to-precision/#:~:text=2.1 Stata provides two IEEE,are stored in 8 bytes.&text=They are 1 … 7/8
10/14/2020 The Stata Blog » The Penultimate Guide to Precision
Mata
Resources
Documentation
Meetings
Support
Stata Products
New Books
New Products
Statistics
Tags
#StataProgramming ado ado-command ado-file Bayesian bayesmh binary biostatistics conference coronavirus COVID-19
Bayes
econometrics endogeneity estimation Excel gmm import marginal effects margins Mata meeting mlexp nonlinear model numerical
format graphics
analysis OLS power precision probit programming putexcel Python random numbers runiform() sample size SEM simulation Stata matrix command
Stata matrix function statistics time series treatment effects users group
Links
Stata
Stata Press
The Stata Journal
Stata FAQs
Statalist
Statalist archives
Links to others
Top www.stata.com
Copyright © 2010-2020 StataCorp LLC
Terms of use