Feat
Feat
Feat
• about Plus
• support Plus
• subscribe to Plus
• terms of use
• home
• latest issue
• explore the archive
• careers library
• news
September 1999
Features
So, here's a challenge. Go and look up some numbers. A whole variety of naturally−occuring numbers will do.
Try the lengths of some of the world's rivers, or the cost of gas bills in Moldova; try the population sizes in
Peruvian provinces, or even the figures in Bill Clinton's tax return. Then, when you have a sample of
numbers, look at their first digits (ignoring any leading zeroes). Count how many numbers begin with 1, how
many begin with 2, how many begin with 3, and so on − what do you find?
If the character
isn't a square root sign, an alternative version of this page is available that should work with most browsers.
You might expect that there would be roughly the same number of numbers beginning with each different
digit: that the proportion of numbers beginning with any given digit would be roughly 1/9. However, in very
many cases, you'd be wrong!
Surprisingly, for many kinds of data, the distribution of first digits is highly skewed, with 1 being the most
common digit and 9 the least common. In fact, a precise mathematical relationship seems to hold: the
expected proportion of numbers beginning with the leading digit n is log10((n+1)/n).
This relationship, shown in the graph of Figure 1 and known as Benford's Law, is becoming more and more
useful as we understand it better. But how was it discovered, and why on earth should it be true?
Figure 1: The proportional frequency of each leading digit predicted by Benford's Law.
Newcomb's Discovery
The first person to notice this phenomenon was Simon Newcomb, a mathematician and astronomer. One day,
Newcomb was using a book of logarithms for some calculations. He noticed that the pages of the book
became more tatty the closer one was to the front. Why should this be? Apparently, people did more
calculations using numbers that began with lower digits than with higher ones. Newcomb found a formula that
matched his observations pretty well. He claimed that the percentage of numbers that start with the digit D
should be log10((D+1)/D).
Newcomb didn't provide any sort of explanation for his finding. He noted it as a curiosity, and in the face of a
Newcomb's Discovery 2
Looking out for number one
general lack of interest it was quickly forgotten. That was until 1938, when Frank Benford, a physicist at the
general electric company, noticed the same pattern. Fascinated by this discovery, Benford set out to see
exactly how well numbers from the real world corresponded to the law. He collected an enormous set of data
including baseball statistics, areas of river catchments, and the addresses of the first 342 people listed in the
book American Men of Science.
Benford observed that even using such a menagerie of data, the numbers were a good approximation to the
law that Newcomb had discovered half a century before. About 30% began with 1, 18% with 2 and so on. His
analysis was evidence for the existence of the law, but Benford, also, was unable to explain quite why this
should be so.
The first step towards explaining this curious relationship was taken in 1961 by Roger Pinkham, a
mathematician from New Jersey. Pinkham's argument was this. Suppose that there really is a law of digit
frequencies. If so, then that law should be universal: whether you measure prices in Dollars, Dinar or Drakma,
whether you measure lengths in cubits, inches or metres, the proportions of digit frequencies should be the
same. In other words, Pinkham was saying that the distribution of digit frequencies should be "scale
invariant".
Using this reasoning, Pinkham went on to be the first to show that Benford's law is scale invariant. Then he
showed that if a law of digit frequencies is scale invariant then it has to be Benford's Law (see the proof
below). The evidence was mounting that Benford's Law really does exist.
Figure 2 shows the results above expressed as relative frequencies and plotted against the expected
frequencies predicted by Benford's law:
Figure 2
As you can see, there is a reasonable (but not perfect) correspondence with the digit frequency predictions
made by Benford's law. However, as with any sampled statistics, we'd expect a better correspondence with the
predicted values if we used a larger number of samples. In fact, if we calculate the relative frequencies of
leading digits over all the sample data in table 1, we see that the frequencies approach the Benford predictions
much more closely:
Figure 3
At this point, you might be tempted to revise the way you choose your lottery numbers: out go birthdays and
in comes Benford. Will that make a difference?
Sadly, the answer is no. The outcome of the lottery is truly random, meaning that every possible lottery
number has an equal chance of occurring. The leading−digit frequencies should therefore, in the long run, be
in exact proportion to the number of lottery numbers starting with that digit.
On the other hand, consider Olympic 400m times in seconds. Not very many of these begin with 1! Similarly,
think about the ages in years of politicians around the world: not many of these will begin with 1 either!
Unlike the lottery, these data are not random: instead, they are highly constrained. The range of possibilities is
too narrow to allow a law of digit frequencies to hold.
In other words, Benford's Law needs data that are neither totally random nor overly constrained, but rather lie
somewhere in between. These data can be wide ranging, and are typically the result of several processes, with
many influences. For example, the populations in towns and cities can range from tens or hundreds to
thousands or millions, and are affected by a huge range of factors.
Dr Mark Nigrini, an accountancy professor from Dallas, has made use of this to great effect. If somebody tries
to falsify, say, their tax return then invariably they will have to invent some data. When trying to do this, the
tendency is for people to use too many numbers starting with digits in the mid range, 5,6,7 and not enough
numbers starting with 1. This violation of Benford's Law sets the alarm bells ringing.
This demonstrates a limitation of the Benford fraud−detection method. Often data can diverge from Benford's
Law for perfectly innocent reasons. Sometimes figures cannot be given precisely, and so rounding off occurs,
which can change the first digit of a number. Also, especially when dealing with prices, the figures 95 and 99
turn up anomalously often because of marketing strategies. In these cases use of Benford's Law could indicate
fraud where no such thing has occured. Basically the method is not infallible.
However, the use of this remarkable rule is not restricted to hunting down fraud. There is already a system in
use that can help to check computer systems for Y2K compliance. Using Benford's Law, it is possible to
detect a significant change in a firm's figures between 1999 and 2000. Too much of a change could indicate
that something is wrong.
Time, money and resources can be saved if computer systems are managed more efficiently. A team in
Freiburg is working on the idea of allocating computer disk space according to Benford's Law.
Scientists in Belgium are working on whether or not Benford's Law can be used to detect irregularities in
clinical trials. Meanwhile, the good correlation of population statistics with Benford's Law means that it can
be used to verify demographic models.
Who knows where else this might prove useful? Dr Nigrini says "I forsee lots of uses for this stuff, but for me
it's just fascinating in itself. For me, Benford is a great hero. His law is not magic but sometimes it seems like
it".
So if there is a distribution law of first significant digits, it should hold no matter what units happen to have
been used. The distribution of first significant digits should not change when every number is multiplied by a
constant factor. In other words, any such law must be scale invariant.
It's fairly easy to work out what will happen by looking at each digit in turn. If the first significant digit is 1,
then multiplying by 2 will yield a new first digit of 2 or 3 with equal probability. But if the first significant
digit is 5 or 6 or 7 or 8 or 9 the new first digit must be 1. It turns out that in the new set of accounts, a first
In the diagram below, the notation [a,b) means the range of numbers greater than or equal to a, but strictly
less than b.
Our intuition has failed us − the original uniform distribution is now heavily skewed towards the digit 1. So if
scale invariance is correct, the uniform distribution is the wrong answer.
Since we are interested in the distribution of first significant digits it makes sense to express numbers in
scientific notation x ×10n where 1 ≤ x < 10 . This is possible for all numbers except zero. The first significant
digit − d is then simply the first digit of x. We can easily derive a scale invariant distribution for d once we
have found a scale invariant distribution for x.
If a distribution for x is scale−invariant, then the distribution of y=log10x should remain unchanged when we
add a constant value to y. Why? Because we would be multiplying x by some constant a, and log10ax = log10a
+ log10x = log10a + y.
Now, the only probability distribution on y in [0,1) that will remain unchanged after the addition of an
arbitrary constant to y, is the uniform distribution. To convince yourself of this, think about the shape of the
probability density function for the uniform distribution.
Figure 5
Pr d=1 = Pr 1≤x<2
= Pr 0 ≤ y < log102
log102
⌠
1 dy=log102,
⌡ 0
= log10((n+1)/n)
The expression log10((n+1)/n) was exactly the formula given by Newcomb and later Benford for the
proportion of numbers whose first digit is n. So, we can show that scale invariance for a distribution of first
digit frequencies of x implies that this distribution must be Benford's Law!
Jon Walthoe is presently a graduate student at the University of Sussex. As well as his graduate research, he is
involved in the EPSRC−funded Pupil Researcher Initiative.
After finishing his first degree at the University of Exeter, he opted out of Maths for a while. This involved
working in local government among other things, and travelling in Latin America. When not doing his
research, his favourite escape is sailing on the local waters in Brighton.
Return to article
• contact
• copyright info
• sponsors
• privacy info
Plus is part of the family of activities in the Millennium Mathematics Project, which also includes the NRICH
and MOTIVATE sites.