Probability, Statistics, and Reality: Essay Concerning The Social Dimension of Science
Probability, Statistics, and Reality: Essay Concerning The Social Dimension of Science
exp
1
2
u m
_
,
2
1
1
]
du (1)
The right-hand side of this equation should be familiar: it describes a function of x
that depends on two parameters m and , which have denite (but at this point
unknown) values. The left-hand side is where the novelty lies, both in notation and
in concept. The expression means, The probability that the random variable X is
less than a value x.
But what do we mean by probability? We will describe its mathematical prop-
erties in the next chapter. What it really means, in the sense of what it corre-
sponds to in the real world, is a deep, and deeply controversial, question. A common
working denition is that it is the fraction of times that we get a particular result
if we repeat something: the frequency of occurrence, as in the canonical example
that the probability of getting heads on tossing a fair coin is 0.5. While there are
good objections to this interpretation, there are good objections to all the other ones
as well. Certainly the frequency interpretation seems like the most straightforward
wa y to connect equation (1) to the data summarized in Figure 1.2.
So, we have data, and an assumed probability model for them. What comes
next is statistics: using the model to draw conclusions from the data. Probability
theory is mathematics, with its own rulesalbeit rules inspired by real-world exam-
ples. Statistics, while it makes use of probability theory, is the activity of applying it
to actual data, and the drawing of actual conclusions. Statistics can perhaps best be
thought of as a branch of applied mathematics, though for historical reasons nobody
uses this term for it.
Given our model, as expressed by equation (1), and given these data, an imme-
diate statistical question is, what are the best values for m and ? Assuming that
(1) in fact correctly models these data, the formulas for nding the best m and
are
m
1
N
N
i 1
d
i
(2)
2
1
N 1
N
i 1
(d
i
m)
2
where there are N data d
1
, d
2
,
. . .
, d
N
. Applying these expressions to the data from
which Figure 1.3 was drawn, we get m 503 26. 795 and 0. 18. The parameters m
and in equation (1) are called the mean and standard deviation; the values m
and that we get from the data using the formulae in (2) are called estimates of
these. (As is standard in statistics, we use a superposed hat symbol to denote an
estimate of a variable). Why these estimates might be termed the best ones is
something we will get to; this is part of the area of statistics known, unsurprisingly,
as estimation theory. Finding the best values of parameters such as m and is
called point estimation. As we describe in the next subsection, this is the answer
to only one kind of statistical question; perhaps the commonest, but not always the
most relevant.
2008 D. C. Agnew/C. Constable
Version 1.3 Introduction 1-6
The interpretation of these data as one of a true value contaminated by errors
leads to a formally similar but philosophically quite different statement: that the
data can be modeled as
d
i
t + e
i
(3)
where t is the true value of the distance, and e
i
are the errors; we model the errors
as being a random variable E, such that
P(E < e)
1
(2 )
1
2
e
exp
1
2
_
,
2
1
1
]
du (4)
Mathematically, (3) and (4) are equivalent to (1) if we have t m, which means that
if we assume (3) and (4) our best estimate of t is m. This is of course the bit of
statistics scientists do most often: given a collection of repeated measurements, we
form the average, and say the answer is the true result we are trying to nd. There
is nothing really wrong in thisuntil we start to ask questions such as How uncer-
tain is t? Such a question, while tempting, is nonsense: if t is the true value, it can-
not be uncertain mathematically, it is some number, not (like the es) a random kind
of thing. We will therefore avoid this approach, and instead will try to pose prob-
lems in terms of equations such as (1), in which all the non-random variables are, as
it were, on the same footing: they are all parameters in a formula describing a
probability.
We can use our estimates m and to nd how the data would be distributed if
they actually followed the model (1); the right side of Figure 1.3 shows the result in
the same histogram form as the left side. You might wish to argue, looking at these
plots, that the model is in fact not right, because the data show a sharper peak, just
to the left of zero, than the model; and the model does not show the few points at
large distances that are evident in the data. As it turns out, these are quite valid
objections, which lead, among other things, to something called robust estimation:
how to make point estimates that are not affected by small amounts of outlying
data. But the underlying question, of deciding if the model we are using is a valid
one or not, is quite a different issue from estimating model parameters; so we dis-
cuss it in a new section.
1.3. Another Example: Hypothesis Testing
Figure 1.4 plots another dataset: the top two traces show the reversals of the
Earths magnetic eld, as reconstructed mostly from marine magnetic anomalies,
over the last 159 Myr. We plot this, as is conventional, as a kind of square wave,
with positive values meaning that the dipole has the same direction as now, and neg-
ative ones that it was reversed. This is an example of a geophysical time series of a
particular kind, namely a point process, so called because such a time series is
dened by points in time at which something happens: a eld reversal, an earth-
quake, or a disk crash. The simplest probability model for this is called a Poisson
process: in any given unit of time (we suppose for simplicity that time is divided
into units) there is a probability p that the event happens. By this specication, p is
always the same at any date, and it does not matter if it has been a long time or a
2008 D. C. Agnew/C. Constable
Version 1.3 Introduction 1-7
short time since the last event. Such a model, which doesnt know what time it is,
does not describe the actual times of the reversals. What it does describe is the
intervals between successive events, which we plot on the lower left of Figure 1.2,
again as a histogram. We use the log of the time because the intervals have to be
positiveand also because they are spread out over a large range, from 2 10
4
to
4 10
7
years. The longest interval (the Cretaceous Normal Superchron) ran from
124 Myr ago to 83 Myr ago. Looking at the time series, or the distribution, naturally
raises a question: is this interval somehow unusual compared to the others? If it
is, we might wish to argue that the core dynamo (the source of the eld) changed its
behavior during this time.
Figure 1.4
In statistics questions of this type are called hypothesis testing, because we
can only say that a behavior is unusual in comparison to some particular hypothesis.
In statistics, the meaning of hypothesis is more limited than in general usage: it
refers to a particular probability model. Because we are dealing with probability, we
can never say that something absolutely could not have occurred, only that it is very
improbable. The mode of reasoning used for such tests is somewhat difcult to get
used to because it seems backwards from the way we usually reason: rather than
draw a positive conclusion, we look for a negative one, by rst creating a model that
is the opposite of what we want to show. This is called the null hypothesis; we
then show that the data in fact make it very unlikely that this hypothesis is true.
2008 D. C. Agnew/C. Constable
Version 1.3 Introduction 1-8
For this example, a null hypothesis might be the probability model that there is
an equal chance of a reversal in any 40,000 year intervalwe choose this time
because shorter reversals do not seem to be common, either because they do not hap-
pen or because they are not well recorded in marine magnetic anomalies. We call
this chance (or probability) p, and we assume it is the same over all times: these
assumptions combine to form our hypothesis, or (again) what we could call a stochas-
tic model. If we estimate this p using the distribution of intervals, we nd that it is
(roughly) 0.1. This gives the histogram of inter-reversal times shown at the lower
right of Figure 1.4. This is actually not a very good t to the data but we can take it
as a rst approximation. Now, 40 Myr has 1000 intervals of 40 kyr; the probability
of there not being a reversal over this many intervals consecutively is (1 p)
1000
, or
2 10
46
. So, over the 160 Myr of data we have (four 40-Myr intervals), we would
expect to get such a long reversal-less span about one time out of 10
45
which we
may regard as so unlikely that we can reject the idea that p does not change over
time. Of course, we only have the one example, so it is (again) arguable whether or
not saying one time out of N is the right way to phrase things; but given how small
the probability is in this case, we may feel justied in any case in rejecting the
hypothesis we started out with. But, we should always remember that our decision
to reject depends on a judgment about how small a probability we are willing to tol-
erateand this judgment is, in the end, arbitrary.
3
1.4. Distance Measurement (II): Error Bounds
Our discussion in Section 1>2 of the stochastic model for the distance measure-
ment was about the best estimate of the parameter m. But this is not the only
question we could ask relating to m; we might reasonably also ask how well we
think we know itthat is, in the conventional phrasing, how large the error of m is.
This question actually is itself a hypothesis test, or rather a whole series of such
tests, for each of which the hypothesis is m is really equal to the value ____; given
our model, is this compatible (that is, likely) given the data observed? If the
assumed value were 50320 mm, or 50330, the answer would be, not very likely; if the
assumed value were 50326.8, the answer would be, quite possible. We can in fact
work out what this series of tests would give us for any assumed value of m; then we
choose a probability value corresponding to not very likely, and say that any value
of m that gives a higher value from the hypothesis test is acceptable. This gives us,
not just a value for m, but what is usually more valuable, a range for it, this range
being between the condence limits.
1.5. Predicting Earthquakes: A Model Misapplied
We close with an example of misapplied statistics leading to a false conclu-
sionand an expensively false one at that. Our example is, ttingly, one of earth-
quake prediction, a eld that has more examples of inept statistical reasoning (as
well as blissful unawareness of the need for such reasoning), than any other branch
3
A more thorough analysis (C. Constable, On rates of occurrence of geomagnetic reversals,
Phys. Earth Planet. Inter., 118, 181-193 (2000)) shows that the Poisson model can be used, pro-
vided we make the probability time-dependent. A model consistent with the data has this proba-
bility diminishing to a very low value during the Cretaceous Superchron, and then increasing to
the present.
2008 D. C. Agnew/C. Constable
Version 1.3 Introduction 1-9
of geophysics.
The particular case was that of the repeating earthquakes at Parkeld, a very
small settlement on the San Andreas fault in Central California. Earthquakes hap-
pened there in 1901, 1922, 1934 and 1966; seismometer records showed the last
three shocks to have been very similar. Nineteenth-century reports of felt shaking
suggested earthquakes at Parkeld in 1857 and 1881. This sequence of dates could
be taken to imply a somewhat regular repetition of events.
You might think, from what we discussed in the previous section, that the rele-
vant data would be the times between earthquakesand you would be right. How-
ever, the actual analysis took a different approach, shown in the left panel of Figure
1.5: the event numbers were plotted against the date of the earthquake, and a
straight line t to these points. The gure shows two ts, one including the 1934
event and the other omitting it as anomalous. If the 1934 event is included, the
straight line reaches event number 7 in 1983 (the left-hand vertical line); this was
known not to have happened when the analysis was done in 1984. Excluding the
1934 event yielded a predicted time for event 7 of 1988.1, t4.5 years.
Partly because of this prediction, which seemed to promise a payoff in the near
future, a massive monitoring effort was set up around Parkeld, which was contin-
ued long after the end of the prediction. The earthquake eventually happened in
September 2004, 19 years late.
Figure 1.5
What went wrong? The biggest mistake, not uncommon, was to adopt a set of
standard methods without checking to see if the assumptions behind them were
appropriate: an approach often, and justly, derided, as the cookbook method. In this
case, the line tting, and the range for the predicted date, assumed that both the x-
and y-coordinates of the plotted points were random variables with a probability dis-
tribution somewhat similar to equation (1). But the event numbers, 1 through 6, are
as nonrandom as any sequence of numbers can be; the dates are not random either,
for they have to increase with event number. A more careful analysis of the series
shows that while the time predicted for the next earthquake in 1984 would still be
2008 D. C. Agnew/C. Constable
Version 1.3 Introduction 1-10
close to 1990, the range of not-improbable times would be much largerthe earth-
quake was nowhere near as imminent as the incorrect analysis suggested. No doubt
some level of monitoring would have been undertaken anyway but a more thoughtful
approach might have been taken if the statistical analysis had been done properly.
4
A simple way of seeing what a better analysis would give is suggested by the
right-hand plot in Figure 1.5, which shows the inter-event times ordered by size, and
labeled by the date when each ended. The longest interval has an arrow extending
from 1984 (when the prediction was made) to the actual time of the earthquake. It
seems clear that the range of prior inter-event times is such that the most reason-
able prediction for the next earthquake in 1984 would have been would have been
probably soon, but a 10-20 year wait might well be expected.
The lesson is that you should learn not just a set of techniques to use, but also
when not to use them. In this case, one approach, and one kind of plot, were seri-
ously misleading; a different presentation of the data would immediately have led to
different conclusions.
4
The original prediction was by W. Bakun and T. V. McEvilly, Recurrence models and Park-
eld, California, earthquakes, J. Geophys. Res., 89, 3051-3058 (1984). The correct analysis is Y.
Y. Kagan, Statistical aspects of Parkeld earthquake sequence and Parkeld prediction experi-
ment, Tectonophys., 270, 207-219 (1997). This particular error is not unique to the Parkeld
analysis; see Stephen M. Stigler, Terrestrial mass extinctions and galactic plane crossings,
Nature, 313, 159 (1985).
2008 D. C. Agnew/C. Constable