Overview of Principles of Statistics
Overview of Principles of Statistics
F. James
CERN, CH-1211 Geneva 23, Switzerland
Abstract
A summary of the basic principles of statistics. Both the Bayesian and Fre-
quentist points of view are exposed.
2 Probability
All statistical methods are based on calculations of probability.
In Mathematics, probability is an abstract (undefined) concept which obeys certain rules. We will
need a specific operational definition. There are basically two such definitions we could use:
Frequentist probability is defined as the limiting frequency of a particular outcome in a large num-
ber of identical experiments.
Bayesian probability is defined as the degree of belief in a particular outcome of a single experi-
ment.
3
Just like the definition of electric charge [1], the definition of frequentist probability is a conceptual
definition which communicates clearly its meaning and can in principle be used to evaluate it, but in
practice one seldom has to resort to such a primitive procedure and go experimentally to a limit (in the
case of the electric field, it is even physically impossible to go to the limit because charge is quantised,
but this only illustrates that the definition is more conceptual than practical).
However, even though one does not usually have to repeat experiments in order to evaluate prob-
abilities, the definition does imply a serious limitation: It can only be applied to phenomena that are in
principle exactly repeatable. This implies also that the phenomena must be random, that is: identical
situations can give rise to different results, something we are accustomed to in Quantum Mechanics.
There is great debate about whether macroscopic phenomena like coin-tossing are random or not; in
principle coin-tossing is classical mechanics and the initial conditions determine the outcome, so it is
not random. But such phenomena are usually treated as random; it is sufficient that the phenomenon
behaves as though it were random: initial conditions which are experimentally indistinguishable yield
results which are unpredictably different.
bility:
A Random Variable is data which can take on different values, unpredictable except in proba-
data hypothesis is assumed known, provided any unknowns in the hypothesis are given some
assumed values.
Example: for a Poisson process, is a random variable taking on positive integer values, and
is the probability of observing events when the expected rate is :
" !
A Nuisance parameter is an unknown whose value does not interest us, but is unfortunately
necessary for the calculation of data hypothesis .
The Likelihood Function # is data hypothesis
evaluated at the observed data, and considered
as a function of the (unknowns in the) hypothesis.
4
$ means the probability that A is true, given that B
3.1 Bayes’ Theorem
%
We first need to define conditional probability:
is true. For example symptom illness such as
headache influenza is the probability of the patient
having a headache if she has influenza.
ten:
Bayes’
$ &Theorem
$ 'says
$( that
&the
) probability
which implies:
of both A and B being true simultaneously can be writ-
$( )* & % , + $& $ &
which can be written:
$ $ not$ not$
$
$(
This theorem therefore tells us how to invert conditional probability to obtain
.
when we know
flu - / 6 .- / &% .:- +/ flu.- &/ flu &%
flu flu not flu not flu
So the answer depends on the Prior Probability of the person having flu, that is:
for Frequentists, the frequency of occurence of flu in the general population.
for Bayesians, the prior belief that the person has the flu, before we know the outcome of any tests.
If we are in the winter in Durham, perhaps
% 6;<8 1 > =
flu is 1% . On the other hand, we may be in another
country where it is a very rare disease with flu
If we apply the same diagnostic test in each of these two places, we would get the following
probabilities:
;8<1
flu - / 31 2@?BA
flu = 1% flu
8<1 C >=
flu - 13271D198 8<1 FE
So this test would be useful for diagnosing the flu in Durham, but in another place where it was a rare
disease it would always lead to the conclusion that the person probably does not have the flu even if the
test is positive.
Note that, as long as all the probabilities are meaningful in the context of a given methodology,
Bayes’ Theorem can be used as well by Frequentists as by Bayesians. The use of Bayes’ Theorem does
not imply that a method is Bayesian, however the inverse is true: all Bayesian methods make use (at least
implicitly) of Bayes’ Theorem.
5
4 Point Estimation - Frequentist
%HBIKJ&I G
Common notation: for all estimation (sections 4 – 6), we are estimating a parameter using some G
data, and it is assumed that we know , which can be thought of as the Monte Carlo for the
experiment, for any assumed value of . G
An Estimator is a function of the data which will be used to estimate (measure) the unknown
G G
G
parameter . The problem is to find that function which gives estimates of closest to the true value
G
assumed for . This can be done because we know data true value of and because the estimate is a
function of the data. The general procedure would therefore be to take a lot of trial estimator functions,
and for each one calculate the expected distribution of estimates about the assumed true value of . [All G
this can be done without any experimental data.] Then the best (most efficient) estimator is the one which
gives estimates grouped closest to the true value (having a distribution centred on the true value and as
narrow as possible).
# G
Fortunately, we don’t have to do all that work, because it turns out that under very general con-
ditions, it can be shown that the best estimator will be the one which maximizes the Likelihood .
This is the justification for the well-known method of Maximum Likelihood.
Note that the definition of the “narrowest distribution” of estimates requires specifying a norm
for the width; the usual criterion, whereby the width is defined as the variance, leads to the Maximum
Likelihood solution, since this is (asymptotically) the minimum-variance estimator.
An important and well-known property of the Maximum-likelihood estimate is that it is metric-
M L G * M G
independent: If the hat represents the Maximum-likelihood estimate, then . L
5 Point Estimation - Bayesian
For parameter estimation, we can rewrite Bayes’ Theorem:
Posterior pdf
G R # G TS Prior pdf
G
The Bayesian point estimate is usually taken as the maximum value of the Posterior pdf.
G
# G
If the Prior pdf is taken to be the uniform distribution in , then the maximum of the Posterior will
occur at the maximum of , which means that in practice the Bayesian point estimate is often the
same as the Frequentist point estimate, although following a very different reasoning!
Note that the choice of a uniform Prior is not well justified in Bayesian theory (for example,
G
it seldom corresponds to anyone’s actual prior belief about ), so the best Bayesian solution is not
necessarily the Maximum Likelihood.
Note also that the choice of the maximum of the posterior density has the unfortunate property of
M G %U M G WV G
being dependent on the metric chosen for . In particular, consider the “natural” metric, that function
in which the pdf is uniform between zero and one: in this metric has no maximum. This
problem is easily solved by choosing the point estimate corresponding to the median P (50th percentile)
instead of the mode (maximum), but then it will not in general coincide with the Maximum Likelihood.
6
6 Interval Estimation - Bayesian
Here the goal is to find an interval which will contain the true value with a given probability, say 90%.
Since the Posterior Probability distribution is known from Bayes’ Theorem (see above), we have only
to find an interval such that the integral under the Posterior pdf is equal to 0.9 . As this interval is not
unique, the usual convention is to choose the interval with the largest values of the posterior pdf.
There are three arbitrary choices to be made in Bayesian estimation, and the most common choices
are:
1. The uniform prior.
2. The point estimate as the maximum of the posterior pdf.
3. The interval estimate as the interval containing the largest values of the posterior pdf.
Note that all these choices produce metric-dependent results (they give a different answer under
change of variables), but the first two happen to cancel to yield the metric-independent frequentist result.
A metric-independent solution is easily found for the third case, the most obvious possibility being
the central intervals, defined such that there is equal probability above and below the confidence interval.
However, this would have the unfortunate consequence that a Bayesian result could never be given as
an upper limit: Even if no events are observed, the central Bayesian interval would always be two-sided
with an upper and a lower limit.
7
All the above have exact frequentist coverage when the data is continuous. For discrete data there
is an additional problem that exact coverage is not always possible, so we have to accept some
over-coverage.
2. The Neyman procedure in general, and in particular all of the three examples above are fully
metric-independent in both the data and the parameter spaces.
3. The probability statement that defines the coverage of frequentist intervals appears to be a state-
ment about the probability of the true value falling inside the confidence interval, but it is in fact
the probability of the (random) confidence interval covering the (fixed but unknown) true value.
That means that coverage is not a property of one confidence interval, it is a property of the ensem-
ble of confidence intervals you could have obtained as results to your experiment. This somewhat
unintuitive property causes considerable misunderstanding.
contamination.
8 l o
is the probability of accepting when
is the power of the test.
is true. This is the error of the second kind, or
When the two hypotheses are simple hypotheses, then it can be shown that the most powerful test
is the Neyman-Pearson Test [8], which consists in taking as the critical region that region with the largest
values of q asr q Y qt
, where is the likelihood under hypothesis . `ut
When a hypothesis contains unknown parameters, it is said to be not completely specified and
is called a composite hypothesis. This important case is much more complicated than that of simple
hypotheses, and the theory is less satisfactory, general results holding only asymptotically and under
certain conditions. In practice, Monte Carlo calculations are required in order to calculate and
i o
exactly for composite hypotheses.
data
The normalization factor data can be determined for the case of parameter estimation, where all the
possible values of the parameter are known, but in hypothesis testing it doesn’t work, since we cannot
enumerate all possible hypotheses. However it can be used to find the ratio of probabilities for two
hypotheses, since the normalizations cancel:
11 Decision Theory
For decision-making we need to introduce a new concept, the loss incurred in making the wrong decision,
or more generally the losses incurred in taking different decisions as a function of which hypothesis is
true. Sometimes the negative loss (utility) is used.
Simplest possible example: Decide whether to bring an umbrella to work.
In order to make a decision, we need, in addition to the loss function, a decision rule. The most obvious
and most common rule is to minimize the expected loss. Let rain be the (Bayesian) probability that it
will rain. Then we can write:
9
Since the loss function is in general subjective, and in view of the result that no decision rule can
be better than a Bayesian decision rule, it is natural and reasonable to treat the whole decision process
within the domain of Bayesian statistics.
References
[1] Wolfgang Panofsky and Melba Phillips, Classical Electricity and Magnetism, Addison-Wesley
1955, Section 1-2.
[2] Bruno de Finetti, Annales de l’Institut Henri Poincaré 7 (1937) 1-68. English Translation reprinted
in Breakthroughs in Statistics, Vol. 1, Kotz and Johnson eds., Springer 1992.
[3] Anthony O’Hagan, Kendall’s Advanced Theory Of Statistics, Vol. 2B (1994), Chapter 4.
[4] J. Neyman, Phil. Trans. R. Soc. Ser. A 236 (1937) 333, reprinted in A Selection of Early Statistical
Papers on J. Neyman, Univ. of Cal. Press, Berkeley, 1967.
[5] Kendall’s Advanced Theory of Statistics: In the Fifth Edition (1991) the authors are Stuart and
Ord, and this material is at the beginning of chapter 23 in Volume 2. In the Sixth Edition (1999)
the authors are Stuart, Ord and Arnold, and this material appears at the beginning of chapter 22 in
Volume 2A.
[8] J. Neyman and E. S. Pearson, Phil. Trans. R. Soc. Ser. A 231 (1933) 289-337, reprinted in Break-
throughs in Statistics, Vol. 1, Kotz and Johnson eds., Springer 1992.
[9] Karl Pearson, Phil. Mag. Ser. 5 (1900) 157-175, reprinted in Breakthroughs in Statistics, Vol. 2,
Kotz and Johnson eds., Springer 1992.
10