Statistics
Lent Term 2015
Prof. Mark Thomson
Lecture 4 : The Dark Arts
Prof. M.A. Thomson Lent 2015 107
Course Synopsis
Lecture 1: The basics
Introduction, Probability distribution functions, Binomial
distributions, Poisson distribution
Lecture 2: Treatment of Gaussian Errors
The central limit theorem, Gaussian errors, Error
propagation, Combination of measurements, Multi-
dimensional Gaussian errors, Error Matrix
Lecture 3: Fitting and Hypothesis Testing
The χ2 test, Likelihood functions, Fitting, Binned maximum
likelihood, Unbinned maximum likelihood
Lecture 4: The Dark Arts
Bayesian Inference, Credible Intervals
The Frequentist approach, Confidence Intervals
Systematic Uncertainties
Prof. M.A. Thomson Lent 2015 108
Parameter Estimation Revisited
! Let s consider more carefully the maximum likelihood method
for simplicity consider a single parameter
! Construct the likelihood that our data are consistent with the model, i.e.
the probability that the model would give the observed data
! We have then (very reasonably) taken the value of which maximises
the likelihood as our best estimate of the parameter
! With less justification we then took our error estimate from
! Does this really make sense ?
! What we really want to calculate is the posterior PDF for the parameter
given the data, i.e.
assumed
Can not justify this – in general it is not the case
Prof. M.A. Thomson Lent 2015 109
Conditional Probabilities and Bayes Theory
! A nice example of conditional probability (from L. Lyons)
" In the general population, the probability of a randomly selected woman
being pregnant is 2%
" But
! Correct treatment of conditional probabilities requires Bayes theorem
" Probability of A and B can be expressed in terms of conditional probabilities
! Here the prior probability of selecting a woman is
i.e. half population are women
and the prior probability of selecting a pregnant person is
i.e. 1 % of population are pregnant
Sanity
restored…
Prof. M.A. Thomson Lent 2015 110
! Apply Bayes theory to our the measurement of a parameter x
" We determine , i.e. the likelihood function
" We want , i.e. the PDF for x in the light of the data
" Bayes theory gives:
the likelihood function, i.e. what we measure
the posterior PDF for x, i.e. in the light of the data
prior probability of the data. Since this doesn t depend on
x it is essentially a normalisation constant
prior probability of x, i.e. encompassing our knowledge of
x before the measurement
! Bayes theory tells us how to modify our knowledge of x in the light of new data
Bayes theory is the formal basis of Statistical Inference
Prof. M.A. Thomson Lent 2015 111
Applying Bayes Theorem
! Bayes theory provides an unambiguous prescription for going from
! But you need to provide the PRIOR PROBABILITY
! This is fine if you have an objective prior, e.g. a previous measurement
" If we now make a new measurement, i.e. determine the likelihood function
" Bayes theory then gives
Where and are the usual
mean and variance for combining
two measurements
" For this to be a (normalised) PDF can infer (although it isn t of any interest):
Prof. M.A. Thomson Lent 2015 112
The Problem with Applying Bayes Theorem
! The problem arises when there is no objective prior
! For example, in a hypothetical background free search for a Z , observe
no events
" No problem in calculating the likelihood function (a conditional probability)
Poisson prob. for observing 0
x is the true number of expected events
" What is the best estimate of x and the 90 % confidence level upper limit ?
" Depends on the choice of prior probability:
" What to do about the prior ?
" i.e. how do we express our knowledge (none) of x prior to the measurement
! In general there is no objective answer, always putting in some extra information
" i.e. a subjective bias
" could argue that a flat prior, i.e. P(x) = constant, is objective
" but why not choose a prior that is flat in ln x ?
" for some limits/measurements (e.g. a mass) a flat prior in ln x is more natural
" the arbitrariness in the choice of prior is a problem for the Bayesian approach
" it can make a big difference…
Prof. M.A. Thomson Lent 2015 113
Choice of Prior, example I
! See no events…
Poisson prob. for observing 0
Prior flat prior in x : Prior flat prior in lnx :
! The Conclusions are very different. Compare regions containing 90 % of probability
" In this case, the choice of prior is important
Prof. M.A. Thomson Lent 2015 114
Choice of Prior, example II
! Suppose we measure the W-boson mass:
! We want
" Again consider two priors
" Here the choice of prior is NOT important
" The data are strong enough to overcome our prior assumptions (subjective bias)
" Here, can interpret the measurement as a Gaussian PDF for m
Prof. M.A. Thomson Lent 2015 115
Choice of Prior, example III
# An example (apparently due to Newton), e.g. see CERN Yellow Report 2000-005
! Suppose you are in the Tower of London facing execution.
! The Queen arrives carrying a small bag and says
This bag contains 5 balls; the balls are either white or black. If you correctly
guess the number of black balls, I will spare your life and set you free.
! The Queen is in a good mood and continues
To give you a better chance, you can take one of the balls from the bag.
It s BLACK
! The Queen points her pistol at you
Time to choose, sucker…
! What do you guess to maximise your chance of survival ?
! Use statistical inference to analyse the problem.
" Let n be the number of black balls in the bag.
" The data are that you picked out a black ball
" Can calculate
e.g. if there were two black balls chance of picking out a black ball from the
five in the bag was 2/5.
Prof. M.A. Thomson Lent 2015 116
1
0.9
0.8
0.7
0.6
! But we want
0.5
0.4 ! Answer depends on choice of Prior
0.3
0.2
0.1
0
0 1 2 3 4 5
! Could assume flat Prior
0.18 0.35
0.16
0.3
0.14
0.25
0.12
0.1
0.08
0.2
0.15
GUESS: 5
0.06
0.1
0.04
0.05
0.02
0 0
0 1 2 3 4 5 0 1 2 3 4 5
! Could assume balls drawn randomly from a large bag containing equal nos. B & W
0.35 0.4
0.35
0.3
0.3
0.25
0.25
0.2
0.2 GUESS: 3
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 1 2 3 4 5 0 1 2 3 4 5
! Oh dear… answer depends on Prior (unknown) assumptions
Prof. M.A. Thomson Lent 2015 117
! So what do we learn from this ?
(apart something about the role of the Monarchy in a modern democracy)
" Whilst we know how to apply Bayesian statistical inference, we have
insufficient data, i.e. we don t know the prior
" Unless the data are strong , i.e. override the information in the reasonable
range of prior probabilities, we cannot expect to know
" Applies equally to our experiment where we saw zero events and wanted to
arrive at a PDF for the expected mean number of events…
Don t have enough information to answer this question
Prof. M.A. Thomson Lent 2015 118
Bayesian Credible Intervals
! Ideally, (I) would like to work with probabilities, i.e. a PDF which encompasses all
our knowledge of a particular parameter, e.g.
! Could then integrate PDF to contain 95 % of probability. Can then define the
95 % Credible Interval*: mH < 186 GeV
! To do this need to go from , i.e. from , to
" requires subjective choice of prior probability
! Hence Bayesian Credible Intervals necessarily include some additional input
beyond the data alone…
*This is not what is done.
Prof. M.A. Thomson Lent 2015 119
Bayesian Credible Intervals - example
! Trying to estimate a selection efficiency using MC events. All N events pass cuts.
" what statement can we make about the efficiency?
! Binomial distribution…
! Apply Bayes theorem:
Prior
Constant
! Choose prior, e.g.
! Normalise
Prof. M.A. Thomson Lent 2015 120
! Integrate to find region containing 90% of probability
90 % Credible Interval:
(with a flat prior probability)
Prof. M.A. Thomson Lent 2015 121
Likelihood Ordering
! Note, 90 % credible interval is not uniquely defined
" more than one interval contains 90 % probability, e.g.
90 % Credible Interval:
! Natural, to choose the interval such that all points in the excluded region are
lower in likelihood than those in the credible interval : likelihood ordering
! Credible intervals provide an intuitive way of interpreting data, but:
" Rarely used in Particle Physics as a way of presenting data
" Because they represent the data and prior combined
" NOTE: all information from the experiment is in the likelihood
Prof. M.A. Thomson Lent 2015 122
C.I. vs C.L.
! From data obtain
! Bayes theorem provides the mathematical framework for statistical inference
! To go from requires a (usually) subjective choice
of Prior probability
! For weak data, the choice of Prior can drive the interpretation of the data
! Credible intervals are a useful way of interpreting data, but are generally not
used in Particle Physics as a way of presenting the conclusions of an experiment.
! Particle Physics to use Frequentist Confidence limits which are not
[and do not form a mathematically consistent basis for
statistical inference]
! Finally, never forget that credible intervals (or confidence limits) are an
interpretation of the data
The experimental result is the likelihood function
Prof. M.A. Thomson Lent 2015 123
A Few words on Systematic Uncertainties
! Systematic Uncertainties are often associated with an internal unknown bias, e.g.
" How well do you know your calibration
" How well does MC model the data, e.g. jet fragmentation parameters
! Parametric Uncertainties associated with uncertain parameters
" How does the uncertainty on the Higgs mass impact the interpretation of a
a measurement
! No over-riding principle – just some general guidelines
" Once a result is published, systematic errors will be treated as if they are
Gaussian
x = a ± b (stat.) ± c (syst.)
" Some systematic errors are Gaussian: e.g. energy scale determined
from data e.g. Z ! e+ e to determine electron energy scale
" Others are not: e.g. impact of different jet hadronisation models, where one
might compare PYTHIA with HERWIG – here one obtains a single estimate
of the scale of the uncertainties
" Theoretical uncertainties: e.g. missing HO corrections. Again these are
estimates – should not be treated as Gaussian (although they are)
! Systematic dominated measurements
" Beware – if there is a single dominating systematic error and it is inherently
non-Gaussian, this is a problem
Prof. M.A. Thomson Lent 2015 124
Estimating Systematic Uncertainties
! No rules – just guidelines
" Remember syst. errors will be treated as Gaussian, so try to evaluate them on
this basis, e.g. suppose use 3 alternative MC jet fragmentation models and
result changes by +Δ1, +Δ2 and –Δ3 (where Δ2 is the largest):
i) take largest shift as systematic error estimate: Δ2 ?
ii) assume error distributed uniformly in “box” of width 2Δ2 giving an rms
of 2Δ2 /√12 ?
" Cut variation is evil (i.e. vary cuts and see how results change)
• at best, introduces statistical noise
• at worst, hides away lack of understanding of some data - MC discrepancy
understand the origin of the discrepancy
" Wherever possible use data driven estimates, energy scales, control samples,
etc.
" Remember that you are estimating the scale of a possible systematic bias
Prof. M.A. Thomson Lent 2015 125
Incorporating Systematics into Fits
! Two commonly used approaches
" Error matrix – with (correlated) systematic uncertainties
" Nuisance parameters
! Nuisance parameter example:
" Suppose we are looking at WW decays and count numbers of events in three
different decay channels qqqq, qqlv and lvlv
" Want to measure cross section and hadronic branching fractions accounting
for common luminosity uncertainty
i) build physics model
exp WW
Nqqqq ( WW , Bqq , L) = B2qq
"L
ii) build likelihood function
exp exp obs 2 exp
(Nqqqq obs 2
Nqqqq ) (Nqqlv Nqqlv ) (Nlvlv obs 2
Nlvlv )
2
( WW , Bqq , L) = 2 ln L = exp + exp + exp
Nqqqq Nqqlv Nlvlv
iii) add penalty term for nuisance parameters, here integrated lumi. Known
to be L0 with uncertainty σL
exp exp obs 2 exp
(Nqqqq obs 2
Nqqqq ) (Nqqlv Nqqlv ) (Nlvlv obs 2
Nlvlv ) (L L0 )2
2
( WW , Bqq , L) = 2 ln L = exp + exp + exp + 2
Nqqqq Nqqlv Nlvlv L
Prof. M.A. Thomson Lent 2015 126
Incorporating Systematics into Fits
! Let’s consider this more closely
exp exp obs 2
(Nqqqq N obs 2
qqqq ) (Nqqlv Nqqlv ) (L L 0 )2
2
( WW , Bqq , L) = 2 ln L = exp + ... exp + 2
Nqqqq Nqqlv L
" We are now fitting 3 parameters
• the number of degrees of freedom has not changed, since we have added one
parameter, but also one additional “data point”
" Of the 3 parameters, we are “not interested” in the fitted value of the lumi.
" The penalty term constrains the luminosity to be consistent with the
externally measured value
" The presence of the nuisance parameters will flatten the fitted likelihood
surface – increasing the uncertainties on the fitted parameters
" Also have some measure of the tension in the fit
• if the data pull the nuisance parameter away from the expected value, could
indicate a problem
Prof. M.A. Thomson Lent 2015 127
That’s All Folks
Prof. M.A. Thomson Lent 2015 128