Abstract
Physical theories that depend on many parameters or are tested against data from many different experiments pose unique challenges to statistical inference. Many models in particle physics, astrophysics and cosmology fall into one or both of these categories. These issues are often sidestepped with statistically unsound ad hoc methods, involving intersection of parameter intervals estimated by multiple experiments, and random or grid sampling of model parameters. Whilst these methods are easy to apply, they exhibit pathologies even in low-dimensional parameter spaces, and quickly become problematic to use and interpret in higher dimensions. In this article we give clear guidance for going beyond these procedures, suggesting where possible simple methods for performing statistically sound inference, and recommendations of readily-available software tools and standards that can assist in doing so. Our aim is to provide any physicists lacking comprehensive statistical training with recommendations for reaching correct scientific conclusions, with only a modest increase in analysis burden. Our examples can be reproduced with the code publicly available at Zenodo.
Recommended by Dr Steve Ritz
1. Introduction
The search for new particles is underway in a wide range of high-energy, astrophysical and precision experiments. These searches are made harder by the fact that theories for physics beyond the standard model almost always contain unknown parameters that cannot be uniquely derived from the theory itself. For example, in particle physics models of dark matter, these would be the dark matter mass and its couplings. Models usually make a range of different experimental predictions depending on the assumed values of their unknown parameters. Despite an ever-increasing wealth of experimental data, evidence for specific physics beyond the standard model has not yet emerged, leading to the proposal of increasingly complicated models. This increases the number of unknown parameters in the models, leading to high-dimensional parameter spaces. This problem is compounded by additional calibration and nuisance parameters that are required as experiments become more complicated. Unfortunately, high-dimensional parameter spaces, and the availability of relevant constraints from an increasing number of experiments, expose flaws in the simplistic methods sometimes employed in phenomenology to assess models. In this article, we recommend alternatives suitable for today's models and data, consistent with established statistical principles.
When assessing a model in light of data, physicists typically want answers to two questions: (a) is the model favoured or allowed by the data? (b) What values of the unknown parameters are favoured or allowed by the data? In statistical language, these questions concern model testing and parameter estimation, respectively. Parameter estimation allows us to understand what a model could predict, and design future experiments to test it. On the theory side, it allows us to construct theories that contain the model and naturally accommodate the observations. Model testing, on the other hand, allows us to test whether data indicate the presence of a new particle or new phenomena.
Many analyses of particle physics models suffer from two key deficiencies. First, they overlay exclusion curves from experiments and, second, they perform a random or grid scan of a high-dimensional parameter space. These techniques are often combined to perform a crude hypothesis test. In this article, we recapitulate relevant statistical principles, point out why both of these methods give unreliable results, and give concrete recommendations for what should be done instead. Despite the prevalence of these problems, we stress that there is diversity in the depth of statistical training in the physics community. Physicists contributed to major developments in statistical theory [1, 2] and there are many statistically rigorous works in particle physics and related fields, including the famous Higgs discovery [3, 4], and global fits of electroweak data [5]. Our goal is to make clear recommendations that would help lift all analyses closer to those standards, though we urge particular caution when testing hypotheses as unfortunately there are no simple recipes. The examples that we use to illustrate our recommendations can be reproduced with the code publicly available through Zenodo [6].
Our discussion covers both Bayesian methods [7–12], in which one directly considers the plausibility of a model and regions of its parameter space, and frequentist methods [13–16], in which one compares the observed data to data that could have been observed in identical repeated experiments 62 . Our recommendations are agnostic about the relative merits of the two sets of methods, and apply whether one is an adherent of either form, or neither. Both approaches usually involve the so-called likelihood function [17], which tells us the probability of the observed data, assuming a particular model and a particular combination of numerical values for its unknown parameters.
In the following discussions, we assume that a likelihood is available and consider inferences based on it. In general, though, the likelihood alone is not enough in frequentist inference (as well as for reference priors and some methods in Bayesian statistics that use simulation). One requires the so-called sampling distribution; this is similar to the likelihood function, except that the data is not fixed to the observed data (see the likelihood principle [18] for further discussion). There are, furthermore, situations in which the likelihood is intractable. In such cases, likelihood-free techniques may be possible [19]. In fact, in realistic applications in physics, the complete likelihood is almost always intractable. Typically, however, we create summaries of the data by e.g. binning collider events into histograms.
2. Problems of overlaying exclusion limits
Experimental searches for new phenomena are usually summarised by confidence regions, either for a particular model's parameters or for model-independent quantities more closely related to the experiment that can be interpreted in any model. For example, experiments performing direct searches for dark matter [20] publish confidence regions for the mass and scattering cross section of the dark matter particle, rather than for any parameters included in the Lagrangian of a specific dark matter model. To apply those results to a given dark matter model, the confidence regions must be transformed to the parameter space of the specific model of interest. This can sometimes modify the statistical properties of the confidence regions, so care must be taken in performing the transformation [21–23].
In the frequentist approach, if an experiment that measured a parameter were repeated over and over again, each repeat would lead to a different confidence region for the measured parameter. The coverage is the fraction of repeated experiments in which the resulting confidence region would contain the true parameter values [24]. The confidence level of a confidence region is the desired coverage 63 . For example: a 95% confidence region should contain the true values in 95% of repeated experiments, and the rate at which we would wrongly exclude the true parameter values is controlled to be 5%. Approximate confidence regions can often be found from the likelihood function alone using asymptotic assumptions about the sampling distribution, e.g., Wilks' theorem [29]. However, it is important to check carefully that the required assumptions hold [30].
Confidence intervals may be constructed to be one- or two-tailed. By construction, in the absence of a new effect, a 95% upper limit would exclude all effect sizes, including zero, at a rate of 5%. The fact that confidence intervals may exclude effect sizes that the experiment had no power to discover was considered a problem in particle physics and lead to the creation of CLs intervals [31]. By construction, these intervals cannot exclude negligible effect sizes, and thus over-cover.
The analogous construct in Bayesian statistics is the credible region. First, prior information about the parameters and information from the observed data contained in the likelihood function are combined into the posterior using Bayes' theorem. Second, parameters that are not of interest are integrated over, resulting in a marginal posterior distribution. A 95% credible region for the remaining parameters of interest is found from the marginal posterior by defining a region containing 95% of the posterior probability. In general, credible regions only guarantee average coverage: suppose we re-sampled model parameters and pseudo-data from the model and constructed 95% credible regions. In 95% of such trials, the credible region would contain the sampled model parameters [15, 32]. Whilst credible regions and confidence intervals are identical in some cases (e.g. in normal linear models), the fact that they in general lead to different inferences remains a point of contention [33]. For both credible regions and confidence intervals, the level only stipulates the size of the region. One requires an ordering rule to decide which region of that size is selected. For example, the Feldman–Cousins construction [34] for confidence regions and the highest-posterior density ordering rule for credible regions naturally switch from a one- to a two-tailed result.
When several experiments report confidence regions, requiring that the true value must lie within all of those regions amounts to approximating the combined confidence region by the intersection of regions from the individual experiments. This quickly loses accuracy as more experiments are applied in sequence, and leads to much greater than nominal error rates. This is because by taking an intersection of n independent 95% confidence regions, a parameter point has n chances to be excluded at a 5% error rate, giving an error rate of 1 − 0.95n [35].
This issue is illustrated in figure 1 using the B-physics observable ϕs, which is a well-measured phase characterising charge conjugation and parity symmetry (CP) violation in Bs meson decays [36]. We perform 10 000 pseudo-experiments 64 . Each pseudo-experiment consists of a set of five independent Gaussian measurements of an assumed true standard model value of ϕs = −0.037 with statistical errors 0.078, 0.097, 0.037, 0.285, and 0.17, which are taken from real ATLAS, CMS and LHCb measurements 65 . We can then obtain the 95% confidence interval from the combination of the five measurements in each experiment 66 , and compare it to the interval resulting from taking the intersection of the five 95% confidence intervals from the individual measurements. We show the first 100 pseudo-experiments in figure 1. As expected, the 95% confidence interval from the combination contains the true value in 95% of simulated experiments. The intersection of five individual 95% confidence intervals, on the other hand, contains the true value in only 78% of simulations. Thus, overlaying regions leads to inflated error rates and can create a misleading impression about the viable parameter space. Whilst this is a one-dimensional illustration, an identical issue would arise for the intersection of higher-dimensional confidence regions. Clearly, rather than taking the intersection of reported results, one should combine likelihood functions from multiple experiments. Good examples can be found in the literature [38–44].
In figure 2 we again show the dangers of simply overlaying confidence regions. We construct several toy two-dimensional likelihood functions (top), and find their 95% confidence regions (bottom left). In the bottom right panel, we show the contours of the combined likelihood function (blue) and a combined 95% confidence region (red contour). We see that the intersection of confidence regions (dashed black curve) can both exclude points that are allowed by the combined confidence region, and allow points that should be excluded. It is often useful to plot both the contours of the combined likelihood (bottom right panel) and the contours from the individual likelihoods (bottom left panel), in order to better understand how each measurement or constraint contributes to the final combined confidence region.
Download figure:
Standard image High-resolution imageRecommendation. Rather than overlaying confidence regions, combine likelihood functions. Derive a likelihood function for all the experimental data (this may be as simple as multiplying likelihood functions from independent experiments), and use it to compute approximate joint confidence or credible regions in the native parameter space of the model.
3. Problems of uniform random sampling and grid scanning
Parameter estimation generally involves integration of a posterior or maximisation of a likelihood function. This is required to go from the full high-dimensional model to the one or two dimensions of interest or to compare different models. In most cases this cannot be done analytically. The likelihood function, furthermore, may be problematic in realistic settings. In particle physics [45], it is usually moderately high-dimensional, and often contains distinct modes corresponding to different physical solutions, degeneracies in which several parameters can be adjusted simultaneously without impacting the fit, and plateaus in which the model is unphysical and the likelihood is zero. On top of that, only noisy estimates of the likelihood may be available, such as from Monte Carlo simulations of collider searches for new particles, and derivatives of the likelihood function are usually unavailable [46]. As even single evaluations of the likelihood function can be computationally expensive, the challenge is then to perform integration or maximisation in a high-dimensional parameter space using a tractable number of evaluations of the likelihood function.
Random and grid scans are common strategies in the high-energy phenomenology literature. In random scans, one evaluates the likelihood function at a number of randomly-chosen parameter points. Typically the parameters are drawn from a uniform distribution in each parameter in a particular parametrisation of the model, which introduces a dependency on the choice of parametrisation. In grid scans, one evaluates the likelihoods on a uniformly spaced grid with a fixed number of points per dimension. It is then tempting to attribute statistical meaning to the number or density of samples found by random or grid scans. However, such an interpretation is very problematic, in particular when the scan is combined with the crude method described in section 2, i.e. keeping only points that make predictions that lie within the confidence regions reported by every single experiment. It is worth noting that random scans often outperform grid scans: consider 100 likelihood evaluations in a two-parameter model where the likelihood function depends much more strongly on the first parameter than on the second. A random scan would try 100 different parameter values of the important parameter, whereas the grid scan would try just 10. In a similar vein, quasi-random samples that cover the space more evenly than truly random samples can out-perform truly random sampling [47]. This is illustrated in figure 3 with 256 samples in two-dimensions.
Download figure:
Standard image High-resolution imageHowever, random, quasi-random and grid scans are all extremely inefficient in cases with even a few parameters. The 'curse of dimensionality' [48] is one of the well-known problems: the number of samples required for a fixed resolution per dimension scales exponentially with dimension D: just ten samples per dimension requires 10D samples. This quickly becomes an impossible task in high-dimensional problems. Similarly, consider a D-dimensional model in which the interesting or best-fitting region occupies a fraction of each dimension. A random scan would find points in that region with an efficiency of D , i.e. random scans are exponentially inefficient. See [49] for further discussion and examples.
These issues can be addressed by using more sophisticated algorithms that, for example, preferentially explore areas of the parameter space where the likelihood is larger. Which algorithm is best suited for a given study depends on the goal of the analysis. For Bayesian inference, it is common to draw samples from the posterior distribution or compute an integral over the model's parameter space, relevant for Bayesian model selection. See [50] for a review of Bayesian computation. For frequentist inference, one might want to determine the global optimum and obtain samples from any regions in which the likelihood function was moderate. This can be more challenging than Bayesian computation. In particular, algorithms for Bayesian computation might not be appropriate optimizers. For example, Markov chain Monte Carlo methods draw from the posterior. In high-dimensions, the bulk of the posterior probability (the typical set) often lies well away from the maximum likelihood. This is another manifestation of the curse of dimensionality.
In figure 4 we illustrate one such algorithm that overcomes the deficiencies of random and grid sampling and is suitable for frequentist inference. Here we assume that the logarithm of the likelihood function is given by a four-dimensional Rosenbrock function [51]
This is a challenging likelihood function with a global maximum at xi = 1 (i = 1, 2, 3, 4). We show samples found with . This constraint corresponds to the two-dimensional 95% confidence region, which in the (x1, x2) plane has a banana-like shape (red contour). We find the points using uniform random sampling from −5 to 5 for each parameter (orange dots), using a grid scan (yellow dots), and using an implementation of the differential evolution algorithm [52, 53] operating inside the same limits (blue dots). With only 2 × 105 likelihood calls, the differential evolution scan finds more than 11 500 points in the high-likelihood region 67 , whereas in 107 tries the random scan finds only seven high-likelihood samples, and the grid scan just 10. The random and grid scans would need over 1010 likelihood calls to obtain a similar number of high-likelihood points as obtained by differential evolution in just 2 × 105 evaluations. If likelihood calls are expensive and dominate the run-time, this could make differential evolution about 105 times faster.
Download figure:
Standard image High-resolution imageRecommendation. Use efficient algorithms to analyse parameter spaces, rather than grid or random scans. The choice of algorithm should depend on the goal. Good examples for Bayesian analyses are Markov chain Monte Carlo [54, 55] and nested sampling [56]. Good examples for maximizing and exploring the likelihood are simulated annealing [57], differential evolution [52], genetic algorithms [58] and local optimizers such as Nelder–Mead [59]. These are widely available in various public software packages [53, 60–66].
4. Problems with model testing
Overlaying confidence regions and performing random scans are straightforward methods for 'hypothesis tests' of physical theories with many parameters or testable predictions. For example, it is tempting to say that a model is excluded if a uniform random or grid scan finds no samples for which the experimental predictions lie inside every 95% confidence region. This procedure is, however, prone to misinterpretation: just as in section 2, it severely under-estimates error rates, and, just as in section 3, it easily misses solutions.
Testing and comparing individual models in a statistically defensible manner is challenging and contentious. On the frequentist side, one can calculate a global p-value: the probability of obtaining data as extreme or more extreme than observed, if the model in question is true. The p-value features in two distinct statistical approaches [67]: first, the p-value may be interpreted as a measure of evidence against a model [68]. See [69–73] for discussion of this approach. Second, we may use the p-value to control the rate at which we would wrongly reject the model when it was true [74]. If we reject when p < α, we would wrongly reject at a rate α. In particle physics, we adopt the 5σ threshold, corresponding to α ≃ 10−7 [75]. When we compute p-values, we should take into account all the tests that we might have performed. In the context of searches for new particles, this is known as the look-elsewhere effect. Whilst calculations can be greatly simplified by using asymptotic formulae [76, 77], bear in mind that they may not apply [30]. Also, care must be taken to avoid common misinterpretations of the p-value [78, 79]. For example, the p-value is not the probability of the null hypothesis, or the probability that the observed data were produced by chance alone, or the probability of the observed data given the null hypothesis, or the rate at which we would wrongly reject the null hypothesis when it was true.
On the Bayesian side, one can perform Bayesian model comparison [1, 80] to find any change brought about by data to the relative plausibility of two different models. The factor that updates the relative plausibility of two models is called a Bayes factor. The Bayes factor is a ratio of integrals that may be challenging to compute in high-dimensional models. Just as in Bayesian parameter inference, this requires constructing priors for the parameters of the two models, permitting one to coherently incorporate prior information. In this setting, however, inferences may be strongly prior dependent, even in cases with large data sets and where seemingly uninformative priors are used [81, 82]. This sensitivity can be particularly problematic in high-dimensional models. Unfortunately, there is no unique notion of an uninformative prior representing a state of indifference about a parameter [83], though in special cases symmetry considerations may help [84].
Neither of these approaches is simple, either philosophically or computationally, and the task of model testing and comparison is in general full of subtleties. For example, they depend differently on the amount of data collected which leads to somewhat paradoxical differences between them [1, 85, 86]. See [87–91] for recent discussions in other scientific settings. It is worth noting that there are connections between model testing and parameter inference in the case of nested models, i.e. when a model can be viewed as a subset of the parameter space of some larger, 'full' model. A hypothesis test of a nested model can be equivalent to whether it lies inside a confidence region in the full model [92, 93]. Similarly, the Bayes factor between nested models can be found from parameter inference in the full model alone through the Savage–Dickey ratio [94]. There are, furthermore, approaches beyond Bayesian model comparison and frequentist model testing that we do not discuss here.
Recommendation. In Bayesian analyses, carefully consider the choice of priors, their potential impact particularly in high-dimensions and check the prior sensitivity. In frequentist analyses, consider the look-elsewhere effect, check the validity of any asymptotic formulae and take care to avoid common misinterpretations of the p-value. If investigation of such subtleties fall outside the scope of the analysis, refrain from making strong statements on the overall validity of the theory under study.
5. Summary
As first steps towards addressing the challenges posed by physical theories with many parameters and many testable predictions, we make three recommendations: (i) construct a composite likelihood that combines constraints from individual experiments, (ii) use adaptive sampling algorithms (ones that target the interesting regions) to efficiently sample the parameter spaces, and (iii) avoid strong statements on the viability of a theory unless a proper model test has been performed. The second recommendation can be easily achieved through the use of any one of a multitude of publicly-available implementations of efficient sampling algorithms (for examples see section 3). For the first recommendation, composite likelihoods are often relatively simple to construct, and can be as straightforward as a product of Gaussians for multiple independent measurements. Even for cases where constructing the composite likelihood is more complicated, software implementations are often publicly available already [95–103].
Given the central role of the likelihood function in analysing experimental data, it is in the interest of experimental collaborations to make their likelihood functions (or a reasonable approximation) publicly available to truly harness the full potential of their results when confronted with new theories. Even for large and complex datasets, e.g. those from the Large Hadron Collider, there exist various recommended methods for achieving this goal [104–106].
Our recommendations can be taken separately when only one of the challenges exists, or where addressing them all is impractical. However, when confronted with both high-dimensional models and a multitude of relevant experimental constraints, we recommend that they are used together to maximise the validity and efficiency of analyses.
Author contributions
The project was led by AF and in preliminary stages by BF and FK. ABe, AF, SHoof, AK, PSc and WS contributed to creating the figures. PA, CB, TB, ABe, ABuc, AF, TEG, SHoof, AK, JECM, MTP, AR, PSc, ACV and YZ contributed to writing. WH and FK performed official internal reviews of the article. All authors read, endorsed and discussed the content and recommendations.
Acknowledgments
BCA has been partially supported by the UK Science and Technology Facilities Council (STFC) Consolidated HEP theory Grants ST/P000681/1 and ST/T000694/1. PA is supported by Australian Research Council (ARC) Future Fellowship FT160100274, and PS by FT190100814. PA, CB, TEG and MW are supported by ARC Discovery Project DP180102209. CB and YZ are supported by ARC Centre of Excellence CE110001104 (Particle Physics at the Tera-scale) and WS and MW by CE200100008 (Dark Matter Particle Physics). ABe is supported by F.N.R.S. through the F.6001.19 convention. ABuc is supported by the Royal Society Grant UF160548. JECM is supported by the Carl Trygger Foundation Grant No. CTS 17:139. JdB acknowledges support by STFC under Grant ST/P001246/1. JE was supported in part by the STFC (UK) and by the Estonian Research Council. BF was supported by EU MSCA-IF project 752162—DarkGAMBIT. MF and FK are supported by the Deutsche Forschungsgemeinschaft (DFG) through the Collaborative Research Center TRR 257 'Particle Physics Phenomenology after the Higgs Discovery' under Grant 396021762—TRR 257 and FK also under the Emmy Noether Grant No. KA 4662/1-1. AF is supported by an NSFC Research Fund for International Young Scientists Grant 11950410509. SHe was supported in part by the MEINCOP (Spain) under contract PID2019-110058GB-C21 and in part by the Spanish Agencia Estatal de Investigación (AEI) through the Grant IFT Centro de Excelencia Severo Ochoa SEV-2016-0597. SHoof is supported by the Alexander von Humboldt Foundation. SHoof and MTP are supported by the Federal Ministry of Education and Research of Germany (BMBF). KK is supported in part by the National Science Centre (Poland) under research Grant No. 2017/26/E/ST2/00470, LR under No. 2015/18/A/ST2/00748, and EMS under No. 2017/26/D/ST2/00490. LR and ST are supported by Grant AstroCeNT: Particle Astrophysics Science and Technology Centre, carried out within the International Research Agendas programme of the Foundation for Polish Science financed by the European Union under the European Regional Development Fund. MLM acknowledges support from NWO (Netherlands). SM is supported by JSPS KAKENHI Grant No. 17K05429. The work of KAO was supported in part by DOE Grant DE-SC0011842 at the University of Minnesota. JJR is supported by the Swedish Research Council, contract 638-2013-8993. KS was partially supported by the National Science Centre, Poland, under research Grants 2017/26/E/ST2/00135 and the Beethoven Grants DEC-2016/23/G/ST2/04301. AS is supported by MIUR research Grant No. 2017X7X85K and INFN. ST is partially supported by the Polish Ministry of Science and Higher Education through its scholarship for young and outstanding scientists (decision No. 1190/E-78/STYP/14/2019). RT was partially supported by STFC under Grant No. ST/T000791/1. The work of MV is supported by the NSF Grant No. PHY-1915005. ACV is supported by the Arthur B McDonald Canadian Astroparticle Physics Research Institute. Research at Perimeter Institute is supported by the Government of Canada through the Department of Innovation, Science, and Economic Development, and by the Province of Ontario through MEDJCT. LW is supported by the National Natural Science Foundation of China (NNSFC) under Grant No. 117050934, by Jiangsu Specially Appointed Professor Program. WS is supported by KIAS Individual Grant (PG084201) at Korea Institute for Advanced Study.
Data availability statement
Footnotes
- 62
- 63
- 64
In a pseudo-experiment, we simulate the random nature of a real experimental measurement using a pseudo-random number generator on a computer. Pseudo-experiments may be used to learn about the expected distributions of repeated measurements.
- 65
See equation (91) and table 22 in [36].
- 66
We used the standard weighted-mean approach to combine the results [37].
- 67
We used a population size of 50 and stopped once the coefficient of variation of the fitness of the population dropped below 1%. See the associated code for the complete settings [6].