Harmonic mean p-value

The harmonic mean p-value^[1]^[2]^[3] (HMP) is a statistical technique for addressing the multiple comparisons problem that controls the strong-sense family-wise error rate^[2] (this claim has been disputed^[4]). It improves on the power of Bonferroni correction by performing combined tests, i.e. by testing whether groups of p-values are statistically significant, like Fisher's method.^[5] However, it avoids the restrictive assumption that the p-values are independent, unlike Fisher's method.^[2]^[3] Consequently, it controls the false positive rate when tests are dependent, at the expense of less power (i.e. a higher false negative rate) when tests are independent.^[2] Besides providing an alternative to approaches such as Bonferroni correction that controls the stringent family-wise error rate, it also provides an alternative to the widely-used Benjamini-Hochberg procedure (BH) for controlling the less-stringent false discovery rate.^[6] This is because the power of the HMP to detect significant groups of hypotheses is greater than the power of BH to detect significant individual hypotheses.^[2]

There are two versions of the technique: (i) direct interpretation of the HMP as an approximate p-value and (ii) a procedure for transforming the HMP into an asymptotically exact p-value. The approach provides a multilevel test procedure in which the smallest groups of p-values that are statistically significant may be sought.

Direct interpretation of the harmonic mean p-value

The weighted harmonic mean of p-values ${\textstyle p_{1},\dots ,p_{L}}$ is defined as ${\overset {\circ }{p}}={\frac {\sum _{i=1}^{L}w_{i}}{\sum _{i=1}^{L}w_{i}/p_{i}}},$ where ${\textstyle w_{1},\dots ,w_{L}}$ are weights that must sum to one, i.e. ${\textstyle \sum _{i=1}^{L}w_{i}=1}$ . Equal weights may be chosen, in which case ${\textstyle w_{i}=1/L}$ .

In general, interpreting the HMP directly as a p-value is anti-conservative, meaning that the false positive rate is higher than expected. However, as the HMP becomes smaller, under certain assumptions, the discrepancy decreases, so that direct interpretation of significance achieves a false positive rate close to that implied for sufficiently small values (e.g. ${\overset {\circ }{p}}<0.05$ ).^[2]

The HMP is never anti-conservative by more than a factor of ${\textstyle e\,\log L}$ for small ${\textstyle L}$ , or ${\textstyle \log L}$ for large ${\textstyle L}$ .^[3] However, these bounds represent worst case scenarios under arbitrary dependence that are likely to be conservative in practice. Rather than applying these bounds, asymptotically exact p-values can be produced by transforming the HMP.

Asymptotically exact harmonic mean p-value procedure

Generalized central limit theorem shows that an asymptotically exact p-value, ${\textstyle p_{\overset {\circ }{p}}}$ , can be computed from the HMP, ${\overset {\circ }{p}}$ , using the formula^[2] $p_{\overset {\circ }{p}}=\int _{1/{\overset {\circ }{p}}}^{\infty }f_{\textrm {Landau}}\left(x\,|\,\log L+0.874,{\frac {\pi }{2}}\right)\mathrm {d} x.$ Subject to the assumptions of generalized central limit theorem, this transformed p-value becomes exact as the number of tests, ${\textstyle L}$ , becomes large. The computation uses the Landau distribution, whose density function can be written $f_{\textrm {Landau}}(x\,|\,\mu ,\sigma )={\frac {1}{\pi \sigma }}\int _{0}^{\infty }{\textrm {e}}^{-t{\frac {(x-\mu )}{\sigma }}-{\frac {2}{\pi }}t\log t}\,\sin(2t)\,{\textrm {d}}t.$ The test is implemented by the p.hmp command of the harmonicmeanp R package; a tutorial is available online.

Equivalently, one can compare the HMP to a table of critical values (Table 1). The table illustrates that the smaller the false positive rate, and the smaller the number of tests, the closer the critical value is to the false positive rate.

Table 1. Critical values for the HMP ${\textstyle {\overset {\circ }{p}}}$ for varying numbers of tests ${\textstyle L}$ and false positive rates ${\textstyle \alpha }$ .^[2]
${\textstyle L}$	${\textstyle \alpha =0.05}$	${\textstyle \alpha =0.01}$	${\textstyle \alpha =0.001}$
10	0.040	0.0094	0.00099
100	0.036	0.0092	0.00099
1,000	0.034	0.0090	0.00099
10,000	0.031	0.0088	0.00098
100,000	0.029	0.0086	0.00098
1,000,000	0.027	0.0084	0.00098
10,000,000	0.026	0.0083	0.00098
100,000,000	0.024	0.0081	0.00098
1,000,000,000	0.023	0.0080	0.00097

Multiple testing via the multilevel test procedure

If the HMP is significant at some level ${\textstyle \alpha }$ for a group of ${\textstyle L}$ p-values, one may search all subsets of the ${\textstyle L}$ p-values for the smallest significant group, while maintaining the strong-sense family-wise error rate.^[2] Formally, this constitutes a closed-testing procedure.^[7]

When ${\textstyle \alpha }$ is small (e.g. ${\textstyle \alpha <0.05}$ ), the following multilevel test based on direct interpretation of the HMP controls the strong-sense family-wise error rate at level approximately ${\textstyle \alpha :}$

Define the HMP of any subset ${\textstyle {\mathcal {R}}}$ of the ${\textstyle L}$ p-values to be ${\overset {\circ }{p}}_{\mathcal {R}}={\frac {\sum _{i\in {\mathcal {R}}}w_{i}}{\sum _{i\in {\mathcal {R}}}w_{i}/p_{i}}}.$
Reject the null hypothesis that none of the p-values in subset ${\textstyle {\mathcal {R}}}$ are significant if ${\textstyle {\overset {\circ }{p}}_{\mathcal {R}}\leq \alpha \,w_{\mathcal {R}}}$ , where ${\textstyle w_{\mathcal {R}}=\sum _{i\in {\mathcal {R}}}w_{i}}$ . (Recall that, by definition, ${\textstyle \sum _{i=1}^{L}w_{i}=1}$ .)

An asymptotically exact version of the above replaces ${\textstyle {\overset {\circ }{p}}_{\mathcal {R}}}$ in step 2 with $p_{{\overset {\circ }{p}}_{\mathcal {R}}}=\max \left\{{\overset {\circ }{p}}_{\mathcal {R}},w_{\mathcal {R}}\int _{w_{\mathcal {R}}/{\overset {\circ }{p}}_{\mathcal {R}}}^{\infty }f_{\textrm {Landau}}\left(x\,|\,\log L+0.874,{\frac {\pi }{2}}\right)\mathrm {d} x\right\},$ where ${\textstyle L}$ gives the number of p-values, not just those in subset ${\textstyle {\mathcal {R}}}$ .^[8]

Since direct interpretation of the HMP is faster, a two-pass procedure may be used to identify subsets of p-values that are likely to be significant using direct interpretation, subject to confirmation using the asymptotically exact formula.

Properties of the HMP

The HMP has a range of properties that arise from generalized central limit theorem.^[2] It is:

Robust to positive dependency between the p-values.
Insensitive to the exact number of tests, L.
Robust to the distribution of weights, w.
Most influenced by the smallest p-values.

When the HMP is not significant, neither is any subset of the constituent tests. Conversely, when the multilevel test deems a subset of p-values to be significant, the HMP for all the p-values combined is likely to be significant; this is certain when the HMP is interpreted directly. When the goal is to assess the significance of individual p-values, so that combined tests concerning groups of p-values are of no interest, the HMP is equivalent to the Bonferroni procedure but subject to the more stringent significance threshold ${\textstyle \alpha _{L}<\alpha }$ (Table 1).

The HMP assumes the individual p-values have (not necessarily independent) standard uniform distributions when their null hypotheses are true. Large numbers of underpowered tests can therefore harm the power of the HMP.

While the choice of weights is unimportant for the validity of the HMP under the null hypothesis, the weights influence the power of the procedure. Supplementary Methods §5C of ^[2] and an online tutorial consider the issue in more detail.

Bayesian interpretations of the HMP

The HMP was conceived by analogy to Bayesian model averaging and can be interpreted as inversely proportional to a model-averaged Bayes factor when combining p-values from likelihood ratio tests.^[1]^[2]

The harmonic mean rule-of-thumb

I. J. Good reported an empirical relationship between the Bayes factor and the p-value from a likelihood ratio test.^[1] For a null hypothesis ${\textstyle H_{0}}$ nested in a more general alternative hypothesis ${\textstyle H_{A},}$ he observed that often, ${\textrm {BF}}_{i}\approx {\frac {1}{\gamma \,p_{i}}},\quad 3{\frac {1}{3}}<\gamma <30,$ where ${\textstyle {\textrm {BF}}_{i}}$ denotes the Bayes factor in favour of ${\textstyle H_{A}}$ versus $H_{0}.$ Extrapolating, he proposed a rule of thumb in which the HMP is taken to be inversely proportional to the model-averaged Bayes factor for a collection of ${\textstyle L}$ tests with common null hypothesis: ${\overline {\textrm {BF}}}=\sum _{i=1}^{L}w_{i}\,{\textrm {BF}}_{i}\approx \sum _{i=1}^{L}{\frac {w_{i}}{\gamma \,p_{i}}}={\frac {1}{\gamma \,{\overset {\circ }{p}}}}.$ For Good, his rule-of-thumb supported an interchangeability between Bayesian and classical approaches to hypothesis testing.^[9]^[10]^[11]^[12]^[13]

Bayesian calibration of p-values

If the distributions of the p-values under the alternative hypotheses follow Beta distributions with parameters $\left(0<\xi _{i}<1,1\right)$ , a form considered by Sellke, Bayarri and Berger,^[14] then the inverse proportionality between the model-averaged Bayes factor and the HMP can be formalized as^[2]^[15] ${\overline {\textrm {BF}}}=\sum _{i=1}^{L}\mu _{i}\,{\textrm {BF}}_{i}=\sum _{i=1}^{L}\mu _{i}\,\xi _{i}\,p_{i}^{\xi _{i}-1}\approx {\bar {\xi }}\sum _{i=1}^{L}w_{i}\,p_{i}^{-1}={\frac {\bar {\xi }}{\overset {\circ }{p}}},$ where

${\textstyle \mu _{i}}$ is the prior probability of alternative hypothesis ${\textstyle i,}$ such that ${\textstyle \sum _{i=1}^{L}\mu _{i}=1,}$
${\textstyle \xi _{i}/(1+\xi _{i})}$ is the expected value of ${\textstyle p_{i}}$ under alternative hypothesis ${\textstyle i,}$
${\textstyle w_{i}=u_{i}/{\bar {\xi }}}$ is the weight attributed to p-value ${\textstyle i,}$
${\textstyle u_{i}=\left(\mu _{i}\,\xi _{i}\right)^{1/(1-\xi _{i})}}$ incorporates the prior model probabilities and powers into the weights, and
${\textstyle {\bar {\xi }}=\sum _{i=1}^{L}u_{i}}$ normalizes the weights.

The approximation works best for well-powered tests ( $\xi _{i}\ll 1$ ).

The harmonic mean p-value as a bound on the Bayes factor

For likelihood ratio tests with exactly two degrees of freedom, Wilks' theorem implies that ${\textstyle p_{i}=1/R_{i}}$ , where ${\textstyle R_{i}}$ is the maximized likelihood ratio in favour of alternative hypothesis ${\textstyle i,}$ and therefore ${\textstyle {\overset {\circ }{p}}=1/{\bar {R}}}$ , where ${\textstyle {\bar {R}}}$ is the weighted mean maximized likelihood ratio, using weights ${\textstyle w_{1},\dots ,w_{L}.}$ Since ${\textstyle R_{i}}$ is an upper bound on the Bayes factor, ${\textstyle {\textrm {BF}}_{i}}$ , then ${\textstyle 1/{\overset {\circ }{p}}}$ is an upper bound on the model-averaged Bayes factor: ${\overline {\textrm {BF}}}\leq {\frac {1}{\overset {\circ }{p}}}.$ While the equivalence holds only for two degrees of freedom, the relationship between ${\textstyle {\overset {\circ }{p}}}$ and ${\textstyle {\bar {R}},}$ and therefore ${\textstyle {\overline {\textrm {BF}}},}$ behaves similarly for other degrees of freedom.^[2]

Under the assumption that the distributions of the p-values under the alternative hypotheses follow Beta distributions with parameters $\left(1,\kappa _{i}>1\right),$ and that the weights $w_{i}=\mu _{i},$ the HMP provides a tighter upper bound on the model-averaged Bayes factor: ${\overline {\textrm {BF}}}\leq {\frac {1}{e\,{\overset {\circ }{p}}}},$ a result that again reproduces the inverse proportionality of Good's empirical relationship.^[16]

References

^ ^a ^b ^c Good, I J (1958). "Significance tests in parallel and in series". Journal of the American Statistical Association. 53 (284): 799–813. doi:10.1080/01621459.1958.10501480. JSTOR 2281953.
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m ⁿ Wilson, D J (2019). "The harmonic mean p-value for combining dependent tests". Proceedings of the National Academy of Sciences USA. 116 (4): 1195–1200. doi:10.1073/pnas.1814092116. PMC 6347718. PMID 30610179.
^ ^a ^b ^c Vovk, Vladimir; Wang, Ruodu (April 25, 2019). "Combining p-values via averaging" (PDF). Algorithmic Learning in a Random World.
^ Goeman, Jelle J.; Rosenblatt, Jonathan D.; Nichols, Thomas E. (2019-11-19). "The harmonic mean p-value: Strong versus weak control, and the assumption of independence". Proceedings of the National Academy of Sciences. 116 (47): 23382–23383. doi:10.1073/pnas.1909339116. ISSN 0027-8424. PMC 6876242. PMID 31662466.
^ Fisher, R A (1934). Statistical Methods for Research Workers (5th ed.). Edinburgh, UK: Oliver and Boyd.
^ Benjamini Y, Hochberg Y (1995). "Controlling the false discovery rate: A practical and powerful approach to multiple testing". Journal of the Royal Statistical Society. Series B (Methodological). 57 (1): 289–300. doi:10.1111/j.2517-6161.1995.tb02031.x. JSTOR 2346101.
^ Marcus R, Eric P, Gabriel KR (1976). "On closed testing procedures with special reference to ordered analysis of variance". Biometrika. 63 (3): 655–660. doi:10.1093/biomet/63.3.655. JSTOR 2335748.
^ Wilson, Daniel J (August 17, 2019). "Updated correction to "The harmonic mean p-value for combining independent tests"" (PDF).
^ Good, I J (1984). "C192. One tail versus two-tails, and the harmonic-mean rule of thumb". Journal of Statistical Computation and Simulation. 19 (2): 174–176. doi:10.1080/00949658408810727.
^ Good, I J (1984). "C193. Paired versus unpaired comparisons and the harmonic-mean rule of thumb". Journal of Statistical Computation and Simulation. 19 (2): 176–177. doi:10.1080/00949658408810728.
^ Good, I J (1984). "C213. A sharpening of the harmonic-mean rule of thumb for combining tests "in parallel"". Journal of Statistical Computation and Simulation. 20 (2): 173–176. doi:10.1080/00949658408810770.
^ Good, I J (1984). "C214. The harmonic-mean rule of thumb: Some classes of applications". Journal of Statistical Computation and Simulation. 20 (2): 176–179. doi:10.1080/00949658408810771.
^ Good, Irving John. (2009). Good thinking : the foundations of probability and its applications. Dover Publications. ISBN 9780486474380. OCLC 319491702.
^ Sellke, Thomas; Bayarri, M. J; Berger, James O (2001). "Calibration of p Values for Testing Precise Null Hypotheses". The American Statistician. 55 (1): 62–71. doi:10.1198/000313001300339950. ISSN 0003-1305. S2CID 396772.
^ Wilson, D J (2019). "Reply to Held: When is a harmonic mean p-value a Bayes factor?" (PDF). Proceedings of the National Academy of Sciences USA. 116 (13): 5857–5858. doi:10.1073/pnas.1902157116. PMC 6442550. PMID 30890643.
^ Held, L (2019). "On the Bayesian interpretation of the harmonic mean p-value". Proceedings of the National Academy of Sciences USA. 116 (13): 5855–5856. doi:10.1073/pnas.1900671116. PMC 6442579. PMID 30890644.

[:0-1] Good, I J (1958). "Significance tests in parallel and in series". Journal of the American Statistical Association. 53 (284): 799–813. doi:10.1080/01621459.1958.10501480. JSTOR 2281953.

[:1-2] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l ^m ⁿ Wilson, D J (2019). "The harmonic mean p-value for combining dependent tests". Proceedings of the National Academy of Sciences USA. 116 (4): 1195–1200. doi:10.1073/pnas.1814092116. PMC 6347718. PMID 30610179.

[:2-3] Vovk, Vladimir; Wang, Ruodu (April 25, 2019). "Combining p-values via averaging" (PDF). Algorithmic Learning in a Random World.

[4] Goeman, Jelle J.; Rosenblatt, Jonathan D.; Nichols, Thomas E. (2019-11-19). "The harmonic mean p-value: Strong versus weak control, and the assumption of independence". Proceedings of the National Academy of Sciences. 116 (47): 23382–23383. doi:10.1073/pnas.1909339116. ISSN 0027-8424. PMC 6876242. PMID 31662466.

[5] Fisher, R A (1934). Statistical Methods for Research Workers (5th ed.). Edinburgh, UK: Oliver and Boyd.

[6] Benjamini Y, Hochberg Y (1995). "Controlling the false discovery rate: A practical and powerful approach to multiple testing". Journal of the Royal Statistical Society. Series B (Methodological). 57 (1): 289–300. doi:10.1111/j.2517-6161.1995.tb02031.x. JSTOR 2346101.

[7] Marcus R, Eric P, Gabriel KR (1976). "On closed testing procedures with special reference to ordered analysis of variance". Biometrika. 63 (3): 655–660. doi:10.1093/biomet/63.3.655. JSTOR 2335748.

[8] Wilson, Daniel J (August 17, 2019). "Updated correction to "The harmonic mean p-value for combining independent tests"" (PDF).

[9] Good, I J (1984). "C192. One tail versus two-tails, and the harmonic-mean rule of thumb". Journal of Statistical Computation and Simulation. 19 (2): 174–176. doi:10.1080/00949658408810727.

[10] Good, I J (1984). "C193. Paired versus unpaired comparisons and the harmonic-mean rule of thumb". Journal of Statistical Computation and Simulation. 19 (2): 176–177. doi:10.1080/00949658408810728.

[11] Good, I J (1984). "C213. A sharpening of the harmonic-mean rule of thumb for combining tests "in parallel"". Journal of Statistical Computation and Simulation. 20 (2): 173–176. doi:10.1080/00949658408810770.

[12] Good, I J (1984). "C214. The harmonic-mean rule of thumb: Some classes of applications". Journal of Statistical Computation and Simulation. 20 (2): 176–179. doi:10.1080/00949658408810771.

[13] Good, Irving John. (2009). Good thinking : the foundations of probability and its applications. Dover Publications. ISBN 9780486474380. OCLC 319491702.

[14] Sellke, Thomas; Bayarri, M. J; Berger, James O (2001). "Calibration of p Values for Testing Precise Null Hypotheses". The American Statistician. 55 (1): 62–71. doi:10.1198/000313001300339950. ISSN 0003-1305. S2CID 396772.

[:3-15] Wilson, D J (2019). "Reply to Held: When is a harmonic mean p-value a Bayes factor?" (PDF). Proceedings of the National Academy of Sciences USA. 116 (13): 5857–5858. doi:10.1073/pnas.1902157116. PMC 6442550. PMID 30890643.

[16] Held, L (2019). "On the Bayesian interpretation of the harmonic mean p-value". Proceedings of the National Academy of Sciences USA. 116 (13): 5855–5856. doi:10.1073/pnas.1900671116. PMC 6442579. PMID 30890644.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]