Replicability analysis of high dimensional data accounting for dependence
Abstract
Replicability is the cornerstone of scientific research. We study the replicability of data from high-throughput experiments, where tens of thousands of features are examined simultaneously. Existing replicability analysis methods either ignore the dependence among features or impose strong modeling assumptions, producing overly conservative or overly liberal results. Based on -values from two studies, we use a four-state hidden Markov model to capture the structure of local dependence. Our method effectively borrows information from different features and studies while accounting for dependence among features and heterogeneity across studies. We show that the proposed method has better power than competing methods while controlling the false discovery rate, both empirically and theoretically. Analyzing datasets from genome-wide association studies reveals new biological insights that otherwise cannot be obtained by using existing methods. Keywords: False discovery rate; hidden Markov model; high dimensional replicability analysis; non-parametric maximum likelihood estimation.
1 Introduction
Replicability is the cornerstone of modern scientific research. Consistent findings at different times, places, and populations provide stronger scientific evidence. We study conceptual replicability, where consistent results are obtained using different procedures and populations that target the same scientific question. In high-throughput experiments, non-biological factors, such as batch effects, may confound signals in a single study. With multiple studies, if a signal is replicable, it suggests that the result is not due to chance or confounders, strengthening evidence from a single study. To focus on the main ideas, we study replicability analysis of high dimensional data from two studies. In high-throughput experiments, where tens of thousands of features are examined simultaneously, an acute problem is multiple comparisons. Compared with multiple testing from a single study, the null hypothesis for replicability analysis is composite. We regard a feature as replicable if it is non-null from both studies. The composite null hypothesis for replicability analysis consists of three states: the feature from both studies is null, the feature from one study is null, and from another study is non-null and vice versa. An ad hoc approach for replicability analysis of high dimensional data is to implement a multiple testing procedure, for instance, the Benjamini and Hochberg procedure (Benjamini and Hochberg, 1995), for each study, and get the intersection of discoveries from all studies. This commonly used and intuitive procedure does not control the false discovery rate and has low power because it does not borrow information from different studies (Bogomolov and Heller, 2023). As a conservative alternative, Benjamini et al. (2009) proposes to use the maximum of -values from different studies as a test statistic and implement the Benjamini and Hochberg procedure afterward. To improve power, Lyu et al. (2023) estimates different proportions of composite null and gets a better approximation of the cumulative distribution function of the maximum of -values, yet no theoretical results are provided. Li et al. (2011) proposes a reproducible discovery rate and graphical tools to assess the replicability. Philtron et al. (2018) uses the maximum rank of each feature to assess replicability non-parametrically. The procedures proposed by Li et al. (2011) and Philtron et al. (2018) both assume that the two studies share the same states in the sense that they are both null or non-null. This assumption is strong and does not incorporate the heterogeneity of different studies. Bogomolov and Heller (2018) first pre-screens the -values and uses a cross-screening strategy to borrow information from two studies. An empirical Bayes approach was proposed by Heller and Yekutieli (2014) where unknown functions are estimated parametrically. Hung and Fithian (2020) discusses different criteria for replicability analysis. We refer the readers to Bogomolov and Heller (2023)for a recent survey on this topic. To the best of our knowledge, high dimensional replicability analysis accounting for dependence has not been studied before. In many high dimensional data, dependence among features is a norm rather than an exception. For instance, genome-wide association studies (GWAS) data exhibit linkage disequilibrium among single nucleotide polymorphisms (SNPs), where alleles at nearby sites can co-occur on the same phenotype more often than by chance alone. As a result, it is common to observe that phenotype associated SNPs form clusters (Pritchard and Przeworski, 2001). An effective strategy for dependence modeling is through the hidden Markov model (Li and Stephens, 2003; Sesia et al., 2021). Under the hidden Markov model, consistency of the maximum likelihood estimation is shown by Leroux (1992) with parametric assumption on the density functions of the mixture model. Bickel et al. (1998) further shows the asymptotic normality of the estimated parameters. Recently, Alexandrovich et al. (2016) gives nonparametric identification and maximum likelihood estimation for finite-state hidden Markov models, and Abraham et al. (2022) shows the optimal minimax rate of the supremum-norm convergence of preliminary estimators of the emission densities of the hidden Markov model. With one dataset, the existing false discovery rate control procedure accounting for dependence through the hidden Markov model uses Gaussian mixtures to estimate the non-null density function (Sun and Cai, 2009; Abraham et al., 2022). These procedures cannot be extended to the replicability analysis of two studies. In this paper, we develop a robust, efficient, and computationally scalable high dimensional replicability analysis method without any tuning parameters. Our method only requires paired -values from two studies as the input and does not require the availability of individual data, which may be prohibitive due to privacy concerns or logistics. We use a four-state hidden Markov model to account for the heterogeneity of different states. Conditional on the hidden states, we assume that the paired -values follow a four-group mixture model (Efron, 2012). We do not assume signals from different studies have the same effect size and account for such heterogeneity by modeling non-null density functions of -values from two studies separately. In addition, to have robust inference, we do not impose parametric assumptions on the non-null density functions of -values and develop a non-parametric maximum likelihood estimation procedure. Computationally, we combine the forward-backward algorithm (Baum et al., 1970), EM algorithm (Dempster et al., 1977) and the pool-adjacent-violator-algorithm (Robertson et al., 1988) to estimate unknown parameters and unknown functions. Theoretically, we show consistency of the estimated parameters and functions under minimum assumptions, and asymptotic false discovery rate control using the proposed estimation method.
2 Methodology
2.1 Notations and model set-up
Suppose that we have -values of hypotheses from two studies These -values can be obtained by the marginal association of each SNP with a phenotype in different populations. We are interested in identifying replicable SNPs associated with the phenotype in both studies. We use GWAS as a motivating and illustrative example and remark that our method is general and can be applied in other settings. Let denote the hidden states of th SNP in study , where indicates association of the th SNP in study and otherwise, . We use to indicate the four possible values of the joint association status and . The replicability null hypothesis is composite with
(2.1) |
To capture the local dependence structure, we assume that follows a four-state stationary, irreducible, and aperiodic Markov chain. The transition probabilities are where with constraint . The stationary distribution of state is where . Denote as the transition probability matrix and as the vector of stationary probabilities. Since the Markov chain is stationary, we have . The convergence theorem of a Markov chain (Theorem 5.5.1 in Durrett (2019)) implies that almost surely for as . Conditional on the hidden states, we model the probability density function of -values by a mixture model. Specifically,
(2.2) | |||
where is the probability density function of -values when and and are the probability density functions of -values when and respectively. We assume that follows the standard uniform distribution and impose the following monotone likelihood ratio condition (Sun and Cai, 2007; Cao et al., 2013).
(2.3) |
This condition is natural as small -values indicate evidence against the null. The paired -values are assumed to be conditionally independent given the hidden states. Based on (2.2), we have and for and respectively, where is conditional probability density function of given state , .
2.2 Estimation
We use to denote the collection of unknown parameters and unknown functions, with the true value denoted as . The likelihood function for is
The maximum-likelihood estimate is defined as
(2.4) |
where is the parameter space of Given , define the forward probability as and the backward probability as , respectively. The forward-backward procedure (Baum et al., 1970) is used in the calculation. Specifically, we first initialize and We can obtain and for recursively by and similarly, Define two posterior probabilities as and . By definition, we have . They can be obtained from the forward and backward probabilities through
and
The likelihood function of the complete data is given by
We use the EM algorithm (Dempster et al., 1977) in combination of the pool-adjacent-violator-algorithm (Robertson et al., 1988) to estimate the unknowns With an appropriate initialization , the EM algorithm proceeds by iteratively implementing the following two steps. E-step: Given current , the forward and backward probabilities we can obtain the posterior probabilities The conditional expectation of the log-likelihood function is
M-step: Update by
By using the Lagrange multiplier, we can calculate and each element of with as
and
The two functions can be updated by
(2.5) |
and
(2.6) |
where is the space of non-increasing density functions with support . We iterate between the E-step and M-step until the algorithm converges. Next, we provide details to solve (2.5). We first order the -values as . Denote and let correspond to . Denote and . Let be the space of . We aim to find
Using the Lagrangian multiplier, the objective function we want to maximize becomes
Taking derivatives with respect to and we have
To incorporate the non-increasing constraint on we have
Let and
We have
which can be obtained by the pool-adjacent-violator-algorithm. According to Theorem 3.1 of Barlow and Brunk (1972), we have , . The calculation of in (2.6) follows the same line and we omit the details.
2.3 Testing
The test statistic we use is the replicability local index of significance (rLIS), which is defined as the posterior probability of being null given the observed paired -values. Specifically,
Denote as the indicator function of event , i.e., if is true and otherwise. With a rejection threshold , is rejected if . We can write the total number of rejections as
The number of false rejections is
The law of total expectation gives
(2.7) |
We aim to control the false discovery rate at a pre-specified level where
The false discovery proportion is defined as
By (2.7),
In the oracle case, we assume is known. With nominal FDR level , we can apply the following testing procedure.
(2.8) |
Let be the ordered and be the corresponding hypotheses. Assume there are rejections, which means . Thus, the rejection criterion (2.8) is equivalent to the following step-up procedure
(2.9) | ||||
We can write rLIS in terms of forward and backward probabilities. Specifically, we have
With the maximum likelihood estimator (2.4), we can estimate and plug the estimators in the step-up procedure (2.9). Its validity is shown in Theorem 3.2 in Section 3.
3 Theory
3.1 Notations and identifiability
Recall that corresponds to for , denotes the stationary probabilities of , and is its transition probability matrix. denotes the probability density function of given and denotes the probability density function of given , . Since the hidden Markov model is stationary, we have , i.e., is the eigenvector of with the corresponding eigenvalue . When , is uniquely determined by , where is a -dimensional identity matrix. Let be the parameter space of , where
where is a space of non-increasing probability density functions with support . For , let be the set of discontinuous points of . For any , let be the left limit of at and be the right limit of at . Since is non-increasing and is a point of discontinuity, we know and exist and . Then there exists a rational number in the open interval . Thus, we can find an injection from rational numbers to , which means that , the set of discontinuous points of , has at most countable points. Consequently, is continuous almost everywhere. By the same token, is also continuous almost everywhere. For any and in , denote the distance between them as
(3.1) |
where denotes the norm of a vector, denotes the Frobenius norm of a matrix, and denotes the Hellinger distance of two density functions, where Under the distance in (3.1), we can obtain identifiability of elements in . Identifiability means that implies almost everywhere, where and . When , it follows that , , and . Moreover, implies that almost everywhere. By the same token, implies almost everywhere.
Proposition 1.
is compact with respect to distance defined in (3.1).
The proof of Proposition 1 is in the Supplementary Materials. This result is needed in the consistency proof of unknown parameters and unknown functions in the next subsection.
3.2 Consistency of the maximum likelihood estimation
We need the following conditions to show the consistency of the maximum likelihood estimator and asymptotic false discovery rate control. (C1) The transition probability matrix is irreducible and , the true parameter, is an interior point of . (C2) There exist and such that for any satisfying , we have , and for , where is a pre-specified false discovery rate level. Moreover, for
we have
(C3) . (C4) for some and all , where . (C5) There exists such that for all , where
(C1) ensures that the hidden states are irreducible, which guarantees that the distribution of the observed data is ergodic. Since is an interior point of , the compactness of can be used to show that in probability as . (C2) bounds the stationary probabilities and elements in the transition probabilities away from 0 when is close to . In addition, it requires a lower bound on the non-null probability density function of -values near which is mild for a non-increasing probability density function in (C3) is a technical assumption (Leroux, 1992). (C4) guarantees the existence of generalized Kullback–Leibler divergence of two distributions indexed by and where is in a small neighborhood of . (C5) requires that the ratio of density functions for any two hidden states is finite with positive probability.
Theorem 3.1.
Assume (C1)-(C5) hold. Let be the true parameter value and be the maximum likelihood estimator defined in (2.4). Then converges to in probability.
The proof of Theorem 3.1 is relegated in the Supplementary Material. Denote the shift operator of a hidden Markov model as . Let be any shift-invariant set, which means if and only if . The distribution of is defined to be ergodic if or for any shift-invariant set . Lemma 1 in Leroux (1992) shows that under (C1), are ergodic. Therefore, we can apply Birkhoff’s ergodic theorem (Birkhoff, 1931): For any , where exists, we have
Theorem 2 in Leroux (1992) gives the definitions of entropy and generalized Kullback–Leibler divergence when the probability density functions under different hidden states are parametric under the hidden Markov model. We extend their results to the setting where the probability density functions under different hidden states are estimated nonparametrically. Specifically, assume (C1) and (C4) hold. For , there is a constant such that
-
1.
as ;
-
2.
almost surely as .
Moreover, Assume (C1) - (C4) hold. For every , . If , . With the compactness of the parameter space and non-negativeness of the generalized Kullback–Leibler divergence , we obtain the consistency of the maximum likelihood estimator in Theorem 3.1.
3.3 Asymptotic false discovery rate control
With the maximum likelihood estimator , we can estimate the forward probabilities by and the backward probabilities by for and The estimates can be obtained as follows.
We calculate the test statistic
(3.2) |
Order the test statistics as with corresponding replicability null hypotheses . For a pre-specified false discovery rate level , we have the following procedure.
(3.3) |
Theorem 3.2 shows asymptotic false discovery rate control at level with the maximum likelihood estimator specified in (2.4) and the testing procedure (3.3).
Theorem 3.2.
Assume the paired -values and the joint hidden states follow a four dimensional hidden Markov model defined with true parameter . Conditional on the hidden states, the paired -values follow a mixture model specified in (2.2). Assume the probability density function of null -values is a standard uniform distribution and (2.3) holds. If (C1)-(C5) hold, procedure (3.3) controls the false discovery rate asymptotically at level .
4 Simulations
We conduct simulation studies to evaluate the finite sample performance of the proposed method in terms of false discovery rate control and power comparison. The data generating process is as follows. We set the total number of tests We first simulate a Markov chain with the stationary probability . For and , we have the corresponding transition probability matrices
where and the stationary property holds. We generate from the stationary probability as for . For each , we generate based as for . Based on the definition of , the hidden states of two studies, , can be obtained from the value of , . Let denote the normal distribution with mean and variance . For simplicity, we directly simulate observed -values conditional on for and . Specifically, , where denotes the signal strength in study . One-sided -values are calculated by . We compare the proposed method with replicability analysis methods that do not account for the dependence structure among hypotheses, including the ad hoc BH, MaxP, radjust, JUMP, and STAREG. Detailed descriptions of these methods are provided in the Supplementary Materials. For each setting, the simulations are repeated times to calculate the empirical false discovery rate and statistical power of different methods with nominal false discovery rate level . We first set and let . We vary and the signal strengths and to evaluate the false discovery rate and power of different methods in different settings. The results are summarized in Figure 1. We observe that ad hoc BH cannot control the false discovery rate, and we exclude it in the power comparison. All other methods have valid false discovery rate control. MaxP, radjust, JUMP are too conservative, and have low power. Our procedure has higher power than the other methods across all settings. The power gain is especially pronounced in the challenging weak signal scenario. The power increases with increased signal strengths for all methods.
Figure 2 presents the FDR control and power comparison of different methods with varied FDR levels where and
We examine the performance of different methods with nominal FDR ranging from to . We use the diagonal lines with slope as references. Our procedure and STAREG can control FDR while ad hoc BH, MaxP, radjust and JUMP are too conservative. Our procedure has the highest power.
5 Data analysis
We illustrate our method by analyzing two GWAS datasets (Morris et al., 2012). A sex-differentiated meta-analysis was performed to test for the association of SNPs with type 2 diabetes. Type 2 diabetes occurs when blood glucose is too high. It affected approximately 329 million individuals in 2015 (Lipton et al., 2016). Identifying replicable SNPs that contribute to disease risk can be instrumental in a full understanding of disease biology and the development of therapeutics. In the first data set, there are type 2 diabetes cases and controls from the male population. In the second data set, there are type 2 diabetes cases and controls from the female population. We aim to find replicable SNPs that are associated with both genders. The datasets are downloaded from DIAbetes Genetics Replication and Meta-analysis Consortium at https://www.diagram-consortium.org/downloads.html. The male group contains summary statistics of SNPs, and the female group contains summary statistics of SNPs. We analyze the paired -values of SNPs that are common to both studies, where denote -values for male population and denote -values for female population. The estimated transition matrix is
and the stationary probability of different states is
which is the eigenvector of corresponding to eigenvalue The estimated probability density functions of non-null -values, and , are plotted in Figure 3.
Next, we compare our method to the competing methods. The results of different methods at the FDR level are summarized in Figure 4. The MaxP procedure is the most conservative with findings, and all of them are identified by other methods. Our method has findings, among which are uniquely identified by our method.
Among these unique findings, are recorded in the NHGRI-EBI GWAS Catalog (Sollis et al., 2023), which reports associations with type 2 diabetes in published GWAS at SNP level.
Figure 5 presents the Manhattan plots of MaxP, STAREG and our method. In Figure 5, the vertical axes are transformations of test statistics for replicability analysis. They are for MaxP, for STAREG and for our method. Then we map the remaining SNPs to genes by R package snpGeneSets (Mei et al., 2016). The other SNPs are mapped to genes, many of which have been reported to be related to type 2 diabetes in previous literature. For example, JAZF1, CDC123, THADA, ADAMTS9-AS2 and NOTCH2 were reported to be associated with type 2 diabetes (Zeggini et al., 2008). JAZF1 is a key transcriptional regulator of ribosome biogenesis, global protein, and insulin translation and has a significant association with type 2 diabetes (Kobiita et al., 2020). 33 SNPs are mapped to JAZF1 such as rs10245867 (rLIS: 2.64e-06; Male -value: 1.03e-08; Female -vale: 6.64e-05). ADAMTS9 can increase the risk of type 2 diabetes through impairment of insulin sensitivity (Graae et al., 2019). 25 SNPs are mapped to ADAMTS9, such as rs11914351 (rLIS: 3.71e-05; Male -value: 8.53e-4; Female -value: 5.70e-2). Increased expression of NOTCH2 may play a role in the pathogenesis of type 2 diabetes, and they may contribute to poor control of the glycemic state (Ghanem et al., 2020). 9 SNPs are mapped to NOTCH2, such as rs10127888 (rLIS: 4.44e-05; Male -value: 2.82e-2; Female -value: 1.52e-2).
6 Concluding remarks
In this paper, we propose a robust and powerful inference of high-dimensional replicability analysis accounting for dependence. We deal with summary statistics such as -values from each study instead of the raw data since summary statistics are easier to access and store. We capture the local dependence of -values by a hidden Markov model. We account for the heterogeneity of different studies by joint hidden states, allowing non-null density functions to have different distributions and estimating them non-parametrically. Furthermore, we obtain the identifiability condition of the unknown parameters and functions, consistency of estimated parameters and functions, and the asymptotic false discovery rate control. Simulation studies demonstrate valid false discovery rate control and higher power of our method. GWAS data analysis provides new biological insights that otherwise cannot be obtained using existing methods. For the maximum likelihood estimation with a hidden Markov model, theoretical results such as the rate of convergence is desirable, and we leave it as future work. We use two studies to illustrate the main ideas. In theory, our approach can be extended to more than two studies. In practice, for studies, the total number of possible states is , which is computationally prohibitive, and a new approach is warranted.
Acknowledgement
We thank Yan Li for her help with the simulation studies. This research is partially supported by NSF 2311249 and NIH 2UL1TR001427-5.
References
- Abraham et al. (2022) K. Abraham, I. Castillo, and E. Gassiat. Multiple testing in nonparametric hidden Markov models: An empirical Bayes approach. Journal of Machine Learning Research, 23(94):1–57, 2022.
- Alexandrovich et al. (2016) G. Alexandrovich, H. Holzmann, and A. Leister. Nonparametric identification and maximum likelihood estimation for hidden Markov models. Biometrika, 103(2):423–434, 2016.
- Barlow and Brunk (1972) R. Barlow and H. Brunk. The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337):140–147, 1972.
- Baum et al. (1970) L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41(1):164–171, 1970.
- Benjamini and Hochberg (1995) Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995.
- Benjamini et al. (2009) Y. Benjamini, R. Heller, and D. Yekutieli. Selective inference in complex research. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367(1906):4255–4271, 2009.
- Bickel et al. (1998) P. J. Bickel, Y. Ritov, and T. Ryden. Asymptotic normality of the maximum-likelihood estimator for general hidden Markov models. Annals of Statistics, 26(4):1614–1635, 1998.
- Birkhoff (1931) G. D. Birkhoff. Proof of the ergodic theorem. Proceedings of the National Academy of Sciences, 17(12):656–660, 1931.
- Bogomolov and Heller (2018) M. Bogomolov and R. Heller. Assessing replicability of findings across two studies of multiple features. Biometrika, 105(3):505–516, 2018.
- Bogomolov and Heller (2023) M. Bogomolov and R. Heller. Replicability across multiple studies. Statistical Science, 38(4):602–620, 2023.
- Cao et al. (2013) H. Cao, W. Sun, and M. R. Kosorok. The optimal power puzzle: scrutiny of the monotone likelihood ratio assumption in multiple testing. Biometrika, 100(2):495–502, 2013.
- Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
- Durrett (2019) R. Durrett. Probability: theory and examples. Cambridge University Press, 2019.
- Efron (2012) B. Efron. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, volume 1. Cambridge University Press, 2012.
- Fekete (1923) M. Fekete. Über die verteilung der wurzeln bei gewissen algebraischen gleichungen mit ganzzahligen koeffizienten. Mathematische Zeitschrift, 17(1):228–249, 1923.
- Ghanem et al. (2020) Y. Ghanem, A. Ismail, R. Elsharkawy, R. Fathalla, and A. El Feky. Expression of Notch 2 and ABCC8 genes in patients with type 2 diabetes mellitus and their association with diabetic kidney disease. Clinical Diabetology, 9(5):306–312, 2020.
- Graae et al. (2019) A.-S. Graae, N. Grarup, R. Ribel-Madsen, S. H. Lystbaek, T. Boesgaard, H. Staiger, A. Fritsche, N. Wellner, K. Sulek, M. Kjolby, et al. ADAMTS9 regulates skeletal muscle insulin sensitivity through extracellular matrix alterations. Diabetes, 68(3):502–514, 2019.
- Heller and Yekutieli (2014) R. Heller and D. Yekutieli. Replicability analysis for genome-wide association studies. Annals of Applied Statistics, 8(1):481–498, 2014.
- Hung and Fithian (2020) K. Hung and W. Fithian. Statistical methods for replicability assessment. Annals of Applied Statistics, 14(3):1063–1087, 2020.
- Kobiita et al. (2020) A. Kobiita, S. Godbersen, E. Araldi, U. Ghoshdastider, M. W. Schmid, G. Spinas, H. Moch, and M. Stoffel. The diabetes gene JAZF1 is essential for the homeostatic control of ribosome biogenesis and function in metabolic stress. Cell Reports, 32(1), 2020.
- Leroux (1992) B. G. Leroux. Maximum-likelihood estimation for hidden Markov models. Stochastic Processes and Their Applications, 40(1):127–143, 1992.
- Li and Stephens (2003) N. Li and M. Stephens. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics, 165(4):2213–2233, 2003.
- Li et al. (2011) Q. Li, J. B. Brown, H. Huang, and P. J. Bickel. Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics, 5(3):1752–1779, 2011.
- Li et al. (2023) Y. Li, X. Zhou, R. Chen, X. Zhang, and H. Cao. Stareg: an empirical Bayesian approach to detect replicable spatially variable genes in spatial transcriptomic studies. bioRxiv 10.1101/2023.05.30.542607, 2023.
- Lipton et al. (2016) R. Lipton, T. Schwedt, B. Friedman, et al. Global, regional, and national incidence, prevalence, and years lived with disability for 310 diseases and injuries, 1990-2015: a systematic analysis for the Global Burden of Disease Study 2015. Lancet, 388(10053):1545–1602, 2016.
- Lyu et al. (2023) P. Lyu, Y. Li, X. Wen, and H. Cao. JUMP: replicability analysis of high-throughput experiments with applications to spatial transcriptomic studies. Bioinformatics, 39(6):btad366, 2023.
- Mei et al. (2016) H. Mei, L. Li, F. Jiang, J. Simino, M. Griswold, T. Mosley, and S. Liu. snpGeneSets: an r package for genome-wide study annotation. G3: Genes, Genomes, Genetics, 6(12):4087–4095, 2016.
- Morris et al. (2012) B. Morris, Andrewand Voight, T. Teslovich, T. Ferreira, A. Segré, et al. Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nature Genetics, 44(9):981–990, 2012.
- Philtron et al. (2018) D. Philtron, Y. Lyu, Q. Li, and D. Ghosh. Maximum rank reproducibility: a nonparametric approach to assessing reproducibility in replicate experiments. Journal of the American Statistical Association, 113(523):1028–1039, 2018.
- Pritchard and Przeworski (2001) J. K. Pritchard and M. Przeworski. Linkage disequilibrium in humans: models and data. The American Journal of Human Genetics, 69(1):1–14, 2001.
- Riesz (1928) F. Riesz. Sur la convergence en moyenne. Acta Sci. Math, 4(1):58–64, 1928.
- Robertson et al. (1988) T. Robertson, R. L. Dykstra, and F. T. Wright. Order restricted statistical inference. In Wiley Series in Probability and Mathematical Statistics. John Wiley and Sons, 1988.
- Sesia et al. (2021) M. Sesia, S. Bates, E. Candès, J. Marchini, and C. Sabatti. False discovery rate control in genome-wide association studies with population structure. Proceedings of the National Academy of Sciences, 118(40):e2105841118, 2021.
- Sollis et al. (2023) E. Sollis, A. Mosaku, A. Abid, A. Buniello, M. Cerezo, L. Gil, T. Groza, O. Güneş, P. Hall, J. Hayhurst, et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Research, 51(D1):D977–D985, 2023.
- Storey (2002) J. D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 64(3):479–498, 2002.
- Storey and Tibshirani (2003) J. D. Storey and R. Tibshirani. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences, 100(16):9440–9445, 2003.
- Storey et al. (2004) J. D. Storey, J. E. Taylor, and D. Siegmund. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 66(1):187–205, 2004.
- Sun and Cai (2009) W. Sun and T. Cai. Large-scale multiple testing under dependence. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 71(2):393–424, 2009.
- Sun and Cai (2007) W. Sun and T. T. Cai. Oracle and adaptive compound decision rules for false discovery rate control. Journal of the American Statistical Association, 102(479):901–912, 2007.
- Williams (1991) D. Williams. Probability with martingales. Cambridge university press, 1991.
- Zeggini et al. (2008) E. Zeggini, L. J. Scott, R. Saxena, B. F. Voight, J. L. Marchini, T. Hu, P. I. de Bakker, G. R. Abecasis, P. Almgren, G. Andersen, et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nature Genetics, 40(5):638–645, 2008.
Supplementary Materials
Appendix A Proof of main results
A.1 Proof of Proposition 1
Proof.
Since the spaces of the transition matrix and stationary probability are bounded and closed with finite dimensions, they are compact. We just need to show the decreasing density function space is compact under the Hellinger distance . Consider any Cauchy sequence . Our goal is to show there exists some such that as . Denote for all . Then
which means that . Since is compact, there exists some satisfying
Denote . Our goal is to show that : is a non-increasing density function. We next show that by contradiction. Note that . Assume and denote . Thus
which is contradictory to the fact that . Then we can conclude that Next, we show that is also non-increasing. For any , denote
(A.1) |
Denote as the Lebesgue measure. Thus
which implies that converges to in measure, or equivalently, for any ,
By the theorem of Riesz [Riesz, 1928], there exists a subsequence of , such that almost everywhere. Since are non-increasing, we could conclude that is also non-increasing. In conclusion, and consequently, is compact with respect to the Hellinger distance . ∎
A.2 Proof of Theorem 3.1
For any with , define the conditional distribution of given as
where denotes the hidden state of the th gene for . Denote the largest for
Then the likelihood function satisfies
(A.2) |
where for ; and satisfies . In addition, assume for some . Then
(A.3) |
where (A.3) holds due to (C2): for all . Therefore, combining (A.2) and (A.3) and taking the logarithm, we have
(A.4) |
Dividing (A.4) by , we have
(A.5) |
Letting , the lower bound of inequality (A.5) tends to . Hence and converges to the same limit in probability. Taking the expectation on all terms of inequality (A.5), we know has the same limit as . By Theorem 2 in Leroux [1992], there exists some satisfying
We also have
Replacing by , we get the limit . Lemma 6 in Leroux [1992] gives that for . Letting , there exists such that,
(A.6) |
Denote as a ball centered at with radius , where is the distance between and defined in (3.1). by (C4). Therefore, is a bounded monotone increasing function of . Since are continuous functions of , and are also continuous. By the Monotone Convergence Theorem and the continuity of , we have
Then there exists , such that
(A.7) |
where the second inequality holds due to (A.6). By the continuity of and , it follows from (A.5) that
Thus, and converge to the same limit in probability. Define
In addition, we have
(A.8) |
By the construction of , Lemma 3 of Leroux [1992] shows that is subadditive, which means for any sequence ,
By the property of subadditive processes [Fekete, 1923],
which implies that
(A.9) |
Consequently, by (A.8), (A.9) and (A.7), we have as ,
(A.10) |
Next, we use (A.10) to show the consistency of . Let be any closed subset of , not containing . Since is compact, is also compact and is covered by the union of finite open sets , where is a finite set in . Therefore,
where the limit in the last line holds due to (A.10) and that almost surely as by Birkhoff’s ergodic theorem [Birkhoff, 1931]. Since is a maximum likelihood estimator, . Therefore, cannot be in . In other words, for any open set containing , must be in for large . Letting , we conclude that in probability.
A.3 FDR control under oracle case
We consider the case that is known. The following theorem shows that FDR can be controlled under the oracle case.
Theorem A.1.
Under the oracle case where is known, denote for . Order the test statistics with the corresponding null hypotheses . For a pre-specified false discovery rate level , we have the following procedure
Then the testing procedure can control the FDR at level .
Proof.
Denote as the number of total rejections and as the number of false rejections. If we reject the replicability null hypothesis if for some threshold , then satisfies
Let for simplicity. Therefore,
The last equality holds because is a function of . Since , we have
Consequently,
∎
A.4 Proof of Theorem 3.2
First, we introduce some notations used in the proof. Consider an infinite hidden Markov model with hidden states and -values . Denote the following test statistics
For any test statistics corresponding to the null hypothesis , consider the testing procedure based on ordered with corresponding null hypotheses . We have the number of rejections given by
(A.11) |
We reject for . An equivalent algorithm is
(A.12) |
The rejection threshold can be written as . The total number of false rejections is . Replacing by and , the number of rejections and number of false rejections are denoted by , , and . Moreover, we define the corresponding rejection thresholds as . Next, we consider the distribution of . Since is stationary, irreducible, and aperiodic, the two-sided generalization of Theorem 6.1.3 in Durrett [2019] implies that is ergodic. Therefore, are identically distributed. Denote the cumulative distribution function of as
Denote the conditional cumulative distribution function of given as
Thus
Let
(A.13) |
By the forward-backward algorithm [Baum et al., 1970],
where and . and can be derived recursively by and . Since the joint distribution of is continuous, and is a continuous map from to (0, 1), the probability density function of is positive and continuous. It suffices to show that is strictly increasing in . For some threshold , define the number of rejections and false rejections as
Thus we have the expectations
Therefore, the marginal false discovery rate is
Theorem 1 of Sun and Cai [2009] implies that is increasing in . Define the threshold based on the marginal false discovery rate as
Since is equivalent to the statement that for , we have
under (C2) with . Without loss of generality, we assume With the notations above, we will prove Theorem 3.2 as follows. In Step 1, we show that the total number of rejections and converge to infinity almost surely. In Step 2, we show that and as . Finally, we show the asymptotic false discovery rate control in Step 3. Step 1. Asymptotic behavior of rejection numbers. Recall that and are the corresponding rejection threshold given by and . First, we show and in probability by Lemma 1.
Lemma 1.
Assume (C1)-(C4) hold. and in probability.
We next show almost surely. For simplicity, denote as for any . By (C2), for all and (C5), for any , . Then for any states and ,
Let . Since , we conclude that for all ,
Define
Then
(A.14) |
Then we apply Lemma 2 below to show and almost surely as .
Lemma 2.
If (A.14) and (C1)-(C3) hold, then and almost surely.
Step 2. Convergence of and in expectation. In Step 2, we show and under the general case that .
Lemma 3.
If , then and as .
Appendix B Proof of lemmas
B.1 Proof of Lemma 1
Proof.
Recall that
where
(B.1) | ||||
Since , it is a funtion of . Thus
Brikhoff’s ergodic theorem [Birkhoff, 1931] gives
Consequently,
(B.2) |
In addition, . Therefore,
which implies that , or equivalently,
(B.3) |
By construction, is an increasing step function with jump at . For , construct the lower bound of as
Then is strictly increasing in . We also have
where denotes the number of rejections yielded by threshold , satisfying if . By Birkhoff’s ergodic theorem [Birkhoff, 1931], almost surely as . Then for any ,
By (B.2), almost surely for any . Denote
As with probability , we have
(B.4) |
By (B.3), we also have almost surely. We claim that in probability. If not, there exist and such that for any , there exists satisfying
where denotes the even that . Let
Since in probability for any , there exists , such that for any ,
where denotes the event that . Without loss of generality, assume . Letting , we have
Thus has positive probability. Additionally, is strictly increasing over with probability . On , we have
which is a contradiction. Thus we must have in probability. Furthermore, by (B.3) and (B.4),
in probability can be shown in the same way. ∎
B.2 Proof of Lemma 2
Proof.
Define . Similarly, define . We need to show that
(B.5) |
Since and with probability , we have with probability . Thus with probability for any We have
Since , we have
and similarly,
Therefore,
Since , we know (B.5) is true. We have the similar definitions and . We also have
(B.6) |
We move to the second step. Let . For any , let and . We claim that when and ,
(B.7) |
We have
We just need to show
and
We have for ,
Then
Similarly, we also have
Therefore, (B.2) is true. Then we consider the expectations.
By (C5) and the construction of , we have for all . Let
(B.8) |
then we have . Therefore, for some ,
(B.9) |
By Lévy’s upward theorem [Williams, 1991], almost surely as . Next, we show that . Note that
By (C2), for all . Then we have
Consequently,
By (C2), and . Moreover, two roots of the quadratic equation
are
By (C2), for each , . Thus
where is defined in (C2). Thus and
By (C2), and . Since are continuous, there exist such that and whenever and . Consequently, for and
Therefore,
Therefore, we have
and thus we can conclude that . Finally, we show that and almost surely as . We consider the case that not all hypotheses are rejected. Recall (A.12). The threshold satisfies with probability . It suffices to show that almost surely as . Take , with . satisfies when is large enough. For any satisfying , by (B.2), we have
with probability . Define . Then is ergodic. Thus Brikhoff’s ergodic theorem [Birkhoff, 1931] gives that
Moreover, by the construction of in (B.8). Then Markov’s inequality gives
Thus
as . We use the property that . Then
By Birkhoff’s ergodic theorem, we have
Then we have almost surely. We have shown that for any . Therefore, almost surely, which means almost surely. We can use a similar argument to show that almost surely. The details are omitted. ∎
B.3 Proof of Lemma 3
Proof.
Note that almost surely as as shown in Lemma 2. The rejection criteria in (2.9) implies that
Note that as ,
Since
we have
(B.10) |
We can use the same approach to show
(B.11) |
Therefore, combining (B.10) and (B.11), we have
(B.12) |
We finish the proof by contradiction. Assume that does not hold, where is the total number of rejections by (2.9) when the total number of hypotheses is and is the total number of rejections by (3.3) when the total number of hypotheses is . Then there is such that, for any , there exists some satisfying . Since
implies that either (i) , or (ii) . We first consider the case that (i) is true. Then and therefore the event has positive probability. On the event , we have
(B.13) | ||||
(B.14) |
where (B.13) holds due to triangle inequality and (B.14) holds because
We next show that in (B.14) is positive with probability . Since the event has probability , it suffices to show that
(B.15) | ||||
(B.16) |
Since with probability , for and defined in (B.1), we have
(B.17) |
Lemma 1 shows that in probability, and the construction of and gives that in probability. Therefore,
(B.18) |
Similarly, we also have
(B.19) |
Combining (B.18) and (B.19), we have
(B.20) |
By the rejection criteria (2.8) and (2.9), with probability . Thus (B.15) holds. Moreover, for any , as . Then on the event ,
Birkhoff’s ergodic theorem [Birkhoff, 1931] gives that as ,
When tends to , the continuity of gives that as ,
(B.21) |
Similarly,
(B.22) |
Moreover, by Birkhoff’s ergodic theorem [Birkhoff, 1931], we have
(B.23) |
Combining (B.17), (B.21) and (B.23), we have
with probability approaching . Recall that . Thus for any ,
Equivlently, we have
Since is strictly increasing in and , we have and . Therefore, (B.16) holds. By (B.10), (B.15) and (B.16), we have
It implies that
(B.24) |
Next, we show
(B.25) |
Note that
(B.26) |
Denote and . Since and , we have
Therefore,
(B.27) |
For with , denote
By (B.9), for if . Whenever or , . Note that Lemma 2 shows that almost surely as . Therefore, by (B.27),
(B.28) |
Similarly, for and , we have
(B.29) |
Furthermore, denote and . By (B.22), we have almost surely as . By Birkhoff’s ergodic theorem [Birkhoff, 1931], we have almost surely as . Note that
Birkhoff’s ergodic theorem [Birkhoff, 1931] gives that
Similarly, we have almost surely as by Birkhoff’s ergodit theorem [Birkhoff, 1931]. Therefore,
Therefore, we have
(B.30) |
Combining (B.26), (B.28), (B.29) and (B.30), we have (B.25). Then by (B.14), (B.24) and (B.25), for any , there exists some satisfying
This is contradictionary to (B.12). Therefore, (i) does not hold. Now consider the case when (ii) is true. In this case, and therefore the event has positive probability. By (B.11) and (B.25), we have
(B.31) |
Thus with probability . Then we can use a similar method as (B.14) and obtain that on the event ,
Let and . We know , where denotes the cardinality of a set. Since with probability , we have
We apply the ergodic theorem [Birkhoff, 1931] and continuity of to obtain
as Since with probability and with probability, we have
as Since , the continuous mapping theorem gives that
(B.32) |
as Denote . Note that , the culmulative distribution function of , is strictly increasing in over the interval . It implies that . Hence by (B.31) and (B.32), we have
(B.33) |
By (B.25), we take expectations on both sides of (B.33) to get
The result is contradictory to (B.12). Therefore, (ii) does not hold either. We have shown that neither (i) nor (ii) holds, which implies that . Similarly, we can obtain that . The details are omitted. ∎
Appendix C Competing methods
We compare the FDR and power of our method with several replicability analysis methods, including ad hoc BH, MaxP [Benjamini et al., 2009] radjust method [Bogomolov and Heller, 2018], JUMP [Lyu et al., 2023] and STAREG [Li et al., 2023].
C.1 The ad hoc BH method
BH [Benjamini and Hochberg, 1995] is the most popular multiple testing procedure that conservatively controls the false discovery rate for independent or positively dependent tests. In study , the BH procedure proceeds as follows.
-
•
Step 1. Let be the ordered -values, and denote by the corresponding hypothesis;
-
•
Step 2. Find the largest such that , i.e., , and if the set is empty;
-
•
Step 3. Reject .
The ad hoc BH method for replicability analysis identifies SNPs rejected by both studies as replicable SNPs.
C.2 The MaxP method
Define the maximum of -values as
Note that follows a super-uniform distribution under the replicability null. The MaxP method directly applies BH [Benjamini and Hochberg, 1995] to for false discovery rate control.
C.3 The radjust procedure
The radjust procedure [Bogomolov and Heller, 2018] works as follows.
-
•
Step 1. For a pre-specified false discovery rate level , compute
where is the set of features pre-selected in study for . By default, it selects features with -values less than or equal to .
-
•
Step 2. Reject features with indices in the set
In this paper, we implement an adaptive version of the radjust procedure Bogomolov and Heller [2018] in the simulations, which first estimates the fractions of true null hypotheses among the pre-selected features. The fractions in the two studies are estimated as follows.
(C.1) |
where . The adaptive procedure with a nominal false discovery rate level works as follows.
-
•
Step 1. Compute and using (C.1). Let
-
•
Step 2. Reject features with indices in the set
C.4 The JUMP method
The JUMP method [Lyu et al., 2023] works on the maximum of -values across two studies. Define
Let denote the inferred association status of single nucleotide polymorphisms across two studies. Then with for and . It can be shown that
which means that follows a super-uniform distribution under the replicability null. Denote
For a given threshold , a conservative estimate of the false discovery rate is obtained by
Following Storey [2002], Storey et al. [2004], the proportion of null hypotheses in study can be estimated by
Similarly, is estimated by
where and are tuning parameters that can be selected by using the smoothing method provided in Storey and Tibshirani [2003]. Then we have
With these estimates, we have a plug-in estimate of false discovery rate,
The JUMP method works as follows.
-
•
Step 1. Let be the ordered maximum of -values and denote by the corresponding hypothesis;
-
•
Step 2. Find the largest such that the estimated false discovery rate is controlled, where
-
•
Step 3. Reject .
C.5 The STAREG method
Let denote the inferred association status of single nucleotide polymorphisms across two studies. Then with for and . Assume a mixture model for -values in the two studies. Specifically,
where is the density function of -values under the null, and denote the non-null density functions for study 1 and study 2, respectively. Then the local false discovery rate (Lfdr) is defined as the posterior probability of being replicability null given data. We have
Assume the monotone likelihood ratio condition [Sun and Cai, 2007, Cao et al., 2013]:
(C.2) |
We have that is monotonically non-decreasing in . The rejection rule based on to test the replicability null is , where is a threshold to be determined. We write the total number of discoveries as , and the number of false discoveries as . In the oracle case that we know , define
Reject if . Then the FDR is asymptotically controlled at level .
Assume follows a standard uniform distribution. Let and denote -values from study 1 and study 2, respectively. Denote and The unknown parameters and functions are estimated by maximizing the following log-likelihood function
where and are latent variables. For scalable computation, we utilize EM algorithm [Dempster et al., 1977] in combination of pool-adjacent-violator-algorithm [Robertson et al., 1988] to efficiently estimate the unknowns incorporating the monotonic constraint (C.2) for and . With the estimates , we obtain the estimated Lfdr as follows.
An estimate of is
The replicability null hypothesis is rejected if This is equivalent to the step-up procedure [Sun and Cai, 2007]: let be the order statistics of and denote by the corresponding ordered hypotheses, the procedure works as follows.
Find | |||