plain \nolinenumbers
2024 \jvol1 \jnum1 \accessdateAccess Arxiv on 27 March 2024
Semi-Confirmatory Factor Analysis for High-Dimensional Data with Interconnected Community Structures
Abstract
Confirmatory factor analysis (CFA) is a statistical method for identifying and confirming the presence of latent factors among observed variables through the analysis of their covariance structure. Compared to alternative factor models, CFA offers interpretable common factors with enhanced specificity and a more adaptable approach to modeling covariance structures. However, the application of CFA has been limited by the requirement for prior knowledge about “non-zero loadings” and by the lack of computational scalability (e.g., it can be computationally intractable for hundreds of observed variables). We propose a data-driven semi-confirmatory factor analysis (SCFA) model that attempts to alleviate these limitations. SCFA automatically specifies “non-zero loadings” by learning the network structure of the large covariance matrix of observed variables, and then offers closed-form estimators for factor loadings, factor scores, covariances between common factors, and variances between errors using the likelihood method. Therefore, SCFA is applicable to high-throughput datasets (e.g., hundreds of thousands of observed variables) without requiring prior knowledge about “non-zero loadings”. Through an extensive simulation analysis benchmarking against standard packages, SCFA exhibits superior performance in estimating model parameters with a much reduced computational time. We illustrate its practical application through factor analysis on a high-dimensional RNA-seq gene expression dataset.
keywords:
Closed-Form Solution; Factor Score; Interconnected Community Structure; Statistical Inference.\arabicsection INTRODUCTION
Factor analysis is a commonly used statistical technique to elucidate the relationship between multivariate observations. Factor models aim to identify the underlying factors that collectively describe the interdependencies present in multivariate observed data (Anderson, 2003). As correlated high-dimensional observed variables can be effectively decomposed into a smaller number of common factors, i.e., achieving dimension reduction, factor analysis models have garnered popularity in various fields, including social science, psychology, molecular biology, and others (Schreiber et al., 2006; Fan et al., 2020, 2021).
Factor analysis is broadly classified into two categories: exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). EFA is frequently employed to explore the interactive relationships among observed variables and to identify latent common factors, without prerequisite knowledge of grouping these observed variables. In contrast, CFA is commonly utilized to validate whether the empirical evidence supports a predetermined latent structure of the shared variance in the model specification. In a CFA model, prerequisite knowledge or empirical evidence regarding the grouping of observed variables is represented by predefined “non-zero loadings” in the factor loading matrix, establishing a rule or a factor membership that exclusively and exhaustively assigns each observed variable to a certain common factor (Browne, 2001). In practical applications, a combined approach is often adopted, starting with EFA to investigate the underlying dependence pattern, followed by CFA for model verification and justification (Basilevsky, 2009; Brown, 2015; Gana & Broc, 2019).
The above distinctive model specifications between EFA and CFA confer unique strengths and limitations to each approach. EFA exhibits greater flexibility by not necessitating specified “non-zero loadings” and can be adapted to accommodate high-dimensional observations, reaching thousands of observed variables or more (Friguet et al., 2009; Bai & Li, 2012; Fan et al., 2013; Fan & Han, 2017). However, the factor loadings in classical EFA models are typically non-zero, resulting in less interpretable relationships between common factors and observed variables. Moreover, EFA models are limited to explicitly estimating the covariance matrix of common factors for high-dimensional data. In contrast, CFA naturally specifies a sparse factor loading matrix guided by prior knowledge, enhancing interpretability, as exemplified in Carlson & Mulaik (1993), and establishes arbitrary covariances between common factors, as demonstrated in Lawley (1958) and Jackson et al. (2009). As previously mentioned, classical CFA models encounter two primary limitations: (1) pre-determined “non-zero loadings” in the factor loading matrix or factor membership is typically lacking, resulting in the nonexistence of a rule that exclusively and exhaustively assigns each observed variable to a certain common factor; and (2) the computational burden of estimating a CFA model in high-dimensional scenarios becomes intractable because the existing standard computational packages struggle to handle datasets containing hundreds or thousands of observed variables (Fox, 2006; Rosseel, 2012; Oberski, 2014).
In the current research, we concentrate on the CFA approach while attempting to address the two aforementioned limitations. We propose a semi-confirmatory factor analysis (SCFA) model that addresses the specification of “non-zero loadings” through the covariance structure learned from high-dimensional data, and significantly alleviates the computational burden with theoretically guaranteed solutions in closed form. Specifically, to overcome the first limitation, we incorporate a prevalent covariance structure, namely, the interconnected community structure, into the conventional CFA model. We focus on the interconnected community structure, as it is widely prevalent in the covariance matrices of various high-dimensional datasets, as illustrated in Figure \arabicfigure. Notably, interconnected community structures, which enable features between communities to exhibit correlations, encompass various well-known patterns, including all independent community structures (Newman & Girvan, 2004; Fortunato, 2010) and most hierarchical community structures (Li et al., 2022; Schaub et al., 2023). Therefore, they provide a versatile covariance structure to model various practical applications, including brain imaging, gene expression, multi-omics, metabolomics, and more (Girvan & Newman, 2002; Colizza et al., 2006; Simpson et al., 2013; Levine et al., 2015; Huttlin et al., 2017; Zitnik et al., 2018; Perrot-Dockès et al., 2022). However, we acknowledge that the proposed structures can be limited in representing certain covariance patterns (e.g., Toeplitz). In such cases, traditional factor models remain suitable. Interconnected community structures are latent in many studies, which can be accurately and robustly estimated and extracted by recently developed network structure detection approaches (Wang et al., 2020; Li et al., 2022; Yang et al., 2024). Consequently, the detected community membership can serve as a guide to specifying the previously unknown “non-zero loadings” for CFA, effectively addressing the first limitation. The SCFA model also alleviates the computational burden by deriving closed-form solutions for all CFA model parameters, including the factor loadings, factors, covariance matrix between common factors, and covariance matrix for error terms. The closed-form estimators not only improve estimation accuracy and stability but also drastically reduce computational load and ensure scalability (e.g., handling thousands of observed variables).
SCFA presents several methodological contributions. Firstly, SCFA alleviates the requirement of “non-zero loadings” in CFA models. The acquired interconnected community structure assigns observed variables to common factors, specifying “non-zero loadings” in an adaptive manner. Secondly, SCFA provides a computationally efficient approach for conducting high-dimensional CFA (e.g., with thousands or more observed variables). All estimators are obtained by the likelihood approach and in closed form, substantially mitigating the computational burden. Thirdly, SCFA yields more accurate and reliable estimates, since all matrix estimators are uniformly minimum-variance unbiased estimators (UMVUEs). The factor scores can also be conveniently estimated using the feasible generalized least-square (FGLS) method. We further show that FGLS estimators have an identical solution to those obtained through ordinary least-square (OLS) and generalized least-square (GLS) methods. Lastly, we derive explicit variance estimators that facilitate statistical inference concerning model parameters and factor scores in a SCFA model.
The remainder of the paper is structured as follows. In Section \arabicsection, we introduce the SCFA model, detailing its specifications for the factor loading matrix and the covariance matrix of observations. We subsequently present the estimation and inference procedures for all unknown matrices and factor scores. Section \arabicsection and Section \arabicsection are dedicated to evaluating the proposed model and approach. Section \arabicsection assesses the performance of our model through simulated data, while Section \arabicsection demonstrates the application to a genomics dataset without prior knowledge of “non-zero loadings”. All proofs and additional tables are included in the Supplementary Material.
\arabicsection METHOD
\arabicsection.\arabicsubsection Background
Confirmatory factor analysis model. Let represent a -dimensional vector of observations, denote the -dimensional vector of common factors, with , and denote the -dimensional vector of error terms. A factor model can be expressed as
(\arabicequation) |
where represents the by zero matrix; without loss of generality, the -dimensional mean vector is denoted by ; and represents the by factor loading matrix. Furthermore, let , , and denote the by , by , and by covariance matrices, respectively, where is assumed to be diagonal (Fan et al., 2021).
When performing CFA, we may introduce zero entries at specified positions in the factor loading matrix (Jöreskog, 1969) and assume the common factors to be oblique, e.g., the covariance matrix of common factors can be arbitrarily positive definite. Following the model in (\arabicequation) and these two assumptions of and , we can derive the following relationship between covariance matrices in the CFA model:
(\arabicequation) |
where denotes the transpose, becomes a by block-diagonal matrix with -dimensional “non-zero loadings” for ; is a by symmetric positive-definite matrix; and is a by diagonal positive-definite matrix satisfying the submatrix is a by diagonal matrix if or zero matrix if . In particular, represents the number of observed variables within the th factor, for , satisfying that each and .
We remark that the non-overlapping factor loading pattern in (\arabicequation) permits only a single non-zero entry within each row. This configuration aligns with the preferences of the majority of existing complexity criteria (Browne, 2001).
Covariance matrix in a block form. As defined in (\arabicequation), the covariance matrix of observed variables is structured as a block matrix. Specifically, the block structure of is determined by the “non-zero loadings” in for :
for all and , where for and is a diagonal matrix.
In practical applications, it is often the case that information about both the components and sizes of each “non-zero loadings” is unavailable, particularly in domains such as omics and imaging data. Nevertheless, covariance matrices in these applications often exhibit block patterns, albeit these block patterns may be latent and require pattern extraction algorithms. Recent advancements in statistics have provided convenient tools to accurately identify latent block structures in covariance matrices and precisely estimate covariance parameters (Lei & Rinaldo, 2015; Wu et al., 2021; Li et al., 2022). Inspired by the above block structure of in the CFA model, we are motivated to extract knowledge regarding all “non-zero loadings” from structured covariance matrix estimation for high-dimensional data, thereby facilitating the estimation of the CFA model, as elaborated in the following sections.
Interconnected community structure. The block-structured covariance matrix in the CFA model is naturally linked with the community-based network structure (Wu et al., 2021). In the current research, we focus on the interconnected community structure, which is more general and prevalent in real-world applications. As demonstrated in Figure \arabicfigure, the interconnected community covariance structure presents in the high-throughput datasets such as genetics, imaging, gene-expression, DNA-methylation, and metabolomics, among many (Yang et al., 2024). Therefore, we build the proposed model based on the interconnected community covariance structure for factor analysis on these datasets.
We characterize the interconnected community covariance structure as follows. In this structure, all features can be categorized into multiple mutually exclusive and exhaustive communities. In other words, there implicitly exists a community-membership function that operates as a bijection. In contrast to the classical community network structure, the interconnected community structure exhibits correlations among features within and between communities (i.e., interactive communities), as demonstrated in Figure \arabicfigure. Thus, the (population) covariance matrix with an interconnected community structure has a block structure: diagonal blocks represent the intra-community correlations while the off-diagonal blocks characterize the inter-community relationships. Due to the high resemblance of entries within each block, a parametric covariance model is often used by assigning two (or one) parameters for each diagonal (or off-diagonal) block (Yang et al., 2024).
A two-step procedure is commonly employed for estimating large covariance matrices in these datasets: firstly, extracting the latent structure from the sample covariance matrix, and subsequently estimating the covariance parameters under the learned covariance structure (Chen et al., 2018; Wu et al., 2021; Chen et al., 2023). Given the reliable and replicable performance of existing algorithms in extracting latent network structures (Chen et al., 2018; Wang et al., 2020; Li et al., 2022) (see almost identical results of detected interconnected structures by different algorithms in the Supplementary Material), it is plausible to consider the estimated interconnected community structure of the covariance matrix as “known” prior knowledge for SCFA models.
\arabicsection.\arabicsubsection Semi-parametric confirmatory analysis model
We propose a semi-confirmatory factor analysis (SCFA) model for multivariate with a covariance matrix having an interconnected community structure. In this section, we will show that , , and in the SCFA model can be represented by the parameters in the parametric having an interconnected community structure in closed forms, as demonstrated in Figure \arabicfigure. This reparameterization facilitates accurate and computationally efficient estimation of , , and . Specifically, following (\arabicequation) and (\arabicequation), we have:
(\arabicequation) | |||
(\arabicequation) |
where and , the number of common factors, is set to be the number of interconnected communities. are set to be the interconnected community sizes satisfying that for every and , and is the sum (we define ). Without loss of generality, the first entry in is assumed to be a known constant , i.e., is known, while the other entries in are unknown for every . denotes the by identity matrix, and denote by and by all-one matrices, respectively. and are (unknown) entries of the parametric population covariance matrix.
The parametric covariance matrix defined in (\arabicequation) is derived from covariance patterns inherent in interconnected community structures (Yang et al., 2024). When viewed through the lens of network analysis, takes a block form, where each diagonal block represents a uniform correlation relationship between features within a single community, and each off-diagonal block represents a uniform correlation relationship between features from two distinct communities.
The SCFA model is distinct from the conventional CFA model because it can leverage the covariance matrix with an interconnected community structure to derive closed-form solutions for all parameters in the factor loading matrix and in the covariance matrices and .
Specifying all “non-zero loadings” in . In the SCFA model, we specify the “non-zero loadings” in based on the interconnected community structure. For an interconnected community structure of communities with corresponding community sizes , the factor loading matrix is
We utilize the community-membership function based on the interconnected community structure to partition observed variables into mutually exclusive and exhaustive common factors, i.e., . Without loss of generality, the th observed variable is mapped to the th common factor, which includes distinct observed variables and satisfies that . Next, we reorder observed variables by listing them from the same common factor as neighbors. For the input vector of observed variables in (\arabicequation), we reorder the elements and obtain using the function . Consequently, dictates that the first elements of are categorized into the first common factor, and so forth, with the last elements of being categorized into the th common factor. For simplicity, we denote throughout the remainder of this paper. Moreover, the adoption of a non-overlapping factor loading pattern in (\arabicequation) and (\arabicequation) stems from the fact that is a bijective function.
With the exception of components for and , which will be determined in the subsequent corollary), all information about the factor loading matrix has been completely determined by the community membership.
Reparameterizing and by . With a covariance matrix characterized by an interconnected community structure and corresponding “non-zero loadings” specified in , the SCFA model in (\arabicequation) can represent model parameters and using the covariance parameters in .
Corollary \arabicsection.\arabictheorem.
Consider the blocks of a classical CFA model as shown in (\arabicequation). All parameters in can be determined by the parametric covariance matrix in (\arabicequation) in terms of and through:
with for every . Then, we have the following equations:
(1) for all , so ;
(2) for all , so with ;
(3) for all , so , where we assume that the symmetric matrix is positive definite and for all .
The conditions for all and the positive definiteness of are required to ensure that is positive definite (see Supplementary Material). Corollary \arabicsection.\arabictheorem demonstrates that the SCFA model can (1) directly specify the “non-zero loadings” in based on the function and derive that all loadings belonging to factor follow ; and (2) parameters for all entries in the matrices and can be expressed by the parameters in the parametric covariance matrix (i.e., and ). This reparametrization facilitates much improved parameter estimation accuracy and computational efficiency of the SCFA model.
\arabicsection.\arabicsubsection Estimation and inference
In this section, we introduce a likelihood-based parameter estimation procedure for SCFA applied to sample data with an unknown covariance matrix having an interconnected community structure. We derive closed-form estimators for SCFA parameters , and , and calculate closed-form factor scores using the least-square approach. Lastly, we establish the theoretical properties of the proposed estimators and delineate the inference procedure.
Suppose that the rows of data matrix are independently and identically distributed as , satisfying the proposed SCFA model in (\arabicequation). Let and be the th factor score and error term, respectively. We define as the unbiased sample covariance matrix. The typical estimation objectives encompass the non-zero elements of , the elements of symmetric , and the diagonal elements of based on . We assume that the interconnected community structure and are known for , while parameters and need to be estimated for each and . Without loss of generality, we let the high-dimensional framework in this paper follow that and .
We maximize the likelihood function with respect to , , and :
where denotes the log-likelihood function of normal data X, denotes the determinant, denotes the trace, is defined in (\arabicequation). By maximizing and letting throughout the rest of the paper, we obtain the following (unique) maximum likelihood estimators:
with , where
(\arabicequation) |
for all and , denotes the by submatrix of , denotes the sum of all elements of a matrix.
The estimation procedure for SCFA parameters is scalable for high-dimensional data (e.g., ) as long as . In contrast, the classic CFA model faces challenges in computation when (1) and (2) but is several hundred, due to the bottleneck of computing the large covariance (and its inverse) using the maximum likelihood approach. Therefore, SCFA addresses a longstanding limitation in CFA regarding dimensionality constraints.
Furthermore, the proposed factor score estimator of is
(\arabicequation) |
for , where the OLS estimator is identical to the generalized least-square (GLS) estimator and the feasible generalized least-square (FGLS) estimator. The derivation is provided in the Supplementary Material. The following theorems exhibit the theoretical properties of the above estimators.
Theorem \arabicsection.\arabictheorem.
If , then the proposed matrix estimators , , and are uniformly minimum-variance unbiased estimators (UMVUEs).
Please refer to the proof in the Supplementary Material. As and are (unique) maximum likelihood estimators, they also exhibit large-sample properties such as consistency, asymptotic efficiency, and asymptotic normality, under the conditions that and for fixed and .
Theorem \arabicsection.\arabictheorem.
If and is positive definite (i.e., for every ), then defined as (\arabicequation) is UMVUE, following a multivariate normal distribution with mean and covariance matrix presented in Theorem \arabicsection.\arabictheorem.
The proofs of the equivalence (\arabicequation) and Theorem \arabicsection.\arabictheorem are elaborated in the Supplementary Material. also exhibits large-sample properties such as consistency, asymptotic efficiency, and asymptotic normality, as . The normality result in Theorem \arabicsection.\arabictheorem extends the confidence intervals for factor scores . In addition to the maximum likelihood estimators and , we provide their exact closed-form variance estimators in the following theorem. Moreover, the exact covariance matrix of the proposed factor score estimator is obtainable from (\arabicequation).
Theorem \arabicsection.\arabictheorem.
(1) The exact variance estimators of and are
for every and .
(2) The exact covariance matrix estimator of is
In particular, as for all , then and for each .
Utilizing Theorem \arabicsection.\arabictheorem and the estimates provided in (\arabicequation), we can perform Wald-type hypothesis tests and compute interval estimates for all model parameters , , and .
\arabicsection SIMULATION STUDIES
\arabicsection.\arabicsubsection Data generation
We perform Monte Carlo simulation to generate the observation vector for , where the factor loading matrix , the common factor , and the error term for . Specifically, we explore various values for (as indicated in Table \arabictable) under different sample sizes, i.e., , with and
We repeat the above data generation procedure times.
\arabicsection.\arabicsubsection Assessing the estimated factor scores and model parameters
We apply the SCFA model to each simulated dataset to estimate , , , and , respectively. Since a primary objective of CFA is dimension reduction, yielding reliable common factors, we first focus on evaluating the performance of estimation. Specifically, we employ the Euclidean loss as the evaluation criterion. We calculate by (\arabicequation) and benchmark it against competing methods including the various CFA computational methods implemented by the R packages of “sem” (Fox, 2006; Fox et al., 2022), “OpenMx” (Neale et al., 2016; Boker et al., 2023), and “lavaan” (Rosseel, 2012; Rosseel et al., 2023), respectively. We constrain the scales of loadings by all methods to be identical for fair comparison.
In Table \arabictable, we summarize the mean and standard deviation of the Euclidean losses for each approach across the replicates under different settings. The results indicate that SCFA outperforms conventional numerical approaches, exhibiting the lowest average loss, while all other methods demonstrate, on average, twice the SCFA loss. Additionally, SCFA shows a much-reduced variation in loss (e.g., around one-fifth) compared to the competing methods. Lastly, SCFA drastically improves computational efficiency by at least times. As mentioned earlier, the numerical implementations of the conventional CFA method may be limited for input datasets with and may thus yield no results (i.e., “NA” entries in Table \arabictable). In contrast, SCFA provides a viable solution for datasets with , which are very common in practice (e.g., omics, imaging, and financial data).
SCFA | sem | OpenMx | lavaan | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean |
|
|
mean |
|
|
mean |
|
|
mean |
|
|
||||||||||||||||||||
12.16 | 0.78 | 0.04 | 20.7 | 5.93 | 23.03 | 21.31 | 5.84 | 139.89 | 21.24 | 5.84 | 16.21 | ||||||||||||||||||||
10.03 | 0.67 | 0.14 | 20.26 | 6.95 | 102.29 | 20.7 | 6.85 | 331.37 | 20.66 | 6.86 | 22.39 | ||||||||||||||||||||
8.72 | 0.58 | 0.04 | NA | NA | NA | 19.53 | 8.07 | 866.16 | NA | NA | NA | ||||||||||||||||||||
17.26 | 0.87 | 0.04 | 30.06 | 10.06 | 423.17 | 30.98 | 9.75 | 904.6 | 30.93 | 9.79 | 47.75 | ||||||||||||||||||||
12.27 | 0.65 | 0.05 | NA | NA | NA | 31.05 | 14.22 | 262.32 | NA | NA | NA | ||||||||||||||||||||
9.83 | 0.49 | 0.22 | NA | NA | NA | NA | NA | NA | NA | NA | NA | ||||||||||||||||||||
25.88 | 1.01 | 0.04 | 39.75 | 10.98 | 409.12 | 40.58 | 10.82 | 866.05 | 40.54 | 10.82 | 47.29 | ||||||||||||||||||||
14.95 | 0.60 | 0.07 | NA | NA | NA | NA | NA | NA | NA | NA | NA | ||||||||||||||||||||
11.49 | 0.51 | 0.13 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
In addition to the factor scores, we evaluate the performance of estimating model parameters and . As parametric matrices and can be represented by parameters and , we assess the accuracy of the estimators and in (\arabicequation).
The estimation results are summarized in Table \arabictable. We consider the following metrics in Table \arabictable: the average bias, Monte Carlo standard deviation, average standard error, and Wald-type empirical coverage probability using the proposed estimates for each and . The results in Table \arabictable demonstrate that our estimation is generally accurate with small biases and approximate coverage probabilities. Specifically, for each parameter, the bias is relatively small when compared to the Monte Carlo standard deviation, while the average standard error is close to the Monte Carlo standard deviation. Furthermore, both the bias and average standard error decrease as the sample size increases, and the coverage probability approaches the nominal level as the sample size grows.
In comparison, we also utilize the aforementioned three R packages to estimate all in the factor loading matrix , all diagonal elements of , and all elements of . We also calculate the average bias and asymptotic standard error using the results produced by “sem”, “OpenMx”, and “lavaan”, respectively, for all elements of , , and . Due to the page limit, we present the results in the Supplementary Material. In cases where , the proposed estimators demonstrate superior performance compared to the estimates produced by the R packages “sem”, “OpenMx”, and “lavaan” with lower average standard errors and much shorter computational time. When , the computational times of conventional methods become long and even intractable, resulting in “NA” values for model parameters. This is primarily due to the presence of singular sample covariance matrices. In general, the performance of all methods is comparable, with computational cost being the bottleneck for the competing methods.
bias | MCSD | ASE | CP | bias | MCSD | ASE | CP | bias | MCSD | ASE | CP | |||
, | , | , | ||||||||||||
0.0 | 1.0 | 1.0 | 92 | 0.0 | 0.7 | 0.8 | 93 | 0.0 | 0.6 | 0.7 | 97 | |||
-0.1 | 2.1 | 2.0 | 92 | 0.2 | 1.6 | 1.6 | 97 | -0.3 | 1.3 | 1.3 | 97 | |||
-0.4 | 3.7 | 4.2 | 96 | 0.1 | 3.5 | 3.4 | 94 | -0.1 | 2.8 | 2.9 | 96 | |||
-2.5 | 45.9 | 45.6 | 89 | -0.3 | 52.1 | 46.0 | 91 | 0.7 | 46.7 | 46.1 | 96 | |||
-3.0 | 41.4 | 40.9 | 92 | -4.3 | 43.3 | 41.6 | 92 | -5.4 | 47.1 | 41.8 | 94 | |||
2.8 | 45.5 | 48.0 | 94 | 2.6 | 50.4 | 47.5 | 92 | -3.6 | 51.6 | 47.2 | 89 | |||
-14.2 | 76.1 | 68.4 | 90 | -2.5 | 78.3 | 70.8 | 88 | -2.5 | 66.9 | 70.7 | 92 | |||
-7.9 | 62.9 | 59.5 | 90 | -2.8 | 58.9 | 59.7 | 96 | -3.7 | 64.1 | 59.7 | 91 | |||
3.3 | 74.8 | 85.8 | 97 | -7.4 | 80.0 | 82.9 | 94 | -6.2 | 84.7 | 82.9 | 89 | |||
, | , | , | ||||||||||||
0.0 | 0.5 | 0.5 | 95 | 0.1 | 0.4 | 0.3 | 93 | 0.0 | 0.2 | 0.3 | 98 | |||
0.0 | 0.9 | 1.0 | 97 | 0.0 | 0.7 | 0.7 | 93 | 0.0 | 0.5 | 0.5 | 96 | |||
-0.4 | 1.9 | 2.0 | 95 | 0.0 | 1.5 | 1.4 | 96 | -0.1 | 1.3 | 1.2 | 93 | |||
3.1 | 33.3 | 32.8 | 92 | 1.0 | 28.8 | 32.4 | 97 | 0.4 | 31.6 | 32.3 | 95 | |||
-0.1 | 31.3 | 29.6 | 95 | 0.3 | 28.4 | 29.7 | 97 | 1.9 | 26.6 | 29.6 | 96 | |||
-0.4 | 35.5 | 33.2 | 96 | -0.2 | 33.2 | 33.2 | 96 | 2.0 | 33.4 | 33.8 | 95 | |||
-5.3 | 48.3 | 49.2 | 94 | 3.3 | 50.1 | 50.5 | 95 | 3.0 | 54.0 | 50.4 | 92 | |||
-3.9 | 41.0 | 41.5 | 93 | -3.6 | 44.1 | 42.2 | 93 | 6.4 | 42.8 | 43.2 | 97 | |||
-13.2 | 55.1 | 57.2 | 93 | -7.4 | 56.1 | 57.8 | 92 | 6.3 | 60.4 | 59.9 | 96 | |||
, | , | , | ||||||||||||
-0.1 | 0.3 | 0.4 | 100 | 0.0 | 0.2 | 0.2 | 95 | 0.0 | 0.2 | 0.2 | 96 | |||
-0.1 | 0.8 | 0.8 | 94 | -0.1 | 0.4 | 0.4 | 92 | 0.0 | 0.3 | 0.3 | 94 | |||
0.1 | 1.6 | 1.7 | 97 | -0.1 | 0.9 | 0.9 | 97 | 0.0 | 0.6 | 0.7 | 98 | |||
-0.9 | 23.3 | 26.2 | 97 | -1.4 | 25.3 | 26.1 | 95 | -0.5 | 26.1 | 26.2 | 95 | |||
0.1 | 23.2 | 23.9 | 96 | -4.3 | 21.0 | 23.5 | 97 | 0.7 | 23.7 | 23.9 | 95 | |||
-2.2 | 26.9 | 26.8 | 94 | -1.1 | 27.9 | 26.9 | 93 | -0.3 | 28.5 | 27.1 | 92 | |||
-3.1 | 40.2 | 40.4 | 90 | -9.1 | 36.9 | 39.5 | 93 | -4.8 | 37.7 | 40.0 | 97 | |||
-3.5 | 32.0 | 34.0 | 96 | -6.4 | 32.9 | 33.7 | 92 | -0.9 | 33.6 | 34.3 | 94 | |||
-12.8 | 47.1 | 46.6 | 92 | -5.6 | 46.2 | 47.3 | 94 | -1.1 | 51.8 | 47.8 | 90 |
\arabicsection.\arabicsubsection Simulation results with model misspecification
We evaluate the performance of SCFA in scenarios of model misspecification, particularly when the covariance matrix deviates from an interconnected community structure. Under the misspecified structure, we generate with , where , , and are model parameters defined in Section \arabicsection.\arabicsubsection. We set , , , and . is an additional noise term that results in the covariance structure deviating from the interconnected community structure. Specifically, , where the noise level and the noise scale of each entry in averages approximately , which overwhelms . We repeat this data generation procedure times.
We apply the SCFA model to each simulated dataset and calculate the average bias, Monte Carlo standard deviation, average standard error, and coverage probability for each model parameter. The summarized results are presented in Table \arabictable. Additional results concerning higher values of are available in the Supplementary Material.
The results demonstrate that both means and standard deviations of the Euclidean losses for estimated factor scores are larger than those calculated in a correctly specified model. As the noise level increases, both the mean and standard deviation of the losses increase. When the proposed interconnected community structure is violated by moderate random noises (e.g., the diagonal elements of mean matrix of are larger than those of ), the performance of the SCFA model remains robust, yielding reliable common factor estimates and covariance parameter estimates among common factors (see Table \arabictable).
noise scale | noise scale | noise scale | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bias | MCSD | ASE | CP | bias | MCSD | ASE | CP | bias | MCSD | ASE | CP | |||
203.2 | 3.8 | 3.6 | 0 | 602.3 | 13.4 | 10.3 | 0 | 989.9 | 21.4 | 16.9 | 0 | |||
203.5 | 4.6 | 3.8 | 0 | 596.1 | 13.2 | 10.4 | 0 | 986.7 | 19.5 | 17 | 0 | |||
201.9 | 4.5 | 3.7 | 0 | 614.1 | 11.8 | 9.7 | 0 | 1000.4 | 18.9 | 15.3 | 0 | |||
-2.8 | 25.7 | 26.3 | 96 | -1.3 | 27 | 27.3 | 96 | -2.5 | 26.9 | 28 | 94 | |||
0.5 | 26.3 | 24.2 | 91 | 2.7 | 28.5 | 25.1 | 90 | 2.3 | 28.9 | 25.6 | 94 | |||
-2.9 | 26.2 | 27.2 | 98 | -1.6 | 27.2 | 27.9 | 96 | 0.7 | 26.9 | 28.5 | 98 | |||
1.2 | 39.5 | 41.2 | 94 | 3.3 | 40.8 | 42.4 | 94 | 4.8 | 40.4 | 43.4 | 94 | |||
-0.9 | 33.4 | 34.9 | 96 | 0.8 | 33.8 | 35.5 | 95 | 0.6 | 34 | 36.1 | 96 | |||
1.1 | 51.3 | 48.4 | 97 | 0.3 | 51.8 | 49 | 95 | 2.6 | 51.9 | 49.9 | 96 | |||
34.70 | 1.23 | 58.13 | 2.26 | 75.94 | 2.68 |
\arabicsection APPLICATION TO GENE EXPRESSION DATA
We applied the SCFA model to the dataset for microRNA regulation on gene expression using the Pan-kidney dataset from The Cancer Genome Atlas (TCGA) project, as outlined by Tomczak et al. (2015). The gene expression dataset comprised information on the expression levels in Reads Per Million mapped reads (RPM) for genes, collected from patients with age (mean and standard deviation years) and sex ( female and male).
In this analysis, we extracted an interconnected community structure from the gene co-expression matrix using the algorithm by Wu et al. (2021), while other community detection algorithms provided similar findings. The detected interconnected community structure comprised communities from genes with varying sizes: , , , , , and , as illustrated in Figure \arabicfigure. Genes were highly correlated within each module, while the interconnections between modules could be either positive or negative. Our goal was to identify latent factors for each module (i.e., dimension reduction of correlated genes) and estimate module-to-module interactions.
Based on the learned interconnected community structure, we applied the SCFA model for parameter estimation. The results were presented in Figure \arabicfigure. The first row of Figure \arabicfigure (left to right) demonstrated the extracted interconnected community structure (middle) from the original data (left), and the large covariance estimate (right) by Yang et al. (2024). In the second row of Figure \arabicfigure, we showed the results of decomposing the above-estimated covariance matrix of into SCFA parameters . We exhibited the estimated , , and using heatmaps. The middle heatmap in row two illustrated the interactive relationships between common factors. The bottom subfigure of Figure \arabicfigure was a classic path diagram demonstration of factor analysis with factor memberships, intra-factor, and inter-factor correlations. As demonstrated, SCFA reduced gene expression variables to correlated common factors (with the estimated covariance matrix ), while accurately representing the complex relationships among all variables. The computational time was less than one minute. Due to the page limit, we provided detailed numerical estimates, gene names, the corresponding community membership, and estimated factor scores in the Supplementary Material.
We conducted pathway analysis to investigate the relationships between each common factor and kidney cancer. The results revealed that each common factor characterized unique cellular and molecular functions related to kidney cancer. Specifically, the first common factor exhibited enrichment in pathways related to G protein-coupled receptors (GPCRs), that play pivotal roles in various aspects of cancer progression, including tumor growth, invasion, migration, survival, and metastasis (Arakaki et al., 2018). The second common factor was enriched with pathways related to cellular respiration in mitochondria, which are essential to cancer cells, including renal adenocarcinoma (Wallace, 2012). The third common factor was enriched with pathways associated with the body’s immune response: renal cell carcinoma (RCC) is considered to be an immunogenic tumor but is known to mediate immune dysfunction (Díaz-Montero et al., 2020). For the remaining common factors, the uniqueness of associations might be less pronounced, and detailed information was provided in the Supplementary Material.
For comparison, we also applied the conventional CFA model with the same detected community membership to the dataset using R packages “sem”, “OpenMx”, and “lavaan”, respectively. However, none of these standard computational packages could handle the gene expression dataset due to . Thus, the SCFA model stands out as an unprecedented toolkit for CFA in high-dimensional applications.
\arabicsection DISCUSSION
We have developed a novel confirmatory factor analysis model, the SCFA model, designed for high-dimensional data. In contrast to the traditional CFA model equipped with standard computational packages, which relies on predefined model specifications and faces computational challenges with high-dimensional datasets, the SCFA model simultaneously addresses these limitations by leveraging a data-driven covariance structure. Due to its prevalence and block structure, we integrate the ubiquitous interconnected community structure observed in covariance matrices across diverse high-dimensional data types into the SCFA model. This integration yields likelihood-based UMVUEs for the factor loading matrix , the covariance matrix of common factors , and the covariance matrix of error terms in closed forms, as well as explicitly consistent least-square estimators for the factor scores. Additionally, explicit variance estimators further facilitate statistical inference for these parameters.
Extensive simulation studies have demonstrated the superior performance of the SCFA model in both parameter estimation and factor score accuracy. Notably, SCFA significantly reduces the computational burden compared to existing CFA computational packages. The SCFA model also exhibits robustness to moderate violations of the interconnected community structure. In an application to TCGA gene expression data, we employed the SCFA model to explore the relationships between common factors and genes, conducting statistical inference on covariances between these common factors. These findings underscore the limitations of classical CFA models and standard computational packages in handling such high-dimensional data analyses.
In summary, SCFA models combine the benefits of conventional CFA models, including dimension reduction and flexibility in covariance matrix modeling for common factors, with the ability to adapt to data-driven covariance structures. As many high-dimensional datasets implicitly exhibit interconnected community structures, SCFA holds broad applicability. The software package is available at https://github.com/yiorfun/SCFA.
References
- Anderson (2003) Anderson, T. (2003). An Introduction to Multivariate Statistical Analysis. John Wiley and Sons.
- Arakaki et al. (2018) Arakaki, A. K. S., Pan, W.-A. & Trejo, J. (2018). Gpcrs in cancer: Protease-activated receptors, endocytic adaptors and signaling. International Journal of Molecular Sciences 19.
- Bai & Li (2012) Bai, J. & Li, K. (2012). Statistical analysis of factor models of high dimension. The Annals of Statistics 40, 436 – 465.
- Basilevsky (2009) Basilevsky, A. T. (2009). Statistical factor analysis and related methods: theory and applications. John Wiley & Sons.
- Boker et al. (2023) Boker, S. M., Neale, M. C., Maes, H. H., Wilde, M. J., Spiegel, M., Brick, T. R., Estabrook, R., Bates, T. C., Mehta, P., von Oertzen, T., Gore, R. J., Hunter, M. D., Hackett, D. C., Karch, J., Brandmaier, A. M., Pritikin, J. N., Zahery, M., Kirkpatrick, R. M., Wang, Y., Goodrich, B., Driver, C., of Technology, M. I., Johnson, S. G., for Computing Machinery, A., Kraft, D., Wilhelm, S., Medland, S., Falk, C. F., Keller, M., G, M. B., of the University of California, T. R., Ingber, L., Voon, W. S., Palacios, J., Yang, J., Guennebaud, G. & Niesen, J. (2023). Extended Structural Equation Modelling. R package version 2.21.8.
- Brown (2015) Brown, T. A. (2015). Confirmatory factor analysis for applied research. Guilford publications.
- Browne (2001) Browne, M. W. (2001). An overview of analytic rotation in exploratory factor analysis. Multivariate Behavioral Research 36, 111–150.
- Carlson & Mulaik (1993) Carlson, M. & Mulaik, S. A. (1993). Trait ratings from descriptions of behavior as mediated by components of meaning. Multivariate Behavioral Research 28, 111–159.
- Chen et al. (2018) Chen, S., Kang, J., Xing, Y., Zhao, Y. & Milton, D. K. (2018). Estimating large covariance matrix with network topology for high-dimensional biomedical data. Computational Statistics and Data Analysis 127, 82–95.
- Chen et al. (2023) Chen, S., Zhang, Y., Wu, Q., Bi, C., Kochunov, P. & Hong, L. E. (2023). Identifying covariate-related subnetworks for whole-brain connectome analysis. Biostatistics , kxad007.
- Chiappelli et al. (2019) Chiappelli, J., Rowland, L. M., Wijtenburg, S. A., Chen, H., Maudsley, A. A., Sheriff, S., Chen, S., Savransky, A., Marshall, W., Ryan, M. C., Bruce, H. A., Shuldiner, A. R., Mitchell, B. D., Kochunov, P. & Hong, L. E. (2019). Cardiovascular risks impact human brain n-acetylaspartate in regionally specific patterns. Proceedings of the National Academy of Sciences 116, 25243–25249.
- Colizza et al. (2006) Colizza, V., Flammini, A., Serrano, M. A. & Vespignani, A. (2006). Detecting rich-club ordering in complex networks. Nature physics 2, 110–115.
- Díaz-Montero et al. (2020) Díaz-Montero, C. M., Rini, B. I. & Finke, J. H. (2020). The immunology of renal cell carcinoma. Nature Reviews Nephrology 16, 721–735.
- Fan & Han (2017) Fan, J. & Han, X. (2017). Estimation of the false discovery proportion with unknown dependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79, 1143–1164.
- Fan et al. (2020) Fan, J., Li, R., Zhang, C.-H. & Zou, H. (2020). Statistical foundations of data science. Chapman and Hall/CRC.
- Fan et al. (2013) Fan, J., Liao, Y. & Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 75, 603–680.
- Fan et al. (2021) Fan, J., Wang, K., Zhong, Y. & Zhu, Z. (2021). Robust high dimensional factor models with applications to statistical machine learning. Statistical science: a review journal of the Institute of Mathematical Statistics 36, 303.
- Fortunato (2010) Fortunato, S. (2010). Community detection in graphs. Physics Reports 486, 75–174.
- Fox (2006) Fox, J. (2006). Teacher’s corner: structural equation modeling with the sem package in r. Structural equation modeling 13, 465–486.
- Fox et al. (2022) Fox, J., Nie, Z., Byrnes, J., Culbertson, M., DebRoy, S., Friendly, M., Goodrich, B., Jones, R. H., Kramer, A., Monette, G. & Novomestky, F. (2022). Structural Equation Models. R package version 3.1-15.
- Friguet et al. (2009) Friguet, C., Kloareg, M. & Causeur, D. (2009). A factor model approach to multiple testing under dependence. Journal of the American Statistical Association 104, 1406–1415.
- Gana & Broc (2019) Gana, K. & Broc, G. (2019). Structural equation modeling with lavaan. John Wiley & Sons.
- Girvan & Newman (2002) Girvan, M. & Newman, M. E. J. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99, 7821–7826.
- Huttlin et al. (2017) Huttlin, E. L., Bruckner, R. J., Paulo, J. A., Cannon, J. R., Ting, L., Baltier, K., Colby, G., Gebreab, F., Gygi, M. P., Parzen, H. et al. (2017). Architecture of the human interactome defines protein communities and disease networks. Nature 545, 505–509.
- ISGlobal (2021) ISGlobal (2021). Barcelona institute for global health. https://www.isglobal.org/en. Accessed: 2022-07-30.
- Jackson et al. (2009) Jackson, D. L., Gillaspy Jr, J. A. & Purc-Stephenson, R. (2009). Reporting practices in confirmatory factor analysis: an overview and some recommendations. Psychological methods 14, 6.
- Jöreskog (1969) Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis. Psychometrika 34, 183–202.
- Lawley (1958) Lawley, D. (1958). Estimation in factor analysis under various initial assumptions. British journal of statistical Psychology 11, 1–12.
- Lei & Rinaldo (2015) Lei, J. & Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. The Annals of Statistics 43, 215 – 237.
- Levine et al. (2015) Levine, J. H., Simonds, E. F., Bendall, S. C., Davis, K. L., El-ad, D. A., Tadmor, M. D., Litvin, O., Fienberg, H. G., Jager, A., Zunder, E. R. et al. (2015). Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197.
- Li et al. (2022) Li, T., Lei, L., Bhattacharyya, S., den Berge, K. V., Sarkar, P., Bickel, P. J. & Levina, E. (2022). Hierarchical community detection by recursive partitioning. Journal of the American Statistical Association 117, 951–968.
- Neale et al. (2016) Neale, M. C., Hunter, M. D., Pritikin, J. N., Zahery, M., Brick, T. R., Kirkpatrick, R. M., Estabrook, R., Bates, T. C., Maes, H. H. & Boker, S. M. (2016). Openmx 2.0: Extended structural equation and statistical modeling. Psychometrika 81, 535–549.
- Newman & Girvan (2004) Newman, M. E. J. & Girvan, M. (2004). Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113.
- Oberski (2014) Oberski, D. (2014). lavaan.survey: An r package for complex survey analysis of structural equation models. Journal of Statistical Software 57, 1–27.
- Perrot-Dockès et al. (2022) Perrot-Dockès, M., Lévy-Leduc, C. & Rajjou, L. (2022). Estimation of large block structured covariance matrices: Application to ‘multi-omic’ approaches to study seed quality. Journal of the Royal Statistical Society: Series C (Applied Statistics) 71, 119–147.
- Ritchie et al. (2023) Ritchie, S. C., Surendran, P., Karthikeyan, S., Lambert, S. A., Bolton, T., Pennells, L., Danesh, J., Di Angelantonio, E., Butterworth, A. S. & Inouye, M. (2023). Quality control and removal of technical variation of nmr metabolic biomarker data in~ 120,000 uk biobank participants. Scientific Data 10, 64.
- Rosseel (2012) Rosseel, Y. (2012). lavaan: An r package for structural equation modeling. Journal of Statistical Software 48, 1–36.
- Rosseel et al. (2023) Rosseel, Y., Jorgensen, T. D., Rockwood, N., Oberski, D., Byrnes, J., Vanbrabant, L., Savalei, V., Merkle, E., Hallquist, M., Rhemtulla, M., Katsikatsou, M., Barendse, M., Scharf, F. & Du, H. (2023). Latent Variable Analysis. R package version 0.6-15.
- Schaub et al. (2023) Schaub, M. T., Li, J. & Peel, L. (2023). Hierarchical community structure in networks. Phys. Rev. E 107, 054305.
- Schreiber et al. (2006) Schreiber, J. B., Nora, A., Stage, F. K., Barlow, E. A. & King, J. (2006). Reporting structural equation modeling and confirmatory factor analysis results: A review. The Journal of Educational Research 99, 323–338.
- Simpson et al. (2013) Simpson, S. L., Bowman, F. D. & Laurienti, P. J. (2013). Analyzing complex functional brain networks: Fusing statistics and network science to understand the brain. Statistics Surveys 7, 1 – 36.
- Tomczak et al. (2015) Tomczak, K., Czerwińska, P. & Wiznerowicz, M. (2015). Review the cancer genome atlas (tcga): an immeasurable source of knowledge. Contemporary Oncology/Współczesna Onkologia 2015, 68–77.
- Wallace (2012) Wallace, D. C. (2012). Mitochondria and cancer. Nature Reviews Cancer 12, 685–698.
- Wang et al. (2020) Wang, Z., Liang, Y. & Ji, P. (2020). Spectral algorithms for community detection in directed networks. The Journal of Machine Learning Research 21, 6101–6145.
- Wu et al. (2021) Wu, Q., Ma, T., Liu, Q., Milton, D. K., Zhang, Y. & Chen, S. (2021). Icn: extracting interconnected communities in gene co-expression networks. Bioinformatics 37, 1997–2003.
- Yang et al. (2024) Yang, Y., Chen, C. & Chen, S. (2024). Covariance matrix estimation for high-throughput biomedical data with interconnected communities. The American Statistician In press.
- Zitnik et al. (2018) Zitnik, M., Sosic̆, R. & Leskovec, J. (2018). Prioritizing network communities. Nature communications 9, 2544.