\renewtheoremstyle

plain \nolinenumbers

\jyear

2024 \jvol1 \jnum1 \accessdateAccess Arxiv on 27 March 2024

Semi-Confirmatory Factor Analysis for High-Dimensional Data with Interconnected Community Structures

Yifan Yang [email protected] Department of Population and Quantitative Health Sciences, Case Western Reserve University,
Cleveland, Ohio, 44106 U.S.A.
      Tianzhou Ma [email protected] Department of Epidemiology and Biostatistics, University of Maryland, College Park,
4200 Valley Drive, College Park, Maryland, 20742 U.S.A.
      Chuan Bi [email protected] University of Maryland School of Medicine, 655 W. Baltimore Street,
Baltimore, Maryland 21201, U.S.A.
      Shuo Chen [email protected] University of Maryland School of Medicine, 655 W. Baltimore Street,
Baltimore, Maryland 21201, U.S.A.
(1 January 2024; 1 January 2024)
Abstract

Confirmatory factor analysis (CFA) is a statistical method for identifying and confirming the presence of latent factors among observed variables through the analysis of their covariance structure. Compared to alternative factor models, CFA offers interpretable common factors with enhanced specificity and a more adaptable approach to modeling covariance structures. However, the application of CFA has been limited by the requirement for prior knowledge about “non-zero loadings” and by the lack of computational scalability (e.g., it can be computationally intractable for hundreds of observed variables). We propose a data-driven semi-confirmatory factor analysis (SCFA) model that attempts to alleviate these limitations. SCFA automatically specifies “non-zero loadings” by learning the network structure of the large covariance matrix of observed variables, and then offers closed-form estimators for factor loadings, factor scores, covariances between common factors, and variances between errors using the likelihood method. Therefore, SCFA is applicable to high-throughput datasets (e.g., hundreds of thousands of observed variables) without requiring prior knowledge about “non-zero loadings”. Through an extensive simulation analysis benchmarking against standard packages, SCFA exhibits superior performance in estimating model parameters with a much reduced computational time. We illustrate its practical application through factor analysis on a high-dimensional RNA-seq gene expression dataset.

keywords:
Closed-Form Solution; Factor Score; Interconnected Community Structure; Statistical Inference.
journal: Biometrika

\arabicsection INTRODUCTION

Factor analysis is a commonly used statistical technique to elucidate the relationship between multivariate observations. Factor models aim to identify the underlying factors that collectively describe the interdependencies present in multivariate observed data (Anderson, 2003). As correlated high-dimensional observed variables can be effectively decomposed into a smaller number of common factors, i.e., achieving dimension reduction, factor analysis models have garnered popularity in various fields, including social science, psychology, molecular biology, and others (Schreiber et al., 2006; Fan et al., 2020, 2021).

Factor analysis is broadly classified into two categories: exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). EFA is frequently employed to explore the interactive relationships among observed variables and to identify latent common factors, without prerequisite knowledge of grouping these observed variables. In contrast, CFA is commonly utilized to validate whether the empirical evidence supports a predetermined latent structure of the shared variance in the model specification. In a CFA model, prerequisite knowledge or empirical evidence regarding the grouping of observed variables is represented by predefined “non-zero loadings” in the factor loading matrix, establishing a rule or a factor membership that exclusively and exhaustively assigns each observed variable to a certain common factor (Browne, 2001). In practical applications, a combined approach is often adopted, starting with EFA to investigate the underlying dependence pattern, followed by CFA for model verification and justification (Basilevsky, 2009; Brown, 2015; Gana & Broc, 2019).

The above distinctive model specifications between EFA and CFA confer unique strengths and limitations to each approach. EFA exhibits greater flexibility by not necessitating specified “non-zero loadings” and can be adapted to accommodate high-dimensional observations, reaching thousands of observed variables or more (Friguet et al., 2009; Bai & Li, 2012; Fan et al., 2013; Fan & Han, 2017). However, the factor loadings in classical EFA models are typically non-zero, resulting in less interpretable relationships between common factors and observed variables. Moreover, EFA models are limited to explicitly estimating the covariance matrix of common factors for high-dimensional data. In contrast, CFA naturally specifies a sparse factor loading matrix guided by prior knowledge, enhancing interpretability, as exemplified in Carlson & Mulaik (1993), and establishes arbitrary covariances between common factors, as demonstrated in Lawley (1958) and Jackson et al. (2009). As previously mentioned, classical CFA models encounter two primary limitations: (1) pre-determined “non-zero loadings” in the factor loading matrix or factor membership is typically lacking, resulting in the nonexistence of a rule that exclusively and exhaustively assigns each observed variable to a certain common factor; and (2) the computational burden of estimating a CFA model in high-dimensional scenarios becomes intractable because the existing standard computational packages struggle to handle datasets containing hundreds or thousands of observed variables (Fox, 2006; Rosseel, 2012; Oberski, 2014).

In the current research, we concentrate on the CFA approach while attempting to address the two aforementioned limitations. We propose a semi-confirmatory factor analysis (SCFA) model that addresses the specification of “non-zero loadings” through the covariance structure learned from high-dimensional data, and significantly alleviates the computational burden with theoretically guaranteed solutions in closed form. Specifically, to overcome the first limitation, we incorporate a prevalent covariance structure, namely, the interconnected community structure, into the conventional CFA model. We focus on the interconnected community structure, as it is widely prevalent in the covariance matrices of various high-dimensional datasets, as illustrated in Figure \arabicfigure. Notably, interconnected community structures, which enable features between communities to exhibit correlations, encompass various well-known patterns, including all independent community structures (Newman & Girvan, 2004; Fortunato, 2010) and most hierarchical community structures (Li et al., 2022; Schaub et al., 2023). Therefore, they provide a versatile covariance structure to model various practical applications, including brain imaging, gene expression, multi-omics, metabolomics, and more (Girvan & Newman, 2002; Colizza et al., 2006; Simpson et al., 2013; Levine et al., 2015; Huttlin et al., 2017; Zitnik et al., 2018; Perrot-Dockès et al., 2022). However, we acknowledge that the proposed structures can be limited in representing certain covariance patterns (e.g., Toeplitz). In such cases, traditional factor models remain suitable. Interconnected community structures are latent in many studies, which can be accurately and robustly estimated and extracted by recently developed network structure detection approaches (Wang et al., 2020; Li et al., 2022; Yang et al., 2024). Consequently, the detected community membership can serve as a guide to specifying the previously unknown “non-zero loadings” for CFA, effectively addressing the first limitation. The SCFA model also alleviates the computational burden by deriving closed-form solutions for all CFA model parameters, including the factor loadings, factors, covariance matrix between common factors, and covariance matrix for error terms. The closed-form estimators not only improve estimation accuracy and stability but also drastically reduce computational load and ensure scalability (e.g., handling thousands of observed variables).

Refer to caption
Figure \arabicfigure: We provide examples demonstrating that the interconnected community structures are frequently observed across various real applications: A: a brain imaging study (Chiappelli et al., 2019); B: a gene expression study (Tomczak et al., 2015); C: a multi-omics study (Perrot-Dockès et al., 2022); D: a plasma metabolomics study (Ritchie et al., 2023); E and F: an environmental study involving exposome and plasma metabolomics (ISGlobal, 2021).

SCFA presents several methodological contributions. Firstly, SCFA alleviates the requirement of “non-zero loadings” in CFA models. The acquired interconnected community structure assigns observed variables to common factors, specifying “non-zero loadings” in an adaptive manner. Secondly, SCFA provides a computationally efficient approach for conducting high-dimensional CFA (e.g., with thousands or more observed variables). All estimators are obtained by the likelihood approach and in closed form, substantially mitigating the computational burden. Thirdly, SCFA yields more accurate and reliable estimates, since all matrix estimators are uniformly minimum-variance unbiased estimators (UMVUEs). The factor scores can also be conveniently estimated using the feasible generalized least-square (FGLS) method. We further show that FGLS estimators have an identical solution to those obtained through ordinary least-square (OLS) and generalized least-square (GLS) methods. Lastly, we derive explicit variance estimators that facilitate statistical inference concerning model parameters and factor scores in a SCFA model.

The remainder of the paper is structured as follows. In Section \arabicsection, we introduce the SCFA model, detailing its specifications for the factor loading matrix and the covariance matrix of observations. We subsequently present the estimation and inference procedures for all unknown matrices and factor scores. Section \arabicsection and Section \arabicsection are dedicated to evaluating the proposed model and approach. Section \arabicsection assesses the performance of our model through simulated data, while Section \arabicsection demonstrates the application to a genomics dataset without prior knowledge of “non-zero loadings”. All proofs and additional tables are included in the Supplementary Material.

\arabicsection METHOD

\arabicsection.\arabicsubsection Background

Confirmatory factor analysis model. Let Xp×1subscript𝑋𝑝1X_{p\times 1}italic_X start_POSTSUBSCRIPT italic_p × 1 end_POSTSUBSCRIPT represent a p𝑝pitalic_p-dimensional vector of observations, fK×1subscript𝑓𝐾1f_{K\times 1}italic_f start_POSTSUBSCRIPT italic_K × 1 end_POSTSUBSCRIPT denote the K𝐾Kitalic_K-dimensional vector of common factors, with 1<K<p1𝐾𝑝1<K<p1 < italic_K < italic_p, and up×1subscript𝑢𝑝1u_{p\times 1}italic_u start_POSTSUBSCRIPT italic_p × 1 end_POSTSUBSCRIPT denote the p𝑝pitalic_p-dimensional vector of error terms. A factor model can be expressed as

Xp×1=μp×1+Lp×KfK×1+up×1,E(f)=0K×1,E(u)=0p×1,cov(f,u)=0K×p,formulae-sequencesubscript𝑋𝑝1subscript𝜇𝑝1subscript𝐿𝑝𝐾subscript𝑓𝐾1subscript𝑢𝑝1formulae-sequenceE𝑓subscript0𝐾1formulae-sequenceE𝑢subscript0𝑝1cov𝑓𝑢subscript0𝐾𝑝X_{p\times 1}=\mu_{p\times 1}+L_{p\times K}f_{K\times 1}+u_{p\times 1},% \operatorname{E}(f)=0_{K\times 1},\operatorname{E}(u)=0_{p\times 1},% \operatorname{cov}\left(f,u\right)=0_{K\times p},italic_X start_POSTSUBSCRIPT italic_p × 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_p × 1 end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_p × italic_K end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_K × 1 end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT italic_p × 1 end_POSTSUBSCRIPT , roman_E ( italic_f ) = 0 start_POSTSUBSCRIPT italic_K × 1 end_POSTSUBSCRIPT , roman_E ( italic_u ) = 0 start_POSTSUBSCRIPT italic_p × 1 end_POSTSUBSCRIPT , roman_cov ( italic_f , italic_u ) = 0 start_POSTSUBSCRIPT italic_K × italic_p end_POSTSUBSCRIPT , (\arabicequation)

where 0p×qsubscript0𝑝𝑞0_{p\times q}0 start_POSTSUBSCRIPT italic_p × italic_q end_POSTSUBSCRIPT represents the p𝑝pitalic_p by q𝑞qitalic_q zero matrix; without loss of generality, the p𝑝pitalic_p-dimensional mean vector is denoted by μ=0p×1𝜇subscript0𝑝1\mu=0_{p\times 1}italic_μ = 0 start_POSTSUBSCRIPT italic_p × 1 end_POSTSUBSCRIPT; and L𝐿Litalic_L represents the p𝑝pitalic_p by K𝐾Kitalic_K factor loading matrix. Furthermore, let Σ=cov(X)Σcov𝑋\Sigma=\operatorname{cov}(X)roman_Σ = roman_cov ( italic_X ), Σf=cov(f)subscriptΣ𝑓cov𝑓\Sigma_{f}=\operatorname{cov}(f)roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = roman_cov ( italic_f ), and Σu=cov(u)subscriptΣ𝑢cov𝑢\Sigma_{u}=\operatorname{cov}(u)roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = roman_cov ( italic_u ) denote the p𝑝pitalic_p by p𝑝pitalic_p, K𝐾Kitalic_K by K𝐾Kitalic_K, and p𝑝pitalic_p by p𝑝pitalic_p covariance matrices, respectively, where ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is assumed to be diagonal (Fan et al., 2021).

When performing CFA, we may introduce zero entries at specified positions in the factor loading matrix L𝐿Litalic_L (Jöreskog, 1969) and assume the common factors to be oblique, e.g., the covariance matrix of common factors ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT can be arbitrarily positive definite. Following the model in (\arabicequation) and these two assumptions of L𝐿Litalic_L and ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, we can derive the following relationship between covariance matrices in the CFA model:

Σ=LΣfLT+Σu,Σ𝐿subscriptΣ𝑓superscript𝐿TsubscriptΣ𝑢\Sigma=L\Sigma_{f}L^{\mathrm{\scriptscriptstyle T}}+\Sigma_{u},roman_Σ = italic_L roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , (\arabicequation)

where TT{\mathrm{\scriptscriptstyle T}}roman_T denotes the transpose, L=Bdiag(1,,K)𝐿Bdiagsubscript1subscript𝐾L=\operatorname{Bdiag}(\ell_{1},\ldots,\ell_{K})italic_L = roman_Bdiag ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) becomes a p𝑝pitalic_p by K𝐾Kitalic_K block-diagonal matrix with pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT-dimensional “non-zero loadings” ksubscript𝑘\ell_{k}roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K; Σf=(σf,kk)subscriptΣ𝑓subscript𝜎𝑓𝑘superscript𝑘\Sigma_{f}=\left(\sigma_{f,kk^{\prime}}\right)roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = ( italic_σ start_POSTSUBSCRIPT italic_f , italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) is a K𝐾Kitalic_K by K𝐾Kitalic_K symmetric positive-definite matrix; and Σu=(Σu,kk)subscriptΣ𝑢subscriptΣ𝑢𝑘superscript𝑘\Sigma_{u}=\left(\Sigma_{u,kk^{\prime}}\right)roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = ( roman_Σ start_POSTSUBSCRIPT italic_u , italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) is a p𝑝pitalic_p by p𝑝pitalic_p diagonal positive-definite matrix satisfying the submatrix Σu,kksubscriptΣ𝑢𝑘superscript𝑘\Sigma_{u,kk^{\prime}}roman_Σ start_POSTSUBSCRIPT italic_u , italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is a pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by pksuperscriptsubscript𝑝𝑘p_{k}^{\prime}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT diagonal matrix if k=ksuperscript𝑘𝑘k^{\prime}=kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_k or zero matrix if kksuperscript𝑘𝑘k^{\prime}\neq kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_k. In particular, pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents the number of observed variables within the k𝑘kitalic_kth factor, for k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K, satisfying that each pk>1subscript𝑝𝑘1p_{k}>1italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 1 and p=p1++pK𝑝subscript𝑝1subscript𝑝𝐾p=p_{1}+\cdots+p_{K}italic_p = italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT.

We remark that the non-overlapping factor loading pattern in (\arabicequation) permits only a single non-zero entry within each row. This configuration aligns with the preferences of the majority of existing complexity criteria (Browne, 2001).

Covariance matrix ΣΣ\Sigmaroman_Σ in a block form. As defined in (\arabicequation), the covariance matrix of observed variables ΣΣ\Sigmaroman_Σ is structured as a block matrix. Specifically, the block structure of Σ=(Σkk)ΣsubscriptΣ𝑘superscript𝑘\Sigma=\left(\Sigma_{kk^{\prime}}\right)roman_Σ = ( roman_Σ start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) is determined by the “non-zero loadings” ksubscript𝑘\ell_{k}roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in L𝐿Litalic_L for k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K:

Σ=(Σkk)=(Σ11Σ12Σ1KΣ21Σ22Σ2KΣK1ΣK2ΣKK),Σkk=σf,kkkkT+Σu,kkpk×pk,formulae-sequenceΣsubscriptΣ𝑘superscript𝑘matrixsubscriptΣ11subscriptΣ12subscriptΣ1𝐾subscriptΣ21subscriptΣ22subscriptΣ2𝐾subscriptΣ𝐾1subscriptΣ𝐾2subscriptΣ𝐾𝐾subscriptΣ𝑘superscript𝑘subscript𝜎𝑓𝑘superscript𝑘subscript𝑘superscriptsubscriptsuperscript𝑘TsubscriptΣ𝑢𝑘superscript𝑘superscriptsubscript𝑝𝑘subscript𝑝superscript𝑘\Sigma=\left(\Sigma_{kk^{\prime}}\right)=\begin{pmatrix}\Sigma_{11}&\Sigma_{12% }&\dots&\Sigma_{1K}\\ \Sigma_{21}&\Sigma_{22}&\dots&\Sigma_{2K}\\ \vdots&\vdots&\ddots&\vdots\\ \Sigma_{K1}&\Sigma_{K2}&\dots&\Sigma_{KK}\end{pmatrix},\quad\Sigma_{kk^{\prime% }}=\sigma_{f,kk^{\prime}}\ell_{k}\ell_{k^{\prime}}^{\mathrm{\scriptscriptstyle T% }}+\Sigma_{u,kk^{\prime}}\in\mathbb{R}^{p_{k}\times p_{k^{\prime}}},roman_Σ = ( roman_Σ start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = ( start_ARG start_ROW start_CELL roman_Σ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL roman_Σ start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL roman_Σ start_POSTSUBSCRIPT 1 italic_K end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL roman_Σ start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL roman_Σ start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL roman_Σ start_POSTSUBSCRIPT 2 italic_K end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL roman_Σ start_POSTSUBSCRIPT italic_K 1 end_POSTSUBSCRIPT end_CELL start_CELL roman_Σ start_POSTSUBSCRIPT italic_K 2 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL roman_Σ start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) , roman_Σ start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_f , italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT italic_u , italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

for all k𝑘kitalic_k and ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where Σkk=ΣkksubscriptΣsuperscript𝑘𝑘subscriptΣ𝑘superscript𝑘\Sigma_{k^{\prime}k}=\Sigma_{kk^{\prime}}roman_Σ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_k end_POSTSUBSCRIPT = roman_Σ start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for kksuperscript𝑘𝑘k^{\prime}\neq kitalic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_k and Σu,kksubscriptΣ𝑢𝑘superscript𝑘\Sigma_{u,kk^{\prime}}roman_Σ start_POSTSUBSCRIPT italic_u , italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is a diagonal matrix.

In practical applications, it is often the case that information about both the components and sizes of each “non-zero loadings” ksubscript𝑘\ell_{k}roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is unavailable, particularly in domains such as omics and imaging data. Nevertheless, covariance matrices in these applications often exhibit block patterns, albeit these block patterns may be latent and require pattern extraction algorithms. Recent advancements in statistics have provided convenient tools to accurately identify latent block structures in covariance matrices and precisely estimate covariance parameters (Lei & Rinaldo, 2015; Wu et al., 2021; Li et al., 2022). Inspired by the above block structure of Σ=(Σkk)ΣsubscriptΣ𝑘superscript𝑘\Sigma=\left(\Sigma_{kk^{\prime}}\right)roman_Σ = ( roman_Σ start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) in the CFA model, we are motivated to extract knowledge regarding all “non-zero loadings” from structured covariance matrix estimation for high-dimensional data, thereby facilitating the estimation of the CFA model, as elaborated in the following sections.

Interconnected community structure. The block-structured covariance matrix in the CFA model is naturally linked with the community-based network structure (Wu et al., 2021). In the current research, we focus on the interconnected community structure, which is more general and prevalent in real-world applications. As demonstrated in Figure \arabicfigure, the interconnected community covariance structure presents in the high-throughput datasets such as genetics, imaging, gene-expression, DNA-methylation, and metabolomics, among many (Yang et al., 2024). Therefore, we build the proposed model based on the interconnected community covariance structure for factor analysis on these datasets.

We characterize the interconnected community covariance structure as follows. In this structure, all features can be categorized into multiple mutually exclusive and exhaustive communities. In other words, there implicitly exists a community-membership function φ𝜑\varphiitalic_φ that operates as a bijection. In contrast to the classical community network structure, the interconnected community structure exhibits correlations among features within and between communities (i.e., interactive communities), as demonstrated in Figure \arabicfigure. Thus, the (population) covariance matrix with an interconnected community structure has a block structure: diagonal blocks represent the intra-community correlations while the off-diagonal blocks characterize the inter-community relationships. Due to the high resemblance of entries within each block, a parametric covariance model is often used by assigning two (or one) parameters for each diagonal (or off-diagonal) block (Yang et al., 2024).

A two-step procedure is commonly employed for estimating large covariance matrices in these datasets: firstly, extracting the latent structure from the sample covariance matrix, and subsequently estimating the covariance parameters under the learned covariance structure (Chen et al., 2018; Wu et al., 2021; Chen et al., 2023). Given the reliable and replicable performance of existing algorithms in extracting latent network structures (Chen et al., 2018; Wang et al., 2020; Li et al., 2022) (see almost identical results of detected interconnected structures by different algorithms in the Supplementary Material), it is plausible to consider the estimated interconnected community structure of the covariance matrix as “known” prior knowledge for SCFA models.

Refer to caption
Figure \arabicfigure: The overview of SCFA from the perspective of a covariance matrix. The left subfigure illustrates the population covariance matrix ΣΣ\Sigmaroman_Σ with an explicit interconnected community structure. The parameters of the SCFA model, L𝐿Litalic_L, ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, and ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, are linked with ΣΣ\Sigmaroman_Σ as demonstrated by the right-hand side of the equation. The community membership in the interconnected community structure can specify the “non-zero loadings” in L𝐿Litalic_L, while ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT are determined by the parametric covariance structure in ΣΣ\Sigmaroman_Σ.

\arabicsection.\arabicsubsection Semi-parametric confirmatory analysis model

We propose a semi-confirmatory factor analysis (SCFA) model for multivariate X𝑋Xitalic_X with a covariance matrix ΣΣ\Sigmaroman_Σ having an interconnected community structure. In this section, we will show that L𝐿Litalic_L, ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, and ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT in the SCFA model can be represented by the parameters in the parametric ΣΣ\Sigmaroman_Σ having an interconnected community structure in closed forms, as demonstrated in Figure  \arabicfigure. This reparameterization facilitates accurate and computationally efficient estimation of L𝐿Litalic_L, ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, and ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. Specifically, following (\arabicequation) and (\arabicequation), we have:

X=Lf+u,Σ=LΣfLT+Σu, where formulae-sequence𝑋𝐿𝑓𝑢Σ𝐿subscriptΣ𝑓superscript𝐿TsubscriptΣ𝑢 where \displaystyle X=Lf+u,\quad\Sigma=L\Sigma_{f}L^{\mathrm{\scriptscriptstyle T}}+% \Sigma_{u},\text{ where }italic_X = italic_L italic_f + italic_u , roman_Σ = italic_L roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , where (\arabicequation)
Lp×K=Bdiag(1,,K) with k=(p¯k1+1,kp¯k,k)pk×1,p¯k1+1,k=τk0,Σp×p=(Σkk)=(a11Ip1+b11Jp1b121p1×p2b1K1p1×pKb211p2×p1a22Ip2+b22Jp2b2K1p2×pKbK11pK×p1bK21pK×p2aKKIpK+bKKJpK),bkk=bkk,subscript𝐿𝑝𝐾formulae-sequenceabsentBdiagsubscript1subscript𝐾 with subscript𝑘subscriptmatrixsubscriptsubscript¯𝑝𝑘11𝑘subscriptsubscript¯𝑝𝑘𝑘subscript𝑝𝑘1subscriptsubscript¯𝑝𝑘11𝑘subscript𝜏𝑘0subscriptΣ𝑝𝑝formulae-sequenceabsentsubscriptΣ𝑘superscript𝑘matrixsubscript𝑎11subscript𝐼subscript𝑝1subscript𝑏11subscript𝐽subscript𝑝1subscript𝑏12subscript1subscript𝑝1subscript𝑝2subscript𝑏1𝐾subscript1subscript𝑝1subscript𝑝𝐾subscript𝑏21subscript1subscript𝑝2subscript𝑝1subscript𝑎22subscript𝐼subscript𝑝2subscript𝑏22subscript𝐽subscript𝑝2subscript𝑏2𝐾subscript1subscript𝑝2subscript𝑝𝐾subscript𝑏𝐾1subscript1subscript𝑝𝐾subscript𝑝1subscript𝑏𝐾2subscript1subscript𝑝𝐾subscript𝑝2subscript𝑎𝐾𝐾subscript𝐼subscript𝑝𝐾subscript𝑏𝐾𝐾subscript𝐽subscript𝑝𝐾subscript𝑏𝑘superscript𝑘subscript𝑏superscript𝑘𝑘\displaystyle\begin{aligned} L_{p\times K}&=\operatorname{Bdiag}\left(\ell_{1}% ,\ldots,\ell_{K}\right)\text{ with }\ell_{k}=\begin{pmatrix}\ell_{\bar{p}_{k-1% }+1,k}\\ \vdots\\ \ell_{\bar{p}_{k},k}\end{pmatrix}_{p_{k}\times 1},\ell_{\bar{p}_{k-1}+1,k}=% \tau_{k}\neq 0,\\ \Sigma_{p\times p}&=\left(\Sigma_{kk^{\prime}}\right)=\begin{pmatrix}a_{11}I_{% p_{1}}+b_{11}J_{p_{1}}&b_{12}1_{p_{1}\times p_{2}}&\cdots&b_{1K}1_{p_{1}\times p% _{K}}\\ b_{21}1_{p_{2}\times p_{1}}&a_{22}I_{p_{2}}+b_{22}J_{p_{2}}&\cdots&b_{2K}1_{p_% {2}\times p_{K}}\\ \vdots&\vdots&\ddots&\vdots\\ b_{K1}1_{p_{K}\times p_{1}}&b_{K2}1_{p_{K}\times p_{2}}&\cdots&a_{KK}I_{p_{K}}% +b_{KK}J_{p_{K}}\end{pmatrix},b_{kk^{\prime}}=b_{k^{\prime}k},\end{aligned}start_ROW start_CELL italic_L start_POSTSUBSCRIPT italic_p × italic_K end_POSTSUBSCRIPT end_CELL start_CELL = roman_Bdiag ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) with roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + 1 , italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × 1 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + 1 , italic_k end_POSTSUBSCRIPT = italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ 0 , end_CELL end_ROW start_ROW start_CELL roman_Σ start_POSTSUBSCRIPT italic_p × italic_p end_POSTSUBSCRIPT end_CELL start_CELL = ( roman_Σ start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = ( start_ARG start_ROW start_CELL italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_b start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_b start_POSTSUBSCRIPT 1 italic_K end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_b start_POSTSUBSCRIPT 2 italic_K end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUBSCRIPT italic_K 1 end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_b start_POSTSUBSCRIPT italic_K 2 end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) , italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_k end_POSTSUBSCRIPT , end_CELL end_ROW (\arabicequation)

where k,k=1,,Kformulae-sequence𝑘superscript𝑘1𝐾k,k^{\prime}=1,\ldots,Kitalic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 , … , italic_K and K𝐾Kitalic_K, the number of common factors, is set to be the number of interconnected communities. p1,,pKsubscript𝑝1subscript𝑝𝐾p_{1},\ldots,p_{K}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT are set to be the interconnected community sizes satisfying that pk>1subscript𝑝𝑘1p_{k}>1italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 1 for every k𝑘kitalic_k and p=p1++pK𝑝subscript𝑝1subscript𝑝𝐾p=p_{1}+\cdots+p_{K}italic_p = italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, and p¯k=k=1kpksubscript¯𝑝𝑘superscriptsubscriptsuperscript𝑘1𝑘subscript𝑝superscript𝑘\bar{p}_{k}=\sum_{k^{\prime}=1}^{k}p_{k^{\prime}}over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the sum (we define p¯0=0subscript¯𝑝00\bar{p}_{0}=0over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0). Without loss of generality, the first entry in ksubscript𝑘\ell_{k}roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is assumed to be a known constant τk0subscript𝜏𝑘0\tau_{k}\neq 0italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≠ 0, i.e., p¯k1+1,k=τksubscriptsubscript¯𝑝𝑘11𝑘subscript𝜏𝑘\ell_{\bar{p}_{k-1}+1,k}=\tau_{k}roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + 1 , italic_k end_POSTSUBSCRIPT = italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is known, while the other entries in ksubscript𝑘\ell_{k}roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are unknown for every k𝑘kitalic_k. Ipsubscript𝐼𝑝I_{p}italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT denotes the p𝑝pitalic_p by p𝑝pitalic_p identity matrix, Jpsubscript𝐽𝑝J_{p}italic_J start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and 1p×qsubscript1𝑝𝑞1_{p\times q}1 start_POSTSUBSCRIPT italic_p × italic_q end_POSTSUBSCRIPT denote p𝑝pitalic_p by p𝑝pitalic_p and p𝑝pitalic_p by q𝑞qitalic_q all-one matrices, respectively. akksubscript𝑎𝑘𝑘a_{kk}italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT and bkksubscript𝑏𝑘superscript𝑘b_{kk^{\prime}}italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are (unknown) entries of the parametric population covariance matrix.

The parametric covariance matrix ΣΣ\Sigmaroman_Σ defined in (\arabicequation) is derived from covariance patterns inherent in interconnected community structures (Yang et al., 2024). When viewed through the lens of network analysis, ΣΣ\Sigmaroman_Σ takes a block form, where each diagonal block represents a uniform correlation relationship between features within a single community, and each off-diagonal block represents a uniform correlation relationship between features from two distinct communities.

The SCFA model is distinct from the conventional CFA model because it can leverage the covariance matrix with an interconnected community structure to derive closed-form solutions for all parameters in the factor loading matrix L𝐿Litalic_L and in the covariance matrices ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT.

Specifying all “non-zero loadings” in L𝐿Litalic_L. In the SCFA model, we specify the “non-zero loadings” in L𝐿Litalic_L based on the interconnected community structure. For an interconnected community structure of K𝐾Kitalic_K communities with corresponding community sizes 1,,Ksubscript1subscript𝐾\ell_{1},\ldots,\ell_{K}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, the factor loading matrix is

L=(p¯0+1,1p¯0+2,1p¯1,1p¯1+1,2p¯1+2,2p¯2,2p¯K1+1,Kp¯K1+2,Kp¯K,K)p×K.𝐿subscriptmatrixsubscriptsubscript¯𝑝011missing-subexpressionmissing-subexpressionmissing-subexpressionsubscriptsubscript¯𝑝021missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscriptsubscript¯𝑝11missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscriptsubscript¯𝑝112missing-subexpressionmissing-subexpressionmissing-subexpressionsubscriptsubscript¯𝑝122missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscriptsubscript¯𝑝22missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscriptsubscript¯𝑝𝐾11𝐾missing-subexpressionmissing-subexpressionmissing-subexpressionsubscriptsubscript¯𝑝𝐾12𝐾missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionsubscriptsubscript¯𝑝𝐾𝐾𝑝𝐾L=\begin{pmatrix}\ell_{\bar{p}_{0}+1,1}&&&\\ \ell_{\bar{p}_{0}+2,1}&&&\\ \vdots&&&\\ \ell_{\bar{p}_{1},1}&&&\\ &\ell_{\bar{p}_{1}+1,2}&&\\ &\ell_{\bar{p}_{1}+2,2}&&\\ &\vdots&&\\ &\ell_{\bar{p}_{2},2}&&\\ &&\ddots&\\ &&&\ell_{\bar{p}_{K-1}+1,K}\\ &&&\ell_{\bar{p}_{K-1}+2,K}\\ &&&\vdots\\ &&&\ell_{\bar{p}_{K},K}\\ \end{pmatrix}_{p\times K}.italic_L = ( start_ARG start_ROW start_CELL roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 , 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 2 , 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 , 2 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2 , 2 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋮ end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 2 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL ⋱ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT + 1 , italic_K end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_K - 1 end_POSTSUBSCRIPT + 2 , italic_K end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL end_CELL start_CELL roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , italic_K end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) start_POSTSUBSCRIPT italic_p × italic_K end_POSTSUBSCRIPT .

We utilize the community-membership function φ𝜑\varphiitalic_φ based on the interconnected community structure to partition p𝑝pitalic_p observed variables into K𝐾Kitalic_K mutually exclusive and exhaustive common factors, i.e., φ:{1,2,,p}{1,2,,K},jφ(j):𝜑formulae-sequence12𝑝12𝐾maps-to𝑗𝜑𝑗\varphi:\{1,2,\ldots,p\}\to\{1,2,\ldots,K\},j\mapsto\varphi(j)italic_φ : { 1 , 2 , … , italic_p } → { 1 , 2 , … , italic_K } , italic_j ↦ italic_φ ( italic_j ). Without loss of generality, the j𝑗jitalic_jth observed variable is mapped to the k𝑘kitalic_kth common factor, which includes pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT distinct observed variables {j:φ(j)=k}={φ1(k)[1],,φ1(k)[pk]}conditional-set𝑗𝜑𝑗𝑘superscript𝜑1𝑘delimited-[]1superscript𝜑1𝑘delimited-[]subscript𝑝𝑘\{j:\varphi(j)=k\}=\{\varphi^{-1}(k)[1],\ldots,\varphi^{-1}(k)[p_{k}]\}{ italic_j : italic_φ ( italic_j ) = italic_k } = { italic_φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_k ) [ 1 ] , … , italic_φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_k ) [ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] } and satisfies that p=p1++pK𝑝subscript𝑝1subscript𝑝𝐾p=p_{1}+\cdots+p_{K}italic_p = italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT. Next, we reorder p𝑝pitalic_p observed variables by listing them from the same common factor as neighbors. For the input vector of observed variables X=(X(1),,X(p))T𝑋superscriptsuperscript𝑋1superscript𝑋𝑝TX=(X^{(1)},\ldots,X^{(p)})^{\mathrm{\scriptscriptstyle T}}italic_X = ( italic_X start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , italic_X start_POSTSUPERSCRIPT ( italic_p ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT in (\arabicequation), we reorder the elements and obtain X(φ)=(X(φ1(1)[1]),,X(φ1(1)[p1]),,X(φ1(K)[1]),,X(φ1(K)[pK]))Tsuperscript𝑋𝜑superscriptsuperscript𝑋superscript𝜑11delimited-[]1superscript𝑋superscript𝜑11delimited-[]subscript𝑝1superscript𝑋superscript𝜑1𝐾delimited-[]1superscript𝑋superscript𝜑1𝐾delimited-[]subscript𝑝𝐾TX^{(\varphi)}=(X^{\left(\varphi^{-1}(1)[1]\right)},\ldots,X^{\left(\varphi^{-1% }(1)[p_{1}]\right)},\ldots,X^{\left(\varphi^{-1}(K)[1]\right)},\ldots,X^{\left% (\varphi^{-1}(K)[p_{K}]\right)})^{\mathrm{\scriptscriptstyle T}}italic_X start_POSTSUPERSCRIPT ( italic_φ ) end_POSTSUPERSCRIPT = ( italic_X start_POSTSUPERSCRIPT ( italic_φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 ) [ 1 ] ) end_POSTSUPERSCRIPT , … , italic_X start_POSTSUPERSCRIPT ( italic_φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 ) [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ) end_POSTSUPERSCRIPT , … , italic_X start_POSTSUPERSCRIPT ( italic_φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_K ) [ 1 ] ) end_POSTSUPERSCRIPT , … , italic_X start_POSTSUPERSCRIPT ( italic_φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_K ) [ italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT using the function φ𝜑\varphiitalic_φ. Consequently, φ𝜑\varphiitalic_φ dictates that the first p1subscript𝑝1p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT elements of X(φ)superscript𝑋𝜑X^{(\varphi)}italic_X start_POSTSUPERSCRIPT ( italic_φ ) end_POSTSUPERSCRIPT are categorized into the first common factor, and so forth, with the last pKsubscript𝑝𝐾p_{K}italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT elements of X(φ)superscript𝑋𝜑X^{(\varphi)}italic_X start_POSTSUPERSCRIPT ( italic_φ ) end_POSTSUPERSCRIPT being categorized into the K𝐾Kitalic_Kth common factor. For simplicity, we denote X(φ)=Xsuperscript𝑋𝜑𝑋X^{(\varphi)}=Xitalic_X start_POSTSUPERSCRIPT ( italic_φ ) end_POSTSUPERSCRIPT = italic_X throughout the remainder of this paper. Moreover, the adoption of a non-overlapping factor loading pattern in (\arabicequation) and (\arabicequation) stems from the fact that φ𝜑\varphiitalic_φ is a bijective function.

With the exception of components p¯k+j,k+1subscriptsubscript¯𝑝𝑘𝑗𝑘1\ell_{\bar{p}_{k}+j,k+1}roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_j , italic_k + 1 end_POSTSUBSCRIPT for j=2,,pk+1𝑗2subscript𝑝𝑘1j=2,\ldots,p_{k+1}italic_j = 2 , … , italic_p start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT and k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K, which will be determined in the subsequent corollary), all information about the factor loading matrix has been completely determined by the community membership.

Reparameterizing ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT by ΣΣ\Sigmaroman_Σ. With a covariance matrix ΣΣ\Sigmaroman_Σ characterized by an interconnected community structure and corresponding “non-zero loadings” specified in L𝐿Litalic_L, the SCFA model in (\arabicequation) can represent model parameters ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT using the covariance parameters in ΣΣ\Sigmaroman_Σ.

Corollary \arabicsection.\arabictheorem.

Consider the blocks Σkk=σf,kkkkT+Σu,kksubscriptΣ𝑘superscript𝑘subscript𝜎𝑓𝑘superscript𝑘subscript𝑘superscriptsubscriptsuperscript𝑘TsubscriptΣ𝑢𝑘superscript𝑘\Sigma_{kk^{\prime}}=\sigma_{f,kk^{\prime}}\ell_{k}\ell_{k^{\prime}}^{\mathrm{% \scriptscriptstyle T}}+\Sigma_{u,kk^{\prime}}roman_Σ start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_f , italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT italic_u , italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT of a classical CFA model as shown in (\arabicequation). All parameters in ΣkksubscriptΣ𝑘superscript𝑘\Sigma_{kk^{\prime}}roman_Σ start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT can be determined by the parametric covariance matrix in (\arabicequation) in terms of akksubscript𝑎𝑘𝑘a_{kk}italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT and bkksubscript𝑏𝑘superscript𝑘b_{kk^{\prime}}italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT through:

Σkk=σf,kkkkT+Σu,kk={akkIpk+bkkJpk,k=kbkk1pk×pk,kk,bkk=bkk,formulae-sequencesubscriptΣ𝑘superscript𝑘subscript𝜎𝑓𝑘superscript𝑘subscript𝑘superscriptsubscriptsuperscript𝑘TsubscriptΣ𝑢𝑘superscript𝑘casessubscript𝑎𝑘𝑘subscript𝐼subscript𝑝𝑘subscript𝑏𝑘𝑘subscript𝐽subscript𝑝𝑘superscript𝑘𝑘subscript𝑏𝑘superscript𝑘subscript1subscript𝑝𝑘subscript𝑝superscript𝑘superscript𝑘𝑘subscript𝑏𝑘superscript𝑘subscript𝑏superscript𝑘𝑘\Sigma_{kk^{\prime}}=\sigma_{f,kk^{\prime}}\ell_{k}\ell_{k^{\prime}}^{\mathrm{% \scriptscriptstyle T}}+\Sigma_{u,kk^{\prime}}=\begin{cases}a_{kk}I_{p_{k}}+b_{% kk}J_{p_{k}},&k^{\prime}=k\\ b_{kk^{\prime}}1_{p_{k}\times p_{k^{\prime}}},&k^{\prime}\neq k\end{cases},b_{% kk^{\prime}}=b_{k^{\prime}k},roman_Σ start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_f , italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT italic_u , italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = { start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , end_CELL start_CELL italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_k end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × italic_p start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT , end_CELL start_CELL italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_k end_CELL end_ROW , italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_k end_POSTSUBSCRIPT ,

with pk>2subscript𝑝𝑘2p_{k}>2italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 2 for every k𝑘kitalic_k. Then, we have the following equations:

(1) k=(τk,,τk)T=τk1pk×1subscript𝑘superscriptsubscript𝜏𝑘subscript𝜏𝑘Tsubscript𝜏𝑘subscript1subscript𝑝𝑘1\ell_{k}=\left(\tau_{k},\ldots,\tau_{k}\right)^{\mathrm{\scriptscriptstyle T}}% =\tau_{k}1_{p_{k}\times 1}roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , … , italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT = italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT × 1 end_POSTSUBSCRIPT for all k𝑘kitalic_k, so L=Bdiag(τ11p1×1,,τK1pK×1)𝐿Bdiagsubscript𝜏1subscript1subscript𝑝11subscript𝜏𝐾subscript1subscript𝑝𝐾1L=\operatorname{Bdiag}\left(\tau_{1}1_{p_{1}\times 1},\ldots,\tau_{K}1_{p_{K}% \times 1}\right)italic_L = roman_Bdiag ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × 1 end_POSTSUBSCRIPT , … , italic_τ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT × 1 end_POSTSUBSCRIPT );

(2) σf,kk=bkk/(τkτk)subscript𝜎𝑓𝑘superscript𝑘subscript𝑏𝑘superscript𝑘subscript𝜏𝑘subscript𝜏superscript𝑘\sigma_{f,kk^{\prime}}=b_{kk^{\prime}}/(\tau_{k}\tau_{k^{\prime}})italic_σ start_POSTSUBSCRIPT italic_f , italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT / ( italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) for all k,k𝑘superscript𝑘k,k^{\prime}italic_k , italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, so Σf={bkk/(τkτk)}subscriptΣ𝑓subscript𝑏𝑘superscript𝑘subscript𝜏𝑘subscript𝜏superscript𝑘\Sigma_{f}=\left\{b_{kk^{\prime}}/\left(\tau_{k}\tau_{k^{\prime}}\right)\right\}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = { italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT / ( italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) } with bkk=bkksubscript𝑏𝑘superscript𝑘subscript𝑏superscript𝑘𝑘b_{kk^{\prime}}=b_{k^{\prime}k}italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_k end_POSTSUBSCRIPT;

(3) Σu,kk=akkIpksubscriptΣ𝑢𝑘𝑘subscript𝑎𝑘𝑘subscript𝐼subscript𝑝𝑘\Sigma_{u,kk}=a_{kk}I_{p_{k}}roman_Σ start_POSTSUBSCRIPT italic_u , italic_k italic_k end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT for all k𝑘kitalic_k, so Σu=Bdiag(a11Ip1,,aKKIpK)subscriptΣ𝑢Bdiagsubscript𝑎11subscript𝐼subscript𝑝1subscript𝑎𝐾𝐾subscript𝐼subscript𝑝𝐾\Sigma_{u}=\operatorname{Bdiag}\left(a_{11}I_{p_{1}},\ldots,a_{KK}I_{p_{K}}\right)roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = roman_Bdiag ( italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), where we assume that the symmetric matrix (bkk)subscript𝑏𝑘superscript𝑘\left(b_{kk^{\prime}}\right)( italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) is positive definite and akk>0subscript𝑎𝑘𝑘0a_{kk}>0italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT > 0 for all k𝑘kitalic_k.

The conditions akk>0subscript𝑎𝑘𝑘0a_{kk}>0italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT > 0 for all k𝑘kitalic_k and the positive definiteness of (bkk)subscript𝑏𝑘superscript𝑘\left(b_{kk^{\prime}}\right)( italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) are required to ensure that ΣΣ\Sigmaroman_Σ is positive definite (see Supplementary Material). Corollary \arabicsection.\arabictheorem demonstrates that the SCFA model can (1) directly specify the “non-zero loadings” in L𝐿Litalic_L based on the function φ𝜑\varphiitalic_φ and derive that all loadings belonging to factor k𝑘kitalic_k follow p¯k1+1,k==p¯k,k=τksubscriptsubscript¯𝑝𝑘11𝑘subscriptsubscript¯𝑝𝑘𝑘subscript𝜏𝑘\ell_{\bar{p}_{k-1}+1,k}=\cdots=\ell_{\bar{p}_{k},k}=\tau_{k}roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + 1 , italic_k end_POSTSUBSCRIPT = ⋯ = roman_ℓ start_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k end_POSTSUBSCRIPT = italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT; and (2) parameters for all entries in the matrices ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT can be expressed by the parameters in the parametric covariance matrix ΣΣ\Sigmaroman_Σ (i.e., akksubscript𝑎𝑘𝑘a_{kk}italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT and bkksubscript𝑏𝑘superscript𝑘b_{kk^{\prime}}italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT). This reparametrization facilitates much improved parameter estimation accuracy and computational efficiency of the SCFA model.

\arabicsection.\arabicsubsection Estimation and inference

In this section, we introduce a likelihood-based parameter estimation procedure for SCFA applied to sample data Xn×psubscriptX𝑛𝑝\textrm{X}_{n\times p}X start_POSTSUBSCRIPT italic_n × italic_p end_POSTSUBSCRIPT with an unknown covariance matrix having an interconnected community structure. We derive closed-form estimators for SCFA parameters L𝐿Litalic_L, ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, and calculate closed-form factor scores using the least-square approach. Lastly, we establish the theoretical properties of the proposed estimators and delineate the inference procedure.

Suppose that the rows X1,,Xnsubscript𝑋1subscript𝑋𝑛X_{1},\ldots,X_{n}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of data matrix Xn×psubscriptX𝑛𝑝\textrm{X}_{n\times p}X start_POSTSUBSCRIPT italic_n × italic_p end_POSTSUBSCRIPT are independently and identically distributed as N(0p×1,Σ)𝑁subscript0𝑝1ΣN\left(0_{p\times 1},\Sigma\right)italic_N ( 0 start_POSTSUBSCRIPT italic_p × 1 end_POSTSUBSCRIPT , roman_Σ ), satisfying the proposed SCFA model in (\arabicequation). Let fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the i𝑖iitalic_ith factor score and error term, respectively. We define Sp×p=n1XTXsubscript𝑆𝑝𝑝superscript𝑛1superscriptXTXS_{p\times p}=n^{-1}\textrm{X}^{\mathrm{\scriptscriptstyle T}}\textrm{X}italic_S start_POSTSUBSCRIPT italic_p × italic_p end_POSTSUBSCRIPT = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT X start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT X as the unbiased sample covariance matrix. The typical estimation objectives encompass the (pK)𝑝𝐾(p-K)( italic_p - italic_K ) non-zero elements of L𝐿Litalic_L, the K(K+1)/2𝐾𝐾12K(K+1)/2italic_K ( italic_K + 1 ) / 2 elements of symmetric ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, and the p𝑝pitalic_p diagonal elements of ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT based on S𝑆Sitalic_S. We assume that the interconnected community structure and φ𝜑\varphiitalic_φ are known for ΣΣ\Sigmaroman_Σ, while parameters akksubscript𝑎𝑘𝑘a_{kk}italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT and bkksubscript𝑏𝑘superscript𝑘b_{kk^{\prime}}italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT need to be estimated for each k𝑘kitalic_k and ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Without loss of generality, we let the high-dimensional framework in this paper follow that n<p𝑛𝑝n<pitalic_n < italic_p and K+K(K+1)/2<n𝐾𝐾𝐾12𝑛K+K(K+1)/2<nitalic_K + italic_K ( italic_K + 1 ) / 2 < italic_n.

We maximize the likelihood function with respect to L=Bdiag(1,,K)𝐿Bdiagsubscript1subscript𝐾L=\operatorname{Bdiag}\left(\ell_{1},\ldots,\ell_{K}\right)italic_L = roman_Bdiag ( roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ), Σf={bkk/(τkτk)}subscriptΣ𝑓subscript𝑏𝑘superscript𝑘subscript𝜏𝑘subscript𝜏superscript𝑘\Sigma_{f}=\left\{b_{kk^{\prime}}/(\tau_{k}\tau_{k^{\prime}})\right\}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = { italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT / ( italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) }, and Σu=Bdiag(a11Ip1,,aKKIpK)subscriptΣ𝑢Bdiagsubscript𝑎11subscript𝐼subscript𝑝1subscript𝑎𝐾𝐾subscript𝐼subscript𝑝𝐾\Sigma_{u}=\operatorname{Bdiag}\left(a_{11}I_{p_{1}},\ldots,a_{KK}I_{p_{K}}\right)roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = roman_Bdiag ( italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ):

(L^,Σ^f,Σ^u)=argmaxL,Σf,Σun(L,Σf,Σu;S,τ1,,τK)=argmaxa11,,aKK,b11,b12,,bKKn(a11,,aKK,b11,b12,,bKK,L^;S,τ1,,τK)=argmaxa11,,aKK,b11,b12,,bKK[n2logdet(Σ(a11,,aKK,b11,b12,,bKK))n2tr{SΣ1(a11,,aKK,b11,b12,,bKK)}],^𝐿subscript^Σ𝑓subscript^Σ𝑢subscriptargmax𝐿subscriptΣ𝑓subscriptΣ𝑢subscript𝑛𝐿subscriptΣ𝑓subscriptΣ𝑢𝑆subscript𝜏1subscript𝜏𝐾subscriptargmaxsubscript𝑎11subscript𝑎𝐾𝐾subscript𝑏11subscript𝑏12subscript𝑏𝐾𝐾subscript𝑛subscript𝑎11subscript𝑎𝐾𝐾subscript𝑏11subscript𝑏12subscript𝑏𝐾𝐾^𝐿𝑆subscript𝜏1subscript𝜏𝐾subscriptargmaxsubscript𝑎11subscript𝑎𝐾𝐾subscript𝑏11subscript𝑏12subscript𝑏𝐾𝐾𝑛2Σsubscript𝑎11subscript𝑎𝐾𝐾subscript𝑏11subscript𝑏12subscript𝑏𝐾𝐾𝑛2tr𝑆superscriptΣ1subscript𝑎11subscript𝑎𝐾𝐾subscript𝑏11subscript𝑏12subscript𝑏𝐾𝐾\begin{split}\left(\widehat{L},\widehat{\Sigma}_{f},\widehat{\Sigma}_{u}\right% )&=\operatorname*{argmax}\limits_{L,\Sigma_{f},\Sigma_{u}}\mathcal{L}_{n}\left% (L,\Sigma_{f},\Sigma_{u};S,\tau_{1},\ldots,\tau_{K}\right)\\ &=\operatorname*{argmax}\limits_{a_{11},\ldots,a_{KK},b_{11},b_{12},\ldots,b_{% KK}}\mathcal{L}_{n}\left(a_{11},\ldots,a_{KK},b_{11},b_{12},\ldots,b_{KK},% \widehat{L};S,\tau_{1},\ldots,\tau_{K}\right)\\ &=\operatorname*{argmax}\limits_{a_{11},\ldots,a_{KK},b_{11},b_{12},\ldots,b_{% KK}}\bigg{[}-\frac{n}{2}\log\det\left(\Sigma\left(a_{11},\ldots,a_{KK},b_{11},% b_{12},\ldots,b_{KK}\right)\right)\\ &-\frac{n}{2}\operatorname{tr}\left\{S\Sigma^{-1}\left(a_{11},\ldots,a_{KK},b_% {11},b_{12},\ldots,b_{KK}\right)\right\}\bigg{]},\end{split}start_ROW start_CELL ( over^ start_ARG italic_L end_ARG , over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) end_CELL start_CELL = roman_argmax start_POSTSUBSCRIPT italic_L , roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_L , roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ; italic_S , italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_τ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_argmax start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT , over^ start_ARG italic_L end_ARG ; italic_S , italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_τ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_argmax start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - divide start_ARG italic_n end_ARG start_ARG 2 end_ARG roman_log roman_det ( roman_Σ ( italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - divide start_ARG italic_n end_ARG start_ARG 2 end_ARG roman_tr { italic_S roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT ) } ] , end_CELL end_ROW

where nsubscript𝑛\mathcal{L}_{n}caligraphic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the log-likelihood function of normal data X, det()\det(\cdot)roman_det ( ⋅ ) denotes the determinant, tr()tr\operatorname{tr}(\cdot)roman_tr ( ⋅ ) denotes the trace, Σ=Σ(a11,,aKK,b11,b12,,bKK)ΣΣsubscript𝑎11subscript𝑎𝐾𝐾subscript𝑏11subscript𝑏12subscript𝑏𝐾𝐾\Sigma=\Sigma\left(a_{11},\ldots,a_{KK},b_{11},b_{12},\ldots,b_{KK}\right)roman_Σ = roman_Σ ( italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT ) is defined in (\arabicequation). By maximizing nsubscript𝑛\mathcal{L}_{n}caligraphic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and letting τ1==τK=1subscript𝜏1subscript𝜏𝐾1\tau_{1}=\cdots=\tau_{K}=1italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_τ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = 1 throughout the rest of the paper, we obtain the following (unique) maximum likelihood estimators:

L^=Bdiag(1p1×1,,1pK×1),Σ^u=Bdiag(a^11Ip1,,a^KKIpK),Σ^f=(b^kk),formulae-sequence^𝐿Bdiagsubscript1subscript𝑝11subscript1subscript𝑝𝐾1formulae-sequencesubscript^Σ𝑢Bdiagsubscript^𝑎11subscript𝐼subscript𝑝1subscript^𝑎𝐾𝐾subscript𝐼subscript𝑝𝐾subscript^Σ𝑓subscript^𝑏𝑘superscript𝑘\widehat{L}=\operatorname{Bdiag}\left(1_{p_{1}\times 1},\ldots,1_{p_{K}\times 1% }\right),\quad\widehat{\Sigma}_{u}=\operatorname{Bdiag}\left(\widehat{a}_{11}I% _{p_{1}},\ldots,\widehat{a}_{KK}I_{p_{K}}\right),\quad\widehat{\Sigma}_{f}=% \left(\widehat{b}_{kk^{\prime}}\right),over^ start_ARG italic_L end_ARG = roman_Bdiag ( 1 start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × 1 end_POSTSUBSCRIPT , … , 1 start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT × 1 end_POSTSUBSCRIPT ) , over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = roman_Bdiag ( over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = ( over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ,

with b^kk=b^kksubscript^𝑏𝑘superscript𝑘subscript^𝑏superscript𝑘𝑘\widehat{b}_{kk^{\prime}}=\widehat{b}_{k^{\prime}k}over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_k end_POSTSUBSCRIPT, where

a^kk=pktr(Skk)sum(Skk)pk(pk1),b^kk={sum(Skk)pkpk,kksum(Skk)tr(Skk)pk(pk1),k=k,formulae-sequencesubscript^𝑎𝑘𝑘subscript𝑝𝑘trsubscript𝑆𝑘𝑘sumsubscript𝑆𝑘𝑘subscript𝑝𝑘subscript𝑝𝑘1subscript^𝑏𝑘superscript𝑘casessumsubscript𝑆𝑘superscript𝑘subscript𝑝𝑘subscript𝑝superscript𝑘𝑘superscript𝑘sumsubscript𝑆𝑘𝑘trsubscript𝑆𝑘𝑘subscript𝑝𝑘subscript𝑝𝑘1𝑘superscript𝑘\widehat{a}_{kk}=\frac{p_{k}\operatorname{tr}\left(S_{kk}\right)-\operatorname% {sum}\left(S_{kk}\right)}{p_{k}(p_{k}-1)},\quad\widehat{b}_{kk^{\prime}}=% \begin{cases}\dfrac{\operatorname{sum}\left(S_{kk^{\prime}}\right)}{p_{k}p_{k^% {\prime}}},&k\neq k^{\prime}\\ \dfrac{\operatorname{sum}\left(S_{kk}\right)-\operatorname{tr}\left(S_{kk}% \right)}{p_{k}(p_{k}-1)},&k=k^{\prime}\end{cases},over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT = divide start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_tr ( italic_S start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT ) - roman_sum ( italic_S start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 ) end_ARG , over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = { start_ROW start_CELL divide start_ARG roman_sum ( italic_S start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG , end_CELL start_CELL italic_k ≠ italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL divide start_ARG roman_sum ( italic_S start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT ) - roman_tr ( italic_S start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 ) end_ARG , end_CELL start_CELL italic_k = italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW , (\arabicequation)

for all k𝑘kitalic_k and ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, Skksubscript𝑆𝑘superscript𝑘S_{kk^{\prime}}italic_S start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT denotes the pksubscript𝑝𝑘p_{k}italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT by pksubscript𝑝𝑘p_{k’}italic_p start_POSTSUBSCRIPT italic_k ’ end_POSTSUBSCRIPT submatrix of S=(Skk)𝑆subscript𝑆𝑘superscript𝑘S=\left(S_{kk^{\prime}}\right)italic_S = ( italic_S start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), sum()sum\operatorname{sum}(\cdot)roman_sum ( ⋅ ) denotes the sum of all elements of a matrix.

The estimation procedure for SCFA parameters is scalable for high-dimensional data (e.g., p>104𝑝superscript104p>10^{4}italic_p > 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT) as long as n>K+K(K+1)/2𝑛𝐾𝐾𝐾12n>K+K(K+1)/2italic_n > italic_K + italic_K ( italic_K + 1 ) / 2. In contrast, the classic CFA model faces challenges in computation when (1) p>n𝑝𝑛p>nitalic_p > italic_n and (2) n>p𝑛𝑝n>pitalic_n > italic_p but p𝑝pitalic_p is several hundred, due to the bottleneck of computing the large covariance (and its inverse) using the maximum likelihood approach. Therefore, SCFA addresses a longstanding limitation in CFA regarding dimensionality constraints.

Furthermore, the proposed factor score estimator of fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is

f^i=(L^TL^)1L^TXi=(L^TΣu1L^)1L^TΣu1Xi=(L^TΣ^u1L^)1L^TΣ^u1Xi,subscript^𝑓𝑖superscriptsuperscript^𝐿T^𝐿1superscript^𝐿Tsubscript𝑋𝑖superscriptsuperscript^𝐿TsuperscriptsubscriptΣ𝑢1^𝐿1superscript^𝐿TsuperscriptsubscriptΣ𝑢1subscript𝑋𝑖superscriptsuperscript^𝐿Tsuperscriptsubscript^Σ𝑢1^𝐿1superscript^𝐿Tsuperscriptsubscript^Σ𝑢1subscript𝑋𝑖\widehat{f}_{i}=\left(\widehat{L}^{\mathrm{\scriptscriptstyle T}}\widehat{L}% \right)^{-1}\widehat{L}^{\mathrm{\scriptscriptstyle T}}X_{i}=\left(\widehat{L}% ^{\mathrm{\scriptscriptstyle T}}\Sigma_{u}^{-1}\widehat{L}\right)^{-1}\widehat% {L}^{\mathrm{\scriptscriptstyle T}}\Sigma_{u}^{-1}X_{i}=\left(\widehat{L}^{% \mathrm{\scriptscriptstyle T}}\widehat{\Sigma}_{u}^{-1}\widehat{L}\right)^{-1}% \widehat{L}^{\mathrm{\scriptscriptstyle T}}\widehat{\Sigma}_{u}^{-1}X_{i},over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( over^ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT over^ start_ARG italic_L end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( over^ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG italic_L end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( over^ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG italic_L end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (\arabicequation)

for i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n, where the OLS estimator is identical to the generalized least-square (GLS) estimator and the feasible generalized least-square (FGLS) estimator. The derivation is provided in the Supplementary Material. The following theorems exhibit the theoretical properties of the above estimators.

Theorem \arabicsection.\arabictheorem.

If K+K(K+1)/2<n𝐾𝐾𝐾12𝑛K+K(K+1)/2<nitalic_K + italic_K ( italic_K + 1 ) / 2 < italic_n, then the proposed matrix estimators L^^𝐿\widehat{L}over^ start_ARG italic_L end_ARG, Σ^fsubscript^Σ𝑓\widehat{\Sigma}_{f}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, and Σ^usubscript^Σ𝑢\widehat{\Sigma}_{u}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT are uniformly minimum-variance unbiased estimators (UMVUEs).

Please refer to the proof in the Supplementary Material. As a^kksubscript^𝑎𝑘𝑘\widehat{a}_{kk}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT and b^kksubscript^𝑏𝑘superscript𝑘\widehat{b}_{kk^{\prime}}over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are (unique) maximum likelihood estimators, they also exhibit large-sample properties such as consistency, asymptotic efficiency, and asymptotic normality, under the conditions that K+K(K+1)/2<n𝐾𝐾𝐾12𝑛K+K(K+1)/2<nitalic_K + italic_K ( italic_K + 1 ) / 2 < italic_n and n𝑛n\to\inftyitalic_n → ∞ for fixed K𝐾Kitalic_K and p𝑝pitalic_p.

Theorem \arabicsection.\arabictheorem.

If K+K(K+1)/2<n𝐾𝐾𝐾12𝑛K+K(K+1)/2<nitalic_K + italic_K ( italic_K + 1 ) / 2 < italic_n and ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is positive definite (i.e., akk>0subscript𝑎𝑘𝑘0a_{kk}>0italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT > 0 for every k𝑘kitalic_k), then f^isubscript^𝑓𝑖\widehat{f}_{i}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT defined as (\arabicequation) is UMVUE, following a multivariate normal distribution with mean fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and covariance matrix presented in Theorem \arabicsection.\arabictheorem.

The proofs of the equivalence (\arabicequation) and Theorem \arabicsection.\arabictheorem are elaborated in the Supplementary Material. f^isubscript^𝑓𝑖\widehat{f}_{i}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT also exhibits large-sample properties such as consistency, asymptotic efficiency, and asymptotic normality, as p𝑝p\to\inftyitalic_p → ∞. The normality result in Theorem \arabicsection.\arabictheorem extends the confidence intervals for factor scores fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In addition to the maximum likelihood estimators a^kksubscript^𝑎𝑘𝑘\widehat{a}_{kk}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT and b^kksubscript^𝑏𝑘superscript𝑘\widehat{b}_{kk^{\prime}}over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, we provide their exact closed-form variance estimators in the following theorem. Moreover, the exact covariance matrix of the proposed factor score estimator f^isubscript^𝑓𝑖\widehat{f}_{i}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is obtainable from (\arabicequation).

Theorem \arabicsection.\arabictheorem.

(1) The exact variance estimators of a^kksubscript^𝑎𝑘𝑘\widehat{a}_{kk}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT and b^kksubscript^𝑏𝑘superscript𝑘\widehat{b}_{kk^{\prime}}over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are

var(a^kk)=2akk2(n1)(pk1),var(b^kk)={2(n1)pk(pk1){(akk+pkbkk)2(2akk+pkbkk)bkk},k=k12(n1)pkpk{pkpk(bkk2+bkk2)+2(akk+pkbkk)(akk+pkbkk)},kkformulae-sequencevarsubscript^𝑎𝑘𝑘2superscriptsubscript𝑎𝑘𝑘2𝑛1subscript𝑝𝑘1varsubscript^𝑏𝑘superscript𝑘cases2𝑛1subscript𝑝𝑘subscript𝑝𝑘1superscriptsubscript𝑎𝑘𝑘subscript𝑝𝑘subscript𝑏𝑘𝑘22subscript𝑎𝑘𝑘subscript𝑝𝑘subscript𝑏𝑘𝑘subscript𝑏𝑘𝑘𝑘superscript𝑘12𝑛1subscript𝑝𝑘subscript𝑝superscript𝑘subscript𝑝𝑘subscript𝑝superscript𝑘superscriptsubscript𝑏𝑘superscript𝑘2superscriptsubscript𝑏superscript𝑘𝑘22subscript𝑎𝑘𝑘subscript𝑝𝑘subscript𝑏𝑘𝑘subscript𝑎superscript𝑘superscript𝑘subscript𝑝superscript𝑘subscript𝑏superscript𝑘superscript𝑘𝑘superscript𝑘\begin{split}\operatorname{var}(\widehat{a}_{kk})&=\frac{2a_{kk}^{2}}{(n-1)(p_% {k}-1)},\\ \operatorname{var}(\widehat{b}_{kk^{\prime}})&=\begin{cases}\frac{2}{(n-1)p_{k% }(p_{k}-1)}\left\{(a_{kk}+p_{k}b_{kk})^{2}-(2a_{kk}+p_{k}b_{kk})b_{kk}\right\}% ,&k=k^{\prime}\\ \frac{1}{2(n-1)p_{k}p_{k^{\prime}}}\left\{p_{k}p_{k^{\prime}}(b_{kk^{\prime}}^% {2}+b_{k^{\prime}k}^{2})+2(a_{kk}+p_{k}b_{kk})(a_{k^{\prime}k^{\prime}}+p_{k^{% \prime}}b_{k^{\prime}k^{\prime}})\right\},&k\neq k^{\prime}\end{cases}\end{split}start_ROW start_CELL roman_var ( over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT ) end_CELL start_CELL = divide start_ARG 2 italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_n - 1 ) ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 ) end_ARG , end_CELL end_ROW start_ROW start_CELL roman_var ( over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_CELL start_CELL = { start_ROW start_CELL divide start_ARG 2 end_ARG start_ARG ( italic_n - 1 ) italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 ) end_ARG { ( italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( 2 italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT ) italic_b start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT } , end_CELL start_CELL italic_k = italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 ( italic_n - 1 ) italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG { italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 2 ( italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT ) ( italic_a start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) } , end_CELL start_CELL italic_k ≠ italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW end_CELL end_ROW

for every k𝑘kitalic_k and ksuperscript𝑘k^{\prime}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

(2) The exact covariance matrix estimator of f^isubscript^𝑓𝑖\widehat{f}_{i}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is

cov(f^i)=(L^TL^)1L^TΣL^(L^TL^)1=diag(a11p1,,aKKpK)+(bkk).covsubscript^𝑓𝑖superscriptsuperscript^𝐿T^𝐿1superscript^𝐿TΣ^𝐿superscriptsuperscript^𝐿T^𝐿1diagsubscript𝑎11subscript𝑝1subscript𝑎𝐾𝐾subscript𝑝𝐾subscript𝑏𝑘superscript𝑘\begin{split}\operatorname{cov}\left(\widehat{f}_{i}\right)=\left(\widehat{L}^% {\mathrm{\scriptscriptstyle T}}\widehat{L}\right)^{-1}\widehat{L}^{\mathrm{% \scriptscriptstyle T}}\Sigma\widehat{L}\left(\widehat{L}^{\mathrm{% \scriptscriptstyle T}}\widehat{L}\right)^{-1}=\operatorname{diag}\left(\frac{a% _{11}}{p_{1}},\ldots,\frac{a_{KK}}{p_{K}}\right)+\left(b_{kk^{\prime}}\right).% \end{split}start_ROW start_CELL roman_cov ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( over^ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT over^ start_ARG italic_L end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT roman_Σ over^ start_ARG italic_L end_ARG ( over^ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT over^ start_ARG italic_L end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = roman_diag ( divide start_ARG italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , … , divide start_ARG italic_a start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG ) + ( italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) . end_CELL end_ROW

In particular, as pksubscript𝑝𝑘p_{k}\to\inftyitalic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → ∞ for all k𝑘kitalic_k, then p𝑝p\to\inftyitalic_p → ∞ and cov(f^i)(bkk)=Σfcovsubscript^𝑓𝑖subscript𝑏𝑘superscript𝑘subscriptΣ𝑓\operatorname{cov}\left(\widehat{f}_{i}\right)\to\left(b_{kk^{\prime}}\right)=% \Sigma_{f}roman_cov ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) → ( italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT for each i𝑖iitalic_i.

Utilizing Theorem \arabicsection.\arabictheorem and the estimates provided in (\arabicequation), we can perform Wald-type hypothesis tests and compute interval estimates for all model parameters ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, and fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

\arabicsection SIMULATION STUDIES

\arabicsection.\arabicsubsection Data generation

We perform Monte Carlo simulation to generate the observation vector Xi=Lfi+uisubscript𝑋𝑖𝐿subscript𝑓𝑖subscript𝑢𝑖X_{i}=Lf_{i}+u_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_L italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n, where the factor loading matrix L=Bdiag(1p1×1,,1pK×1)𝐿Bdiagsubscript1subscript𝑝11subscript1subscript𝑝𝐾1L=\operatorname{Bdiag}\left(1_{p_{1}\times 1},\ldots,1_{p_{K}\times 1}\right)italic_L = roman_Bdiag ( 1 start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × 1 end_POSTSUBSCRIPT , … , 1 start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT × 1 end_POSTSUBSCRIPT ), the common factor fiN(0K×1,(bkk))similar-tosubscript𝑓𝑖𝑁subscript0𝐾1subscript𝑏𝑘superscript𝑘f_{i}\sim N\left(0_{K\times 1},\left(b_{kk^{\prime}}\right)\right)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_N ( 0 start_POSTSUBSCRIPT italic_K × 1 end_POSTSUBSCRIPT , ( italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ), and the error term uiN(0p×1,Bdiag(a11Ip1,,aKKIpK))similar-tosubscript𝑢𝑖𝑁subscript0𝑝1Bdiagsubscript𝑎11subscript𝐼subscript𝑝1subscript𝑎𝐾𝐾subscript𝐼subscript𝑝𝐾u_{i}\sim N\left(0_{p\times 1},\operatorname{Bdiag}(a_{11}I_{p_{1}},\ldots,a_{% KK}I_{p_{K}})\right)italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_N ( 0 start_POSTSUBSCRIPT italic_p × 1 end_POSTSUBSCRIPT , roman_Bdiag ( italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) for i=1,,n𝑖1𝑛i=1,\ldots,nitalic_i = 1 , … , italic_n. Specifically, we explore various values for (p1,,pK)Tsuperscriptsubscript𝑝1subscript𝑝𝐾T(p_{1},\ldots,p_{K})^{\mathrm{\scriptscriptstyle T}}( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT (as indicated in Table \arabictable) under different sample sizes, i.e., n{40,80,120}𝑛4080120n\in\{40,80,120\}italic_n ∈ { 40 , 80 , 120 }, with K=3𝐾3K=3italic_K = 3 and

a11=0.1,a22=0.2,a33=0.5,(bkk)=(2.020.731.153.131.633.69) is symmetric.formulae-sequencesubscript𝑎110.1formulae-sequencesubscript𝑎220.2formulae-sequencesubscript𝑎330.5subscript𝑏𝑘superscript𝑘matrix2.020.731.15missing-subexpression3.131.63missing-subexpressionmissing-subexpression3.69 is symmetrica_{11}=0.1,a_{22}=0.2,a_{33}=0.5,\quad\left(b_{kk^{\prime}}\right)=\begin{% pmatrix}2.02&0.73&1.15\\ &3.13&1.63\\ &&3.69\end{pmatrix}\text{ is symmetric}.italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT = 0.1 , italic_a start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT = 0.2 , italic_a start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT = 0.5 , ( italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = ( start_ARG start_ROW start_CELL 2.02 end_CELL start_CELL 0.73 end_CELL start_CELL 1.15 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 3.13 end_CELL start_CELL 1.63 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL 3.69 end_CELL end_ROW end_ARG ) is symmetric .

We repeat the above data generation procedure 100100100100 times.

\arabicsection.\arabicsubsection Assessing the estimated factor scores and model parameters

We apply the SCFA model to each simulated dataset to estimate L𝐿Litalic_L, fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, and ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, respectively. Since a primary objective of CFA is dimension reduction, yielding reliable common factors, we first focus on evaluating the performance of fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT estimation. Specifically, we employ the Euclidean loss i=1nf^ifisuperscriptsubscript𝑖1𝑛normsubscript^𝑓𝑖subscript𝑓𝑖\sum_{i=1}^{n}\left\|\widehat{f}_{i}-f_{i}\right\|∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ as the evaluation criterion. We calculate f^isubscript^𝑓𝑖\widehat{f}_{i}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by (\arabicequation) and benchmark it against competing methods including the various CFA computational methods implemented by the R packages of “sem” (Fox, 2006; Fox et al., 2022), “OpenMx” (Neale et al., 2016; Boker et al., 2023), and “lavaan” (Rosseel, 2012; Rosseel et al., 2023), respectively. We constrain the scales of loadings by all methods to be identical for fair comparison.

In Table \arabictable, we summarize the mean and standard deviation of the Euclidean losses for each approach across the 100100100100 replicates under different settings. The results indicate that SCFA outperforms conventional numerical approaches, exhibiting the lowest average loss, while all other methods demonstrate, on average, twice the SCFA loss. Additionally, SCFA shows a much-reduced variation in loss (e.g., around one-fifth) compared to the competing methods. Lastly, SCFA drastically improves computational efficiency by at least 100100100100 times. As mentioned earlier, the numerical implementations of the conventional CFA method may be limited for input datasets with pn𝑝𝑛p\geq nitalic_p ≥ italic_n and may thus yield no results (i.e., “NA” entries in Table \arabictable). In contrast, SCFA provides a viable solution for datasets with pn𝑝𝑛p\geq nitalic_p ≥ italic_n, which are very common in practice (e.g., omics, imaging, and financial data).

SCFA sem OpenMx lavaan
(n,p)𝑛𝑝(n,p)( italic_n , italic_p ) mean
standard
deviation
time
(secs)
mean
standard
deviation
time
(secs)
mean
standard
deviation
time
(secs)
mean
standard
deviation
time
(secs)
(40,20)4020(40,20)( 40 , 20 ) 12.16 0.78 0.04 20.7 5.93 23.03 21.31 5.84 139.89 21.24 5.84 16.21
(40,30)4030(40,30)( 40 , 30 ) 10.03 0.67 0.14 20.26 6.95 102.29 20.7 6.85 331.37 20.66 6.86 22.39
(40,40)4040(40,40)( 40 , 40 ) 8.72 0.58 0.04 NA NA NA 19.53 8.07 866.16 NA NA NA
(80,40)8040(80,40)( 80 , 40 ) 17.26 0.87 0.04 30.06 10.06 423.17 30.98 9.75 904.6 30.93 9.79 47.75
(80,80)8080(80,80)( 80 , 80 ) 12.27 0.65 0.05 NA NA NA 31.05 14.22 262.32 NA NA NA
(80,120)80120(80,120)( 80 , 120 ) 9.83 0.49 0.22 NA NA NA NA NA NA NA NA NA
(120,40)12040(120,40)( 120 , 40 ) 25.88 1.01 0.04 39.75 10.98 409.12 40.58 10.82 866.05 40.54 10.82 47.29
(120,120)120120(120,120)( 120 , 120 ) 14.95 0.60 0.07 NA NA NA NA NA NA NA NA NA
(120,200)120200(120,200)( 120 , 200 ) 11.49 0.51 0.13 NA NA NA NA NA NA NA NA NA
Table \arabictable: We present the means, standard deviations, and computation times (in seconds) for the Euclidean losses i=1nf^ifisuperscriptsubscript𝑖1𝑛normsubscript^𝑓𝑖subscript𝑓𝑖\sum_{i=1}^{n}\|\widehat{f}_{i}-f_{i}\|∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ across 100100100100 replicates, comparing the results obtained from our proposed method and computational packages for different values of p𝑝pitalic_p and n𝑛nitalic_n, with “NA” indicating “not available”.

In addition to the factor scores, we evaluate the performance of estimating model parameters ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. As parametric matrices ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT can be represented by parameters akksubscript𝑎𝑘𝑘a_{kk}italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT and bkksubscript𝑏𝑘superscript𝑘b_{kk^{\prime}}italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, we assess the accuracy of the estimators a^kksubscript^𝑎𝑘𝑘\widehat{a}_{kk}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT and b^kksubscript^𝑏𝑘superscript𝑘\widehat{b}_{kk^{\prime}}over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT in (\arabicequation).

The estimation results are summarized in Table \arabictable. We consider the following metrics in Table \arabictable: the average bias, Monte Carlo standard deviation, average standard error, and 95%percent9595\%95 % Wald-type empirical coverage probability using the proposed estimates for each akksubscript𝑎𝑘𝑘a_{kk}italic_a start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT and bkksubscript𝑏𝑘superscript𝑘b_{kk^{\prime}}italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. The results in Table \arabictable demonstrate that our estimation is generally accurate with small biases and approximate 95%percent9595\%95 % coverage probabilities. Specifically, for each parameter, the bias is relatively small when compared to the Monte Carlo standard deviation, while the average standard error is close to the Monte Carlo standard deviation. Furthermore, both the bias and average standard error decrease as the sample size increases, and the 95%percent9595\%95 % coverage probability approaches the nominal level as the sample size grows.

In comparison, we also utilize the aforementioned three R packages to estimate all ksubscript𝑘\ell_{k}roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the factor loading matrix L𝐿Litalic_L, all diagonal elements of Σu=Bdiag(a11Ip1,,aKKIpK)subscriptΣ𝑢Bdiagsubscript𝑎11subscript𝐼subscript𝑝1subscript𝑎𝐾𝐾subscript𝐼subscript𝑝𝐾\Sigma_{u}=\operatorname{Bdiag}\left(a_{11}I_{p_{1}},\ldots,a_{KK}I_{p_{K}}\right)roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = roman_Bdiag ( italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), and all elements of Σf=(bkk)subscriptΣ𝑓subscript𝑏𝑘superscript𝑘\Sigma_{f}=\left(b_{kk^{\prime}}\right)roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = ( italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). We also calculate the average bias and asymptotic standard error using the results produced by “sem”, “OpenMx”, and “lavaan”, respectively, for all elements of ksubscript𝑘\ell_{k}roman_ℓ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, Σu=Bdiag(a11Ip1,,aKKIpK)subscriptΣ𝑢Bdiagsubscript𝑎11subscript𝐼subscript𝑝1subscript𝑎𝐾𝐾subscript𝐼subscript𝑝𝐾\Sigma_{u}=\operatorname{Bdiag}\left(a_{11}I_{p_{1}},\ldots,a_{KK}I_{p_{K}}\right)roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = roman_Bdiag ( italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_K italic_K end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), and Σf=(bkk)subscriptΣ𝑓subscript𝑏𝑘superscript𝑘\Sigma_{f}=\left(b_{kk^{\prime}}\right)roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = ( italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). Due to the page limit, we present the results in the Supplementary Material. In cases where p<n𝑝𝑛p<nitalic_p < italic_n, the proposed estimators demonstrate superior performance compared to the estimates produced by the R packages “sem”, “OpenMx”, and “lavaan” with lower average standard errors and much shorter computational time. When p>120𝑝120p>120italic_p > 120, the computational times of conventional methods become long and even intractable, resulting in “NA” values for model parameters. This is primarily due to the presence of singular sample covariance matrices. In general, the performance of all methods is comparable, with computational cost being the bottleneck for the competing methods.

bias MCSD ASE 95%percent9595\%95 % CP bias MCSD ASE 95%percent9595\%95 % CP bias MCSD ASE 95%percent9595\%95 % CP
n=40𝑛40n=40italic_n = 40 2×(3,3,4)T2superscript334T2\times(3,3,4)^{\mathrm{\scriptscriptstyle T}}2 × ( 3 , 3 , 4 ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, p=20𝑝20p=20italic_p = 20 3×(3,3,4)T3superscript334T3\times(3,3,4)^{\mathrm{\scriptscriptstyle T}}3 × ( 3 , 3 , 4 ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, p=30𝑝30p=30italic_p = 30 4×(3,3,4)T4superscript334T4\times(3,3,4)^{\mathrm{\scriptscriptstyle T}}4 × ( 3 , 3 , 4 ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, p=40𝑝40p=40italic_p = 40
a11subscript𝑎11a_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT 0.0 1.0 1.0 92 0.0 0.7 0.8 93 0.0 0.6 0.7 97
a22subscript𝑎22a_{22}italic_a start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT -0.1 2.1 2.0 92 0.2 1.6 1.6 97 -0.3 1.3 1.3 97
a33subscript𝑎33a_{33}italic_a start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT -0.4 3.7 4.2 96 0.1 3.5 3.4 94 -0.1 2.8 2.9 96
b11subscript𝑏11b_{11}italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT -2.5 45.9 45.6 89 -0.3 52.1 46.0 91 0.7 46.7 46.1 96
b12subscript𝑏12b_{12}italic_b start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT -3.0 41.4 40.9 92 -4.3 43.3 41.6 92 -5.4 47.1 41.8 94
b13subscript𝑏13b_{13}italic_b start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT 2.8 45.5 48.0 94 2.6 50.4 47.5 92 -3.6 51.6 47.2 89
b22subscript𝑏22b_{22}italic_b start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT -14.2 76.1 68.4 90 -2.5 78.3 70.8 88 -2.5 66.9 70.7 92
b23subscript𝑏23b_{23}italic_b start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT -7.9 62.9 59.5 90 -2.8 58.9 59.7 96 -3.7 64.1 59.7 91
b33subscript𝑏33b_{33}italic_b start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT 3.3 74.8 85.8 97 -7.4 80.0 82.9 94 -6.2 84.7 82.9 89
n=80𝑛80n=80italic_n = 80 4×(3,3,4)T4superscript334T4\times(3,3,4)^{\mathrm{\scriptscriptstyle T}}4 × ( 3 , 3 , 4 ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, p=40𝑝40p=40italic_p = 40 8×(3,3,4)T8superscript334T8\times(3,3,4)^{\mathrm{\scriptscriptstyle T}}8 × ( 3 , 3 , 4 ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, p=80𝑝80p=80italic_p = 80 12×(3,3,4)T12superscript334T12\times(3,3,4)^{\mathrm{\scriptscriptstyle T}}12 × ( 3 , 3 , 4 ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, p=120𝑝120p=120italic_p = 120
a11subscript𝑎11a_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT 0.0 0.5 0.5 95 0.1 0.4 0.3 93 0.0 0.2 0.3 98
a22subscript𝑎22a_{22}italic_a start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT 0.0 0.9 1.0 97 0.0 0.7 0.7 93 0.0 0.5 0.5 96
a33subscript𝑎33a_{33}italic_a start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT -0.4 1.9 2.0 95 0.0 1.5 1.4 96 -0.1 1.3 1.2 93
b11subscript𝑏11b_{11}italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT 3.1 33.3 32.8 92 1.0 28.8 32.4 97 0.4 31.6 32.3 95
b12subscript𝑏12b_{12}italic_b start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT -0.1 31.3 29.6 95 0.3 28.4 29.7 97 1.9 26.6 29.6 96
b13subscript𝑏13b_{13}italic_b start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT -0.4 35.5 33.2 96 -0.2 33.2 33.2 96 2.0 33.4 33.8 95
b22subscript𝑏22b_{22}italic_b start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT -5.3 48.3 49.2 94 3.3 50.1 50.5 95 3.0 54.0 50.4 92
b23subscript𝑏23b_{23}italic_b start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT -3.9 41.0 41.5 93 -3.6 44.1 42.2 93 6.4 42.8 43.2 97
b33subscript𝑏33b_{33}italic_b start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT -13.2 55.1 57.2 93 -7.4 56.1 57.8 92 6.3 60.4 59.9 96
n=120𝑛120n=120italic_n = 120 4×(3,3,4)T4superscript334T4\times(3,3,4)^{\mathrm{\scriptscriptstyle T}}4 × ( 3 , 3 , 4 ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, p=40𝑝40p=40italic_p = 40 12×(3,3,4)T12superscript334T12\times(3,3,4)^{\mathrm{\scriptscriptstyle T}}12 × ( 3 , 3 , 4 ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, p=120𝑝120p=120italic_p = 120 20×(3,3,4)T20superscript334T20\times(3,3,4)^{\mathrm{\scriptscriptstyle T}}20 × ( 3 , 3 , 4 ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, p=200𝑝200p=200italic_p = 200
a11subscript𝑎11a_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT -0.1 0.3 0.4 100 0.0 0.2 0.2 95 0.0 0.2 0.2 96
a22subscript𝑎22a_{22}italic_a start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT -0.1 0.8 0.8 94 -0.1 0.4 0.4 92 0.0 0.3 0.3 94
a33subscript𝑎33a_{33}italic_a start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT 0.1 1.6 1.7 97 -0.1 0.9 0.9 97 0.0 0.6 0.7 98
b11subscript𝑏11b_{11}italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT -0.9 23.3 26.2 97 -1.4 25.3 26.1 95 -0.5 26.1 26.2 95
b12subscript𝑏12b_{12}italic_b start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT 0.1 23.2 23.9 96 -4.3 21.0 23.5 97 0.7 23.7 23.9 95
b13subscript𝑏13b_{13}italic_b start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT -2.2 26.9 26.8 94 -1.1 27.9 26.9 93 -0.3 28.5 27.1 92
b22subscript𝑏22b_{22}italic_b start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT -3.1 40.2 40.4 90 -9.1 36.9 39.5 93 -4.8 37.7 40.0 97
b23subscript𝑏23b_{23}italic_b start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT -3.5 32.0 34.0 96 -6.4 32.9 33.7 92 -0.9 33.6 34.3 94
b33subscript𝑏33b_{33}italic_b start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT -12.8 47.1 46.6 92 -5.6 46.2 47.3 94 -1.1 51.8 47.8 90
Table \arabictable: We present the estimation results (×100absent100\times 100× 100) using the proposed method for a11subscript𝑎11a_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT, a22subscript𝑎22a_{22}italic_a start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT, a33subscript𝑎33a_{33}italic_a start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT, and bkksubscript𝑏𝑘superscript𝑘b_{kk^{\prime}}italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT across various p𝑝pitalic_p (i.e., (p1,,pK)Tsuperscriptsubscript𝑝1subscript𝑝𝐾T(p_{1},\ldots,p_{K})^{\mathrm{\scriptscriptstyle T}}( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT) and n𝑛nitalic_n, based on 100100100100 replicates, where “bias” denotes the average of estimation bias, “MCSD” denotes the Monte Carlo standard deviation, “ASE” denotes the average standard error, “95%percent9595\%95 % CP” denotes the coverage probability based on a 95%percent9595\%95 % Wald-type confidence interval.

\arabicsection.\arabicsubsection Simulation results with model misspecification

We evaluate the performance of SCFA in scenarios of model misspecification, particularly when the covariance matrix deviates from an interconnected community structure. Under the misspecified structure, we generate XiN(0p×1,Σκ)similar-tosubscript𝑋𝑖𝑁subscript0𝑝1subscriptΣ𝜅X_{i}\sim N\left(0_{p\times 1},\Sigma_{\kappa}\right)italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_N ( 0 start_POSTSUBSCRIPT italic_p × 1 end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ) with Σκ=LΣfLT+Σu+EκsubscriptΣ𝜅𝐿subscriptΣ𝑓superscript𝐿TsubscriptΣ𝑢subscript𝐸𝜅\Sigma_{\kappa}=L\Sigma_{f}L^{\mathrm{\scriptscriptstyle T}}+\Sigma_{u}+E_{\kappa}roman_Σ start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT = italic_L roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT + italic_E start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT, where L𝐿Litalic_L, ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, and ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT are model parameters defined in Section \arabicsection.\arabicsubsection. We set n=120𝑛120n=120italic_n = 120, (p1,,pK)T=(60,60,80)Tsuperscriptsubscript𝑝1subscript𝑝𝐾Tsuperscript606080T(p_{1},\ldots,p_{K})^{\mathrm{\scriptscriptstyle T}}=\left(60,60,80\right)^{% \mathrm{\scriptscriptstyle T}}( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT = ( 60 , 60 , 80 ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, K=3𝐾3K=3italic_K = 3, and p=200𝑝200p=200italic_p = 200. Eκsubscript𝐸𝜅E_{\kappa}italic_E start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT is an additional noise term that results in the covariance structure deviating from the interconnected community structure. Specifically, EκWishart(p,κIp)similar-tosubscript𝐸𝜅Wishart𝑝𝜅subscript𝐼𝑝E_{\kappa}\sim\text{Wishart}\left(p,\kappa I_{p}\right)italic_E start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ∼ Wishart ( italic_p , italic_κ italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ), where the noise level κ102×{1,3,5}𝜅superscript102135\kappa\in 10^{-2}\times\{1,3,5\}italic_κ ∈ 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT × { 1 , 3 , 5 } and the noise scale of each entry in Eκsubscript𝐸𝜅E_{\kappa}italic_E start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT averages approximately {2,6,10}2610\{2,6,10\}{ 2 , 6 , 10 }, which overwhelms ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. We repeat this data generation procedure 100100100100 times.

We apply the SCFA model to each simulated dataset and calculate the average bias, Monte Carlo standard deviation, average standard error, and 95%percent9595\%95 % coverage probability for each model parameter. The summarized results are presented in Table \arabictable. Additional results concerning higher values of κ𝜅\kappaitalic_κ are available in the Supplementary Material.

The results demonstrate that both means and standard deviations of the Euclidean losses for estimated factor scores are larger than those calculated in a correctly specified model. As the noise level increases, both the mean and standard deviation of the losses increase. When the proposed interconnected community structure is violated by moderate random noises (e.g., the diagonal elements of mean matrix of Eκsubscript𝐸𝜅E_{\kappa}italic_E start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT are larger than those of ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT), the performance of the SCFA model remains robust, yielding reliable common factor estimates f^isubscript^𝑓𝑖\widehat{f}_{i}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and covariance parameter estimates among common factors Σ^f=(b^kk)subscript^Σ𝑓subscript^𝑏𝑘superscript𝑘\widehat{\Sigma}_{f}=\left(\widehat{b}_{kk^{\prime}}\right)over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = ( over^ start_ARG italic_b end_ARG start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) (see Table \arabictable).

noise scale =2absent2=2= 2 noise scale =6absent6=6= 6 noise scale =10absent10=10= 10
bias MCSD ASE 95%percent9595\%95 % CP bias MCSD ASE 95%percent9595\%95 % CP bias MCSD ASE 95%percent9595\%95 % CP
a11subscript𝑎11a_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT 203.2 3.8 3.6 0 602.3 13.4 10.3 0 989.9 21.4 16.9 0
a22subscript𝑎22a_{22}italic_a start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT 203.5 4.6 3.8 0 596.1 13.2 10.4 0 986.7 19.5 17 0
a33subscript𝑎33a_{33}italic_a start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT 201.9 4.5 3.7 0 614.1 11.8 9.7 0 1000.4 18.9 15.3 0
b11subscript𝑏11b_{11}italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT -2.8 25.7 26.3 96 -1.3 27 27.3 96 -2.5 26.9 28 94
b12subscript𝑏12b_{12}italic_b start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT 0.5 26.3 24.2 91 2.7 28.5 25.1 90 2.3 28.9 25.6 94
b13subscript𝑏13b_{13}italic_b start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT -2.9 26.2 27.2 98 -1.6 27.2 27.9 96 0.7 26.9 28.5 98
b22subscript𝑏22b_{22}italic_b start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT 1.2 39.5 41.2 94 3.3 40.8 42.4 94 4.8 40.4 43.4 94
b23subscript𝑏23b_{23}italic_b start_POSTSUBSCRIPT 23 end_POSTSUBSCRIPT -0.9 33.4 34.9 96 0.8 33.8 35.5 95 0.6 34 36.1 96
b33subscript𝑏33b_{33}italic_b start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT 1.1 51.3 48.4 97 0.3 51.8 49 95 2.6 51.9 49.9 96
i=1nf^ifisuperscriptsubscript𝑖1𝑛normsubscript^𝑓𝑖subscript𝑓𝑖\sum_{i=1}^{n}\|\widehat{f}_{i}-f_{i}\|∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ 34.70 1.23 58.13 2.26 75.94 2.68
Table \arabictable: We present the estimation results (×100absent100\times 100× 100) using the proposed method for a11subscript𝑎11a_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT, a22subscript𝑎22a_{22}italic_a start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT, a33subscript𝑎33a_{33}italic_a start_POSTSUBSCRIPT 33 end_POSTSUBSCRIPT, and bkksubscript𝑏𝑘superscript𝑘b_{kk^{\prime}}italic_b start_POSTSUBSCRIPT italic_k italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT under n=120𝑛120n=120italic_n = 120, (p1,,pK)T=(60,60,80)Tsuperscriptsubscript𝑝1subscript𝑝𝐾Tsuperscript606080T(p_{1},\ldots,p_{K})^{\mathrm{\scriptscriptstyle T}}=(60,60,80)^{\mathrm{% \scriptscriptstyle T}}( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT = ( 60 , 60 , 80 ) start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, p=200𝑝200p=200italic_p = 200, and various noise terms Eκsubscript𝐸𝜅E_{\kappa}italic_E start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT, where “bias” denotes the average of estimation bias, “MCSD” denotes the Monte Carlo standard deviation, “ASE” denotes the average standard error, “95%percent9595\%95 % CP” denotes the coverage probability based on a 95%percent9595\%95 % Wald-type confidence interval. We also present means and standard deviations (not ×100absent100\times 100× 100) of the Euclidean losses i=1nf^ifisuperscriptsubscript𝑖1𝑛normsubscript^𝑓𝑖subscript𝑓𝑖\sum_{i=1}^{n}\|\widehat{f}_{i}-f_{i}\|∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ for estimated factor scores across 100100100100 replicates under various noise levels.

\arabicsection APPLICATION TO GENE EXPRESSION DATA

We applied the SCFA model to the dataset for microRNA regulation on gene expression using the Pan-kidney dataset from The Cancer Genome Atlas (TCGA) project, as outlined by Tomczak et al. (2015). The gene expression dataset comprised information on the expression levels in Reads Per Million mapped reads (RPM) for 13408134081340813408 genes, collected from 712712712712 patients with age (mean 59.6659.6659.6659.66 and standard deviation 12.5912.5912.5912.59 years) and sex (34.27%percent34.2734.27\%34.27 % female and 65.73%percent65.7365.73\%65.73 % male).

In this analysis, we extracted an interconnected community structure from the gene co-expression matrix using the algorithm by Wu et al. (2021), while other community detection algorithms provided similar findings. The detected interconnected community structure comprised K=6𝐾6K=6italic_K = 6 communities from p=1777𝑝1777p=1777italic_p = 1777 genes with varying sizes: p1=384subscript𝑝1384p_{1}=384italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 384, p2=503subscript𝑝2503p_{2}=503italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 503, p3=314subscript𝑝3314p_{3}=314italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 314, p4=268subscript𝑝4268p_{4}=268italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 268, p5=158subscript𝑝5158p_{5}=158italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = 158, and p6=150subscript𝑝6150p_{6}=150italic_p start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = 150, as illustrated in Figure \arabicfigure. Genes were highly correlated within each module, while the interconnections between modules could be either positive or negative. Our goal was to identify latent factors for each module (i.e., dimension reduction of correlated genes) and estimate module-to-module interactions.

Based on the learned interconnected community structure, we applied the SCFA model for parameter estimation. The results were presented in Figure \arabicfigure. The first row of Figure \arabicfigure (left to right) demonstrated the extracted interconnected community structure (middle) from the original data (left), and the large covariance estimate (right) by Yang et al. (2024). In the second row of Figure \arabicfigure, we showed the results of decomposing the above-estimated covariance matrix of ΣΣ\Sigmaroman_Σ into SCFA parameters LΣfLT+Σu𝐿subscriptΣ𝑓superscript𝐿TsubscriptΣ𝑢L\Sigma_{f}L^{\mathrm{\scriptscriptstyle T}}+\Sigma_{u}italic_L roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT + roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT. We exhibited the estimated L^^𝐿\widehat{L}over^ start_ARG italic_L end_ARG, Σ^fsubscript^Σ𝑓\widehat{\Sigma}_{f}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, and Σ^usubscript^Σ𝑢\widehat{\Sigma}_{u}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT using heatmaps. The middle heatmap in row two illustrated the interactive relationships between 6666 common factors. The bottom subfigure of Figure \arabicfigure was a classic path diagram demonstration of factor analysis with factor memberships, intra-factor, and inter-factor correlations. As demonstrated, SCFA reduced p=1777𝑝1777p=1777italic_p = 1777 gene expression variables to K=6𝐾6K=6italic_K = 6 correlated common factors (with the estimated covariance matrix Σ^fsubscript^Σ𝑓\widehat{\Sigma}_{f}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT), while accurately representing the complex relationships among all 1777177717771777 variables. The computational time was less than one minute. Due to the page limit, we provided detailed numerical estimates, gene names, the corresponding community membership, and estimated factor scores in the Supplementary Material.

Refer to caption
Figure \arabicfigure: We present the workflow for analyzing the TCGA gene expression data. In the first row, the subfigures illustrate: the input correlation matrix of 1777177717771777 genes (left), the reordered correlation matrix highlighting the detected interconnected community structure (middle), and the (estimated) population correlation matrix (right), respectively. Utilizing the community membership from the 6666 estimated interconnected communities, we applied the SCFA model to the dataset. In the second row, the subfigures depict the estimation results: L^^𝐿\widehat{L}over^ start_ARG italic_L end_ARG (left), Σ^fsubscript^Σ𝑓\widehat{\Sigma}_{f}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT (middle), and Σ^usubscript^Σ𝑢\widehat{\Sigma}_{u}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT (right). The bottom path diagram illustrates gene subsets (in rectangles) and their respective common factors (in oval circles). The lines (with double-ended arrows) among the 6666 common factors denote their estimated correlation coefficients (i.e., the off-diagonal entries of Σ^fsubscript^Σ𝑓\widehat{\Sigma}_{f}over^ start_ARG roman_Σ end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT). Specifically, the red solid lines represent significant positive correlations, the blue solid lines represent significant negative correlations, and the dashed lines represent non-significant correlations.

We conducted pathway analysis to investigate the relationships between each common factor and kidney cancer. The results revealed that each common factor characterized unique cellular and molecular functions related to kidney cancer. Specifically, the first common factor exhibited enrichment in pathways related to G protein-coupled receptors (GPCRs), that play pivotal roles in various aspects of cancer progression, including tumor growth, invasion, migration, survival, and metastasis (Arakaki et al., 2018). The second common factor was enriched with pathways related to cellular respiration in mitochondria, which are essential to cancer cells, including renal adenocarcinoma (Wallace, 2012). The third common factor was enriched with pathways associated with the body’s immune response: renal cell carcinoma (RCC) is considered to be an immunogenic tumor but is known to mediate immune dysfunction (Díaz-Montero et al., 2020). For the remaining common factors, the uniqueness of associations might be less pronounced, and detailed information was provided in the Supplementary Material.

For comparison, we also applied the conventional CFA model with the same detected community membership to the dataset using R packages “sem”, “OpenMx”, and “lavaan”, respectively. However, none of these standard computational packages could handle the gene expression dataset due to n=712<p=1777𝑛712𝑝1777n=712<p=1777italic_n = 712 < italic_p = 1777. Thus, the SCFA model stands out as an unprecedented toolkit for CFA in high-dimensional applications.

\arabicsection DISCUSSION

We have developed a novel confirmatory factor analysis model, the SCFA model, designed for high-dimensional data. In contrast to the traditional CFA model equipped with standard computational packages, which relies on predefined model specifications and faces computational challenges with high-dimensional datasets, the SCFA model simultaneously addresses these limitations by leveraging a data-driven covariance structure. Due to its prevalence and block structure, we integrate the ubiquitous interconnected community structure observed in covariance matrices across diverse high-dimensional data types into the SCFA model. This integration yields likelihood-based UMVUEs for the factor loading matrix L𝐿Litalic_L, the covariance matrix of common factors ΣfsubscriptΣ𝑓\Sigma_{f}roman_Σ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, and the covariance matrix of error terms ΣusubscriptΣ𝑢\Sigma_{u}roman_Σ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT in closed forms, as well as explicitly consistent least-square estimators for the factor scores. Additionally, explicit variance estimators further facilitate statistical inference for these parameters.

Extensive simulation studies have demonstrated the superior performance of the SCFA model in both parameter estimation and factor score accuracy. Notably, SCFA significantly reduces the computational burden compared to existing CFA computational packages. The SCFA model also exhibits robustness to moderate violations of the interconnected community structure. In an application to TCGA gene expression data, we employed the SCFA model to explore the relationships between 6666 common factors and 1777177717771777 genes, conducting statistical inference on covariances between these common factors. These findings underscore the limitations of classical CFA models and standard computational packages in handling such high-dimensional data analyses.

In summary, SCFA models combine the benefits of conventional CFA models, including dimension reduction and flexibility in covariance matrix modeling for common factors, with the ability to adapt to data-driven covariance structures. As many high-dimensional datasets implicitly exhibit interconnected community structures, SCFA holds broad applicability. The software package is available at https://github.com/yiorfun/SCFA.

References

  • Anderson (2003) Anderson, T. (2003). An Introduction to Multivariate Statistical Analysis. John Wiley and Sons.
  • Arakaki et al. (2018) Arakaki, A. K. S., Pan, W.-A. & Trejo, J. (2018). Gpcrs in cancer: Protease-activated receptors, endocytic adaptors and signaling. International Journal of Molecular Sciences 19.
  • Bai & Li (2012) Bai, J. & Li, K. (2012). Statistical analysis of factor models of high dimension. The Annals of Statistics 40, 436 – 465.
  • Basilevsky (2009) Basilevsky, A. T. (2009). Statistical factor analysis and related methods: theory and applications. John Wiley & Sons.
  • Boker et al. (2023) Boker, S. M., Neale, M. C., Maes, H. H., Wilde, M. J., Spiegel, M., Brick, T. R., Estabrook, R., Bates, T. C., Mehta, P., von Oertzen, T., Gore, R. J., Hunter, M. D., Hackett, D. C., Karch, J., Brandmaier, A. M., Pritikin, J. N., Zahery, M., Kirkpatrick, R. M., Wang, Y., Goodrich, B., Driver, C., of Technology, M. I., Johnson, S. G., for Computing Machinery, A., Kraft, D., Wilhelm, S., Medland, S., Falk, C. F., Keller, M., G, M. B., of the University of California, T. R., Ingber, L., Voon, W. S., Palacios, J., Yang, J., Guennebaud, G. & Niesen, J. (2023). Extended Structural Equation Modelling. R package version 2.21.8.
  • Brown (2015) Brown, T. A. (2015). Confirmatory factor analysis for applied research. Guilford publications.
  • Browne (2001) Browne, M. W. (2001). An overview of analytic rotation in exploratory factor analysis. Multivariate Behavioral Research 36, 111–150.
  • Carlson & Mulaik (1993) Carlson, M. & Mulaik, S. A. (1993). Trait ratings from descriptions of behavior as mediated by components of meaning. Multivariate Behavioral Research 28, 111–159.
  • Chen et al. (2018) Chen, S., Kang, J., Xing, Y., Zhao, Y. & Milton, D. K. (2018). Estimating large covariance matrix with network topology for high-dimensional biomedical data. Computational Statistics and Data Analysis 127, 82–95.
  • Chen et al. (2023) Chen, S., Zhang, Y., Wu, Q., Bi, C., Kochunov, P. & Hong, L. E. (2023). Identifying covariate-related subnetworks for whole-brain connectome analysis. Biostatistics , kxad007.
  • Chiappelli et al. (2019) Chiappelli, J., Rowland, L. M., Wijtenburg, S. A., Chen, H., Maudsley, A. A., Sheriff, S., Chen, S., Savransky, A., Marshall, W., Ryan, M. C., Bruce, H. A., Shuldiner, A. R., Mitchell, B. D., Kochunov, P. & Hong, L. E. (2019). Cardiovascular risks impact human brain n-acetylaspartate in regionally specific patterns. Proceedings of the National Academy of Sciences 116, 25243–25249.
  • Colizza et al. (2006) Colizza, V., Flammini, A., Serrano, M. A. & Vespignani, A. (2006). Detecting rich-club ordering in complex networks. Nature physics 2, 110–115.
  • Díaz-Montero et al. (2020) Díaz-Montero, C. M., Rini, B. I. & Finke, J. H. (2020). The immunology of renal cell carcinoma. Nature Reviews Nephrology 16, 721–735.
  • Fan & Han (2017) Fan, J. & Han, X. (2017). Estimation of the false discovery proportion with unknown dependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79, 1143–1164.
  • Fan et al. (2020) Fan, J., Li, R., Zhang, C.-H. & Zou, H. (2020). Statistical foundations of data science. Chapman and Hall/CRC.
  • Fan et al. (2013) Fan, J., Liao, Y. & Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 75, 603–680.
  • Fan et al. (2021) Fan, J., Wang, K., Zhong, Y. & Zhu, Z. (2021). Robust high dimensional factor models with applications to statistical machine learning. Statistical science: a review journal of the Institute of Mathematical Statistics 36, 303.
  • Fortunato (2010) Fortunato, S. (2010). Community detection in graphs. Physics Reports 486, 75–174.
  • Fox (2006) Fox, J. (2006). Teacher’s corner: structural equation modeling with the sem package in r. Structural equation modeling 13, 465–486.
  • Fox et al. (2022) Fox, J., Nie, Z., Byrnes, J., Culbertson, M., DebRoy, S., Friendly, M., Goodrich, B., Jones, R. H., Kramer, A., Monette, G. & Novomestky, F. (2022). Structural Equation Models. R package version 3.1-15.
  • Friguet et al. (2009) Friguet, C., Kloareg, M. & Causeur, D. (2009). A factor model approach to multiple testing under dependence. Journal of the American Statistical Association 104, 1406–1415.
  • Gana & Broc (2019) Gana, K. & Broc, G. (2019). Structural equation modeling with lavaan. John Wiley & Sons.
  • Girvan & Newman (2002) Girvan, M. & Newman, M. E. J. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99, 7821–7826.
  • Huttlin et al. (2017) Huttlin, E. L., Bruckner, R. J., Paulo, J. A., Cannon, J. R., Ting, L., Baltier, K., Colby, G., Gebreab, F., Gygi, M. P., Parzen, H. et al. (2017). Architecture of the human interactome defines protein communities and disease networks. Nature 545, 505–509.
  • ISGlobal (2021) ISGlobal (2021). Barcelona institute for global health. https://www.isglobal.org/en. Accessed: 2022-07-30.
  • Jackson et al. (2009) Jackson, D. L., Gillaspy Jr, J. A. & Purc-Stephenson, R. (2009). Reporting practices in confirmatory factor analysis: an overview and some recommendations. Psychological methods 14, 6.
  • Jöreskog (1969) Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis. Psychometrika 34, 183–202.
  • Lawley (1958) Lawley, D. (1958). Estimation in factor analysis under various initial assumptions. British journal of statistical Psychology 11, 1–12.
  • Lei & Rinaldo (2015) Lei, J. & Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. The Annals of Statistics 43, 215 – 237.
  • Levine et al. (2015) Levine, J. H., Simonds, E. F., Bendall, S. C., Davis, K. L., El-ad, D. A., Tadmor, M. D., Litvin, O., Fienberg, H. G., Jager, A., Zunder, E. R. et al. (2015). Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197.
  • Li et al. (2022) Li, T., Lei, L., Bhattacharyya, S., den Berge, K. V., Sarkar, P., Bickel, P. J. & Levina, E. (2022). Hierarchical community detection by recursive partitioning. Journal of the American Statistical Association 117, 951–968.
  • Neale et al. (2016) Neale, M. C., Hunter, M. D., Pritikin, J. N., Zahery, M., Brick, T. R., Kirkpatrick, R. M., Estabrook, R., Bates, T. C., Maes, H. H. & Boker, S. M. (2016). Openmx 2.0: Extended structural equation and statistical modeling. Psychometrika 81, 535–549.
  • Newman & Girvan (2004) Newman, M. E. J. & Girvan, M. (2004). Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113.
  • Oberski (2014) Oberski, D. (2014). lavaan.survey: An r package for complex survey analysis of structural equation models. Journal of Statistical Software 57, 1–27.
  • Perrot-Dockès et al. (2022) Perrot-Dockès, M., Lévy-Leduc, C. & Rajjou, L. (2022). Estimation of large block structured covariance matrices: Application to ‘multi-omic’ approaches to study seed quality. Journal of the Royal Statistical Society: Series C (Applied Statistics) 71, 119–147.
  • Ritchie et al. (2023) Ritchie, S. C., Surendran, P., Karthikeyan, S., Lambert, S. A., Bolton, T., Pennells, L., Danesh, J., Di Angelantonio, E., Butterworth, A. S. & Inouye, M. (2023). Quality control and removal of technical variation of nmr metabolic biomarker data in~ 120,000 uk biobank participants. Scientific Data 10, 64.
  • Rosseel (2012) Rosseel, Y. (2012). lavaan: An r package for structural equation modeling. Journal of Statistical Software 48, 1–36.
  • Rosseel et al. (2023) Rosseel, Y., Jorgensen, T. D., Rockwood, N., Oberski, D., Byrnes, J., Vanbrabant, L., Savalei, V., Merkle, E., Hallquist, M., Rhemtulla, M., Katsikatsou, M., Barendse, M., Scharf, F. & Du, H. (2023). Latent Variable Analysis. R package version 0.6-15.
  • Schaub et al. (2023) Schaub, M. T., Li, J. & Peel, L. (2023). Hierarchical community structure in networks. Phys. Rev. E 107, 054305.
  • Schreiber et al. (2006) Schreiber, J. B., Nora, A., Stage, F. K., Barlow, E. A. & King, J. (2006). Reporting structural equation modeling and confirmatory factor analysis results: A review. The Journal of Educational Research 99, 323–338.
  • Simpson et al. (2013) Simpson, S. L., Bowman, F. D. & Laurienti, P. J. (2013). Analyzing complex functional brain networks: Fusing statistics and network science to understand the brain. Statistics Surveys 7, 1 – 36.
  • Tomczak et al. (2015) Tomczak, K., Czerwińska, P. & Wiznerowicz, M. (2015). Review the cancer genome atlas (tcga): an immeasurable source of knowledge. Contemporary Oncology/Współczesna Onkologia 2015, 68–77.
  • Wallace (2012) Wallace, D. C. (2012). Mitochondria and cancer. Nature Reviews Cancer 12, 685–698.
  • Wang et al. (2020) Wang, Z., Liang, Y. & Ji, P. (2020). Spectral algorithms for community detection in directed networks. The Journal of Machine Learning Research 21, 6101–6145.
  • Wu et al. (2021) Wu, Q., Ma, T., Liu, Q., Milton, D. K., Zhang, Y. & Chen, S. (2021). Icn: extracting interconnected communities in gene co-expression networks. Bioinformatics 37, 1997–2003.
  • Yang et al. (2024) Yang, Y., Chen, C. & Chen, S. (2024). Covariance matrix estimation for high-throughput biomedical data with interconnected communities. The American Statistician In press.
  • Zitnik et al. (2018) Zitnik, M., Sosic̆, R. & Leskovec, J. (2018). Prioritizing network communities. Nature communications 9, 2544.
\printhistory