Python Packages For Exploratory Factor Analysis
Python Packages For Exploratory Factor Analysis
To cite this article: Isaiah Persson & Jam Khojasteh (2021) Python Packages for Exploratory
Factor Analysis, Structural Equation Modeling: A Multidisciplinary Journal, 28:6, 983-988, DOI:
10.1080/10705511.2021.1910037
SOFTWARE REVIEW
ABSTRACT KEYWORDS
Exploratory Factor Analysis (EFA) is a widely used statistical technique for reducing data dimensionality Exploratory factor analysis
and representing latent constructs via observed variables. Different software offer toolsets for performing (EFA); Python; statsmodels;
this analysis. While Python’s statistical computing ecosystem is less developed than that of R, it is growing FactorAnalyzer; scikit-learn
in popularity as a platform for data analysis and now offers several packages that perform EFA. This article
reviews EFA modules in the statsmodels, FactorAnalyzer, and scikit-learn Python packages. These packages
are discussed with regard to official documentation, features, and performance on an applied example.
Introduction
there are a few packages available. While multiple publications
Factor analysis is commonly used for data reduction in aca have discussed conducting EFA with R packages, it seems that
demic fields of educational measurement and psychology to there has been no such endeavor regarding Python packages
describe constructs that cannot be directly observed (e.g., (Kabacoff, 2011, pp. 342–351; Luo et al., 2019; Mair, 2018; pp.
intelligence and happiness), and it is also used in fields such 23–34). This article attempts to close this gap by reviewing the
as marketing research to measure customer attitudes and other statsmodels (version 0.12.2), FactorAnalyzer (version 0.3.2),
industry-relevant latent variables (B2B International, 2021; and scikit-learn (version 0.24.2) packages, with respect to
Costello & Osborne, 2005; Pohlmann, 2004; Watkins, 2018). their documentation, features, and performance using
Specifically, exploratory factor analysis (EFA) is a common way a sample dataset (Biggs & Madnani, 2019; Pedregosa et al.,
to model observed items (i.e., variables) in terms of a smaller 2011; Perktold et al., 2010).
number of unobserved factors (i.e., latent constructs) (Fabrigar
et al., 1999; Watkins, 2018). This procedure essentially
Python packages and review framework
expresses each observed value as a linear combination of dif
ferent factors plus error (Fabrigar et al., 1999; Preacher et al., Currently, at least three Python packages (i.e., statsmodels,
2013; Watkins, 2018). Each item’s observed variance is parti FactorAnalyzer, and scikit-learn) offer modules for conducting
tioned into communality, variance that is shared with other exploratory factor analyses (Biggs & Madnani, 2019; Pedregosa
items and explained by underlying factors, and uniqueness, et al., 2011; Perktold et al., 2010). To start, this article reviews
also known as error, noise, or unexplained variance. each package’s official documentation for overall clarity and
Python has become an important language within the data comprehensiveness and discusses the availability of informal
analytic community (Ayer et al., 2014; Bajuk, 2019). While the resources on platforms such as blogs and forums. Then, the
R programming language has dominated statistical computing packages are compared based on their documented features,
in academic research due to its well-developed ecosystem of after which they are tested by conducting an EFA with each
specialized packages, Python has emerged as an alternative package on a sample dataset. Finally, this article concludes by
language that offers versatility and integrates well with other providing recommendations to users and package developers.
applications (Bajuk, 2019; Ozgur et al., 2017). According to the All analyses with Python (version 3.8.8) are run in Jupyter
major software development and version control site GitHub, Notebook (version 6.3.0) within the Anaconda open-source
Python is the second most popular language for software toolkit (Anaconda Inc., 2020; Project Jupyter, 2020). Jupyter
development and the preferred language for developing Notebook is an integrated development environment (IDE),
machine learning applications among its user-base (Elliot, similar to RStudio, which operates as a web application and
2019; GitHub, 2020). In addition, many users regard Python allows users to seemlessly edit, run, and present code (Project
as one of the simplest programming languages to learn (Ayer Jupyter, 2020; RStudio Team, 2020). Necessary software
et al., 2014). Given its pervasive use and ease of integration, packages for this paper’s analyses are retrieved and managed
researchers may benefit from familiarizing themselves with via Anaconda. For those who are unfamiliar with this toolkit,
Python. Using this language for analyses will make academic Anaconda “makes it easy to manage multiple data environ
research more accessible to application developers in industry, ments that can be maintained and run separately without
thus enhancing the probability for collaboration and exchange interference from each other” (Anaconda Inc., 2020). Along
across domains. For those who wish to use Python for EFA, with the three primary packages, users may also need to load
supporting packages, such as pandas, NumPy, SciPy and scikit-learn is one of the most comprehensive and influential
Matplotlib to manipulate and visualize data (Harris et al., machine learning packages in the Python programming ecosystem.
2020; Hunter, 2007; Krekel & Pytest-Dev Team, 2020; Smith, Along with the other two packages, it provides a purpose-built class
2015; The Pandas Development Team, 2020; Virtanen et al., that performs EFA. The package’s “User Guide” provides
2020). Each of the reviewed packages provides further infor a conceptual overview of EFA that focuses on mathematical
mation in their official documentation concerning software descriptions, presenting it as an alternative to principal components
dependencies that are necessary to run the EFA modules. analysis (PCA) for matrix factorization. The code documentation
Figure 1 displays the code that loads the necessary primary outlines how to implement the EFA class, however many of the
and supporting packages for the analyses discussed in this parameters and attributes are described with machine learning
article.1 terminology that may not be familiar to users from behavioral
and social sciences. The examples that the package uses, such as
image processing, focus on predictive accuracy over interpretable
Overview of Python packages with EFA capabilities model building. While informative, scikit-learn’s approach may
statsmodels is an expansive package in Python “that provides classes seem less relatable and even a bit inaccessible to users from back
and functions for the estimation of many different statistical models” grounds other than machine learning.
(Perktold et al., 2010). The package’s authors attempt to accommo
date individuals who are familiar with programming in R, by Informal documentation and help from user-base
allowing users to define model variables for many statistical func
tions and classes with R-style formulas. The “Getting Started” and There are a number of blogs and web tutorials that demonstrate
“User Guide” sections of the statsmodels website provide an intro how to perform EFA with Python. Most of these utilize the
duction to this and general guidance on how to use the package. The FactorAnalyzer package and, to a lesser extent, scikit-learn.
statsmodels documentation details the input parameters that one A cursory web search did not find any user examples of EFA
may specify for the class that estimates an EFA model along with performed with statsmodels. This may reflect the limited popu
ways to report and modify the results. The documentation provides larity of Python for statistical computing compared to R. Due to
a thorough outline of intended functionality and limitations. this relative lack of popularity, it is difficult to find user-generated
Unfortunately, there are no examples of the code being implemen solutions when dealing with implementation challenges.
ted on a dataset. This may hinder someone who is new to this
package or to Python, resulting in a trial and error process. Package features
FactorAnalyzer, as the name suggests, is a package devel
oped by ETS solely for performing exploratory factor analysis Next, each package’s documented functionality is reviewed, by
and confirmatory factor analysis (CFA) (Biggs & Madnani, comparing input data requirements (e.g., raw datasets or cor
2019). The official documentation provides a clear and con relation matrices), tests of assumptions, estimation methods,
cise explanation of factor analysis and its application to tools for choosing factors (e.g., scree plots and eigenvalue
modeling and measuring latent variables via observed vari tables), rotation options, and reporting formats.
ables. This is followed by instructions on how to use each of
the package’s modules for EFA and CFA. The package’s
documentation explains its EFA and CFA toolset in terms Specifying an EFA model
of psychometric application and provides a conceptual over Each of the three packages provides a purpose-built class for
view that avoids mathematical terminology and equations. specifying parameters and estimating an EFA model (see
Figure 1. This screenshot shows all the Python packages and modules for performing EFA in this article.
1
The Python and R code that support the findings of this study are openly available on the Open Science Framework website (DOI: 10.17605/OSF.IO/XPMUZ).
STRUCTURAL EQUATION MODELING: A MULTIDISCIPLINARY JOURNAL 985
Figure 2. This screen shot shows the code used to specify and fit an EFA model using maximum likelihood estimation in statsmodels, FactorAnalyzer, and scikit-learn.
Figure 2). FactorAnalyzer and scikit-learn allow users to measure of sampling adequacy, and Bartlett’s test of sphericity.
retrieve results directly from the fitted class, by calling attri Unfortunately, none of the packages provides a comprehensive
butes and methods that are associated with it. On the other set of functions or classes to test assumptions for EFA. While
hand, statsmodels requires users to then specify a separate statsmodels provides a class for users to calculate descriptive
class that uses the fitted model as its only parameter, to statistics, such as skewness and kurtosis, an error message is
retrieve results. Figure 2 displays examples of code from generated when executing the code and little guidance is found
each package for fitting an EFA model to data. from the official documentation or from searching user plat
forms. Neither FactorAnalyzer nor scikit-learn offer the option
Input data to generate descriptive statistics. Rather, users must look to
other packages, such as Scipy or pandas to obtain these figures.
All three packages allow users to conduct an EFA on a raw dataset Scipy’s kurtosis and skewness functions are clear and easy to
with observations organized by row and items (i.e., variables) by implement.
column. Alternatively, FactorAnalyzer and statsmodels also give FactorAnalyzer uniquely provides classes to compute
users the option of using a correlation matrix as input data. As Bartlett’s test of sphericity and the KMO test for sampling
shown in Figure 2, FactorAnalyzer and scikit-learn require users adequacy. After much searching, it appears that this is the
to enter the dataset as a parameter to the “.fit()” method after only Python package that provides a built-in approach to
specifying the other class parameters, whereas statsmodels calculate both test statistics. FactorAnalyzer also provides
requires the dataset to be specified as a parameter within the class. a built-in attribute to the “FactorAnalyzer()” class that com
putes a correlation matrix for the original data, which can be
called by attaching “.corr_” as a suffix to the class command.
Testing assumptions
Alternatively, one can simply call the pandas “corr()” function
Before starting an EFA, it is necessary to test basic assumptions on the data frame being analyzed, to generate the data correla
about the data. One should evaluate measures of normality tion matrix. Figure 3 demonstrates how to test the assumptions
(i.e., skewness and kurtosis), the Kaiser-Meyer-Olkin (KMO) using Scipy, FactorAnalyzer and pandas.
Figure 3. This screenshot shows the Python functions used to calculate a correlation matrix, skewness, kurtosis, Bartlett’s test of sphericity, and the KMO measure of
sampling adequacy.
986 PERSSON AND KHOJASTEH
Figure 4. This screenshot shows the code from statsmodels that loads the “bfi” dataset, on which the three Python packages are tested in relationship to R’s psych
package.
For example, both statsmodels and scikit-learn return loading Kabacoff, R. (2011). R in action. Shelter Island, NY: Manning publications.
matrices that differ significantly from each other and the psych Krekel, H. and Pytest-Dev Team. (2020). Full pytest documentation.
package’s output even though the same estimation and rotation pytest. Retrieved February 2, 2021, from https://docs.pytest.org/en/
stable/contents.html
methods are used. Likewise, FactorAnalyzer’s “.psi_” attribute Luo, L., Arizmendi, C., & Gates, K. M. (2019). Exploratory factor analysis
for reporting the factor correlation matrix returns an error (EFA) programs in R. Structural Equation Modeling, 26, 819–826.
message and statsmodels’s scree plot function visualizes the https://doi.org/10.1080/10705511.2019.1615835
wrong set of eigenvalues. Such methods should be tested to Mair, P. (2018). Modern psychometrics with R. Springer. https://doi.org/
make sure they return values that align with results from estab 10.1007/978-3-319-93177-7
Navlani, A. (2019, April). Introduction to factor analysis in Python.
lished programs such as R’s psych package. Hopefully, develop datacamp. DataCamp, Inc. Retrieved February 2, 2021, from
ment and quality control will accelerate as more users integrate https://www.datacamp.com/community/tutorials/introduction-fac
these packages into their data analytic projects and contribute tor-analysis
insights from their experiences. Ozgur, C. C., Rogers, G., Hughes, Z., & Myer-Tyson, E. (2017). MatLab vs.
Python vs. R. Journal of Data Science, 15, 355–371. https://doi.org/10.
6339/JDS.201707_15(3).0001
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B.,
Disclosure statement Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.,
Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., &
We have no conflicts of interest to report.
Duchesnay, E. (2011). Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12, 2825–2830. http://jmlr.org/
References papers/v12/pedregosa11a.html
Pohlmann, J. T. (2004). Use and interpretation of factor analysis in the
Anaconda Inc. (2020). Individual edition. Anaconda. Retrieved January 1, Journal of Educational Research: 1992–2002. The Journal of
2021, from https://www.anaconda.com/products/individual Educational Research, 98, 14–23. https://doi.org/10.3200/JOER.98.1.
Arel-Bundock, V. (2020). A collection of datasets originally distributed in 14-23
various R packages. Rdatasets. Vincent Arel-Bundock. Retrieved Preacher, K. J., Zhang, G., Kim, C., & Mels, G. (2013). Choosing the
January 1, 2021, from https://vincentarelbundock.github.io/Rdatasets/ optimal number of factors in exploratory factor analysis: A model
Ayer, V., Miguez, S., & Toby, B. (2014). Why scientists should learn to selection perspective. Multivariate Behavioral Research, 48, 28–56.
program in Python. Powder Diffraction, 29, S48–S64. https://doi.org/ https://doi.org/10.1080/00273171.2012.710386
10.1017/S0885715614000931 Project Jupyter. (2020, November 18). Home. Jupyter. Project Jupyter.
B2B International. (2021). Factor analysis in marketing research. B2B Retrieved January 1, 2021, from https://jupyter.org
International. Retrieved February 1, 2021, from https://www.b2binter Revelle, W. (2020). psych: Procedures for Psychological, Psychometric,
national.com/research/methods/statistical-techniques/factor-analysis/ and Personality Research. R package version 2.0.12. Northwestern
Bajuk, L. (2019, December 17). R vs. Python: What’s the best language for University, Evanston, IL. http://cran.r-project.org/package=psych
data science? R Studio Blog. Retrieved February 1, 2021, from R Studio RStudio Team. (2020). RStudio: Integrated development for R. RStudio,
Blog https://blog.rstudio.com/2019/12/17/r-vs-python-what-s-the-best PBC. Retrieved February 19, 2021, from http://www.rstudio.com/ .
-for-language-for-data-science/ Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical
Biggs, J., & Madnani, N. (2019). Factor_analyzer documentation. Release modeling with python. Proceedings of the 9th Python in Science
0.3.1. Jeremy Biggs. Retrieved February 2, 2021, from http://factor- Conference. https://doi.org/10.25080/Majora-92bf1922-011
analyzer.readthedocs.io/en/latest/index.html Smith, N. (2015). patsy – Describing statistical models in Python. Authored/
Costello, A. B., & Osborne, J. (2005). Best practices in exploratory factor published by Nathaniel J. Smith. Retrieved February 2, 2021, from
analysis: Four recommendations for getting the most from your https://patsy.readthedocs.io/en/latest/index.html
analysis. Practical Assessment, Research, and Evaluation, 10, Article 7.
St-Amant, F. (2020, May 13). Factor analysis tutorial. Towards data
https://doi.org/10.7275/jyj1-4868
Elliot, T. (2019, January 24). The state of the octoverse: Machine learning. science. Retrieved February 2, 2021, from https://towardsdatascience.
The GitHub Blog. Retrieved February 1, 2020, from https://github.blog/ com/factor-analysis-a-complete-tutorial-1b7621890e42
2019-01-24-the-state-of-the-octoverse-machine-learning/ The Pandas Development Team. (2020, February). pandas-dev/pandas:
Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Pandas. Zenodo. https://doi.org/10.5281/zenodo.3509134
Evaluating the use of exploratory factor analysis in psychological Toth, G. (2020). Factor analysis. DataSklr. Retrieved Febuary 2, 2021, from
research. Psychological Methods, 4, 272–299. https://doi.org/10.1037/ Mair, 2018https://www.datasklr.com/principal-component-analysis-
1082-989X.4.3.272 and-factor-analysis/factor-analysis
GitHub. (2020). The 2020 state of the octoverse. GitHub. Retrieved Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T.,
February 1, 2021, from https://octoverse.github.com Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N.,
Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., . . . van Mulbregt, P.
Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., Del (2020). SciPy 1.0: Fundamental algorithms for scientific computing in
Río, J. F., Wiebe, M., Peterson, P., . . . Oliphant, T. E. (2020). Array Python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-
programming with NumPy. Nature, 585, 357–362. https://doi.org/10. 019-0686-2
1038/s41586-020-2649-2 Watkins, M. W. (2018). Exploratory factor analysis: A guide to best
Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in practice. Journal of Black Psychology, 44, 219–246. https://doi.org/10.
Science & Engineering, 9, 90–95. https://doi.org/10.1109/MCSE.2007.55 1177/0095798418771807