0% found this document useful (0 votes)
27 views18 pages

Flora, David

This document is a tutorial that discusses different methods for calculating reliability estimates, specifically focusing on coefficient omega. It begins by explaining the importance of reliability and differentiating between true score and target construct. It then clarifies the distinctions between different forms of coefficient omega and provides guidelines for choosing the appropriate estimate based on a test's internal factor structure. The tutorial demonstrates how to compute different forms of omega in R and provides resources for readers to access the R code and data used in examples.

Uploaded by

Miguel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views18 pages

Flora, David

This document is a tutorial that discusses different methods for calculating reliability estimates, specifically focusing on coefficient omega. It begins by explaining the importance of reliability and differentiating between true score and target construct. It then clarifies the distinctions between different forms of coefficient omega and provides guidelines for choosing the appropriate estimate based on a test's internal factor structure. The tutorial demonstrates how to compute different forms of omega in R and provides resources for readers to access the R code and data used in examples.

Uploaded by

Miguel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

951747

research-article2020
AMPXXX10.1177/2515245920951747FloraWhich Omega Is Right?

ASSOCIATION FOR
Tutorial PSYCHOLOGICAL SCIENCE
Advances in Methods and

Your Coefficient Alpha Is Probably Wrong, Practices in Psychological Science


2020, Vol. 3(4) 484­–501
© The Author(s) 2020
but Which Coefficient Omega Is Right? Article reuse guidelines:
sagepub.com/journals-permissions

A Tutorial on Using R to Obtain Better DOI: 10.1177/2515245920951747


https://doi.org/10.1177/2515245920951747
www.psychologicalscience.org/AMPPS

Reliability Estimates

David B. Flora
Department of Psychology, York University

Abstract
Measurement quality has recently been highlighted as an important concern for advancing a cumulative psychological
science. An implication is that researchers should move beyond mechanistically reporting coefficient alpha toward more
carefully assessing the internal structure and reliability of multi-item scales. Yet a researcher may be discouraged upon
discovering that a prominent alternative to alpha, namely, coefficient omega, can be calculated in a variety of ways.
In this Tutorial, I alleviate this potential confusion by describing alternative forms of omega and providing guidelines
for choosing an appropriate omega estimate pertaining to the measurement of a target construct represented with a
confirmatory factor analysis model. Several applied examples demonstrate how to compute different forms of omega in R.

Keywords
alpha, psychometrics, reliability, R, confirmatory factor analysis, assessment, omega, measurement, open data, open
materials

Received 8/8/19; Revision accepted 6/11/20

Measurement is an important aspect of the replication estimates calculated from parameter estimates of factor-
crisis facing psychology and related fields (Fried & analytic models specified to represent associations
Flake, 2018; Loken & Gelman, 2017), and it is well between a test’s items and the test’s target construct.
known that measurement error produces biased esti- Thus, being informed of a test’s internal factor structure
mates of the associations among constructs that observed is inherent in choosing an appropriate omega estimate.
variables represent (e.g., Cole & Preacher, 2014). Yet The main purposes of this Tutorial are to clarify distinc-
researchers often present very little reliability and valid- tions among different omega estimates and to demon-
ity evidence for their variables, frequently reporting strate how they can be calculated using routines readily
only coefficient alpha to convey the psychometric qual- available in R (R Core Team, 2018). Throughout, I use
ity of tests (Flake, Pek, & Hehman, 2017). Furthermore, example data to illustrate these reliability estimates.
psychometricians have established that alpha is based
on a highly restricted (and thus unrealistic) psychomet-
Disclosures
ric model and consequently can provide misleading
reliability estimates (e.g., Sijtsma, 2009). The persistent The complete R code and output (as a single .rmd file
popularity of alpha suggests that applied researchers and resulting .pdf file) for the examples presented, data
are not aware of its limitations or alternative reliability
estimates.
Although many reliability estimates have been pre-
sented in the literature, distinguishing among them and Corresponding Author:
David B. Flora, Department of Psychology, 101 Behavioural Sciences
their software implementations can be confusing. In this Building, York University, 4700 Keele St., Toronto, ON M3J 1P3,
Tutorial, I describe the calculation of different forms of Canada
coefficient omega (McDonald, 1999), which are reliability E-mail: [email protected]
Which Omega Is Right? 485

files, and a supplementary document describing additional It is important to recognize that the CTT true score
analyses are available on OSF, at https://osf.io/m94rp/. does not necessarily equate to a construct score (Borsboom,
2005). Thus, a true score may be determined by a con-
struct that a test is designed to measure (the target
What Is Reliability?
construct) as well as by other systematic influences.
Observed scores on any given psychological test or Most often, researchers want to know how reliably a
scale are determined by a combination of systematic test measures the target construct itself, and for this
(signal) and random (noise) influences. Reliability is reason it is important to establish the dimensionality,
defined as a population-based quantification of mea- or internal structure, of a test before estimating reli-
surement precision (e.g., Mellenbergh, 1996) as a func- ability (Savalei & Reise, 2019). Factor analysis is com-
tion of the signal-to-noise ratio. Measurement error, or monly used to investigate and confirm the internal
unreliability, produces biased estimates of effects meant structure of a multi-item test, and as shown throughout
to represent true associations among constructs (Bollen, this Tutorial, parameter estimates of factor-analytic
1989; Cole & Preacher, 2014), and measurement error models lead to reliability estimates representing how
is a culprit in the replication crisis (Loken & Gelman, precisely test scores measure target constructs repre-
2017). Thus, using tests with maximally reliable scores sented by the models’ factors. In this framework, a
and using statistical methods to account for measure- one-factor model is adequate to explain the item-
ment error (e.g., Savalei, 2019) can help psychology response data of a unidimensional test measuring a
progress as a replicable science; calculating and report- single target construct; conversely, poor fit of a one-
ing accurate reliability estimates is integral to this goal. factor model is evidence of multidimensionality. The
Although the reliability concept pertains to any formal definition of reliability from CTT can be adapted
empirical measurement, this Tutorial focuses on com- to this context so that reliability is the proportion of a
posite reliability, that is, the reliability of observed scale score’s variance explained by a target construct
scores calculated as composites (i.e., the sum or mean) (Savalei & Reise, 2019). Therefore, it is crucial to deter-
of individual test components. These individual com- mine how a test represents that construct with respect
ponents are most commonly items within a test or scale. to its internal factor structure. If reliability is estimated
A formal definition of composite reliability based on using the parameters of an incorrect (i.e., misspecified)
classical test theory (CTT; e.g., Lord & Novick, 1968) factor model, then the reliability estimate is likely to be
first posits that an observed score x for individual test biased with respect to the measurement of the target
taker i on item j equals the individual’s true score t for construct. The key idea to this Tutorial is that a reli-
that item plus an error score e: ability coefficient should estimate how well an observed
test score measures a target construct, which does not
xij = tij + eij . necessarily correspond to how well the score captures
replicable variation because some replicable variation
Next, if Xi denotes an individual’s observed total may be irrelevant to the target construct; thus, it is criti-
score, calculated by summing1 the observed item scores cal for the target construct to be accurately represented
(i.e., Xi = ∑ j = 1 xij ), and if Ti denotes the individual’s
J
in a factor model for the test.
total true score, which is the sum of the unobserved Because the reliability estimates presented herein
are calculated from factor-analytic models, familiarity
true scores (Ti = ∑ j = 1 tij ), then the reliability ρX of total-
J

with factor analysis, especially confirmatory factor anal-


score variable X is the proportion of total true-score ysis (CFA; a type of structural equation modeling, or
variance relative to total observed variance: SEM), is beneficial. I emphasize the use of CFA over
exploratory factor analysis (EFA) for reliability assess-
σT2 ment because using CFA implies having strong, a priori
ρX = .
σ2X hypotheses about the underlying causal associations
between one or more target constructs (represented by
Because true scores are unobserved, reliability can- the model’s factors) and observed item scores. In the
not be calculated directly from this formula, which has preliminary stages of scale development, EFA is valu-
led to the development of various approaches to esti- able for uncovering systematic influences on item
mating reliability, most prominently coefficient alpha responses that might map onto hypothesized constructs,
(Cronbach, 1951). (See Revelle & Condon, 2019, for a whereas specification of a CFA model implies that the
review relating composite reliability, test-retest reliabil- item-construct relations underpinning certain reliability
ity, and interrater reliability to this formal variance-ratio estimates have been more well established (Flora &
definition of reliability.) Flake, 2017; Zinbarg, Yovel, Revelle, & McDonald,
486 Flora

Continuous
Observed Item
Responses

x1 e1
λ1
x2 e2
λ2
x3 e3
λ3
x4 e4
λ4
λ5 x5 e5
f
(Target Construct) λ6
x6 e6
λ7
λ8 x7 e7

λ9 x8 e8

Model Equation: λ10


x9 e9
xj = λj f + ej
x10 e10

Fig. 1. One-factor model for a unidimensional test consisting of 10 continuously scored items. See
the text for further explanation.

2006). Establishing these item-construct associations is after addressing the unidimensional case, I present
critical for reliability to be meaningfully estimated omega estimates of total-score reliability for multidi-
because choice of an appropriate reliability estimate mensional tests; the online supplement addresses reli-
depends on the interpretation of the final measurement ability assessment for subscale scores.
model chosen for a test (Savalei & Reise, 2019). For
readers unfamiliar with CFA, I present basic principles
Reliability of Unidimensional Tests:
that should enable use of the procedures presented
here, but I strongly encourage such readers to acquire Omega to Alpha
further background knowledge (e.g., see Brown, 2015). Figure 1 shows a path diagram of a one-factor model
This Tutorial focuses on forms of coefficient omega for a hypothetical 10-item test. Path diagrams represent
that estimate how reliably a total score for a test mea- latent variables, or factors, with ovals and represent
sures a single construct that is common to all items in observed variables (i.e., item-response variables in the
the test, even if the test is multidimensional (e.g., a test present context) with rectangles. Linear associations
designed to produce a total score as well as subscale are represented by straight, unidirectional arrows. For
scores). First, I demonstrate reliability estimation for example, because the factor f and an error term e1 are
unidimensional tests represented by one-factor models. the two entities with arrows pointing at item x1 in
Often, a test is designed to measure a single construct, Figure 1, f and e1 are the two linear influences on x 1.
but a one-factor model does not adequately represent The arrow from f to x1 is labeled as λ1, which is a factor-
the test’s internal structure. In this situation, reliability loading parameter giving the strength of the association
estimates based on a one-factor model are likely to be between f and x1. The effects implied by these two
inaccurate, and instead reliability estimates should be arrows in the path diagram combine to form the linear
based on a multidimensional (i.e., multifactor) model. regression equation
In other situations, a test may be explicitly designed to
measure multiple constructs (i.e., a test with subscales),
but a meaningful total score is still of interest. Thus, x1 = λ1 f + e1 ,
Which Omega Is Right? 487

which indicates that item x1 is a dependent variable λj fi ) and the factor variance is fixed to 1, then reliability
regressed on the factor f as a single independent vari- is a function of the factor-loading parameters (e.g.,
able with slope coefficient λ1 and error e1. This one- Jöreskog, 1971; McDonald, 1999). Conceptually, because
factor model consists of an analogous equation for each the factor loading quantifies the strength of the associa-
observed item such that tion between an item and a factor, the extent to which
a set of items (as represented by their total score) reli-
xj = λj f + ej, ably measures the factor is a function of the items’
factor loadings. Therefore, the reliability of the total
score on a unidimensional test can be estimated from
where xj is the jth item regressed on factor f with factor parameter estimates of a one-factor model fitted to the
loading λj (the intercept term is omitted from this and item scores. I refer to this reliability estimate as ωu, and
other equations without loss of generality). This equa- its formula is presented in Table 1: The numerator of
tion in which item scores are influenced by a single ω u represents the amount of total-score variance
factor, but to varying degrees (i.e., λj varies across explained by the common factor f as a function of the
items), is known as the congeneric model in the psy- estimated factor loadings (i.e., the numerator estimates
chometric literature. As shown later, when a model 2
the σT term in the population reliability formula given
consists of more than one factor, this equation expands earlier), and the denominator, σ  2X , is the estimated vari-
to a multiple regression equation with each x j simulta- ance of the observed total score. This formula gives a
neously regressed on multiple factors. form of coefficient omega that is appropriate under the
Because scores on the factor f are unobserved, the strong assumption that the one-factor model is correct,
λj factor-loading parameters cannot be estimated with that is, the set of items is unidimensional, so that ω u
the usual ordinary least squares method for linear represents the proportion of total-score variance that
regression. Instead, the factor loadings (and other is due to the single factor. 3 The subscript u indicates
model parameters) are estimated as a function of a that this variant of omega is based on a unidimensional
covariance (or correlation) matrix among the observed model; different forms of coefficient omega introduced
item scores, typically using a maximum likelihood (ML) later are distinguished by different subscripts.
function. In addition to factor loadings, other model The σ  2X term in the denominator of ωu can be calcu-
parameters include the factor variance and the vari- lated either as the sample variance of the total score X
ances of the individual error terms. Because the factor or as the model-implied variance of X. Generally, this
is unobserved, its scale is arbitrary, and the model can- choice is unlikely to have a large effect on the magni-
not be estimated (i.e., the model is not identified) tude of omega estimates if the estimated model is not
unless a parameter is constrained to define the factor’s badly misspecified (as evidenced by good model fit). 4
scale. In all the examples in this Tutorial, I have set the Specifically, σ  2X can be calculated as the sum of all ele-
scale of each factor by constraining its variance to be ments in the variance-covariance matrix of item scores
equal to 1 (which also serves to simplify equations for (which equals the sample variance of X) or as the sum
omega reliability estimates). 2 A variety of statistics is of all elements in the model-implied covariance matrix
available to evaluate how well a CFA model fits the among the observed items (i.e., the model-implied vari-
item-level data (and thereby evaluate the tenability of ance of X). Representing the total-score variance as a
unidimensionality). In the examples presented, I report function of the entire model-implied covariance matrix
the root mean square error of approximation (RMSEA), incorporates any free covariances among the individual
comparative-fit index (CFI), and Tucker-Lewis index item-level error terms, ej. Thus, when free error covari-
(TLI); smaller values (e.g., .08 or lower) of RMSEA are ances are explicitly specified, the model-implied total-
indicative of better model fit, whereas larger values score variance used to calculate ωu is a function of both
(e.g., .90 or greater) of CFI and TLI indicate better error variances and the free error covariances. Although
model fit. error covariances may represent replicable variation in
observed test scores, these parameters are separate
from variance due to the target construct represented
Coefficient omega by the common factor f, and thus error covariances
Numerous authors have shown that if the equation for contribute to the total observed variance (i.e., the
the CTT true score is reexpressed as the one-factor denominator of ωu) but not to variance due to the factor
model such that an individual’s true score t ij for item j (i.e., the numerator of ωu). The demonstration later in
is presumed to equal the product of the item’s factor this discussion of ωu shows how to obtain an ωu esti-
loading λj and the individual’s factor score f i (i.e., tij = mate in R that accounts for the contribution of error
488 Flora

Table 1. Formulas for Coefficient Omega Estimates for Three Underlying Factor Models

Reliability estimate of total score X as a measure of a single


Model Model equation(s) construct common to all items
One-factor model xj = λj f + ej  J  
2

(unidimensional,  ∑ λ j 
ωu =  2 
congeneric model), j =1

continuous items σ X
One-factor model xj = c if τjc < x *j ≤ τjc+1,
(unidimensional,
x = λj f + ej
*
 C –1 C –1
(   ) 
 ∑ ∑ Φ2 τ x jc , τ x j ′c ′ , λ j λ j ′ − 
J J
 c = 1 c′ = 1 
congeneric model), ∑ ∑
j
  C –1   C –1 
categorical items j =1 j′ =1   ∑ Φ (τ x )   ∑ Φ (τ x )  
jc   j ′c 
  c =1 1 1

   c′ = 1 
ωu-cat = 2
X
σ

Bifactor model 2
 K   J 
(hierarchical factor x j = λ jg g +  ∑ λ jk sk  + e j  ∑ λ  jg 
model), continuous items  k =1  
 
ωh =  
j =1
2
X
σ
Note: For the one-factor model, λj is the factor loading for generic item xj, f is an unobserved factor (or latent variable), and ej is the error
term for item j. For categorical items, τjc refers to a threshold parameter used to link continuous latent-response variable x j* to observed
ordered, categorical item-response variable xj; for all items, the minimum threshold = −∞, and the maximum threshold = ∞ (see Wirth &
Edwards, 2007). For the bifactor model, λjg is the factor loading for item xj on general factor g; λjk is the factor loading for item xj on the
J
kth specific factor sk. Total score X = ∑ xj and total score variance σ̂X2 may be estimated from either the model-implied variance of X or
j =1
the observed sample variance of X. Φ1 is the univariate normal cumulative distribution function; Φ2 is the bivariate normal cumulative
distribution function. Reliability equations assume that all factor variances are fixed to 1 for model identification.

covariances (when they are specified as free parameters uncorrelated (see McDonald, 1999, or Green & Yang,
rather than fixed to 0 by software defaults) to total 2009a, for details). Taken together, alpha is an estimate
variance by calculating the denominator of ωu as a of total-score reliability for the measurement of a single
function of the entire model-implied covariance matrix construct common to all items in a test under the con-
among items. ditions that (a) a one-factor model is correct (i.e., the
After the one-factor model is estimated, one can test is unidimensional), (b) the factor loadings are equal
obtain a residual covariance (or correlation) matrix to across all items (i.e., essential tau equivalence), and (c)
diagnose the presence of large error covariance between the errors ej are independent across items. Because it
any pair of items; a residual covariance between two is unlikely for all of these conditions to hold in any real
items is the difference between their observed covari- situation, some researchers have called for abandoning
ance and the corresponding model-implied covariance. alpha in favor of alternative reliability estimates (e.g.,
Green and Yang (2009a) described scenarios in which McNeish, 2018). However, Savalei and Reise (2019) con-
a unidimensional test might produce correlated errors, tended that only severe violations of essential tau
although large residual correlations may also be evi- equivalence cause alpha to produce a notably biased
dence of multidimensionality. Large residual correla- reliability estimate, whereas multidimensionality and
tions diminish the fit of the one-factor model to data, error correlation are more likely to be problematic for
which can prompt researchers to modify their hypothe- the interpretation of alpha as a reliability estimate for
sized CFA model to explicitly specify free error-covariance the measurement of a single target construct (Green &
parameters.5 Yang, 2009a; Zinbarg et al., 2006), largely because alpha
does not disentangle replicable variation due to a target
construct from other sources of replicable variation.
Coefficient alpha Overall, estimates of ωu are unbiased with varying
If the population factor loadings are equal across all factor loadings (i.e., violation of tau equivalence), when
items, then the j subscript can be dropped from λ in alpha underestimates population reliability (e.g.,
the equation for the one-factor model, which leads to Zinbarg, Revelle, Yovel, & Li, 2005). Furthermore, Yang
what is known as the essential tau-equivalence model. and Green (2010) showed that ωu estimates are largely
If this model is correct, it can be shown that ω u is robust when the estimated model contains more factors
equivalent to alpha as long as the errors e j remain than the true model, even with samples as small as 50.
Which Omega Is Right? 489

Trizano-Hermosilla and Alvarado (2016) found that direct ML estimation to incorporate cases with incomplete
increasing levels of skewness in the univariate item data (see Brown, 2015). The one-factor model (here
distributions produced increasingly negatively biased named mod1f) is specified using a plain text string:
ωu estimates, especially for short tests (i.e., six items in
their study), but that item skewness caused greater bias > mod1f <- 'openness =~ O1 + O2 + O3 +
for alpha than for ωu. O4 + O5'

This string indicates that a factor named openness is


Example calculation of ωu in R measured by the five observed variables listed on the
To demonstrate calculation of ωu using R, I use data for right-hand side of the =~ operator; O1 through O5 are
five items measuring the personality trait openness as the names assigned to the openness items in the data
completed by 2,800 participants in the Synthetic frame (called open). Next, the model mod1f is esti-
Aperture Personality Assessment project (Revelle, Wilt, mated using the cfa function:
& Rosenthal, 2010). These data are available on this
Tutorial’s OSF site as well as in the psych package for > fit1f <- cfa(mod1f, data=open,
R (Revelle, 2020). These items have a 6-point, ordered std.lv=T, missing='direct',
categorical Likert-type response scale and thus are estimator='MLR')
treated as continuous variables for CFA and reliability
estimation; as I describe later, when items have fewer The std.lv option (referring to standardize latent
than five response categories, it is important to explic- variables) is set to TRUE so that the scale of the
itly account for their categorical nature. With these data, open factor is set by fixing its variance equal to 1,
alpha equals .60 for the five-item openness test, but one leaving all factor loadings as free parameters. The
should not justify using alpha by merely assuming that missing='direct' option indicates that missing data
the test is unidimensional and that its scores conform are to be handled with direct ML estimation (see Brown,
to the essential tau-equivalence model; rather, the test’s 2015), and the MLR estimator requests ML with robust
factor structure should be tested to determine an appro- model-fit statistics (see Savalei, 2018), as recommended
priate reliability estimate. Furthermore, it is important by Rhemtulla, Brosseau-Liard, and Savalei (2012) to
to note that I reverse-scored two negatively worded account for nonnormality inherent in the analysis of
items (Items O2 and O5) before fitting one-factor CFA item-level data.
models to the data because not doing so would produce The results can be viewed using the summary
a mix of positive and negative factor loadings. Such a function:
mix will reduce the numerator of ωu and lead to a mis-
leadingly low reliability estimate; the absolute strength > summary(fit1f, fit.measures=T)
of the factor loadings is what should matter in represent-
ing how well the items measure the factor. The fit.measures option is set to TRUE to request
For all examples in this Tutorial, I used the R package a set of commonly reported model-fit statistics, including
lavaan (Version 0.6-5; Rosseel, 2012) to estimate CFA the RMSEA, CFI, and TLI indices. The fit statistics reported
models. Once a CFA model was estimated, functions under the Robust column in the summary indicate that
from the semTools package (Version 0.5-3; Jorgensen, this one-factor model has a marginal fit to the data;
Pornprasertmanit, Schoemann, & Rosseel, 2020) were although the CFI of .94 suggests adequate fit, the TLI of
used to obtain reliability estimates based on the CFA .88 is lower than desired, and the RMSEA is somewhat
model object created by lavaan. I have assumed that high at 0.08; in combination, the indices cast doubt on
readers have some elementary familiarity with R, includ- the suitability of this one-factor model for subsequently
ing how to install packages and manage data sets (files obtaining a reliability estimate for the openness scale
on the OSF site provide complete code corresponding (see Brown, 2015, for discussion of assessing model fit
to all examples in this Tutorial). The lavaan package using these and other statistics). Nonetheless, for illustra-
is loaded as follows: tive reasons, I examine the factor-loading estimates and
calculate ωu before revising the CFA model to obtain a
> library(lavaan) more appropriate ωu estimate.
The factor-loading estimates of the fit1f model
In this example, I fitted the one-factor model depicted are listed under the Latent Variables heading in
in Figure 1 to the data for the five openness items using the results summary as follows:
490 Flora

Latent Variables: omega. For the current example, a percentile bootstrap


Estimate Std.Err z-value P(>|z|) 95% CI for ωu is obtained with
openness =~
O1 0.622 0.029 21.536 0.000 > library(MBESS)
O2 0.684 0.042 16.466 0.000 > ci.reliability(data=open,
O3 0.794 0.032 24.572 0.000 type="omega", interval.type="perc")
O4 0.361 0.031 11.779 0.000
O5 0.685 0.036 19.069 0.000 The output gives $est as .61, the point estimate of
ωu, along with a 95% CI from .58 ($ci.lower) to .63
These estimates of λ vary substantially (from .36 ($ci.upper), which represents a range of plausible
to.79), suggesting that tau equivalence is not tenable for values for the population ωu.
this openness test and thus that ωu is a more appropriate Although the factor-loading estimates varied across
reliability estimate than alpha.6 The ωu estimate can be items, the difference between ωu and alpha (.61 vs. .60)
obtained by executing the reliability function from is quite small. But, as mentioned earlier, when a one-
the semTools package (i.e., semTools::reliability) factor model includes error correlations among items,
on the fit1f one-factor model object: the difference between ωu and alpha can be substantial.
In this case, the fit1f model has a small but notable
> library(semTools) residual correlation (r = .10) between the O2 and O5
> reliability(fit1f) items,8 which is not surprising because these were the
two reverse-coded items. Thus, it may be important to
This call to reliability produces the following account for the corresponding error covariance in the
output: calculation of ωu. To include this term as a free param-
eter, the one-factor model can be respecified as a new
openness model, called mod1fR, as follows:
alpha 0.5999111
omega 0.6079033 > mod1fR <- 'openness =~ O1 + O2 + O3 +
omega2 0.6079033 O4 + O5
omega3 0.6078732 O2 ~~ O5'
avevar 0.2461983
The second line uses the ~~ operator to specify the free
The results listed in the omega and omega2 rows give error-covariance parameter. When this revised model is
the ωu estimate in which the denominator equals the fitted to the data (see the OSF project page for complete
model-implied variance of the total score X, and the code and output), the overall model fit is considerably
results listed in the omega3 row are based on a varia- improved (RMSEA = .04, CFI = .99, TLI = .97), and the
tion in which the denominator equals the observed, estimate of the standardized error covariance (i.e., error
sample variance of X, as described earlier. 7 The result- correlation) equals .19. Next, applying the reliability
ing ωu of .61 represents the proportion of total-score function to the fitted model obtains a ωu estimate of .56,
variance that is due to the single factor, that is, how which is notably lower than the .61 obtained from the
reliably a total score for these five items measures an first one-factor model, in which all error covariances were
openness common factor. assumed to equal zero.9 Therefore, this example demon-
As is any statistic calculated with sample data, ω u is strates the potential importance of explicitly modeling
a point estimate of a population parameter and is sub- interitem error covariances when fitting a factor model
ject to sampling error; thus, its precision can be con- from which ωu is to be estimated. Because the overall
veyed with a confidence interval (CI). Kelley and model fit was improved and the error covariance is con-
Pornprasertmanit (2016) reviewed different approaches ceptually meaningful (because reverse-scored items often
to constructing CIs around omega estimates, ultimately share excess covariance), the revised ωu of .56 is more
recommending bootstrapping. The ci.reliability trustworthy than the first ωu as an estimate of the com-
function from the MBESS package (Kelley, 2019; Version posite reliability of the openness scale with respect to
4.6.0) can be used to obtain CIs for some forms of the measurement of a single openness factor.
Which Omega Is Right? 491

Factor Analysis of Ordered, to the observed item-response variables (again depicted


Categorical Items with rectangles); each jagged arrow represents one or
more threshold parameters. 10
Most often, item responses are scored with an ordered,
categorical scale (e.g., Likert-type items scored with
discrete integers) or a binary response scale (e.g., 1 = Categorical omega
yes, 0 = no). These scales produce categorical data for
Because the factor loadings estimated from polychoric
which the traditional, linear factor-analytic model is
correlations represent the associations between the fac-
technically incorrect; thus, fitting CFA models to the
tor and latent-response variables rather than the
observed Pearson product-moment covariance (or cor-
observed item responses themselves, applying the ω u
relation) matrix among item scores can produce inac-
formula presented earlier in this context would give the
curate results (see Flora, LaBrish, & Chalmers, 2012).
proportion of variance explained by the factor relative
Just as binary or ordinal logistic regression is recom-
to the variance of the sum of the latent-response vari-
mended over linear multiple regression for a categorical
ables (i.e., X ∗ = ∑ jJ = 1 x *j ) instead of the total variance of
outcome, a categorical-variable method is recom-
the sum of the observed item responses (i.e., X =
mended for factor analysis of categorical item scores:
∑ j = 1 x j ); that is, ωu would represent the reliability of
J
One can fit a CFA model to polychoric correlations,
rather than product-moment covariance or correlations, a hypothetical, latent total score instead of the reliability
to account for items’ categorical nature (see Finney & of the actual observed total score. To remedy this issue,
DiStefano, 2013). A polychoric correlation measures the Green and Yang (2009b) developed an alternative reli-
bivariate association between two binary or ordinally ability estimate that is rescaled into the observed total-
scaled variables, explicitly accounting for its nonlinear score metric. This reliability estimate for unidimensional
nature, and is thus appropriate for representing the categorical items is referred to as ω u-cat; its formula is
associations among items eliciting ordered, categorical in Table 1. Although this formula is complex, its numer-
responses. As the number of item-response categories ator, like that of ωu, expresses the amount of observed
increases (e.g., to five or more response options), the total-score variance explained by the single factor. This
item scores may behave more like continuous variables, term is obtained from applying the univariate and bivar-
and so CFA estimates obtained with product-moment iate normal cumulative distribution functions (denoted
covariances become closer to results obtained with as Φ1 and Φ2, respectively) based on the factor loadings
polychoric correlations (Rhemtulla et al., 2012). and thresholds obtained when a CFA model is estimated
CFA models can be readily fitted to polychoric cor- using polychoric correlations; in short, the Φ 1 and Φ2
relations with most SEM software, including the lavaan functions transform the explained variance in the latent-
package in R. Doing so subtly revises the one-factor response-variable metric back into the observed total-
model given earlier to score metric. As for ωu, the denominator of ωu-cat is the
estimated variance of the observed total score (i.e., σ  2X ),
x *j = λ j f + e j , which may be calculated as the sample variance of X
or the model-implied variance of X according to the
such that now the factor loading λ j represents the formula given in Green and Yang.
strength of the linear association between the factor f Yang and Green (2015) asserted that applied research-
and a latent-response variable x *j , rather than the ers should be more interested in the reliability of an
observed item-response variable xj. This latent-response observed total score X than in the reliability of a latent
variable is an unobserved, continuous variable repre- total score X * because observed scores, rather than latent
senting a respondent’s judgment about item content scores, are most frequently used to differentiate among
that determines the observed value of an ordinal item individuals in research and practice. Yang and Green
response as a function of threshold parameters (for established that, compared with ωu, ωu-cat produces
further explanation, see Wirth & Edwards, 2007). Thus, more accurate reliability estimates for total score X with
the factor influences the observed, categorical variables ordinal item-level data, especially when the univariate
indirectly via the latent-response variables. Figure 2 item-response distributions differ across items.
gives a path diagram of this one-factor model: The
latent-response variables (depicted with small ovals)
are linearly regressed on the single factor (as shown
Example calculation of ωu-cat in R
with straight arrows), whereas the jagged arrows rep- To demonstrate estimation of ωu-cat in R, I use data from
resent thresholds linking the latent-response variables a subsample of 500 participants who completed the
492 Flora

Continuous Categorical
Latent-Response Observed Item
Variables Threshold Responses
e1 Parameters
x ∗1 x1
λ1
e2
x ∗2 x2
λ2
e3
x ∗3 x3
λ3
e4
λ4 x ∗4 x4

λ5 e5
f x ∗5 x5
(Target Construct) λ6
e6
λ7 x ∗6 x6
e7
λ8
x ∗7 x7
e8
λ9
x ∗8 x8
e9
λ10
Model Equations: x9
e10 x ∗9
xj = c if τjc < xj∗ ≤ τjc+1,
xj∗ = λj f + ej x10
x ∗10

Fig. 2. One-factor model for a unidimensional test consisting of 10 ordinally scaled items. See the text for further explanation.

four-item psychoticism scale by Jonason and Webster data frame. This mod1f model is also estimated using
(2010) online via an open-access personality-testing the cfa function:
website (https://openpsychometrics.org/about). With
these data, alpha for the psychoticism scale was .77, > fit1f <- cfa(mod1f, data=potic, std.
but it should not be assumed that the scale is unidi- lv=T, ordered=T, estimator='WLSMV')
mensional and conforms to the essential tau-equiva-
lence model; instead, its factor structure should be But now, the ordered option is set to TRUE so that
tested to determine an appropriate reliability estimate. all items are treated as ordered, categorical variables;
Because the psychoticism items have a 4-point response consequently, lavaan is told to fit the model to poly-
scale, I fitted the one-factor model to the interitem choric correlations using WLSMV. As before, the std.
polychoric correlations using WLSMV, a robust weighted lv option is set to TRUE so that the variance of the
least squares estimator that is recommended over the psyctcsm factor is fixed equal to 1.
ML estimator for CFA with polychoric correlations Again, the results can be viewed using the summary
(Finney & DiStefano, 2013). function. The model-fit statistics under the Robust
Specifically, the one-factor model is specified in the column indicate that this one-factor model fits the data
same way as in the previous example with items treated adequately, CFI = .99, TLI = .97; although the RMSEA
as continuous variables: has a high value of .11, the residual correlations are all
small (< .07). Thus, it seems reasonable to estimate reli-
> mod1f <- 'psyctcsm =~ DDP1 + DDP2 + ability of the psychoticism scale from the parameter esti-
DDP3 + DDP4' mates of this model (i.e., the scale can be con­sidered a
unidimensional test). As for ωu, the ωu-cat estimate can be
This code indicates that the factor psyctcsm is mea- obtained by executing the semTools::reliability
sured by observed variables DDP1 through DDP4, function on the fit1f one-factor model object;
which are the names of the psychoticism items in the because this model was fitted to polychoric correla-
Which Omega Is Right? 493

tions, semTools::reliability automatically cal- potic data frame. The results return a point estimate
culates ωu-cat instead of ωu: of .79 with percentile bootstrap 95% CI of [.75, .83].
In sum, because the psychoticism items have only
> reliability(fit1f) four response categories, any reliability estimate based
on a factor-analytic model should account for the items’
This call to reliability produces the following categorical nature. Because the one-factor model ade-
output: quately explains the polychoric correlations among the
items, it is reasonable to consider the psychoticism
psyctcsm scale a unidimensional test. Therefore, ωu-cat = .79 (95%
alpha 0.8007496 CI = [.75, .83]) is an appropriate estimate of the propor-
omega 0.7902953 tion of the psychoticism scale’s total-score variance that
omega2 0.7902953 is due to a single psychoticism factor.
omega3 0.7932682
avevar 0.5289638
Reliability Estimates
Results listed in the omega and omega2 rows give the for Multidimensional Scales
ωu-cat estimate based on Green and Yang’s (2009b) for- Often, tests are designed to measure a single construct
mula, in which the denominator equals the model- but end up having a multidimensional structure, espe-
implied variance of the total score X, and results listed cially as the content of the test broadens. Occasionally,
in the omega3 row are based on a variation of Green multidimensionality is intentional, as when a test is
and Yang’s formula in which the denominator equals designed to produce subscale scores in addition to a total
the observed, sample variance of X. Thus, the ωu-cat score. In other situations, the breadth of the construct’s
estimate indicates that .79 of the scale’s total-score vari- definition or the format of items produces unintended
ance is due to the single psychoticism factor. multidimensionality, even if a general target construct
These results also give an estimate of alpha (.80) that that influences all items is still present. In either case, the
differs from the estimate reported earlier for the psy- one-factor model presented earlier is incorrect, and thus
choticism scale (i.e., .77); this alpha is an example of it is generally inappropriate to use alpha or ωu to repre-
ordinal alpha (Zumbo, Gadermann, & Zeisser, 2007) sent the reliability of a total score from a multidimen-
because it is based on the model estimated using poly- sional test. Instead, reliability estimates for observed
choric correlations, whereas the first alpha estimate scores derived from multidimensional tests should be
used the traditional calculation from interitem product- interpretable with respect to the target constructs.
moment covariances. Note that ordinal alpha is a reli- For example, Flake, Barron, Hulleman, McCoach, and
ability estimate for the sum of the continuous, Welsh (2015) developed a 19-item test to measure a
latent-response variables (i.e., x* variables described broad construct, termed psychological cost, from Eccles’s
earlier) rather than for X, the sum of the observed, (2005) expectancy-value theory of motivation. Although
categorical item-response variables (Yang & Green, this psychological-cost scale (PCS) was designed to
2015). Additionally, ordinal alpha still carries the produce a meaningful total score representing a general
assumption of equal factor loadings (i.e., tau equiva- cost construct, the item content was derived from sev-
lence). For these reasons, I advocate ignoring the eral more specific content domains (termed task-effort
alpha results reported by semTools::reliability cost, outside-effort cost, loss of valued alternatives, and
when the factor model has been fitted using polychoric emotional cost). Consequently, although a general cost
correlations (see Chalmers, 2018, and Zumbo & Kroc, factor is expected to influence responses to all 19 items,
2019, for further discussion). it may be best to consider the PCS multidimensional
As with ωu, the ci.reliability function can also because of excess covariance among items from the
be used to obtain a confidence interval for ωu-cat. Spe- same content domain beyond the covariance explained
cifically, the command by a general construct.

ci.reliability(data=potic,
Bifactor models
type="categorical",
interval.type="perc") One way to represent such a multidimensional structure
is with a bifactor model, in which a general factor influ-
includes the option type="categorical" to invoke ences all items and specific factors (also known as
estimation of ωu-cat based on fitting a one-factor model group factors) capture covariation among subsets of
to the polychoric correlations among items in the items that remains over and above the covariance due
494 Flora

to the general factor. Specific factors do not represent estimates. Notably, Zinbarg et al. (2006) showed that
subscales per se but instead represent the shared ωh estimates calculated using the CFA method described
aspects of a subset of items that are independent from here are largely unbiased and are more accurate reli-
the general factor (in fact, in some situations, specific ability estimates than alpha, showing trivial effects of
factors may be used to capture method artifacts, such design factors including the magnitude and heterogene-
as item-wording effects). A bifactor model for the PCS ity of factor loadings, variation in sample size ranging
includes a general cost factor influencing all items from 50 to 200, the number of items, the number of
along with four specific factors capturing excess covari- specific factors, and the presence of cross-loadings. Yet
ance among items from the same content domain. A no research to date has directly addressed whether ω h
path diagram of this model is in Figure 3, which shows estimates are robust to model misspecification (e.g.,
that each item has a nonzero general-factor loading incorrectly using a bifactor model for a test with a dif-
(i.e., the general factor, g, influences all items) along ferent multidimensional structure).
with a nonzero loading on a specific factor pertaining As with ωu, when the bifactor model is fitted to
to the item’s content domain (i.e., each specific factor, polychoric correlations among ordered, categorical
sk, influences only a subset of items). Because each item items, applying the formula for ωh leads to a reliability
is directly influenced by two factors, the equation for estimate in the x* latent-response-variable metric rather
this model is a multiple regression equation with each than the metric of the observed total score X. Instead,
item simultaneously regressed on the general factor and the approach of Green and Yang (2009b) can be applied
one of the specific factors. The general factor must be to produce a version of ωh that gives a reliability esti-
uncorrelated with the specific factors to guarantee mate of the proportion of total observed score variance
model identification (Yung, Thissen, & McLeod, 1999), due to the general factor; this estimate is referred to as
whereas in other CFA models, all factors freely corre- ωh-cat. Although ωh-cat is not presented in Table 1, its
late with each other (allowing the general factor in a equation is simply an adaptation of the equation for
bifactor model to correlate with one of the specific ωu-cat in which the loadings from the one-factor model
factors causes errors such as nonconvergence or are replaced with general-factor loadings from a bifac-
improper solutions). tor model.

Omega-hierarchical Example calculation of ωh in R


When item-response data are well represented by a To demonstrate the estimation of ωh using R, I use data
bifactor model, a reliability measure known as omega- my colleagues and I collected administering the PCS to
hierarchical (or ωh) represents the proportion of total- 154 students in an introductory statistics course (Flake,
score variance due to a single, general construct that Ferland, & Flora, 2017). With these data, alpha for the
influences all items, despite the multidimensional nature PCS total score is .96, but this may be a misleading
of the item set (Rodriguez, Reise, & Haviland, 2016; reliability estimate because of multidimensionality; that
Zinbarg et al., 2005). Just as the parameter estimates is, alpha is a function of total variance due to all sys-
from a one-factor model are used to estimate reliability tematic influences on the items (i.e., both general and
with ωu, parameter estimates from a bifactor model are specific factors) rather than a measure of how reliably
used to calculate ωh, the formula of which is in Table the total score measures a single target construct rep-
1. This formula for ωh is like that for ωu, except that the resented by a general factor. I fitted the bifactor model
numerator is a function of the general-factor loadings depicted in Figure 3 to the item-response data, treating
only; the denominator again represents the estimated vari- the item responses as continuous variables because
ance of the total score X. Therefore, ωh represents the they were given on a 6-point scale.
extent to which the total score provides a reliable measure The syntax to specify the bifactor model for
of a construct represented by a general factor that influ- lavaan is
ences all items in a multidimensional scale over and above
extraneous influences captured by the specific factors. > modBf <- 'gen =~ TE1+TE2+TE3+TE4+TE
Although several prominent resources have pre- 5+OE1+OE2+OE3+OE4+LVA1+LVA2+LVA3+
sented ωh and discussed its conceptual advantages for LVA4 +EM1+EM2+EM3+EM4+EM5+EM6
estimating reliability of total scores from multidimen- s1 =~ TE1 + TE2 + TE3 + TE4 + TE5
sional tests (e.g., McDonald, 1999; Revelle & Zinbarg, s2 =~ OE1 + OE2 + OE3 + OE4
2009; Rodriguez et al., 2016; Zinbarg et al., 2005), little s3 =~ LVA1 + LVA2 + LVA3 + LVA4
research has studied the finite-sample properties of ω h s4 =~ EM1 + EM2 + EM3 + EM4 + EM5 + EM6'
λ1s1 x1 e1

λ2s1 x2
e2 λ
s1 λ3s1 1g

(Specific Task Effort) x3 e3 λ


λ4s1 2g

x4 e4 λ3g
λ5s1
x5 e5 λ4g
λ5g
λ6s2 x6 e6
λ7s2 λ6g
s2 x7 e7
(Specific Outside Effort) λ8s2 λ7g
x8 e8
λ9s2 λ8g
e9 λ
x9 9g
g
λ10g (Psychological Cost)
λ10,s3 x10
e10 λ11g
λ11,s3 x11 λ12g
s3 e11
(Specific Loss of λ12,s3 λ13g
Valued Alternatives) x12 e12
λ13,s3
λ14g Generic Equation for All Items:
e13 K=4
x13
xj = λjg g + ( ∑ λjk sk ) + ej
k=1
λ15g
λ14,s4 x14 e14
λ16g Example Equation (for Item x 1):
e15
λ15,s4 x15 λ17g x1 = λ1g g + λ1s1 s1 + 0s2 + 0s3 + 0s4 + e1
λ16,s4 e16 x1 = λ1g g + λ1s1 s1 + e1
x16 λ18g
s4
(Specific Emotional Cost) λ17,s4 e17
x17 λ19g
λ18,s4
e18
λ19,s4 x18

x19 e19

Fig. 3. Bifactor model for the psychological-cost scale. See the text for further explanation.

495
496 Flora

where gen is the general cost factor measured by all the estimates listed under the s1 to s4 columns is
19 items (i.e., TE1 through EM6), s1 is the specific described in the online supplement.
factor for items pertaining to task-effort content (TE1
through TE5), s2 is the specific factor for items per-
Higher-order models
taining to outside-effort content (OE1 through OE4),
s3 is the specific factor for loss-of-valued-alternatives In this bifactor-model example, I considered multidi-
items (LVA1 through LVA4), and s4 is the specific mensionality among the PCS items a nuisance for the
factor for emotional-cost items (EM1 through EM6). The measurement of a broad, general psychological-cost
model is estimated with target construct. In other situations, researchers may
hypothesize an alternative multidimensional structure
> fitBf <- cfa(modBf, data=pcs, such that there is a broad, overarching construct indi-
std.lv=T, estimator='MLR', rectly influencing all items in a test through more con-
orthogonal=T) ceptually narrow constructs that directly influence
different subsets of items. Such hypotheses imply that
where the orthogonal=TRUE option forces all inter- the item-level data arise from a higher-order model, in
factor correlations to equal 0, which, as discussed ear- which a higher-order factor (also known as a second-
lier, is important for identification of bifactor models. order factor) causes individual differences in several
The results from the summary function indicate that more conceptually narrow lower-order factors (or first-
the bifactor model fits the PCS data well, with robust order factors), which in turn directly influence the
model-fit statistics, CFI = .98, TLI = .97, RMSEA = .05. observed item responses. In this context, researchers
Thus, it is reasonable to calculate ω h to estimate how may evaluate the extent to which the test produces
reliably the PCS total score measures the general reliable total scores (as measures of the construct rep-
psychological-cost factor. Applying semTools:: resented by the higher-order factor) as well as subscale
reliability to this fitted model, scores (as measures of the constructs represented by
the lower-order factors).
> reliability(fitBf), When item scores arise from a higher-order model,
a reliability measure termed omega-higher-order, or ωho,
produces the following output: represents the proportion of total-score variance that
is due to the higher-order factor; parameter estimates
gen s1 s2 s3 s4 from a higher-order model are used to calculate ω ho.
alpha 0.9638781 0.92504205 0.8992820 0.9052459 0.9405882 As does ωh, ωho represents the reliability of a total score
omega 0.9741033 0.56377307 0.7884791 0.6766430 0.7816839 for measuring a single construct that influences all
omega2 0.9094893 0.09237594 0.3666293 0.1880759 0.2054075 items, despite the multidimensional nature of the test.
omega3 0.9077636 0.09240479 0.3666634 0.1878380 0.2053012 Thus, the conceptual distinction between ω h and ωho
avevar NA NA NA NA NA owes to the subtle difference between the interpretation
of the general factor in the bifactor model and the
Estimates listed under the gen column pertain to the higher-order factor in the higher-order model: In short,
general cost factor. The omega estimate, .97, ignores whereas the bifactor model’s general factor influences
the contribution of the specific factors to calculation of all items directly (while the specific factors are held
the implied variance of the total score in its denomina- constant), a higher-order factor influences all items
tor ( Jorgensen et al., 2020); thus, this value is not a indirectly via the lower-order factors (see Yung et al.,
reliability estimate for the PCS total score. Instead, the 1999, for further detail). The online supplement to this
omega2 and omega3 values under gen are ωh esti- article presents a formula for ωho and demonstrates the
mates; omega2 is calculated using the model-implied estimation of a higher-order model for the PCS and
variance of the total score in its denominator, and subsequent calculation of ωho as a function of the mod-
omega3 is calculated using the observed sample vari- el’s parameter estimates. When the semTools::
ance of X (this distinction is analogous to the one reliability function is applied to a fitted higher-
between omega2 and omega3 described earlier for order model, it does not return ωho; instead, the reli-
ωu). Thus, the proportion of PCS total-score variance a b i l i t y L 2 function of the semTools package
that is due to a general psychological cost factor over calculates ωho, as shown in the online supplement,
and above the influence of effects that are specific to which also describes reliability estimation for
the different content domains is .91. 11 Interpretation of subscales.12
Which Omega Is Right? 497

Exploratory Omega Estimates a total score calculated from a multidimensional item


set conforming to a bifactor or higher-order model, respec-
The examples presented thus far used semTools:: tively. If the test is multidimensional but a suitable CFA
reliability (or reliabilityL2) to estimate model cannot be hypothesized or does not adequately fit
forms of omega from the results of CFA models. How- the data, then an EFA approach can be used both to dis-
ever, these estimates depend on correct specification cover potential reasons for multidimensionality and to
of the model underlying a given test (e.g., ω u is not an obtain a preliminary, exploratory omega estimate.
appropriate reliability estimate if the population model An important issue often overlooked is that item
is multidimensional, as evidenced by poor fit of a one- responses typically have a categorical scale; therefore,
factor model). When no hypothesized CFA model fits deciding whether to treat the data as categorical (i.e.,
the data (as may be particularly likely in the early stages by fitting the factor model to polychoric correlations)
of test development), EFA can be used to help uncover affects both model-fit statistics and parameter esti-
a test’s dimensional structure. Once the optimal number mates used to calculate omega estimates. Furthermore,
of factors underlying a test is determined, the omega Green and Yang (2009b) showed that when the mea-
function from the psych package (Revelle, 2020) can be surement model is fitted to polychoric correlations, it
used to obtain an omega estimate based on EFA model is necessary to rescale the model’s parameter estimates
parameters; this omega estimate will represent the pro- to obtain reliability estimates in the metric of the
portion of total-score variance due to a general factor observed total-score scale instead of a latent-response
common to all items (see the online supplement for scale; ωu-cat is an appropriate substitute for ω u in this
further discussion and a demonstration of this omega situation.
function). Although the essential tau-equivalence model under-
lying alpha is unlikely to be correct for a given test, it
is also important for an omega estimate to be based on
Conclusion a properly specified measurement model, which high-
Researchers should not mechanistically report alpha lights the importance of model comparisons and repli-
and instead should investigate the internal dimensional cation for evaluating a test’s internal structure (using
structure of a test to choose an appropriate reliability factor analysis) and ultimately estimating the reliability
estimate for the measurement of a construct of interest, of its scores. Further research is needed to examine the
that is, an appropriate form of coefficient omega. This finite-sample properties of different omega estimates
Tutorial has described alternative forms of omega that under correct model specification as well as their
depend on a test’s underlying factor structure. Examples robustness to model misspecification, especially for
have been presented to demonstrate how to compute multidimensional cases. Nevertheless, the extant studies
different omega estimates in R, mainly using sem clearly support a general preference for omega esti-
Tools::reliability, which works on a CFA mates over alpha (e.g., Trizano-Hermosilla & Alvarado,
model fitted to the item-level data using the lavaan 2016; Yang & Green, 2010; Zinbarg et al., 2006). Yet,
package. just as researchers should not blindly report alpha for
The flowchart in Figure 4 summarizes recommenda- the reliability of a test, they should not thoughtlessly
tions for choosing an appropriate omega estimate; this report an omega coefficient without first investigating
chart applies to a situation in which a test is intended the test’s internal structure. The main conceptual benefit
to measure a construct that is common to all items. of using an omega coefficient to estimate reliability is
The dimensional structure of the test determines the realized when omega is based on a thoughtful model-
appropriate form of omega: ωu (or ωu-cat if the item ing process that focuses on how well a test measures
responses are categorical) is appropriate for a unidi- a target construct. Moving beyond mindless reliance on
mensional test (i.e., the item-response data conform to coefficient alpha and giving more careful attention to
a one-factor model), and ωh or ωho (or the categorical- measurement quality is an important aspect of over-
variable analogue) is appropriate for the reliability of coming the replication crisis.
498
Use ωu to represent reliability of the total score
with respect to the hypothetical construct
(omega2 or omega3 from
semTools::reliability) Use ωh to represent reliability of the total
score with respect to the general construct
Yes (omega2 or omega3 from
semTools::reliability)
Does a one-factor model fit the data?
Yes: Bifactor Model
Use ωho to represent reliability of the total score
Yes No with respect to the higher-order construct
Yes: Higher- (see the online supplement)
Does a hypothesized multidimensional Order Model
CFA model fit the data?
Are item responses approximately continuous? Use EFA to investigate multidimensionality;
No Omega Hierarchical from
(e.g., Likert type with more than 5 response options)
psych::omega
estimates total-score reliability with respect
to the higher-order factor
Use ωh-cat to represent reliability of the total
No score with respect to the hypothetical construct
Use ωh-cat to represent reliability of the total
(omega2 or omega3 from
Yes score with respect to the general construct
semTools::reliability)
(omega2 or omega3 from
Does a one-factor model fit semTools::reliability)
the polychoric correlations?
Yes: Bifactor Model
Respecify as the bifactor model to obtain
No
ωh-cat from semTools::reliability
Yes: Higher- (see the online supplement)
Does a hypothesized multidimensional
Order Model
CFA model fit the polychoric correlations?

Use EFA of polychoric correlations to investigate


No multidimensionality;
Omega Hierarchical from psych::
omega is biased with respect to the observed
total-score scale

Fig. 4. Flowchart for determining the appropriate omega estimate for measurement of a hypothetical construct influencing all items in a scale. CFA = confirmatory factor analysis; EFA =
exploratory factor analysis.
Which Omega Is Right? 499

Transparency of the model-implied variance, showing that confidence-interval


 X = S X2 . Bentler (2009) suggested that
2
coverage is better with σ
Action Editor: Mijke Rhemtulla
X
2
the model-implied variance is a more efficient estimator of σ
Editor: Daniel J. Simons
than S X2 is, but this efficiency may depend on correct model
Author Contributions
specification.
D. B. Flora is the sole author of this manuscript and is
5. Such post hoc model modification is known to reduce the
responsible for its content.
replicability of CFA models and effectively leads one from a
Declaration of Conflicting Interests
confirmatory phase to an exploratory phase of scale develop-
The author(s) declared that there were no conflicts of
ment and validation (Flora & Flake, 2017).
interest with respect to the authorship or the publication
6. To test tau equivalence formally, one can compare the fit
of this article.
of the current model with that of a one-factor model with the
Open Practices
factor loadings constrained to equal each other (see the OSF
Open Data: https://osf.io/y3e4k
project page for a demonstration).
Open Materials: https://osf.io/y3e4k
7. The omega3 estimate is referred to as “hierarchical omega”
Preregistration: not applicable
in the help documentation for semTools::reliability
All data and materials have been made publicly available
( Jorgensen et al., 2020) and in Kelley and Pornprasertmanit
via OSF and can be accessed at https://osf.io/y3e4k. This
(2016), whereas this Tutorial follows most of the psychometric
article has received badges for Open Data and Open Mate-
literature (e.g., Rodriguez et al., 2016; Zinbarg et al., 2006) by
rials. More information about the Open Practices badges
reserving the term omega-hierarchical for estimates based on a
can be found at http://www.psychologicalscience.org/
bifactor model, which is presented later in this Tutorial.
publications/badges.
8. The residual correlation matrix can be obtained with the
command residuals(fit1f, type='cor').
9. Unfortunately, the ci.reliability function cannot
explicitly account for the error covariance to obtain a CI around
ORCID iD this ωu estimate.
David B. Flora https://orcid.org/0000-0001-7472-0914 10. The number of finite threshold parameters for an item
equals the number of item-response categories minus 1.
11. Although the current version of MBESS::ci.reliabil-
Acknowledgments
ity calculates CIs for an estimate its authors call “hierarchi-
I would like to thank Reviewer 3 for detailed and constructive cal omega,” it is calculated from a one-factor model instead
guidance across several drafts of this manuscript. of a bifactor model. Unfortunately, ci.reliability cannot
obtain a CI for the ωh estimate presented here.
Notes 12. If a higher-order model is fitted to polychoric correlations
1. Equivalent results are obtained if the test score is calculated among items with five or fewer response options, the current
as the mean of all items. version of reliabilityL2 does not return a version of ωho
2. By default, most software will set the scale of a factor by in the observed score metric (i.e., ωho-cat). In this situation, the
fixing the factor loading of the first item equal to 1 while leav- researcher may respecify the higher-order model as a bifactor
ing the factor variance as a freely estimated parameter; this model and use ωh-cat.
approach leads to an equivalent model such that overall model
fit and standardized parameter estimates are identical to a
model in which factor variance is fixed at 1. Thus, the same References
reliability estimates are obtained regardless of how the factor Bentler, P. M. (2009). Alpha, dimension-free, and model-
scale is established. based internal consistency reliability. Psychometrika, 74,
3. In the factor-analytic model, the error ej consists of both 137–143.
random error and systematic influences on the jth item but Bollen, K. A. (1989). Structural equations with latent vari-
no other items. Although these influences cannot be disentan- ables. New York, NY: Wiley.
gled (without repeated measures data), reliability is tradition- Borsboom, D. (2005). Measuring the mind: Conceptual
ally defined as the proportion of systematic variance relative issues in contemporary psychometrics. New York, NY:
to total variance, where systematic variance is due to both the Cambridge University Press.
common factor f and these item-specific influences. However, Brown, T. A. (2015). Confirmatory factor analysis for applied
applied researchers usually want to understand how reliably a research (2nd ed.). New York, NY: Guilford Press.
set of items measures a given construct relevant to all items, Chalmers, R. P. (2018). On misconceptions and the limited use-
which is represented here as the factor f (see Bollen, 1989, pp. fulness of ordinal alpha. Educational and Psychological
218–221). Thus, in the unidimensional context, ωu provides that Measurement, 78, 1056–1071.
information. Cole, D., & Preacher, K. (2014). Manifest variable path analy-
4. Kelley and Pornprasertmanit (2016) suggested that omega sis: Potentially serious and misleading consequences due
 X is
2
estimates are more robust to model misspecification when σ to uncorrected measurement error. Psychological Methods,
calculated as the sample variance, S X , of the total score instead
2
19, 300–315.
500 Flora

Cronbach, L. J. (1951). Coefficient alpha and the internal Lord, F. M., & Novick, M. R. (1968). Statistical theories of
structure of tests. Psychometrika, 16, 297–334. mental test scores. Reading, MA: Addison-Wesley.
Eccles, J. S. (2005). Subjective task value and the Eccles et al. McDonald, R. P. (1999). Test theory: A unified approach.
model of achievement-related choices. In A. J. Elliot & C. S. Mahwah, NJ: Erlbaum.
Dweck (Eds.), Handbook of competence and motivation McNeish, D. (2018). Thanks coefficient alpha, we’ll take it
(pp. 105–121). New York, NY: Guilford Press. from here. Psychological Methods, 23, 412–433.
Finney, S. J., & DiStefano, C. (2013). Nonnormal and categori- Mellenbergh, G. J. (1996). Measurement precision in test
cal data in structural equation modeling. In G. R. Hancock score and item response models. Psychological Methods,
& R. O. Mueller (Eds.), A second course in structural 1, 293–299.
equation modeling (2nd ed., pp. 439–492). Charlotte, NC: R Core Team. (2018). R: A language and environment for
Information Age. statistical computing. Vienna, Austria: R Foundation for
Flake, J. K., Barron, K. E., Hulleman, C., McCoach, B. D., Statistical Computing.
& Welsh, M. E. (2015). Measuring cost: The forgotten Revelle, W. (2020). psych: Procedures for psychological, psy-
component of expectancy-value theory. Contemporary chometric, and personality research (R package Version
Educational Psychology, 41, 232–244. 2.0.9) [Computer software]. Retrieved from https://
Flake, J. K., Ferland, M., & Flora, D. B. (2017, April). cran.r-project.org/web/packages/psych/index.html
Trajectories of psychological cost in gatekeeper classes: Revelle, W., & Condon, D. M. (2019). Reliability from α to
Relationships with expectancy, value, and performance. ω: A tutorial. Psychological Assessment, 31, 1395–1411.
Paper presented at the annual meeting of the American Revelle, W., Wilt, J., & Rosenthal, A. (2010). Individual dif-
Educational Research Association, San Antonio, TX. ferences in cognition: New methods for examining the
Flake, J. K., Pek, J., & Hehman, E. (2017). Construct validation personality-cognition link. In A. Gruszka, G. Matthews, &
in social and personality research: Current practice and B. Szymura (Eds.), Handbook of individual differences in
recommendations. Social Psychological and Personality cognition: Attention, memory and executive control (pp.
Science, 8, 370–378. 27–49). New York, NY: Springer.
Flora, D. B., & Flake, J. K. (2017). The purpose and practice of Revelle, W., & Zinbarg, R. E. (2009). Coefficients alpha, beta,
exploratory and confirmatory factor analysis in psychologi- omega, and the glb: Comments on Sijtsma. Psychometrika,
cal research: Decisions for scale development and valida- 74, 145–154.
tion. Canadian Journal of Behavioural Science, 49, 78–88. Rhemtulla, M., Brosseau-Liard, P. E., & Savalei, V. (2012).
Flora, D. B., LaBrish, C., & Chalmers, R. P. (2012). Old and When can categorical variables be treated as continu-
new ideas for data screening and assumption testing for ous? A comparison of robust continuous and categorical
exploratory and confirmatory factor analysis. Frontiers in SEM estimation methods under suboptimal conditions.
Psychology, 3, Article 55. doi:10.3389/fpsyg.2012.00055 Psychological Methods, 17, 354–373.
Fried, E. I., & Flake, J. K. (2018). Measurement matters. Rodriguez, A., Reise, S. P., & Haviland, M. G. (2016).
Observer, 31(3), pp. 29–30. Evaluating bifactor models: Calculating and interpreting
Green, S. B., & Yang, Y. (2009a). Commentary on coefficient statistical indices. Psychological Methods, 21, 137–150.
alpha: A cautionary tale. Psychometrika, 74, 121–135. Rosseel, Y. (2012). lavaan: An R package for structural equa-
Green, S. B., & Yang, Y. (2009b). Reliability of summed item tion modeling. Journal of Statistical Software, 48(2).
scores using structural equation modeling: An alternative Savalei, V. (2018). On the computation of the RMSEA and
to coefficient alpha. Psychometrika, 74, 155–167. CFI from the mean-and-variance corrected test statistic
Jonason, P., & Webster, G. (2010). The dirty dozen: A concise with nonnormal data in SEM. Multivariate Behavioral
measure of the dark triad. Psychological Assessment, 22, Re­search, 53, 419–429.
420–432. Savalei, V. (2019). A comparison of several approaches
Jöreskog, K. G. (1971). Statistical analysis of sets of conge- for controlling measurement error in small samples.
neric tests. Psychometrika, 36, 109–133. Psy­chological Methods, 24, 352–370.
Jorgensen, T. D., Pornprasertmanit, S., Schoemann, A. M., & Savalei, V., & Reise, S. P. (2019). Don’t forget the model in your
Rosseel, Y. (2020). semTools: Useful tools for structural model-based reliability coefficients: A reply to McNeish
equation modeling (R package Version 0.5-3) [Computer (2018). Collabra: Psychology, 5, Article 36. doi:10.1525/
software]. Retrieved from https://CRAN.R-project.org/ collabra.247
package=semTools Sijtsma, K. (2009). On the use, the misuse, and the very
Kelley, K. (2019). MBESS: The MBESS R package (R pack- limited usefulness of Cronbach’s alpha. Psychometrika,
age Version 4.6.0) [Computer software]. Retrieved from 74, 107–120.
https://CRAN.R-project.org/package=MBESS Trizano-Hermosilla, I., & Alvarado, J. M. (2016). Best alterna-
Kelley, K., & Pornprasertmanit, S. (2016). Confidence inter- tives to Cronbach’s alpha reliability in realistic conditions:
vals for population reliability coefficients: Evaluation of Congeneric and asymmetrical measurements. Frontiers
methods, recommendations, and software for composite in Psychology, 7, Article 769. doi:10.3389/fpsyg.2016
measures. Psychological Methods, 21, 69–92. .00769
Loken, E., & Gelman, A. (2017). Measurement error and the Wirth, R. J., & Edwards, M. C. (2007). Item factor analysis:
replication crisis: The assumption that measurement error Current approaches and future directions. Psychological
always reduces effect sizes is false. Science, 355, 584–585. Methods, 12, 58–79.
Which Omega Is Right? 501

Yang, Y., & Green, S. B. (2010). A note on structural equa- Zinbarg, R. E., Yovel, I., Revelle, W., & McDonald, R. P.
tion modeling estimates of reliability. Structural Equation (2006). Estimating generalizability to a latent variable
Modeling, 17, 66–81. common to all of a scale’s indicators: A comparison of
Yang, Y., & Green, S. B. (2015). Evaluation of structural estimators for ωh. Applied Psychological Measurement,
equation modeling estimates of reliability for scales with 30, 121–144.
ordered categorical items. Methodology, 11, 23–34. Zumbo, B. D., Gadermann, A. M., & Zeisser, C. (2007). Ordinal
Yung, Y.-F., Thissen, D., & McLeod, L. D. (1999). On the rela- versions of coefficients alpha and theta for Likert rating
tionship between the higher-order factor model and the scales. Journal of Modern Applied Statistical Methods, 6,
hierarchical factor model. Psychometrika, 64, 113–128. 21–29.
Zinbarg, R. E., Revelle, W., Yovel, I., & Li, W. (2005). Zumbo, B. D., & Kroc, E. (2019). A measurement is a choice
Cronbach’s α, Revelle’s β, and McDonald’s ωh: Their re­- and Stevens’ scales of measurement do not help make it:
lations with each other and two alternative conceptuali­ A response to Chalmers. Educational and Psychological
zations of reliability. Psychometrika, 70, 123–133. Measurement, 79, 1184–1197.

You might also like