Reliabil
Reliabil
This document describes covariance structure analysis methods for estimation of scale
reliability and dealing with related issues, which can be used with LISREL8.50.
The following discussion is focused on the reliability coefficient. This is a main
psychometric index reflecting "precision" (consistency) of measurement in the behavioral,
educational and social sciences, which is defined as the overall (unconditional) percentage of true
variance in observed variance on a given measure. The procedures outlined below accomplish
goals of earlier methods based on Cronbach's coefficient alpha (α; Cronbach, 1951) in the setting
of pre-specified scale components (tests, test parts) and sampling of subjects. As is well known,
with uncorrelated errors (and regardless of factorial structure of the components) coefficient α
equals scale reliability—the quantity of actual interest to scale constructors and developers—only
in the very restrictive case when all elements of a multiple-component instrument are tau-
equivalent, and otherwise underestimates it even at the population level (Novick & Lewis, 1967;
Lord & Novick, 1968; Zimmerman, 1972). Tau-equivalence is a rather restrictive condition in
these sciences where units of measurement are frequently arbitrary. (This testable condition
requires that the measurement units are in addition equal to one another.) Alternatively, with
correlated errors, α can overestimate or conversely underestimate scale reliability already in the
population (e.g., Zimmerman, 1972; Raykov, 2001a, in press/2001b).
The major feature of the estimation and testing approaches in the sequel is that they allow
one to answer a number of important questions in the same terms in which the questions are
asked. This feature is not shared with many applications of α in relation to reliability. The present
approaches permit one to (i) estimate scale reliability (not α); (ii) construct confidence interval
for scale reliability (not α); (iii) test if scale reliability is the same in independent or dependent
groups (not whether α is so); and (iv) examine if after revision, such as adding/deleting
components, the new scale has the same reliability as the old one (not whether α for the new scale
is the same as α for the old one) in the population for which the instrument is being developed
(rather than only in the sample at hand, as is the case with currently typical applications of α for
these purposes). Thus, the methods outlined in the remainder answer relevant and often raised
queries in the behavioral and social sciences about reliability of multi-component measuring
instruments and thereby provide their answers in terms of the ultimate index of concern, the scale
reliability coefficient.
In this way, the methods described below fulfill a major logical prerequisite.
Accordingly, if a question is asked in terms of concept(s) A say, then its answer ought to be given
in terms of A, not in terms of another concept(s) (that only under potentially rather restrictive
conditions equal A). That is, with the following procedures one answers questions about scale
reliability and the answers are provided in terms of scale reliability, in which terms the questions
were asked in the first instance. Traditional and currently still popular methods addressing these
questions answer them in terms of α rather than scale reliability. Those methods obviously do not
fulfill this fundamental logical requirement, due to the well-documented fact that α is in general a
misestimator of scale reliability even if one had access to the entire population.
The research on which this document is based was supported by grants from Scientific
Software International and The College Entrance and Examination Board (##2000-225, 2001-
173, 2001-267), which are herewith gratefully acknowledged. I am also thankful to Drs. Patrick
E. Shrout and David A. Grayson for stimulating discussions on some of the issues dealt with in
the remainder (specifically, on multi-dimensional scale reliability estimation and testing for
reliability change in a studied population as a result of scale revision). The detailed developments,
discussions and rationale of the methods outlined below can be found in the papers and
manuscripts in press that are cited in the reference section at the end of this document.
Scale reliability evaluation 2
(1) Y i = ai + b i η + Ei
hods true, where ai and bi are appropriate constants, η the common true score (e.g., η = T1 can be
taken, with T1 being the true score of Y1), and Ei are the corresponding error scores (i = 1, 2, ..., k;
for a definition of true and error scores, see Zimmerman, 1975). For identifiability, let Var(η) = 1,
where Var(.) denotes variance in the population.
We will be concerned with various issues related directly to the reliability coefficient ρY
of the total scale score Y = Y1 + Y2 + … + Yk , which is also refered to as "scale reliability" or
"composite reliability". With uncorrelated errors this coefficient, defined as the ratio of true
variance in Y to its observed variance (e.g., Lord & Novick, 1968), equals:
k
(∑ bi ) 2
i =1
(2) ρY = k k
,
(∑ bi ) + ∑θ ii2
i =1 i =1
where θii = Var(Ei) are the error variances (i = 1, 2, ..., k, e.g., Bollen, 1989; numerators and
denominators of reliability coefficients are assumed throughout distinct from zero, a typically
fulfilled assumption in empirical research). With correlated errors (e.g., Williams & Zimmerman,
1996; Zimmerman, 1972),
k
(∑ bi ) 2
i =1
(3) ρY = k k
,
(∑ bi ) + ∑θ ii + 2
2
∑θ ij
i =1 i =1 1≤i < j ≤ k
where θij (1≤i<j≤k) are the nonzero error covariances and i and j vary across all pairs of
correlated errors. We presume that all models dealt with in the rest are identified. For a weighted
scale Y = w1Y1 + w2Y2 + ... + wkYk , it follows that
k
(∑ wi bi ) 2
i =1
(4) ρY = k k
(∑ wi bi ) 2 + ∑ wi θ ii
2
i =1 i =1
in the uncorrelated error case, and in that with correlated errors
k
(∑ wi bi ) 2
i =1
(5) ρY = k k
,
(∑ wi bi ) + ∑ wi θ ii + 2 ∑w w θ
2 2
i j ij
i =1 i =1 1≤ i < j ≤ j
where the same notation for the error covariances is used as in Equation (3). For the purposes of
the remaining discussion, the weighted scale case is reducible to the unweighted case via
appropriate substitutions (e.g., Raykov, in press/2001b).
Scale reliability evaluation 3
Reliability1.ls8
[Note. With correlated errors, one needs to extend the right-hand side of the constraint for PS(3,3)
by adding twice the nonzero error covariances.]
For illustration, multinormal zero-mean data were generated on N = 300 cases for k = 5
variables using LISREL, according to the following congeneric test model (see Equation (1)):
(6) Y1 = η1 + ε1,
Y2 = η1 + ε2,
Y3 = η1 + ε3,
Y4 = η1 + ε4,
Y5 = 3η1 + ε5,
Scale reliability evaluation 4
where η1 had unitary variance while that of each error term was set at .4; the covariance matrix of
the resulting data is presented in Table 1.
Table 1
Covariance matrix of 5 components for 300 cases (see Equations (6))
Y1 1.322
Y2 0.878 1.241
Y3 0.912 0.886 1.313
Y4 0.858 0.807 0.881 1.240
Y5 2.670 2.567 2.668 2.560 8.243
Since all parameters of the data generating model (6) are known, use of (2) furnishes the
(true) reliability of the scale Y = Y1 + Y2 + ... + Y5 as ρY = 49/(49+2) = .961. Applying the above
INPUT 1 with LISREL8.50, one obtains .955 as an estimate of ρY (look at the estimate of
PS(4,4)) that is fairly close to the true scale reliability value. At the same time, coefficient alpha
results for these data as .877, which notably underestimates the true scale reliability coefficient of
.961. Detailed discussions on the misestimation features of coefficient alpha already at the
population level can be found in Novick & Lewis (1967), Zimmerman (1972), Raykov (1997b,
1998a, 2001a, in press/2001b), as well as other sources.
and
respectively, where a caret denotes sample estimate, zγ/2 is the γ/2th standard normal quantile (0 <
γ < 100), u = b1+b2+…+bk is the sum of construct loadings, v = Var(E1)+Var(E2)+… +Var(Ek) is
that of measurement error variances, and D1 and D2 are the partial derivatives of the scale
reliability coefficient ρY with respect to u and v, which are obtained with the following formulas
(see (2); Stewart, 1991):
The estimates of D1 and D2 are furnished by substitution into (9) and (10) of the estimates
of u and v yielded by the following LISREL8.50 input file (STEP 1). The variances and
covariance of the estimates of u and v are found as pertinent entries in the matrix of
COVARIANCES OF PARAMETER ESTIMATES in the LISREL output obtained thereby.
Then, to get an approximate SE and CI of scale reliability, a few simple computations are
performed with a major statistical software, e.g., SPSS (STEP 2 input file; the LISREL8.50 and
following SPSS input files as well as data in Table 2 are reprinted with permission from Raykov,
T. 2001c, “Analytic estimation of standard error and confidence interval for scale reliability”.
Multivariate Behavioral Research, in press).
Reliability2.ls8
[Note. The variances and covariance of the estimates of the parametric expressions u (PSI(2,2))
and v (PSI(3,3)) are located as pertinent entries of the output section “Covariance matrix of
parameter estimates”. Alternatively, the variances of the estimates of u and v are their squared
standard errors reported by LISREL in the “LISREL estimates” section.]
For illustration, this two-step interval estimation procedure is applied on simulated data
(cf. Raykov, 2001c). Using LISREL, miltinormal zero-mean data were generated on k = 5
variables for N = 500 cases, following the model
(11) Y1 = T1 + E1 ,
Y2 = 1.5T1 + E2 ,
Y3 = 2T1 + E3 ,
Y4 = 2.5T1 + E4 ,
Y5 = 3T1 + E5 ,
whereby the latent score T1 was simulated with variance 1, and the error variances generated at .4,
.6, .8, 1, and 1.2 for E1 to E5, respectively. The covariance matrix of the so-obtained data on Y1
through Y5 is presented in Table 2.
Table 2
Covariance matrix of 5 variables for 500 cases (see Equations (11))
Y1 1.384
Y2 1.484 2.756
Y3 1.988 2.874 4.845
Y4 2.429 3.588 4.894 6.951
Y5 3.031 4.390 6.080 7.476 10.313
Applying the above INPUT 2 with LISREL8.50 (STEP 1 input file) furnishes .962 as an
estimate of the reliability coefficient of the scale Y = Y1 + Y2 + … + Yk. Since all population
parameters are known, the (true) reliability of this scale is determined by substituting their values
into (2): ρY = (1+1.5+2+2.5+3)2/[(1+1.5+2+2.5+3)2 +.4+.6+.8+1+1.2] = .962. By comparison, α
= .931 results for this data set, which notably underestimates the true scale reliability of .962 and
is similarly lower than its estimate rendered by the described approach (see also Raykov, 1997b).
To obtain an approximate SE and 90%-CI for scale reliability, use the above SPSS file (STEP2
input file). To this end, uˆ and v̂ are first found as the corresponding variances of η2 and η3 in the
LISREL output: 9.941 and 3.892. With them, D1 and D2 are obtained from (9) and (10) as .007
and -.009, respectively. Then the variances and covariance of uˆ and v̂ are located in the LISREL
output section titled “Covariances of parameter estimates” correspondingly as .107, .019, and
-.001. With all these quantities, that SPSS file yields a standard error of .003 and a 90%-
confidence interval for the scale reliability coefficient as (.958; .967), which covers the true scale
reliability of .962. (Note that coefficient alpha in this data is outside the last CI.)
than one trait. Suppose that subsets of Y1, Y2, ..., Yk, if considered separately from the remaining
items, assess common traits with unrestricted relationships. Thus, the initial set Y1, Y2, ..., Yk can
be split into q subsets of items (1 ≤ q ≤ k), which subsets correspond to q constructs η1, η2, ..., ηq.
We do not limit the number of items measuring any of the traits, and do not restrict the number of
traits evaluated by any of the items, as long as the overall model dealt with next is identified.
Without loss of generality, assume that the measures are ordered so that the q traits correspond to
consecutive (not necessarily non-overlapping) subsets of Y1, Y2, ..., Yk: the first m1 of these k
measures assess η1, the next m2 evaluate η2, ..., and the last mq components assess ηq (m1 + m2 +
... + mq = k; 1 ≤ mi ≤ k, i = 1, ..., q). The components within the jth of these subsets measure the
same pertinent construct, ηj , with possibly different units of measurement and/or precision (i.e.,
are congeneric; Joreskog, 1971), and some components may also assess another trait(s) besides ηj
(j = 1, …, q). Finally, for identifiability reasons, assume that all q latent constructs have variances
of one, i.e., Var(ηi) = 1 (i = 1, …, q). This general setup contains as a special case that of
congeneric measures, for q = 1. An example is given in the following Figure 1 (see also Equation
(12) below) that represents the path diagram of the model fitted next (Raykov & Shrout, 2002).
Following the rationale and idea of the "omega" coefficient by McDonald (1985, 1999),
the reliability coefficient of the sum score Y for the model in Figure 1 can be obtained using
covariance structure modeling (and application of the bootstrap subsequently yields an
approximate SE and CI of reliability of Y; e.g., Raykov & Shrout, 2002, and below.) To
demonstrate, we employ a simulated data set of multinormal zero-mean data generated on N =
300 cases for k = 6 components Y1 to Y6 using LISREL according to the model
(12) Y1 = .5η1 + ε1 ,
Y2 = .8η1 + ε2 ,
Y3 = .6η1 + .3η2 + ε3 ,
Y4 = .4η1 + .4η2 + ε4 ,
Y5 = .5η2 + ε5 ,
Y6 = .8η2 + ε6 ,
where η1 and η2 evaluated by the battery of 6 components were simulated to have unitary
variances and correlation Corr(η1,η2) = .3, and the error standard deviations were set at .7, .6, .6,
.6, .7, and .6 for ε1 through ε6, respectively. The covariance matrix of Y1 through Y6 is presented
in Table 3 (Raykov & Shrout, 2002b).
Table 3
Covariance matrix of the six simulated variables (N = 300; see Equations (12))
Y1 0.58
Y2 0.32 0.95
Y3 0.31 0.48 0.83
Y4 0.23 0.37 0.41 0.72
Y5 0.01 0.05 0.19 0.24 0.78
Y6 0.09 0.21 0.36 0.36 0.43 0.93
The reliability coefficient of the scale score Y = Y1 + Y2 + ... + Yk is obtained with the
LISREL8.50 input file provided next, which uses only linear parameter constraints and
corresponds to the model in Figure 2 (that is the same as the model in Figure 1, with the added
dummy variables for composite score Y, η3, and its true score, η4; the ratio of their variances is
the reliability coefficient of Y--see Raykov & Shrout, 2002, and Equation (13) below. The
LISREL 8.50 file, data in Table 3, and Figures 1 and 2 are reprinted with permission from
Scale reliability evaluation 8
Raykov, T., Shrout, P., 2002, “Reliability of scales with general structure: Point and interval
estimation using covariance structure modeling”, Structural Equation Modeling, in press).
Reliability3.ls8
[Note. Start values are valid for the used data set and may not be appropriate for others. To obtain a
bootstrap standard error, analyzing the raw data produce with PRELIS B (≥200) resample covariance
matrices; then analyze them with this LISREL input including as last the keywords BE=BEB
PS=PSB; and finally study the distribution of the ratio of the variances of η4 to η3. This ratio is
computed on the pertinent elements of the last two output files (viz. the 91st and 92nd columns of BEB
and the 2nd, 6th, 10th, 15th, 21st, 28th, and 36th columns of PSB); see below.]
Applying INPUT 3 with LISREL8.50 we fit to the covariance matrix in Table 3 the two-
factor model in Figure 2 with free factor loadings following the pattern in Equations (12), as well
as free error variances and latent correlation; further, all paths leading from the observed
variables into η4 are fixed at 1 to allow its interpretation as the scale score Y (see Raykov, 1997a;
Raykov & Shrout, 2002). This yields acceptable goodness of fit indices: χ2 = 9.02, df = 6, p = .17
and RMSEA = .04 (0, .09). The estimate of the true composite variance (i.e., of η4) is found to be
10.83, while that of Y (i.e., of η3) turns out to be 13.09. Then the estimate of the reliability
coefficient of the composite Y = Y1 + Y2 + ... + Y6 results as:
Scale reliability evaluation 9
Since in this example all parameters are known, the (true) reliability of Y is determined
as:
= .83 ,
which is the same as its estimate in (13) found with the discussed method but markedly
underestimated by coefficient α that in this data set turns out to be .76 (see Raykov & Shrout,
2002). Next, resampling B = 1000 times from the original raw data set and fitting the model to
each so-obtained sample furnishes 1000 bootstrap estimates of composite reliability. (Out of the
1000 model runs, 10 were associated with lack of convergence and were therefore disregarded in
the following analyses.) The mean of these estimates is .84 and their standard deviation is .02.
Hence, using the bootstrap approach (Efron & Tibshiriani, 1993), an approximate standard error
of reliability of Y = Y1 + Y2 + … + Y6 results as SE( ρ̂ Y ) = .02. The 5th and 95th percentiles of the
distribution of so-obtained resample composite reliability coefficients are thereby found to be .81
and .87, respectively. Using a simple approach to confidence interval construction, the last two
numbers can be taken as a lower and upper limit of a bootstrap 90%-confidence interval for ρY ,
viz. (.81, .87). (We note that other, more involved methods of bootstrap-based CI construction
can be used for this purpose as well; Efron & Tibshiriani, 1993.) This interval covers the true
composite reliability of .83 (see Equation (14)) and is completely above .80 that may be
considered a recommendable benchmark for reliability of scales. We stress that the estimated α of
.76 in this data is markedly below both the point and interval estimates of composite reliability
obtained with the method of this section, and also markedly below the true scale reliability
coefficient (cf. Novick & Lewis, 1967).
Figure 1
(Raykov & Shrout, 2002)
η1 η2
Y1 Y2 Y3 Y4 Y5 Y6
ε1 ε2 ε3 ε4 ε5 ε6
Scale reliability evaluation 10
Figure 2
(Raykov & Shrout, 2002)
η4
λ11+λ21+λ32+λ41 λ32+λ42+λ52+λ62
η1 η2 η3
1
1 1
1
1 1
Y1 Y2 Y3 Y4 Y5 Y6
ε1 ε2 ε3 ε4 ε5 ε6
Scale reliability evaluation 11
where the 1st subindex denotes group. To this end, we reparameterize the congeneric test model in
the above Equation (1) by fixing the paths from the error terms into the observed variables to
equal the sum of all construct loadings, i.e., b1 + b2 + ... + bk, within each group (k > 2; note that
this does not change the identification status of the model.) To see how the null hypothesis (15) is
equivalently transformed thereby, first denote the error term variances in each group by θr,ii (r =
1, 2, i = 1, ..., k), where the 1st subindex pertains to group. Symbolize then the squared sums of
k k
construct loadings in the groups by B1 and B2 (i.e., B1 = ( ∑
i =1
b1i ) 2 and B2 = ( ∑b
i =1
2i ) 2 , where the
1st subindex of the b's stands for group). Now, for the sake of (15), equate the reciprocals of the
right-hand side of (2) for each group to one another, and observe after simple algebra that then
(15) becomes equivalent to the cross-group constraint
where θr,ij denote nonzero error covariances in the pertinent group. With the above mentioned
reparameterization (viz. within each group set all paths from error terms into observed variables
to equal the sum of their loadings on the latent variable), (16) and (17) are correspondingly
equivalent to
k
(18) θ *2,11 = ∑θ *1,ii −θ *2, 22 −... − θ *2,kk
i =1
Scale reliability evaluation 12
and
k
(19) θ *2,11 = ∑θ *1,ii +2 ∑θ * 1,ij − θ *2, 22 −... − θ *2,kk − 2 ∑θ * 2 ,ij ,
i =1 1≤i < j ≤ k 1≤i < j ≤ k
where each starred parameter stands for its corresponding ratio in (16) and (17), respectively.
(Note that the starred parameters are the error variances and covariances, if any, of the
reparameterized model.) Equations (18) and (19) represent each a cross-group linear constraint in
terms of the rescaled error variances and covariances.
This procedure is next demonstrated with LISREL8.50 on the following numerical
example. Multinormal, zero-mean data are first generated on m = 5 variables for N = 300 subjects
in each of two groups. The two unrelated data sets are simulated using LISREL according to the
following congeneric component models (this example and data is reprinted with permission from
Raykov, T., in press/2002, “Examining group differences in reliability of multiple-component
instruments”, British Journal for Mathematical and Statistical Psychology. This paper is available
for downloading on a pay-per-view basis at a cost of approx. $15 at
http://www.bps.org.uk/publications/jMS_1.cfm , copyright The British Journal of Mathematical
and Statistical Psychology & The British Psychological Society.) In Group 1, the underlying
model is (cf. Equation (1)):
(20) Y1 = η1 + E1 ,
Y2 = 1.1η1 + E2 ,
Y3 = 1.2η1 + E3 ,
Y4 = 1.3η1 + E4 ,
Y5 = 1.4η1 + E5 ,
where the common true score η1 is generated as a standard normal variate while the variances of
E1 to E5 are set at .8 and the covariance of E1 and E2 fixed at .6. In Group 2, the same model is
employed with the only difference that the covariance of E1 and E2 is set at -.6; that is, the data
generation model is defined here as:
(21) Y1 = η1 + E1 ,
Y2 = 1.1η1 + E2 ,
Y3 = 1.2η1 + E3 ,
Y4 = 1.3η1 + E4 ,
Y5 = 1.4η1 + E5 ,
where the common true score η1 is a standard normal variate while the variances of E1 to E5 are
set at .8 and the covariance of E1 and E2 fixed at -.6. The covariance matrices of the resulting data
sets are presented in Table 4.
Scale reliability evaluation 13
Table 4
Covariance matrices of the simulated two group data sets (see Equations (20) and (21))
________________________________________________________________________
Component Y1 Y2 Y3 Y4 Y5
________________________________________________________________________
Group 1 (N = 300)
________________________________________________________________________
Y1 1.602
Y2 1.580 1.982
Y3 1.071 1.299 2.260
Y4 1.168 1.353 1.474 2.287
Y5 1.308 1.489 1.750 1.661 2.783
________________________________________________________________________
Group 2 (N = 300)
________________________________________________________________________
Y1 1.602
Y2 0.597 2.165
Y3 1.123 1.525 2.372
Y4 1.230 1.518 1.569 2.393
Y5 1.373 1.633 1.824 1.744 2.826
________________________________________________________________________
To examine the relationship between the group reliability coefficients of the scale Y = Y1
+ Y2 + ... + Y5 , first the reliability of Y is estimated in each group using the approach in Section 1
of this document. In Group 1, ρ̂1,Y = .858 is obtained; since all parameters of the model used to
generate the data are known, the (true) composite reliability coefficient is determined using
Equation (2) as ρ1,Y = 36/(36+4+1.2) = .874. In Group 2, ρ̂ 2,Y = .923 is found, while the (true)
scale reliability coefficient is similarly found via (2) as ρ 2,Y = 36/(36+4-1.2) = .928. Thus, by data
generation the true group difference in scale reliability is notable, ∆ρY = ρ1,Y - ρ2,Y = .874 - .928 =
-.054. To apply the outlined procedure with LISREL8.50, the following input file is used
(reprinted with permission from Raykov, T., in press/2002, “Examining group differences in
reliability of multiple-component instruments”, British Journal for Mathematical and Statistical
Psychology; information on accessibility given above and in the References section):
Reliability4.ls8
To test the null hypothesis (15), we fit two nested models--one without the restriction
equivalent to (15) and the other with it (e.g., Joreskog & Sorbom, 1996); the difference in
chi-square values of both models represents a test statistic of the hypothesis. The two-
group congeneric test model (with related errors E1 and E2) having no group constraints is
associated with a χ2 = 13.190, df = 8, p = .105, and RMSEA = .045 (0; .089), all
indicating acceptable fit. The restricted model is obtained from the full by introducing
(19) for k = 5 and a single pair of correlated errors (see the last constraint line of the
LISREL input file.) This model is associated with χ2 = 33.150, df = 9, p = .0, and
RMSEA = .093 (.060; .130), indicating lack of fit. The difference in chi-square values
between the two models is ∆χ2 = 33.150 – 13.190 = 19.960, for difference in degrees of
freedom being 1 and associated p < .001. This suggests rejection of Ho of equal reliability
of Y = Y1 + Y2 + ... + Y5 in the two groups, in favor of the alternative of their difference.
This conclusion is in agreement with the notable difference in the true scale reliability
coefficients determined above, ∆ρY = -.054. In contrast, an application of the traditional
procedure for studying scale reliability differences via comparison of alpha coefficients
across groups would yield an incorrect conclusion. Indeed, the corresponding two-sample
test (Feldt, 1969; see also Charter & Feldt 1996; Feldt, 1965; Feldt et al., 1987; Feldt &
Ankenmann, 1998) is based on the statistic W = (1-αmin)/(1-αmax), where αmin is the
Scale reliability evaluation 15
smaller and αmax the larger alpha coefficient in the groups. In the Group 1 data set, αˆ (1) =
.902 is found, while in Group 2 αˆ ( 2 ) = .892 results. Hence W = .108/.098 = 1.102. [Note
that coefficient alpha noticeably overestimates composite reliability in Group 1 and
underestimates that reliability in Group 2 (e.g., Raykov, 2001a, in press/2001b).] Since
W follows an F-distribution with degrees of freedom df1 = df2 = 300 – 1 = 299 (e.g.,
Feldt, 1969), the associated probability value is p > .10 (obtained, e.g., via use of the SAS
function PROBF(.,.,.); SAS Institute, 1988). Therefore, with this conventional method it
would be incorrectly concluded that the null hypothesis (15) could not be rejected. This
conclusion would contradict the notable difference in the true scale reliability
coefficients, ∆ρYY = -.054, determined earlier and also sensed with the method outlined
in this section. The incorrect end result with the conventional two-sample test would be
explained with the misestimation of scale reliability by α in opposite directions in the two
groups (e.g., Williams & Zimmerman, 1996; Zimmerman, 1972; Zimmerman et al.,
1993; Raykov, 1997b, 2001a).
In conclusion of this section, we note that in exactly the same manner one can test
differences across groups in reliability of different scales (or different modes of
administration/presentation of the same scale—e.g., paper-and-pensil vs. computer
administration), or of the same scale across repeated assessments (Raykov, 2001d; see also
Raykov, 2000). Similarly, the presented method is readily applied in settings with more than 2
groups (more than 2 repeated assessments), by imposing the critical constraint (18), or (19) if
pertinent, on all subsequent pairs of groups or pairs of repeated assessments.
The last hypothesis (24) captures the essence of the typical effort involved in a scale
revision, to yield a composite having higher reliability than an initial version of it. (No change is
required in the logic of the following method when instead of (24) the alternative hypothesis of
interest is HA: ρY,k < ρY,m .) Using (2) and setting scale reliability before revision being equal to
scale reliability after the revision, simple algebra leads to the equivalent parameter restriction
(Raykov & Grayson, 2002):
(25) θ m = −θ k +1 − θ k + 2 − ... − θ m −1 +
+ (bk+1+bk+2+...+bm)2(θ1+ θ2+...+ θk)(b1+b2+...+bk)-2
+ 2(bk+1+bk+2+...+bm)(θ1+ θ2+...+ θ k)(b1+b2+...+bk)-1 .
If the revision consists of deletion of just one component, i.e., m=k+1, which seems to be
the case most frequently encountered in scale development applications, (25) simplifies to
Equation (25), or (26) if applicable, represents a nonlinear constraint imposed upon the
parameters of the model for the pre-revised (longer) scale, i.e., in the congeneric model (1) for the
m components Y1, Y2, ..., Ym. Therefore, testing the null hypothesis (25) is equivalent to testing the
nonlinear constraint (25) or (26), whichever is applicable, via two nested models (see, e.g.,
preceding section). This test is accomplished with LISREL8.50 as demonstrated next. To this
end, simulated data for N = 300 subjects on m = 5 variables is used, which is generated using
LISREL according to the following model:
(27) Y1 = η1 + E1 ,
Y2 = 1.3η1 + E2 ,
Y3 = 1.6η1 + E3 ,
Y4 = 1.9η1 + E4 ,
Y5 = .1η1 + E5 ,
where η1 is generated as a standard normal construct score, and the variances of E1 to E5 are set at
.4, .6, .8, 1 and 1.5, respectively. The covariance matrix of the resulting simulated data set is
presented in Table 5 (Raykov & Grayson, 2002).
Y1 1.40
Y2 1.31 2.23
Y3 1.73 2.27 3.62
Y4 1.93 2.53 3.26 4.81
Y5 0.17 0.09 0.17 0.25 1.49
Scale reliability evaluation 17
The LISREL8.50 input file implementing the test of the null hypothesis (22) on this data
is as follows.
Reliability5.ls8
For the data in Table 5, the full model--correspondingly set out as a congeneric test
model--yields a chi-square value (χ2) = 4.05 for 5 degrees of freedom (df) with associated p-value
(p) = .54 and a root mean square error of approximation (RMSEA) of .0 with a 90%-confidence
interval (.0; .073) (Joreskog & Sorbom, 1996; Raykov & Grayson, 2002). With the method in
section 1 of this document, one obtains ρˆ Y ,5 = .90; using Equation (2) with the construct loadings
and error variances from the model in (27) that generated the data, yields the true scale reliability
ρY,5 = .89. In this model, the estimate of b5 on the underlying construct (.11) is at least 9 times
smaller than any of the remaining component loadings while its error variance estimate (1.48) is
nearly 14 times larger than that loading. This indicates that Y5 is possibly not contributing to
precise measurement of the trait in common to the remaining 4 components, Y1 through Y4; it is
therefore worthwhile testing if deletion of Y5 leads to an improved scale, i.e., if the scale
consisting only of Y1 through Y4 has significantly higher (different) population reliability than the
starting scale version with all five components. (Note that this is a one-tailed alternative
hypothesis. The method outlined before does not necessitate, however, use of one-tailed
hypotheses only, and as mentioned earlier is equally well applicable also with two-tailed
alternative hypotheses.)
To accomplish this, (26) is introduced in the full model. This nested model is associated
with χ2 = 153.65, df = 6, p = 0, RMSEA = .34 (.30; .38), whereby the difference in chi-square
values is significant: ∆χ2 = 149.60, ∆df = 6 – 5 = 1, p < .001. To estimate the reliability of the so-
Scale reliability evaluation 18
revised scale, first the congeneric model with these 4 indicators yields χ2 = .68, df = 2, p = .71,
and RMSEA = 0 (0; .08). The method in Section 1 furnishes ρˆ Y , 4 = .93 as estimate of reliability
of the revised scale consisting of Y1 to Y4. That is, in the sample the revised scale is associated
with a reliability coefficient by .03 (= .93 - .90) higher than that of the initial scale. The question
now is whether this increase of .03 is significant, specifically if it is positive in the population.
(Using (2) with the first four construct loadings and error variances from the model in (27) having
generated the data, yields the true scale reliability for Y1+Y2+Y3+Y4 as ρY,4 = .92.) To answer it,
halving the p-value associated with the above found difference of 149.60 in the chi-square values
of the full and restricted models still yields a value below .01. It is therefore concluded that the
null hypothesis (22) stating no change in reliability can be rejected in favor of the alternative
hypothesis (23) of increase in scale reliability due to the removal of the last measure from the
initial scale. Thus, the revision of the composite Y = Y1 + Y2 + ... + Y5 consisting of dropping the
last component Y5 has indeed led to an increase in reliability beyond what could be explained by
chance factors only, i.e., the revised scale Y = Y1 + Y2 + Y3 + Y4 has higher (population)
reliability.
Last but not least, we stress that the testing procedure outlined in this section is equally
applicable regardless of whether the components are added or deleted, irrespective of their
number, and regardless of their location within the longer scale.
6. Concluding remarks
The methods outlined in this document address frequently asked questions in scale
construction, development and evaluation in the behavioural, social and educational sciences.
While they have definite advantages over traditional methods as indicated before, these
procedures have also certain limitations discussed in more detail in the earlier cited papers
presenting them (Raykov, 2001c, in press/2001b, in press/2002; Raykov & Shrout, 2002; Raykov
& Grayson, 2002). Since the procedures represent applications of covariance structure modeling
that is based on an asymptotic theory of model and parameter testing (e.g., Bollen, 1989), it is
recommended that these approaches be used with large samples. With discrete data
(items/components) having only a limited number of response options, an initial exploratory
factor analysis on an independent sample (or randomly halved initial sample, if large) of the
tetrachoric correlation matrix (Joreskog & Sorbom, 1996) could give indication of clustering of
the components. Adding the components within the so-found clusters/parcels leads to sum scores
better approximating continuous distributions, on which scores the methods outlined in this
document can be applied (e.g., Raykov & Grayson, 2002; Raykov, in press/2002).
Scale reliability evaluation 19
References
Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley.
Charter, R. A., Feldt, L. S. (1996). Testing the equality of two alpha coefficients. Perceptual and
Motor Skills, 82, 763-768.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of a test. Psychometrika, 16,
297-334.
Efron, B. J., Tibshiriani, R. (1993). An introduction to the bootstrap. New York: Chapman & Hall.
Feldt, L. S. (1965). The approximate sampling distribution of Kuder-Richardson reliability
coefficient twenty. Psychometrika, 30, 357-370.
Feldt, L. S. (1969). A test of the hypothesis that Cronbach’s alpha or Kuder-Richardson
coefficient is the same for two tests. Psychometrika, 34, 363-373.
Feldt, L. S., Ankenmann, R. D. (1998). Appropriate sample size for comparing alpha reliabilities.
Applied Psychological Measurement, 22, 170-178.
Feldt, L. S., Woodruff, D. J., & Salih, F. A. (1987). Statistical inference for coefficient alpha.
Applied Psychological Measurement, 11, 93-103.
Gilmer, J. S., Feldt, L. S. (1983). Reliability estimation for a test with parts of unknown lengths.
Psychometrika, 48, 99-111.
Joreskog, K. G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36,
109-133.
Joreskog, K. G., Sorbom, D. (1996). PRELIS User’s reference guide. Chicago, IL: Scientific
Software International.
Komaroff, E. (1997). Effect of simultaneous violations of essential tau-equivalence and
correlated errors on coefficient alpha. Applied Psychological Measurement, 21, 337-
348.
Lord, F. M. (1955). Sampling fluctuations resulting from sampling of test items.
Psychometrika, 20, 1-22.
Lord, F. M., Novick, M. (1968). Statistical theories of mental test scores. Readings, MA:
Adison-Wesley.
McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale, NJ: Lawrence
Erlbaum.
McDonald, R. P. (1999). Test theory. A unified treatment. Mahwah, NJ: Lawrence Erlbaum.
Miller, M. B. (1995). Coefficient alpha: A basic introduction from the perspective of classical test
theory and structural equation modeling. Structural Equation Modeling, 2, 255-273.
Novick, M. R., Lewis, C. (1967). Coefficient alpha and the reliability of composite measurement.
Psychometrika, 32, 1-13.
Raykov, T. (1997a). Estimation of composite reliability for congeneric measures. Applied
Psychological Measurement, 22, 173-184.
Raykov, T. (1997b). Scale reliability, Cronbach’s coefficient alpha, and violations of essential
tau-equivalence with fixed congeneric components. Multivariate Behavioral Research,
32, 329-353.
Raykov, T. (1998a). Coefficient alpha and composite reliability with interrelated
nonhomogeneous items. Applied Psychological Measurement, 22, 375-385.
Raykov, T. (1998b). A method for obtaining standard errors and confidence intervals of
composite reliability for congeneric items. Applied Psychological Measurement, 22, 369-
374.
Raykov, T. (2000). A method for examining stability in reliability. Multivariate Behavioral
Research, 34, 289-305.
Raykov, T. (2001a). Bias of Cronbach’s coefficient alpha for fixed congeneric measures with
correlated errors. Applied Psychological Measurement, 25, 69-76.
Scale reliability evaluation 20
Raykov, T. (in press/2001b). Estimation of congeneric scale reliability using covariance structure
models with nonlinear constraints. British Journal of Mathematical and Statistical
Psychology (in press). This paper is available for downloading on a pay-per-view basis at
a cost of approx. $15 at http://www.bps.org.uk/publications/jMS_1.cfm , copyright The
British Journal of Mathematical and Statistical Psychology & The British Psychological
Society.
Raykov, T. (2001c). Analytic estimation of standard error and confidence interval for scale
reliability. Multivariate Behavioral Research (in press).
Raykov, T. (2001d). Studying change in scale reliability for repeated multiple measurements
via covariance structure modeling. In R. Cudeck, S. H. C. du Toit, & D. Sorbom (Eds.),
Structural Equation Modeling: Present and Future. Festschrift in Honor of Karl Joreskog
(pp. 217-230). Chicago, IL: Scientific Software International.
Raykov, T. (in press/2002). Examining group differences in reliability of multiple-component
instruments. British Journal for Mathematical and Statistical Psychology. This paper is
available for downloading on a pay-per-view basis at a cost of approx. $15 at
http://www.bps.org.uk/publications/jMS_1.cfm , copyright The British Journal of
Mathematical and Statistical Psychology & The British Psychological Society.
Raykov, T., Shrout, P. E. (2002). Reliability of scales with general structure: Point and
interval estimation using covariance structure modeling. Structural Equation Modeling
(in press).
Raykov, T., Grayson, D. A. (2002). A test for change of composite reliability in scale
Development. Multivariate Behavioral Research (in press).
SAS Institute (1988). SAS language guide. Cary, NC: SAS Institute.
Stewart, J. (1991). Calculus. Pacific Grove, CA: Brooks/Cole.
Woodruff, D. J., Feldt, L. S. (1986). Tests for equality of several alpha coefficients when their
sample estimates are dependent. Psychometrika, 51, 393-413.
Williams, R. H., Zimmerman, D. W. (1996). Are simple gain scores obsolete? Applied
Psychological Measurement, 20, 59-69.
Zimmerman, D. W. (1972). Test reliability and the Kuder-Richarson formulas: Derivation from
probability theory. Educational and Psychological Measurement, 32, 939-954.
Zimmerman, D. W. (1975). Probability spaces, Hilbert spaces, and the axioms of test theory.
Psychometrika, 40, 395-412.
Zimmerman, D. W., Zumbo, B. D., & Lalonde, C. (1993). Coefficient alpha as an estimate or
reliability under violation of two assumptions. Educational and Psychological
Measurement, 53, 33-49.
Scale reliability evaluation 21
The concern of this section is to present a readily applicable procedure for estimation of
maximal reliability for a linear combination of congeneric measures, denoted Y1, Y2, …,
Yk (k > 1; in the case k = 2, add identifying restrictions)
and of their optimal weights w1, w2, …, wk. (For these measures, the well-known classical
test theory related decomposition Yi = bi η + Ei holds, where bi are the indicator loadings
on the common true score η and Ei are the error scores with variances θi, i = 1, …, k).
The following method is described in detail in Raykov (in press; see above in this
document for information how to download relevant material related to that paper).
As has been well documented in the psychometric literature (e.g., Raykov, in press and
references therein), the maximal reliability for a scale of congeneric components is
k
ρi
∑1− ρ
i =1
ρmax = i
,
ρi
k
1+ ∑
i =1 1 − ρ i
ρi
wi =
bi (1 − ρ i )
and related activities, one can use LISREL 8.54 for Windows with the following
nonlinear constraints for the pertinent factor loadings.
−1
wi = biθ ii
as the optimal component weights (i = 1, …, k; see Raykov, in press, for details on how
to obtain this expression).
References
Conger, A. (1980). Maximally reliable composites for undimensional measures.
Educational and Psychological Measurement, 40, 367-375.
Li, H. (1997). A unifying expression for the maximal reliability of a linear composite.
Psychometrika, 62, 245-249.
Raykov, T. (in press). Estimation of maximal reliability: A note on a covariance structure
modeling approach. British Journal of Mathematical and Statistical Psychology.