Testing Testing One Two Three by Les Hayduk
Testing Testing One Two Three by Les Hayduk
www.elsevier.com/locate/paid
Received 11 January 2006; received in revised form 1 September 2006; accepted 27 September 2006
Available online 21 December 2006
Abstract
Barrett (2007) presents minor revisions to statements previously posted on Barrett’s website, and discussed
on SEMNET (a web discussion group about structural equation modeling). Unfortunately, Barrett’s ‘‘recom-
mendations’’ remain seriously statistically and methodologially flawed. Unlike Barrett, we see scientific value
in reporting models that challenge or discredit theories. We critique Barrett’s way of proceeding in the context
of both especially small and large samples, and we urge greater attention to the v2 significance test.
Ó 2006 Elsevier Ltd. All rights reserved.
Keywords: SEM; Structural equation model; Chi-square (or v2); Fit; Testing; Sample size
1. Introduction
In ‘‘Structural equation modeling: adjudging model fit’’ Barrett (2007) provides minor revisions
to advice previously presented on Barrett’s web site, and critiqued by one of us (Hayduk). Despite
some revisions, Barrett’s recommendations remain seriously problematic. We begin with a brief
*
Corresponding author.
E-mail address: [email protected] (L. Hayduk).
0191-8869/$ - see front matter Ó 2006 Elsevier Ltd. All rights reserved.
doi:10.1016/j.paid.2006.10.001
842 L. Hayduk et al. / Personality and Individual Differences 42 (2007) 841–850
context for Barrett’s article, before considering specific problematic statements. We argue that jour-
nal reviewers should pay greater attention to v2 significance tests, and to the theory being tested.
Structural equation modeling (SEM) grew out of the confluence of a path analytic tradition and
a factor analytic tradition. The path analytic tradition had a history of seeking to test theories
(Duncan, 1975). The exploratory factor analytic tradition had almost no theory1 and non-testing
was the factor analytic norm. The differences between investigating theory-laden models and
reducing data factorially became widely apparent when Les Hayduk (a supporter of the theoret-
ical-model tradition) joined SEMNET, an archived web discussion group, in 1997 and confronted
supporters of the factor analytic tradition. The extensive differences were summarised in a target
article, several commentaries, and a rejoinder published in Structural Equation Modeling (Hayduk
& Glaser, 2000a, 2000b). The Hayduk and Glaser target article demanded careful and attentive
model testing (2000a: p. 20–31), and none of the published commentaries challenged the need
for careful testing. However, testing and the deficiencies of disregarding significant ill fit became
points of discussion and debate on SEMNET and in the published research literature (Cummings,
Hayduk, & Estabrooks, 2006).
Paul Barrett participated in the SEMNET discussions and put forward recommendations on
how journal reviewers should deal with structural equation model testing (Sept 26, 2005). Les
Hayduk (Oct 1, 2005) challenged Barrett’s reviewer-statement, which led to progressively revised
statements (Oct 26, and Dec 2, 2005). In March 2006, Barrett switched venues and provided his
target article to Personality and Individual Differences (Barrett, 2007). Unfortunately, Barrett’s lat-
est attempt remains seriously deficient with respect to model testing. Some of Barrett’s statements
may strike the uninitiated as ‘‘surprisingly strong’’, indicating some responsiveness to Hayduk’s
earlier critiques, but numerous important deficiencies and contradictions remain.
Serious problems begin right in Barrett’s title. Do we want to ‘‘adjudge’’ fit, or do we want to
test our theorizing? Barrett’s title says judge fit. Our view is that researchers ought to do their best
to test their theorizing. Our scientific bias is to carefully test our structural equation models/the-
ories. Barrett’s bias displaces or avoids model/theory testing by replacing testing with adjudica-
tion, and by focusing on fit rather than on the model providing the fit. A statistician who
focuses on fit is merely ‘‘sticking to what they are supposed to know about’’ (statistics and data)
and ‘‘avoiding what they are not required to know’’ (substantive theoretical matters). Researchers
ought to be extremely interested in the relevant substantive theory. Abandoning substantive the-
ory is abandoning the life-blood of science. As researchers, we gain access to currently unknown
features of the world by proofing, checking, and testing our current theoretical understandings.
We place our current understanding of ‘‘how the world works’’ in our models as best we can,
1
It is overly charitable to describe as theory, the assertion that ‘‘an initially unspecified number of unknown latent
variables having unspecified interconnections’’ underlie some set of items.
L. Hayduk et al. / Personality and Individual Differences 42 (2007) 841–850 843
and use diagnostic evidence accompanying any model’s ‘‘failure to fit’’ to improve our under-
standing. Barrett’s focus on ‘‘judging fit’’ rather than ‘‘testing models’’ hampers research by cir-
cumnavigating theory.
Structural equation models represent specific theory-based causal connections between latent
variables and between those latents and relevant indicator variables. Estimates of the model’s
parameters are those values which, when placed in the model’s equations, imply an indicator var-
iance/covariance matrix that is as similar as possible to the data variance/covariance matrix. The
similarity, or dissimilarity, of these matrices is usually expressed as the likelihood of observing the
data covariance matrix had the model, with its causal estimates, constituted the population from
which the data was obtained. The model implied covariance matrix would be the population
covariance matrix if the model was the proper model.
Even if the model is properly causally specified and the estimates provide proper population
parameter values, random sampling fluctuations can keep the data matrix from corresponding ex-
actly to the model-implied (population) covariance matrix. That is, some degree of ill fit between
the model’s (theory’s) implications and the observed covariance matrix is expected to appear
merely by chance alone. But the differences between a model’s implications and the data might
not result from mere chance sampling fluctuations. Sometimes the model is simply causally
mis-specified, so the differences between the model-implied and data covariance matrices originate
in real model/theory deficiencies, not mere sampling fluctuations. Scientific tradition has set an
alpha probability of .05 as a limit on the degree of covariance divergence that can be tentatively
attributed to sampling fluctuations. A properly specified model should lead to non-significant dif-
ferences between the model-implied and data covariance matrices (p > .05) 19 times out of 20.
There are reasons to increase alpha in the context of SEM testing, but for our current purposes
the consequences of equivalent and nearly-equivalent models are more important.
Properly specified models should lead to non-significant differences between the model-implied
and data covariance matrices, but two or more different causal models may imply covariance
matrices that match any one data matrix (i.e., there are equivalent models (Hayduk, 1996)). Prop-
er model/theoretical specifications should imply covariance matrices that are within sampling fluc-
tuations of the data, but some (and possibly many) causally mis-specified models can also imply
covariance matrices corresponding to the observed data. Finding a model that fits the covariance
data does not say the model is the correct model, but merely that the model is one of the several
potentially very causally different models that are consistent with the data. The possibility of mul-
tiple models that fit but are seriously causally mis-specified (be they covariance-equivalent or
nearly-covariance-equivalent models) makes it unreasonable to assume that a small degree of
covariance ill fit reports the existence of only minimal causal mis-specification. A saturated model
is guaranteed to fit, but is not guaranteed to be even close to properly causally specified. Hence,
even a slight sign of ill fit should be attended to, because it might be the first detectable sign of
serious model causal mis-specification.
With these comments in mind, we turn to addressing specific statements made by Barrett,
roughly in the order they appear in his article.
844 L. Hayduk et al. / Personality and Individual Differences 42 (2007) 841–850
5. Barrett on testing
Barrett (2007, p. 816) claims the model v2 test is a ‘‘conventional null hypothesis significance
test (NHST)’’. The v2 test has a null hypothesis and is a significance test, but it is rendered impor-
tantly unconventional by the possibility of mis-specified covariance-equivalent (and nearly covari-
ance-equivalent) models. Both the proper causal model, and seriously causally-misspecified
covariance-equivalent models can provide zero, or near zero, residual covariances. The existence
of covariance-equivalent models means that unlike tests applied to effects or correlations, the ‘‘en-
tity of interest’’ does not disappear or vanish when a ‘‘zero’’ hypothesized value is reached. Attain-
ing zero residual difference between the model-implied and data covariances does not mean the
entity of interest (causal mis-specification) has disappeared. This makes the consequences of
observing a hypothesis of zero effect (or zero correlation) and a hypothesis of zero-covariance-
residuals radically different. It renders some cogent critiques of ordinary null hypothesis signifi-
cance testing inapplicable to testing structural equation models, but a detailed discussion of this
would push us beyond the length limitation on this commentary.
Barrett says that ‘‘In general, the larger the sample size, the more likely a model will fail to fit
via using the chi-square goodness-of-fit test’’ (2007, p. 816). Barrett’s statement is true of only
some misspecified models, not models in general. For properly specified models, Barrett’s state-
ment is simply false because as N increases, the fit function that connects N to v2 decreases cor-
respondingly (Bollen, 1990), and hence v2 does not increase, and does not lead to model rejection.
Barrett’s statement is also false for covariance-equivalent yet causally mis-specified models (where
again, as N increases, the fit function decreases correspondingly). We cannot prevent Barrett from
claiming that all his models are detectably wrong in general, but we can encourage everyone to
strive for models that are properly specified, not wrong. Observing that at least some wrong mod-
els are more assuredly detected by larger samples (because of decreased sampling variability) is
good methodological news to those seeking proper models!
Barrett (2007, p. 817) connects covariance fit to the worth of a model. In ‘‘other areas of sci-
ence, model fit is adjudged according to how well a model predicts or explains that which it is de-
signed to predict or explain’’. This good-idea from ‘‘other areas of science’’ must be nuanced when
brought into the SEM context. Is a structural equation model designed to ‘‘predict or explain’’ a
theory? Is a model designed to encapsulate and test a theory? Structural equation models imply,
result in, and provide for variances and covariances as consequences of causal actions, but models
can fit with low or high R2, and models can fail with low or high R2. The R2 type of predictive
information is available in structural equation models, but this information does not statistically
test the model. Barrett (2007, p. 818) seems to acknowledge this when he says ‘‘the [v2] test is blind
to whether the model actually predicts or explains anything to some substantive degree’’. But
researchers ought to recognize that the model v2 test outcome is not foretold by the magnitude
of the ‘‘proportion of explained variance’’. An incorrect causal model can lead to biasedly-large
effect estimates and consequently to a biasedly-large R2; where the incorrectness of the model nul-
lifies and invalidates the high R2. Explaining variance is secondary to having a proper model that
provides the variance explanation.
We do not object to multi-faceted model assessments – examining predictive accuracy in terms
of R2, parsimony in terms of degrees of freedom, and theoretical substance in terms of the direct
and indirect effect routings permitted or contradicted by the estimates – but none of these replace
L. Hayduk et al. / Personality and Individual Differences 42 (2007) 841–850 845
or displace model testing. If the model v2 test detects a causally mis-specified model, the biased
estimates and variance improperly-explained by those biased estimates, become impotent and
unconvincing. Detecting model causal mis-specification is not the only facet of model evaluation,
but it is the most fundamental facet because model mis-specification attacks the estimates that
underpin the other modes of model assessment.
Barrett does not see SEM as a tool for investigating theory, or differentiating potential mech-
anisms of action. He says ‘‘SEM is a modeling tool, and not a tool for ‘descriptive’ analysis. It fits
models to data.’’ (2007, p. 823). This misses the opportunity to see that theoretical intent can be
encapsulated in, and solid description provided by, structural equation models that investigate the
mechanisms by which a set of variables become interconnected. Barrett (2007, p. 823) ‘‘These
models require testing in order to determine the fit of a model to the data’’, but he does not
see the opportunity, potential, and demand for testing the underlying theory. He sees the fit as
being tested, not the substantive theory providing the fit!
Barrett (2007, p. 818) says that ‘‘When the model is theory-based, cross-validated predictive accu-
racy is even more compelling as the sole arbiter of model ‘acceptability’.’’ What he does not recog-
nize is that failing the model fit test can question the theory and render the predictions biased due to
biased estimates; and that cross-validation becomes double-crossing invalidation if the model is
repeatedly significantly inconsistent with the data, and no one takes notice of this. He ‘‘presumes’’
the theory is OK by repeatedly overlooking evidence potentially pointing to model specification
problems! By repeatedly overlooking or disregarding model fit test evidence, Barrett prevents the
data from speaking against the model’s theory, and somehow transforms repeated model failures
into cross-validation! This sounds like magic, but it is just methodological deficiency in disguise.
6. Barrett’s ‘‘recommendations’’
We agree that structural equation model testing via v2, degrees of freedom, and the associated
probability must be reported for all manuscripts reporting SEM results. We disagree that N < 200
or N > 10,000 provide reasonable boundaries for altering this requirement. The researcher simply
MUST report this information.
Barrett says that ‘‘if the model fits’’, the researcher might ‘‘proceed to report and discuss fea-
tures of the model’’ (2007, p. 820). Should researchers report, discuss, and publish failing models?
We argue that attentively constructed and theoretically meaningful models that fail ought to be
carefully discussed and published. Contrary to Barrett, we see discussion of well-conducted failing
models as contributing to scientific progress. Any area that is unable to openly acknowledge and
examine the deficiencies in its current theories is hampered from proceeding toward better theo-
ries. So we unequivocally disagree with Barrett’s recommendation to limit discussion and publi-
cation to only fitting models. If a model fails, the authors should not proceed to discuss the model
as if it were ‘‘OK anyway’’. They should publish a discussion of ‘‘how the world looks from this
theory/model perspective’’, and their diagnostic investigations of ‘‘how and why this theory/mod-
el perspective on the world fails’’. We need to understand what is problematic if we are to do bet-
ter next time around.
846 L. Hayduk et al. / Personality and Individual Differences 42 (2007) 841–850
Next, consider the quotation Barrett selected from Burnham and Anderson (2002) regarding
large sample size. This quote is devious because it is not describing structural equation models,
but survival models. Burnham and Anderson were speaking in a context where ‘‘truth’’ could
be equated with not having to confront the complications of sampling fluctuations (in their sur-
vival data). Here is how Burnham and Anderson lead up to, the statement quoted by Barrett.
‘‘The results in Table 5.7 provide a motivation for us to mention some philosophical issues about
model selection when truth is essentially known, or equivalently in statistical terms, when we have
a huge sample size.’’ (Burnham & Anderson, 2002, p. 219). In the context in which Burnham and
Anderson were writing, ‘‘truth’’ was merely a statement that the data are stable and not subject to
sampling fluctuations. Barrett is mistaken if he thinks the truth of structural equation models be-
comes known merely because the data covariance matrix has become stable due to a large sample.
It is simply wrong SEM statistics to think that having a stable data covariance matrix would in-
form us about the truth or falsity of the several DIFFERENT models that could be fit to that
stable data! Barrett’s Burnham and Anderson quote is so far out of context that it can not be rea-
sonably connected to structural equation modeling!
This segment is titled ‘‘Sample size’’ even if problematic discussions of N appear in Sections 1,
2, 4, and 5. In this section Barrett begins with the unreasonable claim, that ‘‘SEM analyses based
upon samples of less than 200 should simply be rejected outright. . .’’ unless the population is
small. This claim is based on two statistical mistakes. First, the population relevant to model test-
ing is not the list of cases from which the sample was drawn, but is the model-implied covariance
matrix (often called sigma, or R). In model testing, the issue is not generalizability, but sampling
fluctuations – namely whether one can reasonably, or whether one ought not, attribute the differ-
ences between the model-implied (putative population) and data covariance matrices to sampling
fluctuations. Barrett implicitly talks generalizability, when the v2 model test concerns whether the
discrepancies with the data are stable enough to challenge the model. Generalizability has an
important place in modeling, but that place is not the same statistical place as structural equation
model testing.
Barrett’s second reason for ‘‘rejecting outright’’ models based on N’s less than 200 is that tests
based on small N do not have sufficient power. Barrett’s reasoning fails because high model-test-
ing power can be attained in other ways, even if N is less than 200. For example, Browne, Mac-
Callum, Kim, Andersen, and Glaser (2002) had an N of only 72 with sufficient power to detect
severe model specification problems (Hayduk, Pazderka-Robinson, Cummings, Levers, & Beres,
2005). The statistical basis of this high power (small measurement errors or unique variances) is
well discussed in Browne et al. (2002), but Barrett fails to recognize that there are ‘‘other ways to
attain high power’’ even when N is small.
Barrett hesitates ‘‘to advise authors to test the power of their model’’ despite the fact that highly
significant failure of a small-N model actually is a prima facie demonstration of sufficient power.
Non-borderline failure to fit is direct evidence of sufficient power, and the multitudes of failing
small-N models constitute direct demonstrations of sufficient testing power with small N. Barrett
should have noticed that N’s considerably smaller than 200 often display sufficient power to
clearly reject and ‘‘advise revision of’’ models. For example, Hayduk (1985) discusses instances
L. Hayduk et al. / Personality and Individual Differences 42 (2007) 841–850 847
where samples of 22, 38, and 40 in the context of experimental studies provided sufficient power to
consistently reject some models while ‘‘accepting’’ others. Barrett’s blanket rejection of structural
equation models with N’s less than 200 is seriously inconsistent with both the statistical literature
and practical experience.
Section 3 is Barrett’s most reasonable section, but there remains room for improvement. When
considering multivariate non-normality in Section 3.1, Barrett should have reported that a re-
searcher would be entirely unjustified in disregarding severe ill-fit on the basis of a trivial degree
of non-normality. It would be faulty statistics to point to some significant yet minimal degree of
non-normality as explaining or excusing severe ill-fit. Researchers pointing to non-normality as
‘‘the culprit producing ill fit’’ should be required to document that their observed degree of ill
fit is comparable in magnitude to what could reasonably result from the degree and style of
non-normality in their data. Only some kinds of non-normality lead to excessive model failures
– other kinds of non-normality can actually lead to an increased likelihood of the model fitting
(Yuan, Bentler, & Zhang, 2005). There are several investigations of the impact of the degree of
non-normality on v2 (see the references in Yuan et al., 2005), but the degree or extent of the im-
pact is much less than urban myth suggests. And researchers claiming non-normality as their ex-
cuse for ill-fit should also be required to demonstrate that the greatest ill fit actually connects to
the most non-normal variables. If the model ill-fit is driven by a huge covariance residual between
two nearly normal variables, pointing to non-normality among other variables in the model is
untenable as an excuse for the model’s ill-fit.
Barrett also fails to recognize that if the non-normality is sufficiently severe to render testing
questionable, it is also likely to be sufficiently severe to render the estimates themselves question-
able. The v2 test originates from the same fit function that provides the estimates, and hence it is
‘‘awkward’’ for a researcher to claim that the fit test is questionable while the estimates remain
sound. The dependence of both the v2 test and estimates on the same fit function urges caution
against discounting the model fit test while claiming trustworthy estimates.
We will not belabor the fact that Barrett’s Section 3 provides no instructions or recommenda-
tions to those who use the v2 test and find a fitting model.
This section returns to the statistical mistake of using increased sample size as a rationale for
ignoring the v2 test. Barrett should have been sensitive to this because it has been discussed repeat-
edly on SEMNET. He says the rationale ‘‘is always likely to revolve around the argument that as
a sample increases in size, so will increasingly trivial ‘magnitude’ discrepancies between the actual
and model implied covariance matrix assume statistical significance’’ (Barrett, 2007, p. 821). It is
reasonable to claim that increasing N increases the power to detect any given size of covariance
difference. Barrett’s mistake is to presume that the size of the covariance differences themselves
(the residuals) can be trusted to correspond to the size of the problem in the model’s specification.
Small residuals do not mean that the model mis-specifications are correspondingly small. SEM-
NET discussions between Roger Millsap and Les Hayduk (between Nov 2003 and Apr 2004)
848 L. Hayduk et al. / Personality and Individual Differences 42 (2007) 841–850
clearly demonstrated that the size of the covariance residuals (the degree of ill-fit) and the degree
of model causal mis-specification may or may not correspond. That is, in some instances the de-
gree of causal mis-specification does correspond to the size of the covariance residuals, but in
other instances, it does not. And it is even possible for the degree of causal mis-specification in
a structural equation model to be entirely uncorrelated with the degree of covariance ill-fit dis-
played by that model.
This preceding statement is an extreme version of the well-known fact that adding a modifi-
cation-index-recommended coefficient can improve model fit even if the coefficient itself consti-
tutes a mis-specification, and even if the model the coefficient is being added to is seriously
causally-misspecified. Barrett assumes incorrectly that the size of model mis-specification can
be trusted to shrink as the size of covariance ill-fit shrinks. The statistical mistake is to think
that the seriousness of causal mis-specifications can be trusted to decline as residual covariances
decline. That is simply an unreliable statistical claim. For an example of where small covariance
residuals arose from substantial model mis-specifications, see Hayduk et al. (2005). Another way
to see this point is to consider that a seriously wrong model can be bludgeoned into fitting by
blindly following modification indices. This has received SEMNET discussion (June 1, 2006) in
the context of why ‘‘replication’’ is deficient as a way to verify the properness of structural equa-
tion models.
Barrett’s next major misstep appears in Section 5a, where a model is supposed to attain ‘‘empir-
ical adequacy’’ or to be ‘‘‘good enough’ for practical purposes’’ (Barrett, 2007, p. 822) despite
being significantly inconsistent with the data – presumably highly statistically significantly incon-
sistent with the data ‘‘given the huge sample size’’. The contradiction between ‘‘empirical ade-
quacy’’ and ‘‘conflicting with the empirical data’’ is obvious – adequacy despite the data!
Barrett’s resorting to predictive accuracy (be it R2 or covariance fit) directly conflicts with
SEM as a means of investigating the mechanisms of causal action. SEM is typically not oriented
toward predicting some variance or covariance outcome – it seeks to represent and investigate the
theorized mechanisms via which various outcomes (plural) arise. Focusing on predictive accuracy
pretends that the goal of SEM is to predict something, as opposed to understanding the causal
mechanisms connecting some things.
Treating close as ‘‘‘good enough’ for practical purposes’’ (Barrett, 2007, p. 822) is statistical
risk-taking behavior, particularly when models have real-life consequences. Since even fitting
models can be seriously wrongly specified and hence potentially harmful if implemented in prac-
tice, the warning should be clear: the first indication of significant ill-fit might constitute the first
indication of huge problems. Overlooking indications of potentially huge problems are the kinds
of things lawyers will gladly describe as malfeasance, dereliction of responsibility, or absence of
due diligence, if harm results. Since a small signal (minor covariance ill fit) can originate from ma-
jor model mis-specification, discounting the signal without paying careful diagnostic attention,
and issuing a warning to all concerned is needless risk-taking. If harm results from implementing
a wrong model in ‘‘practice’’ this may result in the SEM researcher becoming the defendant. Close
enough for practical implementation purposes is not a phrase to be employed lightly in the con-
text of SEM. Close but significant ill-fit in SEM-speak, translates as ‘‘close to being sued’’ in legal-
L. Hayduk et al. / Personality and Individual Differences 42 (2007) 841–850 849
speak. By ‘‘choosing to ignore its [v2’s] result’’ (Barrett, 2007, p. 822), the researcher becomes
culpable.
Barrett suggests (2007, p. 822) that there is some way to examine the residual matrix and to be
able to claim that examination ‘‘of the residual matrix . . . might lead them to conclude that the
[v2] test result is misleading’’. We notice that Barrett failed to indicate what a researcher can look
at in the covariance residuals as indicating the test is misleading. We know of no reference that
supports this, and we hear this as inviting blatant author-bias toward disregarding the v2 test.
And here is another invitation to bias. Barrett (2007, p. 822) says in the context of a researcher
‘‘choosing to ignore its [v2’s] result’’ that the test is misleading because ‘‘it appears to have little
consequence in terms of the distribution, location, and/or size of residual discrepancy’’. Even
SEM-novices will appreciate that a single statistical value like a model v2 can not report on var-
ious distributions or locations of residuals! How could a distribution of residuals, the placement
of multiple potential residuals, or patterns in the magnitude of residuals be reported by a single
numerical value? They cannot. Rather than biasedly pretending this constitutes a rationale for dis-
regarding v2, Barrett should have required that a significant model v2 demands diagnostic exam-
ination of the distribution, location, and size of the residual discrepancies, and possibly even
specific model features as the sources of the distributed ill-fit.
Barrett claims (2007, p. 823) to have ‘‘provided the logic of how to approach assessing model
acceptability if a sample size is huge, or where the assumptions of the chi-square test do not hold
in the sample.’’ He simply has not done this. He did not mention the normality-adjusted versions of
v2 (Satorra & Bentler, 1994) as a way to address concern for non-normality. What ‘‘logic’’ did he
provide beyond blatantly, and wrongly, asserting that you might claim the model is good anyway –
despite the evidence? What SEM logic is involved in turning away from SEM to seek external cri-
terion variables (‘‘An alternative strategy might be to include some criterion variables external to
the SEM analysis’’ (Barrett, 2007, p. 822)) if your question is about mechanisms of causal action?
This does not provide any SEM logic. It implicitly dismisses SEM and displaces what SEM testing
is capable of, by distractingly pointing the SEM researcher in non-SEM directions.
Barrett mentions parsimony at several points – often in ways that imply parsimony is an excuse
for permitting or disregarding significant model ill-fit. Parsimony is attained by adding model/the-
oretical constraints. These constraints are appropriately acknowledged in v2’s degrees of freedom,
but if the constraints are problematic the model will (we hope) tend to fail. Once we have evidence
of failure to fit, we have evidence that the parsimony may have been ill-gotten. Model parsimony
does not provide an excuse for overlooking ill fit. The indicated degree of parsimony is confronted
and questioned by failure of the model to fit.
We will end on a point of near-agreement with Barrett. It seems obvious to us that when
researchers pay careful attention to model testing, there is less need for fit indices. Barrett says:
‘‘In fact, I would now recommend banning ALL such indices from ever appearing in any paper
as indicative of model ‘acceptability’ or ‘degree of misfit’. Model fit can no longer be claimed
via recourse to some published ‘threshold-level recommendation’. There are no recommendations
anymore which stand serious scrutiny.’’ (Barrett, 2007, p. 821, emphasis in original). The problem
is not merely one of locating an arbitrary cut-point for degree of close fit; the fundamental prob-
lem is that even tiny covariance residuals, and miniscule ill fit, can be the ONLY detectable sign of
severe structural equation model specification problems. Hence the emphasis turns from ‘‘index-
ing’’ to ‘‘testing,’’ and diagnostically investigating whatever significant ill fit the test locates.
850 L. Hayduk et al. / Personality and Individual Differences 42 (2007) 841–850
7. Conclusion
We conclude that Barrett’s article is too statistically ill-founded to constitute reasonable advice
on structural equation model testing. Barrett seems to think he has adopted a ‘‘hard line’’. We
think he is still circumnavigating, and not addressing, the hard-point of structural equation model
testing. Barrett has provided sporadically strong assertions, but these do not nullify his culpability
for making other deficient statistical recommendations. The occasional strong assertion in favor
of testing ought not deflect the attentive reader from the remaining serious flaws in Barrett’s ap-
proach. Researchers unwilling to acknowledge the failings of their models are unlikely to do the
detailed diagnostic investigations, and novel thinking, capable of providing advancement via
structural equation modeling.
We recommend that all journal reviewers insist that authors of research articles involving struc-
tural equation models report v2, its degrees of freedom, and p-value, and that the authors also
report the implications of the diagnostics undertaken to investigate any significant model ill-fit.
Acknowledgement
The authors thank Dionne Pohler for participating in discussions of this paper.
References
Barrett, P. (2007). Structural equation modelling: adjudging model fit. Personality and Individual Differences, 42(5),
815–824. doi:10.1016/j.paid.2006.09.018.
Bollen, K. A. (1990). Overall fit in covariance structure models: two types of sample size effects. Psychological Bulletin,
107, 256–259.
Browne, M. W., MacCallum, R. C., Kim, C. T., Andersen, B. L., & Glaser, R. (2002). When fit indices and residuals are
incompatible. Psychological Methods, 7(4), 403–421.
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic
approach (2nd ed.). New York: Springer.
Cummings, G. G., Hayduk, L., & Estabrooks, C. (2006). Is the nursing work index measuring up? Moving beyond
estimating reliability to testing validity. Nursing Research, 55, 82–93.
Duncan, O. D. (1975). Introduction to structural equation models. New York: Academic Press.
Hayduk, L. A. (1985). Personal space: the conceptual and measurement implications of structural equation models.
Canadian Journal of Behavioural Science, 17(2), 140–149.
Hayduk, L. A. (1996). LISREL issues, debates and strategies. Baltimore: Johns Hopkins University Press.
Hayduk, L. A., & Glaser, D. N. (2000a). Jiving the four-step, waltzing around factor analysis, and other serious fun.
Structural Equation Modeling, 7(1), 1–35.
Hayduk, L. A., & Glaser, D. N. (2000b). Doing the four-step, right-2–3, wrong-2–3: a brief reply to Mulaik and
Millsap; Bollen; Bentler; and Herting and Costner. Structural Equation Modeling, 7(1), 111–123.
Hayduk, L. A., Pazderka-Robinson, H., Cummings, G. G., Levers, M. J., & Beres, M. A. (2005). Structural equation
model testing and the quality of natural killer cell activity measurements. BMC Medical Research Methodology, 5(1),
1–9.
Satorra, A., & Bentler, P. M. (1994). Corrections to test statistics and standard errors in covariance structure analysis.
In A. Von Eye & C. C. Clogg (Eds.), Latent variables analysis: Applications for developmental research (pp. 399–419).
Newbury Park, CA: Sage.
Yuan, K.-H., Bentler, P. M., & Zhang, W. (2005). The effect of skewness and kurtosis on mean and covariance structure
analysis: the univariate case and its multivariate implication. Sociological Methods and Research, 34(2), 240–258.