Regression Primer 02
Regression Primer 02
Knowing about regression analysis will help you to learn about SEM. Although the techniques considered
next analyze observed variables only, their basic principles make up a core part of SEM. This includes the
dependence of the results on not only what is measured (the data), but also on what is not measured, or omit-
ted relevant variables, a kind of specification error. Some advice: Even if you think that you already know a
lot about regression, you should nevertheless read this primer carefully. This is because many readers tell me
that they learned something new after hearing about the issues outlined here. Next I assume that standard
deviations (SD) for continuous variables are calculated as the square root of the sample variance s2 = SS/df,
where SS refers to the sum of squared deviations from the mean and the overall degrees of freedom are
df = N – 1. Standardized scores, or normal deviates, are calculated as z = (X – M)/SD for a continuous
variable X.
1
2 Regression Primer
see in Equation R.2 that BX is just a rearrangement of are not centered.) Once centered, x = 0 corresponds to a
the expression for the covariance between X and Y, or score that equals the mean in the original (uncentered)
covXY = rXYSDXSDY. Thus, BX corresponds to the cova- scores, or X = M X. When regressing Y on x, the value of
riance structure of Equation R.1. Because BX reflects the intercept A x equals Ŷ when x = 0; that is, the inter-
the original metrics of X and Y, its value will change if cept is the predicted score on Y when X takes its aver-
the scale of either variable is altered (e.g., X is measured age value in the raw data. Although centering generally
in centimeters instead of inches). For the same reason, changes the value of the intercept (A X ≠ A x), centering
values of BX are not limited to a particular range. For does not affect the value of the unstandardized regres-
example, it may be possible to derive values of BX such sion coefficient (BX = Bx). Exercise 2 asks you to prove
as –7.50 or 1,225.80, depending on the raw score met-
rics of X and Y. Consequently, a numerical value of BX
this point for the data in Table R.1.
Regression residuals, X
r =0
or( Y − Yˆ ,) sum to zero and are
that appears “large” does not necessarily mean that X uncorrelated with the predictor, or
is an important or strong predictor of Y.
The intercept A X of Equation R.1 is related to both rX(Y − Yˆ ) = 0 (R.4)
BX and the means of both variables:
The equality represented in Equation R.4 is required
A X = MY – BX MX (R.3) in order for the computer to calculate unique values
of the regression coefficient and intercept in a par-
The term A X represents the mean structure of Equation ticular sample. Conceptually, assuming indepen-
R.1 because it conveys information about the means dence of residuals and predictors, or the regression
of both variables (and the regression coefficient) albeit rule (Kenny & Milan, 2012), permits estimation of
with a single number. As stated, Ŷ = A X when X = 0, but the explanatory power of the latter (e.g., BX for X in
sometimes scores of zero are impossible on certain pre- Equation R.1) controlling for omitted (unmeasured)
dictors (e.g., there is no IQ score of zero in conventional predictors. Bollen (1989) referred to this assump-
standardized metrics for such scores). If so, scores on tion as pseudo-isolation of the measured predictor X
X may be centered, or converted to mean deviations from all other unmeasured predictors of Y. This term
x = X – MX, before analyzing the data. (Scores on Y describes the essence of statistical control where BX is
Regression Primer 3
130 X on Y
120
Y on X
110
Y
100
90
80
12 14 16 18 20 22
X
FIGURE R.1. Unstandardized prediction lines for regressing Y on X and for regressing X on Y for the data in
Table R.1.
4 Regression Primer
that is, a score one standard deviation above the mean ter when the scales of all variables are meaningful
on X predicts a score almost seven-tenths of a standard rather than arbitrary. Suppose that Y is the time to com-
deviation above the mean on Y. A standardized regres- plete an athletic event and X is the number of hours
sion coefficient thus equals the expected difference on spent in training. Assuming a negative covariance, the
Y in standard deviation units, given an increase on X value of BX would indicate the predicted decrease in
of one full standard deviation. Unlike the unstandard- performance time for every additional hour of training.
ized regression coefficient BX (see Equation R.2), the In contrast, standardized coefficients describe the effect
value of the standardized regression coefficient (rXY) is of training on performance in standard deviation units,
unaffected by the scale on either X or Y. It is true that which discard the original—and meaningful—scales
(1) rXY = .686 is also the standardized coefficient when of X and Y. The assumptions of bivariate regression
regressing z X on zY, and (2) the standardized prediction are essentially the same as those of multiple regression.
equation in this case is zˆ X = rXY zY. They are considered in the next section.
There is a special relation between rXY and the
unstandardized predicted scores. If Y is regressed on
X, for example, then MULTIPLE REGRESSION
1. rXY = rYYˆ ; that is, the bivariate correlation between
The logic of multiple regression is considered next for
X and Y equals the bivariate correlation between Y
the case of two continuous predictors, X and W, and a
and Ŷ ;
continuous criterion Y, but the same ideas apply if there
2. the observed variance in Y can be represented as are three or more predictors. The form of the unstan-
the exact sum of the variances of the predicted dardized equation for regressing Y on both X and W is
scores and the residuals, or sY2 = sŶ2 + sY2 − Yˆ ; and
2 2 2
3. rXY = sŶ / sY , which says that the squared correla- Yˆ =B X X + BW W + A X , W (R.8)
tion between X and Y equals the ratio of the vari-
ance of the predicted scores over the variance of the where BX and BW are the unstandardized partial
observed scores on Y. regression coefficients and A X,W is the intercept. The
coefficient BX estimates the change in Y, given a 1-point
The equality just stated is the basis for interpreting change in X while controlling for W. The coefficient BW
squared correlations as proportions of explained vari- has the analogous meaning for the other predictor. The
ance, and a squared correlation is the coefficient of intercept A X,W equals the predicted score on Y when
2
determination. For the data in Table R.1, rXY = .6862 = the scores on both predictors are zero, or X = W = 0.
.470, so we can say that X explains about 47.0% of the
If zero is not a valid score on either predictor, then
variance in Y, and vice versa. Exercise 3 asks you to
Y can be regressed on centered scores (x = X – MX,
verify the second and third equalities just described for
w = W – M W) instead of the original scores. If so, then
the data in Table R.1.
When replication data are available, it is actually
Ŷ = A x,w, given X = MX and W = M W. As in bivari-
better to compare unstandardized regression coef- ate regression, centering does not affect the values of
ficients, such as BX, across different samples than to the regression coefficients for each predictor in Equa-
compare standardized regression coefficients, such as tion R.8 (i.e., BX = Bx, BW = Bw).
rXY. This is especially true if those samples have dif- The overall multiple correlation is actually just the
ferent variances on X or Y. This is because the correla- Pearson correlation between the observed and predicted
tion rXY is standardized based on the variability in a scores on the criterion, or RY·X,W = rYYˆ . Unlike bivari-
particular sample. If variances in a second sample are ate correlations, though, the range of R is 0–1.0. The
not the same, then the basis of standardization is not statistic R2 equals the proportion of variance explained
constant over the first and second samples. In contrast, in Y by both predictors X and W, controlling for their
the metric of BX is that of the raw scores for variables X intercorrelation. For the data in Table R.1, the unstan-
and Y, and these metrics are presumably constant over dardized regression equation is
samples.
Unstandardized regression coefficients are also bet- Yˆ =2.147 X + 1.302 W + 2.340
Regression Primer 5
and the multiple correlation equals .759. Given these rXY − rWY rXW rWY − rXY rXW
results, we can say that bX = and bW = (R.10)
1 − rXW
2
1 − rXW
2
1. a 1-point change in X predicts an increase in Y of In the numerators of Equation R.10, the bivariate corre-
2.147 points, controlling for W; lation of each predictor with the criterion is adjusted for
2. a 1-point change in W predicts an increase in Y of the correlation of the other predictor with the criterion
1.302 points, controlling for X; and for correlation between the two predictors. The
3. Ŷ = 2.340, given X = W = 0; and denominators in Equation R.10 adjust the total stan-
dardized variance by removing the proportion shared
4. the predictors explain .7592 = .576, or about 57.6%
by the two predictors. If the values of rXY, rWY, and rXW
of the total variance in Y, after taking account of
vary over samples, then values of coefficients in Equa-
their intercorrelation (rXW = .272; Table R.1).
tions R.8–R.10 will also change.
Given three or more predictors, the formulas for the
The regression equation just described defines a plane
regression coefficients are more complicated but follow
in three dimensions where the slope along the X-axis
the same principles (see Cohen et al., 2003, pp. 636–
is 2.147, the slope along the W-axis is 1.302, and the
642). If there is just a single predictor X, then bX = rXY.
Y-intercept for X = W = 0 is 2.340. This regression sur-
The intercept in Equation R.8 can be expressed as a
face is plotted in Figure R.2 over the range of scores in
function of the unstandardized partial regression coef-
Table R.1.
ficients and the means of all three variables as follows:
Equations for the unstandardized partial regression
coefficients for each of two continuous predictors are
A X, W =
M Y − B X M X − BW M W (R.11)
SDY SDW
BX = bX and BW = bW (R.9) The regression equation for standardized variables is
SDX SDY
where bX and bW for X and W are, respectively, their
=
zˆ Y b X rXY + bW rWY (R.12)
standardized partial regression coefficients, also
For the data in Table R.1, bX = .594, which says that
known as beta weights. Their formulas are listed next:
the difference on Y is expected to be about .60 stan-
130
120
110
Ŷ
100
90
80
22
20
18
16 80
14 60 70
X 12 30 40 50
W
FIGURE R.2. Unstandardized regression surface for predicting Y from X and W for the data in Table R.1.
6 Regression Primer
other measured predictor, W, and (c) all unmeasured ment on the residuals are inadequate. Exercise 6 asks
predictors. But if the relation between X and Y is appre- you to inspect the residuals for the multiple regression
ciably curvilinear or conditional, the value of BX could analysis of the data in Table R.1. Although there is no
misrepresent predictive power. A conditional relation requirement in regression for normal distributions of
implies interaction, where the covariance between the original scores, values of multiple correlations and
X and Y changes over the levels of at least one other absolute partial regression coefficients are reduced if
predictor, measured or unmeasured. A curvilinear rela- the distributions for a predictor and the criterion have
tion of X to Y is also conditional in the sense that the very different shapes, such as very positively skewed on
shape of the regression surface changes over the levels one versus very negatively skewed on the other.
of X (e.g., Figure 7.7). How to represent curvilinear or
4. There are no causal effects among the predictors
interactive effects in regression analysis and SEM is
(i.e., there is a single equation). Because predictors and
considered in Chapter 7.
criteria are theoretically interchangeable in regression,
2. All predictors are perfectly reliable (no measure- such analyses can be viewed as strictly predictive. But
ment error). This very strong assumption is necessary sometimes the analysis is explicitly or implicitly moti-
because there is no direct way in standard regression vated by causal hypotheses, where a researcher views
analysis to represent or control for less-than-perfect the regression equation as a prototypical causal model
score reliability for the predictors. Consequences of with the predictors as causes and the criterion as their
minor violations of this requirement may not be criti- outcome (Cohen et al., 2003). If predictors in standard
cal, but more serious ones can result in substantial bias. regression analyses are viewed as causal, then we must
This bias can affect not only the regression weights of assume there are no causal effects among them. Spe-
predictors measured with error but also those of other cifically, standard regression analyses do not allow for
predictors. It is difficult to anticipate the direction of indirect causal effects where one predictor, such as X,
this propagation of measurement error. Depending affects another, such as W, which in turn affects the cri-
on sample intercorrelations, some absolute regression terion, Y. The indirect effect just described would be
weights may be biased upward (too large), but oth- represented in SEM by the presumed causal order
ers may be biased in the other direction (too small),
or attenuation bias. There is no requirement that the X W Y
criterion be measured without error, but the use of a
psychometrically deficient measure of it can reduce the From a regression perspective, (1) variable W is both
value of R2. Note that measurement error in the cri- a predictor (of Y ) and an outcome (of X), and (2) there
terion only affects the standardized regression coeffi- are actually two equations, one for W another for Y.
cients, not the unstandardized ones. If the predictors But standard regression techniques analyze a single
are also measured with error, too, then these effects for equation at a time, in this case for just Y, and thus yield
the criterion could be amplified, diminished, or can- estimates of direct effects only. If there are appreciable
celed out, but it is best not to hope for the absence of indirect effects but such effects are not explicitly repre-
bias; see Williams et al. (2013) for more information sented in the analysis,, then estimates of direct effects
about measurement error in regression analysis. in standard regression analyses can be very wrong
(Achen, 2005). The idea behind this type of bias is elab-
3. Significance tests in regression assume that the
orated in Chapter 6, which concerns a graph-theoretic
residuals are normally distributed and homoscedas-
approach to causal inference.
tic. The homoscedasticity assumption means that the
residuals have constant variance across all levels of the 5. There is no specification error. A few different
predictors. Distributions of residuals can be heterosce- kinds of potential mistakes involve specification error.
dastic (the opposite of homoscedastic) or non-normal These include the failure to estimate the correct func-
due to outliers, severe non-normality in the observed tional form of relations between predictors and the cri-
scores, more measurement error at some levels of the terion, such as assuming unconditional linear effects
criterion or predictors, or a specification error. The only when there are sizable curvilinear or interac-
residuals should always be inspected in regression tive effects. Use of the incorrect estimation method is
analyses (see Cohen, Cohen, West, & Aiken, 2003, another kind of error. For example, OLS estimation is
chap. 4). Reports of regression analyses without com- for continuous criteria, but dichotomous outcomes (e.g.,
8 Regression Primer
pass–fail) generally require different methods, such as or not W is included in the regression equation, given
those used in logistic regression. Including predictors rXW = 0.
that are irrelevant in the population is a specification Now suppose that
error. The concern is that an irrelevant predictor could
in a particular sample relate to the criterion by sam- rXY = .40, rWY = .60, and rXW = .60
pling error alone, and this chance covariance may dis-
tort values of regression coefficients for other predic- Now we assume that the correlation between the
tors. Omitting from the regression equation predictors included predictor X and the omitted predictor W is
that (1) account for some unique proportion of criterion .60, not zero. In the bivariate analysis with X as the sole
variance and (2) covary with measured predictors is predictor, rXY = .40 (the same as before), but now the
left-out variables error, described next. results of the multiple regression analysis are
Perhaps the most general definition is that suppression the results of a multiple regression analysis are
occurs when either (1) the absolute value of a predic-
tor’s beta weight is greater than that of its bivariate bX = –.40, bW = .80, and R Y2 ⋅X , W = .48
correlation with the criterion or (2) the two have dif-
ferent signs (see also Shieh, 2006). So defined, sup- This example of classical suppression (i.e., rXY = 0,
pression implies that the estimated relation between bX = –.40) demonstrates that bivariate correlations of
a predictor and a criterion while controlling for other zero can mask true predictive relations once other vari-
predictors is a “surprise,” given the bivariate correla- ables are controlled. There is also reciprocal suppres-
tions. Suppose that X is the amount of psychotherapy, sion, which can occur when two variables correlate
W is the degree of depression, and Y is the number of positively with the criterion but negatively with each
prior suicide attempts. The bivariate correlations in a other. Some cases of suppression can be modeled in
hypothetical sample are SEM as the result of inconsistent direct versus indirect
effects of causally prior variables on outcome variables.
rXY = .19, rWY = .49, and rXW = .70 These possibilities are explored later in the book.
evaluation of the predictive power of the psychologi- tion. If you had good reason for including a predictor,
cal variable, over and beyond that of the simple demo- then it is better to leave it in the equation until replica-
graphic variables. The latter can be estimated as the tion indicates that the predictor does not appreciably
increase in the squared multiple correlation, or DR2, relate to the criterion.
from that of step 1 with demographic predictors only to
that of step 2 with all predictors in the equation.
An example of the statistical standard is stepwise PARTIAL AND PART CORRELATION
regression, where the computer selects predictors for
entry based solely on statistical significance; that is, The concept of partial correlation concerns the idea
which predictor, if entered into the equation, would of spuriousness: If the observed relation between
have the smallest p value for the test of its partial two variables is wholly due to one or more common
regression coefficient? After selection, predictors at a cause(s), their association is spurious. Consider these
later step can be removed from the equation according bivariate correlations between vocabulary breadth (Y ),
to p values (e.g., if p ≥ .05 for a predictor in the equation foot length (X), and age (W ) in a hypothetical sample of
at a particular step). The stepwise process stops when
elementary school children:
there could be no statistically significant DR2 by add-
ing more predictors. Variations on stepwise regression
rXY = .50, rWY = .60, and rXW = .80
include forward inclusion, where selected predictors
are not later removed from the equation, and back-
Although the correlation between foot length X and
ward elimination, which begins with all predictors in
vocabulary breadth Y is fairly substantial (.50), it is
the equation and then automatically removes them, but
such methods are directed by the computer, not you. hardly surprising because both are caused by a third
Problems of stepwise and related methods are so variable, age W (i.e., maturation).
severe that they are actually banned in some jour- The first-order partial correlation rXY·W removes
nals (Thompson, 1995), and for good reasons, too. the influence of a third variable W from both X and Y.
One problem is extreme capitalization on chance. The formula is
Because every result in these methods is determined
rXY − rXW rWY
by p values in a particular sample, the findings are rXY ⋅W = (R.15)
unlikely to replicate. Another problem is that not all
stepwise regression procedures report p values that are
( 1 − r )( 1 − r )
2
XW
2
WY
corrected for the total number of variables that were Applied to the hypothetical correlations just listed, the
considered for inclusion. Consequently, p values in partial correlation between foot length and vocabulary
stepwise computer output are generally too low, and breadth controlling for age is rXY·W = .043. (You should
absolute values of test statistics are too high; that is, verify this result.) Because the association between X
the computer’s choices could actually be wrong. Even
and Y disappears when W is controlled, their bivariate
worse, such methods give the false impression that the
relation may be spurious. Presumed spurious associa-
researcher does not have to think about predictor selec-
tions due to common causes are readily represented in
tion. Stepwise and related methods are anachronisms
SEM.
in modern data analysis. Said more plainly, death to
stepwise regression, think for yourself (e.g., hierarchi- Equation R.15 for partial correlation can be extended
cal entry)— see Whittingham, Stephens, Bradbury, to control for two or more external variables. For
and Freckleton (2006) for more information. example, the second-order partial correlation rXY·WZ
Once a final set of rationally selected predictors has estimates the association between X and Y controlling
been entered into the equation, they should not be subse- for both W and Z. There is a related coefficient called
quently removed if their regression coefficients are not part correlation or semipartial correlation that con-
statistically significant. To paraphrase Loehlin (2004), trols for external variables out of either of two other
the researcher should not feel compelled to drop every variables, but not both. The formula for the first-order
predictor that is not significant. In smaller samples, the part correlation rY(X·W ), for which the association
power of significance tests may be low, and removing a between X and W is controlled but not for the associa-
nonsignificant predictor can substantially alter the solu- tion between Y and W, is
Regression Primer 11
rXY − rWY rXW squared bivariate correlations of the predictors with the
rY (X⋅W) = (R.16)
criterion and the overall squared multiple correlation
1 − rXW
2
can be expressed as sums of the areas a, b, c, or d in
Given the same bivariate correlations among these Figure R.3, as follows:
three variables reported earlier, the part correlation
2 2
between vocabulary breadth (Y ) and foot length (X) rXY = a + c and rWY =b+c
controlling only foot length for age (W ) is rY(X·W ) = .033.
This result (.033) is somewhat smaller than the partial R Y2 ⋅X , W = a + b + c = 1.0 – d
correlation for these data, or rXY·W = .043. In general,
rXY·W ≥ rY(X·W ); if rXW = 0, then rXY·W = rY(X·W ). The squared part correlations match up directly with
Relations among the squares of the various correla- the unique areas a and b in Figure R.3. Each of these
tions just described can be illustrated with a Venn-type areas also equals the increase in the total proportion
diagram like the one in Figure R.3. The circles repre- of explained variance that occurs by adding a second
sent total standardized variances of the criterion Y and predictor to the equation (i.e., DR2); that is,
predictors X and W. The regions in the figure labeled
a– d make up the total standardized variance of Y, so
rY2(X⋅W)= a= R Y2 ⋅X , W − rWY
2
(R.17)
a + b + c + d = 1.0
rY2(W⋅ X) = b= R Y2 ⋅X , W − rXY
2
tion of variance.
c d Y
FIGURE R.3. Venn diagram for the standardized variances of predictors X and W and criterion Y.
12 Regression Primer
R Y2 ⋅X , W − rXY
2 preparation—automatically classifies a variable with
b
r=
2
WY⋅ X = less than 16 levels as ordinal.
b+d 1 − rXY2
The statistic r has a theoretical maximum absolute
value of 1.0. But the practical upper limit for | r | is < 1.0
For the data in Table R.1, rY2(X⋅W) = .327 and
2 if the relation between X and Y is not unconditionally
rXY = .435. In words, predictor X uniquely explains
⋅W linear, there is measurement error in either X or Y, or
.327, or 32.7% of the total variance of Y (squared part
distributions for X versus Y have different shapes. The
correlation). Of the variance in Y not already explained
amount of variation in samples (i.e., SDX and SDY in
by W, predictor X accounts for .435, or 43.5% of the Equation R.19) also affects the value of r. In general,
remaining variance (squared partial correlation). restriction of range on either X or Y through sampling
Exercise 7 asks you to calculate and interpret the cor- or case selection (e.g., only cases with higher scores
responding results for the other predictor, W, and the on X are studied) tends to reduce values of | r |, but
same data. not always (see Huck, 1992). The presence of outliers,
When predictors are correlated—which is just about or extreme scores, can also distort the value of r; see
always— beta weights, partial correlations, and part Goodwin and Leech (2006) for more information.
correlations are alternative ways to describe in stan- There are other forms of the Pearson correlation for
dardized terms the relative explanatory power of each observed variables that are either natural dichotomies,
predictor controlling for the rest. None is more “cor- such as male versus female for chromosomal sex, or
rect” than the others because each gives a different ordinal (ranks). For example:
perspective on the same data. Note that unstandardized
regression coefficients (B) are preferred when compar- 1. The point-biserial correlation (rpb) estimates the
ing results for the same predictors and criterion across association between a dichotomy and a continuous
different samples. variable (e.g., treatment vs. control, weight).
2. The phi coefficient (ϕ̂ ) is for two dichotomies (e.g.,
OBSERVED treatment vs. control, survived vs. died).
VERSUS ESTIMATED CORRELATIONS 3. Spearman’s rank order correlation or Spear-
man’s rho (ρ̂ ) is for two ranked variables (e.g.,
The Pearson correlation estimates the degree of linear finish order in a race, rank by amount of training
association between two continuous variables. Its equa- time).
tion is
Computational formulas for all these special forms are
N just rearrangements of Equation R.19 for r (e.g., Kline,
cov XY
∑ z X zY i i
2013a, pp. 138, 166).
i =1
=rXY = (R.19) All forms of the Pearson correlation estimate asso-
SDX SDY df ciations between observed (measured) variables. Other,
non-Pearson correlations assume that the underlying,
where df = N – 1. Rodgers and Nicewander (1988) or latent, variables are continuous and normally dis-
described a total of 11 other formulas, each of which tributed. For example:
represents a different conceptual or computational defi-
nition of r, but all of which yield the same result for the 1. The biserial correlation (r bis) is for a naturally con-
same data. tinuous variable, such as weight, and a dichotomy,
A continuous variable is one for which, theoreti- such as recovered–not recovered, that theoretically
cally, any value is possible within the limits of its score represents a dichotomized continuous latent vari-
range. This includes values with decimals, such as 3.75 able. For example, presumably degrees of recovery
seconds or 13.60 kilograms. In practice, variables with were collapsed when the observed dichotomy was
a range of at least 15 points or so are usually consid- created. The value of r bis estimates what the Pear-
ered as continuous even if their scores are discrete, or son r would be if the dichotomized variable were
integers only (e.g., scores of 10, 11, 12, etc.). For exam- continuous and normally distributed.
ple, the PRELIS program of LISREL—used for data 2. The polyserial correlation is the generalization of
Regression Primer 13
r bis that does basically the same thing for a naturally noncontinuous variables in SEM are considered later in
continuous variable and a theoretically continuous- Chapters 17 and 18.
but- polytomized variable (i.e., categorized into In both regression and SEM, it is generally a bad idea
three or more levels). Likert-type response scales to categorize predictors or outcomes that are continuous
for survey or questionnaire items, such as agree, in order to form pseudo-groups (e.g., “low” vs. “high”
undecided, or disagree, are examples of a poly- based on a mean split). Categorization not only discards
tomized response continuum about the degree of numerical information about individual differences
agreement. in the original distribution but it also tends to reduce
3. The tetrachoric correlation (rtet) for two dichoto- absolute values of sample correlations when population
mized variables estimates what r would be if both distributions are normal. The degree of this reduction
measured variables were continuous and normally is greater as the cutting point moves further away from
distributed. the mean. But if population correlations are low and the
sample size is small, then categorization can actually
4. The polychoric coefficient is the generalization of increase absolute sample correlations. Categorization
the tetrachoric correlation that estimates r but for can also create artifactual main or interactive effects,
ordinal observed variables with two or more levels. especially when cutting points are arbitrary. In general,
it is better to analyze continuous variables as they are
Computing polyserial or polychoric correlations is rela- and without categorizing them—see Royston, Altman,
tively complicated and requires special software, such and Sauerbrei (2006) for more information.
as PRELIS in LISREL. These programs generally use
a special form of maximum likelihood estimation that
assumes normality of the latent continuous variables, LOGISTIC REGRESSION
and error variance tends to increase rapidly as the num- AND PROBIT REGRESSION
ber of categories on the observed variables decreases
from about five to two; that is, dichotomized continu- Some options to analyze dichotomous outcomes in SEM
ous variables generate the greatest imprecision. are based on logistic regression. Just as in standard
The PRELIS program can also analyze censored multiple regression, the predictors in logistic regression
variables, for which values occur outside of the range can be either continuous or categorical. But the predic-
of measurement. Suppose that a scale registers values tion equation in logistic regression is a logistic func-
of weight between 1 and 300 pounds only. For objects tion, or a sigmoid function with an “S” shape. It is a
that weigh either less than 1 pound or more than 300 type of link function, or a transformation that relates
pounds, the scale tells us only that the measured the observed outcomes to the predicted outcomes in a
weight is, respectively, at most 1 pound or at least 300 regression analysis. Each method of regression has its
pounds. In this example, the hypothetical scale is both own special kind of link function. In standard multiple
left censored and right censored because the values regression with continuous variables, the link function
less than 1 or more than 300 are not registered on the is the identity link, which says that observed scores on
scale. There are other possibilities for censoring, but the criterion Y are in the same units as Ŷ , the predicted
scores on censored variables are either exactly known scores (e.g., Figure R.1). For noncontinuous outcomes,
(e.g., weight = 250) or partially known in that they fall though, original and predicted scores are in different
within an interval (e.g., weight ≥ 300). The technique metrics. This is also true in logistic regression, where
of censored regression, better known in economics the link function is the logit link as explained next.
than in the behavioral sciences, analyzes censored out- Suppose that a total of 32 patients with the same dis-
comes. order are administered a daily treatment for a varying
In SEM, Pearson correlations are normally ana- number of days (5–60). After treatment, the patients are
lyzed as part of analyzing covariances when outcome rated as recovered (1) or not recovered (0). Presented in
variables are continuous. But noncontinuous outcome Table R.2 are the hypothetical raw data for this exam-
variables can be analyzed in SEM, too. One option is ple. I used Statgraphics Centurion (Statgraphics Tech-
to calculate polyserial or polychoric correlations from nologies, 1982–2022)2 to plot the logistic function with
the raw data and then fit the model to these predicted
Pearson correlations. Special methods for analyzing 2 https://www.statgraphics.com/centurion-overview
14 Regression Primer
TABLE R.2. Example Data Set for Logistic Regression and Probit Regression
Status n Number of days in treatment (X)
Not recovered (Y = 0) 16 6, 7, 9, 10, 11, 13, 15, 16, 18, 19, 23, 25, 26, 28, 30, 32
Recovered (Y = 1) 16 27, 30, 33, 35, 36, 39, 41, 42, 44, 46, 47, 49, 51, 53, 55, 56
95% confidence limits for these data that is presented the ratio of the probability for the target event, such as
in Figure R.4. This function generates π̂ , the predicted recovered, over the probability for the other event, such
probability of recovery, given the number of days as not recovered. Suppose that 60% of patients recover
treated, X. The confidence limits for these predictions after treatment, but the rest, or 40%, do not recover, or
are so wide because the sample size is small (see the
figure). Because predicted probabilities are estimated π̂ = .60 and 1 – π̂ = .40
from the data, they correspond to a latent continuous
variable, and in this sense logistic regression (and pro- The odds of recovery are thus ω̂ = .60/.40, or 1.50;
bit regression, too) can be seen as a latent variable tech- that is, the odds are 3:2 in favor of recovery. Odds
nique. are converted back to probabilities by dividing the
The estimation method in logistic regression is not odds by 1.0 plus the odds. For example, ω̂ = 1.50, so
OLS. Instead, it is usually a form of maximum likeli- π̂ = 1.50/2.50 = .60, which is the probability of recov-
hood estimation that is applied after transforming the ery.
dichotomous outcome variable into a logit, which is the Coefficients for predictors in logistic regression are
natural logarithm (i.e., natural base e, or about 2.7183) calculated by the computer in a log metric, but each
of the odds of the target outcome, ω̂ . The quantity ω̂ is coefficient can be converted to an odds ratio, which
1.0
.9
π)
Probability of Recovery ( ˆ
.8
.7
.6
.5
.4 95% confidence
limits
.3
Logistic
.2
Probit
.1
0
5 10 15 20 25 30 35 40 45 50 55 60
FIGURE R.4. Predicted probability of recovery with 95% confidence limits for the data in Table R.2.
Regression Primer 15
estimates the difference in the odds of the target out- function of the normal curve (F) to calculate predicted
come, given a 1-point increase in the predictor, con- probabilities of the target outcome π̂ from values of Yˆ *
trolling for all other predictors. I submitted the data for each case:
in Table R.2 to the Logistic Regression procedure in
Statgraphics Centurion. The prediction equation in a πˆ = Φ (Yˆ *) (R.21)
log metric is
Equation R.21 is known as the normal ogive model.3
πˆ I analyzed the data in Table R.2 using the Probit
logit ( πˆ )= ln = ln ( ωˆ )= .455 X − 13.701
1 − πˆ Analysis procedure in Statgraphics Centurion. The
prediction equation is
where .455 is the coefficient for the predictor X, number
of treatment days, and –13.701 is the intercept. Taking Yˆ * = .268X – 8.072
the antilogarithm of the coefficient for days in treat-
ment, or The coefficient for X, .268, estimates in standard devia-
tion units the amount of change in recovery, given a
ln–1 (.455) = e.455 = 1.576 one-day increase in treatment. That is, the z-score for
recovery increases by .268 for each additional day of
gives us the odds ratio, or 1.576. This result says that for treatment. Again, this rate of change is not constant
each additional day of treatment, the odds for recovery because the overall relation is nonlinear (Figure R.4).
increase by 57.6%. But this rate of increase is not lin- Predicted probabilities of recovery for this example are
ear; instead, the rate at which a logistic curve ascends generated by the probit function
or descends changes according to values of the predic-
tor. For these data, the greatest rate of change in pre- πˆ = Φ (.268 X − 8.072)
dicted recovery occurs between 30 and 40 days of treat-
ment. But at the extremes (X < 30 or X > 40), the rate of The 95% confidence limits for the probit function are
change in the probability of recovery is much less—see somewhat different than those for the logistic function
Figure R.4. The inverse logit function presented next for the data in Table R.2—see Figure R.4.
generates the logistic curve plotted in the figure: Logistic regression and probit regression applied in
the same large samples tend to give similar results but in
e .455 X −13.701 different metrics for the coefficients. The scaling factor
=πˆ logit −1(.455=
X − 13.701)
1 + e .455 X −13.701 that converts results from the logistic model to the same
metric as the normal ogive (probit) model is approxi-
An alternative method is probit regression, which mately 1.7. For example, the ratio of the coefficients
analyzes binary outcomes in terms of a probit func- for the predictor in, respectively, the logistic and probit
tion, where probit stands for “probability unit.” Like- analyses of the data in Table R.2 is .455/.268 = 1.698,
wise, the link function in probit regression is the or 1.7 at single-decimal accuracy. The two procedures
probit link. A probit model assumes that the observed may generate appreciably different results if there are
dichotomy Y = 1 for the target outcome versus Y = 0 many cases at the extremes (predicted probabilities
for other events is determined by a normal continuous are close to either 0 to 1.0) or if the sample is small.
latent variable Y* with a mean of zero and variance of Probit regression is more computationally intensive
1.0 such that than logistic regression, but this difference is rela-
tively unimportant for modern microcomputers with
1 if Y * ≥ 0 fast processors and ample memory. It can happen that
Y= (R.20) computer procedures for probit regression may fail to
0 if Y * < 0 generate a solution in smaller samples. Agresti (2019)
describes additional techniques for categorical data.
The equation in probit regression generates Yˆ * in the
metric of normal deviates (z scores). Next, the com- 3 You can see the equation for Φ at https://en.wikipedia.org/
puter uses the equation for the cumulative distribution wiki/Normal_distribution
16 Regression Primer
You should know about regression analysis before The book by Cohen, Cohen, West, and Aiken (2003) is con-
learning the basics of SEM. For both sets of techniques, sidered by many as a kind of “bible” for multiple regression.
the results are affected not only by what is measured Royston, Altman, and Sauerbrei (2006) explain why catego-
(i.e., the data) but also by what is not measured, espe- rizing predictor or outcome variables is a bad idea. Shieh
cially if omitted predictors covary with included pre- (2006) describes suppression in more detail.
dictors, which is a specification error. Accordingly, you
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003).
should carefully select predictors after review of theory
Applied multiple regression/correlation analysis for the
and results of prior studies in the area. In regression,
behavioral sciences (3rd ed.). New York: Routledge.
those predictors should have adequate psychometric
characteristics because there is no allowance for mea- Royston, P., Altman, D. G., & Sauerbrei, W. (2006). Dichot-
surement error. The same restriction does not apply in omizing continuous predictors in multiple regression: A
SEM, but use of grossly inadequate measures in SEM bad idea. Statistics in Medicine, 25, 127–141.
can seriously bias the results, too. When selecting pre- Shieh, G. (2006). Suppression situations in multiple linear
dictors, the role of judgment should be greater than that regression. Educational and Psychological Measurement,
of significance testing, which can greatly capitalize on 66, 435–447.
sample-specific variation.
EXERCISES
All questions concern the data in Table R.1. 4. Calculate the unstandardized regression equa-
tion and the standardized regression equation for
1. Calculate the unstandardized regression equation predicting Y from both X and W. Also calculate
for predicting Y from X based on the descriptive R Y2 ⋅X , W .
statistics.
5. Calculate Rˆ 2Y ⋅X , W .
2. Show that centering scores on X does not change
the value of the unstandardized regression coeffi- 6. Construct a histogram of the residuals for the
cient for predicting Y but does affect the value of the regression of Y on both X and W.
intercept.
7. Compute and interpret rWY
2 2
⋅ X and rY (X ⋅W) .
3. Show that = +
sY2 sŶ2 sY2 −Yˆ
and 2
rXY = sŶ2 / sY2 when X
is the only predictor of Y.
ANSWERS
1. Given the descriptive statistics and with slight 2. Given M X = 16.900, mean-centered scores (x) are
rounding error: –.90, –2.90, –.90, –4.90, 1.10,
1.10, –3.90, –.90, 1.10, 5.10,
10.870 1.10, 2.10, –.90, –.90, 5.10,
=
=
B X .686 2.479
3.007 –4.90, 3.10, –2.90, 4.10, .10
A X = 102.950 – 2.479 (16.900) = 61.054 and Mx = 0, SDx = 3.007, rxY = .686, so with slight
rounding error
Regression Primer 17
10.870
=
=
B X .686 2.479 R Y2 ⋅ X , W = .595(.686) + .337(.499) = .576
3.007
2
With slight rounding error,
2
rXY = sY2 /sŶ2 = 55.570/118.155 = .470, so rXY = .686 0
−2.0 −1.0 0 1.0 2.0
Standardized residual
REFERENCES
Achen, C. H. (2005). Let’s put garbage-can regressions and Rodgers, J. L., & Nicewander, W. A. (1988). Thirteen ways to
garbage-can probits where they belong. Conflict Manage- look at the correlation coefficient. American Statistician,
ment and Peace Science, 22(4), 327–339. 42(1), 59–66.
Agresti, A. (2019). An introduction to categorical data analy- Royston, P., Altman, D.G., & Sauerbrei, W. (2006). Dichoto-
sis (3rd ed.). Wiley. mizing continuous predictors in multiple regression: A bad
Bollen, K. A. (1989). Structural equations with latent vari- idea. Statistics in Medicine, 25(1), 127–141.
ables. Wiley. Shieh, G. (2006). Suppression situations in multiple linear
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). regression. Educational and Psychological Measurement,
Applied multiple regression/correlation for the behavioral 66(3), 435–447.
sciences (3rd ed.). Routledge. Statgraphics Technologies, Inc. (1982–2013). Statgraphics
Goodwin, L. D., & Leech, N. L. (2006). Understanding cor- Centurion (Version 19.4.01). [Computer software]. https://
relation: Factors that affect the size of r. Journal of Experi- www.statgraphics.com/
mental Education, 74(3), 251–266. Thompson, B. (1995). Stepwise regression and stepwise
Huck, S. W. (1992). Group heterogeneity and Pearson’s r. discriminant analysis need not apply here: A guidelines
Educational and Psychological Measurement, 52(2), editorial. Educational and Psychological Measurement,
253–260. 55(4), 525–534.
Kenny, D. A., & Milan, S. (2012). Identification: A nontechni- Wherry, R. J. (1931). A new formula for predicting the shrink-
cal discussion of a technical issue. In R. H. Hoyle, (Ed.), age of the coefficient of multiple correlation. Annals of
Handbook of structural equation modeling (pp. 145–163). Mathematical Statistics, 2(4), 440–451.
Guilford Press. Whittingham, M. J., Stephens, P. A., Bradbury, R. B., &
Kline, R. B. (2013). Beyond significance testing: Statistics Freckleton, R. P. (2006). Why do we still use stepwise
reform in the behavioral sciences (2nd ed.). American modelling in ecology and behaviour? Journal of Animal
Psychological Association. Ecology, 75(5), 1182–1189.
Loehlin, J. C. (2004). Latent variable models: An introduc- Williams, M. N., Grajales, C. A. G., & Kurkiewicz, D. (2013).
tion to factor, path, and structural equation analysis (4th Assumptions of multiple regression: Correcting two mis-
ed.). Erlbaum. conceptions. Practical Assessment, Research, and Evalu-
Mauro, R. (1990). Understanding L.O.V.E. (left out variables ation, 18, Article 11.
error): A method for estimating the effects of omitted vari-
ables. Psychological Bulletin, 108(2), 314–329.