Correlation and Regression
Correlation and Regression
that neither is considered to be a predictor or an outcome. cient. Thus, an alternative to nonparametric correlations is to
transform X or Y (or both) to better meet these assumptions.
Pearson Product-Moment Coefficient of See Erickson and Nosanchuk 3 for a discussion of
Correlation transformations.
The most commonly used version is the Pearson product- As an example, consider hs-CRP and BMI in Figures 1 and
moment coefficient of correlation, r. Suppose one wants to 2. Figure 1A suggests that there is a positive but nonlinear
estimate the correlation between X⫽BMI, denoted for the ith association between hs-CRP and BMI, and Figures 2A and
subject as Xi, and Y⫽hs-CRP, denoted for the ith subject as 2B indicate that neither hs-CRP nor BMI is normally distrib-
Yi. This is estimated for a sample of size n (i⫽1, . . . , n) using uted; thus, the assumptions for the Pearson correlation coef-
the following formula1: ficient are not met. Consequently, the Spearman rank corre-
lation provides a more appropriate estimate of association.
SSxy When a natural log transformation is applied to both hs-CRP
r⫽
冑SSxxSSyy and BMI to pull in the long right tails, Figure 1B shows a
linear association between the log-transformed variables, and
where
冘(Y ⫺Y ).
estimate for hs-CRP and BMI, which reflects the greater
SSyy⫽ i
2 linearity seen in the scatterplot. Note, however, that the
i Spearman correlation is identical for the original and trans-
2083
2084 Circulation November 7, 2006
Least-Squares Estimation
As with correlation, there are different approaches to estima-
tion of a regression line. The most commonly used technique
is the method of least squares (sometimes referred to as
ordinary least squares to distinguish it from weighted least
squares, which is used when observations have different
weights from complex sampling designs), which minimizes
the sum of the squared residuals or errors (SSE). That is,
estimates ˆ 0 and ˆ 1 of 0 and 1, respectively, are chosen to
minimize
i
i
2
r⫻ 冑SSyy/SSxx .
Figure 1. A, Scatterplot of hs-CRP vs BMI, with least-squares Thus, both r and ˆ 1 estimate the linear association between X
linear regression line. B, Scatterplot of natural log-transformed
hs-CRP vs natural log-transformed BMI, with least-squares lin-
and Y. Unlike r, however, ˆ 1 is not unitless but reflects the
ear regression line. scales of X and Y.
Coefficient of Determination
formed variables, because the log transformation does not
A unitless estimate of the strength of the linear association
change the variables’ ranks. between Y and X is given by the coefficient of determination,
also known as R2. R2 is the proportion of variance in the
Regression outcome Y accounted for by the linear function of the
Regression also indicates whether 2 variables are associated. predictor X, ie, the fitted value⫽ˆ 0⫹ˆ 1X, and is estimated as
In contrast to correlation, however, regression considers one (SSyy⫺SSE)/SSyy⫽1⫺(SSE/SSyy). SSE is the amount of vari-
variable to be an outcome (dependent variable) and the other ability in the outcome Y that is “left over,” ie, not explained
to be a predictor variable. As an example, suppose one wants by the linear function of the predictor X. Note that the
to predict hs-CRP on the basis of BMI. hs-CRP can be estimated Pearson correlation coefficient equals the square
modeled as a linear function of BMI, as in Figure 1A: root of R2; R2 ranges from 0 (no linear association) to 1
(perfect linear association, whether positive or negative). A
Yi⫽0⫹1Xi⫹ei related quantity is the residual mean square ˆ 2, the variance of
the residuals, or equivalently, the variability of Y about the
where 0 is the intercept, 1 is the slope coefficient for
estimated regression line. For a regression with a single
X⫽BMI, and ei⫽Yi⫺(0⫹1Xi) denotes the residual or error, predictor variable, this is computed as SSE/(n⫺2).1 For a
the part of Yi that is not explained by the linear function of Xi, given data set, the smaller ˆ 2 is, the larger R2 is; ˆ 2 is not
0⫹1 Xi. The slope coefficient 1 indicates the difference in unitless, however, but varies with the scale of the observed
Y that corresponds to a 1-unit difference in X. When X is data.
defined in terms of clinically meaningful units, such as age in
years, it facilitates the interpretation of 1. The above ap- Checking Assumptions: Regression
proach assumes a linear association between X and Y. Diagnostics
Consequently, it is important to check this assumption, eg, The above formulas for ˆ 0 and ˆ 1 can be used to estimate a
with a scatterplot of Y versus X, before one estimates the regression line regardless of the distributions of X and Y.
Crawford Correlation and Regression 2085
Assumptions required for inferences with regard to the BMI is hs-CRP⫽⫺7.44⫹0.40⫻BMI, with an R2 value of
coefficients and estimation or prediction from the regression 0.20. Residuals from the regression of hs-CRP on BMI, seen
line, however, include the following: (1) normally distributed in Figure 3A, are not normally distributed and exhibit a large,
residuals with a mean of zero; (2) constant variance of the positive outlier. The scatterplot of residuals versus fitted
residuals; and (3) independence of residuals from different values (Figure 4A) demonstrates increasing variability in the
observations. residuals with larger fitted values. The Kolmogorov-Smirnov
2 statistic is statistically significant (P⬍0.01), which indi-
Downloaded from http://ahajournals.org by on July 16, 2024
outliers. Some techniques reduce the influence of outliers by alcohol consumption as an ordinal variable, eg, zero con-
replacing squared residuals with other functions of the resid- sumption and quartiles of nonzero consumption, and to use
uals or minimizing the median of the squared residuals rather ANOVA rather than linear regression. As another example,
than the sum (see Rousseeuw and Leroy8). Other approaches consider years of education. A difference of 1 year often has
are nonparametric, such as Tukey’s resistant lines3 or Theil’s a different impact depending on whether the reference point
method.2 It is difficult to generalize some of these approaches is, say, 11 years compared with 13 years. In this case, a
to the setting with multiple predictor variables, however. categorized ordinal variable may provide a better fit to the
data. Moreover, categorized variables may be more interpret-
Additional Considerations and Cautions able in clinical settings.10
Extrapolation
Even when an estimated regression line provides a good fit to Confounding
the observed data, it is important not to extrapolate beyond The above discussion assumes there is only a single predictor
the range of the sample, because the estimated line may not variable of interest. The association between X and Y,
be appropriate. For example, as seen in Figure 1A, estimates however, may be due in part to the contribution of additional
of Y from the regression line may be invalid for extreme X variables that are related to both X and Y, ie, confounding
values. Alternatively, the relation between X and Y may variables. For example, the estimated association between
become nonlinear outside the range of the sample. BMI and hs-CRP may be due in part to age, because both
BMI and hs-CRP are themselves positively related to age.
Study Design and Interpretation of Estimates The methods summarized above can be expanded to include
Estimates of correlation and R2 depend not only on the multiple predictors, and associations between X and Y that
magnitude of the underlying true association but also on the adjust for these confounding factors can be estimated. Re-
variability of the data included in the sample (see Weisberg4). turning to the hs-CRP and BMI example, a partial (age-
In the preceding hs-CRP and BMI example, the estimated adjusted) correlation between hs-CRP and BMI can be
Pearson correlation of log hs-CRP and log BMI in the full computed; for the Pearson correlation, this is done by
sample is 0.62. If we restrict the sample to the middle 2 regressing hs-CRP on age, regressing BMI on age, and
quartiles of log BMI, thereby artificially decreasing the SD of computing the Pearson correlation of the 2 sets of residuals,
log BMI from 0.23 to 0.08, the corresponding estimated ie, the component of hs-CRP that is unrelated to age and the
correlation is 0.31, an underestimate. Conversely, if we component of BMI that is unrelated to age. Similarly, an
include only women in the top and bottom log BMI quartiles age-adjusted slope for BMI can be estimated by adding age as
(which yields an SD of log BMI of 0.31), the estimated a predictor to the linear regression model. A regression model
2088 Circulation November 7, 2006