0% found this document useful (0 votes)
9 views

Correlation and Regression

Uploaded by

PhyoNyeinChan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Correlation and Regression

Uploaded by

PhyoNyeinChan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Statistical Primer for Cardiovascular Research

Correlation and Regression


Sybil L. Crawford, PhD

៮ indicates the sample mean of X (⫽BMI), and Y ៮ the


I n many health-related studies, investigators wish to assess
the strength of an association between 2 measured (con-
tinuous) variables. For example, the relation between high-
Here, X
sample mean of Y (⫽hs-CRP). The numerator of r reflects
how BMI and hs-CRP co-vary, and the denominator reflects
sensitivity C-reactive protein (hs-CRP) and body mass index the variability of both BMI and hs-CRP about their respective
(BMI) may be of interest. Although BMI is often treated as a sample means.
categorical variable, eg, underweight, normal, overweight,
and obese, a noncategorized version is more detailed and thus Alternative Correlation Coefficients
may be more informative in terms of detecting associations. The Pearson correlation coefficient assumes that X and Y are
Correlation and regression are 2 relevant (and related) widely jointly distributed as bivariate normal, ie, X and Y each are
used approaches for determining the strength of an associa- normally distributed, and that they are linearly related.2 When
tion between 2 variables. Correlation provides a unitless these assumptions are not satisfied, nonparametric versions
measure of association (usually linear), whereas regression can be used to estimate correlation. These include the
provides a means of predicting one variable (dependent Spearman rank correlation coefficient,2 which is based on a
variable) from the other (predictor variable). This report comparison of the ranks of X and Y rather than on the original
summarizes correlation coefficients and least-squares regres- variables themselves. By using ranks, nonparametric ap-
sion, including intercept and slope coefficients. proaches are robust to departures from the assumptions of the
Pearson correlation coefficient, as well as to outlying (atyp-
Correlation ical) observations that may distort the estimated Pearson
Correlation provides a “unitless” measure of association correlation coefficient. On the other hand, if the assumptions
between 2 variables, ranging from ⫺1 (indicating perfect for the Pearson correlation coefficient are met, the nonpara-
negative association) to 0 (no association) to ⫹1 (perfect metric versions are less efficient. That is, they are less likely
positive association). Both variables are treated equally in to detect an association than the Pearson correlation coeffi-
Downloaded from http://ahajournals.org by on July 16, 2024

that neither is considered to be a predictor or an outcome. cient. Thus, an alternative to nonparametric correlations is to
transform X or Y (or both) to better meet these assumptions.
Pearson Product-Moment Coefficient of See Erickson and Nosanchuk 3 for a discussion of
Correlation transformations.
The most commonly used version is the Pearson product- As an example, consider hs-CRP and BMI in Figures 1 and
moment coefficient of correlation, r. Suppose one wants to 2. Figure 1A suggests that there is a positive but nonlinear
estimate the correlation between X⫽BMI, denoted for the ith association between hs-CRP and BMI, and Figures 2A and
subject as Xi, and Y⫽hs-CRP, denoted for the ith subject as 2B indicate that neither hs-CRP nor BMI is normally distrib-
Yi. This is estimated for a sample of size n (i⫽1, . . . , n) using uted; thus, the assumptions for the Pearson correlation coef-
the following formula1: ficient are not met. Consequently, the Spearman rank corre-
lation provides a more appropriate estimate of association.
SSxy When a natural log transformation is applied to both hs-CRP
r⫽
冑SSxxSSyy and BMI to pull in the long right tails, Figure 1B shows a
linear association between the log-transformed variables, and
where

冘(X ⫺X៮ )(Y ⫺Y៮ ), SS ⫽ 冘(X ⫺X៮ ) ,


Figures 2C and 2D suggest that the log transformation has
SSxy⫽ i i xx i
2 made each variable’s distribution closer to normal. The
i i estimated Pearson correlation of the log-transformed vari-
ables is more than one third higher than the corresponding
and

冘(Y ⫺Y៮ ).
estimate for hs-CRP and BMI, which reflects the greater
SSyy⫽ i
2 linearity seen in the scatterplot. Note, however, that the
i Spearman correlation is identical for the original and trans-

From the University of Massaschusetts Medical School, Worcester, Mass.


Correspondence to Sybil L. Crawford, PhD, Preventive and Behavioral Medicine, University of Massachusetts Medical School, 55 Lake Ave N, Shaw
Bldg, Room 228, Worcester, MA 01655. E-mail [email protected]
(Circulation. 2006;114:2083-2088.)
© 2006 American Heart Association, Inc.
Circulation is available at http://www.circulationaha.org DOI: 10.1161/CIRCULATIONAHA.105.586495

2083
2084 Circulation November 7, 2006

intercept and slope; a transformation of X or Y (or both) may


be needed, as in the preceding hs-CRP and BMI example.

Least-Squares Estimation
As with correlation, there are different approaches to estima-
tion of a regression line. The most commonly used technique
is the method of least squares (sometimes referred to as
ordinary least squares to distinguish it from weighted least
squares, which is used when observations have different
weights from complex sampling designs), which minimizes
the sum of the squared residuals or errors (SSE). That is,
estimates ␤ˆ 0 and ␤ˆ 1 of ␤0 and ␤1, respectively, are chosen to
minimize

SSE⫽ 冘[Y ⫺(␤ˆ ⫹␤ˆ X )] ⫽ 冘ê .


i
i 0 1 i
2

i
i
2

The resulting formulas are


SSxy
␤ˆ 1 ⫽
SSxx
and
៮ ⫺␤ˆ 1X
␤ˆ 0 ⫽Y ៮.

The intercept ␤0 generally is not of intrinsic interest but is


included to estimate ␤1 accurately. Note that if X has been
centered so that X៮ ⫽0, then ␤ˆ 0⫽Y ៮ . The numerator for the
estimated slope coefficient is identical to the numerator of the
estimated Pearson correlation coefficient r; in particular,
when r equals 0, ␤ˆ 1 also equals 0. ␤ˆ 1 can be reexpressed as
Downloaded from http://ahajournals.org by on July 16, 2024

r⫻ 冑SSyy/SSxx .
Figure 1. A, Scatterplot of hs-CRP vs BMI, with least-squares Thus, both r and ␤ˆ 1 estimate the linear association between X
linear regression line. B, Scatterplot of natural log-transformed
hs-CRP vs natural log-transformed BMI, with least-squares lin-
and Y. Unlike r, however, ␤ˆ 1 is not unitless but reflects the
ear regression line. scales of X and Y.

Coefficient of Determination
formed variables, because the log transformation does not
A unitless estimate of the strength of the linear association
change the variables’ ranks. between Y and X is given by the coefficient of determination,
also known as R2. R2 is the proportion of variance in the
Regression outcome Y accounted for by the linear function of the
Regression also indicates whether 2 variables are associated. predictor X, ie, the fitted value⫽␤ˆ 0⫹␤ˆ 1X, and is estimated as
In contrast to correlation, however, regression considers one (SSyy⫺SSE)/SSyy⫽1⫺(SSE/SSyy). SSE is the amount of vari-
variable to be an outcome (dependent variable) and the other ability in the outcome Y that is “left over,” ie, not explained
to be a predictor variable. As an example, suppose one wants by the linear function of the predictor X. Note that the
to predict hs-CRP on the basis of BMI. hs-CRP can be estimated Pearson correlation coefficient equals the square
modeled as a linear function of BMI, as in Figure 1A: root of R2; R2 ranges from 0 (no linear association) to 1
(perfect linear association, whether positive or negative). A
Yi⫽␤0⫹␤1Xi⫹ei related quantity is the residual mean square ␴ˆ 2, the variance of
the residuals, or equivalently, the variability of Y about the
where ␤0 is the intercept, ␤1 is the slope coefficient for
estimated regression line. For a regression with a single
X⫽BMI, and ei⫽Yi⫺(␤0⫹␤1Xi) denotes the residual or error, predictor variable, this is computed as SSE/(n⫺2).1 For a
the part of Yi that is not explained by the linear function of Xi, given data set, the smaller ␴ˆ 2 is, the larger R2 is; ␴ˆ 2 is not
␤0⫹␤1 Xi. The slope coefficient ␤1 indicates the difference in unitless, however, but varies with the scale of the observed
Y that corresponds to a 1-unit difference in X. When X is data.
defined in terms of clinically meaningful units, such as age in
years, it facilitates the interpretation of ␤1. The above ap- Checking Assumptions: Regression
proach assumes a linear association between X and Y. Diagnostics
Consequently, it is important to check this assumption, eg, The above formulas for ␤ˆ 0 and ␤ˆ 1 can be used to estimate a
with a scatterplot of Y versus X, before one estimates the regression line regardless of the distributions of X and Y.
Crawford Correlation and Regression 2085

Figure 2. A, Histogram of hs-CRP. B,


Histogram of BMI. C, Histogram of natu-
ral log-transformed hs-CRP. D, Histo-
gram of natural log-transformed BMI.

Assumptions required for inferences with regard to the BMI is hs-CRP⫽⫺7.44⫹0.40⫻BMI, with an R2 value of
coefficients and estimation or prediction from the regression 0.20. Residuals from the regression of hs-CRP on BMI, seen
line, however, include the following: (1) normally distributed in Figure 3A, are not normally distributed and exhibit a large,
residuals with a mean of zero; (2) constant variance of the positive outlier. The scatterplot of residuals versus fitted
residuals; and (3) independence of residuals from different values (Figure 4A) demonstrates increasing variability in the
observations. residuals with larger fitted values. The Kolmogorov-Smirnov
␹2 statistic is statistically significant (P⬍0.01), which indi-
Downloaded from http://ahajournals.org by on July 16, 2024

These assumptions should be checked before any infer-


ences are made from the estimated regression line. For cates a departure of the estimated residual distribution from
example, to assess whether residuals are normally distributed, normality. Moreover, one observation has a Cook’s distance
a statistical test (eg, the Kolmogorov-Smirnov ␹2 test2) can be ⬎1, which indicates high influence on the estimated regres-
done to compare the estimated distribution to a normal sion line.
distribution. Related graphical checks include a histogram of The corresponding estimated line from regressing log
the estimated residuals and a normal probability plot, also hs-CRP on log BMI is log hs-CRP⫽⫺11.40⫹3.58⫻log BMI,
known as a quantile-quantile plot, of the observed residual with an R2 value of 0.37. The proportion of variance ex-
quantiles versus quantiles that would be expected under plained almost doubles when the variables are transformed,
which reflects the improvement in linearity. The histogram of
a normal distribution4; the latter plot will approximate a
the residuals from the regression of log hs-CRP on log BMI,
straight line if the assumption of normality is met. Also, a
seen in Figure 3B, is closer to bell-shaped and has no outliers,
scatterplot of the estimated residuals versus the fitted values
and there is no significant departure from normality (the
should have a “cloud” pattern, which indicates no increase or
probability value for the corresponding Kolmogorov-
decrease in the variability of the residuals as X increases (ie,
Smirnov ␹2 statistic⫽0.12). The scatterplot of residuals ver-
constant variance), and no curvilinear pattern that suggests a
sus fitted values (Figure 4B) indicates constant variance of
nonlinear association of X and Y.5 In addition, influential the residuals across the range of fitted values. In addition,
observations can be detected with diagnostic tools available none of the observations have a Cook’s distance of at least 1.
in most statistical software packages, such as Cook’s dis- Note that the scales on the y axis, which indicate the scales of
tance,4,6 which indicates for each observation how much the the 2 sets of residuals, are not comparable because the
estimated regression coefficients would change if that obser- original data are on different scales.
vation were omitted and the regression coefficients reesti- As seen in this example, transforming either the outcome
mated; a value of at least 1 indicates a highly influential or the predictor (or both) often solves one or more problems,
observation. Although an influential observation often will including nonlinear associations, outlying values, and non-
have a large, outlying residual, this is not guaranteed to occur, constant variance of residuals. Nonlinear associations also may
because an extremely influential observation may “pull” the be modeled with polynomial regression, expanding the right-
regression line toward itself and hence have a relatively small hand side of the equation to include terms for X2, X3, and so on.7
residual. A more detailed discussion of leverage and influ-
ence is beyond the scope of this report. Estimation and Prediction
Continuing the previous hs-CRP and BMI example, the In addition to determining magnitudes of association, the
estimated regression line for hs-CRP as a linear function of estimated regression line can be used to estimate the average
2086 Circulation November 7, 2006
Downloaded from http://ahajournals.org by on July 16, 2024

Figure 4. A, Scatterplot of residuals from least-squares linear


Figure 3. A, Histogram of residuals from least-squares linear regression of hs-CRP on BMI vs corresponding fitted values. B,
regression of hs-CRP on BMI. B, Histogram of residuals from Scatterplot of residuals from least-squares linear regression of
least-squares linear regression of natural log-transformed natural log-transformed hs-CRP on natural log-transformed BMI
hs-CRP on natural log-transformed BMI. vs corresponding fitted values.

Y at a specified value of X. In the preceding example, we can


estimate the average (mean) hs-CRP concentration at, say, 冋 1 (x*⫺X
␴ˆ 2 1⫹ ⫹
n
៮ )2
SSxx
,册
BMI⫽25 kg/m2. Using the estimated regression line on the
which equals the variance for an estimated mean plus ␴ˆ 2.1
untransformed variables, this would be estimated as
Thus, predicting Y for an individual at a given X value is less
⫺7.44⫹0.40⫻25⫽2.56 mg/L. In addition, we can predict the precise than estimating the mean at the same X value. This
hs-CRP concentration for an individual patient with a BMI of can be seen graphically in Figure 5. Estimates of both the
25 kg/m2, also given by 2.56 mg/L. The corresponding mean log hs-CRP and predicted log hs-CRP across the range
estimate on the log hs-CRP scale is ⫺11.40⫹3.58⫻log of log BMI values are given by the estimated regression line
(25)⫽0.12. (solid line). The 95% CIs for mean log hs-CRP and for
The difference between estimation of an average and predicted log hs-CRP also are presented; the CIs for predic-
prediction for an individual subject lies in the associated tions for an individual are much wider than those for the
variability. The estimated variance of an estimate of a mean mean. Both CIs are wider for extreme values of log BMI than
at X⫽x* is given by for log BMI values nearer the sample mean.
Figure 1A indicates that for values of BMI ⬍18.6 kg/m2,
␴ˆ 2 冋
1 (x*⫺X
n

៮ )2
SSxx 册 linear regression on the untransformed data produces nega-
tive estimates of hs-CRP (for the mean or for an individual
patient), which are invalid for this outcome. In contrast,
which increases with ␴ˆ and with the distance between x* and negative estimates of hs-CRP can be backtransformed with
the observed sample mean for X.1 That is, the estimate of the exponentiation, ie, the antilog, to produce estimates on the
mean is less precise for larger values of the residual mean original scale, which are guaranteed to be above zero because
square (variability of Y about the regression line) and as the of the nature of the antilog transformation. This suggests
value of x* is farther from the center of the observed data. another possible advantage of working with transformed
The variance for a prediction at X⫽x* is equal to variables.
Crawford Correlation and Regression 2087

correlation is 0.70, an overestimate. In the first instance,


because the variation in X is constrained to be too small, the
variation in Y ignoring X (ie, the horizontal spread in Figure
2 for the middle half of the data) is close to the variation in
Y accounting for X, ie, the variation about the regression line.
Consequently, the estimated proportion of explained variance
in Y is deflated. The reverse occurs in the second instance.
Thus, estimates that are not computed from a random sample
from the entire range of the variables may not reflect the true
correlation.
The range of the predictor variable also affects the standard
error of the estimated regression slope, computed as
␴ˆ /公SSxx, which decreases as the variability in X increases;
consequently, the slope is estimated with the greatest preci-
sion if one samples X entirely at the minimum and maximum
possible values.7 Clearly, such a design is not optimal,
Figure 5. Scatterplot of natural log-transformed hs-CRP vs nat- however, for detecting departures from assumptions, eg,
ural log-transformed BMI, with least-squares linear regression nonlinearity.
line and 95% CIs for prediction and for mean estimation.

Categorical Versus Continuous Variables


Alternatives to Least-Squares Estimation When a variable is continuous, treating it as a continuous
Ordinary least-squares regression is widely used, in part variable typically retains more information than collapsing it
because of its ease of computation and also because it has to an ordinal categorical variable.9 In some cases, however,
desirable properties when the assumptions are met.7 Because the latter version may be preferable. Consider the example of
the regression line is estimated by minimizing the squared alcohol consumption. In some populations, there may be a
residuals, however, outlying values can exert a relatively large percentage with no consumption, which leads to a large
large impact on the estimated line. With the advent of “spike” at the value 0; hence, there may be no straightforward
computers, alternative methods have been developed that are transformation that satisfies the assumptions of correlation or
computationally more demanding but are more robust to linear regression. Here, it may be more useful to categorize
Downloaded from http://ahajournals.org by on July 16, 2024

outliers. Some techniques reduce the influence of outliers by alcohol consumption as an ordinal variable, eg, zero con-
replacing squared residuals with other functions of the resid- sumption and quartiles of nonzero consumption, and to use
uals or minimizing the median of the squared residuals rather ANOVA rather than linear regression. As another example,
than the sum (see Rousseeuw and Leroy8). Other approaches consider years of education. A difference of 1 year often has
are nonparametric, such as Tukey’s resistant lines3 or Theil’s a different impact depending on whether the reference point
method.2 It is difficult to generalize some of these approaches is, say, 11 years compared with 13 years. In this case, a
to the setting with multiple predictor variables, however. categorized ordinal variable may provide a better fit to the
data. Moreover, categorized variables may be more interpret-
Additional Considerations and Cautions able in clinical settings.10
Extrapolation
Even when an estimated regression line provides a good fit to Confounding
the observed data, it is important not to extrapolate beyond The above discussion assumes there is only a single predictor
the range of the sample, because the estimated line may not variable of interest. The association between X and Y,
be appropriate. For example, as seen in Figure 1A, estimates however, may be due in part to the contribution of additional
of Y from the regression line may be invalid for extreme X variables that are related to both X and Y, ie, confounding
values. Alternatively, the relation between X and Y may variables. For example, the estimated association between
become nonlinear outside the range of the sample. BMI and hs-CRP may be due in part to age, because both
BMI and hs-CRP are themselves positively related to age.
Study Design and Interpretation of Estimates The methods summarized above can be expanded to include
Estimates of correlation and R2 depend not only on the multiple predictors, and associations between X and Y that
magnitude of the underlying true association but also on the adjust for these confounding factors can be estimated. Re-
variability of the data included in the sample (see Weisberg4). turning to the hs-CRP and BMI example, a partial (age-
In the preceding hs-CRP and BMI example, the estimated adjusted) correlation between hs-CRP and BMI can be
Pearson correlation of log hs-CRP and log BMI in the full computed; for the Pearson correlation, this is done by
sample is 0.62. If we restrict the sample to the middle 2 regressing hs-CRP on age, regressing BMI on age, and
quartiles of log BMI, thereby artificially decreasing the SD of computing the Pearson correlation of the 2 sets of residuals,
log BMI from 0.23 to 0.08, the corresponding estimated ie, the component of hs-CRP that is unrelated to age and the
correlation is 0.31, an underestimate. Conversely, if we component of BMI that is unrelated to age. Similarly, an
include only women in the top and bottom log BMI quartiles age-adjusted slope for BMI can be estimated by adding age as
(which yields an SD of log BMI of 0.31), the estimated a predictor to the linear regression model. A regression model
2088 Circulation November 7, 2006

with multiple predictors is referred to as multiple regression. References


A later article in this series will address both partial correla- 1. McClave JT, Dietrich II FH. Statistics. San Francisco, Calif: Dellen;
tion and multiple regression. 1985.
2. Daniel WW. Applied Nonparametric Statistics. 2nd ed. Boston, Mass:
PWS-KENT; 1990.
3. Erickson BH, Nosanchuk TA. Understanding Data. 2nd ed. Toronto,
Discussion Canada: University of Toronto Press; 1992.
Correlation and regression are 2 widely used approaches for 4. Weisberg S. Applied Linear Regression. New York, NY: Wiley; 1980.
determining the strength of association between 2 variables. 5. Tabachnick BG, Fidell LS. Using Multivariate Statistics. 3rd ed. New
Regression also is used for predicting an outcome from a York, NY: Harper Collins; 1996.
6. Belsley DA, Kuh E, Welsch RE. Regression Diagnostics: Identifying
predictor variable. Estimates are easily obtained in a variety Influential Data and Sources of Collinearity. New York, NY: Wiley;
of statistical software packages. For both methods, it is 1980.
important to assess whether the assumptions are valid before 7. Draper N, Smith H. Applied Regression Analysis. 2nd ed. New York, NY:
Wiley; 1981.
one draws conclusions from the estimates. If assumptions are 8. Rousseeuw PJ, Leroy AM. Robust Regression and Outlier Detection.
not satisfied, options include applying transformations to New York, NY: Wiley; 1987.
better meet the assumptions or using nonparametric versions. 9. Ragland DR. Dichotomizing continuous outcome variables: dependence
of the magnitude of association and statistical power on the cutpoint.
Both correlation and regression are easily generalized to the Epidemiology. 1992;3:434 – 440.
situation with multiple predictor variables. 10. Mazumdar M, Glassman JR. Categorizing a prognostic variable: review
of methods, code for easy implementation and applications to decision-
making about cancer treatments. Stat Med. 2000;19:113–132.
Disclosures
None. KEY WORDS: statistics 䡲 epidemiology 䡲 computers
Downloaded from http://ahajournals.org by on July 16, 2024

You might also like