Chapter 17 Correlation and Regression
Chapter 17 Correlation and Regression
It is also known as the Pearson correlation coefficient, simple correlation, bivariate correlation, or
merely the correlation coefficient.
From a sample of n observations, X and Y, the product moment correlation, r, can be calculated as:
Covariance: A systematic relationship between two variables in which a change in one implies a
corresponding change in the other (COVxy). The covariance may be either positive or negative.
Division by Sx Sy achieves standardization, so that r varies between -1.0 and 1.0. (The positive sign of
r implies a positive relationship; higher value of r means strong association). Thus, correlation is a
special case of covariance, and is obtained when the data are standardized. Note that the correlation
coefficient is an absolute number and is not expressed in any unit of measurement. The correlation
coefficient between two variables will be the same regardless of their underlying units of
measurement.
Because r indicates the degree to which variation in one variable is related to variation in another, it
can also be expressed in terms of the decomposition of the total variation.
r2 measures the proportion of variation in one variable that is explained by the other. Both r and r2
are symmetric measures of association. In other words, the correlation of X with Y is the same as the
correlation of Y with X. It does not matter which variable is considered to be the dependent variable
and which the independent. The product moment coefficient measures the strength of the linear
relationship and is not designed to measure nonlinear relationships. Thus, r=0 merely indicates that
there is no linear relationship between X and Y. It does not mean that X and Y are unrelated. There
could well be a nonlinear relationship between them, which would not be captured by r (see Figure
17.1).
When it is computed for a population rather than a sample, the product moment correlation is
denoted by ρ, the Greek letter rho. The coefficient r is an estimator of ρ. Note that the calculation of
r assumes that X and Y are metric variables whose distributions have the same shape. If these
assumptions are not met, r is deflated and underestimates ρ. In marketing research, data obtained
by using rating scales with a small number of categories may not be strictly interval. This tends to
deflate r, resulting in an underestimation of ρ.
The statistical significance of the relationship between two variables measured by using r can be
conveniently tested. The hypotheses are:
The test statistic (to) is: which has a t distribution with n - 2 degrees of freedom
In conducting multivariate data analysis, it is often useful to examine the simple correlation between
each pair of variables. These results are presented in the form of a correlation matrix, which
indicates the coefficient of correlation between each pair of variables. Usually, only the lower
triangular portion of the matrix is considered. The diagonal elements all equal 1.00, because a
variable correlates perfectly with itself. The upper triangular portion of the matrix is a mirror image
of the lower triangular portion, because r is a symmetric measure of association. The form of a
correlation matrix for five variables, V1 through V5, is as follows.
Partial Correlation
A partial correlation coefficient measures the association between two variables after controlling for
or adjusting for the effects of one or more additional variables.
As in these situations, suppose one wanted to calculate the association between X and Y after
controlling for a third variable, Z. Conceptually,
• One would first remove the effect of Z from X. To do this, one would predict the values of X
based on a knowledge of Z by using the product moment correlation between X and Z, rxz.
The predicted value of X is then subtracted from the actual value of X to construct an
adjusted value of X.
• In a similar manner, the values of Y are adjusted to remove the effects of Z.
• The product moment correlation between the adjusted values of X and the adjusted values
of Y is the partial correlation coefficient between X and Y, after controlling for the effect of Z,
and is denoted by rxy.z.
Statistically, because the simple correlation between two variables completely describes the linear
relationship between them, the partial correlation coefficient can be calculated by a knowledge of
the simple correlations alone, without using individual observations.
Partial correlations have an order associated with them. The order indicates how many variables are
being adjusted or controlled.
• Zero-order: The simple correlation coefficient, r, has a zero-order, as it does not control for
any additional variables when measuring the association between two variables.
• First-order: The coefficient rxy.z is a first-order partial correlation coefficient, as it controls
for the effect of one additional variable, Z.
• Second-order partial correlation coefficient controls for the effects of two variables, and so
on.
The higher-order partial correlations are calculated similarly. The (n + 1)th-order partial coefficient
may be calculated by replacing the simple correlation coefficients on the right side of the preceding
equation with the nth-order partial coefficients.
Spurious relationships: Partial correlations can be helpful for detecting spurious relationships. The
relationship between X and Y is spurious if it is solely due to the fact that X is associated with Z,
which is indeed the true predictor of Y. In this case, the correlation between X and Y disappears
when the effect of Z is controlled.
Part correlation coefficient: This coefficient represents the correlation between Y and X when the
linear effects of the other independent variables have been removed from X but not from Y. The part
correlation coefficient, ry(x.z), is calculated as follows:
The partial correlation coefficient is generally viewed as more important than the part correlation
coefficient because it can be used to determine spurious and suppressor effects. The product
moment correlation, partial correlation, and the part correlation coefficients all assume that the
data are interval or ratio scaled.
Nonmetric Correlation
A correlation measure for two nonmetric variables (ordinal and numeric) that relies on rankings to
compute the correlation.
Spearman’s rho, ρs, and Kendall’s tau, τ, are two measures of nonmetric correlation that can be
used to examine the correlation between them. Both these measures use rankings rather than the
absolute values of the variables and the basic concepts underlying them are quite similar. Both vary
from -1.0 to 1.0
In the absence of ties, Spearman’s ρs yields a closer approximation to the Pearson product moment
correlation coefficient, ρ, than Kendall’s τ. In these cases, the absolute magnitude of τ tends to be
smaller than Pearson’s ρ. On the other hand, when the data contain a large number of tied ranks,
Kendall’s τ seems more appropriate. As a rule of thumb, Kendall’s τ is to be preferred when a large
number of cases fall into a relatively small number of categories (thereby leading to a large number
of ties). Conversely, the use of Spearman’s ρs is preferable when we have a relatively larger number
of categories (thereby having fewer ties).
The product moment as well as the partial and part correlation coefficients provide a conceptual
foundation for bivariate as well as multiple regression analysis.
Regression Analysis
A statistical procedure for analysing associative relationships between a metric dependent variable
and one or more independent variables.
Although the independent variables may explain the variation in the dependent variable, this does
not necessarily imply causation. The use of the terms dependent or criterion variables, and
independent or predictor variables, in regression analysis arises from the mathematical relationship
between the variables. These terms do not imply that the criterion variable is dependent on the
independent variables in a causal sense. Regression analysis is concerned with the nature and
degree of association between variables and does not imply or assume any causality.
Bivariate Regression
Bivariate regression is a procedure for deriving a mathematical relationship, in the form of an
equation, between a single metric dependent or criterion variable and a single metric independent
or predictor variable. The analysis is similar in many ways to determining the simple correlation
between two variables. However, because an equation has to be derived, one variable must be
identified as the dependent and the other as the independent variable.
Plot the Scatter Diagram: A scatter diagram, or scattergram, is a plot of the values of two variables
for all the cases or observations. It is customary to plot the dependent variable on the vertical axis
and the independent variable on the horizontal axis. A scatter diagram is useful for determining the
form of the relationship between the variables. A plot can alert the researcher to patterns in the
data, or to possible problems. Any unusual combinations of the two variables can be easily
identified.
The most commonly used technique for fitting a straight line to a scattergram is the least-squares
procedure. This technique determines the best-fitting line by minimizing the square of the vertical
distances of all the points from the line and the procedure is called ordinary least squares (OLS)
regression. The best-fitting line is called the regression line. Any point that does not fall on the
regression line is not fully accounted for. The vertical distance from the point to the line is the error,
ej (see Figure 17.5). The distances of all the points from the line are squared and added together to
arrive at the sum of squared errors, which is a measure of total error, ∑e²j. In fitting the line, the
least-squares procedure minimizes the sum of squared errors. If Y is plotted on the vertical axis and
X on the horizontal axis, as in Figure 17.5, the best-fitting line is called the regression of Y on X,
because the vertical distances are minimized. The scatter diagram indicates whether the relationship
between Y and X can be modeled as a straight line and, consequently, whether the bivariate
regression model is appropriate.
Formulate the Bivariate Regression Model In the bivariate regression model, the general form of a
straight line is: Y= βo + β1 X where, Y = dependent or criterion variable, X = independent or predictor
variable, βo = intercept of the line, β1 = slope of the line.
This model implies a deterministic relationship, in that Y is completely determined by X. The value of
Y can be perfectly predicted if βo and β1 are known. In marketing research, however, very few
relationships are deterministic. So, the regression procedure adds an error term to account for the
probabilistic or stochastic nature of the relationship. The basic regression equation becomes:
Yi = β₀ + β₁ Xi + ei,
Estimate the Parameters In most cases, β₀ and β₁ are unknown and are estimated from the sample
observations using the equation Ŷi = a + bxi, where Ŷi is the predicted value of Yi, and a and b are
estimators of βo and β1, respectively.
The constant b is usually referred to as the non-standardized regression coefficient. It is the slope of
the regression line and it indicates the expected change in Y when X is changed by one unit. The
slope, b, may be computed in terms of the covariance between X and Y, (COVxy), and the variance of
X as:
Estimate Standardized Regression Coefficient Standardization is the process by which the raw data
are transformed into new variables that have a mean of 0 and a variance of 1. When the data are
standardized, the intercept assumes a value of 0. The term beta coefficient or beta weight is used to
denote the standardized regression coefficient. In this case, the slope obtained by the regression of
Y on X, Byx, is the same as the slope obtained by the regression of X on Y, Bxy. Moreover, each of
these regression coefficients is equal to the simple correlation between X and Y.
Test for Significance The statistical significance of the linear relationship between X and Y may be
tested by examining the hypotheses:
The null hypothesis implies that there is no linear relationship between X and Y. The alternative
hypothesis is that there is a relationship, positive or negative, between X and Y. Typically, a two-
tailed test is done. A t statistic with n - 2 degrees of freedom can be used, where
SEb denotes the standard deviation of b and is called the standard error. Critical value of t can be
calculated from t distribution table. If the calculated value of t is larger than the critical value, the
null hypothesis is rejected and there is a significant linear relationship.
Determine the Strength and Significance of Association The strength of association between Y and
X is measured by the coefficient of determination, r². In bivariate regression, r² is the square of the
simple correlation coefficient obtained by correlating the two variables. The coefficient r² varies
between 0 and 1. It signifies the proportion of the total variation in Y that is accounted for by the
variation in X. The decomposition of the total variation in Y is similar to that for analysis of variance.
The total variation, SSy, may be decomposed into the variation accounted for by the regression line,
SSreg, and the error or residual variation, SSerror or SSres, as follows:
which has an F distribution with 1 and n - 2 degrees of freedom. The F test is a generalized form of
the t test. If a random variable t is distributed with n degrees of freedom, then t 2 is F distributed
with 1 and n degrees of freedom. Hence, the F test for testing the significance of the coefficient of
determination is equivalent to testing the following hypotheses:
If the calculated F statistic exceeds the critical value of F (Fo > Fc), then the relationship is significant.
Null hypothesis is rejected.
Check Prediction Accuracy To estimate the accuracy of predicted values, Ŷ, it is useful to calculate
the standard error of estimate, SEE. This statistic is the standard deviation of the actual Y values
from the predicted Ŷ values.
SEE may be interpreted as a kind of average residual or average error in predicting Y from the
regression equation. Two cases of prediction may arise. The researcher may want to predict the
mean value of Y for all the cases with a given value of X, say Xo, or predict the value of Y for a single
case. In both situations, the predicted value is the same and is given by Ŷ, where
However, the standard error is different in the two situations, although in both situations it is a
function of SEE. For large samples, the standard error for predicting mean value of Y is SEE/ , and
for predicting individual Y values it is SEE. Hence, the construction of confidence intervals for the
predicted value varies, depending upon whether the mean value or the value for a single
observation is being predicted.
Assumptions The regression model makes a number of assumptions in estimating the parameters
and in significance testing:
• The error term is normally distributed. For each fixed value of X, the distribution of Y is
normal.
• The means of all these normal distributions of Y, given X, lie on a straight line with slope b.
• The mean of the error term is 0.
• The variance of the error term is constant. This variance does not depend on the values
assumed by X.
• The error terms are uncorrelated. In other words, the observations have been drawn
independently.
Insights into the extent to which these assumptions have been met can be gained by an examination
of residuals, which is covered in the next section on multiple regression
Multiple Regression
A statistical technique that simultaneously develops a mathematical relationship between two or
more independent variables and a single interval-scaled dependent variable.
Multiple regression model: An equation used to explain the results of multiple regression analysis.
The general form of the multiple regression model is as follows:
coefficient a represents the intercept, but the bs are now the partial regression coefficients. The
least-squares criterion estimates the parameters in such a way as to minimize the total error, SSres.
This process also maximizes the correlation between the actual values of Y and the predicted values,
Ŷ. All the assumptions made in bivariate regression also apply in multiple regression.
F test: The F test is used to test the null hypothesis that the coefficient of multiple determination in
the population, R² pop, is zero. This is equivalent to testing the null hypothesis.
H₀: β₁ = β₂ = . . . = βk = 0.
Partial F test: The significance of a partial regression coefficient, βi, of Xi may be tested using an
incremental F statistic. The incremental F statistic is based on the increment in the explained sum of
squares resulting from the addition of the independent variable Xi to the regression equation after
all the other independent variables have been included.
Partial regression coefficient: The partial regression coefficient, b₁, denotes the change in the
predicted value, Ŷ, per unit change in X₁ when the other independent variables, X₂ to Xk, are held
constant.
The interpretation of the partial regression coefficient, b₁, is that it represents the expected change
in Y when X₁ is changed by one unit but X₂ is held constant or otherwise controlled. Likewise, b₂
represents the expected change in Y for a unit change in X₂, when X₁ is held constant. Thus, calling b₁
and b₂ partial regression coefficients is appropriate. It can also be seen that the combined effects of
X₁ and X₂ on Y are additive. In other words, if X₁ and X₂ are each changed by one unit, the expected
change in Y would be (b₁ + b₂).
Conceptually, the relationship between the bivariate regression coefficient and the partial regression
coefficient can be illustrated as follows. Suppose one were to remove the effect of X₂ from X₁. This
could be done by running a regression of X₁ on X₂. In other words, one would estimate the equation
and calculate the residual . The partial regression coefficient, b₁, is then
equal to the bivariate regression coefficient, br, obtained from the equation . In other
words, the partial regression coefficient, b₁, is equal to the regression coefficient, br, between Y and
the residuals of X₁ from which the effect of X₂ has been removed. The partial coefficient, b2, can also
be interpreted along similar lines.
The beta coefficients are the partial regression coefficients obtained when all the variables (Y, X₁, X₂,
. . . Xk) have been standardized to a mean of 0 and a variance of 1 before estimating the regression
equation. The relationship of the standardized to the non-standardized coefficients remains the
same as before:
The intercept and the partial regression coefficients are estimated by solving a system of
simultaneous equations derived by differentiating and equating the partial derivatives to 0. Yet it is
worth noting that the equations cannot be solved if (1) the sample size, n, is smaller than or equal to
the number of independent variables, k; or (2) one independent variable is perfectly correlated with
another.
Strength of Association: The strength of the relationship stipulated by the regression equation can
be determined by using appropriate measures of association. The total variation is decomposed as in
the bivariate case:
The strength of association is measured by the square of the multiple correlation coefficient, R²,
which is also called the coefficient of multiple determination.
The multiple correlation coefficient, R, can also be viewed as the simple correlation coefficient, r,
between Y and Ŷ. Several points about the characteristics of R² are worth noting. The coefficient of
multiple determination, R², cannot be less than the highest bivariate, r², of any individual
independent variable with the dependent variable. R² will be larger when the correlations between
the independent variables are low. If the independent variables are statistically independent
(uncorrelated), then R² will be the sum of bivariate r² of each independent variable with the
dependent variable. R² cannot decrease as more independent variables are added to the regression
equation. Yet diminishing returns set in, so that after the first few variables, the additional
independent variables do not make much of a contribution. For this reason, R² is adjusted for the
number of independent variables and the sample size by using the following formula:
Significance Testing: Significance testing involves testing the significance of the overall regression
equation as well as specific partial regression coefficients. The null hypothesis for the overall test is
that the coefficient of multiple determination in the population, R² pop, is zero.
If the overall null hypothesis is rejected, one or more population partial regression coefficients have
a value different from 0. To determine which specific coefficients (β’is) are nonzero, additional tests
are necessary. Testing for the significance of the (β’is) can be done in a manner similar to that in the
bivariate case by using t tests.
Some computer programs provide an equivalent F test, often called the partial F test. This involves a
decomposition of the total regression sum of squares, SSreg, into components related to each
independent variable. In the standard approach, this is done by assuming that each independent
variable has been added to the regression equation after all the other independent variables have
been included. The increment in the explained sum of squares, resulting from the addition of an
independent variable, Xi , is the component of the variation attributed to that variable and is
denoted by SSxi. The significance of the partial regression coefficient for this variable, bi , is tested
using an incremental F statistic:
Examination of Residuals: A residual is the difference between the observed value of Yi and the
value predicted by the regression equation, Ŷi. Residuals are used in the calculation of several
statistics associated with regression. In addition, scattergrams of the residuals, in which the residuals
are plotted against the predicted values, Ŷi, time, or predictor variables, provide useful insights in
examining the appropriateness of the underlying assumptions and regression model fitted.
The assumption of a normally distributed error term can be examined by constructing a histogram of
the standardized residuals. A visual check reveals whether the distribution is normal. It is also useful
to examine the normal probability plot of standardized residuals. The normal probability plot shows
the observed standardized residuals compared to expected standardized residuals from a normal
distribution. If the observed residuals are normally distributed, they will fall on the 45-degree line.
Also, look at the table of residual statistics and identify any standardized predicted values or
standardized residuals that are more than plus or minus one and two standard deviations. These
percentages can be compared with what would be expected under the normal distribution (68
percent and 95 percent, respectively). More formal assessment can be made by running the K-S one-
sample test.
The assumption of constant variance of the error term can be examined by plotting the standardized
residuals against the standardized predicted values of the dependent variable, Ŷi. If the pattern is
not random, the variance of the error term is not constant. Figure 17.7 shows a pattern whose
variance is dependent upon the Ŷi values.
A plot of residuals against time, or the sequence of observations, will throw some light on the
assumption that the error terms are uncorrelated. A random pattern should be seen if this
assumption is true. A plot like the one in Figure 17.8 indicates a linear relationship between residuals
and time. A more formal procedure for examining the correlations between the error terms is the
Durbin-Watson test
Plotting the residuals against the independent variables provides evidence of the appropriateness or
inappropriateness of using a linear model. Again, the plot should result in a random pattern. The
residuals should fall randomly, with relatively equal distribution dispersion about 0. They should not
display any tendency to be either positive or negative
To examine whether any additional variables should be included in the regression equation, one
could run a regression of the residuals on the proposed variables. If any variable explains a
significant proportion of the residual variation, it should be considered for inclusion. Inclusion of
variables in the regression equation should be strongly guided by the researcher’s theory. Thus, an
examination of the residuals provides valuable insights into the appropriateness of the underlying
assumptions and the model that is fitted. Figure 17.9 shows a plot that indicates that the underlying
assumptions are met and that the linear model is appropriate. If an examination of the residuals
indicates that the assumptions underlying linear regression are not met, the researcher can
transform the variables in an attempt to satisfy the assumptions. Transformations, such as taking
logs, square roots, or reciprocals, can stabilize the variance, make the distribution normal, or make
the relationship linear.
Stepwise Regression
The purpose of stepwise regression is to select, from a large number of predictor variables, a small
subset of variables that account for most of the variation in the dependent or criterion variable. In
this procedure, the predictor variables enter or are removed from the regression equation one at a
time. There are several approaches to stepwise regression.
• Forward inclusion. Initially, there are no predictor variables in the regression equation.
Predictor variables are entered one at a time, only if they meet certain criteria specified in
terms of the F ratio. The order in which the variables are included is based on the
contribution to the explained variance.
• Backward elimination. Initially, all the predictor variables are included in the regression
equation. Predictors are then removed one at a time based on the F ratio.
• Stepwise solution. Forward inclusion is combined with the removal of predictors that no
longer meet the specified criterion at each step.
Stepwise procedures do not result in regression equations that are optimal, in the sense of
producing the largest R2, for a given number of predictors. Because of the correlations between
predictors, an important variable may never be included, or less important variables may enter the
equation. To identify an optimal regression equation, one would have to compute combinatorial
solutions in which all possible combinations are examined. Nevertheless, stepwise regression can be
useful when the sample size is large in relation to the number of predictors
Multicollinearity
Multicollinearity arises when intercorrelations among the predictors are very high. Multicollinearity
can result in several problems, including:
• The partial regression coefficients may not be estimated precisely. The standard errors are
likely to be high.
• The magnitudes as well as the signs of the partial regression coefficients may change from
sample to sample.
• It becomes difficult to assess the relative importance of the independent variables in
explaining the variation in the dependent variable.
• Predictor variables may be incorrectly included or removed in stepwise regression.
What constitutes serious multicollinearity is not always clear, although several rules of thumb and
procedures have been suggested in the literature. Procedures of varying complexity have also been
suggested to cope with multicollinearity. A simple procedure consists of using only one of the
variables in a highly correlated set of variables. Alternatively, the set of independent variables can be
transformed into a new set of predictors that are mutually independent by using techniques such as
principal components analysis. More specialized techniques, such as ridge regression and latent root
regression, can also be used
Given that the predictors are correlated, at least to some extent, in virtually all regression situations,
none of these measures is satisfactory. It is also possible that the different measures may indicate a
different order of importance of the predictors. Yet, if all the measures are examined collectively,
useful insights may be obtained into the relative importance of the predictors.
Cross-Validation
Cross-validation examines whether the regression model continues to hold on comparable data not
used in the estimation. The typical cross-validation procedure used in marketing research is as
follows:
1. The regression model is estimated using the entire data set. 2. The available data are split
into two parts, the estimation sample and the validation sample. The estimation sample
generally contains 50 to 90 percent of the total sample. 3. The regression model is estimated
using the data from the estimation sample only. This model is compared to the model
estimated on the entire sample to determine the agreement in terms of the signs and
magnitudes of the partial regression coefficients. 4. The estimated model is applied to the
data in the validation sample to predict the values of the dependent variable, , for the
observations in the validation sample. 5. The observed values, Yi , and the predicted values, ,
in the validation sample are correlated to determine the simple r2. This measure, r2, is
compared to R2 for the total sample and to R2 for the estimation sample to assess the
degree of shrinkage.
In double cross-validation, the sample is split into halves. One half serves as the estimation sample,
and the other is used as a validation sample in conducting cross-validation. The roles of the
estimation and validation halves are then reversed, and the cross-validation is repeated.