0% found this document useful (0 votes)

276 views16 pages

Chapter 17 Correlation and Regression

The document provides an overview of correlation and regression analysis. It defines the product moment correlation coefficient (r) as a measure of the strength and direction of the linear relationship between two metric variables. It describes how to calculate r from a sample, and how r ranges from -1 to 1. It also discusses partial correlation coefficients which measure the relationship between two variables while controlling for one or more additional variables. The document concludes with a brief introduction to regression analysis which examines associative relationships between a dependent variable and one or more independent variables.

Uploaded by

KANIKA GORAYA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

276 views16 pages

Chapter 17 Correlation and Regression

Uploaded by

KANIKA GORAYA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Chapter 17 Correlation and Regression

Product Moment Correlation

The product moment correlation, r, is the most widely used statistic, summarizing the strength of
association between two metric (interval or ratio scaled) variables.

It is also known as the Pearson correlation coefficient, simple correlation, bivariate correlation, or
merely the correlation coefficient.

Say two metric (interval or ratio scaled) variables be X & Y.

• It is an index used to determine whether a linear, or straight-line, relationship exists

between X & Y.
• It indicates the degree to which the variation in one variable, X, is related to the variation in
another variable, Y

From a sample of n observations, X and Y, the product moment correlation, r, can be calculated as:

Division of the numerator and denominator by n - 1 gives

• In these equations, and Ῡ denote the sample means.

• Sx and Sy the standard deviations
• COVxy, the covariance between X and Y, measures the extent to which X and Y are related.

Covariance: A systematic relationship between two variables in which a change in one implies a
corresponding change in the other (COVxy). The covariance may be either positive or negative.

Division by Sx Sy achieves standardization, so that r varies between -1.0 and 1.0. (The positive sign of
r implies a positive relationship; higher value of r means strong association). Thus, correlation is a
special case of covariance, and is obtained when the data are standardized. Note that the correlation
coefficient is an absolute number and is not expressed in any unit of measurement. The correlation
coefficient between two variables will be the same regardless of their underlying units of
measurement.

Because r indicates the degree to which variation in one variable is related to variation in another, it
can also be expressed in terms of the decomposition of the total variation.

r2 measures the proportion of variation in one variable that is explained by the other. Both r and r2
are symmetric measures of association. In other words, the correlation of X with Y is the same as the
correlation of Y with X. It does not matter which variable is considered to be the dependent variable
and which the independent. The product moment coefficient measures the strength of the linear
relationship and is not designed to measure nonlinear relationships. Thus, r=0 merely indicates that
there is no linear relationship between X and Y. It does not mean that X and Y are unrelated. There
could well be a nonlinear relationship between them, which would not be captured by r (see Figure
17.1).

FIGURE 17.1 A Nonlinear Relationship for Which r = 0

When it is computed for a population rather than a sample, the product moment correlation is
denoted by ρ, the Greek letter rho. The coefficient r is an estimator of ρ. Note that the calculation of
r assumes that X and Y are metric variables whose distributions have the same shape. If these
assumptions are not met, r is deflated and underestimates ρ. In marketing research, data obtained
by using rating scales with a small number of categories may not be strictly interval. This tends to
deflate r, resulting in an underestimation of ρ.

The statistical significance of the relationship between two variables measured by using r can be
conveniently tested. The hypotheses are:
The test statistic (to) is: which has a t distribution with n - 2 degrees of freedom

tc is calculated from t distribution table for two-tailed test and α=0.05.

If to ≥ tc, the null hypothesis of no relationship between X and Y is rejected.

In conducting multivariate data analysis, it is often useful to examine the simple correlation between
each pair of variables. These results are presented in the form of a correlation matrix, which
indicates the coefficient of correlation between each pair of variables. Usually, only the lower
triangular portion of the matrix is considered. The diagonal elements all equal 1.00, because a
variable correlates perfectly with itself. The upper triangular portion of the matrix is a mirror image
of the lower triangular portion, because r is a symmetric measure of association. The form of a
correlation matrix for five variables, V1 through V5, is as follows.

Partial Correlation
A partial correlation coefficient measures the association between two variables after controlling for
or adjusting for the effects of one or more additional variables.

As in these situations, suppose one wanted to calculate the association between X and Y after
controlling for a third variable, Z. Conceptually,

• One would first remove the effect of Z from X. To do this, one would predict the values of X
based on a knowledge of Z by using the product moment correlation between X and Z, rxz.
The predicted value of X is then subtracted from the actual value of X to construct an
adjusted value of X.
• In a similar manner, the values of Y are adjusted to remove the effects of Z.
• The product moment correlation between the adjusted values of X and the adjusted values
of Y is the partial correlation coefficient between X and Y, after controlling for the effect of Z,
and is denoted by rxy.z.

Statistically, because the simple correlation between two variables completely describes the linear
relationship between them, the partial correlation coefficient can be calculated by a knowledge of
the simple correlations alone, without using individual observations.

Partial correlations have an order associated with them. The order indicates how many variables are
being adjusted or controlled.

• Zero-order: The simple correlation coefficient, r, has a zero-order, as it does not control for
any additional variables when measuring the association between two variables.
• First-order: The coefficient rxy.z is a first-order partial correlation coefficient, as it controls
for the effect of one additional variable, Z.
• Second-order partial correlation coefficient controls for the effects of two variables, and so
on.

The higher-order partial correlations are calculated similarly. The (n + 1)th-order partial coefficient
may be calculated by replacing the simple correlation coefficients on the right side of the preceding
equation with the nth-order partial coefficients.

Spurious relationships: Partial correlations can be helpful for detecting spurious relationships. The
relationship between X and Y is spurious if it is solely due to the fact that X is associated with Z,
which is indeed the true predictor of Y. In this case, the correlation between X and Y disappears
when the effect of Z is controlled.

Part correlation coefficient: This coefficient represents the correlation between Y and X when the
linear effects of the other independent variables have been removed from X but not from Y. The part
correlation coefficient, ry(x.z), is calculated as follows:

The partial correlation coefficient is generally viewed as more important than the part correlation
coefficient because it can be used to determine spurious and suppressor effects. The product
moment correlation, partial correlation, and the part correlation coefficients all assume that the
data are interval or ratio scaled.

Nonmetric Correlation
A correlation measure for two nonmetric variables (ordinal and numeric) that relies on rankings to
compute the correlation.

Spearman’s rho, ρs, and Kendall’s tau, τ, are two measures of nonmetric correlation that can be
used to examine the correlation between them. Both these measures use rankings rather than the
absolute values of the variables and the basic concepts underlying them are quite similar. Both vary
from -1.0 to 1.0

In the absence of ties, Spearman’s ρs yields a closer approximation to the Pearson product moment
correlation coefficient, ρ, than Kendall’s τ. In these cases, the absolute magnitude of τ tends to be

smaller than Pearson’s ρ. On the other hand, when the data contain a large number of tied ranks,

Kendall’s τ seems more appropriate. As a rule of thumb, Kendall’s τ is to be preferred when a large
number of cases fall into a relatively small number of categories (thereby leading to a large number
of ties). Conversely, the use of Spearman’s ρs is preferable when we have a relatively larger number
of categories (thereby having fewer ties).

The product moment as well as the partial and part correlation coefficients provide a conceptual
foundation for bivariate as well as multiple regression analysis.
Regression Analysis
A statistical procedure for analysing associative relationships between a metric dependent variable
and one or more independent variables.

Uses of regression analysis:

• Determine whether the independent variables explain a significant variation in the

dependent variable: whether a relationship exists
• Determine how much of the variation in the dependent variable can be explained by the
independent variables: strength of the relationship
• Determine the structure or form of the relationship: the mathematical equation relating the
independent and dependent variables
• Predict the values of the dependent variable
• Control for other independent variables when evaluating the contributions of a specific
variable or set of variables

Although the independent variables may explain the variation in the dependent variable, this does
not necessarily imply causation. The use of the terms dependent or criterion variables, and
independent or predictor variables, in regression analysis arises from the mathematical relationship
between the variables. These terms do not imply that the criterion variable is dependent on the
independent variables in a causal sense. Regression analysis is concerned with the nature and
degree of association between variables and does not imply or assume any causality.

Bivariate Regression
Bivariate regression is a procedure for deriving a mathematical relationship, in the form of an
equation, between a single metric dependent or criterion variable and a single metric independent
or predictor variable. The analysis is similar in many ways to determining the simple correlation
between two variables. However, because an equation has to be derived, one variable must be
identified as the dependent and the other as the independent variable.

Statistics Associated with Bivariate Regression Analysis

• Bivariate regression model: The basic regression equation is Yi= βo + β1 Xi + ei, where Y
dependent or criterion variable, X independent or predictor variable, βo = intercept of the
line, β1 = slope of the line, and ei is the error term associated with the ith observation.
• Coefficient of determination: The strength of association is measured by the coefficient of
determination, r2. It varies between 0 and 1 and signifies the proportion of the total
variation in Y that is accounted for by the variation in X.
• Estimated or predicted value: The estimated or predicted value of Yi is Ŷi = a + bx, where Ŷi
is the predicted value of Yi, and a and b are estimators of βo and β1, respectively.
• Regression coefficient: The estimated parameter b is usually referred to as the non-
standardized regression coefficient.
• Scattergram: A scatter diagram, or scattergram, is a plot of the values of two variables for all
the cases or observations.
• Standard error of estimate: This statistic, SEE, is the standard deviation of the actual Y
values from the predicted Ŷ values.
• Standard error: The standard deviation of b, SEb, is called the standard error.
• Standardized regression coefficient: Also termed the beta coefficient or beta weight, this is
the slope obtained by the regression of Y on X when the data are standardized.
• Sum of squared errors: The distances of all the points from the regression line are squared
and added together to arrive at the sum of squared errors, which is a measure of total error,
∑e²j
• t statistic: A t statistic with n - 2 degrees of freedom can be used to test the null hypothesis
that no linear relationship exists between X and Y, or

Conducting Bivariate Regression Analysis

FIGURE 17.2 Conducting Bivariate Regression Analysis

Plot the Scatter Diagram: A scatter diagram, or scattergram, is a plot of the values of two variables
for all the cases or observations. It is customary to plot the dependent variable on the vertical axis
and the independent variable on the horizontal axis. A scatter diagram is useful for determining the
form of the relationship between the variables. A plot can alert the researcher to patterns in the
data, or to possible problems. Any unusual combinations of the two variables can be easily
identified.

The most commonly used technique for fitting a straight line to a scattergram is the least-squares
procedure. This technique determines the best-fitting line by minimizing the square of the vertical
distances of all the points from the line and the procedure is called ordinary least squares (OLS)
regression. The best-fitting line is called the regression line. Any point that does not fall on the
regression line is not fully accounted for. The vertical distance from the point to the line is the error,
ej (see Figure 17.5). The distances of all the points from the line are squared and added together to
arrive at the sum of squared errors, which is a measure of total error, ∑e²j. In fitting the line, the
least-squares procedure minimizes the sum of squared errors. If Y is plotted on the vertical axis and
X on the horizontal axis, as in Figure 17.5, the best-fitting line is called the regression of Y on X,
because the vertical distances are minimized. The scatter diagram indicates whether the relationship
between Y and X can be modeled as a straight line and, consequently, whether the bivariate
regression model is appropriate.

Formulate the Bivariate Regression Model In the bivariate regression model, the general form of a
straight line is: Y= βo + β1 X where, Y = dependent or criterion variable, X = independent or predictor
variable, βo = intercept of the line, β1 = slope of the line.

This model implies a deterministic relationship, in that Y is completely determined by X. The value of
Y can be perfectly predicted if βo and β1 are known. In marketing research, however, very few
relationships are deterministic. So, the regression procedure adds an error term to account for the
probabilistic or stochastic nature of the relationship. The basic regression equation becomes:

Yi = β₀ + β₁ Xi + ei,

Estimate the Parameters In most cases, β₀ and β₁ are unknown and are estimated from the sample
observations using the equation Ŷi = a + bxi, where Ŷi is the predicted value of Yi, and a and b are
estimators of βo and β1, respectively.

The constant b is usually referred to as the non-standardized regression coefficient. It is the slope of
the regression line and it indicates the expected change in Y when X is changed by one unit. The
slope, b, may be computed in terms of the covariance between X and Y, (COVxy), and the variance of
X as:

The intercept, a, may then be calculated using

Estimate Standardized Regression Coefficient Standardization is the process by which the raw data
are transformed into new variables that have a mean of 0 and a variance of 1. When the data are
standardized, the intercept assumes a value of 0. The term beta coefficient or beta weight is used to
denote the standardized regression coefficient. In this case, the slope obtained by the regression of
Y on X, Byx, is the same as the slope obtained by the regression of X on Y, Bxy. Moreover, each of
these regression coefficients is equal to the simple correlation between X and Y.

Byx = Bxy = rxy

There is a simple relationship between the standardized and non-standardized regression
coefficients:

Test for Significance The statistical significance of the linear relationship between X and Y may be
tested by examining the hypotheses:

The null hypothesis implies that there is no linear relationship between X and Y. The alternative
hypothesis is that there is a relationship, positive or negative, between X and Y. Typically, a two-
tailed test is done. A t statistic with n - 2 degrees of freedom can be used, where

SEb denotes the standard deviation of b and is called the standard error. Critical value of t can be
calculated from t distribution table. If the calculated value of t is larger than the critical value, the
null hypothesis is rejected and there is a significant linear relationship.

Determine the Strength and Significance of Association The strength of association between Y and
X is measured by the coefficient of determination, r². In bivariate regression, r² is the square of the
simple correlation coefficient obtained by correlating the two variables. The coefficient r² varies
between 0 and 1. It signifies the proportion of the total variation in Y that is accounted for by the
variation in X. The decomposition of the total variation in Y is similar to that for analysis of variance.
The total variation, SSy, may be decomposed into the variation accounted for by the regression line,
SSreg, and the error or residual variation, SSerror or SSres, as follows:

The strength of association may then be calculated as follows:

Another equivalent test for examining the significance of the linear relationship between X and Y
(significance of b) is the test for the significance of the coefficient of determination. The hypotheses
in this case are:

The appropriate test statistic is the F statistic:

which has an F distribution with 1 and n - 2 degrees of freedom. The F test is a generalized form of
the t test. If a random variable t is distributed with n degrees of freedom, then t 2 is F distributed
with 1 and n degrees of freedom. Hence, the F test for testing the significance of the coefficient of
determination is equivalent to testing the following hypotheses:

If the calculated F statistic exceeds the critical value of F (Fo > Fc), then the relationship is significant.
Null hypothesis is rejected.

Check Prediction Accuracy To estimate the accuracy of predicted values, Ŷ, it is useful to calculate
the standard error of estimate, SEE. This statistic is the standard deviation of the actual Y values
from the predicted Ŷ values.

or more generally, if there are k independent variables,

SEE may be interpreted as a kind of average residual or average error in predicting Y from the
regression equation. Two cases of prediction may arise. The researcher may want to predict the
mean value of Y for all the cases with a given value of X, say Xo, or predict the value of Y for a single
case. In both situations, the predicted value is the same and is given by Ŷ, where

However, the standard error is different in the two situations, although in both situations it is a
function of SEE. For large samples, the standard error for predicting mean value of Y is SEE/ , and
for predicting individual Y values it is SEE. Hence, the construction of confidence intervals for the
predicted value varies, depending upon whether the mean value or the value for a single
observation is being predicted.
Assumptions The regression model makes a number of assumptions in estimating the parameters
and in significance testing:

• The error term is normally distributed. For each fixed value of X, the distribution of Y is
normal.
• The means of all these normal distributions of Y, given X, lie on a straight line with slope b.
• The mean of the error term is 0.
• The variance of the error term is constant. This variance does not depend on the values
assumed by X.
• The error terms are uncorrelated. In other words, the observations have been drawn
independently.

Insights into the extent to which these assumptions have been met can be gained by an examination
of residuals, which is covered in the next section on multiple regression

Multiple Regression
A statistical technique that simultaneously develops a mathematical relationship between two or
more independent variables and a single interval-scaled dependent variable.

Multiple regression model: An equation used to explain the results of multiple regression analysis.
The general form of the multiple regression model is as follows:

which is estimated by the following equation:

coefficient a represents the intercept, but the bs are now the partial regression coefficients. The
least-squares criterion estimates the parameters in such a way as to minimize the total error, SSres.
This process also maximizes the correlation between the actual values of Y and the predicted values,
Ŷ. All the assumptions made in bivariate regression also apply in multiple regression.

Statistics Associated with Multiple Regression

Adjusted R²: R², coefficient of multiple determination, is adjusted for the number of independent
variables and the sample size to account for diminishing returns. After the first few variables, the
additional independent variables do not make much contribution.

Coefficient of multiple determination: The strength of association in multiple regression is

measured by the square of the multiple correlation coefficient, R², which is also called the coefficient
of multiple determination.

F test: The F test is used to test the null hypothesis that the coefficient of multiple determination in
the population, R² pop, is zero. This is equivalent to testing the null hypothesis.

H₀: β₁ = β₂ = . . . = βk = 0.

The test statistic has an F distribution with k and (n - k - 1) degrees of freedom.

Partial F test: The significance of a partial regression coefficient, βi, of Xi may be tested using an
incremental F statistic. The incremental F statistic is based on the increment in the explained sum of
squares resulting from the addition of the independent variable Xi to the regression equation after
all the other independent variables have been included.

Partial regression coefficient: The partial regression coefficient, b₁, denotes the change in the
predicted value, Ŷ, per unit change in X₁ when the other independent variables, X₂ to Xk, are held
constant.

Conducting Multiple Regression Analysis

Partial Regression Coefficients: Consider a case in which there are two independent variables, so
that:

The interpretation of the partial regression coefficient, b₁, is that it represents the expected change
in Y when X₁ is changed by one unit but X₂ is held constant or otherwise controlled. Likewise, b₂
represents the expected change in Y for a unit change in X₂, when X₁ is held constant. Thus, calling b₁
and b₂ partial regression coefficients is appropriate. It can also be seen that the combined effects of
X₁ and X₂ on Y are additive. In other words, if X₁ and X₂ are each changed by one unit, the expected
change in Y would be (b₁ + b₂).

Conceptually, the relationship between the bivariate regression coefficient and the partial regression
coefficient can be illustrated as follows. Suppose one were to remove the effect of X₂ from X₁. This
could be done by running a regression of X₁ on X₂. In other words, one would estimate the equation
and calculate the residual . The partial regression coefficient, b₁, is then
equal to the bivariate regression coefficient, br, obtained from the equation . In other
words, the partial regression coefficient, b₁, is equal to the regression coefficient, br, between Y and
the residuals of X₁ from which the effect of X₂ has been removed. The partial coefficient, b2, can also
be interpreted along similar lines.

The beta coefficients are the partial regression coefficients obtained when all the variables (Y, X₁, X₂,
. . . Xk) have been standardized to a mean of 0 and a variance of 1 before estimating the regression
equation. The relationship of the standardized to the non-standardized coefficients remains the
same as before:

The intercept and the partial regression coefficients are estimated by solving a system of
simultaneous equations derived by differentiating and equating the partial derivatives to 0. Yet it is
worth noting that the equations cannot be solved if (1) the sample size, n, is smaller than or equal to
the number of independent variables, k; or (2) one independent variable is perfectly correlated with
another.

Strength of Association: The strength of the relationship stipulated by the regression equation can
be determined by using appropriate measures of association. The total variation is decomposed as in
the bivariate case:
The strength of association is measured by the square of the multiple correlation coefficient, R²,
which is also called the coefficient of multiple determination.

The multiple correlation coefficient, R, can also be viewed as the simple correlation coefficient, r,
between Y and Ŷ. Several points about the characteristics of R² are worth noting. The coefficient of
multiple determination, R², cannot be less than the highest bivariate, r², of any individual
independent variable with the dependent variable. R² will be larger when the correlations between
the independent variables are low. If the independent variables are statistically independent
(uncorrelated), then R² will be the sum of bivariate r² of each independent variable with the
dependent variable. R² cannot decrease as more independent variables are added to the regression
equation. Yet diminishing returns set in, so that after the first few variables, the additional
independent variables do not make much of a contribution. For this reason, R² is adjusted for the
number of independent variables and the sample size by using the following formula:

Significance Testing: Significance testing involves testing the significance of the overall regression
equation as well as specific partial regression coefficients. The null hypothesis for the overall test is
that the coefficient of multiple determination in the population, R² pop, is zero.

This is equivalent to the following null hypothesis:

The overall test can be conducted by using an F statistic:

which has an F distribution with k and (n - k - 1) degrees of freedom

If the overall null hypothesis is rejected, one or more population partial regression coefficients have
a value different from 0. To determine which specific coefficients (β’is) are nonzero, additional tests
are necessary. Testing for the significance of the (β’is) can be done in a manner similar to that in the
bivariate case by using t tests.
Some computer programs provide an equivalent F test, often called the partial F test. This involves a
decomposition of the total regression sum of squares, SSreg, into components related to each
independent variable. In the standard approach, this is done by assuming that each independent
variable has been added to the regression equation after all the other independent variables have
been included. The increment in the explained sum of squares, resulting from the addition of an
independent variable, Xi , is the component of the variation attributed to that variable and is
denoted by SSxi. The significance of the partial regression coefficient for this variable, bi , is tested
using an incremental F statistic:

which has an F distribution with 1 and (n - k - 1) degrees of freedom.

Examination of Residuals: A residual is the difference between the observed value of Yi and the
value predicted by the regression equation, Ŷi. Residuals are used in the calculation of several
statistics associated with regression. In addition, scattergrams of the residuals, in which the residuals
are plotted against the predicted values, Ŷi, time, or predictor variables, provide useful insights in
examining the appropriateness of the underlying assumptions and regression model fitted.

The assumption of a normally distributed error term can be examined by constructing a histogram of
the standardized residuals. A visual check reveals whether the distribution is normal. It is also useful
to examine the normal probability plot of standardized residuals. The normal probability plot shows
the observed standardized residuals compared to expected standardized residuals from a normal
distribution. If the observed residuals are normally distributed, they will fall on the 45-degree line.
Also, look at the table of residual statistics and identify any standardized predicted values or
standardized residuals that are more than plus or minus one and two standard deviations. These
percentages can be compared with what would be expected under the normal distribution (68
percent and 95 percent, respectively). More formal assessment can be made by running the K-S one-
sample test.

The assumption of constant variance of the error term can be examined by plotting the standardized
residuals against the standardized predicted values of the dependent variable, Ŷi. If the pattern is
not random, the variance of the error term is not constant. Figure 17.7 shows a pattern whose
variance is dependent upon the Ŷi values.

A plot of residuals against time, or the sequence of observations, will throw some light on the
assumption that the error terms are uncorrelated. A random pattern should be seen if this
assumption is true. A plot like the one in Figure 17.8 indicates a linear relationship between residuals
and time. A more formal procedure for examining the correlations between the error terms is the
Durbin-Watson test

Plotting the residuals against the independent variables provides evidence of the appropriateness or
inappropriateness of using a linear model. Again, the plot should result in a random pattern. The
residuals should fall randomly, with relatively equal distribution dispersion about 0. They should not
display any tendency to be either positive or negative

To examine whether any additional variables should be included in the regression equation, one
could run a regression of the residuals on the proposed variables. If any variable explains a
significant proportion of the residual variation, it should be considered for inclusion. Inclusion of
variables in the regression equation should be strongly guided by the researcher’s theory. Thus, an
examination of the residuals provides valuable insights into the appropriateness of the underlying
assumptions and the model that is fitted. Figure 17.9 shows a plot that indicates that the underlying
assumptions are met and that the linear model is appropriate. If an examination of the residuals
indicates that the assumptions underlying linear regression are not met, the researcher can
transform the variables in an attempt to satisfy the assumptions. Transformations, such as taking
logs, square roots, or reciprocals, can stabilize the variance, make the distribution normal, or make
the relationship linear.

Stepwise Regression
The purpose of stepwise regression is to select, from a large number of predictor variables, a small
subset of variables that account for most of the variation in the dependent or criterion variable. In
this procedure, the predictor variables enter or are removed from the regression equation one at a
time. There are several approaches to stepwise regression.

• Forward inclusion. Initially, there are no predictor variables in the regression equation.
Predictor variables are entered one at a time, only if they meet certain criteria specified in
terms of the F ratio. The order in which the variables are included is based on the
contribution to the explained variance.
• Backward elimination. Initially, all the predictor variables are included in the regression
equation. Predictors are then removed one at a time based on the F ratio.
• Stepwise solution. Forward inclusion is combined with the removal of predictors that no
longer meet the specified criterion at each step.

Stepwise procedures do not result in regression equations that are optimal, in the sense of
producing the largest R2, for a given number of predictors. Because of the correlations between
predictors, an important variable may never be included, or less important variables may enter the
equation. To identify an optimal regression equation, one would have to compute combinatorial
solutions in which all possible combinations are examined. Nevertheless, stepwise regression can be
useful when the sample size is large in relation to the number of predictors

Multicollinearity
Multicollinearity arises when intercorrelations among the predictors are very high. Multicollinearity
can result in several problems, including:

• The partial regression coefficients may not be estimated precisely. The standard errors are
likely to be high.
• The magnitudes as well as the signs of the partial regression coefficients may change from
sample to sample.
• It becomes difficult to assess the relative importance of the independent variables in
explaining the variation in the dependent variable.
• Predictor variables may be incorrectly included or removed in stepwise regression.

What constitutes serious multicollinearity is not always clear, although several rules of thumb and
procedures have been suggested in the literature. Procedures of varying complexity have also been
suggested to cope with multicollinearity. A simple procedure consists of using only one of the
variables in a highly correlated set of variables. Alternatively, the set of independent variables can be
transformed into a new set of predictors that are mutually independent by using techniques such as
principal components analysis. More specialized techniques, such as ridge regression and latent root
regression, can also be used

Relative Importance of Predictors

When multicollinearity is present, special care is required in assessing the relative importance of
independent variables. In applied marketing research, it is valuable to determine the relative
importance of the predictors. In other words, how important are the independent variables in
accounting for the variation in the criterion or dependent variable? Unfortunately, because the
predictors are correlated, there is no unambiguous measure of relative importance of the predictors
in regression analysis. However, several approaches are commonly used to assess the relative
importance of predictor variables.

• Statistical significance. If the partial regression coefficient of a variable is not significant, as

determined by an incremental F test, that variable is judged to be unimportant. An
exception to this rule is made if there are strong theoretical reasons for believing that the
variable is important
• Square of the simple correlation coefficient. This measure, r², represents the proportion of
the variation in the dependent variable explained by the independent variable in a bivariate
relationship.
• Square of the partial correlation coefficient. This measure, R²yxixjxk is the coefficient of
determination between the dependent variable and the independent variable, controlling
for the effects of the other independent variables.
• Square of the part correlation coefficient. This coefficient represents an increase in R² when
a variable is entered into a regression equation that already contains the other independent
variables.
• Measures based on standardized coefficients or beta weights. The most commonly used
measures are the absolute values of the beta weights, |Bi|, or the squared values, Bi².
Because they are partial coefficients, beta weights take into account the effect of the other
independent variables. These measures become increasingly unreliable as the correlations
among the predictor variables increase (multicollinearity increases)
• Stepwise regression. The order in which the predictors enter or are removed from the
regression equation is used to infer their relative importance.

Given that the predictors are correlated, at least to some extent, in virtually all regression situations,
none of these measures is satisfactory. It is also possible that the different measures may indicate a
different order of importance of the predictors. Yet, if all the measures are examined collectively,
useful insights may be obtained into the relative importance of the predictors.

Cross-Validation
Cross-validation examines whether the regression model continues to hold on comparable data not
used in the estimation. The typical cross-validation procedure used in marketing research is as
follows:

1. The regression model is estimated using the entire data set. 2. The available data are split
into two parts, the estimation sample and the validation sample. The estimation sample
generally contains 50 to 90 percent of the total sample. 3. The regression model is estimated
using the data from the estimation sample only. This model is compared to the model
estimated on the entire sample to determine the agreement in terms of the signs and
magnitudes of the partial regression coefficients. 4. The estimated model is applied to the
data in the validation sample to predict the values of the dependent variable, , for the
observations in the validation sample. 5. The observed values, Yi , and the predicted values, ,
in the validation sample are correlated to determine the simple r2. This measure, r2, is
compared to R2 for the total sample and to R2 for the estimation sample to assess the
degree of shrinkage.

In double cross-validation, the sample is split into halves. One half serves as the estimation sample,
and the other is used as a validation sample in conducting cross-validation. The roles of the
estimation and validation halves are then reversed, and the cross-validation is repeated.

Statistical Software
No ratings yet
Statistical Software
19 pages
Amarillo and Ginger Extract As Mosquito-Killer (SIP) Improved
53% (17)
Amarillo and Ginger Extract As Mosquito-Killer (SIP) Improved
23 pages
Chapter 4 Correlational Analysis
100% (2)
Chapter 4 Correlational Analysis
13 pages
1 Econreview-Questions
100% (1)
1 Econreview-Questions
26 pages
Presentation On: Correlation and Rank Correlation: Submitted To
100% (3)
Presentation On: Correlation and Rank Correlation: Submitted To
23 pages
Multivariate Analysis of Variance (MANOVA)
No ratings yet
Multivariate Analysis of Variance (MANOVA)
12 pages
Case Analysis BOS Brand Challenges Internationalisation
100% (1)
Case Analysis BOS Brand Challenges Internationalisation
1 page
PAHS 306: Session 5 - Simple Correlation
100% (1)
PAHS 306: Session 5 - Simple Correlation
14 pages
Commerce Questions For Bank Interview
100% (1)
Commerce Questions For Bank Interview
24 pages
Correlation and Recession
No ratings yet
Correlation and Recession
45 pages
CH Pintura Corporation
No ratings yet
CH Pintura Corporation
6 pages
Traditional Metrics: Share of Hearts, Minds and Markets
No ratings yet
Traditional Metrics: Share of Hearts, Minds and Markets
2 pages
C1 Mobonik
No ratings yet
C1 Mobonik
1 page
Econometrics Assignment
No ratings yet
Econometrics Assignment
40 pages
K Kiran Kumar IIM Indore
100% (1)
K Kiran Kumar IIM Indore
115 pages
Human Behavior in Organizations: Session 9
No ratings yet
Human Behavior in Organizations: Session 9
24 pages
17.correlation and Regression PDF
No ratings yet
17.correlation and Regression PDF
80 pages
Content Outline: Chapter 1: Descriptive Statistics and Graphical Analysis
50% (2)
Content Outline: Chapter 1: Descriptive Statistics and Graphical Analysis
4 pages
Correlation and Regression
100% (1)
Correlation and Regression
45 pages
Session 4
No ratings yet
Session 4
30 pages
Topic 2 - Correlation Theory
No ratings yet
Topic 2 - Correlation Theory
15 pages
MA Chapter 4 For Exit Exam
No ratings yet
MA Chapter 4 For Exit Exam
30 pages
Session 14
No ratings yet
Session 14
38 pages
High Yield Notes
No ratings yet
High Yield Notes
251 pages
Bivariate Analysis: Measures of Association
100% (1)
Bivariate Analysis: Measures of Association
38 pages
CH IndiaPost - Final Project Report
No ratings yet
CH IndiaPost - Final Project Report
14 pages
Graded Project
No ratings yet
Graded Project
36 pages
Lesson 8. Correlation
No ratings yet
Lesson 8. Correlation
29 pages
Integrated Engineering Team Project Proposal Final
No ratings yet
Integrated Engineering Team Project Proposal Final
31 pages
Session 12 13
No ratings yet
Session 12 13
26 pages
Econometrics I
No ratings yet
Econometrics I
47 pages
Lectures 5 6 - Correlation Analysis
No ratings yet
Lectures 5 6 - Correlation Analysis
29 pages
Worksheet Econometrics I
No ratings yet
Worksheet Econometrics I
7 pages
Using Eviews To Construct An ARDL Bound Test Part 2
No ratings yet
Using Eviews To Construct An ARDL Bound Test Part 2
9 pages
Session 2-3
No ratings yet
Session 2-3
33 pages
Correlation & Regression
No ratings yet
Correlation & Regression
70 pages
Prof. R C Manocha Autocorrelation: What Happens If The Error Terms Are Correlated?
No ratings yet
Prof. R C Manocha Autocorrelation: What Happens If The Error Terms Are Correlated?
21 pages
Session 7-8
No ratings yet
Session 7-8
39 pages
Introduction To Correlation and Regression Analysis
No ratings yet
Introduction To Correlation and Regression Analysis
14 pages
Chapter Seventeen: Correlation and Regression
No ratings yet
Chapter Seventeen: Correlation and Regression
71 pages
Chapter 4 Multiple Regression Model
No ratings yet
Chapter 4 Multiple Regression Model
31 pages
Critical Value Tables: Z-Distribution Table
No ratings yet
Critical Value Tables: Z-Distribution Table
10 pages
Varian - Chapter06 - Demand - Properties of Demand Functions
No ratings yet
Varian - Chapter06 - Demand - Properties of Demand Functions
14 pages
Measures of Relationship - Day 2
No ratings yet
Measures of Relationship - Day 2
44 pages
Correlation
No ratings yet
Correlation
6 pages
Statistics For Managers Using Microsoft Excel: 6 Global Edition
No ratings yet
Statistics For Managers Using Microsoft Excel: 6 Global Edition
53 pages
QT Chapter 4
No ratings yet
QT Chapter 4
6 pages
Chapter Seventeen: Correlation and Regression
No ratings yet
Chapter Seventeen: Correlation and Regression
80 pages
Case Analysis CleanSpritz
No ratings yet
Case Analysis CleanSpritz
1 page
KTLTCNC Eng
No ratings yet
KTLTCNC Eng
134 pages
CH Dell Computers
No ratings yet
CH Dell Computers
5 pages
Mobonik Case: Submitted By: Kanupriya Gathoria Roll No.: PGP10023
No ratings yet
Mobonik Case: Submitted By: Kanupriya Gathoria Roll No.: PGP10023
1 page
Correlation & Regression-I
No ratings yet
Correlation & Regression-I
43 pages
Impact of Internet Banking in Profitability On Commercial Bank
No ratings yet
Impact of Internet Banking in Profitability On Commercial Bank
15 pages
Microsoft PowerPoint Session 4 PDF
No ratings yet
Microsoft PowerPoint Session 4 PDF
86 pages
Correlation
No ratings yet
Correlation
33 pages
Unit 3
No ratings yet
Unit 3
70 pages
Session 5-6
No ratings yet
Session 5-6
21 pages
Corelation and Regression
No ratings yet
Corelation and Regression
5 pages
Anova
No ratings yet
Anova
46 pages
CUHK STAT5102 Ch3
No ratings yet
CUHK STAT5102 Ch3
73 pages
Correction and Regression
No ratings yet
Correction and Regression
30 pages
Co Relational Statistic
No ratings yet
Co Relational Statistic
17 pages
Correlation Ansd Simple Regression
No ratings yet
Correlation Ansd Simple Regression
27 pages
Vibration Predictor Equation - Some Issues: Article
No ratings yet
Vibration Predictor Equation - Some Issues: Article
14 pages
11 Correlation
No ratings yet
11 Correlation
28 pages
Stats Unit 2
No ratings yet
Stats Unit 2
24 pages
MRS - Diana-Correlation Analysis-Notes
No ratings yet
MRS - Diana-Correlation Analysis-Notes
16 pages
STR Jmulti
No ratings yet
STR Jmulti
17 pages
Human Behavior in Organizations: Session 17
No ratings yet
Human Behavior in Organizations: Session 17
21 pages
Salesforce and Channel Management: Metric Construction Formula Considerations Purpose
No ratings yet
Salesforce and Channel Management: Metric Construction Formula Considerations Purpose
2 pages
Correlation Analysis
No ratings yet
Correlation Analysis
7 pages
Corporate Governance and Earnings Management of Quoted
No ratings yet
Corporate Governance and Earnings Management of Quoted
8 pages
Innes - Simon - 202006 - MSC - Thesis
No ratings yet
Innes - Simon - 202006 - MSC - Thesis
61 pages
Correlation Analysis Sta 221
No ratings yet
Correlation Analysis Sta 221
9 pages
Inferential Statistics Guided Project
No ratings yet
Inferential Statistics Guided Project
34 pages
1 - Chapter 4 - Correlation Analysis
No ratings yet
1 - Chapter 4 - Correlation Analysis
7 pages
Case Analysis: The Aravind Eye Hospital
No ratings yet
Case Analysis: The Aravind Eye Hospital
1 page
Measures of Correlation
No ratings yet
Measures of Correlation
7 pages
Correlation Lecture
No ratings yet
Correlation Lecture
10 pages
Impact of Implementation of CMMS For Enhancing The Performance of Manufacturing Industries
No ratings yet
Impact of Implementation of CMMS For Enhancing The Performance of Manufacturing Industries
22 pages
Introduction To Correlation and Regression Analyses PDF
No ratings yet
Introduction To Correlation and Regression Analyses PDF
12 pages
ISYE6414 HW1 Solutions
No ratings yet
ISYE6414 HW1 Solutions
15 pages
Module - 2 Correlation Analysis: Contents: 2.2 Types of Correlation
No ratings yet
Module - 2 Correlation Analysis: Contents: 2.2 Types of Correlation
7 pages
Correlation Notes
No ratings yet
Correlation Notes
15 pages
Group Project (Grab Ehailing Experience and Insight)
No ratings yet
Group Project (Grab Ehailing Experience and Insight)
18 pages
Lesson 6.2 Correlation and Regression Analysis Final Edition
No ratings yet
Lesson 6.2 Correlation and Regression Analysis Final Edition
8 pages
Chapter 21 MDS & Conjoint
No ratings yet
Chapter 21 MDS & Conjoint
11 pages
Chapter Four Correlation Analysis: Positive or Negative
No ratings yet
Chapter Four Correlation Analysis: Positive or Negative
15 pages
Chapter 19, Factor Analysis
No ratings yet
Chapter 19, Factor Analysis
7 pages
Lecture 3 Pearson
No ratings yet
Lecture 3 Pearson
4 pages
The Significance of Correlation
No ratings yet
The Significance of Correlation
6 pages
Ta Khanh Vinh
No ratings yet
Ta Khanh Vinh
33 pages
Correlation Notes
No ratings yet
Correlation Notes
8 pages
Correlation
No ratings yet
Correlation
5 pages
CH B2B
No ratings yet
CH B2B
7 pages
Correlation SBC
No ratings yet
Correlation SBC
4 pages
Lecture-25 CORRELATION - 34861774 - 2024 - 05 - 04 - 23 - 38
No ratings yet
Lecture-25 CORRELATION - 34861774 - 2024 - 05 - 04 - 23 - 38
4 pages
Plant 11 1 2
No ratings yet
Plant 11 1 2
7 pages
Predicting Pavement Condition Index Using International Roughness Index in A Dense Urban Area
No ratings yet
Predicting Pavement Condition Index Using International Roughness Index in A Dense Urban Area
8 pages
Correlation BMLT
No ratings yet
Correlation BMLT
5 pages
Group Assignment
No ratings yet
Group Assignment
3 pages
Abstract of SURYA NAMASKAR
No ratings yet
Abstract of SURYA NAMASKAR
23 pages
J Applied Clin Med Phys - 2024 - Dunn - Assessing The Sensitivity and Suitability of A Range of Detectors For SIMT PSQA
No ratings yet
J Applied Clin Med Phys - 2024 - Dunn - Assessing The Sensitivity and Suitability of A Range of Detectors For SIMT PSQA
21 pages
Lecture 7 Correlation
No ratings yet
Lecture 7 Correlation
5 pages
Correlation
No ratings yet
Correlation
4 pages
Dejaegher 2007
No ratings yet
Dejaegher 2007
20 pages
ANOVA For Feature Selection in Machine Learning by Sampath Kumar Gajawada Towards Data Science
No ratings yet
ANOVA For Feature Selection in Machine Learning by Sampath Kumar Gajawada Towards Data Science
10 pages
Chapter 18
No ratings yet
Chapter 18
9 pages
Why Should One Invest in A Life Insurance Product? An Empirical Study
No ratings yet
Why Should One Invest in A Life Insurance Product? An Empirical Study
10 pages
Homework Assignment 5
No ratings yet
Homework Assignment 5
10 pages
Metode Analisis Perencanaan I (MAP I) : Fika Febi Novianti (2018280030)
No ratings yet
Metode Analisis Perencanaan I (MAP I) : Fika Febi Novianti (2018280030)
6 pages
C1 Mobinik Case
No ratings yet
C1 Mobinik Case
5 pages
Webster and Wind Model: - : by Group3
No ratings yet
Webster and Wind Model: - : by Group3
2 pages
CH Session1 Group 1 - Dipesh
No ratings yet
CH Session1 Group 1 - Dipesh
4 pages
Assignment 1: Alternative Business/organizational Buying Behaviour Models Hobbesian Organizational Buying Model
No ratings yet
Assignment 1: Alternative Business/organizational Buying Behaviour Models Hobbesian Organizational Buying Model
1 page
CH Session1 The Talent Myth
No ratings yet
CH Session1 The Talent Myth
2 pages
C2 Parfum Nineveh
No ratings yet
C2 Parfum Nineveh
1 page

Chapter 17 Correlation and Regression

Uploaded by

Chapter 17 Correlation and Regression

Uploaded by

Chapter 17 Correlation and Regression

Product Moment Correlation

Say two metric (interval or ratio scaled) variables be X & Y.

• It is an index used to determine whether a linear, or straight-line, relationship exists

Division of the numerator and denominator by n - 1 gives

• In these equations, and Ῡ denote the sample means.

FIGURE 17.1 A Nonlinear Relationship for Which r = 0

tc is calculated from t distribution table for two-tailed test and α=0.05.

If to ≥ tc, the null hypothesis of no relationship between X and Y is rejected.

Uses of regression analysis:

• Determine whether the independent variables explain a significant variation in the

Statistics Associated with Bivariate Regression Analysis

Conducting Bivariate Regression Analysis

FIGURE 17.2 Conducting Bivariate Regression Analysis

The intercept, a, may then be calculated using

Byx = Bxy = rxy

The strength of association may then be calculated as follows:

The appropriate test statistic is the F statistic:

or more generally, if there are k independent variables,

which is estimated by the following equation:

Statistics Associated with Multiple Regression

Coefficient of multiple determination: The strength of association in multiple regression is

The test statistic has an F distribution with k and (n - k - 1) degrees of freedom.

Conducting Multiple Regression Analysis

This is equivalent to the following null hypothesis:

The overall test can be conducted by using an F statistic:

which has an F distribution with k and (n - k - 1) degrees of freedom

which has an F distribution with 1 and (n - k - 1) degrees of freedom.

Relative Importance of Predictors

• Statistical significance. If the partial regression coefficient of a variable is not significant, as

You might also like