Chap 13 - Correlation and Linear Regression
Chap 13 - Correlation and Linear Regression
Linear Regression
Chapter 13
Contents
◼ Dependent and independent variable.
◼ The relationship between two variables using
correlation coefficient.
◼ Regression analysis to estimate the linear
relationship between two variables
◼ Interpretation of regression analysis.
◼ Significance of slope of the regression equation.
◼ Regression equation to predict the dependent
variable.
◼ Coefficient of determination.
◼ Transforming data.
13-2
Regression Analysis Introduction
◼ Recall in chapter 04.
13-3
Regression Analysis Introduction
◼ Recall in chapter 04 we used Applewood Auto Group
data to show the relationship between two variables
using a scatter diagram. The profit for each vehicle sold
and the age of the buyer were plotted on an XY graph.
◼ The graph showed that as the age of the buyer
increased, the profit for each vehicle also increased.
◼ This idea is explored further here. Numerical measures
to express the strength of relationship between two
variables are developed.
◼ In addition, an equation is used to express the
relationship between variables, allowing us to estimate
one variable on the basis of another.
13-4
Regression Analysis Introduction
EXAMPLES
◼ Does the amount Square Group spends per month
on training its sales force affect its monthly sales?
◼ Is the number of square feet in a home related to
the cost to heat the home in January?
◼ In a study of fuel efficiency, is there a relationship
between miles per gallon and the weight of a car?
◼ Does the number of hours that students studied for
an exam influence the exam score?
13-5
Dependent vs. Independent Variable
The Dependent Variable is the variable being predicted
or estimated.
The Independent Variable provides the basis for
estimation. It is the predictor variable.
Which in the questions below are the dependent and
independent variables?
i. Does the amount Square Group spends per month on
training its sales force affect its monthly sales?
ii. Is the number of square feet in a home related to the
cost to heat the home in January?
iii. Does the number of hours that students studied for an
exam influence the exam score?
13-6
Regression: Terminology
Scatter Diagram Example
A sales manager of Copier Sales of America, which has a
large sales force throughout the United States and
Canada, wants to determine whether there is a
relationship between the number of sales calls made
in a month and the number of copiers sold that month.
The manager selects a random sample of 10
representatives and determines the number of sales
calls each representative made last month and the
number of copiers sold.
13-8
Scatter Diagram Example
13-9
The Coefficient of Correlation (r)
The Coefficient of Correlation (r) is a measure of the
strength of the relationship between two variables.
13-10
The Coefficient of Correlation (r)
13-11
Correlation Coefficient - Interpretation
13-12
Correlation Coefficient - Example
Using the Copier Sales of
America data which a
scatterplot is shown below,
compute the correlation
coefficient.
13-13
Correlation Coefficient - Example
13-14
Correlation Coefficient - Example
What does correlation of 0.759 mean?
First, it is positive, so we see there is a direct
relationship between the number of sales calls and
the number of copiers sold.
The value of 0.759 is fairly close to 1.00, so we
conclude that the association is strong.
However, does this mean that more sales calls cause
more sales?
No, we have not demonstrated cause and effect here,
only that the two variables—sales calls and copiers
sold—are related.
13-15
Correlation Coefficient - Testing
Significance
Let,
H0: = 0 (correlation in the population is 0)
H1: ≠ 0 (correlation in the population is not 0)
Reject H0 if:
t > t/2, n-2 or t < -t/2, n-2
13-16
Testing Significance of Correlation Coefficient
– Copier Sales Example
Reject H0 if: t > t/2, n-2 or t < -t/2, n-2
t > t0.025,8 or t < -t0.025,8
t > 2.306 or t < -2.306
13-17
Testing Significance of Correlation
Coefficient – Copier Sales Example
13-19
Scatter Plots
Strong positive
No correlation correlation
Weak negative
correlation
13-20
Regression Analysis
In regression analysis we use the independent variable (X)
to estimate the dependent variable (Y).
◼ The relationship between the variables is linear.
◼ Both variables must be at least interval scale.
◼ The least squares criterion is used to determine the
equation.
Regression Equation An equation that expresses the
linear relationship between two variables.
Least Squares Principle Determining a regression
equation by minimizing the sum of the squares of the
vertical distances between the actual Y values and the
predicted values of Y.
13-21
Linear Regression Model
13-22
Regression Analysis – Least Squares
Principle
13-23
Regression Analysis – Least Squares
Principle
13-24
Regression Analysis – Least Squares
Principle
13-25
Regression Analysis – Least Squares
Principle
◼ LEAST SQUARES PRINCIPLE
A mathematical procedure that uses the data
to position a line with the objective of
minimizing the sum of the squares of the
vertical distances between the actual Y
values and the predicted values of Y.
13-26
Regression
Analysis
Least Squares
Principle
13-27
Computing Slope and Y-intercept
13-28
Regression Equation - Example
Recall the example involving Copier
Sales of America. The sales
manager gathered information on
the number of sales calls made and
the number of copiers sold for a
random sample of 10 sales
representatives. Use the least
squares method to determine a
linear equation to express the
relationship between the two
variables.
What is the expected number of
copiers sold by a representative
who made 20 calls?
13-29
Finding and Fitting the Regression
Equation - Example
Step 1 – Find the slope (b) of the line
Regression equation:
𝑌 = a + bX
𝑌 = 18.9476 + 1.1842 X
𝑌 = 18.9476 + 1.1842 (20)
𝑌 = 42.6316
13-30
Testing Significance of Slope –
Copier Sales Example
H0: β = 0 (the slope of the linear model is 0)
H1: β ≠ 0 (the slope of the linear model is not 0)
Reject H0 if: t > t/2,n-2 or t < -t/2,n-2
t > t0.025,8 or t < -t0.025,8
t > 2.306 or t < -2.306
13-31
Testing Significance of Slope –
Copier Sales Example
Compute t statistic and make a conclusion:
𝑏 − 0 1.1842 − 0
𝑡= = = 3.297
𝑠𝑏 0.3591
13-33
Standard Error of Estimate
◼ The standard error of estimate measures the scatter,
or dispersion, of the observed values around the line
of regression.
◼ Formulas used to compute the standard error:
2
σ 𝑌 − 𝑌
𝑆𝑦.𝑥 =
𝑛−2
or,
σ 𝑌 2 − 𝑎 σ 𝑌 − 𝑏 σ 𝑋𝑌
𝑆𝑦.𝑥 =
𝑛−2
13-34
Standard Error of Estimate Example
Recall the example involving Copier Sales of America.
The sales manager determined the least squares
regression equation is given below.
𝑌 = 18.9476 + 1.1842 X
Determine the standard error of estimate as a measure
of how well the values fit the regression line.
13-35
Standard Error of Estimate Example
σ 𝑌−𝑌 2 784.211
𝑆𝑦.𝑥 = = = 9.901
𝑛−2 10−2
13-36
Standard Error of Estimate
◼ If the standard error of estimate is small,
thisindicates that the data are relatively close to the
regression line and the regression equation can be
used to predict Y with little error.
◼ If the standard error of estimate is large,
thisindicates that the data are widely scattered
around the regression line, and the regression
equation will not provide a precise estimate of Y.
13-37
Computing the Estimates of Y
Step 1 – Using the regression equation, substitute the
value of each X to solve for the estimated sales
𝑇𝑜𝑚 𝐾𝑒𝑙𝑙𝑒𝑟
𝑌 = 18.9476 + 1.1842 X
𝑌 = 18.9476 + 1.1842 (20)
𝑌 = 42.6316
13-38
Plotting Estimated and Actual Y’s
13-39
Coefficient of Determination (𝒓𝟐 )
The coefficient of determination ( 𝒓𝟐 ) is the
proportion of the total variation in the
dependent variable (Y) that is explained by the
variation in the independent variable (X).
◼ It is the square of the coefficient of correlation.
◼ It ranges from 0 to 1.
◼ It does not give any information on the direction of
the relationship between the variables.
13-40
𝟐
Coefficient of Determination (𝒓 ) –
Copier Sales Example
13-41
Relationships among r, 𝒓𝟐 and the
Standard Error of Estimate
13-42
Relationships among r, 𝒓𝟐 and the
Standard Error of Estimate
◼ In Chapter 12, the total variation was divided into two
components:
▪ variation due to the treatments and
▪ variation due to random error.
◼ The concept is similar in regression analysis.
◼ The total variation is divided into two components:
▪ variation explained by the regression (explained by the
independent variable) and
▪ the error, or residual. This is the unexplained variation.
13-43
Relationships among r, 𝒓𝟐 and the
Standard Error of Estimate
13-44
Relationships among r, 𝒓𝟐 and the
Standard Error of Estimate
• Two statistics to evaluate the predictive ability of a
regression equation,
• the standard error of the estimate and
• the coefficient of determination.
• When reporting the results of a regression analysis, the
findings must be clearly explained, especially when
using the results to make predictions of the
dependent variable.
• The report must always include a statement
regarding the coefficient of determination so that the
relative precision of the prediction is known to the
reader of the report.
13-45
Assumptions Underlying Linear
Regression
For each value of X, there is a group of Y values, and these:
◼ Y values are normally distributed. The means of these
normal distributions of Y values all lie on the straight line
of regression.
◼ The standard deviations of these normal distributions
are equal.
◼ The Y values are statistically independent. This means
that in the selection of a sample, the Y values chosen for
a particular X value do not depend on the Y values for
any other X values.
13-46
Assumptions Underlying Linear
Regression
13-47
Transforming
Data
13-48
Transforming Data
◼ Two variables may be closely related, but their
relationship is not linear.
◼ The correlation between the variables Winnings
and Score is - 0.782.
◼ This is a fairly strong inverse relationship.
◼ The relationship in scatter diagram does not
appear to be linear.
13-49
Transforming Data
13-50
Transforming Data
Changing scale:
• Log
13-51
Transforming Data
13-52
Transforming Data
13-53
Transforming Data
◼ Implication of antilog
13-54
Learning Objectives (LO)
LO1 Define dependent and independent variable.
LO2 Calculate, test, and interpret the relationship
between two variables using correlation coefficient.
LO3 Apply regression analysis to estimate the linear
relationship between two variables
LO4 Interpret the regression analysis.
LO5 Evaluate the significance of the slope of the
regression equation.
LO6 Evaluate a regression equation to predict the
dependent variable.
LO7 Calculate and interpret coefficient of determination.
LO8 Understand the concept of transforming data.
13-55