0% found this document useful (0 votes)
21 views91 pages

Dbs3e PPT ch14

Uploaded by

rara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views91 pages

Dbs3e PPT ch14

Uploaded by

rara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 91

Chapter 14

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-1
Chapter 14
Correlation and Simple Linear Regression
CHAPTER 14 MAP
14.1 Dependent and Independent Variables

14.2 Correlation Analysis

14.3 Simple Linear Regression Analysis

14.4 Using a Regression to Make a Prediction

14.5 Testing the Significance of the Slope of the Regression Equation

14.6 Assumptions for Regression Analysis

14.7 A Simple Regression Example with a Negative Correlation

14.8 Some Final (but Very Important) Thoughts

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-2
STATS IN PRACTICE
Correlation in the Financial Market
• According to the Wall Street Journal article “Stocks Are Moving in
Tandem. That Can Be Scary” by Akane Otani, the correlations among
the S&P 500’s eleven sectors have spiked in 2018 to hit the highest
level since the 2016 U.S. presidential election.

• Correlation is a key part of investors’ financial strategy; it measures


the relationship between two variables. Perfect positive correlation
is measured with a value of +1.0. Perfect negative correlation is
measured with a value of –1.0.

• Correlation measures the degree to which investment values move


together.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-3
STATS IN PRACTICE
Correlation in the Financial Market
• Financial investments that move independently of one another are
considered uncorrelated, which is represented with a value that is
close to zero. A correlation close to zero would be observed in
financial portfolios that are properly diversified, which is a goal of
many investors.

• However, the article reports that S&P 500 sectors have been moving
together for almost two years which is unusual, and for some
investors, that raises red flags. The article states that increasing
correlations can create strong declines when stocks fall.

• According to Art Hogan, chief market strategist at investment bank


B. Riley FBR, investors end up selling the good with the bad without
proper picking and choosing, especially if they are mainly invested in
investment funds traded on stock exchanges.
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-4
STATS IN PRACTICE
Correlation in the Financial Market
• Figure 14.1 shows how the prices of S&P 500 stocks move in the same
direction as the S&P 500 sectors. The positive correlation in the
movements has made it hard for investors to decide what stocks to
invest in, said Andrew Thrasher, portfolio manager for Financial
Enhancement Group.

• The bottom line of the Wall Street Journal article is that correlation
tells us a lot about the trend in the financial market and many
investors use correlation analysis to make investment decisions.
Financial investments that move independently of one another are
considered to be uncorrelated, which ensure that your investment
portfolio is properly diversified. What a great reason to master
correlation.
Based on: “Stocks Are Moving in Tandem. That Can Be Scary”, Wall Street Journal, February 15, 2018.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-5
STATS IN PRACTICE
Figure 14.1 Three Months Rolling Average Correlations
of S&P 500 Sectors and S&P 500 Stocks*

*https://www.wsj.com/articles/stocks-are-moving-in-tandem-that-can-be-scary-1518720399
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-6
14.1 Dependent and Independent
Variables
An independent variable, x, explains the
variation in another variable, which is called
the dependent variable, y.
• Variation in x explains variation in y, but not
the reverse (direction is only one way).
Independent variable (x) → Dependent variable (y)

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-7
14.2 Correlation Analysis

Correlation analysis is used to measure both the


strength and direction of a linear relationship
between two variables.
• A relationship is linear if the scatter plot of the
independent and dependent variables has a
straight-line pattern.
• Examples of linear relationships:
y y

x x
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-8
Correlation Analysis
Example: A new car dealer wants to examine the
relationship between the number of TV ads run per
week and the number of cars sold that week.
• The number of ads per week is expected to affect
sales, not the reverse, so the number of ads is the
independent variable (x) and the number of cars
sold is the dependent variable (y).
Suppose a sample of 6 weeks is selected.
• Two values are recorded for each week: number of
TV ads and number of cars sold.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-9
Correlation Analysis

Sample data: Scatter plot of the ordered pairs (x, y)

Number Number of
of TV Ads Cars Sold
Week x y
1 3 13
2 6 31
3 4 19
4 5 27
5 6 23
6 3 19

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-10
Correlation Analysis
Construct a table to provide the values needed
for future calculations.
Number of Number of
TV Ads Cars Sold
Week x y xy x2 y2
1 3 13 39 9 169
2 6 31 186 36 961
3 4 19 76 16 361
4 5 27 135 25 729
5 6 23 138 36 529
6 3 19 57 9 361

Σx = 27 Σy = 132 Σxy = 631 Σx2 = 131 Σy2 = 3110

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-11
Correlation Analysis

Every calculation performed in this chapter can be completed


using the five summation values shown in the bottom row of
the last slide, along with the value for n, which represents the
number of ordered pairs in the table.
• For this example, n = 6
• Call these values The Six-Number Summary for Correlation
and Simple Regression, or SNSCSR

Σx Σy Σxy Σx2 Σy2 n

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-12
The Correlation Coefficient

The correlation coefficient, r, indicates both


the strength and direction of the linear
relationship between the independent and
dependent variables.
• The values of r range from -1.0, a strong
negative relationship, to +1.0, a strong
positive relationship.
• When r = 0, there is no relationship
between variables x and y.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-13
The Correlation Coefficient
Examples of Approximate r Values

Graph A (r = 1.0): perfect positive correlation between x and y

Graph B (r = -1.0): perfect negative correlation between x and y

Graph C (r = 0.6): a moderately positive relationship: y tends to increase as x increases,


but not necessarily at the steady rate we observed in Graph A
Graph D (r = -0.4): a relatively weak negative relationship: the correlation coefficient is
closer to zero, negative r value so y tends to decrease as x increases
Graph E (r = 0): no relationship between x and y

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-14
The Correlation Coefficient
Formula for the Correlation Coefficient

Substituting the values from this example:

Because r = 0.836 is positive and close to +1, there is a fairly strong positive
relationship between the number of TV ads and cars sold.
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-15
Using Excel to Calculate the Correlation
Coefficient
Use the CORREL Function in Excel to calculate the correlation
coefficient
=CORREL(array1, array2)

where: array1 = The range of data for the first variable


array2 = The range of data for the second variable

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-16
Conducting a Hypothesis Test to Determine the
Significance of the Correlation Coefficient

The value r represents the correlation coefficient for


a random sample.
The population correlation coefficient (ρ) refers to
the correlation between all values of two variables of
interest in a population.
• A hypothesis test can be used to determine if the
population correlation coefficient, ρ, is
significantly different from zero.
• One tail example:
H0: ρ ≤ 0 (there is no positive relationship between x and y)
H1: ρ > 0 (there is a positive linear relationship)
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-17
Conducting a Hypothesis Test to Determine the
Significance of the Correlation Coefficient
Formula for the Test Statistic for the Correlation Coefficient

where:
r = The sample correlation coefficient
n = The number of ordered pairs

Using values from the prior example: r = 0.836 , n = 6

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-18
Conducting a Hypothesis Test to Determine the
Significance of the Correlation Coefficient
The critical t-score is from the t-distribution with n – 2
degrees of freedom.
This one tail test requires area α in the upper tail.
• For n = 6 (df = 4) and α = 0.05, we get tα = 2.132 from
Appendix A, Table 5. (or by using =T.INV(0.05, 4) in Excel)

Because t = 3.047 is greater


than tα = 2.132, we reject the
Reject H0
null hypothesis and conclude Do not reject H0
that the population correlation 0.95 α = 0.05

coefficient is greater than zero.


0

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-19
14.3 Simple Regression Analysis

Simple regression analysis is used to


determine a straight line that best fits a
series of ordered pairs (x, y).
• This technique is known as simple regression
because we are using only one independent
variable.
• Multiple regression, which includes more
than one independent variable, is discussed
in Chapter 15.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-20
Simple Regression Analysis

Formula for the equation describing a straight


line through ordered pairs

where:
= The predicted value of y
given a value of x
x = The independent variable
b0 = The y-intercept of the straight line
b1 = The slope of the straight line
This is a line described by the equation
= 2.0 + 0.5x.
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-21
Simple Regression Analysis

The simple linear regression model for a


population

where
yi = The i th observation for the dependent variable from the population
β0 = The population y-intercept
β1 = The population slope
xi = The i th observation for the independent variable from the population
ei = The residual for the i th observation from the population

• The goal in this chapter is to estimate β0 and β1 based on a


sample of ordered pairs (x, y).

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-22
Simple Regression Analysis

The difference between the actual data value


and the predicted value is known as the
residual, ei.

where:
= The residual of the ith observation in the sample
= The actual value of the dependent variable for the ith data point
= The predicted value of the dependent variable for the ith data point

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-23
Simple Regression Analysis

Residual for this


y x value
Observed
value of y yi
for xi
Slope = b1
Predicted
value of y
for xi

Intercept = b0

xi x
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-24
The Least Squares Method
The least squares method identifies the linear equation that
best fits a set of ordered pairs.
• used to find the values for b (the y-intercept) and b (the slope of
0 1
the line)
• The resulting best fit line is called the regression line.

Goal: minimize the total squared error between the values of


y and
• The least squares method will minimize the sum of squares
error (SSE):
Where:
n = The number of ordered pairs
around the line that best fits the
data
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-25
Calculating the Slope and
y-intercept Manually
Formulas for the Regression Slope and y-intercept:

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-26
Calculating the Slope and
y-intercept Manually
Using the car sales vs. TV ads data:
Σx = 27 Σy = 132 Σxy = 631 Σx2 = 131 Σy2 = 3110 n=6

So the regression equation is:

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-27
Calculating the Slope and
y-intercept Manually
The regression equation is:

Slope
= 3.8947

Slope: On average, each


additional TV ad increases
the number of cars sold by
3.8947 per week.

Intercept
= 4.4737

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-28
Calculating the Slope and
y-intercept Manually
The regression equation is:
What is the predicted
number of cars sold for a
week with 5 TV ads?

If we set x = 5 in the
regression equation, we
get:

This a point estimate for


the regression equation,
given x = 5.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-29
Calculating the Slope and y-intercept
Using Excel
1. Enter the data for the two variables in the worksheet.
2. Go to the Data tab and select Data Analysis, which opens the Data
Analysis dialog box.
3. Scroll down to Regression and click OK to open the Regression dialog
box.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-30
Calculating the Slope and y-intercept
Using Excel
4. Click on the first text box, which is
labeled Input Y Range.
5. Highlight the cells containing the
dependent variable values, 4-5
including the column label. 6-7
6. Click on the second text box, which 8
is labeled Input X Range.
7. Highlight the cells containing the 9
independent variable values,
including the column label.
8. Check the Labels box.
9. Click on Output Range, then tell
Excel where to put the report, then
click OK.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-31
Calculating the Slope and y-intercept
Using Excel

correlation
coefficient, r

y-intercept

slope value

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-32
Partitioning the Sum of Squares
The total sum of squares (SST ) measures the total variation in
the dependent variable.
• Total variation is made up of two parts:
SST = SSR + SSE
Total sum of Sum of Squares Sum of Squares
Squares Regression Error

where:
y = A value of the dependent variable from the sample
= The average value of the dependent variable from the sample
= The estimated value of y for a given x value
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-33
Partitioning the Sum of Squares

Values can be calculated by hand if necessary.


Formula for the Total Sum of Squares (SST ), Calculator-
Friendly Version

Formula for the Sum of Squares Error (SSE), Calculator-


Friendly Version

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-34
Partitioning the Sum of Squares

Sum of
Squares
in Excel

SSR
SSE
SST

Relationship between the SSR and SSE: Example:


144.10526 + 61.89474 = 206

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-35
Calculating the Sample
Coefficient of Determination
The sample coefficient of determination, R2, measures the
percentage of the total variation of the dependent variable
that is explained by the independent variable from a sample.

• R2 varies from 0% to 100%.


• Higher values of R2 are more desirable than lower ones
because we would like to be able to explain as much of the
variation in the dependent variable as possible.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-36
Calculating the Sample
Coefficient of Determination
Using data from the car sales example:

• 69.95% of the variation in car sales per week is explained


by the number of TV ads per week.
• The value of R2 is equal to the square of the correlation
coefficient, r.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-37
Conducting a Hypothesis Test to Determine the
Significance of the Coefficient of Determination

The population coefficient of determination, ρ2, is


unknown.
The calculated value of R2 represents the coefficient
of determination for a random sample from the
population.
• Use this hypothesis test to determine if the population
coefficient of determination is significantly different from
zero (based on the sample coefficient of determination):
H0 : ρ 2 ≤ 0
H1 : ρ 2 > 0

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-38
Conducting a Hypothesis Test to Determine the
Significance of the Coefficient of Determination
H0 : ρ 2 ≤ 0 (none of the variation in y is explained by x)
H1 : ρ 2 > 0 (x does explain a significant portion of the variation in y)

The F-test statistic is the appropriate test statistic


for this hypothesis test.

With degrees of freedom


D1 = 1 and D2 = n - 2

This value can be calculated manually or found in Excel.


Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-39
Conducting a Hypothesis Test to Determine the
Significance of the Coefficient of Determination
The F-statistic in Excel output:

Calculated p-value
F-test for the
statistic F-test
n
SSR
SSE
df

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-40
Conducting a Hypothesis Test to Determine the
Significance of the Coefficient of Determination
Using the car sales vs. TV ads data:
H0 : ρ2 ≤ 0 Since F = 9.313 > Fα = 7.709,
H1 : ρ2 > 0 we reject H0 and conclude that
the coefficient of determination
is greater than zero.
For this example:
D1 = 1
Do not reject H0
α = 0.05 D2 = n – 2 = 6 – 2 = 4
1 – α = 0.95
The critical F-score for α = 0.05
0 and degrees of freedom equal to
Do not reject H0 Reject H0 1 and 4 is Fα = 7.709:
Fα = 7.709 = F.INV.RT(0.05, 1, 4) = 7.709
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-41
Conducting a Hypothesis Test to Determine the
Significance of the Coefficient of Determination

Using the car sales vs. TV ads data:


The p-value can be found
F = 9.313
in Excel:
=F.DIST.RT(9.313, 1, 4)
=0.03795

• The p-value is 0.03795,


Area = 0.03795
which is less than α =
= p-value 0.05, so we reject the
Do not reject H0
null hypothesis that
1 – α = 0.95 there is no relationship
between TV ads and
0 number of cars sold
Do not reject H0 Reject H0 per week
Fα = 7.709
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-42
14.4 Using a Regression to Make a
Prediction
A point estimate for for a given x value is found by inserting
the desired xi value in the regression equation.
• We can also construct a confidence interval around the point
estimate.
To construct such a confidence interval requires the standard
error of the estimate, se , which measures the amount of
dispersion of the observed data around a regression line.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-43
Using a Regression to Make a Prediction
se measures the variation of observed y values
from the regression line.
y y

x x
Small se Large se

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-44
Using a Regression to Make a Prediction

se can be
calculated Standard
or found in error of the
estimate, se
Excel
n

SSE

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-45
The Confidence Interval for an Average
Value of y Based on a Value of x
Formula for the Confidence Interval (CI) for an
Average Value of y

where:
CI = The confidence interval for an average value of y
= The predicted y value for the desired value of x
tα/2 = The critical t-statistic from the Students’ t-distribution with n – 2 df
se = The standard error of the estimate
n = The number of ordered pairs
= The average value of x from the sample
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-46
The Confidence Interval for an Average
Value of y Based on a Value of x
Example: Using the car sales vs. TV ads data with 5 ads per
week (x = 5),

We also need the average value of x:

The needed t-statistic, tα/2, has n – 2 degrees of freedom


(use α = 0.05):

tα/2 = 2.776
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-47
The Confidence Interval for an Average
Value of y Based on a Value of x
Example: (continued)
Completing the computation for the confidence interval:

UCL = 23.95 + 4.798 = 28.748

LCL = 23.95 – 4.798 = 19.152


Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-48
The Confidence Interval for an Average
Value of y Based on a Value of x
Example: (continued)
Interpreting the confidence
interval:

UCL = 23.95 + 4.798 = 28.748

LCL = 23.95 – 4.798 = 19.152


We are 95% confident that the
average number of cars sold for
all weeks in which 5 TV ads are
used will be between 19.152 and
28.748.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-49
The Prediction Interval for a Specific
Value of y Based on a Value of x
In the previous example, the confidence interval is for the
average number of cars sold for all weeks in which 5 TV ads
occur.
A prediction interval is an interval for an individual week in
which x = 5.
Formula for the Prediction Interval (PI) for a Specific Value of y

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-50
The Prediction Interval for a Specific
Value of y Based on a Value of x
Example: Compute the prediction interval using the cars sales
vs. TV ads data , for x = 5:

UCL = 23.95 + 11.93 = 35.88

LCL = 23.95 – 11.93 = 12.02


Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-51
The Prediction Interval for a Specific
Value of y Based on a Value of x
Example: (continued)
Interpreting the prediction interval:

UCL = 23.95 + 11.93 = 35.88

LCL = 23.95 – 11.93 = 12.02

We are 95% confident that the number of cars sold in a


particular week in which 5 TV ads are used will be between
12.02 and 35.88.
• The prediction interval estimates a single value, so the
variation is greater than when estimating an average value.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-52
14.5 Testing the Significance of the
Slope of the Regression Equation
The calculated value of the slope, b1, is from a
random sample from the full population.
• The population regression slope, β1, is
unknown.
If the population slope is zero, then x has no
effect on y, and we would conclude that there
is no relationship between the dependent and
independent variables.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-53
Testing the Significance of the
Slope of the Regression Equation
We can perform a hypothesis test to
determine if the population regression slope,
β1, is significantly different from zero, based
on the sample regression slope, b1.
H0 : β1 = 0 (There is no linear relationship between the independent
and dependent variables.)

H1 : β1 ≠ 0 (There is a linear relationship between x and y.)

• Use a t-test statistic for this hypothesis test,


with n – 2 degrees of freedom.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-54
Testing the Significance of the
Slope of the Regression Equation
Formula for the t-test Statistic for the Regression
Slope

where:
b1 = The sample regression slope
β1 = The population regression slope from the null hypothesis
sb = The standard error of the slope

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-55
Testing the Significance of the
Slope of the Regression Equation
The standard error of the slope, sb , measures
the variation in the estimate of the slope of the
regression equation, b1.
• The regression slope would vary if
separate regressions were performed
with several sets of samples from the
population.
• A smaller standard error of the slope
increases the likelihood that we can
establish a significant relationship
between x and y.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-56
Testing the Significance of the
Slope of the Regression Equation
Formula for the Standard Error of a Slope

Example: Using the car sales vs. TV ads data:

• The calculated value of sb is:

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-57
Testing the Significance of the
Slope of the Regression Equation
Example: (continued) Computing the t-test statistic for the
regression slope
• recall that the regression equation is

H0 : β1 = 0

H1 : β1 ≠ 0

• For α = 0.05, the critical t-value (with n – 2 = 6 – 2 = 4 df) is


tα/2 = 2.776.

Since t = 3.05 > tα/2 = 2.776, we reject H0 and conclude that the
population regression slope is not equal to zero and that there is a
relationship between TV ads and car sales.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-58
Testing the Significance of the
Slope of the Regression Equation
Formula for the Confidence Interval for the
Slope of a Regression

Example: Using the car sales vs. TV ads data:

• The confidence interval for b1 is:

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-59
Testing the Significance of the
Slope of the Regression Equation
Example: (continued) Finding the confidence interval for the
regression slope
UCL = 3.895 + 3.542 = 7.437
LCL = 3.895 – 3.542 = 0.353
• Based on our sample of six weeks, we are 95% confident that the true
population slope is between 0.353 and 7.437.
• We are 95% confident that every additional TV ad will increase the
number of cars sold by between 0.353 and 7.437 cars per week.
Since this confidence interval does not include zero, we have
evidence to conclude that there is a relationship between TV ads
and car sales.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-60
Testing the Significance of the
Slope of the Regression Equation

sb, the
Standard
calculated t- The t-test
error of the
test statistic, slope, sb
statistic for
the slope
and the
confidence
interval are
reported by
Excel.

Confidence interval
for the slope
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-61
14.6 Assumptions for
Regression Analysis
For the results from regression analysis to be
reliable, certain key assumptions need to be
satisfied.
It is important when performing a regression
analysis to examine a scatter plot and a residual
plot for violations of regression assumptions.
• Regression estimates and predictions will be
less accurate or misleading if assumptions are
violated.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-62
Assumptions for Regression Analysis
Assumption 1: The relationship between the independent and
dependent variables is linear.

y y

x x
Linear Not Linear
For low and high values of x, the
Data in this scatter plot appear
estimated value will be too high;
to follow a linear pattern.
estimated values for x s in the
middle of the x range will be too low.
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-63
Assumptions for Regression Analysis
Assumption 2: The residuals exhibit no patterns across values of
the independent variable.
• The residual for each ordered pair is the difference between the actual
and the predicted values of the dependent variable.

• Excel will generate a residual plot for each ordered pair in the
data set:
1. Enter the x and y data in separate columns in a worksheet.
2. Go to Data > Data Analysis.
3. Select Regression from Data Analysis and click OK.
4. In the Regression dialog box, check the Residuals and
Residual Plots options and click OK.
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-64
Assumptions for Regression Analysis
Assumption 2: (continued) The residuals exhibit no patterns across
values for the independent variable.
y y

x x

residuals
residuals

x
x

No pattern in this residual plot Residuals for low and high values of x are
(random residuals) mostly negative, residuals for x s in the
middle of the x range are mostly positive.
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-65
Assumptions for Regression Analysis

Assumption 3: (homoscedasticity)
The variation of the dependent variable is the same across
all values of the independent variable.
• Can view the residual plot to evaluate this assumption:
residuals

residuals
x x

Constant variance: roughly the Non-constant variance: variation


same variation as x varies from increases as x varies from low to
low to high high

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-66
Assumptions for Regression Analysis

Assumption 4:
The residuals from the ordered pairs follow the normal
probability distribution. Inspect the residual plot to
evaluate this assumption.

A normal probability plot is used to verify if data follows


the normal probability distribution by graphing the data
on the y-axis and the z-scores for the data on the x-axis.

• It is important to examine both the scatter and residual


plots for violations of the assumptions on which a
regression analysis depends.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-67
14.7 A Simple Regression Example
with a Negative Correlation
This section will help review the concepts
presented so far in the chapter.
This time, consider a negative relationship
between the dependent and independent
variables.
• The slope of the regression equation will be
negative.
• The analysis and interpretation of results will be
very similar to what was done in prior sections of
this chapter.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-68
A Simple Regression Example
with a Negative Correlation
Example for this section:
• A consumer products testing company wants to examine
the impact of driving speed on gasoline use for a particular
model of car for an upcoming article in their magazine.
• A fixed length course was travelled at different speeds (in
miles/hour, MPH) and gas use was recorded (in
miles/gallon, MPG).
• MPG can vary due to other factors (such as temperature,
wind speed, use of cruise control, etc.) but these other
factors were not measured).
• Data was collected for 8 different speeds (see next slide).

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-69
A Simple Regression Example
with a Negative Correlation
Sample data:
Speed MPG
(MPH)
40 32
45 28
50 30
55 25
60 27
65 24
70 22
75 23
Speed is expected to affect MPG, so speed is the independent (x)
variable and MPG is the dependent (y) variable.
Copyright © 2020, 2015, 2013 Pearson Education, Inc.
14-70
A Simple Regression Example
with a Negative Correlation
Construct a table to provide the values needed for future calculations.

Speed Gas Use


(MPH) (MPG)
x y xy x2 y2
1 40 32 1280 1600 1024
2 45 28 1260 2025 784
3 50 30 1500 2500 900
4 55 25 1375 3025 625
5 60 27 1620 3600 729
6 65 24 1560 4225 576
7 70 22 1540 4900 484
8 75 23 1725 5625 529

n=8 Σx = 460 Σy = 211 Σxy = 11860 Σx2 = 27500 Σy2 = 5651

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-71
The Correlation Coefficient
Calculating the correlation coefficient:

Because r = –0.907 is negative and close to –1, there is a strong


negative relationship between speed and MPG.
• As speed increases, MPG tends to fall.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-72
The Correlation Coefficient
Use the following hypothesis statement to test
the significance of the correlation coefficient.
H0: ρ = 0 (there is no relationship between speed and MPG)
H1: ρ ≠ 0 (there is a linear relationship)

Suppose α = 0.02 is selected for this test.


• The test statistic is

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-73
The Correlation Coefficient
The critical t-score is from the t-distribution with n – 2
degrees of freedom.
A two tail test requires area α/2 in each tail.
• For n = 8 (df = 6) and α = 0.02 we get tα/2 = 3.143

t = –5.276 is in the
rejection region, so we
reject the null hypothesis Reject H0 Reject H0
Do not reject H0
and conclude that the 0.98 α/2 = 0.01
α/2 = 0.01
population correlation
coefficient is not equal to
zero. 0

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-74
Calculating the Slope
and y-intercept Manually
Σx = 460 Σy = 211 Σxy = 11,860 Σx2 = 27,500 Σy2 = 5,651 n = 8

So the regression equation (rounding to two decimal places) is:

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-75
Calculating the Slope
and y-intercept Manually

• b1 = –0.26: On average, each additional 1 MPH of speed on


the test course decreases gas mileage by 0.26 MPG

What is the predicted MPG for a speed of 60 MPH?

• The predicted gas mileage is 25.7 MPG at 60 MPH.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-76
Partitioning the Sum of Squares
for a Negative Correlation
Hand calculations:
Note: this product is very
sensitive to the way b1 is
rounded since Σxy is so large

Excel output with unrounded


values gives the following:

SSE = 15.155

SST = 85.875

SSR = 70.720

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-77
Calculating the
Coefficient of Determination

83% of the variation in MPG is explained by variation in speed.


(Using the unrounded Excel values for SSR and SST would give
a slightly different value for R2.)

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-78
Calculating the
Coefficient of Determination
Use the F-test statistic to test for significance of the coefficient of
determination:
H0 : ρ 2 ≤ 0 (None of the variation in MPG is explained by speed.)
H1 : ρ 2 > 0 (Speed does explain a significant portion of the variation
in MPH.)

Using the hand-calculated values:

With degrees of freedom D1 = 1


and D2 = n – 2 = 8 – 2 = 6

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-79
Calculating the
Coefficient of Determination
• Suppose we use α = 0.01.
• The critical F-score for α = 0.01 and degrees of freedom equal
to 1 and 6 is Fα = 13.745.
H0 : ρ2 ≤ 0
H1 : ρ2 > 0
Since F = 29.27 > Fα = 13.745,
we reject H0 and conclude that
Do not reject H0 the coefficient of determination is
α = 0.01
1 – α = 0.99 greater than zero.

0 There is sufficient evidence to


Do not reject H0 Reject H0 support that a relationship exists
Fα = 13.745 between speed and MPG.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-80
Calculating Confidence and Prediction
Intervals for a Negative Correlation
Suppose we are interested in calculating a 98%
confidence interval for MPG when the speed is 60
MPH

• The predicted gas mileage is 25.7 MPG at 60 MPH.

• Confidence and prediction intervals can be found


manually.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-81
Calculating Confidence and Prediction
Intervals for a Negative Correlation
• Using manual calculations with α = 0.02, :

Confidence interval:
UCL = 25.7 + 1.81 = 27.51

LCL = 25.7 – 1.81 = 23.89

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-82
Calculating Confidence and Prediction
Intervals for a Negative Correlation
• Prediction interval calculations:

Prediction interval:
UCL = 25.7 + 5.31 = 31.01

LCL = 25.7 – 5.31 = 20.39

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-83
Calculating Confidence and Prediction
Intervals for a Negative Correlation
Interpretations:
• We are 98% confident that the average mileage
for a speed of 60 MPH on the test course is
between 23.89 and 27.51 miles per gallon.
• We are 98% confident that the mileage for a
particular trip on the test course at 60 MPH is
between 20.39 and 31.01 miles per gallon.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-84
Testing the Significance of the Regression of a Slope
with a Negative Correlation

• Now test to determine if the population regression slope is


significantly different from zero.
H0 : β1 = 0 (There is no relationship between speed and MPG.)
H1 : β1 ≠ 0 (There is a relationship between speed and MPG.)
• Use a t-test statistic for this hypothesis test, with n – 2
degrees of freedom.
• The value of sb is needed:

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-85
Testing the Significance of the Regression of a Slope
with a Negative Correlation

• Recall that the regression equation is

H0 : β1 = 0

H1 : β1 ≠ 0

• Suppose α = 0.02 is chosen.


• The critical t-value (with n – 2 = 8 – 2 = 6 df) is tα/2 = 3.143.

Since | t | = 5.31 > tα/2 = 3.143, we reject H0 and conclude that the
population regression slope is not equal to zero and that there is a
relationship between car speed and MPG.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-86
Testing the Significance of the Regression of a Slope
with a Negative Correlation

Finally, find the 98% confidence interval for the slope.

• The confidence interval limits are:

UCL = -0.26 + 0.154 = -0.106

LCL = -0.26 – 0.154 = -0.414

• We are 98% confident that each 1 MPH increase in speed will


decrease mileage by between 0.414 and 0.106 MPG.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-87
FOCUS ON ANALYTICS
The Law of Supply and Demand
• In Section 14.7, we discussed simple regression with a negative
correlation when comparing price and demand. The correlation
coefficient was negative because the price of and the demand
for an item tend to move in opposite directions when looking at
a sample data.
• As one goes up, the other tends to go down. This negative
relationship is the essence of one of the most basic economic
laws, the law of “supply and demand,” as shown in Figure 14.31.
• According to investopedia.com, the law of supply and demand
“is a theory that explains the interaction between the supply of
a resource and the demand for that resource.” This law explains
how the availability of a product/service and the desire for this
product/service relates to its price. Generally, if the supply is low
and the demand is high, the price increases.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-88
FOCUS ON ANALYTICS
The Law of Supply and Demand
• Supply and demand do not affect only price. They can be used to
describe other economic activity, such as unemployment. When
unemployment is high (the supply of available workers is high), then
businesses tend to offer lower wages.
• On the other hand, when unemployment is low (number of
available workers is low), businesses tend to offer higher salaries.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-89
14.8 Some Final (But Very Important)
Thoughts
Pitfalls in regression analysis
• Do not predict values of the dependent variable
beyond the range of the x values.
• There is no guarantee that the estimated relationship is
appropriate beyond the observed range of x.
• Results will be of questionable value.
• Do not confuse correlation with causality.
• Just because the relationship between the variables is
statistically significant doesn’t prove that the
independent variable actually caused the change in the
dependent variable.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-90
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted, in any form or by any means, electronic, mechanical, photocopying,
recording, or otherwise, without the prior written permission of the publisher.
Printed in the United States of America.

Copyright © 2020, 2015, 2013 Pearson Education, Inc.


14-91

You might also like