BUSN 2429 Chapter 14 Correlation and Single Regression Model

Download as pdf or txt
Download as pdf or txt
You are on page 1of 85

Business Statistics

Course: BUSN 2429


Instructor: Bassem Hamid
“Correlation and Single Regression
Model”
(Chapter 14)
1
Business Statistics Map
Introduction + Probability + Inferential Statistics
Descriptive Statistics Probability Distributions

1. Introduction to Business Statistics 4. Introduction to Probabilities 8. Confidence Intervals


2. Displaying Descriptive Statistics 5. Discrete Probability Distribution 9. Hypothesis Test-One Sample Test
3. Calculating Descriptive Statistics 6. Continuous Probability Distribution 10. Hypothesis Test-Two Samples Test
7. Sampling & Sampling Distribution 14. Correlation and Single Regression Model

2
X1
X2 Xn Chapter Diagram-Road Map X1
Population X4 Sample Yn
X3 Y2
X2
Y1
Y2 Yn Scatter Plot
Y3 Describe the relationship
Grade vs. Hours
Correlation Correlation

Y: Grades (Dependent)
is a measure that indicates mutual relationship between is a measure that indicates mutual
two variables X & Y relationship between two variables X & Y

Population Correlation Coefficient, ρ r = 0.835 Sample Correlation Coefficient, r


A hypothesis test can be used to determine if the measures the strength and direction
population correlation coefficient, ρ, is significantly X: Hours (Independent)
r =1 perfect positive correlation between x and y
different from zero (It is based on r) r = -1 perfect negative correlation between x and y
Ho: ρ ≤ 0 (there is no linear relationship between x & y) r = 0 there is no relationship between x and y
H1: ρ > 0 (there is a linear relationship) r = 0.65 moderate positive relationship between x and y
r = -0.65 moderate negative relationship between x and y
Simple Regression Analysis Technique Simple Regression Analysis Technique
Confidence interval is used to predict the value of 𝑦ො is used to describe a straight line (Linear Equation) that
best fits a series of ordered pairs; independent &
independent variables (x & y) ŷ = b0 + b1 x
Population Coefficient of Determination, ρ2
A hypothesis test can be used to determine if the population
Sample Coefficient of Determination, R𝟐
coefficient of determination is significantly different from zero Measures the percentage of the total variation on the
dependent variable Y that is explained by the independent
(It is based on R𝟐 )
variable X. e.g., R𝟐 is 70%. This means that 70% of the
Ho : ρ2 ≤ 0 there is no a linear relationship between x and y variation on Y can be explained by X and the rest is explained
H1 : ρ2 > 0 there is a linear relationship between x and y or by other independent variables (Not included in the model)
x does explain a significant portion of the variation in y 𝑦ො = 𝑏𝑜 + 𝑏1 𝑥1 + 𝑏2 𝑥2 . . +𝑏𝑘 𝑥𝑘
Slope of a Simple Regression Model (Population) Slope of a Simple Regression Model (Sample)
β1 is the slope of the single regression model (3 Methods) b1 is the slope of the single regression model
X1
X2 Xn Chapter Diagram-Road Map X1
Population X4 Sample Yn
X3 Y2
X2
Y2 Yn
Y1
Y3

Correlation Correlation
Mutual relationship between X and Y Mutual relationship between X and Y

Population Correlation Coefficient, ρ Sample Correlation Coefficient, r

Testing the Significance of ρ (ρ ≠ 0)


T-Test
(Excel)
Simple Regression Analysis Technique Simple Regression Analysis Technique
Confidence interval is used to predict the value of 𝑦ො Describe the Linear Equation ŷ = b0 + b1 x
(Excel/PHStat) (Excel)
Population Coefficient of Determination, ρ2 Sample Coefficient of Determination, R𝟐

Testing the Significance of ρ2 (ρ2 ≠ 0) R2 = SSR / SST


(Excel)
F-Test(Excel)
Slope of a Simple Regression Model (Population) Slope of a Simple Regression Model (Sample)
Testing the Significance of (β1 ≠ 0)
T-Test(Excel)
(Excel)
Outlines

This chapter covers the following points:


• #1 Correlation Analysis
• #2 Simple Linear Regressions Analysis
• #3 Developing Single/Simple Regression Model (Excel)
• #4 Developing Multiple Regressions Model (Excel)-OPTIONAL

5
Objectives
After completing this chapter, you will be able to:
• #1 Distinguish between a dependent and an independent variable
• #2 Use correlation analysis to measure the strength and direction in a
relation between two variables
• #3 Use the least squares method to determine the slope and intercept of a
linear formula that best fits a set of ordered pairs
• #4 Partition the sum of squares for dependent and independent variables
• #5 Understand the assumptions for the single regression analysis
• #6 Develop a single/multiple regression model using software
• #7 Interpret the meaning of regression coefficients

6
#1 Correlation Analysis
• #1.1 Dependent and Independent Variables
• #1.2 Scatterplot and Sample Correlation Coefficient
• #1.3 The Sample Correlation Coefficient Formula, r
• #1.4 The Significance for The Population Correlation Coefficient, ρ

7
X1
X2 Xn Chapter Diagram-Road Map X1
Population X4 Sample Yn
X3 Y2
X2
Y2 Yn
Y1
Y3

Correlation Correlation
Mutual relationship between X and Y Mutual relationship between X and Y

Population Correlation Coefficient, ρ Sample Correlation Coefficient, r

Testing the Significance of ρ (ρ ≠ 0)

Simple Regression Analysis Technique Simple Regression Analysis Technique


Confidence interval is used to predict the value of 𝑦ො Describe the Linear Equation ŷ = b0 + b1 x

Population Coefficient of Determination, ρ2 Sample Coefficient of Determination, R𝟐

Testing the Significance of ρ2 (ρ2 ≠ 0) R2 = SSR / SST

Slope of a Simple Regression Model (Population) Slope of a Simple Regression Model (Sample)
Testing the Significance of (β1 ≠ 0)
Sample

#1.1 Dependent and Independent Variables


• An independent variable x, explain the variation in another variable, which is
called the dependent variable, y
• Independent Variable (x) → Dependent variable (y)
• This direct linear relationship between the Independent and Dependent
Variable does not work in reverse
Examples of Independent (Explanatory or Predictor) and Dependent (Response) Variables
Independent Variables (x) Dependent Variables (Y)
The Size of a Television Screen → The Selling Price of the Television
The Number of Visitors per day on → The Amount of Sales Per Day From
a Web Site the Web Site

Relationship?
Between TV Size and TV Price $ 9
Sample

#1.2 Scatterplot and the Sample Correlation Coefficient


Scatterplots is used to describe the Sample Correlation Coefficient, r
relationship between the dependent and indicates both the strength and direction
independent variables through the of the linear relationship between the
following terms: independent and dependent variables
TV Price ($)
oStrength: How much scatter? r = ? 3000

oDirection: What is the sign-positive, 2500

negative or neither? r =? 2000

TV Price ($)
oForm: Is it straight, curved, something 1500

exotic or no patterns? 1000

oUnusual Features: Are there unusual 500

observations or subgroups? 0
0 10 20 30 40 50 60 70
TV Size (In)

10
Examples of r Values (-1.0 to +1.0)

Graph A (r = 1.0): perfect positive correlation between x and y


Graph B (r = -1.0): perfect negative correlation between x and y
Graph C (r = 0.6): a moderately positive relationship: y tends to increase as x increases,
but not necessarily at the steady rate we observed in Graph A
Graph D (r = -0.4): a relatively weak negative relationship: the correlation coefficient is
closer to zero, negative r value so y tends to decrease as x increases
Graph E (r = 0): no relationship between x and y
11
Sample

#1.3 The Sample Correlation Coefficient, r, Formula


• The Sample Correlation Coefficient, r, Formula

n xy − ( x )( y )
r=
n x 2
− ( x )
2
 n y 2
− ( y )
2

Where
n is the number of the independent variable
X represent the value of the independent variable
Y represent the value of the dependent variable

12
Example: The following table shows a sample data for six cars which were
randomly selected.

Number of TV Ads Number of Cars Sold


Week x y
1 3 13
2 6 31
3 4 19
4 5 27
5 6 23
6 3 19

Calculate the Sample Correlation Coefficient to Determine the Strength and Direction of
the Relationship between the Independent and Dependent Variables

13
Number of TV Number of
Ads Cars Sold
Week x y xy x2 y2
1 3 13 39 9 169
2 6 31 186 36 961
3 4 19 76 16 361
4 5 27 135 25 729
5 6 23 138 36 529
6 3 19 57 9 361

x = 27 y = 132 xy = 631 x2 = 131 y2 = 3110

n xy − ( x )( y ) (6)(631) − (27)(132)


r= =
n x  n y 
Because r = 0.836 is positive
2
− ( x )
2 2
− ( y )
2
(6)(131) − (27)  (6)(3110) − (132) 
2 2
and close to +1, there is a fairly
strong positive relationship
222 222 between the number of TV ads
= = = 0.836 and cars sold
Copyright ©2013 Pearson Education,
57 1236  265.43
Inc. publishing as Prentice Hall
Calculating the Sample Correlation Coefficient
Use the CORREL Function in Excel to calculate the correlation coefficient
= CORREL(array1, array2)
where: array1 = The range of data for the first variable
array2 = The range of data for the second variable
Population

#1.4 The Significance of The Population Correlation Coefficient, ρ


• TheThe value r represents the correlation coefficient for a random sample
• The population correlation coefficient (ρ) refers to the correlation between
all values of two variables of interest in a population
• A hypothesis test can be used to determine if the population correlation
coefficient, ρ, is significantly different from zero (when ρ = 0 → there is no
relationship between the variables)
• One tail example:
H0: ρ ≤ 0 (there is no relationship between x and y)
H1: ρ > 0 (there is a linear relationship)

16
Example: Consider our previous example of cars sold, test the claim
• The t-test statistic is the appropriate test statistic for this hypothesis test
𝑟 where:
𝑡=
1 − 𝑟2 r = The sample correlation coefficient
𝑛−2 n = The number of ordered pairs

• Using values from the prior example: r = 0.836 , n = 6


𝑟 0.836 0.836
𝑡= = = = 3.047
1 − 𝑟2 1 − (0.836)2 0.0753
𝑛−2 6−2
• The critical t-score is from the t-distribution with n – 2 degrees of freedom
• This one tail test requires area α in the upper tail
• For n = 6 (df = n-2= 4) and α = 0.05 we get tα = 2.132 from Appendix A, Table 5
(or by using =|T.INV(0.05, 4)| in Excel)
17
H0: ρ ≤ 0 (there is no relationship between x and y)
H1: ρ > 0 (there is a linear relationship)
P-Value
= T.DIST.RT(x,df) textbook page 648
X = test statistic and df = n-2
=T.DIST.RT (3.047, 4)=0.019

Because t = 3.047 is greater than tα = 2.132,


we reject the null hypothesis and conclude
Reject H0
that the population correlation coefficient is Do not reject H0
α = 0.05
0.95
not equal to zero
0
t = 2.132 t = 3.047

18
Can we use a left Tail Test???

Yes. It depends on the value of r. If r is negative, we will use it.

Please refer to the textbook (Page)

19
#2 Simple Linear Regression Analysis
• #2.1 Simple Linear Regression Definition/Formula
• #2.2 The Least Square Methods
• #2.3 The Sample Coefficient of Determination, R2
• #2.4 The Significance of the Population Coefficient of Determination, ρ2
• #2.5 The Significance of the Slope of the Regression Model
• #2.6 Using Regression to Make Prediction (OPTIONAL)

20
X1
X2 Xn Chapter Diagram-Road Map X1
Population X4 Sample Yn
X3 Y2
X2
Y2 Yn
Y1
Y3

Correlation Correlation
Mutual relationship between X and Y Mutual relationship between X and Y

Population Correlation Coefficient, ρ Sample Correlation Coefficient, r

Testing the Significance of ρ (ρ ≠ 0)

Simple Regression Analysis Technique Simple Regression Analysis Technique


Confidence interval is used to predict the value of 𝑦ො Describe the Linear Equation ŷ = b0 + b1 x

Population Coefficient of Determination, ρ2 Sample Coefficient of Determination, R𝟐

Testing the Significance of ρ2 (ρ2 ≠ 0) R2 = SSR / SST

Slope of a Simple Regression Model (Population) Slope of a Simple Regression Model (Sample)
Testing the Significance of (β1 ≠ 0)
Sample

#2.1 Simple Linear Regression Definition/Formula


• Simple Regression Analysis Technique (least squares method) is used to
describe a straight line (Linear Equation) that best fits a series of ordered pairs;
independent & independent variables (x & y)
ŷ = b0 + b1 x
σ𝑦 σ𝑥 𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
𝑏0 = − 𝑏1 𝑏1 =
𝑛 𝑛 𝑛 σ 𝑥2 − σ 𝑥 2
𝑦ො = The predicted value of y given a value of x
x = The independent variable
b0 = The y-intercept of the straight line
b1 = The slope of the straight line

22
Example: From previous example (# of TV ads & # of Cars sold)
x = 27 y = 132 xy = 631 x2 = 131 y2 = 3110 n=6

If x = 8 → The predicted Y = 35.63 𝑦ො = 𝑏0 + 𝑏1 𝑥 𝑦ො = 4.4737 + 3.8947𝑥

𝑛 σ 𝑥𝑦− σ 𝑥 σ 𝑦 (6)(631)−(27)(132) 222


b1= = = = 3.89474
y 𝑛 σ 𝑥2− σ 𝑥 2 (6)(131)−(27)2 57

σ𝑦 σ𝑥 132 27
bo= −𝑏 = − (3.89474) = 22 − 17.5262 = 4.4737
# of Cars

𝑛 𝑛 6 6

Intercept = b1= Rise/ Run = Slope = 3.89 Slope: On average, each additional TV ADS
4.473 increases the CAR SOLD by 3.894

TV ads x

23
Calculating the Slope and y-intercept Using Excel
1. Enter the data for the two variables in the worksheet
2. Go to the Data tab and select Data Analysis, which opens the Data Analysis dialog box
3. Scroll down to Regression and click OK to open the Regression dialog box
Calculating the Slope and y-intercept Using Excel
4. Click on the first text box, which is
labeled Input Y Range
5. highlight the cells containing the
dependent variable values, including 4-5
the column label 6-7
6. Click on the second text box, which 8
is labeled Input X Range
9
7. Highlight the cells containing the
independent variable values,
including the column label
8. Check the Labels box
9. Click on Output Range, then tell
Excel where to put the report, then
click OK
Calculating the Slope and y-intercept Using Excel

correlation
coefficient, r

y-intercept

slope value
Sample

#2.2 The Least Square Method


The least squares method is mathematical procedure used to identifies the linear equation that best
fits a set of ordered pairs
o It is used to find the values for bo (the y-intercept) and b1 (the slope of the line)
o The resulting best fit line is called the regression line
Y

Observed value
ei = The residual for the i th
of y for xi yi observation from the population

Predicted value
ei = yi − yˆ i
of y for xi ŷ i Slope = b1

The regression line does not reflect


reality…100% BUT it best fit between
Deviation observed points (Series of Ordered Pairs)

𝑦ො = 𝑏0 + 𝑏1 𝑥
Intercept = b0

27
xi X
• The least squares method identifies the linear equation that best fits a set
of ordered pairs
o It is used to find the values for bo (the y-intercept) and b1 (the slope of the line)
o The resulting best fit line is called the regression line
o The goal is to minimize the total squared error between the values of y and 𝑦ො
o The least squares method will minimize the sum of squares error (SSE):
n
SSE =  ( yi − yi )
ˆ 2 Where: n = The number of ordered pairs
around the line that best fits the data
i =1

28
Clarifications !
An independent variable x, explain the variation in another
variable, which is called the dependent variable, y

How?

We need to calculate
The Sample Coefficient of Determination, R2 = SSR / SST
The goal is to calculate R𝟐 (Technology, Math(SST, SSR, SSE)
and interpret the result)

29
The Relationship between SST, SSR & SSE
• The total sum of squares (SST) measures the total variation in the dependent variable
• Total variation is made up of two parts:
SST = SSR + SSE
Total sum of Sum of Squares Sum of Squares
Squares Regression Error

SST =  (y − y)2 SSR =  ( yˆ − y)2 SSE =  (y − yˆ)2


y = A value of the dependent variable from the sample
Alternative
y = The average value of the dependent variable from the sample Alternative
Formula
Formula ŷ = The estimated value of y for a given x value

SST =  y − 2
( y ) 2

SSE =  y 2 − b0  y − b1  xy
n
Example: Consider our previous example of cars sold, calculate the SSR, SSE and
SST and comment on the result.

𝑦ො = 4.4737 + 3.8947𝑥
𝑦ො = 𝑏0 + 𝑏1 𝑥

x = 27 y = 132 xy = 631 x2 = 131 y2 = 3110 n=6


σ𝑦 2
2
𝑆𝑆𝐸 = ෍ 𝑦 − 𝑏0 ෍ 𝑦 − 𝑏1 ෍ 𝑥𝑦 𝑆𝑆𝑇 = ෍ 𝑦 2 −
𝑛
SSE = (3110) – (4.4737) (132) – (3.8947) (631) = 61.89474
SST = (3110) – (132)(132)/6= 206
SSR = SST – SSE = 206 – 61.89474 = 144.10526
31
Sample

#2.3 The Sample Coefficient of Determination, R2


• The Sample Coefficient of Determination, R𝟐 measures the percentage of the total
variation of our dependent variables that is explained by our independent variable
from a sample
• R2 = SSR / SST (Ranges from 0% to 100%) = 144.10526/206 = 0.69954 or 70%
• Sum of Squares Regression (SSR) & Total sum of Squares (SST)
• We can conclude that 70% of the total variation
2
in # of cars sold (Y) can be explained
by the # of TV ads (X). The value of R ranges from 0 to 1. The higher the value the
stronger the linear relationship between the dependent and independent variables
• This means2 that we need to add other independent variables to the regression model to
improve R . Example: 𝑦ො = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 (X1: # of TV ads, X2: # Size of the Car)
• The Sample Coefficient of Determination, R can be also determined by squaring
the Sample Correlation Coefficient, r (R2 = r 2 )
How can the remaining variation
(30%) be explained?

32
Partitioning the Sum of Squares Using Excel

Correlation Coefficient, r r
Sample Coefficient of R2 R2 = SSR / SST
Determination, R2
= 144.10526/206
= 0.69954
𝑛
R2 = r 2 R2 = SSR / SST
SST = SSR + SSE
Sum of Squares Regression SSR

Sum of Squares Error SSE


Total sum of Squares SST
Population

#2.4 The Significance of the Population Coefficient of Determination, ρ2


• The population coefficient of determination, ρ2, is unknown
• The calculated value of R2 represents the coefficient of
determination for a random sample from the population
• Use this hypothesis test to determine if the population coefficient of
determination is significantly different from zero (based on the sample
coefficient of determination):
H0 : ρ2 ≤ 0 (there is no relationship between x and y)
H1 : ρ2 > 0 (there is relationship between x and y) or x does explain a significant
portion of the variation in y

34
Example: Consider our previous example of cars sold, test the claim
The F-test statistic is the appropriate test statistic for this hypothesis test
SSE = 61.89474 𝑆𝑆𝑅 144.10526 The p-value can be found
𝐹= = = 9.313
SSR = 144.10526 𝑆𝑆𝐸 61.89474
in Excel:
𝑛−2 6−2
n=6 =F.DIST.RT(9.313, 1, 4)
=0.03795
H0 : ρ2 ≤ 0 Since F = 9.313 > Fα = 7.709 we
reject H0 and conclude that the
H1 : ρ2 > 0 coefficient of determination is
The p-value is 0.03795,
which is less than α =
greater than zero 0.05, so we reject the null
For this example:
hypothesis that there is no
D1 = 1 (We have only one relationship between TV
independent variable) ads and number of cars
D2 = n – 2 = 6 – 2 = 4 sold per week
Do not reject H0  = 0.05
1 –  = 0.95
The critical F-score for α = 0.05 and
degrees of freedom equal to 1 and 4
is Fα = 7.709 (Appendix A Table 6)
0
= F.INV.RT(α, D1, D2) =
Do not reject H0 Reject H0
= F.INV.RT(0.05, 1, 4) = 7.709
Fα = 7.709 35
Test The Significance of ρ2 Using Excel

H0 : ρ2 ≤ 0 (there is no relationship
between x and y) 𝑆𝑆𝑅 144.10526
𝐹= = = 9.313
𝑆𝑆𝐸 61.89474
H1 : ρ2 > 0 (there is relationship 𝑛−2 6−2
between x and y) or x does explain a
significant portion of the variation in y
n Test Statistic P-Value
SSR SSE
For this example:
D1 = 1 (We have only one independent variable)
D2 = n – 2 = 6 – 2 = 4

The critical F-score for α = 0.05 and degrees of


freedom equal to 1 and 4 is Fα = 7.709 (Appendix
A Table 6)
= F.INV.RT(0.05, D1, D2) = (0.05, 1, 4) = 7.709 The P-Value = 0.03796 ˂ α = 0.5 → Reject Ho
The Test Statistics = 9.31293 > The Critical Value =7.709 → Reject Ho.
We have enough evidence to support H1
Population

#2.5 The Significance of The Slope of The Regression Model


• The calculated value of the slope, b1, is from a random sample from 𝑦ො = 𝑏0 + 𝑏1 𝑥
the full population
• The population regression slope, β1, is unknown
• If the population slope is zero, then x has no effect on y, and we would conclude that
there is no relationship between the dependent and independent variables
• We can perform a hypothesis test to determine if the population regression slope, β1,
is significantly different from zero, based on the sample regression slope, b1
Ho : β1 = 0 (There is no relationship between the independent and dependent
variables)
H1 : β1 ≠ 0 (There is a relationship between x and y)
There will be 3 METHODS

37
Test The Significance of the Slope Using Excel
Ho : β1 = 0 (There is no relationship
between the independent and Based on our sample of six
dependent variables) students, we are 95% confident
H1 : β1 ≠ 0 (There is a relationship that the true population slope is
between x and y) between 0.353 and 7.437
Standard error of (Intervals does not include “0”
The critical t-score, tα/2, can be found
in Table 5 in Appendix A, or from the estimate, se
Excel using the TINV function
=TINV(α, df = n-2) =
=TINV (0.05, 6-2) →tα/2 = 2.776
The Test Statistic = 3.05171

The P-Value = 0.03796


SLOPE
Method 1 The Test Statistics = 3.051 > The Critical Value 2.776→ Reject Ho
Method 2 The P-Value = 0.03796 ˂ α = 0.5 → Reject Ho
Method 3 The confidence intervals of the population slop (β) 0.351-7.438 does not include zero→ Reject Ho
Method 1 (Traditional) x = 27 y = 132 xy = 631 x2 = 131 y2 = 3110 n=6
Car Example 𝑦ො = 𝑏0 + 𝑏1 𝑥
𝑦ො = 4.4737 + 3.8947𝑥 𝑥lj = 4.5 ෍ 𝑥 2 = 131 𝑛 = 6 𝑆𝐸𝐸 = 61.8947
The t-test Statistic
𝑏−β b = The sample regression slope
𝑡= β = The population regression slope from the null hypothesis 𝑠𝑒 = 𝑆𝑆𝐸 = 61.8947 = 3.934
𝑠𝑏 sb = The standard error of the slope 𝑛−2 6−2
Se = The standard of error of the estimate

𝑏 − β 3.8947 − 0 𝑠𝑒 3.934 3.934


𝑡= = = 3.05 𝑠𝑏 = = = = 1.276
𝑠𝑏 1.276 σ 𝑥 2 − 𝑛(𝑥)
lj 2 131 − (6)(4.5)2 3.082

The Critical Value


Use Table 5 (df = n-2 and α) For α = 0.05, the critical t-value (with df= n – 2 = 6 – 2 = 4) is tα/2 = ± 2.776
Conclusion
Since t = 3.05 > tα/2 = 2.776, we reject H0 and conclude that the population regression slope is not equal
to zero and that there is a relationship between TV ads and car sales
39
Method 2 (Confidence Intervals) x = 27 y = 132 xy = 631 x2 = 131 y2 = 3110 n=6
Car Example 𝑦ො = 𝑏0 + 𝑏1 𝑥
𝑦ො = 4.4737 + 3.8947𝑥 b1= 3.8947 𝑠𝑏 = 1.276 tα/2 = 2.776

Formula for the Confidence Interval for the Slope of a Regression


𝐶𝐼 = 𝑏 ± 𝑡α/2 𝑠𝑏

The confidence interval for the regression slope, b is: UCL = 3.895 + 3.542 = 7.437
𝐶𝐼 = 𝑏1 ± 𝑡𝛼/2 𝑠𝑏 = 3.8947 ± (2.776)(1.276) = 3.8947 ± 3.542 LCL = 3.895 – 3.542 = 0.353

• Based on our sample of six students, we are 95% confident that the true population slope (β1) is between
0.353 and 7.437
• We are 95% confident that every additional TV ad will increase the number of cars sold by between 0.353
and 7.437 cars per week
• Since this confidence interval does not include zero, we have evidence to conclude that there is a
relationship between TV ads and car sales

40
Method 3 (P-Value) x = 27 y = 132 xy = 631 x2 = 131 y2 = 3110 n=6
Car Example 𝑦ො = 𝑏0 + 𝑏1 𝑥
𝑦ො = 4.4737 + 3.8947𝑥 Use technology to calculate the P-Value

The P-Value = 0.03796 ˂ α = 0.5 → Reject Ho

41
Population

#2.6 Using Regression to Make Prediction (OPTIONAL)


• A point estimate for a given x value is found by inserting the desired xi value in the
regression equation
• We can also construct a confidence interval around the point estimate (Population)
• To construct such a confidence interval requires the standard error of the estimate,
se , which measures the amount of dispersion of the observed data around a
regression line

𝑆𝑆𝐸
𝑠𝑒 =
𝑛−2

• se measures the variation of observed y values from the regression line


42
The Confidence Interval for an Average Value of Y Based on Value of X

• Formula for the Confidence Interval (CI) for an Average Value of y based on
a value of x
1 (x − x )2
CI = yˆ  tα/ 2 se +
n (
( x ) − n
2 x)
2

where:
CI = The confidence interval for an average value of y
ŷ= The predicted y value for the desired value of x
tα/2 = The critical t-statistic from the Students’ t-distribution with n – 2 df
se = The standard error of the estimate
n = The number of ordered pairs
x = The average value of x from the sample
43
• Example: Using the car sales vs. TV ads data with 5 ads per week (x = 5),
yˆ = 23.95 se = 3.934 n=6  x = 27  = 131
x 2

σ 𝑥 27
• We also need the average value of x: 𝑥lj = = = 4.5
𝑛 6

• tα/2 = 2.776 (df = n – 2 degrees of freedom, α = 0.05) (Appendix 5 Table 5)

1 lj 2
(𝑥 − 𝑥)
𝐶𝐼 = 𝑦ො ± 𝑡α/2 𝑠𝑒 + 2
𝑛 σ𝑥
σ 𝑥2 −
𝑛 We are 95% confident that the
average number of cars sold for
1 (5 − 4.5)2
= 23.95 ± (2.776)(3.934) + all weeks in which 5 TV ads
6 27 2
131 − are used will be between
6
19.152 and 28.748
1 0.25
= 23.95 ± (10.921) +
6 9.5 UCL = 23.95 + 4.798 = 28.748
= 23.95 ± (10.921)(0.4393)
= 23.95 ± 4.798 LCL = 23.95 – 4.798 = 19.152
44
The Prediction Interval for a Specific Value of Y Based on a Value of X
• In the previous example, the confidence interval (CI) is for the average number of cars
sold for all weeks in which 5 TV ads occur
• Formula for the Confidence Interval (CI) for an Average Value of y based on a value of x
1 (x − x )2
CI = yˆ  tα/ 2 se +
n (
( x ) − n
2 x)
2

• A prediction interval (PI) is an interval for a specific number of car sold in particular week in
which x = 5
• Formula for the Prediction Interval (PI) for a Specific Value of y based on a value of x
1 (x − x )2
PI = yˆ  t α/ 2 se 1+ +
n (
( x ) − n
2 x)
2

45
• Example: Compute the prediction interval using the cars sales vs. TV ads data , for x = 5:

1 (x − x )2 We are 95% confident that the


PI = yˆ  tα/ 2 se 1 + +
( x ) − (nx )
2
n 2
number of cars sold in a
particular week in which 5
TV ads are used will be
1 (5 − 4.5)2 between 12.02 and 35.88
= 23.95  (2.776)(3.934) 1+ +
6
(131) − (27 )
2
The prediction interval
6 estimates a single value, so
the variation is greater than
1 0.25 when estimating an average
= 23.95  (10.921) 1+ +
6 9.5 value
= 23.95  (10.921)(1.092)
UCL = 23.95 + 11.93 = 35.88
= 23.95  11.93
LCL = 23.95 – 11.93 = 12.02

46
PI CI

We are 95% confident that the We are 95% confident that the
number of cars sold in a average number of cars sold for
particular week in which 5 all weeks in which 5 TV ads
TV ads are used will be are used will be between
between 12.02 and 35.88 19.152 and 28.748
The prediction interval
estimates a single value, so
the variation is greater than
when estimating an average
value

UCL = 23.95 + 11.93 = 35.88 UCL = 23.95 + 4.798 = 28.748

LCL = 23.95 – 11.93 = 12.02 LCL = 23.95 – 4.798 = 19.152

47
Using PHStat to Construct
Confidence and Prediction Intervals

1. Enter the data for x and


y in separate columns
in the worksheet
2. Go to Add-Ins >
PHStat > Regression
> Simple Linear
Regression
3. Fill in the Data section
of the Simple Linear
Regression dialog box,
and fill in the desired
Output Options, then
click OK
Using PHStat to Construct
Confidence and Prediction Intervals
• PHStat output
• These values can be found in the
CIEandPI worksheet created by
PHStat
• These values match the results
calculated on previous slides
(calculated values may differ
slightly due to rounding)
Your Turn #1

The Swiss Hiking Federation is an organization that is responsible for promoting the
safe use of the hiking trail system throughout Switzerland. Trails are clearly marked
with signposts and approximate hiking time to assist individuals with their hiking
agenda. Suppose the SHF would like to investigate the linear relationship between
the amount of time to hike the Wengen-kleine Scheidegg trail in the Jungfrau region
and the age of the hiker. A random sample of seven hikers on this trail was selected
and their age and hiking time, in hours are shown here.
oCalculate the sample correlation coefficient for this sample
Age
oDetermine the regression Time for theAge
equation Swiss hikingTimedata from Your Turn#3. Use
24 the hiking
your result to predict 2.7time for a 26
36-year-old2.2hiker on Wengen-Kleine
Scheidegg trail. 32 4.2 53 2.9
o Partition the SST for
47 the Swiss5 hiking data 38 into the SSR
3.0 and SSE.

oCalculate the sample 40 coefficient


3.8 of determination and interpret it

50
Your Turn #1

The Swiss Hiking Federation is an organization that is responsible for promoting the safe use of the
hiking trail system throughout Switzerland. Trails are clearly marked with signposts and
approximate hiking time to assist individuals with their hiking agenda. Suppose the SHF would like
to investigate the linear relationship between the amount of time to hike the Wengen-kleine
Scheidegg trail in the Jungfrau region and the age of the hiker. A random sample of seven hikers on
this trail was selected and their age and hiking time, in hours are shown here.
o Calculate the sample correlation coefficient for this sample
o Determine the regression equation for the Swiss hiking data. Use your result to predict the hiking
time for a 36-year-old hiker on Wengen-Kleine Scheidegg trail.
o Partition the SST for the Swiss hiking data into the SSR and SSE.
o Calculate the sample coefficient of determination and interpret it
o Test to determine if the population correlation coefficient for the Swiss hiking data is greater than
zero using α = 0.1
o Calculate the coefficient of determination for the Swiss hiking data and test its significance using
α = 0.1.

51
Age Hours
oCalculate the sample # of x x y x.y x𝟐 y𝟐
correlation coefficient 1 24 2.7 64.8 576 7.29
for this sample
2 32 4.2 134.4 1,024 17.64
3 47 5.0 235.0 2,209 25.00
4 40 3.8 152.0 1,600 14.44
5 26 2.2 57.2 676 4.84
6 53 2.9 153.7 2,809 8.41
7 38 3.0 114.0 1,444 9.00
∑x= 260 ∑y= 23.8 ∑xy= 911.1 ∑ x2 = ∑y 2 =
10,338 86.62

n xy − ( x )( y ) (7) (911.1 – (260) (23.8)


r= = 0.435

n x − ( x)  n y − ( y) 
=
2 2 2 2 √[(7) (10,338) – (260)2 ][(7) (86.62) – (23.8)2 ]

r = 0.435 is greater than 0, it appears that as the hiker age increases, the hiking time on this trail
tend to increase 52
oDetermine the regression equation for the Swiss hiking data. Use your result to
predict the hiking time for a 36-year-old hiker on Wengen-Kleine Scheidegg trail.

𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 (7) (911.1) – (260) (23.8)


𝑏1 = = = 0.0398
𝑛 σ 𝑥2 − σ 𝑥 2 (7) (10,338) – (260)2

σ𝑦 σ𝑥 23.8 260
𝑏0 = − 𝑏1 = (0.0398) = 1.9217
𝑛 𝑛 7
7

𝑦ො = 𝑏0 + 𝑏1 𝑥 = 1.9217 + 0.0398 . X = 1.9217 + 0.0398 . (36) = 3.35 hours

As the hiker’s age increase by one year, his or her hiking time tends to increase by an
average of 0.0398 hours (or 2.4 minutes on this trail.

53
oPartition the SST for the Swiss hiking data into the SSR and SSE.
2
σ𝑦 (23.8) 2
𝑆𝑆𝑇 = ෍ 𝑦 2 − = 86.62 - = 5.7
𝑛 7
𝑆𝑆𝐸 = ෍ 𝑦 2 − 𝑏0 ෍ 𝑦 − 𝑏1 ෍ 𝑥𝑦 = (86.62) – (19217) (23.8) – (0.0398) (911.1) = 4.622

SSR = SST – SSE = 5.70 – 4.622 = 1.078

oCalculate the sample coefficient of determination and interpret it


R2 = SSR / SST = 1.078/5.7 = 0.1891.
We can conclude that 19% of the total variation in the dependent factor can be explained
by the independent factor.
oCalculate the coefficient of determination for the Swiss hiking data and test its significance using α = 0.1.

H0 : ρ2 ≤ 0 none (no relationship between x and y)


H1 : ρ2 > 0 x does explain a significant portion of the variation in y (there is relationship between x & y)

F = SSR/(SSE/n-2) = 1.078/(04.622/7-2)= 1.17 D1 =1, D2 = n-2 = 7-2=5 Fα = 4.060


F < Fα → 1.17 < 4.060 → Do not reject Ho.
A hiker’s age explains only about 18.9% of the variation in the hiking time on this trail. This
percentage is too low to be considered statistically significant.
oTest to determine if the population correlation coefficient for the Swiss hiking data is greater than zero using α = 0.1
Ho: ρ ≤ 0 & H1: ρ > 0
𝑟
𝑡= = 0.435/0.4027 = 1.08
1−𝑟 2
𝑛−2
For a one tail test with α= 0.1 and 7-2 = 5 degree of freedom, tα = 1.476. Because t = 1.08 is less than tα = 1.476,
we fail to reject the null hypothesis. Based on our sample, the correlation coefficient between the age of a hiker
and the hiking time for this trail is not greater than zero. In other words, there is no support that a linear
relationship exists between the two.

55
#3 Developing Single/Simple Regression Model (Excel)

56
• Example: Construct this table to provide the values needed for future calculations

Number of TV Number of
Ads Cars Sold
Week x y xy x2 y2
1 3 13 39 9 169
2 6 31 186 36 961
3 4 19 76 16 361
4 5 27 135 25 729
5 6 23 138 36 529
6 3 19 57 9 361

x = 27 y = 132 xy = 631 x2 = 131 y2 = 3110

Copyright ©2013 Pearson Education,


14-57
Inc. publishing as Prentice Hall
x = 27 y = 132 xy = 631 x2 = 131 y2 = 3110 n=6

n xy − ( x )( y ) (6)(631) − (27)(132) Because r = 0.836 is positive


r= =
n x  n y 
and close to +1, there is a fairly
2
− ( x )
2 2
− ( y )
2
(6)(131) − (27)  (6)(3110) − (132) 
2 2
strong positive relationship
between the number of TV ads
222 222 and cars sold
= = = 0.836
57 1236  265.43

𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦 (6)(631) − (27)(132) 222 σ𝑦 2


𝑏= = = = 3.89474 𝑆𝑆𝑇 = ෍ 𝑦 2 −
𝑛 σ 𝑥2 − σ 𝑥 2 (6)(131) − (27)2 57 𝑛
= 3110 - 132 2 /6 = 206
σ𝑦 σ𝑥 132 27
𝑎= −𝑏 = − (3.89474) = 22 − 17.5262 = 4.4737
𝑛 𝑛 6 6
𝑆𝑆𝐸 = ෍ 𝑦 2 − 𝑎 ෍ 𝑦 − 𝑏 ෍ 𝑥𝑦
So, the regression equation is: 𝑦ො = 4.4737 + 3.8947𝑥
= 3110 – 4.4737 (132) – 3.894 (631) = 61.894

SSR = SST – SSE = 206 – 61.894 = 144.106

R2 = SSR / SST = 144.106/206 = 0.6995


58
𝑆𝑆𝐸 x = 27 y = 132 xy = 631 x2 = 131 y2 = 3110 n=6
𝑠𝑒 =
𝑛−2
σ 𝑥 27 tα/2 = 2.776 (df = n – 2 degrees of freedom, α = 0.05) (Appendix 5 Table 5)
61.8947 𝑥lj = = = 4.5
= 𝑛 6
6−2
= 3.934 𝑦ො = 4.4737 + 3.8947𝑥 𝑥 = 5 → 𝑦ො = 23.95

1 (x − x )2
1 (𝑥 lj 2
− 𝑥) CI = yˆ  tα/ 2 se +
( x ) − (nx )
𝐶𝐼 = 𝑦ො ± 𝑡α/2 𝑠𝑒 + 2 n 2
𝑛 σ𝑥 2
σ 𝑥2 −
𝑛

1 (5 − 4.5)2
= 23.95  (2.776)(3.934) +
6
(131) − (27 )
2

1 0.25
= 23.95  (10.921) +
6 9.5
= 23.95  (10.921)(0.4393)

= 23.95  4.798 59
𝑆𝑆𝐸 x = 27 y = 132 xy = 631 x2 = 131 y2 = 3110 n=6
𝑠𝑒 =
𝑛−2
σ 𝑥 27 tα/2 = 2.776 (df = n – 2 degrees of freedom, α = 0.05) (Appendix 5 Table 5)
61.8947 𝑥lj = = = 4.5
= 𝑛 6
6−2
= 3.934 𝑦ො = 4.4737 + 3.8947𝑥 𝑥 = 5 → 𝑦ො = 23.95

1 (x − x )2
1 (𝑥 − lj 2
𝑥) PI = yˆ  tα/ 2 se 1 + +
( x ) − (nx )
𝑃𝐼 = 𝑦ො ± 𝑡α/2 𝑠𝑒 1+ + 2 n 2
𝑛 σ𝑥 2
σ 𝑥2 −
𝑛

1 (5 − 4.5)2
= 23.95  (2.776)(3.934) 1+ +
6
(131) − (27 )
2

1 0.25
= 23.95  (10.921) 1+ +
6 9.5
= 23.95  (10.921)(1.092)

= 23.95  11.93 60
Calculating the Slope and y-intercept Using Excel
1. Enter the data for the two variables in the worksheet
2. Go to the Data tab and select Data Analysis, which opens the Data Analysis
dialog box
3. Scroll down to Regression and click OK to open the Regression dialog box
Calculating the Slope and y-intercept Using Excel
4. Click on the first text box, which is
labeled Input Y Range
5. highlight the cells containing the
dependent variable values, including 4-5
the column label 6-7
6. Click on the second text box, which 8
is labeled Input X Range
9
7. Highlight the cells containing the
independent variable values,
including the column label
8. Check the Labels box
9. Click on Output Range, then tell
Excel where to put the report, then
click OK
Calculating the Slope and y-intercept Using Excel

correlation
coefficient, r

y-intercept

slope value
Partitioning the Sum of Squares Using Excel

Correlation Coefficient, r r
Sample Coefficient of R2
Determination, R2

𝑛
R2 = r 2 R2 = SSR / SST
SST = SSR + SSE
Sum of Squares Regression SSR

Sum of Squares Error SSE


Total sum of Squares SST
Test The Significance of the Slope Using Excel
Ho : β1 = 0 (There is no relationship
between the independent and Based on our sample of six
dependent variables) students, we are 95% confident
H1 : β1 ≠ 0 (There is a relationship that the true population slope is
between x and y) between 0.353 and 7.437
Standard error of (Intervals does not include “0”
The critical t-score, tα/2, can be found the estimate, se
in Table 5 in Appendix A, or from
Excel using the TINV function
=TINV(α, df) →tα/2 = 2.776

The Test Statistic = 3.05171

The P-Value = 0.03796

Method 1 The Test Statistics = 3.051 > The Critical Value 2.776→ Reject Ho
Method 2 The P-Value = 0.03796 ˂ α = 0.5 → Reject Ho
Method 3 The confidence intervals of the population slop (β) 0.351-7.438 does not include zero→ Reject Ho
Using PHStat to Construct
Confidence and Prediction Intervals

1. Enter the data for x and


y in separate columns
in the worksheet
2. Go to Add-Ins >
PHStat > Regression
> Simple Linear
Regression
3. Fill in the Data section
of the Simple Linear
Regression dialog box,
and fill in the desired
Output Options, then
click OK
Using PHStat to Construct
Confidence and Prediction Intervals
• PHStat output
• These values can be found in the
CIEandPI worksheet created by
PHStat
• These values match the results
calculated on previous slides
(calculated values may differ
slightly due to rounding)
Calculating r Using Excel
Use the CORREL Function in Excel to calculate the correlation coefficient
= CORREL(array1, array2)
where: array1 = The range of data for the first variable
array2 = The range of data for the second variable
#4 Developing Multiple Regressions Model (OPTIONAL)
• #4.1 Multiple Regression Model Introduction
• #4.2 Identifying Regression Coefficients
• #4.3 Using the Regression to Make Prediction

69
Sample

#4.1 Multiple Regression Model Introduction


• This session extends the simple regression model discussed in the previous session
• Now consider more than one independent variable as a means to explain the
variation in a dependent variable of interest
• Previous session: 𝑦ො = 𝑎 + 𝑏𝑥
• Existing session: 𝑦ො = 𝑎 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏𝑘 𝑥𝑘
• The predicted value of y (Dependent Variable) given values of x1, x2, … , xk
x1, x2, … , xk = The independent variables of interest
k = The number of independent variables in our regression model
a = The y-intercept of the regression line
b1, b2 & bk are regression coefficients
b1 = The average change in due to a one-unit change in x1 with x2, … , xk constant
b2 = The average change in due to a one-unit change in x2 with x1, x3, … , xk constant
bk = The average change in due to a one-unit change in xk with x1, x2, … , xk-1 constant

70
Sample

#4.2 Identifying Regression Coefficients


• A regression coefficient predicts the change in a dependent variable due to a
one-unit increase in an independent variable while other variables are held
constant
• 𝑦ො = 𝑎 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + ⋯ + 𝑏𝑘 𝑥𝑘
• The slope coefficients in a multiple regression model with k independent
variables are
b1, b2, … , bk
• PHStat2 will be used in this chapter to avoid manual calculations

71
Identifying Regression Coefficients Using Excel
Milk
• Example: A dairy cooperative wants to Consumption Income
Family (Quarts) ($1000s) Family Size
examine household milk consumption
1
(y, in quarts per week) based on annual 2
21 46 5
10 55 2
household income (x1, measured in 3 16 37 3
$1000s per year) and the size of the 4 38 60 5
family (x2) 5 9 35 1
6
• Data for 15 randomly selected families is
26 55 3
7 22 41 3
obtained (Sample) 8 28 50 4

• Identify/Interpret the regression 9 25 52 3


10
coefficients 11
18 49 2
12 34 3
• What is the predicted milk consumption 12 20 39 3
for a family with annual income of 13 28 44 4
$50,000 and a size of 4 people? 14 30 56 4
15 35 49 6
• Example: (continued)
1. Enter the data in Excel
2. Go to Add-Ins > PHStat >
Regression > Multiple Regression
The Multiple Regression dialog box will
appear
• Example: (continued)
3.Click in the Y Variable Cell Range box
4.highlight the cell range of the y
(dependent) variable, including the column
label
5.Click in the X Variables Cell Range box
6.highlight the data for both of the
independent variables, including the
column labels
7.Check the First cells in both ranges
contain label box
8.Check the boxes in the Regression Tool
Output Options and Output Options
sections as shown
9.In the Confidence level for interval
estimates text box, enter 95, then Click
OK
The regression equation is: ŷ = -11.1378 + 0.3913 x + 4.5165 x
1 2
• The regression equation is: ŷ = -11.1378 + 0.3913 x1 + 4.5165 x2
x1 = Annual family income (in $1000s) & x2 = Family size (# of people)

Interpreting the regression coefficients:


• The income coefficient (b1 = 0.3913) tells us that an additional $1,000 of annual income,
holding family size constant, will increase a family’s milk consumption by an average of
0.3913 quarts per week
• The family size coefficient (b2 = 4.5165) tells us that for each additional family member,
holding family income constant, a family’s milk consumption will increase by an average of
4.5165 quarts per week
• It is tempting to interpret the y-intercept (𝑎 = -11.1378) as the milk consumption for a family
with zero income and zero family members, but a negative number does not make sense
• We do not have any observations in our sample with these values of x1 and x2, so this would
not be a reliable estimate
• Using the multiple regression equation to predict the dependent variable using values of
independent variables outside of the range of the original sample can lead to unreliable
results
Copyright ©2013 Pearson Education, Inc.
publishing as Prentice Hall 15-76
• The predicted milk consumption for a family with annual income of $50,000
and a size of 4 people (x1 = 50, x2 = 4) for the sample:
𝑦ො = -11.1378 + 0.3913(50) + 4.5165(4)
= -11.1378 + 19.565 + 18.066

= 26.493 (quarts per week)

77
Population

#4.3 Using The Regression to Make Prediction (OPTIONAL)


• The predicted milk consumption for a family with annual income of $50,000
and a size of 4 people (x1 = 50, x2 = 4) for the sample:
𝑦ො = -11.1378 + 0.3913(50) + 4.5165(4)
= -11.1378 + 19.565 + 18.066

= 26.493 (quarts per week)

• Can we find CI and PI of 𝑦ො ???


Estimate the value of 𝑦ො in the population

78
Identifying Regression Coefficients Using Excel

Enter the desired • We can construct a confidence interval around this


values for x1 and x2 point estimate to get a sense of the range of milk
consumption for families with similar income and size
• Find this interval using PHStat on the “CIEandPI”
Sample

worksheet tab (created for you when the “Confidence


The predicted Interval Estimate and Prediction Interval” option box is
value of ŷ selected)

The 95% confidence interval for the average


milk consumption for families with $50,000
income and 4 people
Population

Confidence Interval (CI) for an Average Value of y


The 95% confidence interval for milk
consumption for a specific family with $50,000
income and 4 people
79
Prediction Interval (PI) for a Specific Value of y
Your Turn #2

The Excel file MLB 2010 Wins 1.xlsx lists the number of games each Major League
Baseball team won during the 2010 season. The file also provides the average
number of runs scored per game (PRG) and the average number of runs given up per
game (ERA) for each team.
oA. Develop a regression equation to predict the number of games won by a Major
League Baseball team based on PRG and ERA.
oB. Interpret the meaning of the regression coefficient.
oC. Predict the number of games won for a team that scores an average of 4 runs per
games and gives up an average of 3.5 runs per games
oD. Construct a 95% confidence interval to estimate the average number of games
won by teams described in part c.
oE. Construct a 95% prediction interval to estimate the number of games won by a
specific team described in part c.

80
A. Regression Analysis

Regression Statistics
Multiple R 0.9648
R Square 0.9309
Adjusted R Square 0.9258
Standard Error 2.9979
Observations 30

ANOVA
df SS MS F Significance F
Regression 2 3269.3426 1634.6713 181.8866 0.0000
Residual 27 242.6574 8.9873
Total 29 3512.0000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%
Intercept 70.7623 8.2166 8.6122 0.0000 53.9033 87.6212 53.9033 87.6212
RPG 16.2015 1.2006 13.4945 0.0000 13.7381 18.6649 13.7381 18.6649
ERA -14.9250 1.3461 -11.0875 0.0000 -17.6870 -12.1630 -17.6870 -12.1630

Y (hat) = 70.7623 + 16.2015 X1 - 14.9250 X2 (X1 = RPG and X2 = ERA)


B. Increasing the average runs per game by 1, will result in a team wining an average of
16.2 more games per season
-Increasing the average runs per game given up by 1, will result in a team losing an
average of 14.9 additional games per season. 81
A. Regression Analysis

Regression Statistics
Multiple R 0.9648
R Square 0.9309
Adjusted R Square 0.9258
Standard Error 2.9979
Observations 30

ANOVA
df SS MS F Significance F
Regression 2 3269.3426 1634.6713 181.8866 0.0000
Residual 27 242.6574 8.9873
Total 29 3512.0000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95% Upper 95%
Intercept 70.7623 8.2166 8.6122 0.0000 53.9033 87.6212 53.9033 87.6212
RPG 16.2015 1.2006 13.4945 0.0000 13.7381 18.6649 13.7381 18.6649
ERA -14.9250 1.3461 -11.0875 0.0000 -17.6870 -12.1630 -17.6870 -12.1630

Y (hat) = 70.7623 + 16.2015 X1 - 14.9250 X2. X1 = RPG and X2 = ERA

C. Y (hat) = 70.7623 + 16.2015 (4) - 14.9250 (3.5) = 83.3 wins

82
Confidence Interval Estimate and Prediction Interval

Data
Confidence Level 95%
1
RPG given value 4
ERA given value 3.5

X'X 30 131.52 122.19


131.52 582.9904 534.7455
122.19 534.7455 502.7763

Inverse of X'X 7.511891 -0.82303 -0.95026


-0.82303 0.160385 0.029439
-0.95026 0.029439 0.201619

X'G times Inverse of X'X 0.893874 -0.07846 -0.12683

[X'G times Inverse of X'X] times XG 0.136135


t Statistic 2.051831
Predicted Y (YHat) 83.33066 C.
For Average Predicted Y (YHat) 83.330
Interval Half Width 2.269562
Confidence Interval Lower Limit 81.06109 D.
Confidence Interval Upper Limit 85.60022
81.5 – 85.6 wins
For Individual Response Y
Interval Half Width 6.556491
Prediction Interval Lower Limit 76.77417
E.
Prediction Interval Upper Limit 89.88715
76.8 – 89.9 wins
83
Summary
• #1 Correlation Analysis
o Scatterplot
o Sample correlation coefficient, r
o Sample Coefficient of Determination, R2
• #2 Regressions Analysis
o Single regression model
o Multiple regression model
o Prediction
o #3 Testing the significance of ρ
o #4 Testing the significance of ρ2
o #5 Testing the significance of the slope

84
References

1. Donnely, R. (2019). Business statistics (3rd ed). Pearson Education.


2. Sharpe, N., DeVeaux, R.,Velleman, P., and Wright, D. (2017).
Business Statistics. (3rd Cdn ed). Pearson.

85

You might also like