Statistics For Business Analysis: Learning Objectives
Statistics For Business Analysis: Learning Objectives
Learning Objectives
How to use regression analysis to predict the value of a dependent variable based on an independent variable The meaning of the regression coefficients b0 and b1 How to evaluate the assumptions of regression analysis and know what to do if the assumptions are violated To make inferences about the slope and correlation coefficient To estimate mean values and predict individual values
Dependent variable: the variable we wish to predict or explain Independent variable: the variable used to explain the dependent variable
Types of Relationships
Linear relationships Y Y Curvilinear relationships
X Y Y
Types of Relationships
(continued) Strong relationships Y Y Weak relationships
X Y Y
Types of Relationships
(continued) No relationship Y
X Y
Yi = 0 + 1Xi + i
Linear component Random Error component
(continued)
Yi = 0 + 1Xi + i
i
Slope = 1 Random Error for this Xi value
Xi
Yi = b0 + b1Xi
Graphical Presentation
House price model: scatter plot
House Price ($1000s) 450 400 350 300 250 200 150 100 50 0 0 500 1000 1500 2000 2500 3000 Square Feet
Excel Output
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10
ANOVA
df Regression Residual Total 1 8 9 SS 18934.9348 13665.5652 32600.5000 MS 18934.9348 1708.1957 F 11.0848 Significance F 0.01039
Graphical Presentation
House price model: scatter plot and regression line 450
400 350 300 250 200 150 100 50 0 0 500 1000 1500 2000 2500 3000 Square Feet House Price ($1000s)
Slope = 0.10977
Intercept = 98.248
b1 measures the estimated change in the average value of Y as a result of a oneunit change in X
Here, b1 = .10977 tells us that the average value of a house increases by .10977($1000) = $109.77, on average, for each additional one square foot of size
10
11
Measures of Variation
Total variation is made up of two parts:
SST =
Total Sum of Squares
SSR +
Regression Sum of Squares
SSE
Error Sum of Squares
SST = ( Yi Y )2
where:
SSR = ( Yi Y )2
Y
SSE = ( Yi Yi )2
Yi = Observed values of the dependent variable Yi = Predicted value of Y for the given Xi value
Measures of Variation
(continued)
SST = total sum of squares Measures the variation of the Yi values around their mean Y SSR = regression sum of squares Explained variation attributable to the relationship between X and Y SSE = error sum of squares Variation attributable to factors other than the relationship between X and Y
12
Measures of Variation
(continued)
Y Yi _
Y SST = (Yi - Y)2
SSE = (Yi - Yi )2
_
Y
_ Y
Xi
Coefficient of Determination, r2
The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable The coefficient of determination is also called r-squared and is denoted as r2
r2 = SSR regression sum of squares = SST total sum of squares
note:
0 r2 1
13
r2 = 1 Y
r2 = 1
X Y
14
r2 = 0
Excel Output
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10
r2 =
ANOVA
df Regression Residual Total 1 8 9 SS 18934.9348 13665.5652 32600.5000 MS 18934.9348 1708.1957 F 11.0848 Significance F 0.01039
15
S YX =
Where
SSE = n2
(Y Y )
i i i=1
n2
Excel Output
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10
S YX = 41.33032
ANOVA
df Regression Residual Total 1 8 9 SS 18934.9348 13665.5652 32600.5000 MS 18934.9348 1708.1957 F 11.0848 Significance F 0.01039
16
small s YX
large s YX
The magnitude of SYX should always be judged relative to the size of the Y values in the sample data i.e., SYX = $41.33K is moderately small relative to house prices in the $200 - $300K range
Assumptions of Regression
Use the acronym LINE: Linearity
The underlying relationship between X and Y is linear
Independence of Errors
Error values are statistically independent
Normality of Error
Error values () are normally distributed for any given value of X
17
Residual Analysis
ei = Yi Yi
The residual for observation i, ei, is the difference between its observed and predicted value Check the assumptions of regression by examining the residuals
Examine for linearity assumption Evaluate independence assumption Evaluate normal distribution assumption Examine for constant variance for all levels of X (homoscedasticity)
x
residuals residuals
Not Linear
Linear
18
X
residuals
residuals
100
0 -3 -2 -1 0 1 2 3
Residual
19
x
residuals
x
residuals
x Non-constant variance
x Constant variance
20
Autocorrelation
Autocorrelation is correlation of the errors (residuals) over time
Time (t) Residual Plot
15
Here, residuals show a cyclic pattern, not random. Cyclical patterns are a sign of positive autocorrelation
10
Residuals
5 0 -5 0 -10 -15
Time (t)
Violates the regression assumption that residuals are random and independent
21
(ei ei1)2
D=
i= 2 n i=1
The possible range is 0 D 4 D should be close to 2 if H0 is true D less than 2 may signal positive autocorrelation, D greater than 2 may signal negative autocorrelation
2 i
dL
dU
22
(continued)
Is there autocorrelation?
(continued)
40 20 0 0 5 10 15 Tim e 20 25 30
(e e
i
i1
)2 =
D=
i= 2 n
e
i =1
2 i
23
(continued)
Here, n = 25 and there is k = 1 one independent variable Using the Durbin-Watson table, dL = 1.29 and dU = 1.45 D = 1.00494 < dL = 1.29, so reject H0 and conclude that significant positive autocorrelation exists Therefore the linear model is not the appropriate model to forecast sales
Decision: reject H0 since D = 1.00494 < dL
Reject H0 Inconclusive Do not reject H0
dL=1.29
dU=1.45
Sb1 =
where:
S YX = SSX
S YX
(X X)
i
S b1
S YX =
24
Excel Output
Regression Statistics
Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10
Sb1 = 0.03297
SS MS
18934.9348 1708.1957
ANOVA
df
Regression Residual Total 1 8 9
F
11.0848
Significance F
0.01039
Coefficients
Intercept Square Feet 98.24833 0.10977
Standard Error
58.03348 0.03297
t Stat
1.69296 3.32938
P-value
0.12892 0.01039
Lower 95%
-35.57720 0.03374
Upper 95%
232.07386 0.18580
small Sb1
large Sb1
25
Test statistic
b t= 1 1 S b1
d.f. = n 2
The slope of this model is 0.1098 Does square footage of the house affect its sales price?
26
b1
Standard Error 58.03348 0.03297
Sb1
t Stat 1.69296 3.32938 P-value 0.12892 0.01039
t=
b1
Standard Error 58.03348 0.03297
Sb1
t Stat 1.69296 3.32938
t
P-value 0.12892 0.01039
Reject H0
-t/2
Do not reject H0
t/2
Reject H0
-2.3060
2.3060 3.329
Decision: Reject H0 Conclusion: There is sufficient evidence that square footage affects house price
27
P-value = 0.01039
H0: 1 = 0 H1: 1 0 From Excel output:
Coefficients Intercept Square Feet 98.24833 0.10977 Standard Error 58.03348 0.03297
P-value
t Stat 1.69296 3.32938 P-value 0.12892 0.01039
This is a two-tail test, so the p-value is P(t > 3.329)+P(t < -3.329) = 0.01039 (for 8 d.f.)
Decision: P-value < so Reject H0 Conclusion: There is sufficient evidence that square footage affects house price
F=
MSR MSE
MSR = MSE =
SSR k SSE n k 1
where F follows an F distribution with k numerator and (n k - 1) denominator degrees of freedom (k = the number of independent variables in the regression model)
28
Excel Output
Regression Statistics
Multiple R R Square Adjusted R Square Standard Error Observations 0.76211 0.58082 0.52842 41.33032 10
F=
ANOVA
df
Regression Residual Total 1 8 9
Significance F
0.01039
Coefficients
Intercept Square Feet 98.24833 0.10977
Standard Error
58.03348 0.03297
t Stat
1.69296 3.32938
P-value
0.12892 0.01039
Lower 95%
-35.57720 0.03374
Upper 95%
232.07386 0.18580
Test Statistic:
F= MSR = 11.08 MSE
Do not reject H0
Reject H0
F.05 = 5.32
29
b1 t n2Sb1
Excel Printout for House Prices:
Coefficients
Intercept Square Feet 98.24833 0.10977
d.f. = n - 2
Standard Error
58.03348 0.03297
t Stat
1.69296 3.32938
P-value
0.12892 0.01039
Lower 95%
-35.57720 0.03374
Upper 95%
232.07386 0.18580
At 95% level of confidence, the confidence interval for the slope is (0.0337, 0.1858)
(continued)
Upper 95%
232.07386 0.18580
Standard Error
58.03348 0.03297
t Stat
1.69296 3.32938
P-value
0.12892 0.01039
Lower 95%
-35.57720 0.03374
Since the units of the house price variable is $1000s, we are 95% confident that the average impact on sales price is between $33.70 and $185.80 per square foot of house size
This 95% confidence interval does not include 0. Conclusion: There is a significant relationship between house price and square feet at the .05 level of significance
30
t=
r - 1 r n2
2
(with n 2 degrees of freedom)
=.05 , df = 10 - 2 = 8
t=
r 1 r n2
2
.762 0 1 .762 10 2
2
= 3.329
31
Reject H0
-t/2
Do not reject H0
t/2
Reject H0
-2.3060
2.3060
3.329
Y = b0+b1Xi
Xi
32
33
34
(continued)
Y
Confidence Interval Estimate for Y|X=Xi Prediction Interval Estimate for YX=Xi
35
36
(continued)
If there is violation of any assumption, use alternative methods or models If there is no evidence of assumption violation, then test for the significance of the regression coefficients and construct confidence intervals and prediction intervals Avoid making predictions or forecasts outside the relevant range
37