Materi 2 - Simple Regression
Materi 2 - Simple Regression
Sutikno
Department of Statistics
Faculty of Mathematics, Computing and Data Science
Institut Tekonologi Sepuluh Nopember Surabaya
[email protected]
085230203017
Source: Basic Business Statistics 12th Edition: Pearson Education, Inc. publishing as Prentice Hall
Correlation vs. Regression
Y Y
X X
Y Y
X X
Department of Statistics, ITS Surabaya Slide-5
Types of Relationships
(continued)
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Department of Statistics, ITS Surabaya Slide-6
Types of Relationships
(continued)
No relationship
X
Department of Statistics, ITS Surabaya Slide-7
Contoh:Data Ascombe
Contoh:Data Ascombe
Contoh:Data Ascombe
Contoh:Data Ascombe
Contoh:Data Ascombe
Correlation Coefficient
n n n
n xi yi xi yi
r i 1
1/ 2
i 1 i 1
1/ 2
n 2 n
2
n 2 n
2
(n xi xi n yi yi
i 1 i 1 i 1 i 1
Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable
Yi β0 β1Xi ε i
Linear component Random Error
component
(continued)
Y Yi β0 β1Xi ε i
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value Random Error
of Y for Xi
for this Xi value
Intercept = β0
Xi X
Department of Statistics, ITS Surabaya Slide-15
Simple Linear Regression Equation (Prediction Line)
Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i
intercept
Value of X for
Ŷi b0 b1Xi
observation i
x y i i nx y
b1 i 1
n
i
x 2
i 1
nx 2
bo y b1 x
350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Regression Statistics
Multiple R 0.76211 The regression equation is:
R Square 0.58082
Adjusted R Square 0.52842 house price 98.24833 0.10977 (square feet)
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet
98.25 0.1098(200 0)
317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Department of Statistics, ITS Surabaya Slide-28
Interpolation vs. Extrapolation
450
400
House Price ($1000s)
350
300
250
200
150 Do not try to
100
extrapolate
50
0
beyond the range
0 500 1000 1500 2000 2500 3000 of observed X’s
Square Feet Department of Statistics, ITS Surabaya Slide-29
Measures of Variation
(continued)
(continued)
Y
Yi
SSE = (Yi - Yi )2 Y
_
SST = (Yi - Y)2
Y _
_ SSR = (Yi - Y)2 _
Y Y
Xi X
Department of Statistics, ITS Surabaya Slide-32
Coefficient of Determination, r2
note: 0 r 1 2
Department of Statistics, ITS Surabaya Slide-33
Examples of Approximate
r2 Values
Y
r2 = 1
X
r =1
2
Department of Statistics, ITS Surabaya Slide-34
Examples of Approximate
r2 Values
Y
0 < r2 < 1
X
Department of Statistics, ITS Surabaya Slide-35
Examples of Approximate
r2 Values
r2 = 0
Y
No linear relationship
between X and Y:
SSR 18934.9348
Regression Statistics
r
2
0.58082
Multiple R 0.76211 SST 32600.5000
R Square 0.58082
Adjusted R Square 0.52842 58.08% of the variation in
Standard Error 41.33032
house prices is explained by
Observations 10
variation in square feet
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
SSE i i
( Y Ŷ ) 2
S YX i1
n2 n2
Where
SSE = error sum of squares
n = sample size
Regression Statistics
Multiple R 0.76211 S YX 41.33032
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
small s YX X large s YX X
• Independence of Errors
– Error values are statistically independent
• Normality of Error
– Error values (ε) are normally distributed for any given value of X
ei Yi Ŷi
• The residual for observation i, ei, is the difference between
its observed and predicted value
• Check the assumptions of regression by examining the
residuals
– Examine for linearity assumption
– Evaluate independence assumption
– Evaluate normal distribution assumption
– Examine for constant variance for all levels of X (homoscedasticity)
Y Y
x x
residuals
residuals
x x
Not Linear
Linear
Not Independent
Independent
residuals
residuals
X
residuals
Percent
100
0
-3 -2 -1 0 1 2 3
Residual
Department of Statistics, ITS Surabaya Slide-45
Residual Analysis for Equal Variance
Y Y
x x
residuals
x residuals x
Non-constant variance
Department of Statistics, ITS Surabaya
Constant variance
Slide-46
Excel Residual Output
15
Here, residuals show a 10
Residuals
5
cyclic pattern, not
0
random. Cyclical
-5 0 2 4 6 8
patterns are a sign of -10
positive autocorrelation -15
Time (t)
i
e 2
D less than 2 may signal positive
i1 autocorrelation, D greater than 2
may signal negative autocorrelation
0 dL dU 2
Department of Statistics, ITS Surabaya Slide-51
Testing for Positive Autocorrelation
140
120
100
Sales
80 y = 30.65 + 4.7038x
2
60 R = 0.8976
40
20
0
0 5 10 15 20 25 30
Tim e
• Is there autocorrelation?
Department of Statistics, ITS Surabaya Slide-52
Testing for Positive Autocorrelation
• Example with n = 25:
Excel/PHStat output: 160
140
Durbin-Watson Calculations
120
Sum of Squared
100
Difference of Residuals 3296.18
Sales
80 y = 30.65 + 4.7038x
Sum of Squared 2
60 R = 0.8976
Residuals 3279.98
40
Durbin-Watson
20
Statistic 1.00494
0
0 5 10 15 20 25 30
Tim e
n
i i1
(e e ) 2
3296.18
D i 2
n
1.00494
3279.98
e
2
i
i1
where:
Sb1
= Estimate of the standard error of the least squares slope
Regression Statistics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error
Observations
41.33032
10
Sb1 0.03297
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Y Y
t b1 = regression slope
Sb1 coefficient
β1 = hypothesized slope
Sb = standard
d.f. n 2 1
error of the slope
Department of Statistics, ITS Surabaya Slide-58
Inference about the Slope:
t Test
b1 Sb1
H0: β1 = 0 From Excel output:
H1: β1 0 Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
b1 β1 0.10977 0
t t 3.32938
Sb1 0.03297
d.f. = 10-2 = 8
Decision:
a/2=.025 a/2=.025 Reject H0
Conclusion:
There is sufficient evidence that
Reject H0 Do not reject H0 Reject H0
-tα/2 0
tα/2 square footage affects house
-2.3060 2.3060 3.329 price
Department of Statistics, ITS Surabaya Slide-61
Inferences about the Slope:
t Test Example
(continued)
P-value = 0.01039
P-value
H0: β1 = 0 From Excel output:
H1: β1 0 Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
SSR
MSR
k
SSE
MSE
n k 1
where F follows an F distribution with k numerator and (n – k - 1)
denominator degrees of freedom
Regression Statistics
Multiple R 0.76211
MSR 18934.9348
R Square 0.58082 F 11.0848
Adjusted R Square 0.52842 MSE 1708.1957
Standard Error 41.33032
Observations 10 With 1 and 8 degrees P-value for
of freedom the F Test
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
• Hypotheses
H 0: ρ = 0 (no correlation between X and Y)
H A: ρ ≠ 0 (correlation exists)
• Test statistic
r -ρ
– t (with n – 2 degrees of freedom)
1 r 2
where
n2 r r 2 if b1 0
r r 2 if b1 0
Department of Statistics, ITS Surabaya Slide-68
Example: House Prices
H 0: ρ = 0 (No correlation)
H 1: ρ ≠ 0 (correlation exists)
=.05 , df = 10 - 2 = 8
r ρ .762 0
t 3.329
1 r 2 1 .762 2
n2 10 2
Department of Statistics, ITS Surabaya Slide-69
Example: Test Solution
r ρ .762 0 Decision:
t 3.329
1 r 2 1 .762 2 Reject H0
n2 10 2 Conclusion:
There is
d.f. = 10-2 = 8
evidence of a
linear association
a/2=.025 a/2=.025
at the 5% level of
significance
Reject H0 Do not reject H0 Reject H0
-tα/2 0
tα/2
-2.3060 2.3060
3.329
Department of Statistics, ITS Surabaya Slide-70
Estimating Mean Values and Predicting
Individual Values
Prediction Interval
for an individual Y,
given Xi
Department of Statistics, ITS Surabaya Xi X
Slide-71
Confidence Interval for
the Average Y, Given X
Ŷ t n2S YX hi
Size of interval varies according
to distance away from mean, X
1 (X i X)2 1 (X i X)2
hi
n SSX n (X i X)2
Department of Statistics, ITS Surabaya Slide-72
Prediction Interval for an Individual Y, Given X
Ŷ t n2S YX 1 hi
1 (X i X)2
Ŷ t n-2S YX 317.85 37.12
n (X i X) 2
1 (X i X)2
Ŷ t n-1S YX 1 317.85 102.28
n (X i X) 2
• In Excel, use
PHStat | regression | simple linear regression …
– Check the
“confidence and prediction interval for X=”
box and enter the X-value and confidence level
desired
(continued)
Input values
Y