Week 2
Week 2
Week 2
Week 2 1 / 66
1 Estimation of Regression Function (Ch 1.6)
Week 2 2 / 66
Estimation of Regression Function (Ch 1.6)
Week 2 3 / 66
Estimated Regression Function
Regression Model: Y = β0 + β1 X + ε.
Regression Function: E (Y ) = β0 + β1 X .
Use least squares estimation to estimate β0 and β1 .
Estimated regression function:
Ŷ = b0 + b1 X ,
where
Pn
i=1 (Xi − X̄ )(Yi − Ȳ ) Sxy
b1 = Pn = , b0 = Ȳ − b1 X̄ ,
i=1 (Xi − X̄ )
2 Sxx
and we call Ŷ the value of the estimated regression function at the level X
of the predictor variable.
Week 2 4 / 66
Estimated Regression Function
Week 2 5 / 66
Residuals
The i-th residual is the difference between the observed value Yi and
the corresponding fitted value Ŷi , denoted as ei :
ei = Yi − Ŷi .
ei = Yi − (b0 + b1 Xi ).
Week 2 6 / 66
Residuals
Do not confuse
εi = Yi − E (Yi ) "Model error"
ei = Yi − Ŷi "Residual"
Week 2 7 / 66
Residuals
15
10
Y
5
2 4 6 8 10
Week 2 8 / 66
Toluca Company Example
Week 2 9 / 66
Load Data
> [1] 25 2
head(mydata, 2)
> x y
> 1 80 399
> 2 30 121
X <- mydata[,1] # or X <- mydata$x
Y <- mydata[,2] # or Y <- mydata$y
Week 2 10 / 66
Load Data
> V1 V2
> 1 80 399
> 2 30 121
names(mydata) <- c("x","y")
head(mydata, 2)
> x y
> 1 80 399
> 2 30 121
Week 2 11 / 66
Summary Statistics
summary(mydata)
> x y
> Min. : 20 Min. :113.0
> 1st Qu.: 50 1st Qu.:224.0
> Median : 70 Median :342.0
> Mean : 70 Mean :312.3
> 3rd Qu.: 90 3rd Qu.:389.0
> Max. :120 Max. :546.0
Week 2 12 / 66
Summary Statistics
boxplot(mydata) # or boxplot(X); boxplot(Y)
500
400
300
200
100
0
x y
Week 2 13 / 66
Scatter Plot
plot(X, Y, pch = 16, xlab = "Lot size", ylab = "Work hours",
main = "Toluca Company")
Toluca Company
500
400
Work hours
300
200
100
20 40 60 80 100 120
Lot size
Week 2 14 / 66
Fitting Model Manually
Recall we have
Pn
i=1 (Xi − X̄ )(Yi − Ȳ )
b1 = Pn 2
i=1 (Xi − X̄ )
b0 = Ȳ − b1 X̄
Xbar <- mean(X)
Ybar <- mean(Y)
Xcenter <- X - Xbar
Ycenter <- Y - Ybar
Sxy <- sum(Xcenter*Ycenter) # or sum(X*Y)-length(X)*mean(X)*mean(Y)
Sxx<- sum(Xcenter^2) # or sum(X^2)-length(X)*mean(X)*mean(X)
Sxy
b1 <- Sxy/Sxx
b1
Week 2 16 / 66
Fitting Model Manually
Week 2 17 / 66
Fitting Model Manually
Week 2 18 / 66
Fitting Model with “lm” Function
View(mymodel)
Week 2 19 / 66
Fitting Model with “lm” Function
> (Intercept) X
> 62.365859 3.570202
Week 2 20 / 66
Fitting Model with “lm” Function
summary(mymodel)
>
> Call:
> lm(formula = Y ~ X)
>
> Residuals:
> Min 1Q Median 3Q Max
> -83.876 -34.088 -5.982 38.826 103.528
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 62.366 26.177 2.382 0.0259 *
> X 3.570 0.347 10.290 4.45e-10 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 48.82 on 23 degrees of freedom
> Multiple R-squared: 0.8215, Adjusted R-squared: 0.8138
> F-statistic: 105.9 on 1 and 23 DF, p-value: 4.449e-10
Week 2 21 / 66
Estimated Regression Line
Ŷ = 62.366 + 3.570X
plot(X, Y, pch = 16)
abline(mymodel, col = "purple", lty = 2, lwd = 2)
500
400
Y
300
200
100
20 40 60 80 100 120
Week 2 22 / 66
Fitted Values
> [1] 347.9820 169.4719 240.8760 383.6840 312.2800 276.5780 490.7901 347.9820
> [9] 419.3861 240.8760 205.1739 312.2800 383.6840 133.7699 455.0881 419.3861
> [17] 169.4719 240.8760 383.6840 455.0881 169.4719 383.6840 205.1739 347.9820
> [25] 312.2800
Yfit
> 1 2 3 4 5 6 7 8
> 347.9820 169.4719 240.8760 383.6840 312.2800 276.5780 490.7901 347.9820
> 9 10 11 12 13 14 15 16
> 419.3861 240.8760 205.1739 312.2800 383.6840 133.7699 455.0881 419.3861
> 17 18 19 20 21 22 23 24
> 169.4719 240.8760 383.6840 455.0881 169.4719 383.6840 205.1739 347.9820
> 25
> 312.2800
Week 2 23 / 66
Predict Y for a New Observation
> [,1]
> [1,] 365.833
Week 2 24 / 66
Residuals
Res <- Y - Yhat # manually compute
Res
> 1 2 3 4 5 6
> 51.0179798 -48.4719192 -19.8759596 -7.6840404 48.7200000 -52.5779798
> 7 8 9 10 11 12
> 55.2098990 4.0179798 -66.3860606 -83.8759596 -45.1739394 -60.2800000
> 13 14 15 16 17 18
> 5.3159596 -20.7698990 -20.0880808 0.6139394 42.5280808 27.1240404
> 19 20 21 22 23 24
> -6.6840404 -34.0880808 103.5280808 84.3159596 38.8260606 -5.9820202
> 25
> 10.7200000
Week 2 25 / 66
Properties of Fitted Regression Line (Ch 1.6)
Week 2 26 / 66
Properties of Fitted Regression Line
Pn
1 The sum of residuals is zero: i=1 ei = 0.
Week 2 27 / 66
Properties of Fitted Regression Line
Pn 2
2. The sum of squared residuals is a minimum: i=1 ei .
Week 2 28 / 66
Properties of Fitted Regression Line
3. The sum of the observed values Yi equals the sum of the fitted values
Ŷi :
n
X n
X
Yi = Ŷi .
i=1 i=1
It follows that the mean of the fitted values Ŷi is the same as the
mean of the observed values Yi , namely, Ȳ .
Week 2 29 / 66
Properties of Fitted Regression Line
4. The sum of the weighted residuals is zero when the residual in the ith
trial is weighted by the level of the predictor variable in the ith trial:
n
X
Xi ei = 0.
i=1
Week 2 30 / 66
Properties of Fitted Regression Line
5. The sum of the weighted residuals is zero when the residual in the ith
trial is weighted by the fitted value of the response variable for the ith
trial:
n
X
Ŷi ei = 0.
i=1
sum(Yhat*Res)
Week 2 31 / 66
Properties of Fitted Regression Line
Week 2 32 / 66
Estimation of Error Terms Variance σ 2 (Ch 1.7)
Week 2 33 / 66
Estimation of Error Terms Variance σ 2
Week 2 34 / 66
Estimation of Error Terms Variance σ 2
Week 2 35 / 66
Estimation of Error Terms Variance σ 2
For the regression model, estimating σ 2 (i.e., Var (Y )), we also need to
calculate a sum of squared deviations, but the deviation of an observation
Yi must be calculated around its own estimated mean Ŷi , i.e., ei = Yi − Ŷi .
The appropriate sum of squares, denoted by SSE, is
n
X n
X
SSE = ei2 = (Yi − Ŷi )2 .
i=1 i=1
Week 2 36 / 66
Estimation of Error Terms Variance σ 2
E (MSE ) = σ 2 .
Week 2 37 / 66
Estimation of Error Terms Variance σ 2
Week 2 38 / 66
Estimation of Error Terms Variance σ 2
>
> Call:
> lm(formula = Y ~ X)
>
> Residuals:
> Min 1Q Median 3Q Max
> -83.876 -34.088 -5.982 38.826 103.528
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 62.366 26.177 2.382 0.0259 *
> X 3.570 0.347 10.290 4.45e-10 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 48.82 on 23 degrees of freedom
> Multiple R-squared: 0.8215, Adjusted R-squared: 0.8138
> F-statistic: 105.9 on 1 and 23 DF, p-value: 4.449e-10
Week 2 39 / 66
Normal Error Regression Model (Ch 1.8)
Week 2 40 / 66
Normal Error Regression Model
Yi = β0 + β1 Xi + εi ,
Week 2 41 / 66
Analysis of Variance (ANOVA) Approach to
Regression Analysis (Ch 2.7)
Week 2 42 / 66
ANOVA Approach to Regression Analysis
Week 2 43 / 66
Partitioning of Total Sum of Squares
Week 2 44 / 66
Partitioning of Total Sum of Squares
Week 2 45 / 66
Partitioning of Total Sum of Squares
Week 2 46 / 66
Partitioning of Total Sum of Squares
15
10
Y
5
2 4 6 8 10
Week 2 47 / 66
Partitioning of Total Sum of Squares
Week 2 48 / 66
Partitioning of Total Sum of Squares
Week 2 49 / 66
Formal Development of Partitioning
n
X
SSTO = (Yi − Ȳ )2
i=1
n
X
= (Yi − Ŷi + Ŷi − Ȳ )2
i=1
n
X n
X n
X
= (Ŷi − Ȳ )2 + (Yi − Ŷi )2 + 2 (Ŷi − Ȳ )(Yi − Ŷi )
i=1 i=1 i=1
= SSR + SSE
Week 2 50 / 66
Breakdown of Degrees of Freedom
SSTO has n − 1 degrees of freedom
Two degrees of freedom are lost due to estimating two parameters to obtain
Ŷi .
SSE
MSE =
n−2
Pn SSE
Error SSE = i=1 (Yi − Ŷi )2 n−2 MSE = n−2
Pn
Total SSTO = i=1 (Yi − Ȳ )2 n−1
Week 2 53 / 66
Expected Mean Squares
E (MSE ) = σ 2
n
X
2
E (MSR) = σ + β12 (Xi − X̄ )2
i=1
Week 2 54 / 66
F Test
To test
H0 : β 1 =0 v.s. Ha : β1 6= 0
2 2 2 2
or H0 : σmodel /σerror =1 v.s. Ha : σmodel /σerror >1
2 2
where σerror = E (MSE ) and σmodel = E (MSR).
MSR
F∗ =
MSE
Week 2 55 / 66
F Test
Recall:
With two independent χ2 distributed random variables Z1 and Z2 , with
degrees of freedom df1 and df2 , the ratio
Z1 /df1
Z2 /df2
Week 2 56 / 66
F Test
So
F ∗ ∼ F (1, n − 2) under H0 .
Week 2 57 / 66
F test
Week 2 58 / 66
ANOVA (manually)
Week 2 59 / 66
ANOVA (by R function)
anova(mymodel)
Week 2 60 / 66
Coefficient of Determination (Ch 2.9)
Week 2 61 / 66
Coefficient of Determination (R 2 )
Week 2 62 / 66
R2
Misunderstanding
A high coefficient of determination indicates that useful predictions can
be made.
A high coefficient of determination indicates that the estimated
regression line is a good fit.
A coefficient of determination near zero indicates that X and Y are
not related.
To access the goodness fit and adequacy of a regression model, conclusion
can not be made just based on the value of R 2 . You also need to look at
residual plots, test the significance of the model, etc.
Week 2 63 / 66
R2
Week 2 64 / 66
R 2 in R
summary(mymodel)
>
> Call:
> lm(formula = Y ~ X)
>
> Residuals:
> Min 1Q Median 3Q Max
> -83.876 -34.088 -5.982 38.826 103.528
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 62.366 26.177 2.382 0.0259 *
> X 3.570 0.347 10.290 4.45e-10 ***
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> Residual standard error: 48.82 on 23 degrees of freedom
> Multiple R-squared: 0.8215, Adjusted R-squared: 0.8138
> F-statistic: 105.9 on 1 and 23 DF, p-value: 4.449e-10
Week 2 65 / 66
Read Ch 1.6-1.8, 2.7, 2.9 of the textbook.
Week 2 66 / 66