0% found this document useful (0 votes)

9 views30 pages

STAT511 HW4 Solution Draft

Uploaded by

luc fbii

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views30 pages

STAT511 HW4 Solution Draft

Uploaded by

luc fbii

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Homework 4 Solutions

STAT511 Homework 4

Donghwi Nam

2024-11-11

Question 1
load("rent.RData")
head(rent, 5)

## rent rentsqm area yearc location bath kitchen c_heating district

## 1 120.9744 3.456410 35 1939 1 0 0 0 1112
## 2 436.9743 4.201676 104 1939 1 1 0 1 1112
## 3 355.7436 12.267021 29 1971 2 0 0 1 2114
## 4 282.9231 7.254436 39 1972 2 0 0 1 2148
## 5 807.2308 8.321964 97 1985 1 0 0 1 2222
rent.data = rent[, c("rentsqm", "area", "yearc", "location")]

(a)
# Converting "location" variable as categorical
rent.data$location = as.factor(rent.data$location)

library(GGally)
library(ggplot2)

ggplot(rent.data, aes(x = rentsqm)) +

geom_histogram(aes(y = ..density..), color = "black", fill = "white", bins = 30) +
geom_density(color = "red") +
labs(title = "Histogram of rentsqm variable") +
theme_minimal()

1
Histogram of rentsqm variable

0.15

0.10
density

0.05

0.00
0 5 10 15
rentsqm
ggplot(rent.data, aes(x = area)) +
geom_histogram(aes(y = ..density..), color = "black", fill = "white", bins = 30) +
geom_density(color = "red") +
labs(title = "Histogram of area variable") +
theme_minimal()

Histogram of area variable

0.020

0.015
density

0.010

0.005

0.000
40 80 120 160
area
ggplot(rent.data, aes(x = yearc)) +
geom_histogram(aes(y = ..density..), color = "black", fill = "white", bins = 15) +
geom_density(color = "red") +
labs(title = "Histogram of yearc variable") +
theme_minimal()

2
Histogram of yearc variable

0.03
density

0.02

0.01

0.00
1920 1940 1960 1980 2000
yearc
ggplot(rent.data) +
geom_bar(aes(x = location)) +
labs(title = "Bar chart of location variable") +
theme_minimal()

Bar chart of location variable

1500
count

1000

500

0
1 2 3
location
First, we analyze distribution of each variable. Rentsqm has an overall bell-shaped distribution with little
right-skewness. Area is clearly right-skewed. It is natural to observe such trend since there are a few number
of houses with larger area. The variable seems to need transformation such as log transformation to mitigate
the right skewness. Yearc has a multi-modal distribution, where number of peaks in the distribution of data
is more than one. We can assume that yearc = 1920 may imply houses built before 1920. And we can
observe peaks at 1960s and 1970s. Since the houses are not built constantly throughout time, this trend
seems natural. By observing barplot of location, we can see that the vast majority of the houses in the
dataset are either in ‘average’ or ‘good’ location.
ggpairs(rent.data,
columns = 1:3,
aes(color = location,
alpha = 0.1))

3
rentsqm area yearc
0.15 Corr: −0.341*** Corr: 0.389***

rentsqm
0.10 1: −0.384*** 1: 0.420***
0.05
2: −0.334*** 2: 0.397***
3: −0.238* 3: 0.370***
0.00
160
Corr: −0.231***
120 1: −0.145***

area
80 2: −0.308***
40 3: −0.186
2000
1980

yearc
1960
1940
1920
0 5 10 15 40 80 120 16019201940196019802000

Now, we analyze pairwise relationships between variables. We use categorical variable location as additional
layer to analyze the patterns. In rentsqm and area, the ‘average’ and ‘good’ location have similar distribution
while ‘top’ location is different. Houses in ‘top’ location seem to have higher rentsqm and area compared to
the other locations. The variables area and rentsqm have a negative correlation meaning that the bigger
the house is, the rent per m2 tends to be lower. It is interesting to see that most of the houses built during
the 1970s is in the ‘average’ location while after the peaks in the construction of the houses, there are more
houses built in ‘top’ location. The variables yearc and rentsqm have a positive correlation, indicating that
the newer the houses are, the higher the rent are. Moreover, area and yearc have negative correlation,
meaning that the newer the houses are, the smaller the houses are.

(b)
We first try with the basic linear regression model.
par(mfrow = c(2, 2), cex.lab = 0.5, mar = 2 * c(1, 1, 1, 1), cex = 0.3)
m1 = lm(rentsqm ~ area + yearc + location, data = rent.data)
summary(m1)

##
## Call:
## lm(formula = rentsqm ~ area + yearc + location, data = rent.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7673 -1.4799 -0.1198 1.3881 8.8992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -66.014367 3.508742 -18.814 < 2e-16 ***
## area -0.028803 0.001659 -17.367 < 2e-16 ***
## yearc 0.038201 0.001777 21.492 < 2e-16 ***
## location2 0.735812 0.079936 9.205 < 2e-16 ***
## location3 1.786440 0.245721 7.270 4.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.114 on 3077 degrees of freedom

4
## Multiple R-squared: 0.2475, Adjusted R-squared: 0.2466
## F-statistic: 253 on 4 and 3077 DF, p-value: < 2.2e-16
plot(lm(m1))
Residuals vs Fitted Q−Q Residuals
10

1888 1888
457 457

4
471 471
5

2
Standardized residuals
Residuals

0
−5

−2
4 6 8 10 −3 −2 −1 0 1 2 3
Scale−Location
Fitted values
Residuals vs Leverage
Theoretical Quantiles

1888
2.0

457
471
4
1.5

2
Standardized residuals

Standardized residuals
1.0

0
0.5

−2

3035
2109

2943
0.0

−4

Cook's distance

4 6 8 10 0.000 0.005 0.010 0.015

Fitted values Leverage

From Question (a), we have seen that area variable is right-skewed. We perform log transformation on area
to mitigate its right skewness.
hist(log(rent.data$area))

5
Histogram of log(rent.data$area)
600
Frequency

400
200
0

3.0 3.5 4.0 4.5 5.0

log(rent.data$area)
Second model uses log transformed area variable.
par(mfrow = c(2, 2), cex.lab = 0.5, mar = 2 * c(1, 1, 1, 1), cex = 0.3)
m2 = lm(rentsqm ~ log(area) + yearc + location, data = rent.data)
summary(m2)

##
## Call:
## lm(formula = rentsqm ~ log(area) + yearc + location, data = rent.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6956 -1.4447 -0.1025 1.4025 8.4686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.72809 3.52113 -16.679 < 2e-16 ***
## log(area) -2.13213 0.10485 -20.335 < 2e-16 ***
## yearc 0.03801 0.00174 21.842 < 2e-16 ***
## location2 0.72335 0.07860 9.203 < 2e-16 ***
## location3 1.76197 0.24140 7.299 3.67e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.08 on 3077 degrees of freedom
## Multiple R-squared: 0.2717, Adjusted R-squared: 0.2707
## F-statistic: 286.9 on 4 and 3077 DF, p-value: < 2.2e-16

6
plot(lm(m2))
Residuals vs Fitted Q−Q Residuals
10

1888 457 1888 457

4
866 866
5

2
Standardized residuals
Residuals

0
−5

−2
4 6 8 10 −3 −2 −1 0 1 2 3
Scale−Location
Fitted values
Residuals vs Leverage
Theoretical Quantiles
2.0

1888 457
866

4
1.5

1845
2
Standardized residuals

Standardized residuals
1.0

0
0.5

−2

2109

2943
0.0

−4

Cook's distance

4 6 8 10 0.000 0.005 0.010 0.015

Log transformation of area has increased R2 of the model. We keep the log-transformed area on further
Fitted values Leverage

analysis.
Next, we try model with all interaction terms.
par(mfrow = c(2, 2), cex.lab = 0.5, mar = 2 * c(1, 1, 1, 1), cex = 0.3)
m3 = lm(rentsqm ~ log(area)*location*yearc, data = rent.data)
summary(m3)

##
## Call:
## lm(formula = rentsqm ~ log(area) * location * yearc, data = rent.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8536 -1.4158 -0.0939 1.3747 8.1744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.347e+02 5.696e+01 -4.120 3.89e-05 ***
## log(area) 3.844e+01 1.353e+01 2.841 0.00453 **
## location2 -1.769e+02 8.054e+01 -2.197 0.02810 *
## location3 7.324e+02 2.827e+02 2.591 0.00961 **
## yearc 1.283e-01 2.906e-02 4.414 1.05e-05 ***
## log(area):location2 4.411e+01 1.902e+01 2.320 0.02043 *
## log(area):location3 -1.684e+02 6.488e+01 -2.596 0.00948 **

7
## log(area):yearc -2.082e-02 6.906e-03 -3.015 0.00259 **
## location2:yearc 9.055e-02 4.118e-02 2.199 0.02795 *
## location3:yearc -3.756e-01 1.442e-01 -2.605 0.00924 **
## log(area):location2:yearc -2.249e-02 9.725e-03 -2.313 0.02082 *
## log(area):location3:yearc 8.659e-02 3.310e-02 2.616 0.00894 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.061 on 3070 degrees of freedom
## Multiple R-squared: 0.2867, Adjusted R-squared: 0.2842
## F-statistic: 112.2 on 11 and 3070 DF, p-value: < 2.2e-16
plot(lm(m3))
Residuals vs Fitted Q−Q Residuals
10

4
457 457
866 866
471 471
5

2
Standardized residuals
Residuals

0
−5

−2

4 6 8 10 12 −3 −2 −1 0 1 2 3
Scale−Location
Fitted values
Residuals vs Leverage Theoretical Quantiles
2.0

457
866
471
0.5
4
1.5

1845
2
Standardized residuals

Standardized residuals
1.0

0
0.5

−2

2943
2109
0.0

−4

Cook's distance 0.5

4 6 8 10 12 0.00 0.05 0.10 0.15 0.20 0.25

We can see that adjusted R2 has increased from the model without any interaction terms. It seems like
Fitted values Leverage

adding interaction terms are beneficial in improving the overall explainability of the model.
Since we considered full model, we perform model search for better model.
library(leaps)
reg_all = regsubsets(rentsqm ~ log(area)*location*yearc, nvmax = 20, data = rent.data)
cbind(Cp = summary(reg_all)$cp,
adjr2 = summary(reg_all)$adjr2,
bic = summary(reg_all)$bic)

## Cp adjr2 bic
## [1,] 575.35403 0.1509341 -489.2060
## [2,] 180.22032 0.2429851 -835.8447
## [3,] 110.51955 0.2594137 -896.4342

8
## [4,] 54.80182 0.2726001 -944.7729
## [5,] 18.07400 0.2813760 -975.1508
## [6,] 17.81677 0.2816677 -969.3711
## [7,] 17.55301 0.2819612 -963.5996
## [8,] 12.85155 0.2832886 -962.2716
## [9,] 12.86424 0.2835183 -956.2296
## [10,] 14.82650 0.2832938 -948.2341
## [11,] 12.00000 0.2841857 -945.0423
2
In terms of Mallow’s Cp and Radj , full model (model 11) is the best while in terms of BIC, model 5
(rentsqm ∼ log(area) + yearc + log(area) ∗ location2 + log(area) ∗ yearc + log(area) ∗ location3 ∗ yearc) is
the best.
The final model we choose is

Y ∼ −234 + 38.4X1 − 176.9X2 + 732.4X3 + 0.13X4 + 44.1X1 X2 − 168.4X1 X3 − 0.02X1 X4 +

0.09X2 X4 − 0.38X3 X4 − 0.02X1 X2 X4 + 0.09X1 X3 X4

where
X1 : log(area), X2 : good location, X3 : top location, X4 : year.
The Residuals vs. Fitted plot tells that the homoscedastic model assumption is met. Q-Q plot shows that
the normality assumption is generally met except at the tails, but not great deviations. The Residuals vs
Leverage plot shows that there are no serious outliers in both predictors and response. I believe the model
assumptions are generally met well.

(c)
ggplot(aes(x = rentsqm), data = rent.data) +
geom_density(aes(color = location)) +
labs(title = "Location vs. Rentsqm") +
theme_minimal()

9
Location vs. Rentsqm

0.15

0.10 location
density

1
2
3

0.05

0.00

0 5 10 15
rentsqm Al-
though there are not many observations from “top” location, this plot clearly indicates that rentsqm for “top”
location is higher than the other locations. Houses in “good” locations tend to be more spread out than the
houses in “average” location.
ggplot(aes(x = yearc, y = rentsqm), data = rent.data) +
geom_point() +
geom_smooth(col = "red", method = "loess") +
labs(title = "Year vs. Rentsqm") +
theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

10
Year vs. Rentsqm

15
rentsqm

0
1920 1940 1960 1980 2000
yearc
Although it seems that rentsqm are spread out in every year, the general trend is that rentsqm is increasing
as the construction year is later.
ggplot(aes(x = area, y = rentsqm), data = rent.data) +
geom_point() +
geom_smooth(col = "red", method = "loess") +
labs(title = "Area vs. Rentsqm") +
theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

11
Area vs. Rentsqm

15
rentsqm

0
40 80 120 160
area
There is a clear trend indicating that as the area of the house increases, the rentsqm decreases. But the
slope gets dramatically smaller as area increases meaning that rentsqm among big houses are not that
different.

(d)
The full model (FM) we consider is

Y = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X1 X2 + β5 X1 X3 + ε

where

Y := rentsqm
X1 := yearc
X2 := ind(location = “good”)
X3 := ind(location = “top”)

The hypotheses we are testing is

H0 : RM vs. H1 : F M
where
RM : Y = β0 + β1 X1 + β2 X2 + β3 X3 + ε

reduced.model = lm(rentsqm ~ location + yearc, data = rent.data)

full.model = lm(rentsqm ~ location * yearc, data = rent.data)
anova(reduced.model, full.model)

## Analysis of Variance Table

12
## Model 1: rentsqm ~ location + yearc
## Model 2: rentsqm ~ location * yearc
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3078 15106
## 2 3076 15104 2 2.1611 0.2201 0.8025
Calculated test statistics is 2.1611 and p-value is 0.8025. Under the significance level α = 0.05, we do not
reject H0 . There is not much evidence to say that the relationship between the year of construction and
average rent price per meter squared differs based on location.

(e)
# Using package
fit = lm(rentsqm ~ location + yearc, data = rent.data)
new.data = data.frame(location = "3", yearc = 1980)
predict(fit, new.data, interval = "prediction")

## fit lwr upr

## 1 9.268873 4.896563 13.64118
# Calculations
x0 = c(1, 0, 1, 1980)
design.matrix = model.matrix(fit)
point.est = coef(fit) %*% x0
se = sqrt(sum((rent.data$rentsqm - fit$fitted.values)ˆ2) / 3078)
moe = qt(0.975, nrow(rent.data) - 4) * se * sqrt(1 + t(x0) %*% solve(crossprod(design.matrix)) %*% x0)
c(point.est - moe, point.est + moe)

## [1] 4.896563 13.641182

95% prediction interval for new data with top location and year of construction at 1980 is

[4.8966, 13.6412]

13
Question 2
(a)
Linear regression
If we consider linear regression, we assume
f (xi ) = x⊤
i β

where xi , β ∈ Rd . LSE estimate of β is obtained through

n n
∂ X 2 X 2
ε = (yi − 2yi x⊤ ⊤ ⊤
i β + β xi xi β)
2
∂β i=1 i=1
=0

Setting above equation to zero and solving for β, we get β̂LSE as

n
X
β̂LSE = (xi x⊤
i )
−1
xi Yi
i=1

Then fˆLM (x0 ) is

fˆLM (x0 ) = x⊤
0 β̂LSE
Xn
⊤
= x0 (xi x⊤
i )
−1
xi Yi
i=1
n
X
= x⊤ ⊤ −1
0 (xi xi ) xi Yi
i=1

so
ℓi (x0 ; (xi )ni=1 ) = x⊤ ⊤ −1
0 (xi xi ) xi
given xi x⊤
i is invertible for all i = 1, . . . , n.

KNN regression
Since fˆKNN (x0 ) = 1
P
k xi ∈Nk (x0 ) Yi ,
n
1X
fˆKNN (x0 ) = ind(xi ∈ Nk (x0 ))Yi
k i=1

which leaves
1
ℓi (x0 ; (xi )ni=1 ) = ind(xi ∈ Nk (x0 ))
k

(b)
EPE at x0 is defined as EPE(fˆ(x0 )) = E[(Y0 − fˆ(x0 ))2 ]. By some calculations,

E[(Y0 − fˆ(x0 ))2 ] = E[(Y0 − f (x0 ) + f (x0 ) − E[fˆ(x0 )] + E[fˆ(x0 )] − fˆ(x0 ))]2
= σ 2 + (E[fˆ(x0 )] − f (x0 ))2 + Var(fˆ(x0 ))

14
Linear regression
Since
E(fˆ(x0 )) = x⊤ ⊤
0 (X X)
−1 ⊤
X E(Y )
and E(Y ) = f = (f (x1 ), . . . , f (xn )), the second term is equal to (x⊤ ⊤
0 (X X)
−1 ⊤
X f − f (x0 ))2 .
The third term is expressed as
Var(x⊤ ˆ ⊤ −1
0 f (x0 )) = Var(x0 (X X) x0 ).
By the assumption in question, we approximate it as
d
Var(x0 (X ⊤ X)−1 x0 ) ≈
n

Hence EPE of linear regression at x0 is

d
σ 2 (1 + ) + (x⊤ ⊤
0 (X X)
−1 ⊤
X f − f (x0 ))2
n

KNN-regression
For KNN-regression,
1 X 1 X 1 X
E(fˆ(x0 )) = E( Yi ) = E(Yi ) = f (xi )
k k k
xi ∈Nk (x0 ) xi ∈Nk (x0 ) xi ∈Nk (x0 )

Hence, the second term in EPE is [ k1 f (xi ) − f (x0 )]2 .

P
xi ∈Nk (x0 )

Var(fˆ(x0 )) is
n
1X 1X
Var(fˆ(x0 )) = Var[ ( ind{xj ∈ Nk (xi )}Yj )]
n i=1 k j
n
1X 1 X
= ( ind{xj ∈ Nk (xi )}Var(Yj ))
n i=1 k 2 j

Since there k number of Yj in each neighbor Nk (xi ) for i = 1, . . . , n,

n
1X 1 X 1 kn 1
( ind{xj ∈ Nk (xi )}Var(Yj )) = =
n i=1 k 2 j n k2 k

Hence, EPE at x0 of KNN regression is

1 1 X
σ 2 (1 + )+[ f (xi ) − f (x0 )]2
k k
xi ∈Nk (x0 )

(c)
2
As k increases, variance σk decreases while bias k1 xi ∈Nk (x0 ) f (xi ) − f (x0 ) increases as the number of Yi to
P
be taken average of gets larger leadeing to the inclusion of Yi ’s that are further apart.

15
Question 3
(a)
load("Q3.Rdata")
fit.q3 = lm(Y ~ ., data = q3_dat)

library(leaps)
press = function(fit)
{
h = hatvalues(fit)
mean((residuals(fit) / (1 - h))ˆ2)
}
reg_q3 = regsubsets(Y ~ ., nvmax = 20, data = q3_dat)
summary(reg_q3)

## Subset selection object

## Call: regsubsets.formula(Y ~ ., nvmax = 20, data = q3_dat)
## 20 Variables (and intercept)
## Forced in Forced out
## X1Y FALSE FALSE
## X2Y FALSE FALSE
## X3 FALSE FALSE
## X4 FALSE FALSE
## X5 FALSE FALSE
## X6Y FALSE FALSE
## X7 FALSE FALSE
## X8Y FALSE FALSE
## X9 FALSE FALSE
## X10 FALSE FALSE
## X11Y FALSE FALSE
## X12 FALSE FALSE
## X13 FALSE FALSE
## X14 FALSE FALSE
## X15 FALSE FALSE
## X16 FALSE FALSE
## X17 FALSE FALSE
## X18 FALSE FALSE
## X19 FALSE FALSE
## X20 FALSE FALSE
## 1 subsets of each size up to 20
## Selection Algorithm: exhaustive
## X1Y X2Y X3 X4 X5 X6Y X7 X8Y X9 X10 X11Y X12 X13 X14 X15 X16 X17
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " "*" " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " "*" " " " " "*" " " " "
## 5 ( 1 ) " " " " " " " " " " "*" "*" " " " " " " " " "*" "*" " " " " " " " "
## 6 ( 1 ) " " " " " " " " "*" "*" " " " " " " " " " " "*" " " " " "*" " " "*"
## 7 ( 1 ) " " " " " " " " "*" "*" " " " " " " " " " " "*" "*" " " "*" " " "*"
## 8 ( 1 ) " " " " " " " " "*" "*" "*" " " " " " " " " "*" "*" " " "*" " " "*"
## 9 ( 1 ) " " " " " " " " "*" "*" "*" " " " " "*" " " "*" "*" " " "*" " " "*"
## 10 ( 1 ) " " " " " " " " "*" "*" "*" " " " " "*" " " "*" "*" "*" "*" " " "*"
## 11 ( 1 ) " " " " " " " " "*" "*" "*" " " " " "*" " " "*" "*" "*" "*" " " "*"

16
## 12 ( 1 ) " " " " " " "*" "*" "*" "*" " " " " "*" " " "*" "*" "*" "*" " " "*"
## 13 ( 1 ) " " " " "*" " " "*" "*" "*" " " " " "*" " " "*" "*" "*" "*" " " "*"
## 14 ( 1 ) " " " " "*" "*" "*" "*" "*" " " " " "*" " " "*" "*" "*" "*" " " "*"
## 15 ( 1 ) "*" " " "*" "*" "*" "*" "*" " " " " "*" " " "*" "*" "*" "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " " " " "*" " " "*" "*" "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " " " " "*" " " "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " " " " "*" "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*"
## 20 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## X18 X19 X20
## 1 ( 1 ) " " "*" " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" " "
## 4 ( 1 ) " " "*" " "
## 5 ( 1 ) " " "*" " "
## 6 ( 1 ) " " "*" " "
## 7 ( 1 ) " " "*" " "
## 8 ( 1 ) " " "*" " "
## 9 ( 1 ) " " "*" " "
## 10 ( 1 ) " " "*" " "
## 11 ( 1 ) " " "*" "*"
## 12 ( 1 ) " " "*" "*"
## 13 ( 1 ) "*" "*" "*"
## 14 ( 1 ) "*" "*" "*"
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
## 18 ( 1 ) "*" "*" "*"
## 19 ( 1 ) "*" "*" "*"
## 20 ( 1 ) "*" "*" "*"
FM_criteria =
cbind(r2 = summary(reg_q3)$rsq[20],
adjr2 = summary(reg_q3)$adjr2[20],
PRESS = press(fit.q3),
Cp = summary(reg_q3)$cp[20],
bic = BIC(fit.q3),
aic = AIC(fit.q3))
FM_criteria

## r2 adjr2 PRESS Cp bic aic

## [1,] 0.465404 0.4430827 1.178398 21 1594.226 1501.505

(b)
step_reg = step(fit.q3, direction = "both", k = log(nrow(q3_dat)))

## Start: AIC=169.07
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 +
## X12 + X13 + X14 + X15 + X16 + X17 + X18 + X19 + X20
##
## Df Sum of Sq RSS AIC
## - X9 1 0.001 540.09 162.86
## - X8 1 0.025 540.12 162.88
## - X11 1 0.028 540.12 162.88

17
## - X16 1 0.157 540.25 163.00
## - X2 1 0.212 540.31 163.06
## - X1 1 0.250 540.34 163.09
## - X4 1 0.345 540.44 163.18
## - X10 1 0.366 540.46 163.20
## - X14 1 0.573 540.67 163.39
## - X3 1 0.642 540.73 163.45
## - X18 1 0.762 540.85 163.56
## - X17 1 0.818 540.91 163.62
## - X20 1 0.832 540.92 163.63
## - X7 1 0.872 540.97 163.67
## - X5 1 1.849 541.94 164.57
## - X15 1 2.887 542.98 165.52
## - X13 1 3.175 543.27 165.79
## <none> 540.09 169.07
## - X12 1 27.824 567.92 187.98
## - X19 1 32.563 572.66 192.13
## - X6 1 120.643 660.74 263.67
##
## Step: AIC=162.86
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X10 + X11 + X12 +
## X13 + X14 + X15 + X16 + X17 + X18 + X19 + X20
##
## Df Sum of Sq RSS AIC
## - X8 1 0.025 540.12 156.67
## - X11 1 0.027 540.12 156.67
## - X16 1 0.166 540.26 156.80
## - X2 1 0.212 540.31 156.84
## - X1 1 0.255 540.35 156.88
## - X4 1 0.352 540.45 156.97
## - X10 1 0.368 540.46 156.99
## - X14 1 0.616 540.71 157.22
## - X3 1 0.650 540.74 157.25
## - X18 1 0.790 540.88 157.38
## - X17 1 0.818 540.91 157.40
## - X20 1 0.832 540.93 157.41
## - X7 1 0.873 540.97 157.45
## - X5 1 2.351 542.44 158.82
## - X15 1 2.931 543.02 159.35
## - X13 1 3.376 543.47 159.76
## <none> 540.09 162.86
## + X9 1 0.001 540.09 169.07
## - X19 1 32.597 572.69 185.95
## - X12 1 33.079 573.17 186.37
## - X6 1 120.642 660.74 257.45
##
## Step: AIC=156.67
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X10 + X11 + X12 + X13 +
## X14 + X15 + X16 + X17 + X18 + X19 + X20
##
## Df Sum of Sq RSS AIC
## - X11 1 0.026 540.14 150.48
## - X16 1 0.166 540.28 150.61
## - X2 1 0.211 540.33 150.65

18
## - X1 1 0.250 540.37 150.68
## - X4 1 0.347 540.47 150.77
## - X10 1 0.370 540.49 150.79
## - X14 1 0.610 540.73 151.02
## - X3 1 0.632 540.75 151.04
## - X18 1 0.786 540.90 151.18
## - X17 1 0.825 540.94 151.22
## - X20 1 0.831 540.95 151.22
## - X7 1 0.880 541.00 151.27
## - X5 1 2.343 542.46 152.62
## - X15 1 2.908 543.03 153.14
## - X13 1 3.386 543.50 153.58
## <none> 540.12 156.67
## + X8 1 0.025 540.09 162.86
## + X9 1 0.000 540.12 162.88
## - X19 1 32.665 572.78 179.81
## - X12 1 33.096 573.21 180.19
## - X6 1 120.642 660.76 251.25
##
## Step: AIC=150.48
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X10 + X12 + X13 + X14 +
## X15 + X16 + X17 + X18 + X19 + X20
##
## Df Sum of Sq RSS AIC
## - X16 1 0.163 540.31 144.41
## - X2 1 0.212 540.36 144.46
## - X1 1 0.266 540.41 144.51
## - X4 1 0.362 540.51 144.60
## - X10 1 0.366 540.51 144.60
## - X14 1 0.610 540.75 144.83
## - X3 1 0.641 540.79 144.86
## - X18 1 0.790 540.93 144.99
## - X17 1 0.815 540.96 145.02
## - X20 1 0.829 540.97 145.03
## - X7 1 0.877 541.02 145.07
## - X5 1 2.331 542.48 146.42
## - X15 1 2.897 543.04 146.94
## - X13 1 3.388 543.53 147.39
## <none> 540.14 150.48
## + X11 1 0.026 540.12 156.67
## + X8 1 0.024 540.12 156.67
## + X9 1 0.000 540.14 156.69
## - X19 1 32.692 572.84 173.64
## - X12 1 33.072 573.22 173.98
## - X6 1 120.897 661.04 245.25
##
## Step: AIC=144.41
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X10 + X12 + X13 + X14 +
## X15 + X17 + X18 + X19 + X20
##
## Df Sum of Sq RSS AIC
## - X2 1 0.198 540.51 138.38
## - X1 1 0.293 540.60 138.47
## - X4 1 0.345 540.65 138.52

19
## - X3 1 0.549 540.86 138.71
## - X18 1 0.679 540.99 138.83
## - X10 1 0.724 541.03 138.87
## - X20 1 0.766 541.07 138.91
## - X14 1 0.863 541.17 139.00
## - X17 1 0.983 541.29 139.11
## - X7 1 1.260 541.57 139.36
## - X5 1 2.173 542.48 140.21
## - X15 1 2.774 543.08 140.76
## - X13 1 3.856 544.16 141.75
## <none> 540.31 144.41
## + X16 1 0.163 540.14 150.48
## + X8 1 0.024 540.28 150.61
## + X11 1 0.023 540.28 150.61
## + X9 1 0.013 540.30 150.62
## - X19 1 32.585 572.89 167.48
## - X12 1 33.331 573.64 168.13
## - X6 1 121.319 661.63 239.48
##
## Step: AIC=138.38
## Y ~ X1 + X3 + X4 + X5 + X6 + X7 + X10 + X12 + X13 + X14 + X15 +
## X17 + X18 + X19 + X20
##
## Df Sum of Sq RSS AIC
## - X1 1 0.301 540.81 132.44
## - X4 1 0.366 540.87 132.51
## - X3 1 0.529 541.03 132.66
## - X18 1 0.690 541.20 132.81
## - X10 1 0.693 541.20 132.81
## - X20 1 0.817 541.32 132.92
## - X14 1 0.899 541.40 133.00
## - X17 1 1.033 541.54 133.12
## - X7 1 1.277 541.78 133.35
## - X5 1 2.131 542.64 134.13
## - X15 1 2.782 543.29 134.74
## - X13 1 3.787 544.29 135.66
## <none> 540.51 138.38
## + X2 1 0.198 540.31 144.41
## + X16 1 0.149 540.36 144.46
## + X11 1 0.024 540.48 144.57
## + X8 1 0.023 540.48 144.57
## + X9 1 0.016 540.49 144.58
## - X19 1 32.595 573.10 161.45
## - X12 1 33.721 574.23 162.43
## - X6 1 121.890 662.40 233.85
##
## Step: AIC=132.45
## Y ~ X3 + X4 + X5 + X6 + X7 + X10 + X12 + X13 + X14 + X15 + X17 +
## X18 + X19 + X20
##
## Df Sum of Sq RSS AIC
## - X4 1 0.373 541.18 126.58
## - X3 1 0.515 541.32 126.71
## - X18 1 0.664 541.47 126.84

20
## - X10 1 0.704 541.51 126.88
## - X20 1 0.729 541.54 126.90
## - X14 1 0.937 541.74 127.10
## - X17 1 0.991 541.80 127.15
## - X7 1 1.315 542.12 127.44
## - X5 1 2.049 542.86 128.12
## - X15 1 2.776 543.58 128.79
## - X13 1 3.662 544.47 129.60
## <none> 540.81 132.44
## + X1 1 0.301 540.51 138.38
## + X2 1 0.205 540.60 138.47
## + X16 1 0.174 540.63 138.50
## + X11 1 0.041 540.76 138.62
## + X8 1 0.018 540.79 138.64
## + X9 1 0.008 540.80 138.65
## - X19 1 32.458 573.26 155.37
## - X12 1 33.830 574.64 156.57
## - X6 1 121.597 662.40 227.64
##
## Step: AIC=126.58
## Y ~ X3 + X5 + X6 + X7 + X10 + X12 + X13 + X14 + X15 + X17 + X18 +
## X19 + X20
##
## Df Sum of Sq RSS AIC
## - X3 1 0.545 541.72 120.86
## - X20 1 0.592 541.77 120.91
## - X18 1 0.617 541.80 120.93
## - X14 1 0.760 541.94 121.06
## - X10 1 1.095 542.27 121.37
## - X17 1 1.180 542.36 121.45
## - X7 1 1.490 542.67 121.73
## - X5 1 2.053 543.23 122.25
## - X15 1 2.404 543.58 122.58
## - X13 1 4.184 545.36 124.21
## <none> 541.18 126.58
## + X4 1 0.373 540.81 132.44
## + X1 1 0.307 540.87 132.51
## + X2 1 0.228 540.95 132.58
## + X16 1 0.155 541.02 132.65
## + X11 1 0.060 541.12 132.74
## + X8 1 0.013 541.17 132.78
## + X9 1 0.001 541.18 132.79
## - X19 1 33.477 574.66 150.37
## - X12 1 36.356 577.54 152.87
## - X6 1 121.516 662.69 221.64
##
## Step: AIC=120.86
## Y ~ X5 + X6 + X7 + X10 + X12 + X13 + X14 + X15 + X17 + X18 +
## X19 + X20
##
## Df Sum of Sq RSS AIC
## - X18 1 0.218 541.94 114.85
## - X20 1 0.683 542.41 115.28
## - X14 1 0.735 542.46 115.33

21
## - X10 1 0.748 542.47 115.34
## - X7 1 1.466 543.19 116.00
## - X17 1 2.106 543.83 116.59
## - X5 1 2.172 543.90 116.65
## - X15 1 2.485 544.21 116.94
## - X13 1 3.693 545.42 118.05
## <none> 541.72 120.86
## + X3 1 0.545 541.18 126.58
## + X4 1 0.402 541.32 126.71
## + X1 1 0.292 541.43 126.81
## + X2 1 0.207 541.52 126.89
## + X11 1 0.075 541.65 127.01
## + X16 1 0.066 541.66 127.02
## + X9 1 0.005 541.72 127.07
## + X8 1 0.002 541.72 127.08
## - X19 1 33.296 575.02 144.47
## - X12 1 39.434 581.16 149.78
## - X6 1 121.115 662.84 215.54
##
## Step: AIC=114.85
## Y ~ X5 + X6 + X7 + X10 + X12 + X13 + X14 + X15 + X17 + X19 +
## X20
##
## Df Sum of Sq RSS AIC
## - X20 1 0.542 542.48 109.14
## - X10 1 0.933 542.87 109.50
## - X14 1 0.943 542.88 109.50
## - X7 1 1.670 543.61 110.17
## - X17 1 1.963 543.90 110.44
## - X5 1 2.043 543.98 110.52
## - X15 1 2.267 544.21 110.72
## - X13 1 3.824 545.77 112.15
## <none> 541.94 114.85
## + X4 1 0.354 541.59 120.74
## + X1 1 0.280 541.66 120.81
## + X2 1 0.221 541.72 120.86
## + X18 1 0.218 541.72 120.86
## + X3 1 0.146 541.80 120.93
## + X11 1 0.073 541.87 121.00
## + X16 1 0.038 541.90 121.03
## + X8 1 0.004 541.94 121.06
## + X9 1 0.003 541.94 121.06
## - X12 1 39.331 581.27 143.67
## - X19 1 48.697 590.64 151.66
## - X6 1 121.500 663.44 209.78
##
## Step: AIC=109.14
## Y ~ X5 + X6 + X7 + X10 + X12 + X13 + X14 + X15 + X17 + X19
##
## Df Sum of Sq RSS AIC
## - X14 1 0.547 543.03 103.42
## - X10 1 0.719 543.20 103.58
## - X7 1 1.259 543.74 104.08
## - X17 1 1.521 544.00 104.32

22
## - X5 1 1.762 544.25 104.54
## - X15 1 2.798 545.28 105.49
## - X13 1 3.376 545.86 106.02
## <none> 542.48 109.14
## + X20 1 0.542 541.94 114.85
## + X3 1 0.266 542.22 115.11
## + X2 1 0.252 542.23 115.12
## + X4 1 0.240 542.24 115.13
## + X1 1 0.205 542.28 115.16
## + X18 1 0.077 542.41 115.28
## + X11 1 0.065 542.42 115.29
## + X16 1 0.017 542.47 115.33
## + X8 1 0.003 542.48 115.35
## + X9 1 0.001 542.48 115.35
## - X12 1 40.894 583.38 139.26
## - X19 1 52.496 594.98 149.10
## - X6 1 121.647 664.13 204.08
##
## Step: AIC=103.42
## Y ~ X5 + X6 + X7 + X10 + X12 + X13 + X15 + X17 + X19
##
## Df Sum of Sq RSS AIC
## - X10 1 0.572 543.60 97.736
## - X7 1 0.852 543.88 97.994
## - X17 1 1.242 544.27 98.353
## - X5 1 2.262 545.29 99.288
## - X15 1 2.823 545.85 99.803
## - X13 1 3.246 546.28 100.190
## <none> 543.03 103.425
## + X14 1 0.547 542.48 109.135
## + X2 1 0.273 542.76 109.388
## + X1 1 0.258 542.77 109.402
## + X18 1 0.239 542.79 109.419
## + X20 1 0.146 542.88 109.505
## + X4 1 0.133 542.90 109.517
## + X16 1 0.130 542.90 109.519
## + X3 1 0.121 542.91 109.528
## + X11 1 0.060 542.97 109.584
## + X8 1 0.002 543.03 109.637
## + X9 1 0.001 543.03 109.639
## - X12 1 40.546 583.58 133.215
## - X19 1 52.177 595.21 143.082
## - X6 1 121.686 664.72 198.308
##
## Step: AIC=97.74
## Y ~ X5 + X6 + X7 + X12 + X13 + X15 + X17 + X19
##
## Df Sum of Sq RSS AIC
## - X7 1 0.661 544.26 92.129
## - X17 1 1.398 545.00 92.806
## - X13 1 2.796 546.40 94.086
## - X5 1 2.915 546.52 94.196
## - X15 1 3.236 546.84 94.490
## <none> 543.60 97.736

23
## + X10 1 0.572 543.03 103.425
## + X16 1 0.423 543.18 103.562
## + X14 1 0.400 543.20 103.583
## + X18 1 0.376 543.23 103.605
## + X4 1 0.337 543.27 103.641
## + X1 1 0.277 543.33 103.696
## + X2 1 0.248 543.35 103.722
## + X20 1 0.086 543.52 103.872
## + X11 1 0.050 543.55 103.905
## + X9 1 0.011 543.59 103.941
## + X8 1 0.007 543.59 103.944
## + X3 1 0.004 543.60 103.947
## - X12 1 40.067 583.67 127.080
## - X19 1 51.937 595.54 137.147
## - X6 1 122.611 666.21 193.218
##
## Step: AIC=92.13
## Y ~ X5 + X6 + X12 + X13 + X15 + X17 + X19
##
## Df Sum of Sq RSS AIC
## - X13 1 2.430 546.69 88.142
## - X17 1 3.443 547.71 89.068
## - X5 1 4.687 548.95 90.202
## - X15 1 6.468 550.73 91.822
## <none> 544.26 92.129
## + X7 1 0.661 543.60 97.736
## + X16 1 0.582 543.68 97.809
## + X4 1 0.487 543.78 97.896
## + X18 1 0.465 543.80 97.916
## + X10 1 0.380 543.88 97.994
## + X1 1 0.305 543.96 98.063
## + X2 1 0.257 544.01 98.108
## + X14 1 0.108 544.15 98.244
## + X9 1 0.077 544.19 98.273
## + X11 1 0.051 544.21 98.297
## + X20 1 0.030 544.23 98.316
## + X8 1 0.013 544.25 98.332
## + X3 1 0.002 544.26 98.342
## - X12 1 43.040 587.30 123.969
## - X19 1 51.299 595.56 130.951
## - X6 1 122.029 666.29 187.062
##
## Step: AIC=88.14
## Y ~ X5 + X6 + X12 + X15 + X17 + X19
##
## Df Sum of Sq RSS AIC
## - X17 1 4.586 551.28 86.104
## - X5 1 4.636 551.33 86.150
## <none> 546.69 88.142
## + X13 1 2.430 544.26 92.129
## - X15 1 12.732 559.43 93.439
## + X16 1 0.588 546.11 93.819
## + X4 1 0.487 546.21 93.911
## + X18 1 0.423 546.27 93.970

24
## + X7 1 0.296 546.40 94.086
## + X2 1 0.217 546.48 94.159
## + X14 1 0.212 546.48 94.163
## + X3 1 0.200 546.49 94.174
## + X1 1 0.175 546.52 94.196
## + X10 1 0.135 546.56 94.233
## + X11 1 0.054 546.64 94.307
## + X8 1 0.037 546.66 94.323
## + X9 1 0.026 546.67 94.333
## + X20 1 0.001 546.69 94.356
## - X19 1 48.928 595.62 124.786
## - X12 1 48.956 595.65 124.810
## - X6 1 122.123 668.82 182.738
##
## Step: AIC=86.1
## Y ~ X5 + X6 + X12 + X15 + X19
##
## Df Sum of Sq RSS AIC
## - X5 1 2.819 554.10 82.440
## <none> 551.28 86.104
## + X17 1 4.586 546.69 88.142
## - X15 1 9.274 560.55 88.231
## + X13 1 3.573 547.71 89.068
## + X7 1 2.329 548.95 90.202
## + X4 1 2.089 549.19 90.421
## + X16 1 1.538 549.74 90.922
## + X20 1 0.624 550.66 91.753
## + X10 1 0.316 550.96 92.033
## + X2 1 0.287 550.99 92.059
## + X18 1 0.252 551.03 92.091
## + X14 1 0.143 551.14 92.189
## + X1 1 0.131 551.15 92.200
## + X8 1 0.049 551.23 92.275
## + X3 1 0.048 551.23 92.275
## + X11 1 0.040 551.24 92.283
## + X9 1 0.000 551.28 92.319
## - X19 1 55.167 606.45 127.577
## - X12 1 67.853 619.13 137.929
## - X6 1 122.587 673.87 180.285
##
## Step: AIC=82.44
## Y ~ X6 + X12 + X15 + X19
##
## Df Sum of Sq RSS AIC
## <none> 554.10 82.440
## - X15 1 7.925 562.02 83.327
## + X7 1 3.481 550.62 85.504
## + X13 1 3.259 550.84 85.706
## + X5 1 2.819 551.28 86.104
## + X17 1 2.769 551.33 86.150
## + X4 1 2.450 551.65 86.439
## + X16 1 1.501 552.60 87.299
## + X20 1 1.322 552.78 87.460
## + X9 1 0.980 553.12 87.770

25
## + X2 1 0.154 553.94 88.516
## + X3 1 0.154 553.94 88.516
## + X1 1 0.135 553.96 88.533
## + X10 1 0.072 554.03 88.590
## + X8 1 0.025 554.07 88.632
## + X18 1 0.023 554.08 88.635
## + X11 1 0.022 554.08 88.635
## + X14 1 0.021 554.08 88.636
## - X12 1 66.232 620.33 132.681
## - X19 1 67.582 621.68 133.768
## - X6 1 121.105 675.20 175.061
RM_criteria =
cbind(r2 = summary(reg_q3)$rsq[4],
adjr2 = summary(reg_q3)$adjr2[4],
PRESS = press(step_reg),
Cp = summary(reg_q3)$cp[4],
bic = BIC(step_reg),
aic = AIC(step_reg))
RM_criteria

## r2 adjr2 PRESS Cp bic aic

## [1,] 0.4515408 0.4471088 1.131644 1.421495 1507.593 1482.306
The predictors selected in the final model by stepwise regression function are X6 , X12 , X15 , X19 .

## r2 adjr2 PRESS Cp bic aic

## [1,] 0.4654040 0.4430827 1.178398 21.000000 1594.226 1501.505
## [2,] 0.4515408 0.4471088 1.131644 1.421495 1507.593 1482.306
Reduced model is better in every criteria except for R2 . It is obvious that full model has higher R2 because
it has more predictors in its model.

(d)
n = nrow(q3_dat)
p = apply(summary(reg_q3)$which, 1, sum)

exhaustive.result =
cbind(adjr2 = summary(reg_q3)$adjr2,
Cp = summary(reg_q3)$cp,
bic = n * log(summary(reg_q3)$rss) + log(n) * p,
aic = n * log(summary(reg_q3)$rss) + 2 * p)
exhaustive.result

## adjr2 Cp bic aic

## 1 0.2626446 163.3492354 3318.076 3309.647
## 2 0.3716955 66.7068505 3243.263 3230.619
## 3 0.4403315 6.4502889 3190.631 3173.772
## 4 0.4471088 1.4214950 3189.744 3168.671
## 5 0.4519080 -1.8282901 3190.589 3165.301
## 6 0.4522848 -1.1459745 3195.446 3165.944

26
## 7 0.4536115 -1.3014755 3199.433 3165.716
## 8 0.4531635 0.1123886 3205.040 3167.109
## 9 0.4526238 1.6053424 3210.729 3168.583
## 10 0.4520570 3.1200927 3216.439 3170.079
## 11 0.4514828 4.6393864 3222.154 3171.579
## 12 0.4507159 6.3251359 3228.042 3173.252
## 13 0.4500006 7.9630280 3233.879 3174.875
## 14 0.4492461 9.6325204 3239.749 3176.530
## 15 0.4484149 11.3659820 3245.686 3178.252
## 16 0.4474751 13.1905842 3251.718 3180.069
## 17 0.4464960 15.0458315 3257.781 3181.918
## 18 0.4453724 17.0224535 3263.972 3183.894
## 19 0.4442423 19.0005546 3270.163 3185.871
## 20 0.4430827 21.0000000 3276.377 3187.871

2
Radj

adjr2.index = which.max(summary(reg_q3)$adjr2)
coef(reg_q3, id = adjr2.index)

## (Intercept) X5 X6Y X12 X13 X15

## -1.3401697 0.3575110 0.9909706 1.0444982 -0.2688508 0.4057510
## X17 X19
## -0.3864674 1.5527461
2
The best model using Radj is

Y = −1.34 + 0.36X5 + 0.99X6 + 1.04X12 − 0.27X13 + 0.41X15 − 0.39X17 + 1.55X19

Mallow’s Cp

Cp.index = which.min(summary(reg_q3)$cp)
coef(reg_q3, id = Cp.index)

## (Intercept) X6Y X7 X12 X13 X19

## -1.3406494 0.9931779 0.4732449 1.1978008 -0.4127354 1.8129446
The best model using Cp is

Y = −1.34 + 0.99X6 + 0.47X7 + 1.20X12 − 0.41X13 + 1.81X19

BIC

bic.index = which.min(n * log(summary(reg_q3)$rss) + log(n) * p)

coef(reg_q3, id = bic.index)

## (Intercept) X6Y X12 X15 X19

## -1.3294318 0.9860560 1.1902994 0.3791212 1.6361045
The best model using BIC is

Y = −1.33 + 0.99X6 + 1.20X12 + 0.38X15 + 1.64X19

AIC

27
aic.index = which.min(n * log(summary(reg_q3)$rss) + 2 * p)
coef(reg_q3, id = aic.index)

## (Intercept) X6Y X7 X12 X13 X19

## -1.3406494 0.9931779 0.4732449 1.1978008 -0.4127354 1.8129446
The best model using AIC is

Y = −1.33 + 0.99X6 + 0.47X7 + 1.20X12 − 0.41X13 + 1.81X19

The model selected using BIC is the same as the one in (b).

(e)
library(glmnet)
Y = q3_dat$Y
X.1 = q3_dat[, -1]
character_index = unlist(lapply(X.1, is.character), use.names = F)
X.1[, character_index] = ifelse(X.1[, character_index] == "Y", 1, 0)
X = as.matrix(X.1)

Lasso

lasso.fit = cv.glmnet(X, Y, nfolds = 10, alpha = 1)

plot(lasso.fit)

20 20 20 19 19 17 14 11 9 8 8 5 4 4 3 3 1
2.0
Mean−Squared Error

1.8
1.6
1.4
1.2

−7 −6 −5 −4 −3 −2 −1

Log(λ)
coef(lasso.fit)

## 21 x 1 sparse Matrix of class "dgCMatrix"

## s1
## (Intercept) -1.1586507

28
## X1 .
## X2 .
## X3 .
## X4 .
## X5 .
## X6 0.6803152
## X7 .
## X8 .
## X9 .
## X10 .
## X11 .
## X12 0.8913236
## X13 .
## X14 .
## X15 0.1241202
## X16 .
## X17 .
## X18 .
## X19 1.4509408
## X20 .
Model selected by 10-CV Lasso is

Y = −1.42 + 0.65X6 + 0.86X12 + 0.10X15 + 1.43X19

Ridge

ridge.fit = cv.glmnet(X, Y, nfolds = 10, alpha = 0)

plot(ridge.fit)

20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
2.2
2.0
Mean−Squared Error

1.8
1.6
1.4
1.2

−2 0 2 4 6

Log(λ)

29
coef(ridge.fit)

## 21 x 1 sparse Matrix of class "dgCMatrix"

## s1
## (Intercept) -1.181790876
## X1 0.006419232
## X2 0.046568073
## X3 -0.076493775
## X4 0.171374660
## X5 0.209494316
## X6 0.653612457
## X7 0.019578076
## X8 0.001661201
## X9 0.183949608
## X10 -0.099354639
## X11 0.016408077
## X12 0.676154837
## X13 -0.200264683
## X14 0.139491605
## X15 0.440441429
## X16 0.000393006
## X17 -0.214931459
## X18 0.145618189
## X19 1.029836235
## X20 -0.287168928
Since ridge regression doesn’t perform variable selection, every variable is present in the selected model.
The model in (b) corresponds to the model fit using 10-CV Lasso.
min(lasso.fit$cvm)

## [1] 1.145805
min(ridge.fit$cvm)

## [1] 1.170251
I would choose the model fit by 10-CV Lasso as the best model over Ridge since it has lower 10-CV error.
Moreover, it is more parsimonious model than the fit by ridge regression.

Research Proposal
83% (6)
Research Proposal
49 pages
House Prices Prediction in King County
No ratings yet
House Prices Prediction in King County
10 pages
GForce System
0% (1)
GForce System
12 pages
Golden Real Analysis PDF
No ratings yet
Golden Real Analysis PDF
4 pages
Stat511 - HW4
No ratings yet
Stat511 - HW4
31 pages
Multiple Regression
No ratings yet
Multiple Regression
7 pages
AAAAAAAAAAAAAAAAAAAAAAAAA
No ratings yet
AAAAAAAAAAAAAAAAAAAAAAAAA
41 pages
Lesllie Salt Company
No ratings yet
Lesllie Salt Company
15 pages
Assignment 4: Chitresh Kumar
No ratings yet
Assignment 4: Chitresh Kumar
7 pages
Problem Set 3: General Guideline
No ratings yet
Problem Set 3: General Guideline
12 pages
Machine Learning Project: TITLE: Predicting The Sale Price of A House Using Linear Regression
No ratings yet
Machine Learning Project: TITLE: Predicting The Sale Price of A House Using Linear Regression
20 pages
R Lab 3
No ratings yet
R Lab 3
7 pages
GianluigiDeRubertis 228766
No ratings yet
GianluigiDeRubertis 228766
9 pages
Home Construction
No ratings yet
Home Construction
8 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
Project Report ME-315 Machine Learning in Practice: Sebastian Perez Viegener LSE ID:201870983 July 3, 2019
No ratings yet
Project Report ME-315 Machine Learning in Practice: Sebastian Perez Viegener LSE ID:201870983 July 3, 2019
15 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
Home Credit Data
No ratings yet
Home Credit Data
6 pages
The Boston Housing Dataset
100% (2)
The Boston Housing Dataset
4 pages
Stastistics and Probability With R Programming Language: Lab Report
50% (2)
Stastistics and Probability With R Programming Language: Lab Report
44 pages
R Lab 1
No ratings yet
R Lab 1
5 pages
Brooklyn College Economics Department Economics 4400w
No ratings yet
Brooklyn College Economics Department Economics 4400w
8 pages
Homework 2
100% (1)
Homework 2
14 pages
Morán-Pérez - Tarea 4 BStat - 22-02-24
No ratings yet
Morán-Pérez - Tarea 4 BStat - 22-02-24
11 pages
PSQF6270 Example4b Continuous QuantReg
No ratings yet
PSQF6270 Example4b Continuous QuantReg
13 pages
Prediction On House Asking Price in Jinjang
100% (1)
Prediction On House Asking Price in Jinjang
5 pages
STAT2 2e R Markdown Files Sec4.4
No ratings yet
STAT2 2e R Markdown Files Sec4.4
13 pages
R Markdown File Mid
No ratings yet
R Markdown File Mid
13 pages
ML Observation
No ratings yet
ML Observation
29 pages
STAT 5302 Applied Regression Analysis. Hawkins
No ratings yet
STAT 5302 Applied Regression Analysis. Hawkins
7 pages
Aaaaaaaaaanotdone
No ratings yet
Aaaaaaaaaanotdone
7 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
MLNC 2013
No ratings yet
MLNC 2013
8 pages
As 2
No ratings yet
As 2
10 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
5 pages
Lesson 4 8 Answer Key AP Precalculus Math Medic Db114f9b6f
No ratings yet
Lesson 4 8 Answer Key AP Precalculus Math Medic Db114f9b6f
3 pages
Statistical Modeling With R - Fall 2016 Homework 1: House Prices in Oregon
No ratings yet
Statistical Modeling With R - Fall 2016 Homework 1: House Prices in Oregon
13 pages
T2 Summary VHA
No ratings yet
T2 Summary VHA
14 pages
Multicollinearity and Oaxaca - Tutorial
No ratings yet
Multicollinearity and Oaxaca - Tutorial
35 pages
Linear Regression
No ratings yet
Linear Regression
17 pages
Report
No ratings yet
Report
6 pages
Homework 5 Solutions
No ratings yet
Homework 5 Solutions
10 pages
Seminar - 1 2
No ratings yet
Seminar - 1 2
14 pages
HW1 Solution
No ratings yet
HW1 Solution
23 pages
02 End To End Machine Learning Project
No ratings yet
02 End To End Machine Learning Project
26 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
Assignment 1
No ratings yet
Assignment 1
7 pages
Stat 512 Homework 6 Solution
No ratings yet
Stat 512 Homework 6 Solution
6 pages
Intro To R Introspection
No ratings yet
Intro To R Introspection
24 pages
MIFI 564 - UNIT 1 - New
No ratings yet
MIFI 564 - UNIT 1 - New
53 pages
HW 3
No ratings yet
HW 3
20 pages
EDA and Hypothesis Testing On KC Housing Data: Daniele Sammarco - Exploratory Data Analysis For Machine Learning by IBM
No ratings yet
EDA and Hypothesis Testing On KC Housing Data: Daniele Sammarco - Exploratory Data Analysis For Machine Learning by IBM
9 pages
Shul Code
No ratings yet
Shul Code
1 page
Ch5 - Slides 2022 - 12 - 19 - L2
No ratings yet
Ch5 - Slides 2022 - 12 - 19 - L2
30 pages
STAT-2450 Assignment 1: Name:, Student ID: B00
No ratings yet
STAT-2450 Assignment 1: Name:, Student ID: B00
9 pages
DALab Part-B BCU&BU
No ratings yet
DALab Part-B BCU&BU
12 pages
ENME 392 - Homework 13 - Fa13 - Solutions
No ratings yet
ENME 392 - Homework 13 - Fa13 - Solutions
23 pages
Outliers Influence
No ratings yet
Outliers Influence
6 pages
222BDA35 Activity2
No ratings yet
222BDA35 Activity2
5 pages
Tute6Answers ECON339
No ratings yet
Tute6Answers ECON339
5 pages
Real Estate Valuation Data Set: Section Order
No ratings yet
Real Estate Valuation Data Set: Section Order
17 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
Improved Spectrogram Analysis For ECG Signal in Emergency Medical Applications
No ratings yet
Improved Spectrogram Analysis For ECG Signal in Emergency Medical Applications
6 pages
Module For Stem 12 Gen Physics
No ratings yet
Module For Stem 12 Gen Physics
23 pages
Computer Architecture ECE 361 Lecture 5: The Design Process & ALU Design
No ratings yet
Computer Architecture ECE 361 Lecture 5: The Design Process & ALU Design
55 pages
Liquid Crystal As Phase Change Materials
No ratings yet
Liquid Crystal As Phase Change Materials
9 pages
Solid Geometry
No ratings yet
Solid Geometry
8 pages
Las Q3 Mathematics8 Sirtwo
No ratings yet
Las Q3 Mathematics8 Sirtwo
6 pages
Hashsorting
No ratings yet
Hashsorting
33 pages
Mission Planning Issues of Imaging Satellites Summ
No ratings yet
Mission Planning Issues of Imaging Satellites Summ
20 pages
DS Unit 5
No ratings yet
DS Unit 5
27 pages
Coal India MT Paper 1 2020 Previous Year Paper
No ratings yet
Coal India MT Paper 1 2020 Previous Year Paper
50 pages
Experiment 05
No ratings yet
Experiment 05
20 pages
CSEC Mathematics June 1996 P2
100% (1)
CSEC Mathematics June 1996 P2
13 pages
VMD0007 BL UP v3.1
No ratings yet
VMD0007 BL UP v3.1
47 pages
Asymptotic Generalizations of The Lockhart Martinelli Method For Two Phase Flows
No ratings yet
Asymptotic Generalizations of The Lockhart Martinelli Method For Two Phase Flows
12 pages
Complete the table showing the rejection regions for common values of α
No ratings yet
Complete the table showing the rejection regions for common values of α
1 page
Assignment 4
No ratings yet
Assignment 4
2 pages
DSP Manual
No ratings yet
DSP Manual
44 pages
Model-Based Testing of Automotive Systems: Piketec GMBH, Germany
No ratings yet
Model-Based Testing of Automotive Systems: Piketec GMBH, Germany
9 pages
Orga Datasheet L85EX R AC 32
No ratings yet
Orga Datasheet L85EX R AC 32
2 pages
Chapter 3 - Operators in C++
No ratings yet
Chapter 3 - Operators in C++
20 pages
Reinventing Discovery
No ratings yet
Reinventing Discovery
4 pages
Origin of The South African Measurement System
No ratings yet
Origin of The South African Measurement System
3 pages
ULA Resource Pack (Urdu Version)
No ratings yet
ULA Resource Pack (Urdu Version)
70 pages
Quantitative Aptitude Shortcuts & Tricks
No ratings yet
Quantitative Aptitude Shortcuts & Tricks
8 pages
Chapter 3 FM I
No ratings yet
Chapter 3 FM I
16 pages
Adc - Dac
No ratings yet
Adc - Dac
4 pages
On Contra -πgb -Continuous Functions and Approximately -πgb-Continuous Functions in Topological Spaces
No ratings yet
On Contra -πgb -Continuous Functions and Approximately -πgb-Continuous Functions in Topological Spaces
6 pages

STAT511 HW4 Solution Draft

Uploaded by

STAT511 HW4 Solution Draft

Uploaded by

Homework 4 Solutions

## rent rentsqm area yearc location bath kitchen c_heating district

ggplot(rent.data, aes(x = rentsqm)) +

Histogram of area variable

Bar chart of location variable

4 6 8 10 0.000 0.005 0.010 0.015

Fitted values Leverage

3.0 3.5 4.0 4.5 5.0

1888 457 1888 457

4 6 8 10 0.000 0.005 0.010 0.015

Cook's distance 0.5

4 6 8 10 12 0.00 0.05 0.10 0.15 0.20 0.25

Y ∼ −234 + 38.4X1 − 176.9X2 + 732.4X3 + 0.13X4 + 44.1X1 X2 − 168.4X1 X3 − 0.02X1 X4 +

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

The hypotheses we are testing is

reduced.model = lm(rentsqm ~ location + yearc, data = rent.data)

## Analysis of Variance Table

## fit lwr upr

## [1] 4.896563 13.641182

where xi , β ∈ Rd . LSE estimate of β is obtained through

Setting above equation to zero and solving for β, we get β̂LSE as

Then fˆLM (x0 ) is

Hence EPE of linear regression at x0 is

Hence, the second term in EPE is [ k1 f (xi ) − f (x0 )]2 .

Since there k number of Yj in each neighbor Nk (xi ) for i = 1, . . . , n,

Hence, EPE at x0 of KNN regression is

## Subset selection object

## r2 adjr2 PRESS Cp bic aic

## r2 adjr2 PRESS Cp bic aic

## r2 adjr2 PRESS Cp bic aic

## adjr2 Cp bic aic

## (Intercept) X5 X6Y X12 X13 X15

Y = −1.34 + 0.36X5 + 0.99X6 + 1.04X12 − 0.27X13 + 0.41X15 − 0.39X17 + 1.55X19

## (Intercept) X6Y X7 X12 X13 X19

Y = −1.34 + 0.99X6 + 0.47X7 + 1.20X12 − 0.41X13 + 1.81X19

bic.index = which.min(n * log(summary(reg_q3)$rss) + log(n) * p)

## (Intercept) X6Y X12 X15 X19

Y = −1.33 + 0.99X6 + 1.20X12 + 0.38X15 + 1.64X19

## (Intercept) X6Y X7 X12 X13 X19

Y = −1.33 + 0.99X6 + 0.47X7 + 1.20X12 − 0.41X13 + 1.81X19

lasso.fit = cv.glmnet(X, Y, nfolds = 10, alpha = 1)

## 21 x 1 sparse Matrix of class "dgCMatrix"

Y = −1.42 + 0.65X6 + 0.86X12 + 0.10X15 + 1.43X19

ridge.fit = cv.glmnet(X, Y, nfolds = 10, alpha = 0)

## 21 x 1 sparse Matrix of class "dgCMatrix"

You might also like