STAT511 HW4 Solution Draft
STAT511 HW4 Solution Draft
STAT511 Homework 4
Donghwi Nam
2024-11-11
Question 1
load("rent.RData")
head(rent, 5)
(a)
# Converting "location" variable as categorical
rent.data$location = as.factor(rent.data$location)
library(GGally)
library(ggplot2)
1
Histogram of rentsqm variable
0.15
0.10
density
0.05
0.00
0 5 10 15
rentsqm
ggplot(rent.data, aes(x = area)) +
geom_histogram(aes(y = ..density..), color = "black", fill = "white", bins = 30) +
geom_density(color = "red") +
labs(title = "Histogram of area variable") +
theme_minimal()
0.020
0.015
density
0.010
0.005
0.000
40 80 120 160
area
ggplot(rent.data, aes(x = yearc)) +
geom_histogram(aes(y = ..density..), color = "black", fill = "white", bins = 15) +
geom_density(color = "red") +
labs(title = "Histogram of yearc variable") +
theme_minimal()
2
Histogram of yearc variable
0.03
density
0.02
0.01
0.00
1920 1940 1960 1980 2000
yearc
ggplot(rent.data) +
geom_bar(aes(x = location)) +
labs(title = "Bar chart of location variable") +
theme_minimal()
1500
count
1000
500
0
1 2 3
location
First, we analyze distribution of each variable. Rentsqm has an overall bell-shaped distribution with little
right-skewness. Area is clearly right-skewed. It is natural to observe such trend since there are a few number
of houses with larger area. The variable seems to need transformation such as log transformation to mitigate
the right skewness. Yearc has a multi-modal distribution, where number of peaks in the distribution of data
is more than one. We can assume that yearc = 1920 may imply houses built before 1920. And we can
observe peaks at 1960s and 1970s. Since the houses are not built constantly throughout time, this trend
seems natural. By observing barplot of location, we can see that the vast majority of the houses in the
dataset are either in ‘average’ or ‘good’ location.
ggpairs(rent.data,
columns = 1:3,
aes(color = location,
alpha = 0.1))
3
rentsqm area yearc
0.15 Corr: −0.341*** Corr: 0.389***
rentsqm
0.10 1: −0.384*** 1: 0.420***
0.05
2: −0.334*** 2: 0.397***
3: −0.238* 3: 0.370***
0.00
160
Corr: −0.231***
120 1: −0.145***
area
80 2: −0.308***
40 3: −0.186
2000
1980
yearc
1960
1940
1920
0 5 10 15 40 80 120 16019201940196019802000
Now, we analyze pairwise relationships between variables. We use categorical variable location as additional
layer to analyze the patterns. In rentsqm and area, the ‘average’ and ‘good’ location have similar distribution
while ‘top’ location is different. Houses in ‘top’ location seem to have higher rentsqm and area compared to
the other locations. The variables area and rentsqm have a negative correlation meaning that the bigger
the house is, the rent per m2 tends to be lower. It is interesting to see that most of the houses built during
the 1970s is in the ‘average’ location while after the peaks in the construction of the houses, there are more
houses built in ‘top’ location. The variables yearc and rentsqm have a positive correlation, indicating that
the newer the houses are, the higher the rent are. Moreover, area and yearc have negative correlation,
meaning that the newer the houses are, the smaller the houses are.
(b)
We first try with the basic linear regression model.
par(mfrow = c(2, 2), cex.lab = 0.5, mar = 2 * c(1, 1, 1, 1), cex = 0.3)
m1 = lm(rentsqm ~ area + yearc + location, data = rent.data)
summary(m1)
##
## Call:
## lm(formula = rentsqm ~ area + yearc + location, data = rent.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7673 -1.4799 -0.1198 1.3881 8.8992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -66.014367 3.508742 -18.814 < 2e-16 ***
## area -0.028803 0.001659 -17.367 < 2e-16 ***
## yearc 0.038201 0.001777 21.492 < 2e-16 ***
## location2 0.735812 0.079936 9.205 < 2e-16 ***
## location3 1.786440 0.245721 7.270 4.53e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.114 on 3077 degrees of freedom
4
## Multiple R-squared: 0.2475, Adjusted R-squared: 0.2466
## F-statistic: 253 on 4 and 3077 DF, p-value: < 2.2e-16
plot(lm(m1))
Residuals vs Fitted Q−Q Residuals
10
1888 1888
457 457
4
471 471
5
2
Standardized residuals
Residuals
0
−5
−2
4 6 8 10 −3 −2 −1 0 1 2 3
Scale−Location
Fitted values
Residuals vs Leverage
Theoretical Quantiles
1888
2.0
457
471
4
1.5
2
Standardized residuals
Standardized residuals
1.0
0
0.5
−2
3035
2109
2943
0.0
−4
Cook's distance
From Question (a), we have seen that area variable is right-skewed. We perform log transformation on area
to mitigate its right skewness.
hist(log(rent.data$area))
5
Histogram of log(rent.data$area)
600
Frequency
400
200
0
log(rent.data$area)
Second model uses log transformed area variable.
par(mfrow = c(2, 2), cex.lab = 0.5, mar = 2 * c(1, 1, 1, 1), cex = 0.3)
m2 = lm(rentsqm ~ log(area) + yearc + location, data = rent.data)
summary(m2)
##
## Call:
## lm(formula = rentsqm ~ log(area) + yearc + location, data = rent.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6956 -1.4447 -0.1025 1.4025 8.4686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.72809 3.52113 -16.679 < 2e-16 ***
## log(area) -2.13213 0.10485 -20.335 < 2e-16 ***
## yearc 0.03801 0.00174 21.842 < 2e-16 ***
## location2 0.72335 0.07860 9.203 < 2e-16 ***
## location3 1.76197 0.24140 7.299 3.67e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.08 on 3077 degrees of freedom
## Multiple R-squared: 0.2717, Adjusted R-squared: 0.2707
## F-statistic: 286.9 on 4 and 3077 DF, p-value: < 2.2e-16
6
plot(lm(m2))
Residuals vs Fitted Q−Q Residuals
10
4
866 866
5
2
Standardized residuals
Residuals
0
−5
−2
4 6 8 10 −3 −2 −1 0 1 2 3
Scale−Location
Fitted values
Residuals vs Leverage
Theoretical Quantiles
2.0
1888 457
866
4
1.5
1845
2
Standardized residuals
Standardized residuals
1.0
0
0.5
−2
2109
2943
0.0
−4
Cook's distance
Log transformation of area has increased R2 of the model. We keep the log-transformed area on further
Fitted values Leverage
analysis.
Next, we try model with all interaction terms.
par(mfrow = c(2, 2), cex.lab = 0.5, mar = 2 * c(1, 1, 1, 1), cex = 0.3)
m3 = lm(rentsqm ~ log(area)*location*yearc, data = rent.data)
summary(m3)
##
## Call:
## lm(formula = rentsqm ~ log(area) * location * yearc, data = rent.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8536 -1.4158 -0.0939 1.3747 8.1744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.347e+02 5.696e+01 -4.120 3.89e-05 ***
## log(area) 3.844e+01 1.353e+01 2.841 0.00453 **
## location2 -1.769e+02 8.054e+01 -2.197 0.02810 *
## location3 7.324e+02 2.827e+02 2.591 0.00961 **
## yearc 1.283e-01 2.906e-02 4.414 1.05e-05 ***
## log(area):location2 4.411e+01 1.902e+01 2.320 0.02043 *
## log(area):location3 -1.684e+02 6.488e+01 -2.596 0.00948 **
7
## log(area):yearc -2.082e-02 6.906e-03 -3.015 0.00259 **
## location2:yearc 9.055e-02 4.118e-02 2.199 0.02795 *
## location3:yearc -3.756e-01 1.442e-01 -2.605 0.00924 **
## log(area):location2:yearc -2.249e-02 9.725e-03 -2.313 0.02082 *
## log(area):location3:yearc 8.659e-02 3.310e-02 2.616 0.00894 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.061 on 3070 degrees of freedom
## Multiple R-squared: 0.2867, Adjusted R-squared: 0.2842
## F-statistic: 112.2 on 11 and 3070 DF, p-value: < 2.2e-16
plot(lm(m3))
Residuals vs Fitted Q−Q Residuals
10
4
457 457
866 866
471 471
5
2
Standardized residuals
Residuals
0
−5
−2
4 6 8 10 12 −3 −2 −1 0 1 2 3
Scale−Location
Fitted values
Residuals vs Leverage Theoretical Quantiles
2.0
457
866
471
0.5
4
1.5
1845
2
Standardized residuals
Standardized residuals
1.0
0
0.5
−2
2943
2109
0.0
−4
We can see that adjusted R2 has increased from the model without any interaction terms. It seems like
Fitted values Leverage
adding interaction terms are beneficial in improving the overall explainability of the model.
Since we considered full model, we perform model search for better model.
library(leaps)
reg_all = regsubsets(rentsqm ~ log(area)*location*yearc, nvmax = 20, data = rent.data)
cbind(Cp = summary(reg_all)$cp,
adjr2 = summary(reg_all)$adjr2,
bic = summary(reg_all)$bic)
## Cp adjr2 bic
## [1,] 575.35403 0.1509341 -489.2060
## [2,] 180.22032 0.2429851 -835.8447
## [3,] 110.51955 0.2594137 -896.4342
8
## [4,] 54.80182 0.2726001 -944.7729
## [5,] 18.07400 0.2813760 -975.1508
## [6,] 17.81677 0.2816677 -969.3711
## [7,] 17.55301 0.2819612 -963.5996
## [8,] 12.85155 0.2832886 -962.2716
## [9,] 12.86424 0.2835183 -956.2296
## [10,] 14.82650 0.2832938 -948.2341
## [11,] 12.00000 0.2841857 -945.0423
2
In terms of Mallow’s Cp and Radj , full model (model 11) is the best while in terms of BIC, model 5
(rentsqm ∼ log(area) + yearc + log(area) ∗ location2 + log(area) ∗ yearc + log(area) ∗ location3 ∗ yearc) is
the best.
The final model we choose is
where
X1 : log(area), X2 : good location, X3 : top location, X4 : year.
The Residuals vs. Fitted plot tells that the homoscedastic model assumption is met. Q-Q plot shows that
the normality assumption is generally met except at the tails, but not great deviations. The Residuals vs
Leverage plot shows that there are no serious outliers in both predictors and response. I believe the model
assumptions are generally met well.
(c)
ggplot(aes(x = rentsqm), data = rent.data) +
geom_density(aes(color = location)) +
labs(title = "Location vs. Rentsqm") +
theme_minimal()
9
Location vs. Rentsqm
0.15
0.10 location
density
1
2
3
0.05
0.00
0 5 10 15
rentsqm Al-
though there are not many observations from “top” location, this plot clearly indicates that rentsqm for “top”
location is higher than the other locations. Houses in “good” locations tend to be more spread out than the
houses in “average” location.
ggplot(aes(x = yearc, y = rentsqm), data = rent.data) +
geom_point() +
geom_smooth(col = "red", method = "loess") +
labs(title = "Year vs. Rentsqm") +
theme_minimal()
10
Year vs. Rentsqm
15
rentsqm
10
0
1920 1940 1960 1980 2000
yearc
Although it seems that rentsqm are spread out in every year, the general trend is that rentsqm is increasing
as the construction year is later.
ggplot(aes(x = area, y = rentsqm), data = rent.data) +
geom_point() +
geom_smooth(col = "red", method = "loess") +
labs(title = "Area vs. Rentsqm") +
theme_minimal()
11
Area vs. Rentsqm
15
rentsqm
10
0
40 80 120 160
area
There is a clear trend indicating that as the area of the house increases, the rentsqm decreases. But the
slope gets dramatically smaller as area increases meaning that rentsqm among big houses are not that
different.
(d)
The full model (FM) we consider is
Y = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X1 X2 + β5 X1 X3 + ε
where
Y := rentsqm
X1 := yearc
X2 := ind(location = “good”)
X3 := ind(location = “top”)
12
## Model 1: rentsqm ~ location + yearc
## Model 2: rentsqm ~ location * yearc
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3078 15106
## 2 3076 15104 2 2.1611 0.2201 0.8025
Calculated test statistics is 2.1611 and p-value is 0.8025. Under the significance level α = 0.05, we do not
reject H0 . There is not much evidence to say that the relationship between the year of construction and
average rent price per meter squared differs based on location.
(e)
# Using package
fit = lm(rentsqm ~ location + yearc, data = rent.data)
new.data = data.frame(location = "3", yearc = 1980)
predict(fit, new.data, interval = "prediction")
[4.8966, 13.6412]
13
Question 2
(a)
Linear regression
If we consider linear regression, we assume
f (xi ) = x⊤
i β
fˆLM (x0 ) = x⊤
0 β̂LSE
Xn
⊤
= x0 (xi x⊤
i )
−1
xi Yi
i=1
n
X
= x⊤ ⊤ −1
0 (xi xi ) xi Yi
i=1
so
ℓi (x0 ; (xi )ni=1 ) = x⊤ ⊤ −1
0 (xi xi ) xi
given xi x⊤
i is invertible for all i = 1, . . . , n.
KNN regression
Since fˆKNN (x0 ) = 1
P
k xi ∈Nk (x0 ) Yi ,
n
1X
fˆKNN (x0 ) = ind(xi ∈ Nk (x0 ))Yi
k i=1
which leaves
1
ℓi (x0 ; (xi )ni=1 ) = ind(xi ∈ Nk (x0 ))
k
(b)
EPE at x0 is defined as EPE(fˆ(x0 )) = E[(Y0 − fˆ(x0 ))2 ]. By some calculations,
E[(Y0 − fˆ(x0 ))2 ] = E[(Y0 − f (x0 ) + f (x0 ) − E[fˆ(x0 )] + E[fˆ(x0 )] − fˆ(x0 ))]2
= σ 2 + (E[fˆ(x0 )] − f (x0 ))2 + Var(fˆ(x0 ))
14
Linear regression
Since
E(fˆ(x0 )) = x⊤ ⊤
0 (X X)
−1 ⊤
X E(Y )
and E(Y ) = f = (f (x1 ), . . . , f (xn )), the second term is equal to (x⊤ ⊤
0 (X X)
−1 ⊤
X f − f (x0 ))2 .
The third term is expressed as
Var(x⊤ ˆ ⊤ −1
0 f (x0 )) = Var(x0 (X X) x0 ).
By the assumption in question, we approximate it as
d
Var(x0 (X ⊤ X)−1 x0 ) ≈
n
KNN-regression
For KNN-regression,
1 X 1 X 1 X
E(fˆ(x0 )) = E( Yi ) = E(Yi ) = f (xi )
k k k
xi ∈Nk (x0 ) xi ∈Nk (x0 ) xi ∈Nk (x0 )
Var(fˆ(x0 )) is
n
1X 1X
Var(fˆ(x0 )) = Var[ ( ind{xj ∈ Nk (xi )}Yj )]
n i=1 k j
n
1X 1 X
= ( ind{xj ∈ Nk (xi )}Var(Yj ))
n i=1 k 2 j
(c)
2
As k increases, variance σk decreases while bias k1 xi ∈Nk (x0 ) f (xi ) − f (x0 ) increases as the number of Yi to
P
be taken average of gets larger leadeing to the inclusion of Yi ’s that are further apart.
15
Question 3
(a)
load("Q3.Rdata")
fit.q3 = lm(Y ~ ., data = q3_dat)
library(leaps)
press = function(fit)
{
h = hatvalues(fit)
mean((residuals(fit) / (1 - h))ˆ2)
}
reg_q3 = regsubsets(Y ~ ., nvmax = 20, data = q3_dat)
summary(reg_q3)
16
## 12 ( 1 ) " " " " " " "*" "*" "*" "*" " " " " "*" " " "*" "*" "*" "*" " " "*"
## 13 ( 1 ) " " " " "*" " " "*" "*" "*" " " " " "*" " " "*" "*" "*" "*" " " "*"
## 14 ( 1 ) " " " " "*" "*" "*" "*" "*" " " " " "*" " " "*" "*" "*" "*" " " "*"
## 15 ( 1 ) "*" " " "*" "*" "*" "*" "*" " " " " "*" " " "*" "*" "*" "*" " " "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " " " " "*" " " "*" "*" "*" "*" " " "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " " " " "*" " " "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" " " " " "*" "*" "*" "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*"
## 20 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## X18 X19 X20
## 1 ( 1 ) " " "*" " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" " "
## 4 ( 1 ) " " "*" " "
## 5 ( 1 ) " " "*" " "
## 6 ( 1 ) " " "*" " "
## 7 ( 1 ) " " "*" " "
## 8 ( 1 ) " " "*" " "
## 9 ( 1 ) " " "*" " "
## 10 ( 1 ) " " "*" " "
## 11 ( 1 ) " " "*" "*"
## 12 ( 1 ) " " "*" "*"
## 13 ( 1 ) "*" "*" "*"
## 14 ( 1 ) "*" "*" "*"
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
## 18 ( 1 ) "*" "*" "*"
## 19 ( 1 ) "*" "*" "*"
## 20 ( 1 ) "*" "*" "*"
FM_criteria =
cbind(r2 = summary(reg_q3)$rsq[20],
adjr2 = summary(reg_q3)$adjr2[20],
PRESS = press(fit.q3),
Cp = summary(reg_q3)$cp[20],
bic = BIC(fit.q3),
aic = AIC(fit.q3))
FM_criteria
(b)
step_reg = step(fit.q3, direction = "both", k = log(nrow(q3_dat)))
## Start: AIC=169.07
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 + X11 +
## X12 + X13 + X14 + X15 + X16 + X17 + X18 + X19 + X20
##
## Df Sum of Sq RSS AIC
## - X9 1 0.001 540.09 162.86
## - X8 1 0.025 540.12 162.88
## - X11 1 0.028 540.12 162.88
17
## - X16 1 0.157 540.25 163.00
## - X2 1 0.212 540.31 163.06
## - X1 1 0.250 540.34 163.09
## - X4 1 0.345 540.44 163.18
## - X10 1 0.366 540.46 163.20
## - X14 1 0.573 540.67 163.39
## - X3 1 0.642 540.73 163.45
## - X18 1 0.762 540.85 163.56
## - X17 1 0.818 540.91 163.62
## - X20 1 0.832 540.92 163.63
## - X7 1 0.872 540.97 163.67
## - X5 1 1.849 541.94 164.57
## - X15 1 2.887 542.98 165.52
## - X13 1 3.175 543.27 165.79
## <none> 540.09 169.07
## - X12 1 27.824 567.92 187.98
## - X19 1 32.563 572.66 192.13
## - X6 1 120.643 660.74 263.67
##
## Step: AIC=162.86
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X10 + X11 + X12 +
## X13 + X14 + X15 + X16 + X17 + X18 + X19 + X20
##
## Df Sum of Sq RSS AIC
## - X8 1 0.025 540.12 156.67
## - X11 1 0.027 540.12 156.67
## - X16 1 0.166 540.26 156.80
## - X2 1 0.212 540.31 156.84
## - X1 1 0.255 540.35 156.88
## - X4 1 0.352 540.45 156.97
## - X10 1 0.368 540.46 156.99
## - X14 1 0.616 540.71 157.22
## - X3 1 0.650 540.74 157.25
## - X18 1 0.790 540.88 157.38
## - X17 1 0.818 540.91 157.40
## - X20 1 0.832 540.93 157.41
## - X7 1 0.873 540.97 157.45
## - X5 1 2.351 542.44 158.82
## - X15 1 2.931 543.02 159.35
## - X13 1 3.376 543.47 159.76
## <none> 540.09 162.86
## + X9 1 0.001 540.09 169.07
## - X19 1 32.597 572.69 185.95
## - X12 1 33.079 573.17 186.37
## - X6 1 120.642 660.74 257.45
##
## Step: AIC=156.67
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X10 + X11 + X12 + X13 +
## X14 + X15 + X16 + X17 + X18 + X19 + X20
##
## Df Sum of Sq RSS AIC
## - X11 1 0.026 540.14 150.48
## - X16 1 0.166 540.28 150.61
## - X2 1 0.211 540.33 150.65
18
## - X1 1 0.250 540.37 150.68
## - X4 1 0.347 540.47 150.77
## - X10 1 0.370 540.49 150.79
## - X14 1 0.610 540.73 151.02
## - X3 1 0.632 540.75 151.04
## - X18 1 0.786 540.90 151.18
## - X17 1 0.825 540.94 151.22
## - X20 1 0.831 540.95 151.22
## - X7 1 0.880 541.00 151.27
## - X5 1 2.343 542.46 152.62
## - X15 1 2.908 543.03 153.14
## - X13 1 3.386 543.50 153.58
## <none> 540.12 156.67
## + X8 1 0.025 540.09 162.86
## + X9 1 0.000 540.12 162.88
## - X19 1 32.665 572.78 179.81
## - X12 1 33.096 573.21 180.19
## - X6 1 120.642 660.76 251.25
##
## Step: AIC=150.48
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X10 + X12 + X13 + X14 +
## X15 + X16 + X17 + X18 + X19 + X20
##
## Df Sum of Sq RSS AIC
## - X16 1 0.163 540.31 144.41
## - X2 1 0.212 540.36 144.46
## - X1 1 0.266 540.41 144.51
## - X4 1 0.362 540.51 144.60
## - X10 1 0.366 540.51 144.60
## - X14 1 0.610 540.75 144.83
## - X3 1 0.641 540.79 144.86
## - X18 1 0.790 540.93 144.99
## - X17 1 0.815 540.96 145.02
## - X20 1 0.829 540.97 145.03
## - X7 1 0.877 541.02 145.07
## - X5 1 2.331 542.48 146.42
## - X15 1 2.897 543.04 146.94
## - X13 1 3.388 543.53 147.39
## <none> 540.14 150.48
## + X11 1 0.026 540.12 156.67
## + X8 1 0.024 540.12 156.67
## + X9 1 0.000 540.14 156.69
## - X19 1 32.692 572.84 173.64
## - X12 1 33.072 573.22 173.98
## - X6 1 120.897 661.04 245.25
##
## Step: AIC=144.41
## Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X10 + X12 + X13 + X14 +
## X15 + X17 + X18 + X19 + X20
##
## Df Sum of Sq RSS AIC
## - X2 1 0.198 540.51 138.38
## - X1 1 0.293 540.60 138.47
## - X4 1 0.345 540.65 138.52
19
## - X3 1 0.549 540.86 138.71
## - X18 1 0.679 540.99 138.83
## - X10 1 0.724 541.03 138.87
## - X20 1 0.766 541.07 138.91
## - X14 1 0.863 541.17 139.00
## - X17 1 0.983 541.29 139.11
## - X7 1 1.260 541.57 139.36
## - X5 1 2.173 542.48 140.21
## - X15 1 2.774 543.08 140.76
## - X13 1 3.856 544.16 141.75
## <none> 540.31 144.41
## + X16 1 0.163 540.14 150.48
## + X8 1 0.024 540.28 150.61
## + X11 1 0.023 540.28 150.61
## + X9 1 0.013 540.30 150.62
## - X19 1 32.585 572.89 167.48
## - X12 1 33.331 573.64 168.13
## - X6 1 121.319 661.63 239.48
##
## Step: AIC=138.38
## Y ~ X1 + X3 + X4 + X5 + X6 + X7 + X10 + X12 + X13 + X14 + X15 +
## X17 + X18 + X19 + X20
##
## Df Sum of Sq RSS AIC
## - X1 1 0.301 540.81 132.44
## - X4 1 0.366 540.87 132.51
## - X3 1 0.529 541.03 132.66
## - X18 1 0.690 541.20 132.81
## - X10 1 0.693 541.20 132.81
## - X20 1 0.817 541.32 132.92
## - X14 1 0.899 541.40 133.00
## - X17 1 1.033 541.54 133.12
## - X7 1 1.277 541.78 133.35
## - X5 1 2.131 542.64 134.13
## - X15 1 2.782 543.29 134.74
## - X13 1 3.787 544.29 135.66
## <none> 540.51 138.38
## + X2 1 0.198 540.31 144.41
## + X16 1 0.149 540.36 144.46
## + X11 1 0.024 540.48 144.57
## + X8 1 0.023 540.48 144.57
## + X9 1 0.016 540.49 144.58
## - X19 1 32.595 573.10 161.45
## - X12 1 33.721 574.23 162.43
## - X6 1 121.890 662.40 233.85
##
## Step: AIC=132.45
## Y ~ X3 + X4 + X5 + X6 + X7 + X10 + X12 + X13 + X14 + X15 + X17 +
## X18 + X19 + X20
##
## Df Sum of Sq RSS AIC
## - X4 1 0.373 541.18 126.58
## - X3 1 0.515 541.32 126.71
## - X18 1 0.664 541.47 126.84
20
## - X10 1 0.704 541.51 126.88
## - X20 1 0.729 541.54 126.90
## - X14 1 0.937 541.74 127.10
## - X17 1 0.991 541.80 127.15
## - X7 1 1.315 542.12 127.44
## - X5 1 2.049 542.86 128.12
## - X15 1 2.776 543.58 128.79
## - X13 1 3.662 544.47 129.60
## <none> 540.81 132.44
## + X1 1 0.301 540.51 138.38
## + X2 1 0.205 540.60 138.47
## + X16 1 0.174 540.63 138.50
## + X11 1 0.041 540.76 138.62
## + X8 1 0.018 540.79 138.64
## + X9 1 0.008 540.80 138.65
## - X19 1 32.458 573.26 155.37
## - X12 1 33.830 574.64 156.57
## - X6 1 121.597 662.40 227.64
##
## Step: AIC=126.58
## Y ~ X3 + X5 + X6 + X7 + X10 + X12 + X13 + X14 + X15 + X17 + X18 +
## X19 + X20
##
## Df Sum of Sq RSS AIC
## - X3 1 0.545 541.72 120.86
## - X20 1 0.592 541.77 120.91
## - X18 1 0.617 541.80 120.93
## - X14 1 0.760 541.94 121.06
## - X10 1 1.095 542.27 121.37
## - X17 1 1.180 542.36 121.45
## - X7 1 1.490 542.67 121.73
## - X5 1 2.053 543.23 122.25
## - X15 1 2.404 543.58 122.58
## - X13 1 4.184 545.36 124.21
## <none> 541.18 126.58
## + X4 1 0.373 540.81 132.44
## + X1 1 0.307 540.87 132.51
## + X2 1 0.228 540.95 132.58
## + X16 1 0.155 541.02 132.65
## + X11 1 0.060 541.12 132.74
## + X8 1 0.013 541.17 132.78
## + X9 1 0.001 541.18 132.79
## - X19 1 33.477 574.66 150.37
## - X12 1 36.356 577.54 152.87
## - X6 1 121.516 662.69 221.64
##
## Step: AIC=120.86
## Y ~ X5 + X6 + X7 + X10 + X12 + X13 + X14 + X15 + X17 + X18 +
## X19 + X20
##
## Df Sum of Sq RSS AIC
## - X18 1 0.218 541.94 114.85
## - X20 1 0.683 542.41 115.28
## - X14 1 0.735 542.46 115.33
21
## - X10 1 0.748 542.47 115.34
## - X7 1 1.466 543.19 116.00
## - X17 1 2.106 543.83 116.59
## - X5 1 2.172 543.90 116.65
## - X15 1 2.485 544.21 116.94
## - X13 1 3.693 545.42 118.05
## <none> 541.72 120.86
## + X3 1 0.545 541.18 126.58
## + X4 1 0.402 541.32 126.71
## + X1 1 0.292 541.43 126.81
## + X2 1 0.207 541.52 126.89
## + X11 1 0.075 541.65 127.01
## + X16 1 0.066 541.66 127.02
## + X9 1 0.005 541.72 127.07
## + X8 1 0.002 541.72 127.08
## - X19 1 33.296 575.02 144.47
## - X12 1 39.434 581.16 149.78
## - X6 1 121.115 662.84 215.54
##
## Step: AIC=114.85
## Y ~ X5 + X6 + X7 + X10 + X12 + X13 + X14 + X15 + X17 + X19 +
## X20
##
## Df Sum of Sq RSS AIC
## - X20 1 0.542 542.48 109.14
## - X10 1 0.933 542.87 109.50
## - X14 1 0.943 542.88 109.50
## - X7 1 1.670 543.61 110.17
## - X17 1 1.963 543.90 110.44
## - X5 1 2.043 543.98 110.52
## - X15 1 2.267 544.21 110.72
## - X13 1 3.824 545.77 112.15
## <none> 541.94 114.85
## + X4 1 0.354 541.59 120.74
## + X1 1 0.280 541.66 120.81
## + X2 1 0.221 541.72 120.86
## + X18 1 0.218 541.72 120.86
## + X3 1 0.146 541.80 120.93
## + X11 1 0.073 541.87 121.00
## + X16 1 0.038 541.90 121.03
## + X8 1 0.004 541.94 121.06
## + X9 1 0.003 541.94 121.06
## - X12 1 39.331 581.27 143.67
## - X19 1 48.697 590.64 151.66
## - X6 1 121.500 663.44 209.78
##
## Step: AIC=109.14
## Y ~ X5 + X6 + X7 + X10 + X12 + X13 + X14 + X15 + X17 + X19
##
## Df Sum of Sq RSS AIC
## - X14 1 0.547 543.03 103.42
## - X10 1 0.719 543.20 103.58
## - X7 1 1.259 543.74 104.08
## - X17 1 1.521 544.00 104.32
22
## - X5 1 1.762 544.25 104.54
## - X15 1 2.798 545.28 105.49
## - X13 1 3.376 545.86 106.02
## <none> 542.48 109.14
## + X20 1 0.542 541.94 114.85
## + X3 1 0.266 542.22 115.11
## + X2 1 0.252 542.23 115.12
## + X4 1 0.240 542.24 115.13
## + X1 1 0.205 542.28 115.16
## + X18 1 0.077 542.41 115.28
## + X11 1 0.065 542.42 115.29
## + X16 1 0.017 542.47 115.33
## + X8 1 0.003 542.48 115.35
## + X9 1 0.001 542.48 115.35
## - X12 1 40.894 583.38 139.26
## - X19 1 52.496 594.98 149.10
## - X6 1 121.647 664.13 204.08
##
## Step: AIC=103.42
## Y ~ X5 + X6 + X7 + X10 + X12 + X13 + X15 + X17 + X19
##
## Df Sum of Sq RSS AIC
## - X10 1 0.572 543.60 97.736
## - X7 1 0.852 543.88 97.994
## - X17 1 1.242 544.27 98.353
## - X5 1 2.262 545.29 99.288
## - X15 1 2.823 545.85 99.803
## - X13 1 3.246 546.28 100.190
## <none> 543.03 103.425
## + X14 1 0.547 542.48 109.135
## + X2 1 0.273 542.76 109.388
## + X1 1 0.258 542.77 109.402
## + X18 1 0.239 542.79 109.419
## + X20 1 0.146 542.88 109.505
## + X4 1 0.133 542.90 109.517
## + X16 1 0.130 542.90 109.519
## + X3 1 0.121 542.91 109.528
## + X11 1 0.060 542.97 109.584
## + X8 1 0.002 543.03 109.637
## + X9 1 0.001 543.03 109.639
## - X12 1 40.546 583.58 133.215
## - X19 1 52.177 595.21 143.082
## - X6 1 121.686 664.72 198.308
##
## Step: AIC=97.74
## Y ~ X5 + X6 + X7 + X12 + X13 + X15 + X17 + X19
##
## Df Sum of Sq RSS AIC
## - X7 1 0.661 544.26 92.129
## - X17 1 1.398 545.00 92.806
## - X13 1 2.796 546.40 94.086
## - X5 1 2.915 546.52 94.196
## - X15 1 3.236 546.84 94.490
## <none> 543.60 97.736
23
## + X10 1 0.572 543.03 103.425
## + X16 1 0.423 543.18 103.562
## + X14 1 0.400 543.20 103.583
## + X18 1 0.376 543.23 103.605
## + X4 1 0.337 543.27 103.641
## + X1 1 0.277 543.33 103.696
## + X2 1 0.248 543.35 103.722
## + X20 1 0.086 543.52 103.872
## + X11 1 0.050 543.55 103.905
## + X9 1 0.011 543.59 103.941
## + X8 1 0.007 543.59 103.944
## + X3 1 0.004 543.60 103.947
## - X12 1 40.067 583.67 127.080
## - X19 1 51.937 595.54 137.147
## - X6 1 122.611 666.21 193.218
##
## Step: AIC=92.13
## Y ~ X5 + X6 + X12 + X13 + X15 + X17 + X19
##
## Df Sum of Sq RSS AIC
## - X13 1 2.430 546.69 88.142
## - X17 1 3.443 547.71 89.068
## - X5 1 4.687 548.95 90.202
## - X15 1 6.468 550.73 91.822
## <none> 544.26 92.129
## + X7 1 0.661 543.60 97.736
## + X16 1 0.582 543.68 97.809
## + X4 1 0.487 543.78 97.896
## + X18 1 0.465 543.80 97.916
## + X10 1 0.380 543.88 97.994
## + X1 1 0.305 543.96 98.063
## + X2 1 0.257 544.01 98.108
## + X14 1 0.108 544.15 98.244
## + X9 1 0.077 544.19 98.273
## + X11 1 0.051 544.21 98.297
## + X20 1 0.030 544.23 98.316
## + X8 1 0.013 544.25 98.332
## + X3 1 0.002 544.26 98.342
## - X12 1 43.040 587.30 123.969
## - X19 1 51.299 595.56 130.951
## - X6 1 122.029 666.29 187.062
##
## Step: AIC=88.14
## Y ~ X5 + X6 + X12 + X15 + X17 + X19
##
## Df Sum of Sq RSS AIC
## - X17 1 4.586 551.28 86.104
## - X5 1 4.636 551.33 86.150
## <none> 546.69 88.142
## + X13 1 2.430 544.26 92.129
## - X15 1 12.732 559.43 93.439
## + X16 1 0.588 546.11 93.819
## + X4 1 0.487 546.21 93.911
## + X18 1 0.423 546.27 93.970
24
## + X7 1 0.296 546.40 94.086
## + X2 1 0.217 546.48 94.159
## + X14 1 0.212 546.48 94.163
## + X3 1 0.200 546.49 94.174
## + X1 1 0.175 546.52 94.196
## + X10 1 0.135 546.56 94.233
## + X11 1 0.054 546.64 94.307
## + X8 1 0.037 546.66 94.323
## + X9 1 0.026 546.67 94.333
## + X20 1 0.001 546.69 94.356
## - X19 1 48.928 595.62 124.786
## - X12 1 48.956 595.65 124.810
## - X6 1 122.123 668.82 182.738
##
## Step: AIC=86.1
## Y ~ X5 + X6 + X12 + X15 + X19
##
## Df Sum of Sq RSS AIC
## - X5 1 2.819 554.10 82.440
## <none> 551.28 86.104
## + X17 1 4.586 546.69 88.142
## - X15 1 9.274 560.55 88.231
## + X13 1 3.573 547.71 89.068
## + X7 1 2.329 548.95 90.202
## + X4 1 2.089 549.19 90.421
## + X16 1 1.538 549.74 90.922
## + X20 1 0.624 550.66 91.753
## + X10 1 0.316 550.96 92.033
## + X2 1 0.287 550.99 92.059
## + X18 1 0.252 551.03 92.091
## + X14 1 0.143 551.14 92.189
## + X1 1 0.131 551.15 92.200
## + X8 1 0.049 551.23 92.275
## + X3 1 0.048 551.23 92.275
## + X11 1 0.040 551.24 92.283
## + X9 1 0.000 551.28 92.319
## - X19 1 55.167 606.45 127.577
## - X12 1 67.853 619.13 137.929
## - X6 1 122.587 673.87 180.285
##
## Step: AIC=82.44
## Y ~ X6 + X12 + X15 + X19
##
## Df Sum of Sq RSS AIC
## <none> 554.10 82.440
## - X15 1 7.925 562.02 83.327
## + X7 1 3.481 550.62 85.504
## + X13 1 3.259 550.84 85.706
## + X5 1 2.819 551.28 86.104
## + X17 1 2.769 551.33 86.150
## + X4 1 2.450 551.65 86.439
## + X16 1 1.501 552.60 87.299
## + X20 1 1.322 552.78 87.460
## + X9 1 0.980 553.12 87.770
25
## + X2 1 0.154 553.94 88.516
## + X3 1 0.154 553.94 88.516
## + X1 1 0.135 553.96 88.533
## + X10 1 0.072 554.03 88.590
## + X8 1 0.025 554.07 88.632
## + X18 1 0.023 554.08 88.635
## + X11 1 0.022 554.08 88.635
## + X14 1 0.021 554.08 88.636
## - X12 1 66.232 620.33 132.681
## - X19 1 67.582 621.68 133.768
## - X6 1 121.105 675.20 175.061
RM_criteria =
cbind(r2 = summary(reg_q3)$rsq[4],
adjr2 = summary(reg_q3)$adjr2[4],
PRESS = press(step_reg),
Cp = summary(reg_q3)$cp[4],
bic = BIC(step_reg),
aic = AIC(step_reg))
RM_criteria
(c)
rbind(FM_criteria, RM_criteria)
(d)
n = nrow(q3_dat)
p = apply(summary(reg_q3)$which, 1, sum)
exhaustive.result =
cbind(adjr2 = summary(reg_q3)$adjr2,
Cp = summary(reg_q3)$cp,
bic = n * log(summary(reg_q3)$rss) + log(n) * p,
aic = n * log(summary(reg_q3)$rss) + 2 * p)
exhaustive.result
26
## 7 0.4536115 -1.3014755 3199.433 3165.716
## 8 0.4531635 0.1123886 3205.040 3167.109
## 9 0.4526238 1.6053424 3210.729 3168.583
## 10 0.4520570 3.1200927 3216.439 3170.079
## 11 0.4514828 4.6393864 3222.154 3171.579
## 12 0.4507159 6.3251359 3228.042 3173.252
## 13 0.4500006 7.9630280 3233.879 3174.875
## 14 0.4492461 9.6325204 3239.749 3176.530
## 15 0.4484149 11.3659820 3245.686 3178.252
## 16 0.4474751 13.1905842 3251.718 3180.069
## 17 0.4464960 15.0458315 3257.781 3181.918
## 18 0.4453724 17.0224535 3263.972 3183.894
## 19 0.4442423 19.0005546 3270.163 3185.871
## 20 0.4430827 21.0000000 3276.377 3187.871
2
Radj
adjr2.index = which.max(summary(reg_q3)$adjr2)
coef(reg_q3, id = adjr2.index)
Mallow’s Cp
Cp.index = which.min(summary(reg_q3)$cp)
coef(reg_q3, id = Cp.index)
BIC
AIC
27
aic.index = which.min(n * log(summary(reg_q3)$rss) + 2 * p)
coef(reg_q3, id = aic.index)
The model selected using BIC is the same as the one in (b).
(e)
library(glmnet)
Y = q3_dat$Y
X.1 = q3_dat[, -1]
character_index = unlist(lapply(X.1, is.character), use.names = F)
X.1[, character_index] = ifelse(X.1[, character_index] == "Y", 1, 0)
X = as.matrix(X.1)
Lasso
20 20 20 19 19 17 14 11 9 8 8 5 4 4 3 3 1
2.0
Mean−Squared Error
1.8
1.6
1.4
1.2
−7 −6 −5 −4 −3 −2 −1
Log(λ)
coef(lasso.fit)
28
## X1 .
## X2 .
## X3 .
## X4 .
## X5 .
## X6 0.6803152
## X7 .
## X8 .
## X9 .
## X10 .
## X11 .
## X12 0.8913236
## X13 .
## X14 .
## X15 0.1241202
## X16 .
## X17 .
## X18 .
## X19 1.4509408
## X20 .
Model selected by 10-CV Lasso is
Ridge
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
2.2
2.0
Mean−Squared Error
1.8
1.6
1.4
1.2
−2 0 2 4 6
Log(λ)
29
coef(ridge.fit)
## [1] 1.145805
min(ridge.fit$cvm)
## [1] 1.170251
I would choose the model fit by 10-CV Lasso as the best model over Ridge since it has lower 10-CV error.
Moreover, it is more parsimonious model than the fit by ridge regression.
30