0% found this document useful (0 votes)
148 views10 pages

C1M6 Peer Reviewed Solver2

1. More complex models are more likely to overfit the data, which means they may not generalize well to new data. 2. Simpler models are easier to interpret and understand the relationships between variables. 3. Including unnecessary variables increases the risk of finding spurious relationships due to multiple testing. 4. Parsimonious models align with Occam's razor - the principle that simpler solutions are generally better than more complex ones. To balance these, we use model selection techniques like backwards selection to iteratively remove the least impactful variables. But we also check that the simplified model is not significantly worse

Uploaded by

Ashutosh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views10 pages

C1M6 Peer Reviewed Solver2

1. More complex models are more likely to overfit the data, which means they may not generalize well to new data. 2. Simpler models are easier to interpret and understand the relationships between variables. 3. Including unnecessary variables increases the risk of finding spurious relationships due to multiple testing. 4. Parsimonious models align with Occam's razor - the principle that simpler solutions are generally better than more complex ones. To balance these, we use model selection techniques like backwards selection to iteratively remove the least impactful variables. But we also check that the simplified model is not significantly worse

Uploaded by

Ashutosh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

C1M6_peer_reviewed

June 13, 2021

1 Module 6: Peer Reviewed Assignment

1.0.1 Outline:

The objectives for this assignment:


1. Apply the processes of model selection with real datasets.
2. Understand why and how some problems are simpler to solve with some forms of model
selection, and others are more difficult.
3. Be able to explain the balance between model power and simplicity.
4. Observe the difference between different model selection criterion.
General tips:
1. Read the questions carefully to understand what is being asked.
2. This work will be reviewed by another human, so make sure that you are clear and concise
in what your explanations and answers.
[1]: # This cell loads in the necesary packages
library(tidyverse)
library(leaps)
library(ggplot2)

�� Attaching packages ��������������������������������������� tidyverse


1.3.0 ��

� ggplot2 3.3.0 � purrr 0.3.4


� tibble 3.0.1 � dplyr 0.8.5
� tidyr 1.0.2 � stringr 1.4.0
� readr 1.3.1 � forcats 0.5.0

�� Conflicts ������������������������������������������
tidyverse_conflicts() ��
� dplyr::filter() masks stats::filter()
� dplyr::lag() masks stats::lag()

1
1.1 Problem 1: We Need Concrete Evidence!

Ralphie is studying to become a civil engineer. That means she has to know everything about
concrete, including what ingredients go in it and how they affect the concrete’s properties. She’s
currently writting up a project about concrete flow, and has asked you to help her figure out which
ingredients are the most important. Let’s use our new model selection techniques to help Ralphie
out!
Data Source: Yeh, I-Cheng, “Modeling slump flow of concrete using second-order regressions and
artificial neural networks,” Cement and Concrete Composites, Vol.29, No. 6, 474-480, 2007.
[2]: concrete.data = read.csv("Concrete.data")

concrete.data = concrete.data[, c(-1, -9, -11)]


names(concrete.data) = c("cement", "slag", "ash", "water", "sp", "course.agg",␣
,→"fine.agg", "flow")

head(concrete.data)

cement slag ash water sp course.agg fine.agg flow


<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 273 82 105 210 9 904 680 62.0
2 163 149 191 180 12 843 746 20.0
A data.frame: 6 × 8
3 162 148 191 179 16 840 743 20.0
4 162 148 190 179 19 838 741 21.5
5 154 112 144 220 10 923 658 64.0
6 147 89 115 202 9 860 829 55.0

1.1.1 1. (a) Initial Inspections

Sometimes, the best way to start is to just jump in and mess around with the model. So let’s do
that. Create a linear model with flow as the response and all other columns as predictors.
Just by looking at the summary for your model, is there reason to believe that our model could be
simpler?
[3]: # Your Code Here
#library(corrplot)
#col4=colorRampPalette(c("black", "darkgrey", "grey", "#CFB87C"))
#corrplot(cor(concrete.data[c(1:7,8)]), method = "ellipse", col=col4(100),␣
,→addCoeff.col = "black", tl.col="black")

## Unable to add corrplot library, therefore relying on just actual correlation␣


,→values

cor(concrete.data[c(1:7,8)])
#lmod=lm(flow~., data=concrete.data)
lmod=lm(flow~cement+slag+ash+water+sp+course.agg+fine.agg, data=concrete.data)
summary(lmod)

2
extractAIC(lmod)

cement slag ash water sp co


cement 1.00000000 -0.24355253 -0.48653529 0.22109124 -0.10638679 -0
slag -0.24355253 1.00000000 -0.32261907 -0.02677464 0.30650431 -0
ash -0.48653529 -0.32261907 1.00000000 -0.24132061 -0.14350798 0.
A matrix: 8 × 8 of type dbl water 0.22109124 -0.02677464 -0.24132061 1.00000000 -0.15545589 -0
sp -0.10638679 0.30650431 -0.14350798 -0.15545589 1.00000000 -0
course.agg -0.30985683 -0.22379245 0.17261996 -0.60220129 -0.10415943 1.
fine.agg 0.05695887 -0.18352199 -0.28285429 0.11459095 0.05829047 -0
flow 0.18646060 -0.32723069 -0.05542346 0.63202567 -0.17631449 -0
Call:
lm(formula = flow ~ cement + slag + ash + water + sp + course.agg +
fine.agg, data = concrete.data)

Residuals:
Min 1Q Median 3Q Max
-30.880 -10.428 1.815 9.601 22.953

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -252.87467 350.06649 -0.722 0.4718
cement 0.05364 0.11236 0.477 0.6342
slag -0.00569 0.15638 -0.036 0.9710
ash 0.06115 0.11402 0.536 0.5930
water 0.73180 0.35282 2.074 0.0408 *
sp 0.29833 0.66263 0.450 0.6536
course.agg 0.07366 0.13510 0.545 0.5869
fine.agg 0.09402 0.14191 0.663 0.5092
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.84 on 95 degrees of freedom


Multiple R-squared: 0.5022,Adjusted R-squared: 0.4656
F-statistic: 13.69 on 7 and 95 DF, p-value: 3.915e-12

1. 8 2. 533.56005610193

Explaination: First we look at correlations between our response, the “flow” and the other
predictor variables. We notice there is reasonable negative correlation between “course.agg” and
“fine.agg”, similarly there is negative correlation between “cement” and “ash”; also high negative
correlation between “course.agg” and “water”. Next we try to fit a full model with the “flow” as
the response and all of the other variables in dataset as predictors. And when we do that, we see
that we have several T tests with high P values which means the T tests failed to find statistical
significance. This means that there is a reason to believe that our model could be simpler and we

3
can eliminate some of the predictor variables.

1.1.2 1. (b) Backwards Selection

Our model has 7 predictors. That is not too many, so we can use backwards selection to narrow
them down to the most impactful.
Perform backwards selection on your model. You don’t have to automate the backwards selection
process.
[4]: # Your Code Here
#First model by removing "slag" which has highest p value
#Step1
lmod1=update(lmod, .~. -slag)
summary(lmod1)

#Step2
lmod2=update(lmod1, .~. -sp)
summary(lmod2)

Call:
lm(formula = flow ~ cement + ash + water + sp + course.agg +
fine.agg, data = concrete.data)

Residuals:
Min 1Q Median 3Q Max
-30.843 -10.451 1.771 9.589 22.939

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -265.45032 55.46193 -4.786 6.16e-06 ***
cement 0.05766 0.02088 2.761 0.006899 **
ash 0.06524 0.01987 3.283 0.001434 **
water 0.74420 0.09117 8.163 1.28e-12 ***
sp 0.31366 0.50874 0.617 0.538997
course.agg 0.07849 0.02447 3.207 0.001820 **
fine.agg 0.09909 0.02644 3.747 0.000305 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.78 on 96 degrees of freedom


Multiple R-squared: 0.5022,Adjusted R-squared: 0.4711
F-statistic: 16.14 on 6 and 96 DF, p-value: 9.229e-13

Call:
lm(formula = flow ~ cement + ash + water + course.agg + fine.agg,

4
data = concrete.data)

Residuals:
Min 1Q Median 3Q Max
-31.893 -10.125 1.773 9.559 23.914

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -249.50866 48.90884 -5.102 1.67e-06 ***
cement 0.05366 0.01979 2.712 0.007909 **
ash 0.06101 0.01859 3.281 0.001436 **
water 0.72313 0.08426 8.582 1.53e-13 ***
course.agg 0.07291 0.02266 3.217 0.001760 **
fine.agg 0.09554 0.02573 3.714 0.000341 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.74 on 97 degrees of freedom


Multiple R-squared: 0.5003,Adjusted R-squared: 0.4745
F-statistic: 19.42 on 5 and 97 DF, p-value: 2.36e-13

1.1.3 1. (c) Objection!

Stop right there! Think about what you just did. You just removed the “worst” features from your
model. But we know that a model will become less powerful when we remove features so we should
check that it’s still just as powerful as the original model. Use a test to check whether the model
at the end of backward selection is significantly different than the model with all the features.
Describe why we want to balance explanatory power with simplicity.
[5]: # Your Code Here
cor(concrete.data[c(1:7,8)])

cement slag ash water sp co


cement 1.00000000 -0.24355253 -0.48653529 0.22109124 -0.10638679 -0
slag -0.24355253 1.00000000 -0.32261907 -0.02677464 0.30650431 -0
ash -0.48653529 -0.32261907 1.00000000 -0.24132061 -0.14350798 0.
A matrix: 8 × 8 of type dbl water 0.22109124 -0.02677464 -0.24132061 1.00000000 -0.15545589 -0
sp -0.10638679 0.30650431 -0.14350798 -0.15545589 1.00000000 -0
course.agg -0.30985683 -0.22379245 0.17261996 -0.60220129 -0.10415943 1.
fine.agg 0.05695887 -0.18352199 -0.28285429 0.11459095 0.05829047 -0
flow 0.18646060 -0.32723069 -0.05542346 0.63202567 -0.17631449 -0

Explaination: As we can see that the predictors that we removed i.e. slag and sp, have reasonable
correlation with our response variable, “flow”. If we remove the variable, although the adjsuted r

5
square increases minutely from 0.4656 to 0.4745, but we can say that there is reason to believe that
our model is not better than “full” model, because we removed 2 important predictors.

1.1.4 1. (d) Checking our Model

Ralphie is nervous about her project and wants to make sure our model is correct. She’s found a
function called regsubsets() in the leaps package which allows us to see which subsets of arguments
produce the best combinations. Ralphie wrote up the code for you and the documentation for the
function can be found here. For each of the subsets of features, calculate the AIC, BIC and adjusted
R2 . Plot the results of each criterion, with the score on the y-axis and the number of features on
the x-axis.
Do all of the criterion agree on how many features make the best model? Explain why the criterion
will or will not always agree on the best model.
Hint: It may help to look at the attributes stored within the regsubsets summary using names(rs).
[6]: reg = regsubsets(flow ~ cement+slag+ash+water+sp+course.agg+fine.agg,␣
,→data=concrete.data, nvmax=6)

rs = summary(reg)
rs$which

# Your Code Here


n=dim(concrete.data)[1]
#names(rs)
#rs$bic
#rs$cp
#rs$outmat
#rs$obj

AIC=2*(1:6) + n*log(rs$rss/n)
plot(AIC ~ I(1:6), xlab="number of predictors", ylab="AIC")

BIC=log(n)*(1:6) + n*log(rs$rss/n)
plot(BIC ~ I(1:6), xlab="number of predictors", ylab="BIC")

plot(1:6, rs$adjr2, xlab="number of predictors", ylab="adjusted r squared")

lmodfinal=lm(flow ~ slag+water, data=concrete.data)


summary(lmodfinal)

(Intercept) cement slag ash water sp course.agg fine.agg


1 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
2 TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
A matrix: 6 × 8 of type lgl 3 TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
4 TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
5 TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
6 TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE

6
7
Call:
lm(formula = flow ~ slag + water, data = concrete.data)

Residuals:
Min 1Q Median 3Q Max
-32.687 -10.746 2.010 9.224 23.927

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -50.26656 12.38669 -4.058 9.83e-05 ***
slag -0.09023 0.02064 -4.372 3.02e-05 ***
water 0.54224 0.06175 8.781 4.62e-14 ***
---

8
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.6 on 100 degrees of freedom


Multiple R-squared: 0.4958,Adjusted R-squared: 0.4857
F-statistic: 49.17 on 2 and 100 DF, p-value: 1.347e-15

As we can see from above plots that we have lowest AIC and BIC for 2 predictor model, also, we
have highest Adjusted R squared for 2 predictor model, therefore in this case, all criterion agree
that 2 features make the best model. And these 2 features are “slag” and “water” So our model
will look like f low = βˆ0 + βˆ1 slag + βˆ0 water
It may not always be the case that all 3 criterion, i.e. AIC, BIC and Adj R Sq may agree. The

9
reason is that all three have different algorithms. Adjusted R squared is a meaure of training error,
where as AIC is an estimate of test error and takes bias into account. AIC is an estimate of a
constant plus the relative distance between the unknown true likelihood function of the data and
the fitted likelihood function of the model, so that a lower AIC means a model is considered to be
closer to the truth. BIC is an estimate of a function of the posterior probability of a model being
true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be
more likely to be the true model. BIC penalizes model complexity more heavily
[ ]:

10

You might also like