C1M6 Peer Reviewed Solver2
C1M6 Peer Reviewed Solver2
1.0.1 Outline:
�� Conflicts ������������������������������������������
tidyverse_conflicts() ��
� dplyr::filter() masks stats::filter()
� dplyr::lag() masks stats::lag()
1
1.1 Problem 1: We Need Concrete Evidence!
Ralphie is studying to become a civil engineer. That means she has to know everything about
concrete, including what ingredients go in it and how they affect the concrete’s properties. She’s
currently writting up a project about concrete flow, and has asked you to help her figure out which
ingredients are the most important. Let’s use our new model selection techniques to help Ralphie
out!
Data Source: Yeh, I-Cheng, “Modeling slump flow of concrete using second-order regressions and
artificial neural networks,” Cement and Concrete Composites, Vol.29, No. 6, 474-480, 2007.
[2]: concrete.data = read.csv("Concrete.data")
head(concrete.data)
Sometimes, the best way to start is to just jump in and mess around with the model. So let’s do
that. Create a linear model with flow as the response and all other columns as predictors.
Just by looking at the summary for your model, is there reason to believe that our model could be
simpler?
[3]: # Your Code Here
#library(corrplot)
#col4=colorRampPalette(c("black", "darkgrey", "grey", "#CFB87C"))
#corrplot(cor(concrete.data[c(1:7,8)]), method = "ellipse", col=col4(100),␣
,→addCoeff.col = "black", tl.col="black")
cor(concrete.data[c(1:7,8)])
#lmod=lm(flow~., data=concrete.data)
lmod=lm(flow~cement+slag+ash+water+sp+course.agg+fine.agg, data=concrete.data)
summary(lmod)
2
extractAIC(lmod)
Residuals:
Min 1Q Median 3Q Max
-30.880 -10.428 1.815 9.601 22.953
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -252.87467 350.06649 -0.722 0.4718
cement 0.05364 0.11236 0.477 0.6342
slag -0.00569 0.15638 -0.036 0.9710
ash 0.06115 0.11402 0.536 0.5930
water 0.73180 0.35282 2.074 0.0408 *
sp 0.29833 0.66263 0.450 0.6536
course.agg 0.07366 0.13510 0.545 0.5869
fine.agg 0.09402 0.14191 0.663 0.5092
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
1. 8 2. 533.56005610193
Explaination: First we look at correlations between our response, the “flow” and the other
predictor variables. We notice there is reasonable negative correlation between “course.agg” and
“fine.agg”, similarly there is negative correlation between “cement” and “ash”; also high negative
correlation between “course.agg” and “water”. Next we try to fit a full model with the “flow” as
the response and all of the other variables in dataset as predictors. And when we do that, we see
that we have several T tests with high P values which means the T tests failed to find statistical
significance. This means that there is a reason to believe that our model could be simpler and we
3
can eliminate some of the predictor variables.
Our model has 7 predictors. That is not too many, so we can use backwards selection to narrow
them down to the most impactful.
Perform backwards selection on your model. You don’t have to automate the backwards selection
process.
[4]: # Your Code Here
#First model by removing "slag" which has highest p value
#Step1
lmod1=update(lmod, .~. -slag)
summary(lmod1)
#Step2
lmod2=update(lmod1, .~. -sp)
summary(lmod2)
Call:
lm(formula = flow ~ cement + ash + water + sp + course.agg +
fine.agg, data = concrete.data)
Residuals:
Min 1Q Median 3Q Max
-30.843 -10.451 1.771 9.589 22.939
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -265.45032 55.46193 -4.786 6.16e-06 ***
cement 0.05766 0.02088 2.761 0.006899 **
ash 0.06524 0.01987 3.283 0.001434 **
water 0.74420 0.09117 8.163 1.28e-12 ***
sp 0.31366 0.50874 0.617 0.538997
course.agg 0.07849 0.02447 3.207 0.001820 **
fine.agg 0.09909 0.02644 3.747 0.000305 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
lm(formula = flow ~ cement + ash + water + course.agg + fine.agg,
4
data = concrete.data)
Residuals:
Min 1Q Median 3Q Max
-31.893 -10.125 1.773 9.559 23.914
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -249.50866 48.90884 -5.102 1.67e-06 ***
cement 0.05366 0.01979 2.712 0.007909 **
ash 0.06101 0.01859 3.281 0.001436 **
water 0.72313 0.08426 8.582 1.53e-13 ***
course.agg 0.07291 0.02266 3.217 0.001760 **
fine.agg 0.09554 0.02573 3.714 0.000341 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Stop right there! Think about what you just did. You just removed the “worst” features from your
model. But we know that a model will become less powerful when we remove features so we should
check that it’s still just as powerful as the original model. Use a test to check whether the model
at the end of backward selection is significantly different than the model with all the features.
Describe why we want to balance explanatory power with simplicity.
[5]: # Your Code Here
cor(concrete.data[c(1:7,8)])
Explaination: As we can see that the predictors that we removed i.e. slag and sp, have reasonable
correlation with our response variable, “flow”. If we remove the variable, although the adjsuted r
5
square increases minutely from 0.4656 to 0.4745, but we can say that there is reason to believe that
our model is not better than “full” model, because we removed 2 important predictors.
Ralphie is nervous about her project and wants to make sure our model is correct. She’s found a
function called regsubsets() in the leaps package which allows us to see which subsets of arguments
produce the best combinations. Ralphie wrote up the code for you and the documentation for the
function can be found here. For each of the subsets of features, calculate the AIC, BIC and adjusted
R2 . Plot the results of each criterion, with the score on the y-axis and the number of features on
the x-axis.
Do all of the criterion agree on how many features make the best model? Explain why the criterion
will or will not always agree on the best model.
Hint: It may help to look at the attributes stored within the regsubsets summary using names(rs).
[6]: reg = regsubsets(flow ~ cement+slag+ash+water+sp+course.agg+fine.agg,␣
,→data=concrete.data, nvmax=6)
rs = summary(reg)
rs$which
AIC=2*(1:6) + n*log(rs$rss/n)
plot(AIC ~ I(1:6), xlab="number of predictors", ylab="AIC")
BIC=log(n)*(1:6) + n*log(rs$rss/n)
plot(BIC ~ I(1:6), xlab="number of predictors", ylab="BIC")
6
7
Call:
lm(formula = flow ~ slag + water, data = concrete.data)
Residuals:
Min 1Q Median 3Q Max
-32.687 -10.746 2.010 9.224 23.927
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -50.26656 12.38669 -4.058 9.83e-05 ***
slag -0.09023 0.02064 -4.372 3.02e-05 ***
water 0.54224 0.06175 8.781 4.62e-14 ***
---
8
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As we can see from above plots that we have lowest AIC and BIC for 2 predictor model, also, we
have highest Adjusted R squared for 2 predictor model, therefore in this case, all criterion agree
that 2 features make the best model. And these 2 features are “slag” and “water” So our model
will look like f low = βˆ0 + βˆ1 slag + βˆ0 water
It may not always be the case that all 3 criterion, i.e. AIC, BIC and Adj R Sq may agree. The
9
reason is that all three have different algorithms. Adjusted R squared is a meaure of training error,
where as AIC is an estimate of test error and takes bias into account. AIC is an estimate of a
constant plus the relative distance between the unknown true likelihood function of the data and
the fitted likelihood function of the model, so that a lower AIC means a model is considered to be
closer to the truth. BIC is an estimate of a function of the posterior probability of a model being
true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be
more likely to be the true model. BIC penalizes model complexity more heavily
[ ]:
10