1 Module 3: Peer Reviewed Assignment
1 Module 3: Peer Reviewed Assignment
1 Module 3: Peer Reviewed Assignment
1.0.1 Outline:
The dataset below measures the chewiness (mJ) of different berries along with their sugar equiv-
alent and salt (NaCl) concentration. Let’s use these data to create a model to finally understand
chewiness.
Here are the variables: 1. nacl: salt concentration (NaCl) 2. sugar: sugar equivalent 3. chewiness:
chewiness (mJ)
Dataset Source: I. Zouid, R. Siret, F. Jourjion, E. Mehinagic, L. Rolle (2013). “Impact of Grapes
Heterogeneity According to Sugar Level on Both Physical and Mechanical Berries Properties and
their Anthocyanins Extractability at Harvest,” Journal of Texture Studies, Vol. 44, pp. 95-103.
1. (a) Simple linear regression (SLR) parameters In the below code, we load in the data
and fit a SLR model to it, using chewiness as the response and sugar as the predictor. The
summary of the model is printed. Let α = 0.05.
1
Look at the results and answer the following questions: * What is the hypothesis test related to the
p-value 2.95e-09? Clearly state the null and alternative hypotheses and the decision made based
on the p-value. * Does this mean the coefficient is statistically significant? * What does it mean
for a coefficient to be statistically significant?
[5]: # Load the data
chew.data = read.csv("berry_sugar_chewy.csv")
Call:
lm(formula = chewiness ~ sugar, data = chew.data)
Residuals:
Min 1Q Median 3Q Max
-2.4557 -0.5604 0.1045 0.5249 1.9559
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.662878 0.756610 10.128 < 2e-16 ***
sugar -0.022797 0.003453 -6.603 2.95e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
1). The hypothesis test related to the p-value of 2.95e− 09 is assessing the probability of of a test
statistic at least as unusual as the one we obtained if the null hypothesis were true. In other words,
this hypothesis test is evaluating the likelihood of seeing a t-value as extreme, or more extreme,
than the one we got (e.g. -6.603). The null hypothesis is H0 : β1 = 0, meaning that sugar has no
significant effect on chewiness. The alternative hypothesis is H1 : β1 ̸= 0, meaning that sugar does
have a significant effect on chewiness. With α = 0.05 and a p-value of 2.95e− 09, this means we can
reject the null hypothesis and accept the alternative hypothesis.
2). With a p-value of 2.95e− 09 and α = 0.05, we can reject the null hypothesis and accept the
alternative. This means that sugar does have a statistically significant effect on chewiness. In other
words, the coefficient is statistically significant.
3). Statistical significance tells us that the results in the data are not explainable by chance alone.
When a coefficient is statistically significant, we can interpret that it has a statistically significant
effect on the response variable being modeled.
1. (b) MLR parameters Now let’s see if the second predictor/feature nacl is worth adding to
the model. In the code below, we create a second linear model fitting chewiness as the response
2
with sugar and nacl as predictors.
Look at the results and answer the following questions: * Which, if any, of the slope parameters
are statistically significant? * Did the statistical significance of the parameter for sugar stay the
same, when compared to 1 (a)? If the statistical signficance changed, explain why it changed. If it
didn’t change, explain why it didn’t change.
[6]: chew.lmod.2 = lm(chewiness ~ ., data=chew.data)
summary(chew.lmod.2)
Call:
lm(formula = chewiness ~ ., data = chew.data)
Residuals:
Min 1Q Median 3Q Max
-2.3820 -0.6333 0.1234 0.5231 1.9731
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.1107 13.6459 -0.521 0.604
nacl 0.6555 0.6045 1.084 0.281
sugar -0.4223 0.3685 -1.146 0.255
1). In this model, none of the slope parameters are statistically significant. We know this because
the p-values for both coefficients are greater than our α of 0.05. This means that in this model,
neither sugar nor salt concentration have a statistically significant effect on chewiness.
2). Interestingly, the statistical significance of the sugar parameter has changed with the addition
of salt concentration as a predictor to the model. There are a number of reasons that this may
have occurred:
• Loss of degrees of freedom – When trying to estimate more parameters in a model, you
sacrifice precision in the model, yielding lower t-statistics and hence higher p-values.
• Correlation of regressors – We may have a situation where these two regressors are related
to one another, measuring something similar. Inidivudally, these variables may be significant
predictors of chewiness, but together the variables essentially compete for explaining the
outcome variable. Especially in smaller samples, this can result in both variables losing
predictive power as none of the effects may be strong enough and as precisely estimated when
controlling for the other to get significant estimates.
• Misspecified models – The underlying theory for t-statistics/p-values requires that you esti-
mate a correctly specified model. Now, if you only regress on one predictor, chances are quite
high that that univariate model suffers from omitted variable bias.
3
My best guess for this instance is that sugar and salt concentration are correlated.
1. (c) Model Selection Determine which of the two models we should use. Explain how you
arrived at your conclusion and write out the actual equation for your selected model.
[7]: # Your Code Here
chew.lmod3 = lm(chewiness~nacl, data=chew.data)
summary(chew.lmod3)
print(cor.test(chew.data$nacl, chew.data$sugar, method = "pearson"))
Call:
lm(formula = chewiness ~ nacl, data = chew.data)
Residuals:
Min 1Q Median 3Q Max
-2.4584 -0.5637 0.1009 0.5231 1.9679
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.496443 0.884040 9.611 2.27e-15 ***
nacl -0.037343 0.005669 -6.587 3.17e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The best model to use in this scenario is the first, chew.lmod. I came to this conclusion by digging
into the problem deeper. I thought that the second model yielded two non-significant predictors
likely because of a hidden relationship between the sugar and salt concentration predictors. I
confirmed this suspicion with a pearson test of correlation, and you can see above that these
variables have almost a perfect correlation ($ r = 0.99$)! Additionally, salt by itself as a predictor
4
of chewiness is significant. Based on this information, we know to only use one of these highly
correlated variables as a predictor of chewiness. The actual equation for our selected model is:
1. (d) Parameter Confidence Intervals Compute 95% confidence intervals for each parameter
in your selected model. Then, in words, state what these confidence intervals mean.
[8]: # Your Code Here
confint(chew.lmod, level=0.95)
2.5 % 97.5 %
A matrix: 2 × 2 of type dbl (Intercept) 6.15927388 9.16648152
sugar -0.02965862 -0.01593536
We can interpret the confidence interval above as follows: In 95% of all samples that could be
drawn, the confidence interval will cover the true value of β1 . A 95% confidence interval is a range
of values that you can be 95% certain contains the true sample statistic, in this case, the slope
in the sugar model. In other words, a 95% confidence interval means that if we were to take 100
different samples and compute a 95% confidence interval for each sample, then approximately 95
of the 100 confidence intervals will contain the true slope value, β1 . This interval does not contain
the value 0, which also leads us to reject the null hypothesis that H0 : β1 = 0 in favor of the
alternative, H1 : β1 ̸= 0.
In this exercise we’ll look at the variability of slopes of simple linear regression models fitted to
realizations of simulated data.
Write a function, called sim_data(), that returns a simulated sample of size n = 20 from the model
iid
Y = 1 + 2.5X + ϵ where ϵ ∼ N (0, 1). We will then use this generative funciton to understand how
fitted slopes can vary, even for the same underlying population.
[9]: sim_data <- function(n=20, var=1, beta.0=1, beta.1=2.5){
# BEGIN SOLUTION HERE
x = seq(-1, 1, length.out = n); beta0 = 1; beta1 = 2.5; e = rnorm(n, 0,␣
,→sqrt(var))
y = beta0 + beta1*x + e
# END SOLUTION HERE
data = data.frame(x=x, y=y)
return(data)
}
2. (a) Fit a slope Execute the following code to generate 20 data points, fit a simple linear
regression model and plot the results.
5
Just based on this plot, how well does our linear model fit the data?
[10]: data = sim_data()
lmod = lm(y~x, data=data)
ggplot(aes(x=x, y=y), data=data) +
geom_point() +
geom_smooth(method="lm", formula=y~x, se=FALSE, color="#CFB87C")
Based solely on this plot, we can tell our linear model fits the data well.
2. (b) Do the slopes change? Now we want to see how the slope of our line varies with different
random samples of data. Call our data genaration function 50 times to gather 50 independent
samples. Then we can fit a SLR model to each of those samples and plot the resulting slope. The
6
function below performs this for us.
Experiment with different variances and report on what effect that has to the spread of the slopes.
[11]: gen_slopes <- function(num.slopes=50, var, num.samples=20){
g = ggplot()
# Repeat the sample for the number of slopes
for(ii in 1:num.slopes){
# Generate a random sampling of data
data = sim_data(n=num.samples, var=var)
# Add the slope of the best fit linear model to the plot
g = g + stat_smooth(aes(x=x, y=y), data=data, method="lm", geom="line",
se=FALSE, alpha=0.4, color="#CFB87C", size=1)
}
return(g)
}
[12]: gen_slopes(var=1)
7
`geom_smooth()` using formula 'y ~ x'
8
`geom_smooth()` using formula 'y ~ x'
9
[13]: gen_slopes(var=5)
10
`geom_smooth()` using formula 'y ~ x'
11
`geom_smooth()` using formula 'y ~ x'
12
[14]: gen_slopes(var=100)
13
`geom_smooth()` using formula 'y ~ x'
14
`geom_smooth()` using formula 'y ~ x'
15
The higher the variance input, the more chaotic the resulting slopes. The slopes resulting from
sampling from data with low variances are more closely bundled. The randomly sampled data
points confined to a smaller variance are closer together and less likely to vary from the other
points in the sample. Conversely, sampling from data with a higher variance in turn creates sample
data that widely vary from one another. This extra variance translates to the linear regression line
fit to the data. Higher data variances mean higher slope variances.
2. (c) Distributions of Slopes As we see above, the slopes are somewhat random. That means
that they follow some sort of distribution, which we can try to discern. The code below computes
num_samples independent realizations of the model data, computes the SLR model, and generates
a histogram of the resulting slopes.
Again, experiment with different variances for the simulated data and record what you notice.
16
What do you notice about the shapes of the resulting histograms?
[15]: hist_slopes <- function(num.slopes=500, var, num.samples=20){
slopes = rep(0, num.slopes)
# For num.slopes, compute a SLR model slope
for(i in 1:num.slopes){
# Simulate the desired data
data = sim_data(var=var, n=num.samples)
# Fit an SLR model to the data
lmod = lm(y~x, data=data)
# Add the slopes to the vector of slopes
slopes[i] = lmod$coef[2]
}
# Plot a histogram of the resulting slopes
g = ggplot() + aes(slopes) + geom_histogram(color="black", fill="#CFB87C")
return(g)
}
[16]: hist_slopes(var=1)
17
[17]: hist_slopes(var=5)
18
[18]: hist_slopes(var=100)
19
As we experiment with different variances for the simulated data, we can also notice differences in
the histogram of the slope values. As our simulated data variance increases, so does the variance
of the plotted slope values. In the first example, looking at simulated data with a variance of 1,
the entirety of the slope values in the histogram fall between 0 and 4. However, we can notice the
consistency of the Gaussian curve. Despite having a larger variance of slopes, the simulated data
with higher variances retains the Normal Distribution.
2. (d) Confidence Intervals of Slopes What does that all mean? It means that when we fit a
linear regression model, our parameter estimates will not be equal to the true parameters. Instead,
the estimates will vary from sample to sample, and form a distribution. This is true for any linear
regression model with any data - not just simulated data - as long as we assume that there is a
large population that we can resample the response from (at fixed predictor values). Also note
that we only demonstrated this fact with the slope estimate, but the same principle is true for the
20
intercept, or if we had several slope parameters.
This simulation shows that there is a chance for a linear regression model to have a slope that is
very different from the true slope. But with a large sample size, n, or small error variance, σ 2 ,
the distribution will become narrower. Confidence intervals can help us understand this variability.
The procedure that generates confidence intervals for our model parameters has a high probability
of covering the true parameter. And, the higher n is, for a fixed σ 2 , or the smaller σ 2 is, for a fixed
n, the narrower the confidence interval will be!
Draw a single sample of size n = 20 from sim_data() with variance σ 2 = 1. Use your sample to
compute a 95% confidence interval for the slope. Does the known slope for the model (which we
can recall is 2.5) fall inside your confidence interval? How does the value of σ 2 affect the CI width?
2.5 % 97.5 %
A matrix: 2 × 2 of type dbl (Intercept) 0.8704551 1.525949
x 1.7368119 2.816745
Call:
lm(formula = y ~ x, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.3127 -0.6090 -0.2439 0.3830 1.6442
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8354 0.1969 4.241 0.000491 ***
x 2.2012 0.3245 6.784 2.36e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
2.5 % 97.5 %
A matrix: 2 × 2 of type dbl (Intercept) -1.5381054 1.350010
x -0.5822167 4.175979
21
The confidence interval for the recently simulated data tells us that there is a 95 chance that a
new sample taken will contain the true sample statistic, in this case slope. The model fitted on
the simulated data has a slope of 2.2. Using simulated data can help us put this interpretation
of a confidence interval to the test. We can refer to the previously created model, lmod as the
“sample” data to test the confidence interval of the newly created lm model generated above. So,
the 95 confidence interval created here from the newly simulated data tells us there is a 95 chance
that a new sample taken will have a slope that falls between 1.73 and 2.81. The slope of the lmod
model (our “sample” in this case) is 2.5, which does in fact fall in this range! This problem is a
great way to simulate the sample data to put our confidence interval to the test.
We can make another model with simulated data utilizing a higher variance to examine its effect
on the confidence interval. Interestingly, the confidence interval for a model fitted to data with a
higher variance is also subsequently higher. For this model fitted to data with a variance of 10, we
see there is a 95 likelihood that a new sample taken will have a slope that falls between −0.58 and
4.17, a considerably larger interval than the previous one.
22