Week13 Exercise Solutions
Week13 Exercise Solutions
load("Regression.RData")
Exercise 1
In a field experiment, effect of nutrient type of soil on plant species diversity is studied. Researchers want to
know whether increasing nutrient type changes number of plant species, and if it changes, how can plant
species number be predicted from nutrient type.
Use the “plantnutrition” data.frame. Calculate the regression line. Visualize the data and add the regression
line to the plot. Check if residuals are normally distributed. Check if the slope of the regression line is
significant. Calculate the predicted confidence interval of number of plant species when nutrient type is 2.
head(plantnutrition)
## NutrientNo PlantSp
## 1 0 36
## 2 0 36
## 3 0 32
## 4 1 34
## 5 2 33
## 6 3 30
# regression line calculation
reg_result = lm(PlantSp~ NutrientNo, data = plantnutrition)
# graphical representation
plot(PlantSp~NutrientNo, data = plantnutrition, col = "red", pch =19)
abline(reg_result)
1
35
30
PlantSp
25
20
0 1 2 3 4
NutrientNo
# checking residuals for normality
residual_values = reg_result$residuals
shapiro.test(residual_values)
##
## Shapiro-Wilk normality test
##
## data: residual_values
## W = 0.92819, p-value = 0.4303
We can assume that residuals are normally distributed. Hence assumption of regression is met.
# significance of regression line slope
summary(reg_result)
##
## Call:
## lm(formula = PlantSp ~ NutrientNo, data = plantnutrition)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.771 -1.856 1.068 2.894 5.907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.110 2.599 13.12 1.08e-06 ***
## NutrientNo -3.339 1.098 -3.04 0.0161 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.336 on 8 degrees of freedom
## Multiple R-squared: 0.536, Adjusted R-squared: 0.478
## F-statistic: 9.241 on 1 and 8 DF, p-value: 0.01607
We tested the null hypothesis that slope = 0, and rejected it with p-value = 0.0161. So, there is a statistically
2
significant linear association between nutrient type and plant species diversity. The relationship is negative
and can be shown with the calculated line equation.
# predicted CI for nutrient type 2
predict(reg_result, data.frame(NutrientNo = 2), interval = "prediction")
Exercise 2
In the height_weight data.frame, there is made-up data of height and weight measurements of people from
2 different job types. We want to predict weight values from height. Y values for each X are normally
distributed, you do not need to check for assumption of normality.
• Calculate and check significance of the regression line for “O1” only and then for all observations. (We
will calculate 2 distinct models)
• For both of the models, predict weight values that correspond to height values 165 and 175. Compare
the prediction intervals. Are the prediction intervals for height 165 and height 175 intersecting or not?
Are they different in 2 models? What can be a reason for a potential difference?
head(height_weight)
##
## Call:
## lm(formula = weight ~ height, data = job1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.116 -8.281 3.169 8.063 22.741
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -119.9106 63.1410 -1.899 0.07370 .
## height 1.0712 0.3589 2.984 0.00795 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.88 on 18 degrees of freedom
3
## Multiple R-squared: 0.331, Adjusted R-squared: 0.2938
## F-statistic: 8.906 on 1 and 18 DF, p-value: 0.007953
P-value is 0.00795, null hypothesis (slope of regression line = 0) is rejected. Height values can be used to
predict weight values.
# same analysis with all individuals in the data set
reg_result_all = lm(weight ~ height, data = height_weight)
summary(reg_result_all)
##
## Call:
## lm(formula = weight ~ height, data = height_weight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.2602 -6.3568 -0.5929 8.4346 25.9064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.9695 34.4587 -1.102 0.27744
## height 0.6112 0.2010 3.040 0.00427 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.31 on 38 degrees of freedom
## Multiple R-squared: 0.1956, Adjusted R-squared: 0.1745
## F-statistic: 9.243 on 1 and 38 DF, p-value: 0.004266
P-value is 0.00427, we can make the same conclusion as before.
Let us compare predictions of the weight values that correspond to height values 165 and 175.
predict(reg_result_job1, data.frame(height = c(165, 175)), interval = "prediction")