CP2403 - Assignment – Part 2 – Task 4: Multiple
Regression
First Name:景娇
Last Name:李
1: Data Selection
- Data selected: bottle.csv
- Response variable: T_degC
- Explanatory variable 1: Depthm
- Explanatory variable 2: Salnty
- Explanatory variable 3: O2ml_L
2: Scatter plots between each explanatory variable and response variable
Scatter plot 1 & r-value
Scatter plot 2 & r-value
Scatter plot 3 & r-value
3: Summary of your pre-testing plan - List possible candidate combinations of individual
regression models to compose a multiple regression model and your justification (e.g. why
did you decide to apply such combination strategy?)
Candidate combinations: Model 1: T_degC ~ Depthm
Model 2: T_degC ~ Salnty
Model 3: T_degC ~ Depthm + Salnty
Justification: These combinations were chosen based on the moderate to strong negative
and positive correlations observed in the scatter plots, respectively.
4: Pre-testing Regression analysis results (for each candidate (multiple) regression model)
5: Pre-testing Regression equation/line (for each candidate (multiple) regression model)
6: Q-Q plot for each candidate (multiple) regression model
7: Conclusion from Q-Q plots
Model 3 appears to have residuals that are closer to a normal distribution than Models 1
and 2, suggesting it might be a better fit.
8: Residual Plot for each candidate model
For each candidate model:
- Standardised Residual plot
- percentage of observations over 2 standardized deviation
- percentage of observations over 2.5 standardized
9: Conclusion from Standardised Residual plots
Model 3 seems to have the best fit among the three, with residuals more randomly distributed around
the zero line.
10: Conclusion Overall
- Can you select one best model among your candidate models?
Based on the R-squared values, Q-Q plots, and standardized residual plots,
Model 3 (T_degC ~ Depthm + Salnty) is selected as the best model.
- Justify your selection.
This model not only has a higher R-squared value but also shows residuals that
are closer to a normal distribution and are more randomly scattered around the zero line,
indicating a better fit and less violation of the assumptions of linear regression.