Assignment 1 Questions
Assignment 1 Questions
REGRESSION MODELLING
(STAT7038)
Assignment 1 for Semester 1, 2025
INSTRUCTIONS:
• This assignment is worth 15% of your overall marks for this course.
• The data files are on Wattle.
• Submit your assignment by Turnitin on Wattle.
• Assignments must be typed. For each part, include relavant computer outputs, discus-
sions, calculations, and R code used. You should make your plots easy to understand.
For example, use variable names in the X and Y axes. Please be selective of what you
include in each part. They should be closely related to the particular question part.
Clearly label each part accordingly.
• You can use any R functions in your assignment, even if they are not covered in the
lectures.
• Unless otherwise advised, use a significance level of 5%.
• Marks may be deducted if these instructions are not strictly adhered to, and marks
will certainly be deducted if the total report is of an unreasonable length, i.e. more
than 12 pages including graphs and tables. You may include an appendix that is in
addition to the above page limits; however the appendix will not be assessed. It will
only be used if there is some question about what you have actually done.
• Late submissions will receive a mark of zero.
• Extensions are usually only granted on medical or compassionate grounds on pro-
duction of appropriate evidence. Requests must be made at least 24 hours before
the deadline. If you are granted an extension and submit your assignment after the
extended deadline then you will receive a mark of zero.
• If you have any questions about this assignment, you are welcome to discuss it with
the lecturer.
• ht: height, cm
• wt: weight, kg
• sport: a factor with levels B_Ball Field Gym Netball Row Swim T_400m T_Sprnt
Tennis W_Polo
(a) [3 marks] You aim to investigate the relationship between body weight and body
fat percentage. For predictive purposes, identify which variable should be used as
the predictor (independent variable) and which as the response (dependent vari-
able). Justify your selection and formulate a simple linear regression (SLR) model,
specifying the model using variable names. Subsequent questions will be based on
this model.
(b) [15 marks] Fit the specified SLR model and perform diagnostic checks to assess
its validity. Present appropriate diagnostic plots and evaluate the model assump-
tions, noting any unusual observations. Use externally studentised residuals in your
analysis.
(c) [9 marks] Construct 90% confidence intervals for both the intercept and slope pa-
rameter, and interpret these intervals in the context of the variables.
(d) [10 marks] Despite an anticipated significant correlation between body weight and
body fat percentage, the analysis result indicates otherwise. You want to investigate
the reason. Examine the scatter plot between the predictor and response variable.
Do you notice any feature in this plot? What could have caused this feature?
Explain the feature using another variable in the dataset. Provide a graph that
clearly shows the feature and its cause to assist your explanation (i.e. use different
colors and add a legend). Taking this feature into consideration, are there any new
unusual observations based on the graph?
(e) [8 marks] Building on the previous analysis in part (d), reassess the correlation be-
tween body weight and body fat percentage, accounting for the additional variable.
In this assignment, your focus will be on the following two variables extracted from the
dataset:
(a) [9 marks] You aim to examine the relationship between two variables in this dataset.
However, some observations contain missing values (represented as “NA”). First,
clean the dataset by removing all observations with missing values and report the
final sample size. Generate a scatter plot using school as the predictor (X vari-
able) and gdp85 as the response (Y variable). Based on the scatter plot, identify
any potential violations of the assumptions required for SLR. (Hint: you may find
the is.na function useful.)
(c) [14 marks] Perform diagnostic checks on the fitted model from part (b) to assess
its validity. Present appropriate diagnostic plots and evaluate the model assump-
tions, noting any unusual observations. Use externally studentised residuals in your
analysis.
(d) [10 marks] Produce the ANOVA (Analysis of Variance) table for the SLR model
and conduct an F -test based on the output. Include all steps for a test. What is
the coefficient of determination for this model and how should you interpret it in
the context of this dataset?
(e) [9 marks] Express the estimated regression model in terms of the original (untrans-
formed) response variable. Based on the mathematical expression, describe how
the estimated response variable changes when the predictor variable increases by
one unit. Create a scatter plot on the original scale, overlaying the fitted regression
line on the scale of untransformed response values.