0% found this document useful (0 votes)
3 views

Assignment 1 Questions

This document outlines the instructions and requirements for Assignment 1 of the Regression Modelling course (STAT7038) for Semester 1, 2025. The assignment consists of two main questions focusing on data analysis using datasets related to high-performance athletes and socio-economic characteristics from various countries, with specific tasks including model fitting, diagnostic checks, hypothesis testing, and confidence interval construction. Adherence to formatting guidelines, submission protocols, and deadlines is emphasized, with penalties for non-compliance.

Uploaded by

qq1812016515
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Assignment 1 Questions

This document outlines the instructions and requirements for Assignment 1 of the Regression Modelling course (STAT7038) for Semester 1, 2025. The assignment consists of two main questions focusing on data analysis using datasets related to high-performance athletes and socio-economic characteristics from various countries, with specific tasks including model fitting, diagnostic checks, hypothesis testing, and confidence interval construction. Adherence to formatting guidelines, submission protocols, and deadlines is emphasized, with penalties for non-compliance.

Uploaded by

qq1812016515
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES AND STATISTICS

REGRESSION MODELLING
(STAT7038)
Assignment 1 for Semester 1, 2025

INSTRUCTIONS:

• This assignment is worth 15% of your overall marks for this course.
• The data files are on Wattle.
• Submit your assignment by Turnitin on Wattle.
• Assignments must be typed. For each part, include relavant computer outputs, discus-
sions, calculations, and R code used. You should make your plots easy to understand.
For example, use variable names in the X and Y axes. Please be selective of what you
include in each part. They should be closely related to the particular question part.
Clearly label each part accordingly.
• You can use any R functions in your assignment, even if they are not covered in the
lectures.
• Unless otherwise advised, use a significance level of 5%.
• Marks may be deducted if these instructions are not strictly adhered to, and marks
will certainly be deducted if the total report is of an unreasonable length, i.e. more
than 12 pages including graphs and tables. You may include an appendix that is in
addition to the above page limits; however the appendix will not be assessed. It will
only be used if there is some question about what you have actually done.
• Late submissions will receive a mark of zero.
• Extensions are usually only granted on medical or compassionate grounds on pro-
duction of appropriate evidence. Requests must be made at least 24 hours before
the deadline. If you are granted an extension and submit your assignment after the
extended deadline then you will receive a mark of zero.
• If you have any questions about this assignment, you are welcome to discuss it with
the lecturer.

Assignment 1 - Sem 1, 2025 Page 1 of 4


Question 1 [45 Marks]
Data are collected from a group of high-performance athletes associated with the Aus-
tralian Institute of Sport (AIS), see dataset “ais.csv” on Wattle. The variables included
are:

• bmi: Body mass index, kg cm−2 103

• pcBfat: percent Body fat

• lbm: lean body mass, kg

• ht: height, cm

• wt: weight, kg

• sex: a factor with levels f m

• sport: a factor with levels B_Ball Field Gym Netball Row Swim T_400m T_Sprnt
Tennis W_Polo

(a) [3 marks] You aim to investigate the relationship between body weight and body
fat percentage. For predictive purposes, identify which variable should be used as
the predictor (independent variable) and which as the response (dependent vari-
able). Justify your selection and formulate a simple linear regression (SLR) model,
specifying the model using variable names. Subsequent questions will be based on
this model.

(b) [15 marks] Fit the specified SLR model and perform diagnostic checks to assess
its validity. Present appropriate diagnostic plots and evaluate the model assump-
tions, noting any unusual observations. Use externally studentised residuals in your
analysis.

(c) [9 marks] Construct 90% confidence intervals for both the intercept and slope pa-
rameter, and interpret these intervals in the context of the variables.

(d) [10 marks] Despite an anticipated significant correlation between body weight and
body fat percentage, the analysis result indicates otherwise. You want to investigate
the reason. Examine the scatter plot between the predictor and response variable.
Do you notice any feature in this plot? What could have caused this feature?
Explain the feature using another variable in the dataset. Provide a graph that
clearly shows the feature and its cause to assist your explanation (i.e. use different
colors and add a legend). Taking this feature into consideration, are there any new
unusual observations based on the graph?

(e) [8 marks] Building on the previous analysis in part (d), reassess the correlation be-
tween body weight and body fat percentage, accounting for the additional variable.

Assignment 1 - Sem 1, 2025 Page 2 of 4


Conduct appropriate hypothesis test(s) to answer this question. Include all steps
in your test(s).

Question 2 [55 Marks]


The dataset “Growth.csv” comprises various socio-economic characteristics collected
from multiple countries. This dataset serves as a valuable resource for analyzing and
understanding the factors influencing economic growth and development across different
nations.

In this assignment, your focus will be on the following two variables extracted from the
dataset:

• gdp85: Per capita GDP in 1985.

• school: Average fraction of working-age population enrolled in secondary school


from 1960 to 1985 (in percent).

(a) [9 marks] You aim to examine the relationship between two variables in this dataset.
However, some observations contain missing values (represented as “NA”). First,
clean the dataset by removing all observations with missing values and report the
final sample size. Generate a scatter plot using school as the predictor (X vari-
able) and gdp85 as the response (Y variable). Based on the scatter plot, identify
any potential violations of the assumptions required for SLR. (Hint: you may find
the is.na function useful.)

(b) [8 marks] Apply the Box-Cox transformation to determine an appropriate trans-


formation for Y . Write the transformed model using variable names. Fit the
transformed model and provide the estimated regression equation using variable
names.

(c) [14 marks] Perform diagnostic checks on the fitted model from part (b) to assess
its validity. Present appropriate diagnostic plots and evaluate the model assump-
tions, noting any unusual observations. Use externally studentised residuals in your
analysis.

(d) [10 marks] Produce the ANOVA (Analysis of Variance) table for the SLR model
and conduct an F -test based on the output. Include all steps for a test. What is
the coefficient of determination for this model and how should you interpret it in
the context of this dataset?

(e) [9 marks] Express the estimated regression model in terms of the original (untrans-
formed) response variable. Based on the mathematical expression, describe how
the estimated response variable changes when the predictor variable increases by
one unit. Create a scatter plot on the original scale, overlaying the fitted regression
line on the scale of untransformed response values.

Assignment 1 - Sem 1, 2025 Page 3 of 4


(f) [5 marks] For a country with a school level of 4, construct a 99% confidence
interval for its predicted gdp85 value. Interpret this confidence interval in the
context of the variables.

Assignment 1 - Sem 1, 2025 Page 4 of 4

You might also like