0% found this document useful (0 votes)
61 views7 pages

st404 Assignment 2 2024

- The document outlines an assignment for a group project analyzing an automobile dataset to model miles per gallon (MPG) and address factors that influence MPG. - Students will create a linear regression model to predict MPG and explain the major determinants of MPG based on variables like vehicle type, size, weight. - The report should present findings in non-technical language for a client association of car dealers and include a statistical methodology section describing the modeling process in more technical detail.

Uploaded by

harshilme18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views7 pages

st404 Assignment 2 2024

- The document outlines an assignment for a group project analyzing an automobile dataset to model miles per gallon (MPG) and address factors that influence MPG. - Students will create a linear regression model to predict MPG and explain the major determinants of MPG based on variables like vehicle type, size, weight. - The report should present findings in non-technical language for a client association of car dealers and include a statistical methodology section describing the modeling process in more technical detail.

Uploaded by

harshilme18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

ST404: Applied Statistical Modelling

ST404 Applied Statistical Modelling 2024


Assignment 2: Linear modelling
1.1 Introduction
Assignment 2 accounts for 40% of your final module mark. It consists of two components:
1. An individual piece of reflective writing, details of which can be found on the other
assignment brief on Moodle (5%)
Deadline for reflective submission: Wednesday 13th March 2024 13:00.
2. A piece of group work analysing an updated version of the Automobile dataset from
Assignment 1 (see below for details). Your analysis will be presented in two parts.
(i) A written report (25%)
Deadline for report submission: Friday 1st March 2024 13:00.
(ii) A poster presentation (10%)
Deadline for submission of the pdf version of poster: Wednesday 28th February 2024
13:00 (Poster sessions will be held in weeks 9 and 10).

This document outlines the group work component.

1.2 The data

After reporting on missing values in the dataset from Assignment 1, your client has gone back to their
original source, and recovered missing values for the Wid and Cyl variables. From this, they have
provided you with a more complete dataset, contained in the files Auto85_full.rdata and
Auto85_full.rds. (Note however that this dataset still contains some missing values in other
variables.)

As a reminder, this data originally comes from Ward’s 1985 Automotive Yearbook, and was originally
donated to the UCI machine learning dataset repository by Schlimmer (1987). However, you should
use the Moodle versions of the data, and not the version from the UCI repository. A full description
of this dataset can be found in Table 1 from the Assignment 1 description.

You should already be familiar with the data from your work from the first assignment, and you
should use what you discovered in that assignment and the feedback you received (which will include
class level feedback and presentation feedback) to inform your decisions and approach for this
assignment. You will be in different groups for this assignment.

Page 1 of 7
ST404: Applied Statistical Modelling

Group Task
2.1 Questions to Address

In this fictitious scenario, your group is working in a consulting role for an Association of US car
dealers in the mid-1980s (although of course you have access to more sophisticated software). Your
task is to analyse the updated Automotive Dataset to reveal patterns in the MPG (miles per gallon).
The Association is interested in a number of questions, such as:

a) What are the major determinants of the MPG of a car?


b) Are there some types of car with unusually low or high MPG that do not conform to the
general pattern?

The expectation when working with industry experts is to provide information that allows them to
formulate their own decisions, but not to make those decisions yourself. So, for example, it is
acceptable to describe the major determinants of car MPG, and to describe clear differences where
these occur. But it is not expected that you would describe to the client what they might recommend
dealers should use in estimating the MPG. They, to a certain extent already know what determines
this. What they want to know would include the following:
• Does the actual MPG match their current understanding?
• Can a model help them identify new aspects that need to be considered?
• Can a model confirm that the aspects already used have the correct weighting in determining
the listed MPG they use in advertising their cars?
• Are there any data quality issues they should tackle for such models to be useful in the
future?

2.2 Modelling Approach

Your model choice strategy combines predictive and explanatory goals. Your model should have good
predictive power in order to support your conclusions about the determinants of car MPG. However,
it should not be a “black box” that gives good predictions but gives no understanding of why these
MPGs are high (or low) in certain cases. Hence your task is to use these data to create a model which
you believe offers an appropriate balance between:

a) predictive power;
b) explanatory power;
c) simplicity (the clients are not experts in statistics, they need to understand it and be able to
use it).

Your task is to use the Automobile Dataset to generate and present a linear regression model which
performs well both as a predictive model and an explanatory model for the MPG.

In generating your model, as previously mentioned you should consider the lessons learned from the
Exploratory Data Analysis in Assignment 1. Since you are working in different groups, your first task
will be to combine the ideas you have from your first groups. In your report, you can refer to your
EDA findings without having to repeat them. Some of you asked about using residual plots etc. in the

Page 2 of 7
ST404: Applied Statistical Modelling

first assignment which was not a requirement at that stage. However, model diagnostics can and
should be now carried out as you build your models.

2.3 Considerations
Factors to consider are (some of which may be as a result of your findings in Assignment 1):
a) Whether the data needs to be cleaned, and if so how;
b) Whether the data needs to be transformed, and if so how;
c) Whether there are outliers to be considered, and if so how to deal with them;
d) How best to test the model for its usefulness in terms of both prediction and explanation;
e) Whether a penalty function should be applied to the size of the coefficients in the model;
f) Whether any covariates should be excluded from the model, and if so how these variables
are to be identified.

You need to present in full only one model for the outcome variable. However, as has been discussed
in lectures, the stepwise regression method can often lead to flawed models. Therefore, if you
present a model found using stepwise regression, you need to justify why the limitations of stepwise
regression have not caused an issue here. You are advised to try a variety of techniques to identify
your final model and you should focus on your chosen method whilst explaining why other methods
were either not considered of were found to be less suitable.

Page 3 of 7
ST404: Applied Statistical Modelling

Required Submission Format


You should prepare a report and a poster presentation.

3.1 Report
The report should be structured into three sections:

3.1.1 Findings (max 4 pages, including figures and tables).


Description of your main findings and recommendations for predictors to focus upon in future, as
you would present them to the client. The client is not a statistician, so keep statistical jargon to a
minimum, and use figures or tables to support your predictions or chosen model.

The goal of this section is to provide the client with a good understanding of what you did, so they
can take an evidence-based approach to future price recommendations. Your report should reach
clear conclusions about what are the major determinants of MPG. This will require a good knowledge
of your data, an interpretation of the model and an understanding of the limitations or your analysis.

An important point to include in this section are any criticisms or limitations of the data or the
analysis that you just performed. Your healthy criticism may give directions for future
implementations, which would be very valuable to the client going forwards.

3.1.2 Statistical methodology (max 7 pages, including figures and tables).

Description of the methods you used. This should indicate any strategies for your analysis such as:
outcome/predictor transformations, variable selection strategies, outlier removal, analysis of the
residuals, and model diagnosis. You should consider at least one selection or penalized likelihood
strategy from the following list:

a) Stepwise regression with AIC and/or BIC


b) Ridge regression
c) LASSO regression
d) Bayesian variable selection

Here you should discuss why you ended up choosing one of these approaches over the others (see
above for a comment on using stepwise regression), and provide any necessary evidence.

A statistical explanation of how you arrived at the recommendations given in the previous section
should be included here, along with any additional discussion of limitations of the data and
suggestions of improvements/alternatives to your approach for future work.

A major goal of this section is to give enough details so that if another statistician attempted to
reproduce your results, they could do so without having to guess at any stage about what decisions
you made and processes you followed - it is not enough to simply include all code used in the
appendix and expect someone to read through it without explanation.

Page 4 of 7
ST404: Applied Statistical Modelling

3.1.3 Appendix (max 4 pages).

Here you should include annotated R code. Do not put any R code in Sections 1 and 2 as detailed in
3.1.1 and 3.1.2 respectively. If your R code is extensive, think about how to reduce it. For example, if
some code is repeated but for different combinations of variables (e.g. different transformations) you
only need to present one example and add a comment to explain this.

3.1.4 Layout

The report should be written in a font size 11 or higher with a 1.5 spacing between the lines. Margins
should be appropriate. All figures and tables should be numbered and have captions. Do not include
raw output from R in your report. Excessively small figures will reduce your mark, so be selective of
which ones to include. Behind the scenes you may have produced more.

3.2 Poster Presentation

In addition to the report, you should prepare a poster. Posters are a standard way for early career
scientists to communicate the findings of their research at conferences, especially for work in
progress. In a poster session, conference attendees are free to come and go, to read the contents of
the poster and discuss them with the authors. There will be a poster session in weeks 9 and/or 10.
Details will be published nearer the time.

The target audience for the poster is the same as for the findings section of the report, i.e., you
should aim the main messages of your poster at your clients. However, you should be ready in the
poster session to defend your chosen approach to a fellow statistician.

The poster should be of A1 size (594 × 841 mm). It should contain a brief description of your
methodology and findings which should be visually appealing to a non-technical audience.

To allow time for the poster to be printed we ask that you submit it early in week 8. The actual
poster sessions will be in weeks 9 and 10. Details will follow and be published on Moodle.

Page 5 of 7
ST404: Applied Statistical Modelling

Marking Criteria
4.1 Report (Marked out of 60; 25% of overall)
4.1.1 Findings
1) Clarity and accurateness of overview of data and the description and interpretation of model; (5
marks)
2) Quality and relevance of numerical and graphic output; (5 marks)
3) Quality of recommendations provided,
a) correct interpretation of model/main predictors; (5 marks)
b) Limitations of the data, model etc.; with suggestions for future implementations (5 marks)
4) Appropriateness, clarity, and correctness of language. (5 marks)

4.1.2 Statistical Methodology


1) Relevance and quality of numerical and graphical evidence and Soundness and justification of
modelling decisions;
a) Data preparation. (6 marks)
b) Modelling approach: selection methods. (6 marks)
c) Model choice and validation; residual/influential analysis. (6 marks)
2) Depth of critical evaluation of the final model;
a) Model discussion and interpretation. (6 marks)
3) Structure and clarity, appropriate use of terminology, correctness of English. (5 marks)

4.1.3 Appendix
Appropriately presented, annotated and complete. (6 marks)

4.2 Poster Presentation (Marked out of 40; 10% of overall)


4.2.1 Poster (Marked as a group) (4%)
1) Layout, structure and visual appeal (8 marks);
2) Accuracy and relevance of content (8 marks).

4.2.2 Oral Presentation: (Marked individually) (5%)


1) Fluidity (5 marks);
2) Clarity, Use of language, Correctness; (5 marks)
3) Engagement with Audience; (5 marks)
4) Response to targeted questions, where appropriate (5 marks).

4.2.3 Oral Presentation Group Coherence: (Marked as a group) (1%)


1) Timing (1 mark)
2) Balance of Group members in delivery (2 marks)
3) Coherence (1 mark)

4.3 Penalties
Late submission (-5% per working day), Over page limit (-5%), Not using prescribed layout (-5%)

Page 6 of 7
ST404: Applied Statistical Modelling

Group Scaling
As in Assignment 1, each group should submit a completed group scaling form along with their report
(as a separate document), which is used to adjust individual group member’s marks (on the report
and poster) based on how much or little work they have done. This form should be approved by all
members of the group. Based on this form, marks for each individual may be adjusted by up to ±20%
(however you should ideally aim to have an even split of workload). See said form for further
instructions on how scaling works.

Each student will receive and individual mark for their oral presentation, as in Assignment 1.

Reflective Writing
Remember to look at the other assignment brief on Moodle for details on the reflective writing
component.

Page 7 of 7

You might also like