0% found this document useful (0 votes)
41 views5 pages

Final Project Part 1 Instructions-1

STA302 final project instructions

Uploaded by

kondor200414
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views5 pages

Final Project Part 1 Instructions-1

STA302 final project instructions

Uploaded by

kondor200414
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

STA302 Fall 2024 Final Project Part 1

Research Proposal and Data Introduction


Due: October 4, 2024 by 8:00PM ET
Latest Acceptance: October 11, 2024 by 8:00PM ET

Please note that if you intend to use the NQA extension time, you should not submit any
documents prior to the posted deadline as Quercus will not allow any changes or additions to
the submission after the initial deadline. Instead, make sure you have all your documents
prepared and ready to submit all at once.

Goal of the Assessment: Learning Outcomes being Assessed:


• To have the opportunity to work on a topic of • Apply multiple linear models on various
interest to them and to be creative about this datasets using R statistical software.
topic. • Differentiate the relationships modelled using
• To experience the process of conducting a qualitative predictors, interactions between
small literature review and incorporating predictors, and continuous predictors.
knowledge gained into analysis. • Create appropriate residuals plots to evaluate
• To think about whether a research question model assumptions for a given data set using
and/or a dataset is appropriate for use with software.
linear regression. • Recognize distinct patterns in appropriate
• To create a draft of the components to be residual plots and correctly conclude which
included in an introduction section of a assumption is violated.
report, as well as summary figures and/or • Report the results of a residual plot analysis and
tables for results section. recommend a course of action.

Instruction Summary:

1. Locate open-source data in an area of interest to the group that meets the data requirements listed below.
Some examples could be (but are certainly not limited to) sports, medicine, public health, economics, video
games, literature, etc. Students/groups will also need to argue for why their dataset is suitable to be used
with a linear regression model.

2. Define an explicit research question using the information in that dataset. Note that students/groups will
need to argue for why linear regression is appropriate to answer this question with this dataset.

3. Locate three peer-reviewed academic papers related to the specific research question or topic of interest.
Students/groups will need to describe how each article relates back to their proposed research question.

4. Select at least 5 variables from the dataset to be predictors in a preliminary multiple linear regression model,
with at least one of these five being categorical in nature. These predictors must have been mentioned and
summarized in the three academic papers above. The model will then be fit and a complete residual analysis
to assess model assumptions will be done.

5. Provide a table that numerically summarizes each variable used in their preliminary model, with an
informative caption that highlights any interesting features of the variables (e.g., skews, possible outliers or
non-sensical observations, high spread, missing values).
Dataset Requirements:

o Dataset must be open-source and the website where it was found/downloaded from must be provided.
o MUST contain at least 1000 observations (i.e., rows).
o MUST contain 1 response variable suitable for linear regression and at least 9 predictor variables, one of
which must be categorical. Categorical variables with multiple levels count as 1 variable here.
o Since at least one predictor will need to be categorical, you may convert one of your numerical
variables to categorical if no such variable is available in your downloaded dataset. However, you will
need to justify your choice of variable and categorization in the proposal.
o Should NOT be from an educational resource, such as a textbook dataset. If you’re not sure, please ask the
instructor or one of the TAs.
o Should NOT be one of the following datasets: Boston Housing dataset or Red Wine Quality dataset.
o If the dataset was found in a data repository (e.g., Kaggle, UCI Repository, etc.), you MUST ensure that your
research question is novel and different from the original usage of the data.

Proposal Format:

Your group will create a written proposal that should introduce your research question and data, summarize existing
knowledge in that area, fit a preliminary model based on the existing knowledge, and conduct a residual analysis of
the model. The proposal must include the following sections and must not exceed the word count in each case:

o Contributions: each group member’s name is listed and a description of their contribution to the proposal is
outlined (this does not count towards the word limit).

o Introduction (350 words): introduce the relevance/importance of the topic, state the research question of
interest, summarize the results of three peer-reviewed research papers with a focus on their connection to
the research question, and describe why linear regression is a suitable statistical tool to answer the research
question.
o i.e., why should someone be interested in your project, what are you trying to answer what is already
known about this question, and why should you use linear regression.

o Data description (300 words): state where the data was found, explain how the data was originally collected
(not how you found the data but how the original curator of the data collected it), describe the response
variable (both statistically and with a written description of what it measures and why it meets the
requirements for use in a linear model), summarize numerically or graphically (in a single figure/table) each
predictor in your dataset that will be used in the preliminary model, and interpret the descriptive statistics
in the context of what the predictors measure and how it relates to the research question.
o NOTE: if you had to convert a numerical predictor to a categorical predictor to meet the data
requirements, you must justify your choice and the chosen categories in this section.

o Ethics discussion (100-200 words, only for L0101/L0201/L2001/L2002 students): Would you consider your
dataset to be trustworthy, given the criteria discussed in the ethics module? Justify briefly using material and
terminology discussed in the first ethics module.
o Bonus exercise: We encourage you to think about whether your dataset was collected ethically, and
whether you are making ethically appropriate use of it, given the issues raised in the ethics module
(you do not need to write anything about this question).

o Preliminary results (300 words): fit a preliminary model using 5 predictors noted in the literature, conduct a
full analysis of the linear regression assumptions noting any violations and what led to your conclusions.
Discuss whether your preliminary model results are similar or different to results in the literature and why.
o NOTE: Place residual plots into the document in a grid (i.e., 2-3 plots placed horizontally in a single
figure) so that multiple plots will display in a single figure for improved readability (see Resources
below).

o Bibliography: an appropriately formatted list of resources and literature cited in the proposal (not included
in work count). APA format is acceptable.

What to Submit:

Only ONE member of the group should submit ALL required submission components. A complete submission to
Quercus will include:

✓ Your group’s completed Group Teamwork Agreement, saved as a PDF (see Quercus Final Project page).
✓ The completed proposal, saved as a PDF.
✓ The Rmd file containing the code used to subset and clean the data, fit the model, produce a summary table,
and conduct the residual analysis for checking assumptions.
✓ The original and cleaned (where appropriate) datasets as CSV files, uploaded to a cloud-based storage service
(e.g., OneDrive), with the shareable link included as a submission comment on Quercus.

Failure to meet these submission requirements, including incorrect format of components, missing components, and
cloud links that do not allow shared access will result in a one-mark deduction on the grade of the proposal.

Resources:

Should your group have difficulty locating a suitable dataset that meets the group’s interest and the dataset
requirements, your group can consider using one of the datasets in the table below. You may also consider
consulting the library resources for help performing your literature search and citing the results. Should your
group use R Markdown to produce the proposal, the R Markdown resources will help you format your
document and make it more presentable.

Dataset Resources Library Resources R Markdown Resources


• Ames Housing • How to search for academic • Settings for displaying or not
dataset articles displaying R code in knitted
• NHANES survey • Using search operators to document
dataset find articles • Adding captions and other plotting
• AirBnB dataset • Limiting search to peer- features
(needs you to create reviewed articles • Including multiple plots in a grid
a free account) • Why and how to cite your using patchwork or base R plot
• Million Song dataset references commands
• NBA player dataset • Help getting the correct • Creating tables in RMarkdown
citation format using Kabble or manually
• Exporting a citation • Exporting plots in RStudio

You may also wish to consider the writing resources posted on the General Resources Quercus page.
Alternatively, keep an eye on the course announcements for dedicated writing office hours with our English
Language Learning TA, Dory.

For some advice in formulating a research question and searching the academic literature, see our Tip Sheet
for Creating a Research Question, designed by Dory Abelman.
Criteria of Assessment Excellent Satisfactory Needs
(2 points) (1 point) Revision
(0 points)
Introduction Section
Proposed research question: All three Only two One or
• The response variable of interest is clearly identifiable, and the criteria are criteria are fewer
predictors hypothesized to be related to the response are met. met. criteria
explicitly stated (or at minimum groups of common predictor are met.
characteristics are listed).
• It is phrased using clear language and familiar terminology and
makes a clear hypothesis about the population relationship.
• It is directly connected to the stated importance/relevance of
the project topic.
Literature summary: All three Only two One or
• Three legitimate peer-reviewed articles are summarized. criteria are criteria are fewer
• The main result of each article is summarized concisely and in met. met. criteria
the context of the original study population. are met.
• A strong and explicit connection is made between each article’s
results and the proposed research question.
Suitability of linear regression: All three Only two One or
• Uses appropriate terminology from the course materials. criteria are criteria are fewer
• Provides a reasonable justification for why and how estimating met. met. criteria
a linear trend will answer the research question proposed. are met.
• Provides a reasonable justification for whether the focus of the
model will be on interpretability (description) or
precision/accuracy (prediction).
Data Description Section
Description of data source: All three Only two One or
• Where the data was sourced/downloaded from is explicitly criteria are criteria are fewer
mentioned with a corresponding citation in the bibliography. met. met. criteria
• The original usage or purpose of the dataset is described, and it are met.
is explicit how that usage differs from the current research
proposal.
• How the data were originally collected by the curator of the
dataset is described and a corresponding reference is cited from
the bibliography.
Response variable summary: All three Only two One or
• An appropriate and suitably presentable numerical or graphical criteria are criteria are fewer
summary is used to statistically describe the response variable. met. met. criteria
• A written description of the response variable highlights are met.
important features of the response distribution, in the context
of what is being measured/the research question.
• A justification for why the chosen response variable is suitable
to be used in a linear regression model is provided and is
correct, based on the statistical summary presented.
Predictor variable summaries: All three Only two One or
• An appropriate and suitably presentable numerical or graphical criteria are criteria are fewer
summary is used to statistically describe the chosen predictor met. met. criteria
variables. are met.
• Important/interesting variable characteristics (e.g. skews,
abnormal values) or lack thereof are, in the context of what is
being measured/the research question.
• A justification for why the chosen predictor variables are
relevant to answering the research question, making explicit
reference to the summarized literature and to any
modifications to variables that have been made.
Ethics Discussion Section (L0101/L0201/L2001/L2002 only)
• Answer correctly references some of the criteria discussed in All three Only two One or
the first ethics module. criteria are criteria are fewer
• Response makes a reasonable and clear attempt to argue for its met. met. criteria
conclusion. are met.
• Meets minimum and maximum word count.
Preliminary Model Results Section
Residual analysis of preliminary model: All three Only two One or
• All plots needed for a complete residual analysis have been criteria are criteria are fewer
presented, are correct, and are easily readable with appropriate met. met. criteria
axes and labels. are met.
• Each assumption and condition are assessed and a conclusion
for each is provided.
• Correct details are provided, with reference to the appropriate
plot, to describe how such a conclusion was made for each
assumption and condition.
Preliminary model discussion: All three Only two One or
• Model estimates from preliminary model are presented in an criteria are criteria are fewer
easily readable, understandable, and professional way. met. met. criteria
• A discussion on what these estimates tell the reader about a are met.
possible answer to the research question is provided in context,
highlighting the effect of at least one numerical and one
categorical predictor explicitly.
• A comparison is made between the preliminary model results
and those summarised from the literature, and it is discussed
why these may be similar or different.
Overall Proposal Formatting
• The bibliography and in-text citations are formatted correctly All four Only three Two or
using a consistent style. criteria are criteria are fewer
• Word counts for each section are met or are no more than 15 met. met. criteria
words in excess. are met.
• Headers and paragraphs are used effectively to increase
readability and separate ideas for increased comprehension.
• No R code or R output (other than plots) are displayed in the
written proposal.
Total Points: /20

You might also like