Assignment
Assignment
The labs will also familiarise students with the use of R, a statistical package
commonly used for data analytics both in industry and academic settings.
LEARNING OUTCOMES
• Understanding and using descriptive statistics to describe and provide an
overview of complex data sets
• Understanding and using univariate and multi-variate analysis methods to
analyse data sets.
• Understand and use the main statistical functions of the software R
1
OVERVIEW OF THE ASSIGNMENT
For this assignment, you will be working with a dataset on the risk of heart
disease. An outcome variable of 1 indicates the person has been diagnosed with
heart disease, while an outcome variable of 0 indicates the person has not been
diagnosed with heart disease. You are going to conduct statistical analysis of the
links between clinical variables and heart disease.
https://www.rstudio.com/
For the analysis you will use several different methods to analyse this outcome:
The analysis will be undertaken using R Studio. The significance level for this
assignment is α=5%, meaning that the p-value has to be less than 0.05 for the
variable to be significantly associated with the outcome.
2
DESCRIPTION OF ANALYTICAL TASKS
This individual assignment consists of six parts and will include the following
tasks:
Part V – Clustering
NB - Please fill in all parts of the report, referring to the guidance provided in
each section.
IMPORTANT: You will also need to provide a copy of the R code used in the
four main sections of the data analysis as a report appendix.
3
SUBMISSION
• There is no word count (min or max) but an indicative word count to cover
all the required points would be between 4000 and 6000 words. A good
report should not exceed 8000 words including appendices.
4
the dataset (trends, outliers). Use boxplots and
histograms as appropriate to describe the data.
Linear Regression (Part 3) Present and describe the results from your linear 20%
regression. Define the outcome clearly and how you
chose the variables for the model. Describe the results
and say what you think they mean. Discuss the
performance of the model.
Logistic Regression (Part 4) Present and describe the results from your logistic 26%
regression. Define the outcome clearly and how you
chose the variables for the model. Describe the results
and say what you think they mean. Discuss the
performance of the model. Include a description and
discussion of odds ratios.
K-Means Clustering (Part 5) Describe K means and what it is used for and clearly 14%
explain what it is being used for in this study. How did
you choose the number of clusters? Provide clearly
labelled plots and describe your results and what you
think they mean in the context of your study aims.
Discussion (Part 6) Summarise clearly what variables had associations for 10%
each of the methods you used. How does what you
found relate to other published literature? State any
limitations of this study (the methods and the dataset).
Reference the literature where appropriate.
Overall Presentation / Tables and figures are clearly labelled and presented and 5%
Coherence well described in a way that it is clear the student
understands the results and knows what they mean in
the context of the aims of the study.
100 total marks
5
MARKING SCHEME – PRESENTATION
The oral presentation is marked out of 20 - and is scaled to 10% of the total for this course.
The oral presentation will be marked as follows:
6
TEMPLATE REPORT
7
Contents
AIM OF THE ASSIGNMENT .............................................................................................. 1
Submission dates:.....................................................................................................................
2.4. Patient Characteristics: Categorical Variables .......... Error! Bookmark not defined.
4.2. Results of Logistic Regression: Variables associated with heart disease ................. 14
6. Discussion ............................................................................................................................ 17
8
6.2. Logistic Regression ................................................................................................... 17
References ................................................................................................................................ 18
Appendix A .............................................................................................................................. 19
9
1. Introduction and Background
5% of Report Mark
Provide an introduction and background to the report, discussing the background of the
problem and stating your aims of this analysis.
10
2. Descriptive Statistics
20% of Report Mark
In this section, please describe the following:
Table 1. Template for descriptive statistics for numerical variables. Please note, all variables
and numbers are an example and not the actual results.
11
Age 63.55 ± 0.45 72
Table 2. Template for descriptive statistics for categorical variables. Please note, all
variables and numbers are an example and not the actual results.
Frequency (%)
Variable Level
N=3700
Renal Impairment Normal 1169 (31.59)
Moderately impaired 740 (20.00)
Severely impaired 186 (5.03)
Unknown 1605 (43.38)
1. Describe the patients in the dataset based on your results in the tables.
12
3. Linear Regression
20% of Report Mark
1. Give background on linear regression. How does it work and what is it used for?
3. Describe how you choose the variables for your linear regression model
4. Describe how you measure the performance of your linear regression model.
Table 3. Template for linear regression model. Please note, all numbers and
variables in the table are as an example, not the actual results.
2. Describe the variables in the linear regression model. How are the variables associated
with the outcome?
13
4. Logistic Regression
26% of Report Mark
1. Give background about logistic regression. How does it work and what is it used for?
3. Give background to odds ratios. How are they calculated and what do they describe?
4. Describe how you choose the variables for your logistic regression model.
5. Describe how you develop your model, using training and testing data and forward
selection.
6. Describe how you measure the performance of your logistic regression model.
1. Provide the results of the Logistic Regression analysis using the table template provided
below.
2. Which variables are significantly associated with the outcome based on unadjusted odds
ratios?
3. For which variables the significance has diminished with the covariate adjustment?
Table 4. Template for unadjusted and adjusted odds ratios. Please note, the variables and
the numbers in the table are an example only and are not the actual results.
Unadjusted Adjusted
Variable Level OR OR
P-value P-value
(95% CI) (95% CI)
Renal Moderately 1.89 1.04
0.0283 0.0578
Impairment Impaired (1.22-2.01) (0.78-1.23)
14
Severely 5.66 3.45
<0.0001 <0.001
Impaired (3.56-8.78) (2.03-4.08)
Unknown 3.57 2.55
<0.001 0.0174
(2.98-4.04) (1.89-3.21)
Using forward selection, develop a prediction model predicting Outcome based on training
data.
1. Provide a table of your final logistic regression model, based on the table template
below.
Table 5. Template for logistic regression model coefficients. Please note, the variables and
the numbers in the table are an example only and are not the actual results.
2. Discuss your model’s performance based on area under the curve, sensitivity,
specificity, and negative and positive predictive values.
15
4. K-Means Clustering
14% of Report Mark
1. Provide background on K-Means clustering. How does it work and what is it used for?
2. Describe how the number of clusters is chosen for your analysis (using the elbow
method, silhouette method and gap statistic).
3. Describe what you are using to evaluate the results (e.g. plots).
1. Show how you would choose the number of clusters, using the elbow, silhouette and
gap statistic methods.
3. Discuss the plots you have generated based on the clusters that have formed
4. Assess the relationship between clusters and the target variable and any other variables
of interest
16
6. Discussion
10% of Report Mark
Provide a discussion for each of the three sections of your analysis (linear regression, logistic
regression, clustering):
2. Explain why these variables are associated with the outcome by bringing examples
from other studies
5. Summarise your results and explain how your analysis could improve healthcare.
17
References
18
Appendix A
Provide a copy of R code for the four main sections of the report. The code should be executable
by a marker by copying and pasting your script into R Studio themselves should they wish to
run it. The code should be commented to briefly but clearly state what the main functionality
of the script is.
19