0% found this document useful (0 votes)
34 views19 pages

Assignment

This document outlines an individual assignment aimed at providing students with practical experience in handling and analyzing health datasets, specifically focusing on heart disease risk. Students will utilize R for statistical analysis, covering descriptive statistics, linear and logistic regression, and clustering methods. The assignment includes a detailed marking scheme and report structure, emphasizing the importance of clear presentation and adherence to guidelines.

Uploaded by

samwaceke214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views19 pages

Assignment

This document outlines an individual assignment aimed at providing students with practical experience in handling and analyzing health datasets, specifically focusing on heart disease risk. Students will utilize R for statistical analysis, covering descriptive statistics, linear and logistic regression, and clustering methods. The assignment includes a detailed marking scheme and report structure, emphasizing the importance of clear presentation and adherence to guidelines.

Uploaded by

samwaceke214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Guidelines, Report Template, and Marking Scheme

AIM OF THE ASSIGNMENT


The aim of this individual assignment is to provide students with the practical
experience of handling, processing and analysing health datasets (a clinical
dataset will be used for this exercise), and applying a range of analytical methods
to infer new knowledge from ‘raw’ health and care data.

The labs will also familiarise students with the use of R, a statistical package
commonly used for data analytics both in industry and academic settings.

LEARNING OUTCOMES
• Understanding and using descriptive statistics to describe and provide an
overview of complex data sets
• Understanding and using univariate and multi-variate analysis methods to
analyse data sets.
• Understand and use the main statistical functions of the software R

1
OVERVIEW OF THE ASSIGNMENT
For this assignment, you will be working with a dataset on the risk of heart
disease. An outcome variable of 1 indicates the person has been diagnosed with
heart disease, while an outcome variable of 0 indicates the person has not been
diagnosed with heart disease. You are going to conduct statistical analysis of the
links between clinical variables and heart disease.

The analysis will be carried out using R Studio.

https://www.rstudio.com/

AIMS OF DATA ANALYSIS


The primary outcome (what you are measuring) for this case study will be heart
disease encoded in the target variable, i.e. whether (1) or not (0) the participant
has heart disease.

For the analysis you will use several different methods to analyse this outcome:

(i) Descriptive statistics using measures of central tendency and spread


and visualisations
(ii) Linear regression for predicting age from relevant continuous
variables;
(iii) predicting heart disease from relevant variables using logistic
regression; and
(iv) using k-means to cluster patients into groups and assessing relationship
of clusters with heart disease

The analysis will be undertaken using R Studio. The significance level for this
assignment is α=5%, meaning that the p-value has to be less than 0.05 for the
variable to be significantly associated with the outcome.

2
DESCRIPTION OF ANALYTICAL TASKS
This individual assignment consists of six parts and will include the following
tasks:

Part I – Introduction & Background

Part II – Descriptive Statistics

Part III – Linear Regression

Part IV- Logistic Regression

Part V – Clustering

Part VI- Discussion

NB - Please fill in all parts of the report, referring to the guidance provided in
each section.

IMPORTANT: You will also need to provide a copy of the R code used in the
four main sections of the data analysis as a report appendix.

3
SUBMISSION
• There is no word count (min or max) but an indicative word count to cover
all the required points would be between 4000 and 6000 words. A good
report should not exceed 8000 words including appendices.

Lab Assessment (coursework A)


Marking 50% of final class mark for CS979
40% allocated to the report
10% to the oral presentation
Introduction and Background 5% of the report mark
Descriptive Statistics 20% of the report mark
Linear Regression 20% of the report mark
Logistic Regression 26% of the report mark
K-Means Clustering 14% of the report mark
Discussion 10% of the report mark
Overall Report Presentation / 5% of the report mark
coherence

MARKING SCHEME REPORT

Report Chapter Criteria for top marks

Introduction and Background Provide a clear introduction and background to the 5%


(Part 1) problem stating why the work is necessary/important;
what the aims of this study are, and reference the
literature appropriately
Descriptive Statistics You should describe the dataset in full. Say where the 20%
(Part 2) data comes from, what it contains, the date range, and
what types of patients and procedures the dataset
contains. Think about describing means and standard
deviations as appropriate and also comment on the
results in Table 1 (your descriptive statistics table). Tell
the reader which variables are numerical and categorial
and describe these in the tables as frequencies or means
as appropriate. Say what if anything is interesting about

4
the dataset (trends, outliers). Use boxplots and
histograms as appropriate to describe the data.
Linear Regression (Part 3) Present and describe the results from your linear 20%
regression. Define the outcome clearly and how you
chose the variables for the model. Describe the results
and say what you think they mean. Discuss the
performance of the model.
Logistic Regression (Part 4) Present and describe the results from your logistic 26%
regression. Define the outcome clearly and how you
chose the variables for the model. Describe the results
and say what you think they mean. Discuss the
performance of the model. Include a description and
discussion of odds ratios.
K-Means Clustering (Part 5) Describe K means and what it is used for and clearly 14%
explain what it is being used for in this study. How did
you choose the number of clusters? Provide clearly
labelled plots and describe your results and what you
think they mean in the context of your study aims.
Discussion (Part 6) Summarise clearly what variables had associations for 10%
each of the methods you used. How does what you
found relate to other published literature? State any
limitations of this study (the methods and the dataset).
Reference the literature where appropriate.
Overall Presentation / Tables and figures are clearly labelled and presented and 5%
Coherence well described in a way that it is clear the student
understands the results and knows what they mean in
the context of the aims of the study.
100 total marks

5
MARKING SCHEME – PRESENTATION

The oral presentation is marked out of 20 - and is scaled to 10% of the total for this course.
The oral presentation will be marked as follows:

CRITERIA Mark Comments


0 – no, Total out of 20
1 – partially
2 – yes
Did the student present/describe the
problem and background clearly?

Did the student present/describe the


datasets and descriptive statistics clearly?

Did the student present/describe the linear


regression methods and the model clearly?

Did the student present/describe the linear


regression model performance clearly?

Did the student present/describe the logistic


regression methods and model clearly?

Did the student present/describe the logistic


regression model’s performance clearly?

Did the student present/describe the K-


Means clustering methods and findings
clearly?

Did the student present/describe the


conclusions based on their results clearly?

Did the student present/describe the plots


and other visualisations clearly?

Did the student present professionally and


answer questions with insight?

6
TEMPLATE REPORT

7
Contents
AIM OF THE ASSIGNMENT .............................................................................................. 1

LEARNING OUTCOMES .................................................................................................... 1

Submission dates:.....................................................................................................................

OVERVIEW OF THE ASSIGNMENT...................................................................................

AIMS OF DATA ANALYSIS............................................................................................... 2

DESCRIPTION OF ANALYTICAL TASKS ....................................................................... 3

Project Title.............................................................................. Error! Bookmark not defined.

1. Introduction and Background ........................................................................................... 10

2. Descriptive Statistics ........................................................................................................ 11

2.1. Data and Procedures .................................................................................................. 11

2.2. Methods for Describing Data .................................................................................... 11

2.3. Patient Characteristics: Numerical Variables ............................................................ 11

2.4. Patient Characteristics: Categorical Variables .......... Error! Bookmark not defined.

3. Linear Regression ............................................................................................................. 13

3.1. Methods: Linear Regression...................................................................................... 13

3.2. Results of Linear Regression..................................................................................... 13

4. Logistic Regression .......................................................................................................... 14

4.1. Methods Overview: Logistic Regression .................................................................. 14

4.2. Results of Logistic Regression: Variables associated with heart disease ................. 14

4.3. Results of Logistic Regression: Developing a prediction model .............................. 15

4.4. Results of Logistic Regression: Model performance ................................................ 15

5. K-Means Clustering .......................................................................................................... 16

5.1. Methods: K-Means Clustering .................................................................................. 16

5.2. Results of K-Means Clustering ................................................................................. 16

6. Discussion ............................................................................................................................ 17

6.1. Linear Regression ...................................................................................................... 17

8
6.2. Logistic Regression ................................................................................................... 17

6.3. K-Means Clustering .................................................................................................. 17

References ................................................................................................................................ 18

Appendix A .............................................................................................................................. 19

A.1 R code used in the Descriptive Statistics section .......................................................... 19

A.2 R code used in the Linear Regression analysis section ................................................. 19

A.3 R code used in the Logistic Regression analysis section .............................................. 19

A.4 R code used in the K-Means clustering analysis section .............................................. 19

9
1. Introduction and Background
5% of Report Mark
Provide an introduction and background to the report, discussing the background of the
problem and stating your aims of this analysis.

Use references to the literature where appropriate.

10
2. Descriptive Statistics
20% of Report Mark
In this section, please describe the following:

2.1. Data and Procedures


- Where does the data come from

- When was the data recorded

- How the data is used

- How many patients there are in the dataset

- What is recorded in the dataset

- Which procedures are recorded in the dataset

- If there is missing data, what is done to handle that

2.2. Methods for Describing Data


- Explaining mean, median, standard deviation (including formulae how these are
calculated)

- Using frequency tables

- Using histograms and box plots to visualise data

2.3. Patient Characteristics: Numerical & Categorical


Variables
1. Using R Studio, calculate and report on appropriate measures of central tendency and
spread for each variable in the dataset.

2. Provide your answers in one or more tables following the template:

Table 1. Template for descriptive statistics for numerical variables. Please note, all variables
and numbers are an example and not the actual results.

Variable Central tendency spread

11
Age 63.55 ± 0.45 72

Table 2. Template for descriptive statistics for categorical variables. Please note, all
variables and numbers are an example and not the actual results.

Frequency (%)
Variable Level
N=3700
Renal Impairment Normal 1169 (31.59)
Moderately impaired 740 (20.00)
Severely impaired 186 (5.03)
Unknown 1605 (43.38)

1. Describe the patients in the dataset based on your results in the tables.

2. Provide up to two figures visualising a selection of variables of interest.

3. Describe the plots.

12
3. Linear Regression
20% of Report Mark

3.1. Methods: Linear Regression


Please describe the following:

1. Give background on linear regression. How does it work and what is it used for?

2. Define the outcome of the linear regression analysis

3. Describe how you choose the variables for your linear regression model

4. Describe how you measure the performance of your linear regression model.

3.2. Results of Linear Regression


1. Provide the results of the linear regression analysis using the table provided below

Table 3. Template for linear regression model. Please note, all numbers and
variables in the table are as an example, not the actual results.

Variable Level Estimate Std. Error P-value


Intercept 2.4790 1.01 0.0144
Age 0.1340 0.02 <0.0001
Renal Impairment Moderately Impaired 1.0360 0.78 0.0250
Severely Impaired 2.4409 0.38 0.0017
Unknown 1.5879 0.46 <0.0001

2. Describe the variables in the linear regression model. How are the variables associated
with the outcome?

3. Describe the performance of your model.

13
4. Logistic Regression
26% of Report Mark

4.1. Methods Overview: Logistic Regression


Please describe the following:

1. Give background about logistic regression. How does it work and what is it used for?

2. Define the outcome of the logistic regression analysis.

3. Give background to odds ratios. How are they calculated and what do they describe?

4. Describe how you choose the variables for your logistic regression model.

5. Describe how you develop your model, using training and testing data and forward
selection.

6. Describe how you measure the performance of your logistic regression model.

4.2. Results of Logistic Regression: Variables associated with


heart disease
Undertake multivariate logistic regression analysis to find which variables are significantly
associated with the target variable based on unadjusted odds ratios.

1. Provide the results of the Logistic Regression analysis using the table template provided
below.

2. Which variables are significantly associated with the outcome based on unadjusted odds
ratios?

3. For which variables the significance has diminished with the covariate adjustment?

Table 4. Template for unadjusted and adjusted odds ratios. Please note, the variables and
the numbers in the table are an example only and are not the actual results.

Unadjusted Adjusted
Variable Level OR OR
P-value P-value
(95% CI) (95% CI)
Renal Moderately 1.89 1.04
0.0283 0.0578
Impairment Impaired (1.22-2.01) (0.78-1.23)

14
Severely 5.66 3.45
<0.0001 <0.001
Impaired (3.56-8.78) (2.03-4.08)
Unknown 3.57 2.55
<0.001 0.0174
(2.98-4.04) (1.89-3.21)

3.3. Results of Logistic Regression: Developing a prediction model


Split your data into training and testing data.

Using forward selection, develop a prediction model predicting Outcome based on training
data.

1. Provide a table of your final logistic regression model, based on the table template
below.

2. Describe the table of your final logistic regression model.

3. Which variables are included in the final prediction model?

Table 5. Template for logistic regression model coefficients. Please note, the variables and
the numbers in the table are an example only and are not the actual results.

Variable Level Estimate St. Error P-value

Intercept 10.3978 0.29 <0.0001

Age 2.4409 0.78 0.0017

3.4. Results of Logistic Regression: Model performance


1. Using test data generate the receiver operating characteristic (ROC) curve.

2. Discuss your model’s performance based on area under the curve, sensitivity,
specificity, and negative and positive predictive values.

15
4. K-Means Clustering
14% of Report Mark

5.1. Methods: K-Means Clustering


Please describe the following:

1. Provide background on K-Means clustering. How does it work and what is it used for?

2. Describe how the number of clusters is chosen for your analysis (using the elbow
method, silhouette method and gap statistic).

3. Describe what you are using to evaluate the results (e.g. plots).

5.2. Results of K-Means Clustering


In R Studio, carry out k-means clustering analysis.

1. Show how you would choose the number of clusters, using the elbow, silhouette and
gap statistic methods.

2. Provide plots visualising clusters with respect to important variable pairs

3. Discuss the plots you have generated based on the clusters that have formed

4. Assess the relationship between clusters and the target variable and any other variables
of interest

16
6. Discussion
10% of Report Mark

Provide a discussion for each of the three sections of your analysis (linear regression, logistic
regression, clustering):

1. Summarise which variables are significantly associated with the outcomes

2. Explain why these variables are associated with the outcome by bringing examples
from other studies

3. Compare your results with some other studies

4. State the limitations of the study

5. Summarise your results and explain how your analysis could improve healthcare.

6. Use references to the literature where appropriate.

6.1. Linear Regression

6.2. Logistic Regression

6.3. K-Means Clustering

17
References

18
Appendix A
Provide a copy of R code for the four main sections of the report. The code should be executable
by a marker by copying and pasting your script into R Studio themselves should they wish to
run it. The code should be commented to briefly but clearly state what the main functionality
of the script is.

A.1 R code used in the Descriptive Statistics section

A.2 R code used in the Linear Regression analysis section

A.3 R code used in the Logistic Regression analysis section

A.4 R code used in the K-Means clustering analysis section

19

You might also like