#1660908-Data Management and Statistical Computing
#1660908-Data Management and Statistical Computing
#1660908-Data Management and Statistical Computing
DMC ASSIGNMENT 2
By Student Name
Course
Tutor
University
Date
DMC Assignment 2 2
DMC Assignment 2
Question 1 (R)
Question 1
> library(readr)
Warning message:
package ‘readr’ was built under R version 3.6.3
> Diabetes <- read_csv("Diabetes.csv")
Parsed with column specification:
cols(
id = col_double(),
location = col_character(),
staff = col_character(),
age = col_double(),
gender = col_character(),
height = col_double(),
weight = col_double(),
waist = col_double(),
hip = col_double(),
bp1s = col_double(),
bp1d = col_double(),
bp2s = col_double(),
bp2d = col_double(),
glyhb = col_double()
)
>
> attach(Diabetes)
> location=factor(location)
> gender=factor(gender)
> staff=factor(staff)
Question 2
> library(ggplot2)
> tbl<-with(Diabetes, table(age_group, diabetes))
> ggplot(as.data.frame(tbl), aes(factor(age_group), Freq, fill=diabetes))+
scale_x_discrete("age_group", labels=c("1"="Under 30", "2"="30 to <40",
"3"="40 to <50", "4"="50 to <60", "5"="60 and above"))+
geom_col(position='dodge')+
ggtitle("Diabetes Status by Age Group")+
ylab('Number of People')
DMC Assignment 2 3
Question 3
> which(is.na(weight))
[1] 162
> which(is.na(height))
[1] 64 87 196 232 318
> which(is.na(waist))
[1] 337 394
> which(is.na(hip))
[1] 337 394
Setting NA values to 0
> Diabetes[is.na(Diabetes)]=0
Question 4
> waist_hip_ratio<-waist/hip
> scatterplot(glyhb~waist_hip_ratio|gender, data = Diabetes,
main="Scatterplot")
DMC Assignment 2 6
The appropriate graph to determine the relationship between waist-hip ratio and diabetes is a
scatter plot. This is because both data are continuous and can be compared by plotting their data
points (Sarikaya and Gleicher, 2017, 402). Also, the two variables can be used to estimate each
Question 5
iii. Scatterplot of MAP by age, where males and females are on the same graph
v.
DMC Assignment 2 8
Part 2: Stata
Question 1
a) Recoding variables
DMC Assignment 2 9
Question 2
a) New set of variables for Blood Glucose, CSF Glucose, and CSF Protein
gen bldgluc_2=(bldgluc_1*1/18)
gen csfgluc_2=(csfgluc_1*1/18)
gen csfpr_2=(csfpr_1*1/18)
from different locations to easily make conversions of the values (Chehregosha et al., 2019,
856). For instance, the U.S. uses the mmol/L scale while Germany uses the mg/dL scale.
Thus, using different measures allows easier communication between countries since the
150
100
Frequency
50
0
0 5 10 15 20
Blood Glucose (mmol/L)
The distribution for Blood Glucose is positively skewed. This is because the distribution
curve appears to lean towards the left with a long tail extending to the right (Boels et al.,
2 50
2 00
F req ue ncy
1 00 1 50
50
0
0 5 10 15
CSF Glucose (mmol/L)
The distribution for CSF Glucose is right-skewed. This is because most of the data fall to
0 20 40 60 80
CSF Protein (mmol/L)
The distribution of CSF Protein is positively skewed. The distribution is unimodal since it
distribution
-6 -4 -2 0 2
log(CSF Glucose)
Log transformation is used to transform highly skewed data to ensure the distribution
distribution of CSF Glucose, the distribution was not normal. After performing a log
transformation on the data, the normal bell curve illustrates that the data for log(CSF
Glucose) approximately follows normality as shown in the figure above (Asar, Ilk and
40 30
Frequency
20 10
0
-4 -2 0 2 4
log(CSF Protein)
The distribution of CSF Protein was not normal. A log transformation on the data
increases its normality as illustrated in the histogram above (Templeton and Burney,
2017, 156). The normality of distribution is observed from the symmetrical normal
Question 3
between two variables (Dasheva, Andonov and Doncheva, 2020, 13). Also, box plots illustrate
the distribution of data by indicating possible outliers (Ho et al., 565). The histograms and cross-
tabulations below show the differences in interquartile range and median of CSF Glucose by
From the cross-tabulation above, CSF glucose is higher in the Gram Positive group compared to
the Gram negative group (Gogtay and Thatte, 2017, 79). One outlier exists in the Gram Positive
group.
An illustration of the differences in interquartile range and median in CSF glucose by abm is
shown below.
DMC Assignment 2 18
As depicted by the crosstabulation and boxplot above, there a higher level of CSF Glucose in the
Question 4
DMC Assignment 2 19
1
.8 .6
cumul_prev
.4 .2
0
0 5 10 15
month
plot above. The plot shows an increase in Viral Meningitis, which tends to increase the cases of
References
Asar, Ö., Ilk, O. and Dag, O., 2017. Estimating Box-Cox power transformation parameter via
pp.91-105.
Ash, S.Y., Harmouche, R., Vallejo, D.L.L., Villalba, J.A., Ostridge, K., Gunville, R., Come,
C.E., Onieva, J.O., Ross, J.C., Hunninghake, G.M. and El-Chemaly, S.Y., 2017.
Boels, L., Bakker, A., Van Dooren, W. and Drijvers, P. 2019. ‘Conceptual difficulties when
Chehregosha, H., Khamseh, M.E., Malek, M., Hosseinpanah, F. and Ismail-Beigi, F. 2019. ‘A
view beyond HbA1c: role of continuous glucose monitoring.’ Diabetes Therapy, 10(3),
pp.853-863.
Dasheva, D., Andonov, H. and Doncheva, L., 2020. Master’s Program High Performance Sport
Gogtay, N.J. and Thatte, U.M., 2017. Principles of correlation analysis. Journal of the
Ho, J., Tumkaya, T., Aryal, S., Choi, H. and Claridge-Chang, A., 2019. Moving beyond P
values: data analysis with estimation graphics. Nature methods, 16(7), pp.565-566.
Mai, U. and Mirarab, S., 2021. Log Transformation Improves Dating of Phylogenies. Molecular
Ravignani, A., 2017. Visualizing and interpreting rhythmic patterns using phase space plots.
Sarikaya, A. and Gleicher, M., 2017. Scatterplots: Tasks, data, and designs. IEEE transactions
Schmidt, A.F. and Finan, C., 2018. Linear regression and the normality assumption. Journal of
Templeton, G.F. and Burney, L.L., 2017. Using a two-step transformation to address non-
Zhu, J., Volkening, L.K. and Laffel, L.M. 2020. ‘Distinct patterns of daily glucose variability by
pubertal status in youth with type 1 diabetes.’ Diabetes Care, 43(1), pp.22-28.