#1660908-Data Management and Statistical Computing

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

DMC Assignment 2 1

DMC ASSIGNMENT 2

By Student Name

Course

Tutor

University

City and State

Date
DMC Assignment 2 2

DMC Assignment 2

Question 1 (R)

Question 1

> library(readr)
Warning message:
package ‘readr’ was built under R version 3.6.3
> Diabetes <- read_csv("Diabetes.csv")
Parsed with column specification:
cols(
id = col_double(),
location = col_character(),
staff = col_character(),
age = col_double(),
gender = col_character(),
height = col_double(),
weight = col_double(),
waist = col_double(),
hip = col_double(),
bp1s = col_double(),
bp1d = col_double(),
bp2s = col_double(),
bp2d = col_double(),
glyhb = col_double()
)
>
> attach(Diabetes)
> location=factor(location)
> gender=factor(gender)
> staff=factor(staff)

Question 2

a) Creating variable diabetes

> diabetes<-ifelse(glyhb>7, "diabetic", "nondiabetic")

b) Creating bar chart

> Diabetes[age<30,"age_group"]<-"Under 30"


> Diabetes[age>=30&age<40, "age_group"]<-"30 to <40"
> Diabetes[age>=40&age<50, "age_group"]<-"40 to <50"
> Diabetes[age>=50&age<60, "age_group"]<-"50 to <60"
> Diabetes[age>=60, "age_group"]<-"60 and over"

> library(ggplot2)
> tbl<-with(Diabetes, table(age_group, diabetes))
> ggplot(as.data.frame(tbl), aes(factor(age_group), Freq, fill=diabetes))+
scale_x_discrete("age_group", labels=c("1"="Under 30", "2"="30 to <40",
"3"="40 to <50", "4"="50 to <60", "5"="60 and above"))+
geom_col(position='dodge')+
ggtitle("Diabetes Status by Age Group")+
ylab('Number of People')
DMC Assignment 2 3

Question 3

> boxplot(weight, main="Boxplot for Weight")


> boxplot(height, main="Boxplot for Height")
> boxplot(waist, main="Boxplot for Waist Circumference")
> boxplot(hip, main="Boxplot for Hip Circumference")

Boxplot for Weight

Boxplot for Height


DMC Assignment 2 4

Boxplot for Waist

Boxplot for Hip


DMC Assignment 2 5

Checking for patterns of data entry errors

> which(is.na(weight))
[1] 162
> which(is.na(height))
[1] 64 87 196 232 318
> which(is.na(waist))
[1] 337 394
> which(is.na(hip))
[1] 337 394

Setting NA values to 0

> Diabetes[is.na(Diabetes)]=0

Question 4

> waist_hip_ratio<-waist/hip
> scatterplot(glyhb~waist_hip_ratio|gender, data = Diabetes,
main="Scatterplot")
DMC Assignment 2 6

The appropriate graph to determine the relationship between waist-hip ratio and diabetes is a

scatter plot. This is because both data are continuous and can be compared by plotting their data

points (Sarikaya and Gleicher, 2017, 402). Also, the two variables can be used to estimate each

other (Schmidt and Finan, 2018, 160).

Question 5

i. Create variables for systolic and diastolic blood pressure

ii. Create a new variable map


DMC Assignment 2 7

iii. Scatterplot of MAP by age, where males and females are on the same graph

iv. The mean and standard deviation of MAP

v.
DMC Assignment 2 8

Part 2: Stata

Question 1

a) Recoding variables
DMC Assignment 2 9

b) Convert other string variables to numeric variables


DMC Assignment 2 10
DMC Assignment 2 11
DMC Assignment 2 12

Question 2

a) New set of variables for Blood Glucose, CSF Glucose, and CSF Protein

gen bldgluc_2=(bldgluc_1*1/18)

gen csfgluc_2=(csfgluc_1*1/18)

gen csfpr_2=(csfpr_1*1/18)

b) Reason for capturing data in two different units

Capturing data in two different units of measurements is imperative to enable people

from different locations to easily make conversions of the values (Chehregosha et al., 2019,

856). For instance, the U.S. uses the mmol/L scale while Germany uses the mg/dL scale.

Thus, using different measures allows easier communication between countries since the

scales can be converted by either multiplying or dividing the values by a factor of 18

(Zhu, Volkening, and Laffel, 2020, 24).

c) The overall distribution of the variables

Histogram of the Distribution for Blood Glucose


DMC Assignment 2 13

Distribution of Blood Glucose (mmol/L)

150
100
Frequency
50
0

0 5 10 15 20
Blood Glucose (mmol/L)

The distribution for Blood Glucose is positively skewed. This is because the distribution

curve appears to lean towards the left with a long tail extending to the right (Boels et al.,

2019, 12). The distribution is unimodal.

Histogram of the Distribution for CSF Glucose


DMC Assignment 2 14

Distribution of CSF Glucose (mmol/L)

2 50
2 00
F req ue ncy
1 00 1 50
50
0

0 5 10 15
CSF Glucose (mmol/L)

The distribution for CSF Glucose is right-skewed. This is because most of the data fall to

the right of the distribution’s peak (Ravignani, 2017, 562).

Histogram of the Distribution for Blood Glucose

Distribution of CSF Protein (mmol/L)


250 200
F re q u e n c y
100 150
50
0

0 20 40 60 80
CSF Protein (mmol/L)

The distribution of CSF Protein is positively skewed. The distribution is unimodal since it

has one peak (Ash et al., 2017, 5).


DMC Assignment 2 15

d) Log-transformation of CSF glucose and CSF protein variables and their

distribution

Distribution of Log-transformed CSF Glucose (mmol/L)


100
80
Frequency
40 20
0 60

-6 -4 -2 0 2
log(CSF Glucose)

Log transformation is used to transform highly skewed data to ensure the distribution

conforms to normality (Curran-Everett, 2018, 344). From the previous histogram

distribution of CSF Glucose, the distribution was not normal. After performing a log

transformation on the data, the normal bell curve illustrates that the data for log(CSF

Glucose) approximately follows normality as shown in the figure above (Asar, Ilk and

Dag, 2017, 93).


DMC Assignment 2 16

Distribution of Log-transformed CSF Protein (mmol/L)

40 30
Frequency
20 10
0

-4 -2 0 2 4
log(CSF Protein)

The distribution of CSF Protein was not normal. A log transformation on the data

increases its normality as illustrated in the histogram above (Templeton and Burney,

2017, 156). The normality of distribution is observed from the symmetrical normal

distribution curve (Mai and Mirarab, 2021, 1156).

Question 3

A cross-tabulation is a way of organizing data in tabular format to display statistical differences

between two variables (Dasheva, Andonov and Doncheva, 2020, 13). Also, box plots illustrate

the distribution of data by indicating possible outliers (Ho et al., 565). The histograms and cross-

tabulations below show the differences in interquartile range and median of CSF Glucose by

grampos and abm.


DMC Assignment 2 17

CSF Glucose by Gram Positive


250
200 150
csfgluc_1
100 50
0

Gram Negative Gram Positive

From the cross-tabulation above, CSF glucose is higher in the Gram Positive group compared to

the Gram negative group (Gogtay and Thatte, 2017, 79). One outlier exists in the Gram Positive

group.

An illustration of the differences in interquartile range and median in CSF glucose by abm is

shown below.
DMC Assignment 2 18

CSF Glucose by Gram Positive


250
200 150
csfgluc_1
100 50
0

Viral Meningitis Bacterial Meningitis missing

As depicted by the crosstabulation and boxplot above, there a higher level of CSF Glucose in the

Viral Meningitis group compared to the Bacterial Meningitis group.

Question 4
DMC Assignment 2 19

1
.8 .6
cumul_prev
.4 .2
0

0 5 10 15
month

There is an increase in the number of infections as demonstrated by the probability distribution

plot above. The plot shows an increase in Viral Meningitis, which tends to increase the cases of

CSF Glucose in the population studied.


DMC Assignment 2 20

References

Asar, Ö., Ilk, O. and Dag, O., 2017. Estimating Box-Cox power transformation parameter via

goodness-of-fit tests. Communications in Statistics-Simulation and Computation, 46(1),

pp.91-105.

Ash, S.Y., Harmouche, R., Vallejo, D.L.L., Villalba, J.A., Ostridge, K., Gunville, R., Come,

C.E., Onieva, J.O., Ross, J.C., Hunninghake, G.M. and El-Chemaly, S.Y., 2017.

Densitometric and local histogram based analysis of computed tomography images in

patients with idiopathic pulmonary fibrosis. Respiratory Research, 18(1), pp.1-11.

Boels, L., Bakker, A., Van Dooren, W. and Drijvers, P. 2019. ‘Conceptual difficulties when

interpreting histograms: A review.’ Educational Research Review, 28(3), pp.1-15.

Chehregosha, H., Khamseh, M.E., Malek, M., Hosseinpanah, F. and Ismail-Beigi, F. 2019. ‘A

view beyond HbA1c: role of continuous glucose monitoring.’ Diabetes Therapy, 10(3),

pp.853-863.

Curran-Everett, D., 2018. Explorations in statistics: the log transformation. Advances in

physiology education, 42(2), pp.343-347.

Dasheva, D., Andonov, H. and Doncheva, L., 2020. Master’s Program High Performance Sport

E-Learning during COVID-19 Pandemic. Педагогика, 92(S7), pp.9-16.

Gogtay, N.J. and Thatte, U.M., 2017. Principles of correlation analysis. Journal of the

Association of Physicians of India, 65(3), pp.78-81.

Ho, J., Tumkaya, T., Aryal, S., Choi, H. and Claridge-Chang, A., 2019. Moving beyond P

values: data analysis with estimation graphics. Nature methods, 16(7), pp.565-566.

Mai, U. and Mirarab, S., 2021. Log Transformation Improves Dating of Phylogenies. Molecular

biology and evolution, 38(3), pp.1151-1167.


DMC Assignment 2 21

Ravignani, A., 2017. Visualizing and interpreting rhythmic patterns using phase space plots.

Music Perception: An Interdisciplinary Journal, 34(5), pp.557-568.

Sarikaya, A. and Gleicher, M., 2017. Scatterplots: Tasks, data, and designs. IEEE transactions

on visualization and computer graphics, 24(1), pp.402-412.

Schmidt, A.F. and Finan, C., 2018. Linear regression and the normality assumption. Journal of

clinical epidemiology, 98, pp.146-151.

Templeton, G.F. and Burney, L.L., 2017. Using a two-step transformation to address non-

normality from a business value of information technology perspective. Journal of

Information Systems, 31(2), pp.149-164.

Zhu, J., Volkening, L.K. and Laffel, L.M. 2020. ‘Distinct patterns of daily glucose variability by

pubertal status in youth with type 1 diabetes.’ Diabetes Care, 43(1), pp.22-28.

You might also like