Mean Vector and Correlation Matrix in R - Jupyter Notebook
Mean Vector and Correlation Matrix in R - Jupyter Notebook
library(car)
library(corrplot)
data <- read.csv(url("https://datalifex.in/dataml/diabetes.csv"))
data
8 99 84 0 0 35.4 0.388 50
Summary
In [38]:
summary(data)
1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
Outcome
Min. :0.000
1st Qu.:0.000
Median :0.000
Mean :0.349
3rd Qu.:1.000
Max. :1.000
Matrix Scatterplot
Scatterplot matrices are good for determining rough linear correlations of metadata that contain continuous
variables.
Scatterplot matrices are not so good for looking at discrete variables.
In [39]:
scatterplotMatrix(data[1:9])
Mean vector
In [40]:
apply(data[,1:9], 2, mean)
In [43]:
colMeans(data[,1:9])
Inference: from the mean we could infer that the average age of the data collected is arround 33
Group means
by function
In [44]:
data$Outcome: 0
------------------------------------------------------------
data$Outcome: 1
Here, outcome 0 is the case of non-diabetics and 1 refers to person having diabetics and respective mean level
of all the parameters We could see that the Mean level of Glucose,Insulin and Age for the person with diabetics
is comparatively higher than with non-diabetics.
aggregate(.~ Outcome,data,mean)
A data.frame: 2 × 9
In [46]:
var(data[,1:9]) #the diagonal elements of the matrix are the individual variances
In [47]:
The above matrix indicates the direction of the linear relationship between variables
Correlation Matrix
In [48]:
cor(data[,1:9])
corrplot(cor(data[,1:9]),method="ellipse")
INFERENCE: from the plot we could notice that there is a strong correlation between pregnancies and age and
moderate correlation between glucose level and people having diabetes, skinthickness and insulin
level,skinthickness and BMI level
In [50]:
D1 <- diag(diag(var(data[,1:9]))^(1/2))
S <- D1%*%cor(data[,1:9])%*%D1
print(S)