CORRELATION AND COVARIANCE in R
CORRELATION AND COVARIANCE in R
CORRELATION AND COVARIANCE in R
COVARIANCE in R
Covariance
The covariance of two variables x and y in a data set measures how the two are
linearly related. A positive covariance would indicate a positive linear relationship
between the variables, and a negative covariance would indicate the opposite.
Similarly, the population covariance is defined in terms of the population mean μx, μy
as:
Covariance in R programming
In Statistics, Covariance is the measure of the relation between two variables of a dataset.
That is, it depicts the way two variables are related to each other.
For an instance, when two variables are highly positively correlated, the variables move
ahead in the same direction.
Covariance is useful in data pre-processing prior to modelling in the domain of data science
and machine learning.
In R programming, we make use of cov() function to calculate the covariance between two
data frames or vectors.
Example:
We provide the below three parameters to the cov() function–
x — vector 1
y — vector 2
method — Any method to calculate the covariance such as Pearson, spearman. The default method is Pearson.
a <- c(2,4,6,8,10)
b <- c(1,11,3,33,5)
print(cov(a, b, method = "spearman"))
Problem
Find the covariance of eruption duration and waiting time in the data set faithful.
Observe if there is any linear relationship between the two variables.
Solution
We apply the cov function to compute the covariance of eruptions and waiting.
> duration = faithful$eruptions # eruption durations
> waiting = faithful$waiting # the waiting period
> cov(duration, waiting) # apply the cov function
[1] 13.978
Correlation in R programming
When two variables are highly (positively) correlated, we say that the variables
depict the same information and have the same effect on the other data variables
of the dataset.
Example
The cor() function in R enables us to calculate the correlation between the variables of the data set or vector.
Example:
a <- c(2,4,6,8,10)
b <- c(1,11,3,33,5)
corr = cor(a,b)
print(corr)
print(cor(a, b, method = "spearman"))
Covariance to Correlation in R
Note: The vectors or values passed to build cov() needs to be a square matrix in
this case!
Example:
Here, we have passed two vectors a and b such that they obey all the terms of a square matrix. Further, using cov2cor()
function, we achieve a corresponding correlation matrix for every pair of the data values.
a <- c(2,4,6,8)
b <- c(1,11,3,33)
covar = cov(a,b)
print(covar)
res = cov2cor(covar)
print(res)
Compute correlation matrix in R
# Load data
data("mtcars")
my_data <- mtcars[, c(1,3,4,5,6,7)]
# print the first 6 rows
head(my_data, 6)
Compute correlation matrix
The function rcorr() [in Hmisc package] can be used to compute the significance
levels for pearson and spearman correlations. It returns both the correlation
coefficients and the p-value of the correlation for all possible pairs of columns in
the data table.
Simplified format:
rcorr(x, type = c("pearson","spearman"))
Install Hmisc package:
install.packages("Hmisc")
Use rcorr() function
library("Hmisc")
res2 <- rcorr(as.matrix(my_data))
res2
The output of the function rcorr() is a list containing the following elements : - r : the
correlation matrix - n : the matrix of the number of observations used in analyzing each
pair of variables - P : the p-values corresponding to the significance levels of correlations.
If you want to extract the p-values or the correlation coefficients from the output, use this:
library(ggplot2)
library(tidyr)
library(datasets)
data("iris")
summary(iris)
Step 2 - Create a correlation matrix of the Iris dataset using the DataExplorer
correlation function. Include only continuous variables in your correlation plot to
avoid confusion as factor variables don’t make sense in a correlation plot
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 3.5.2
library(corrplot)
## corrplot 0.84 loaded
title="matrix_iris"
plot_correlation(iris)
Step 3 - Create three separate correlation matrices for each species of iris flower (
str(iris)
m<-levels(iris$Species)
title0<-"Setosa"
setosaCorr=cor(iris[iris$Species==m[1],1:4])
corrplot(setosaCorr,method="number",title=title,mar=c(0,0,1,0))
versC=cor(iris[iris$Species==m[2],1:4])
title1<-"versicolor"
corrplot(versC,method="number",title=title1,mar=c(0,0,1,0))
veriC<-cor(iris[iris$Species==m[3],1:4])
title2<-"virginica"
corrplot(veriC,method="number",title=title2,mar=c(0,0,1,0))
Ancova
The simple regression analysis gives multiple results for each value of the categorical variable. In such
scenario, we can study the effect of the categorical variable by using it along with the predictor variable
and comparing the regression lines for each level of the categorical variable. Such an analysis is termed
as Analysis of Covariance also called as ANCOVA.
ANCOVA is a type of general linear model (GLM) that includes at least one continuous and one
categorical independent variable (treatments). ANCOVA is useful when the effect of treatments are
important while there is an additional continuous variable in the study. ANCOVA is proposed by British
statistician Ronald A. Fisher during 1930s.
The additional continuous independent variable in ANCOVA is called a covariate (also known as control,
concomitant, or confounding variable).
We use Regression analysis to create models which describe the effect of variation in predictor variables on
the response variable. Sometimes, if we have a categorical variable with values like Yes/No or
Male/Female etc.
Analysis of covariance: variance (ANOVA),
if data have categorical variables on iris data
data(iris)
View(iris) # take a look at the data
Library(lattice)
In this case, we will examine sepal width as our response, using species as a categorical
predictor (like in ANOVA) and sepal length as our coviate (as in linear regression).
It is always helpful to look at a plot first. Note that type=c("p","r") puts both points (p) and
regression lines (r) on the plot