Calculating Pearson Correlation Coefficient in Python With Numpy
Calculating Pearson Correlation Coefficient in Python With Numpy
Numpy
Mehreen Saeed
https://stackabuse.com/calculating-pearson-correlation-coefficient-in-python-with-
numpy/
Introduction
This article is an introduction to the Pearson Correlation Coefficient , its manual calculation
and its computation via Python's numpy module.
The Pearson correlation coefficient measures the linear association between variables. Its value
can be interpreted like so:
We'll illustrate how the correlation coefficient varies with different types of associations. In this
article, we'll also show that zero correlation does not always mean zero associations. Non-
linearly related variables may have correlation coefficients close to zero.
As the covariance is always smaller than the product of the individual standard deviations, the
value of ρ varies between -1 and +1. From the above we can also see that the correlation of a
variable with itself is one:
Before we start writing code, let's do a short example to see how this coefficient is computed.
How is the Pearson Correlation Coefficient Computed?
Suppose we are given some observations of the random variables X and Y. If you plan to
implement everything from scratch or do some manual calculations, then you need the
following when given X and Y:
X Y X2 Y2 XY
-2 4 4 16 -8
-1 1 1 1 -1
0 3 0 9 0
1 2 1 4 2
2 0 4 0 0
E(X) = 0
E(Y) = 2
E(XY) = -1,4
Let's use the above to compute the correlation. We'll use the biased estimate of covariance and
standard deviations. This won't affect the value of the correlation coefficient being computed as
the number of observations cancels out in the numerator and denominator:
For n random variables, it returns an n x n square matrix M, with M(i,j) indicating the correlation
coefficient between the random variable i and j. As the correlation coefficient between a
variable and itself is 1, all diagonal entries (i,i) are equal to one.
In short:
Note that the correlation matrix is symmetric as correlation is symmetric, i.e., M(i,j)=M(j,i).
Let's take our simple example from the previous section and see how to use corrcoef() with
νmpy.
First, let's import the numpy module, alongside the pyplot module from Matplotlib. We'll be
using Matplotlib to visualize the correlation later on:
import numpy as np
import matplotlib.pyplot as plt
We'll use the same values from the manual example from before. Let's store that into x_simple
and compute the correlation matrix:
print(my_rho)
The following is the output correlation matrix. Note the ones on the diagonals, indicating that
the correlation coefficient of a variable with itself is one:
[[ 1. -0.7]
[-0.7 1. ]]
We'll use a seed so that this example is repeatable when calling the RandomState from Numpy:
seed = 13
rand = np.random.RandomState(seed)
x = rand.uniform(0,1,100)
x = np.vstack((x,x*2+1))
x = np.vstack((x,-x[0,]*2+1))
x = np.vstack((x,rand.normal(1,3,100)))
Then, we can call vstack() to vertically stack other arrays to it. This way, we can stack a bunch of
variables like the ones above in the same x reference and access them sequentially.
After the first uniform distribution, we've stacked a few variable sets vertically - the second one
has a complete positive relation to the first one, the third one has a complete negative
correlation to the first one, and the fourth one is fully random, so it should have a ~0
correlation.
When we have a single x reference like this, we can calculate the correlation for each of the
elements in the vertical stack by passing it alone to np.corrcoef():
rho = np.corrcoef(x)
In this example, we'll slowly add varying degrees of noise to the correlation plots, and
calculating the correlation coefficients on each step:
# Compute correlation
rho_noise = np.corrcoef(x_with_noise)
fig.subplots_adjust(wspace=0.3,hspace=0.4)
plt.show()
A Common Pitfall: Associations with no Correlation
There is a common misconception that zero correlation implies no association. Let's clarify that
correlation strictly measures the linear relationship between two variables.
The examples below show variables which are non-linearly associated with each other but have
zero correlation.
The last example of (y=ex) has a correlation coefficient of around 0.52, which is again not a
reflection of the true association between the two variables:
Conclusions
In this article, we discussed the Pearson correlation coefficient. We used the corrcoef() method
from Python's numpy module to compute its value.
If random variables have high linear associations then their correlation coefficient is close to +1
or -1. On the other hand, statistically independent variables have correlation coefficients close
to zero.
We also demonstrated that non-linear associations can have a correlation coefficient zero or
close to zero, implying that variables having high associations may not have a high value of the
Pearson correlation coefficient.