Exploring Correlation in Python

Correlation is a statistical term to measure the relationship between two variables. If the relationship is string, means the change in one variable reflects a change in another variable in a predictable pattern then we say that the variables are correlated. Further the variation in first variable may cause a positive or negative variation in the second variable. Accordingly, they are said to be positively or negatively correlated. Ideally the value of the correlation coefficient varies between -1 to +1.

If the value is +1 or close to it then we say the variables are positively correlated. And they vary in the same direction simultaneously.
If the value is -1 or close to it then we say the variables are negatively correlated. And they vary in the opposite direction simultaneously.
If the value is 0 or close to it then we say the variables are not correlated.

There are different ways to measure the coefficient of correlation. They are available as functions in numpy or scipy.stats. Below we will see how they are used.

Using Spearman’s Correlation

Spearman’s correlation is to measure the degree of the relationship between linearly related variables. It is based on a formula which is used by the scipy.stats package to generate the result. It is the most widely used formula for finding the correlation.

In the below example we take the two variables whose values are generated using the numpy.randon()funcntion. Then the spearmanr() is applied to get the final result.

Example

from numpy.random import randn
from numpy.random import seed
from scipy.stats import spearmanr
seed(1)
data_input1 = 20 * randn(1000) + 100
data_input2 = data_input1 + (10 * randn(1000) + 50)
correlation = spearmanr(data_input1, data_input2)
   print(correlation)

Output

Running the above code gives us the following result −

SpearmanrResult(correlation=0.8724050484050484, pvalue=1.58425746359e-312)

Using Pearson’s Correlation

Pearson’s correlation is another way to measuer the degree of the relationship between linearly related variables. It is based on a formula which is used by the scipy.stats package to generate the result.

In the below example we take the two variables whose values are generated using the numpy.randon()funcntion. Then the pearsonr() is applied to get the final result.

Example

from numpy.random import randn
from numpy.random import seed
from scipy.stats import pearsonr
seed(1)
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
correlation, _ = pearsonr(data1, data2)
print('Pearsons correlation: %.3f' % correlation)

Output

Running the above code gives us the following result −

Pearsons correlation: 0.888