Correlation is a statistical term to measure the relationship between two variables. If the relationship is string, means the change in one variable reflects a change in another variable in a predictable pattern then we say that the variables are correlated. Further the variation in first variable may cause a positive or negative variation in the second variable. Accordingly, they are said to be positively or negatively correlated. Ideally the value of the correlation coefficient varies between -1 to +1.
- If the value is +1 or close to it then we say the variables are positively correlated. And they vary in the same direction simultaneously.
- If the value is -1 or close to it then we say the variables are negatively correlated. And they vary in the opposite direction simultaneously.
- If the value is 0 or close to it then we say the variables are not correlated.
There are different ways to measure the coefficient of correlation. They are available as functions in numpy or scipy.stats. Below we will see how they are used.
Using Spearman’s Correlation
Spearman’s correlation is to measure the degree of the relationship between linearly related variables. It is based on a formula which is used by the scipy.stats package to generate the result. It is the most widely used formula for finding the correlation.
In the below example we take the two variables whose values are generated using the numpy.randon()funcntion. Then the spearmanr() is applied to get the final result.
Example
from numpy.random import randn from numpy.random import seed from scipy.stats import spearmanr seed(1) data_input1 = 20 * randn(1000) + 100 data_input2 = data_input1 + (10 * randn(1000) + 50) correlation = spearmanr(data_input1, data_input2) print(correlation)
Output
Running the above code gives us the following result −
SpearmanrResult(correlation=0.8724050484050484, pvalue=1.58425746359e-312)
Using Pearson’s Correlation
Pearson’s correlation is another way to measuer the degree of the relationship between linearly related variables. It is based on a formula which is used by the scipy.stats package to generate the result.
In the below example we take the two variables whose values are generated using the numpy.randon()funcntion. Then the pearsonr() is applied to get the final result.
Example
from numpy.random import randn from numpy.random import seed from scipy.stats import pearsonr seed(1) data1 = 20 * randn(1000) + 100 data2 = data1 + (10 * randn(1000) + 50) correlation, _ = pearsonr(data1, data2) print('Pearsons correlation: %.3f' % correlation)
Output
Running the above code gives us the following result −
Pearsons correlation: 0.888