Correlation Coefficient
Correlation Coefficient
(Alba González)
CORRELATION
WHAT IS CORRELATION?
Correlation is a statistical measure that quantifies the degree of linear association between two
variables, indicating how they change together at a constant rate. This tool is frequently used to
describe simple relationships without making a statement about cause and effect. Correlations describe
data moving together; thus, they are useful for describing simple relationships among data.
TYPES OF CORRELATION
There is a positive linear correlation when the variable on the x-axis increases as the variable on
the y-axis increases. For example, most of the time there is an increase in between a person’s education
level and their family income.
A negative linear correlation is found when one variable increases as the other variable decreases.
This is shown by a downwards-sloping straight regression line. For example, the longer time it takes
a worker to reach their workplace, the lower the job satisfaction is.
No correlation implies that there is no pattern that can be detected between the variables. For example,
the amount of ice cream sold at the number of shark attacks.
There is a non-linear correlation when there is a relationship between variables, but the relationship
is not linear (straight), although this is not the focus of this project. For example, the implantation of
1
new technology might follow an S-shaped growth pattern – slow adaptation onset, followed by a rapid
increase, and a plateau state when it is fully established.
CORRELATION COEFFICIENT
SCATTERPLOTS
We can use scatterplots to visualize correlations. The correlation coefficient, r, shows how close the
point in the scatterplot comes to a linear relationship; the stronger the relationship or bigger the r
values, the closer to the line in which we want to fit the data.
PEARSON’S COEFFICIENT
Pearson’s r measures the strength and direction (decreasing or increasing, depending on the sign) of a
linear relationship between two variables X and Y can be defined as:
Where:
Strength refers to how one variable will change due to the change in the other. The closer to +1 and -
1, the stronger the relationship. In a scatterplot, the values will lie closer to the line.
Direction indicates a positive linear or negative linear relationship between variables. In the scatterplot,
if the slope goes up is positive, and if it goes down then it is negative.
INTERPRETATION
2
The closer the scatterplots lie next to the line, the stronger the relationship between variables. The
further they move from the line, the weaker the relationship.
For example:
It is important to understand that the negative correlation should not be mistaken with no correlation,
for instance, if the Pearson coefficient is -0.9 it indicates a higher correlation than +0.7, and a
correlation of +0.8 is not better than -0.8.
PROPERTIES
SYMMETRY
This coefficient is symmetric, meaning that either x or y can be expressed as a function of the other;
thus, flipping the axis will not affect the Pearson r.
Using the previous example, the mean years of schooling can be plotted against the life expectancy
and the value of r and R2 remains unchanged.
3
INSENSITIVITY TO SCALE AND LOCATION
The Pearson's coefficient does not have any specific unit, meaning it lacks dimension. Therefore,
multiplying x or y by a negative/positive number, or adding/subtracting, it does not have any impact
on the outcome; there will be different location of the values in the scatterplot, but the correlation
remains constant.
In these scatterplots, the x-axis values are different. Miles were obtained by dividing by 1.61, and
minutes by multiplying by 60. The overall result shows that the slope and the values are different, but
the correlation remains constant.
To apply the Pearson correlation coefficient, the following condition must be satisfied:
- The variables must be measured at the interval or ratio level; thus, x and y variables are
quantitative and are expressed as real numbers.
- The data should be organized in paired observations and shown in a 2-column value: x value
with its corresponding y value.
- The variance and covariance of x and y must be defined, and the variances must be non-null.
LINEARITY ASSUMPTION
4
Pearson correlation assumes a linear correlation between
variables, thus, if the relationship is non-linear, the
coefficient will not accurately represent the association.
This is seen when the correlation coefficient is weak, we
may conclude that there is no relationship between the two
variables, however, the issue may be that the relationship
is non-linear.
If we obtain an r=0.82, most of the people will plot a straight regression line. But it might represent
something like this:
SENSITIVE TO OUTLIERS
To observe this, there is an example measuring the relationship between life expectancy and health
expenditure.
Pearson’s coefficient is 0.54 with the outlier, while the coefficient without outlier is 0.71. Therefore,
visually, the scatterplot values fit better to the line. Meaning that one single outlier changes noticeably
the outcome of the analysis.
5
This leads to the following question, should outliers be excluded from the analysis?
It will depend on the context and what the outlier represents, or the sample size – if it is large enough,
outliers are expected to be seen in the analysis. Therefore, keep the outliers if they represent data of
the population studied, and remove them if they appear due to experimental or measurement errors, or
if there is any significant reason why they should be excluded.
In the context of the example, is there a good reason to remove the outlier? Not really, there is no sign
that indicates the removal of the outlier; in fact, it seems more relevant to consider.
ESTIMATOR BIAS
The Pearson’s correlation coefficient can slightly underestimate the absolute value for a population,
especially on small sample sizes. This produces a bias, noticeable around the absolute range value
(0.5-0.7).
The Spearman rank correlation coefficient serves as an alternative to the previous coefficient
mentioned, the Pearson correlation coefficient. The spearman rank is the Pearson correlation
coefficient calculated between the ranks of the x and t values. For instance, we set the smallest value
of the x axis as 1, the next value 2, etc., and in the y axis we do the same. This means that the Spearman
rank correlation is just the Pearson correlation between the two new list of values.
We also find the Kendall Tau which is another rank-based correlation. In this case, it is a non-
parametric procedure, thus, the data obtained does not have to be normally distributed, compared to
Pearson’s correlation which is parametric. This alternative is preferred over Spearman’s when there is
truly little data.
Both alternatives are useful for assessing increases in a y variable according to changes in a x variable,
being the main advantage compared to Pearson’s, is that we can see the correlation even when the
relationship is not linear. When the values of one variable consistently move in the same direction as
the other, this phenomenon is known as monotonic relationships.
6
In the following examples we can compare the alternatives and the Pearson’s correlation:
The Spearman rank correlation and Kendall Tau values are 1, meaning a perfectly monotonic
relationship when positive, and the same would happen for a value of -1. In contrast, the Pearson
correlation coefficient is 0.86 in the first plot and 0.85 in the second sample as it is affected by the non-
linear character of the relationship between x and y. But what would occur when the relationship is
not monotonic?
As predicted, the values of Spearman rank and Kendall Tau are influenced by the non-monotonic
relationship between x and y. Furthermore, similar to Pearson’s coefficient, they cannot detect the
association between the two variables in the second plot.
7
REFERENCES
- https://www.jmp.com/en_au/statistics-knowledge-portal/what-is-
correlation.html#:~:text=Correlation%20is%20a%20statistical%20measure,statement%20about%20cause%20a
nd%20effect.
- https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/statistics/regression-and-
correlation/types-of-correlation.html
- https://www.questionpro.com/blog/pearson-correlation-
coefficient/#what_is_the_pearson_correlation_coefficient?
- https://medium.com/@anthony.demeusy/pearson-correlation-methodology-limitations-alternatives-part-3-
alternatives-cc2a56f7ad1f