Topic 4.5 Correlational Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Topic 4.

5 Correlation analysis

• Correlation analysis

• Correlation and causality

• Interpret correlation

• Significance

• Pearson correlation

• Spearman correlation
What is a correlation analysis?

Correlation analysis is a statistical technique that gives you


information about the relationship between variables.

Correlation analysis can be calculated to investigate the relationship of


variables. How strong the correlation is is determined by
the correlation coefficient, which varies from -1 to +1. Correlation
analyses can thus be used to make a statement about the strength
and direction of the correlation.

Example
You want to find out whether there is a connection between the age at which a child
speaks its first sentences and its later success at school.

Correlation and causality

If the correlation analysis shows that two characteristics are related


to each other, it can subsequently be checked whether one
characteristic can be used to predict the other characteristic. If the
correlation mentioned in the example is confirmed, for example, it
can be checked whether school success can be predicted by the age at
which a child speaks its first sentences by means of a linear regression.

But beware! Correlations need not be causal relationships. Any


correlations that are discovered should therefore be investigated more
closely, but never interpreted immediately in terms of content, even if
this would be obvious.

Correlation and causality example:


If the correlation between sales figures and price is analysed and a strong correlation is
identified, it would be logical to assume that sales figures are influenced by the price (and
not vice versa). This assumption can, however, by no means be proven on the basis of a
correlation analysis.
Furthermore, it can happen that the correlation between variable x and y is generated by
the variable z, see Partial Correlation for more information.
However, depending on which variables you use, you may be able to
speak of a causal relationship right from the start. For example, if
there is a correlation between age and salary, it is clear that age
influences salary and not the other way around, otherwise everyone
would want to earn as little salary as possible : )

Interpret correlation

With the help of correlation analysis two statements can be made:

 one about the direction


 and one about the strength

of the linear relationship between two metric or ordinally scaled


variables. The direction indicates whether the correlation is positive or
negative, while the strength indicates whether the correlation between
the variables is strong or weak.

Positive correlation
A positive correlation exists if larger values of the variable x are accompanied by larger
values of the variable y, and the other way around. Height and shoe size, for example,
correlate positively and the correlation coefficient lies between 0 and 1, i.e. a positive
value.

Negative correlation
A negative correlation exists if larger values of the variable x are accompanied by smaller
values of the variable y, and the other way around. The product price and the sales
quantity usually have a negative correlation; the more expensive a product is, the smaller
the sales quantity. In this case, the correlation coefficient is between -1 and 0, so it
assumes a negative value.

Strength of correlation
With regard to the strength of the correlation coefficient r, the following table can be
used as a guide:
|r| Strength of correlation
0.0 < 0.1 no correlation
0.1 < 0.3 little correlation
0.3 < 0.5 medium correlation
0.5 < 0.7 high correlation
0.7 < 1 very high correlation
.

Scatter plot and correlation

Just as important as the consideration of the correlation coefficient is


the graphical consideration of the correlation of two variables in a
scatter diagram.
The scatter plot gives you a rough estimate of whether there is a
correlation, whether it is linear or nonlinear, and whether there are
outliers.
Test correlation for significance

If there is a correlation in the sample, it is still necessary to test


whether there is enough evidence that the correlation also exists in
the population. Thus, the question arises when a correlation coefficient
can be considered statistically significant.

The significance of correlation coefficients can be tested using a t-test.


As a rule, it is tested whether the correlation coefficient is
significantly different from zero, i.e. linear independence is tested. In
this case, the null hypothesis is that there is no correlation between
the variables under consideration. In contrast, the alternative
hypothesis assumes that there is a correlation.

As with any other hypothesis test, the significance level is first set,
usually at 5%. If the calculated p-value is below 5 %, the null
hypothesis is rejected and the alternative hypothesis applies. Thus, if
the p-value is below 5%, it is assumed that there is a relationship
between the variables in the population.

The t-value for testing the hypothesis is given by

where n is the sample size and r is the determined correlation in the


sample. The corresponding p-value can be easily calculated in
the correlation calculator on DATAtab.

Directional and non-directional hypotheses


With correlation analysis you can test directional and non-directional correlation
hypotheses.
Non-directional correlation hypothesis:
You are only interested in whether there is a relationship or correlation between two
variables, for example, whether there is a correlation between age and salary, but you are
not interested in the direction of this correlation.
Directional correlation hypothesis:

You are also interested in the direction of the correlation, i.e. whether there is a positive
or negative correlation between the variables.
Your alternative hypothesis is then e.g. age has a positive influence on salary. What you
have to pay attention to in the case of a directional hypothesis, we will go through at the
bottom of the example.

Pearson correlation analysis

With the Pearson correlation analysis you get a statement about the
linear correlation between metric scaled variables. The
respective covariance is used for the calculation. The covariance gives a
positive value if there is a positive correlation between the variables
and a negative value if there is a negative correlation. The covariance
is calculated as:

However, the covariance is not standardized and can assume values


between plus and minus infinity. This makes it difficult to compare
the strength of relationships between different variables. For this
reason, the correlation coefficient, also called product-moment
correlation coefficient, is calculated. The correlation coefficient is
obtained by normalizing the covariance. For this normalization, the
variances of the two variables involved are used and the correlation
coefficient is calculated as

The Pearson correlation coefficient can now take values between -1


and +1 and can be interpreted as follows
 The value +1 means that there is an entirely positive linear
relationship (the more, the more).
 The value -1 indicates that an entirely negative linear relationship
exists (the more, the less).
 With a value of 0 there is no linear relationship, i.e. the variables
do not correlate with each other.
Now finally the strength of the relationship can be interpreted. This
can be illustrated by the following table:

|r| Strength of correlation


|r| Strength of correlation

0.0 < 0,1 no correlation

0.1 < 0,3 little correlation

0.3 < 0,5 medium correlation

0.5 < 0,7 high correlation

0.7 < 1 very high correlation


To check in advance whether a linear relationship exists, scatter
plots should be considered. This way, the respective relationship
between the variables can also be checked visually. The Pearson
correlation is only useful and purposeful if linear relationships are
present.

Pearson Correlation assumptions


For Pearson correlation to be used, the variables must be normally
distributed and there must be a linear relationship between the
variables. The normal distribution can be tested either analytically or
graphically with the Q-Q plot. Whether the variables have a linear
correlation is best checked with a scatter plot. If these conditions are
not met, then the Spearman correlation is used.

Spearman rank correlation

Spearman correlation analysis is used to calculate the relationship


between two variables that have ordinal level of measurement.
Spearman rank correlation is the non-parametric equivalent of
Pearson correlation analysis. This procedure is therefore used when
the prerequisites for a correlation analysis (=parametric procedure)
are not met, i.e. when there is no metric data and no normal
distribution. In this context it is often referred to as "Spearman
correlation" or "Spearman's Rho" if Spearman rank correlation is
meant.

The questions that can be treated by Spearman rank correlation are


similar to those of the Pearson correlation coefficient, i.e. "Is there a
correlation between two variables or characteristics". For example: "Is
there a correlation between age and religiousness in the France
population?

The calculation of the rank correlation is based on the ranking system


of the data series. This means that the measured values are not used
for the calculation, but are transformed into ranks. The test is then
performed using these ranks.

For the rank correlation coefficient ρ, values between -1 and 1 are


possible. If there is a value less than zero (ρ < 0), there is a negative
linear correlation. If a value is greater than zero (ρ > 0), there is a
positive linear relationship. If the value is zero (ρ = 0), there is no
relationship between the variables. As with the Spearman correlation
coefficient, the strength of the correlation can be classified as follows:

Value r Strength of correlation

0.0 < 0,1 no correlation

0.1 < 0,3 little correlation

0.3 < 0,5 medium correlation


Value r Strength of correlation

0.5 < 0,7 high correlation

0.7 < 1 very high correlation

Point biserial correlation

The point biserial correlation is used when one of the variables is


dichotomous, e.g. studied and not studied, and the other has metric
scale level, e.g. salary.

The calculation of a point biserial correlation is the same as the


calculation of the Pearson correlation. To calculate it, one of the two
expressions of the dichotomous variable is coded as 0 and the other as
1.
Calculate correlation analysis with DATAtab

Calculate the example directly with DATAtab for free:

Correlation analysis Load data set


A student wants to know if there is a correlation between the height
and weight of the participants in the statistics course. For this
purpose, the student drew a sample, which is described in the table
below.

Height Weight

1.62 53

1.72 71

1.85 85
Height Weight

1.82 86

1.72 76

1.55 62

1.65 68

1.77 77

1.83 97

1.53 65
To analyze the linear relationships by means of a correlation analysis,
you can calculate a correlation with DATAtab. First copy the table
above into the statistics calculator.
Then click on "Correlation" and select the two variables from the
example. Finally you will get the following results.
First, you will get the null and the alternative hypothesis. The null
hypothesis is: "There is no correlation between height and weight".
Then you get the correlation coefficient and the p value. If you click
on Summary in words, you will get the following interpretation:

A Pearson correlation analysis was performed to test whether there is


a relationship between height and weight. The result of the Pearson
correlation analysis showed that there was a significant relationship
between height and weight, r(8) = 0.86, p = 0.001.

There is a very high, positive correlation between the variables of


height and weight, r= 0.86. Thus, there is a very high, positive
correlation in this sample between height and weight.

Directional (one-sided) correlation hypothesis


Of course, in DATatab you can also choose to calculate a directional
hypothesis.
In this case, you must first check whether the correlation is at all in
the direction of the alternative hypothesis, i.e. that height and weight
are positively correlated. If this is the case, the calculated p-value
must be divided by two, since only one side of the distribution is
considered. However, DATAtab takes care of these two steps for you.
The summary in words then looks like this:

A Pearson correlation analysis was performed to test whether there is


a positive relationship between height and weight. The result of
Pearson correlation analysis showed that there was a significant
positive relationship between height and weight, r(8) = 0.86, p =
<0.001.
There is a very high positive correlation between the variables of
height and weight, r= 0.86. Thus, there is a very high, positive
correlation in this sample between height and weight.

You might also like